Inevitable #Distributed #Transactions for #Microservices

Published on July 23, 2019

Part 1

CONTENT

  • Business Case
  • Does Compensating Transaction Compensate?
  • DT Transaction Manager / Orchestrator
  • Orchestrated DT over multiple MS
  • DT Implementation Options (how to)

Business Case

For many years technology tries to resolve a contest between business operational logic needs and technology means available for the logic implementation. This race negatively impacts business in, at least, two situations – operational business has to decline functionality (and risk the competition in the market) to fit with the application’s constraints and to even lose competency in operations where the limits of graphical user interface substitutes real purposes of business activities in the minds of operators. Finally, technology has come up with Microservices (MS) that are set around business functions rather than around databases and networks. These are the simplest or finest-grained business functions and mentioned contest is supposed to head to its resolution.

Unfortunately, MS still carry such significant IT ‘anchor’, articulated via principles of Microservice development, that the spectrum of business tasks where those principles can work is dramatically limited. Why is it so? – Because many business tasks are presented as processes and adaptive cases, which before were realised via monolith applications. Decomposing monolith into individual small functions or Microservices loses the process/case logic that is so important to business. When we emulate business actions/functions via MS, we can optimise some process-steps, but no business values embodied in the process logic may be omitted or lost in the implementation.

This process logic is known as functional transaction. That is, in specific cases, MS must work in transactions irrespective whether Developers like it or not. The Microservice development principles that oppose transactional behaviour should be relaxed if we want satisfy business needs with Microservice Technology[1]. Since MS are independent units of work that can have different locations/distributions, ownerships and life-cycles, we have to deal with distributed transactions (DT) across MS.

In a distributed environment with multiple ownerships, it is impossible to implement two-phase-commit (2PC) model that locks the entities’ states for the sake of “all commit at once or none”. It is evident that some practitioners point to the difficulty in designing and testing DT while meaning problems with 2PC. For some Developers, DT themselves appear so difficult (with very distinctive logic than in OOD) that we have found numerous recommendations to avoid such designs altogether.

The modern perception states that Microservice development does not need to be precise and unflawed because the major value is scalability and performance, i.e. it is OK for DT to fail and be re-done. Unfortunately to the promoters of this idea, our society still vitally depends a great deal on the concrete things of high quality, including solid transactions. Social media with permitted minimal data consistency or, in the best case, eventual data consistency is not and will not be the standard for the rest of the world. Consistency and accuracy of data is not a trade-off for scalability and performance in many types of industries our society has – we need both and at the same time. Consciously exposing problematic data consistency on the rest of the technology development is a very bad practice, next to a crime (just imagine the SW with eventual consistency for atom stations, aircrafts, ships, trains, cars, electrical devises and finance – eventually we survive and keep our money, but not necessarily when we need).

Thus, we have a choice: either to avoid MS when programming DT or to avoid transaction-required business tasks. The latter is simply impossible without breaking business values, which is unacceptable. So, we have to learn how to use MS in DT in the most reliable way. If such development requires efforts of more than one developer or a Team, it has to be so, and DevOps should accommodate such line of thinking and appropriate automation.

In this article, we will discuss different solutions for the MS-based DT. We will make certain assumptions, define preconditions, articulate trade-offs and potential consequences of doing or not doing some things with and for MS. 

(to be continued)

[1] I avoid calling Microservices an architecture, because its principles and description do not match the definition of a system architecture. Also, “The fewer communications between microservices, the better…” means that ideally Microservices should not communicate. In this case, a presence or absence of any Microservice is not fundamental to the system existence. Then, the smallest possible Microservices cannot be the main drivers for the system (while an architecture can and is), because the system is based on inter-relationships of architectural and non-architectural elements, which the smallest elements are incapable to deliver. The not-communicating and isolated Microservices have no mechanisms that makes them cohesive. “If a microservice must rely on another service to directly service a request, it is not truly autonomous” reinterprets the Autonomous Principle of Service Orientation and loses its sense. A SOA Service can represent functionality to its consumers that it gathers from other Services – this is the exact model of business behaviour. Microservice development principles combat such crucial feature.

Part 2

First of all, the problem with distributed transactions in the MS-based implementation is well known in the industry. We can recommend the good publications on this topic are “Patterns for distributed transactions within a microservices architecture” and “Compensating Transaction pattern”. The RedHat’s publication recommends using the Saga pattern, though recognises that “it also introduces a new set of problems, such as how to atomically update the database and emit an event”. We think that the Saga pattern has much more problems than quoted. So, we start with uncovering related gaps risks.

The business transaction logic used to be relatively stable, not change daily, i.e. it can be built up-front. However, with modern technologies that deliver information about business events and AI predictions at near real-time, a corporate business has to have an ability to change its operational logic weekly or monthly. So, an MS-based transaction implementation should design inter-MS interactions (compositions) with maximum flexibility. Also, the dynamics of execution context demands each MS to be ready for the use in a transaction by design and by implementation. These are the new preconditions for the DT design and implementation. A method for a quantitative estimate of solution/design flexibility is known [one of sub-topics] and based on the complexity of adopting changes in the design.

A Compensating Transaction (CT) is known as a means for compensation real-world effects caused by a transaction if it fails to complete regardless of the reason. Within a monolith application set around data transformation or transition under the application monopoly, a CT was simple – we needed to lock the data fields during the transaction and then undo usually simple CRUD operation in the locked fields. CT was not really considered for sending e-mails or messages (non-compensable actions). Moreover, messaging/MOM is known as a non-transactional communication type (if someone implemented opposite, it violated principles of messaging). For a DT the picture is dramatically different.

The Saga pattern is designed in the assurance that CT is always available. Therefore, Saga states that each individual step of DT can be committed/completed in full isolation from the status of other steps by using local transactions. Let us review if this is really possible?

An execution context for the MS-based DT comprises multiple independent MS deployed on the virtual distributed computational and network platforms together with also distributed clusters of executional infrastructure – Event Buses, Messaging and network of API end-points, which cause poorly predictable latency in communication. Moreover, individual MS may have different owners who do not promise the same responses to the DT or CT requests and related SLA. The most important is that independent MS cannot be locked by a DT until it completes, which also may take not accurately predicted time. As a result, the effect caused by a DT action can be overwritten and/or acquired and propagated in barely understood MS invocation chains in the distributed environment, especially across the MS Application boundaries.

Note that if a DT fails to complete, we are looking not to “undo” each of locally complete steps but compensate their effect uncontrollably propagated by other MS and transactions. Here is an example where compensation is issued but late and failed. An online bookstore has accepted a purchase of a book and has to conduct a transaction that includes: order check-out in the stock, payment assurance, order placement with the distributor, order distribution & delivery including up-front notification for the consumer. Assume that the order placement took a much longer time than anticipated because its systems/MS had to be rebooted/redeployed. The transaction recognised a case of failure and involves CT with the order cancelling, but the order placement MS checked incomplete activities in the Event Sourcing and immediately starts the order delivery and issued the up-front notification. By the time when compensating payment was returned to the consumer’s bank account, it appeared that the account was in the very low balance for a while and a standing order could not obtain enough money and failed. When the order placement MS was able to receive the cancellation message/event, the customer notification was sent already and the customer was able to download the majority of the purchased SW product.

In spite of multiple warnings that the CT should compensate an effect of the transaction’s action, many developers accept this as an “undo” of local CRUD transaction. If an MS has changed data in the data store and, upon a concurrent request, changed it again or another MS was able to obtain the initially changed data before the compensation command came, it does not make sense to return data to the previous value because of this till crash the overall data flow integrity.

Furthermore, nothing can guarantee that if a transaction fires an event or sends a message/e-mail, and then produces an event or message/e-mail asking to ignore the initial notification and undo related operations, the receiving MS would act as requested (note that owners of one MS cannot control owners of another MS). At least, there is no guarantee that the compensation will be executed in all processing chains triggered by initial action.

Thus, when executing business transactions in a distributed environment where each step completes/commits regardless of other steps and where action providers are different (even under different execution policies), we cannot and should not rely on compensation. Impacts of many transaction’s actions cannot be compensated either due to prohibiting complexity and cost or due to unforeseen business consequences.

This is why an assumption like each transaction implemented in the Saga pattern has a CT is simply unrealistic or even incorrect because some actions caused by a transaction cannot be undone in full. While the Saga pattern can deliver relatively reliable DT over MS, its use of CD is almost meaningless and even dangerous.

To minimise the effects of incomplete DT, we propose a distributed procedure that we call a Conditional Roll-Back Process (CRBP). It does not carry a notion of compensation, it just tries to undo the changes caused by incomplete DT where is no risk of uncontrollable consequences. The CRBP still can improve the overall outcome of the MS-based failed transaction. However, the use of each business MS-based transaction has to be documented with elaboration on risks of failure in different points of transactions and possible automated or manual mitigation means.

For instance, a booking application offers a consumer not only an ability to book the flight and hotel, but also a dinner before the flight. The consumer can choose a restaurant, flight and hotel independently within a single booking transaction. If we implement this solution using independent MS for each selected business using Saga pattern and the transaction fails, the CRBP will attempt to roll-back the booking with each of these businesses automatically or with the involvement of the consumer who might have a choice to keep particular booking anyway. This is not compensation.

At the lower level, if some DT steps require the creation or updating data in the data store, the CRBP does not require a simple deletion or reverse transformation of data. The CRBP requires the MS to check 1) if that data is in the same state as it was left after the initial action, i.e. if that data is not overwritten in any way, and 2) if this data was not read and shared already (in any way). If any of these conditions has occurred, no roll-back for this data is requested/recommended. The CRBP can fire “undo” events, messages and e-mails, but it is not wise to assume that the outcomes of these activities would be the ones expected.

The CRBP can be used in the narrowed scope of a single DT step. For instance, in the example above, a consumer may change his/her mind and after booking the dinner and hotel decide to remove the restaurant from the overall order even if it had been booked already. The DT’s Transaction Manager/ Orchestrator should have a logical step where it engages CRBP for one or several steps. 

Part 3

DT Transaction #Manager / #Orchestrator

Traditionally, SW transactions run under the rules provided by the role of Transaction Managers. For services, this role is known as Orchestrator.

In contrast to technical transactions that can fail because of native technology causes (and MS principles count on this), business transactions may not fail unless such option is stated up-front. Acceptance of a transaction #failure depends on the requirement to data consistency in the outcome: if it is payment transaction, the data consistency must be strong; if it is a financial advertisement, we may agree with data eventual consistency. For example, if a system looks up for Mutual Funds with certain attributes (#risk/#ROI) for possible investments, i.e. several sources of data have to be accessed within the search transaction, the found list can miss a few funds and this will be OK from the business perspectives of the #consumer, but the search itself may not fail.

 Thus, a DT across MS should complete from the first go if strong data consistency is expected. Our task is to come up with as much reliable design as possible despite the assumption that any MS may fail individually.

A failure of a DT can be caused by one of or all three factors:

a)     a failure of the infrastructure where the transaction runs

b)     the failure of operating SW within the transaction runs

c)      the failure of the application SW used in the transaction. In this article, we focus on the mitigation of the latter.

There are two the most popular algorithms for transactions (also mentioned in the #Saga pattern comments) across independent entities:

1)     A chain of invocations provided by the transaction participants themselves

2)     An invocation sequence provided by an additional entity – Transaction Manager or Orchestrator.

#DevOps usually tend to a chain of invocation model because they can agree in the team which MS communicates with which another and in which order – everything is in the “hands” of the development team. However, in the DT, not all MS running in the transaction may be necessary under the control of one team.

The chain of invocations algorithm, known as #Choreography, requires that each MS to be aware of all other MS that it has to expect an event notification from and know how to react on it including its own dependant events or other MS to be invoked. If the transaction includes a “foreign” MS, any change in it during the transaction development or execution becomes a subject of additional negotiations, latency in delivery and can lead to a conflict of interests. Let’s recall that business transactions exist not to please DevOps, but it is the other way around. Thus, the question comes – is Choreography algorithm right for transaction implementations across distributed MS? Unfortunately, the practice of this pattern for the last 15 years has clearly demonstrated very low reliability and poor adaptability/flexibility of this pattern.

Architects know about the consequences of Choreography and related risks. If we have several MS that we need to compose with a  Choreography pattern, we have to be ready for re-development many of them – the purpose of Choreography, by definition of this method, should become one of the purposes of each composition’s participant. Also, two MS that are supposed to interact (event via event-listening) are de facto coupled by design, i.e. they re-create a new monolith. This is unacceptable for MS-based solutions. To be concrete, here are the risks that Choreography represents:

1)     A change in any chained MS can impact the execution and outcomes of the entire chain including placing the execution on hold until the change gets up online

2)     If one MS from the interacting pair is owned by another provider (team or even external organisation) and the provider changes its MS without notifying another one in the pair, the transaction fails. Organising and, especially, executing proper, on-time notifications is an additional and non-trivial task on its own considering independent MS life-cycles from different providers. This is a common problem of all MS-based Applications under different ownership, not for DT only. This situation is inevitable the longer we use growing applications and the more interactions across the boards a “digital integration” needs

3)     Since every MS participating in the Choreography requires adjustment of its design and implementation for the sake of such transactions, there is a risk that Choreography-required modifications may be out of sync with previous MS design features/solutions, which reduces the quality of the MS work. Also, it becomes a difficult task to use the MS in several  Choreography-type transactions because of continuous re-development, which contradicts the purpose of services (MS) from the business perspective.

4)     If any of the MS in such transaction chain fails, i.e. does not operate as needed and doesn’t notify anyone about this (which is permitted by the MS practice), the entire transaction fails.

We can conclude that in spite of the use of local transactions and event-based MS interactions, Choreography-type transactions are coupling MS and appear unreliable and fragile by design. It is better to avoid this type of transactions despite an illusive development convenience[1]. The main basis for this statement is that when we need to implement a business transaction, we are looking for the final outcome instead of the “well-being”/convenience of an individual MS. If we want to remove or add a new MS in a Choreography, it seems a simple task from the MS developer’s view point, but from the transaction itself it is twice more work than for the orchestrated type of DT because two more existing already MS might need to be modified – the sender of the invocation to the new MS and the receiver of invocation from the new MS (this relates to both event- and direct-based MS invocations).

In the case of MS, the managed or orchestrated transactions are organised by a special MS – an orchestrator/conductor/manager. The Orchestrator must preserve the business logic of the transaction and this is the only business functionality it is responsible for. The Orchestrator should:

a)     Maintain and execute the transaction logic. The Orchestrator/Transaction Manager is a single point where the transaction logic can be changed if needed

b)     Engage appropriate supplemental MS at the steps of the logical process. The managing MS can engage supplemental MS using different mechanisms that can be specific to each MS

c)      Use supplemental MS as-is, with no re-development and even without them to know that they are used in the transaction.

The risks of this transaction type are:

1)     The Orchestrator/Transaction Manager may be a single point of failure

2)     A change in the engagement mechanism of the supplemental MS impacts the Orchestrator

3)     A change in the transaction logic may require design/implementation/deployment of a new Orchestrator

4)     Any supplemental MS can have an inconsistent outcome.

An orchestrated transaction, as any process, has a state. That is, an MS acting as an Orchestrator is a stateful service. At each step, the Orchestrator should be sure that the needed data has been received (and temporary persisted) or requested action has been successfully performed by the supplemental MS. In a case of failure, the step has to be repeated until succeeded; however, which supplemental MS is used in this case may vary.

In the following parts, we will offer a solution that eliminates the single point of failure for the Orchestrator. For risk 2), the mitigation is relatively straightforward. Since the Orchestrator cares only about results of activities performed by its supplemental MS, if particular MS changes its interaction interface/functionality/outcome, it is not necessary to re-develop Orchestrator because it can discard changed MS and engage another MS that meets the supplemental requirements (this is the main power of #SOA). The same thoughts are applied to the risk 3) – a change in the orchestrating logic constitutes a new Orchestrator. At the same time, a change in any supplemental MS that still preserves the same interaction interface, provided functionality and outcome, does not impact the Orchestrator and, does not necessarily impacts the transaction.

A DT is usually built over existing MS, each of which may implement different data consistency model. This model may get in conflict with the data consistency required by the DT. The orchestrator has only two options in the case of model mismatch: 1) try to find and engage the MS with the same data consistency model as the transaction requires; 2) try to negotiate with the supplemental MS provider a use of #Sib or #Tandem patterns for deployment of its to-be engaged MS. This will minimise MS failure and increase transaction robustness.

[1] The abnormal distractions and people death rate caused by Hurricane Katrina in Louisiana, USA, are attributed to the Choreography-based emergency system implemented that time: several Choreography-nodes/companies were initially destructed by the flood and this crashed the entire emergency notification system (no coordinator available to fix the damage on time until military with centralised command came), i.e. those enterprises who still could be evacuated did not move because they did not receive communications from the failed ones.

Part 4

Orchestrated Transaction: addressing identified risks

During the orchestrated transaction, an overall system stays in the state of eventual consistency. When the transaction completes, the strong data/state consistency should be received. If the solution comprises several systems, e.g. in a case of retail order processing, we may have a global transaction composed of systems’ level transactions, which appear as individual steps of the global transaction (a transaction of transactions). In SOA, the global transaction and each systems’ transactions are independent Services. If we recognise a transaction logic as a self-sufficient business function, we can talk about an ‘orchestrating MS’. The justification for this is a known method of business function decomposition – from coarse- to fine-grained functionality suitable for an MS-based implementation – and continued re-assembly of fine-grained functionality into final task solutions via temporary orchestrated compositions, as needed.

In order to resolve a problem of “single point of failure”, the Orchestrator must be durable. This can be achieved via resilience and combination of run-time fail-over patterns. These patterns should be applied to the orchestrating MS per se and to the supplemental MS. Unfortunately, an application to the latter cannot be guaranteed if the MS are under different ownership, which may not share the design policies.

Usually, transactions are not self-starting entities. So, there are triggers like events or direct invocations that cause an orchestrating MS starts working. In this article, we consider event-based communication only though our patterns are good for both trigger-types. Particularly, if a DT is triggered via event-bus or event-broker that can push event notifications to the orchestrating MS, we recommend #Sib pattern for the Orchestrator implementation. If the orchestrating MS has to poll triggering events, we recommend #Tandem pattern for the Orchestrator.

Both mentioned patterns are not about load balancing between multiple identical instances of an MS. The URL, or application or network/host load balancers do their work well, but they do not operate at the level of MS, especially if the latter are deployed in the containers. Kubernetes delivers load balancing at the level of pods, not MS. Maximum what a load balancer can do is to direct an MS invocation to another distributed instance of the MS. However, if a problem is not in connectivity but in the MS itself – its code or its internal resources – neither Fall-Back pattern nor Circuit Breaker pattern will help. This is why Sib and Tandem patterns introduce alternative MS to execute rather than other instance of the same (probably, broken) MS.

It is assumed that the triggers of orchestrating MS and the Orchestrator itself are under the same ownership. It is important because the trigger has to be aware about how the Orchestrator may be invoked (including the patterns of the Orchestrator implementation). Additionally, it is recommended that the trigger uses a Health-check pattern before pushing an event to the live instance of the Orchestrator.

The facts of failure of any one of orchestrating MS in both patterns should be logged […] and reported, e.g. via monitoring, for an immediate restoration (re-deployment) of the failed entity. So, there would be a fully functioning pattern for another transaction.

Now, let’s review the interactions between orchestrating and supplemental MS. An owner of the transaction cannot enforce a particular design on supplemental MS in all cases to make them unified or standardised, but we will show the best practice for transaction implementation.

The simplest case for designing an Orchestrator is when all supplemental MS use the same interaction mechanism – a direct invocation/API, or event-Broker (push), or event polling (pull). If the Orchestrator uses Event Sourcing for posting interaction events, it may be a sub-case of one of the mentioned pull/push events depending on the implementation.

The most complex case for designing an Orchestrator is when different supplemental MS use different interaction mechanisms. The Orchestrator should be aware of each interaction mechanism and realise it for each process step. In essence, this couples the Orchestrator with its supplemental MS. As a result, a change in the interaction mechanism of any supplemental MS leads to re-design/re-build/re-deployment of the Orchestrator. Thus, one of the Best Practices of MS design/development is in avoiding a change in the interaction mechanism by all means (as before the same was recommended to the Service’s interfaces). Here is an example of this complex case: an orchestrating oMS works with supplemental sMS-3, sMS-4, sMS-2 and sMS-8. The sMS-3, sMS-4 and sMS-8 operate via an Event Broker and, therefore, are expected to be implemented via the Sib pattern. The sMS-2 uses event polling and is implemented via Tandem pattern. The oMS should use Push-Event Broker-based communication mechanism for sMS-3, sMS-4, sMS-2, but an event store for pulling events to/from the sMS-2.

A change in the transaction logic does not impact supplemental MS, but can require different supplemental MS. Though the logic can be externalised via the Rules pattern – no code change in the Orchestrator – the logic change constitutes a new transaction (a process). Developers should be aware of this phenomenon. This is why a change in the transaction logic is lesser desirable, but sometimes inevitable.

One of the major values of SOA can be associated with MS-based solutions as well – it is a re-composability of service-based solutions for the purpose of addressing an external change (in the system or in the market). For transactions, a re-composability means a new transaction rather than a modification or the existing ones – to solve new task different transactions might be needed. Hence, transactions in the MS-based applications represent relations between individual functions and form intermediary functional compositions. Some of such compositions are so useful that can ‘live’ and be reused as a self-consistent component.

Risks of the internal implementation problems of supplemental MS is less anticipated by transaction developers. At a glance, why a DT should care about the internals of the arbitrary supplemental MS? Indeed, the Transaction Manager/ Orchestrator entity should not. Nevertheless, the business task that DT should solve cares very much about this matter. No transaction makes sense if it is not confident in that the providers of its step’s deliver reliable requested values. 

For example, a supplemental MS should mark the record “Order” in its data store as ‘complete’. The MS does this work, but doesn’t successfully send an audit log message because the logging mechanism has a problem at the moment. As a result, we have the transaction-related job is done, but the audit will find a record of an incomplete Order and escalate a quality-blame. From the DT, perspective the supplemental MS has not done the work. In other words, the supplemental MS faces a clear case for two-phase commit (2FC) that should be implemented locally, in the MS. Also, we have to note that the MS implements on one business function but two – make a ‘complete’ record and provide related audit data. 

In essence, there are no such things as a leave, simplest, not decomposable business function – there are always a few satellite functions required by the business execution context with a lot of laws, rules, regulations and inter-dependencies. Particularly, logging events for the audit purposes is equally important to the company as the action itself. For our example above, a failure of 2FC within the supplemental MS will cause the MS to apply such patterns as Retry, Fall-Back and even Circuit Break.

Logging is a factor for DT that can significantly impact the DT performance. This situation is similar to the one where an MS has to deal with the shared data store (usually a corporate legacy data store like Reference Database). An MS principle requiring each data entity to be owned and accessed by the MS is rather an ideal wish, which is not supported by modern data governance at the corporate level. Plus, such access should be resilient if we talk about corporate data, i.e., again, one MS may not provide it. So, a logging solution is either provided by the platform where the MS runs and all resilience/robustness is the platform’s duty or, if such service is absent, the developers of MS have to take care of this. The recommended solution here is a Proxy pattern set on top of the Sib/Tandem pattern.

The Proxy pattern shields operations of one MS from the operations of another MS or resources by representing additional proxy MS in between those two. That is, when an MS-A needs to log the audit information into a Logging System with its Logging API, the MS-A has to communicate with the proxy MS, which, in turn, will interact with the Logging API. In this case, the proxy MS is responsible to overcome all potential communication problems of the Logging API and release the MS-A for its own work. The proxy MS should be able to obtain the audit information in the case of its own failure – this is where Sib/Tandem pattern comes to help.

Part 5

DT Implementation Options

Scope

We talk about distributed transactions (DT) in the distributed systems that, according to CAP theorem, can be in the state of eventual data consistency. Such state is attributed to the replication latency over the network points. In contrast, the DT is about ordered activities or functionality provided by Microservices(MS) and it relies on whatever data the MS works at the moment of invocation. This means that in distributed systems the outcome of DT can vary depending on the data consistency over the transaction time. Nevertheless, the DT demands that the supplemental MS to be available and accessible for the Transaction Orchestrator when needed.

According to Microservice Architecture, an MS participating in a DT may fail, but it not an isolated case because this can cause a failure of related transaction. From business perspectives, it is a serious non-excusable problem. This is why I believe that an implementation of DT should provide as much fault-tolerance at run-time as possible.

While MS is narrowed to particular business function, the MS-based Applications can operate across several business domains. The DT within an Application mimics its the domain landscape, i.e. the DT may be single- or multi-domain scoped. This depends on the business logic managed by the Orchestrator.

Transaction Agent & Identifier

A DT cannot start by itself. In any case, there has to be a trigger. It is a matter of fact that a trigger also cannot come out of the blue. Even a time-based trigger is based on an entity that counts time. We call an entity representing a transaction trigger a Transaction Agent. It can be physical or programmatic. For example, if a Transaction Orchestrator subscribes and receives an event notification, a Transaction Agent has to fire this event notification. As we outlined in previous sections, it is recommended to have Transaction Agent and Transaction Orchestrator aware of each other.

Each DT has its unique identifier, plus, each DT instance has its identifier as well. This allows several instances of the transaction to run at the same time. Thus, the same MS can be invoked in several transactions at the same time.  This is not a transaction-related statement – an MS should be able to handle concurrent invocations by design and definition of ‘service’. An MS designer might know or not whether the MS will be used in the transaction(s), but, to be on the safe side, it is always better to consider a usage in the transaction – this will be very helpful to the transaction designer though this is not a mandatory requirement.

Described is the reason why a transaction aligns with the MS, not the other way around.

Specifics of Event Polling Interactions

This practice has a consequence for an event polling mechanism. Assume that an Orchestrator MS is triggered. If a supplemental MS can be invoked via event polling, the Orchestator must know all event polls of that supplemental MS. That is, the poll that is used by supplemental MS for receiving events and the poll where the supplemental MS places its response-events.This is the requirement for the DT design.

Addition inconvenience comes from Orchestrator MS and the supplemental MS may belong to different ownership and if the owner of the latter changes the pols, the former might be not notified. Also, while all push mechanisms – API and events – can easily apply security controls for authorisation of invocation, the polling events require additional protection for the polls to make sure that only authorised MS has an access to particular events in the poll and that only authoirised MS left certain events in the poll. At the level of MS-based application, it is possible to protect polls as required, but in this case polls do not belong to MS anymore. Such attractive event polling that has been expected as a means for decoupling interacting MS, de facto appears one of the most complex ways of intercommunication.

A note: in SOA, a trust between the consumer and service provider is the fundamental precondition for having business relationships. Since MS are independent and realise business functions, the same trust is mandatory for the MS interactions. This trust requires a consumer or to-be-involved MS knowing which MS interacts with it and how. While such relationships do not fit with Microservice Architecture, business responsibilities of MS mandate that the event pipelines/brokers/messaging/ESB play only ‘bus’ role with no intelligence: the target should be identified by the sender and the bus has to locate it in the net.

Though a polling MS is supposed to obtain the event notification, there is no guarantee of when this would happen. Overall, a DT may not rely on an eventual reaction of the requested functionality/MS and there is no feasible mechanisms that can enforce the polling MS to obtain and act on the event. Therefore, the DT works the most effectively, with better performance,  if an Orchestrator can invoke supplemental MS via API, push-Event Brokers, durable Message Intermediary and alike.

DT Design Steps

The DT design paths following sequential steps:

1)     Define the purpose/goal of the DT

2)     Identify the logic of DT and split it into conditional sequences of steps

3)     Identify the domains the DT has to cross

4)     Identify the execution context for each DT step (contect for the step provider)

5)     Identify the Transaction Agent and the type of communication between the Transaction Agent and the Transaction Orchestrator (e.g. event polling via a subscription, messaging via Pub/Sub or dedicated messages Queues, push events via Event Broker, direst API-based invocation, etc.)

6)     Identify which MS the Orchestrator has to interact with at each step and, respectively, which communication method should be used for the chosen MS. Realise appropriate Sib, Tandem or Listener patterns for the related step

7)     Identify and design error/exception handling for each step

8)     Verify what design patterns are realised by each chosen MS. If the transaction has to deal with an MS via push-invocation like API, messaging or Event Broker, only corresponding Senders and Listeners should be prepared. If it is a polling invocation, the designer has to find out what invocation resources are available to the Orchestrator for the fail-over of polling supplemental MS. If no failure mitigation is offered by owner/provider of the MS, the designer should raise a severe risk for the DT. For the cases, where transaction should complete, it is not recommended even to start a DT if no resilience is provided for the risky MS

9)     Verify that the MS “companions” implementing Sib or Tandem patterns share the same transaction’s data store where the transaction state is persisted

10)  Verify that the transaction’s store can perform self-cleansing and can be reliably invoked upon the transaction completion.

Orchestrator Bootstrap

A special care has to be taken for the bootstrap of Orchestrator. Both Sib and Tandem patterns comprise two MS “companions” – a primary MS and the companion MS. So, we have two orchestrating MS for each transaction. They are not two instances of the same MS, they are two different MS that have the same inputs, functionality and outcomes (such pairing eliminates a single point of failure).

In the Tandem pattern, both MS in the pair work continuously and check the status of each other. That is, the primary MS has to be bootstrapped to start the polling Orchestrator pair. In the Sib pattern, the MS “companion” acts as a “cold reserve” in case of primary MS failure. However, the first push-invocation will bootstrap the primary MS, and, in case of its failure, another push-invocation will bootstrap the the companion MS.

Transaction State

The DT Orchestrator stores the state of the transaction in the data store shared between the MS companions – primary and companion. Every change in the state of the transaction is stored as an event forming the Event Source storage. If needed the latest transaction step – an attemp and completion – can be easily uncovered from the records in the Event Source store.

Additionally, in the case of failur, the survived Orchestrator MS logs the failure-message about its companion and sends a run-time report to the Management Console/Module or Support Service about the failure, if possible. The purpose of the report is to facilitate a prompt analysis of the failure, fix and re-deploy the failed MS companion as soon as possible. The re-deployed MS should be linked to the same Event Source storage used before and continue working.

While each new DT wants its own state transaction store, depending on the size and complexity of the MS-based solution and the number of required transactions, it is possible to have a single distributed state transaction store for all transactions. In this case, each record in the data store should include a composite ‘DT ID’ – a transaction ID and the transaction instance ID. If particular Orchestrator demonstrates frequent failures, it is recommended to log the content of the entire Event Source store at the end of the transaction for further analysis and corrections.

Orchestrator-Supplemental MS Interactions

Now, we can turn to the interactions between the DT Orchestrator and supplemental MS. An ideal model for interactions between them is such as any interaction from the Orchestrator at its steps is performed toward a collection of supplemental MS that provide the same needed functionality via the same interaction mechanism. So, the Orchestrator could pick up any available MS from this collection. However, this is not a trivial task and may be viewed as an over-design (once, IBM implemented this model in one of its products).

Each supplemental MS may and should be able to participate in as many DT as needed. It is a choice of creating of a new instance of the MS or load balancing the existent MS instances in response to the concurrent request. So, the supplemental MS need to recognise and specify the composite DT ID in their responses.

In case of failure, the Best Practice recommends a supplemental MS to fail quickly but gracefully. For DT, this recommendation has two aspects. First, if the Orchestrator receives an error code in response indicating the failure of the MS and immediately applies the CRBP[1] . If the response code is about accessibility of the MS, the CRBP is not needed and the Orchestrator may retry this invocation a couple of times, but then, with no additional delay should, switch to the companion MS. If gracefully failure does not take place, and a few attempts to engage the supplemental MS collapse, the swithch to the companion MS is in due.

Sub-Transactions of DT

Finally, let’s observe so-called sub-transactions. They act in the same way as sub-processes of a process. Designing, deploying, monitoring and maintaining a DT is already not a trivial ask. We consider the following technique for dealing with distributed sub-transactions (here, we do not consider MS-internal transactions that may be addressed via 2PC).

We can simplify the work with distributes sub-transactions if change our point of view and look at the case from the functionality viewpoint.  As we know, a process of decomposing a coarse-grained business function (represented by the main DT) provides a comprehensive layered segregarion of either finer-grained compositions or individual functions. The lowest layer of the structure represents all simple functions that have no sub-transactions (as shown in the diagram here).

That is, each sub-transaction can be viewed in its own layer (scope isolation) while its Transaction Agent appears as an element (MS) in the upper layer. This allows the development of the DT and its sub-transactions to be loosely-coupled and performed in parallel.

Conclusions

Distributed transactions over individual functional providers such as Microservices fulfills the business needs. The realisation of the majority of business tasks requires collective efforts of separate functions. In the contemporary economy, business tasks become more and more complex. If the technology works with only very simple business tasks, it loses its credibility to the business. Unless business can switch to the AI-based operations in full, technology must provide complex solutions adequate to the business tasks. Yes, the solutions should not be over-complicated, but also they may not be oversimplified. As we have seen, distributed transactions over Microservices are not very complex or difficult; they merely require professional work across individual Microservices that can result in the work across several development teams, which is not that difficult.

[1] CRBP stands for a Conditional Roll-Back Process. For more information on CRBP see part 2 of the article.

Join the Conversation

3 Comments

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: