Event Driven Architecture allows for unprecedented levels of decoupling between a system’s major components. Recently I’ve been blogging about the hidden issues that can appear when adopting this model in your application. The most subtle issue envisioned so far involves potential synchronization problems when encountering a catastrophic failure of one of these decoupled components.
The problem directly arises from the decoupled nature of the components. Because they are so decoupled, it’s no longer necessary to shut down the entire application to make backups. Each component can back up its working data on its own unique schedule. This implies catastrophic failures of a single component can cause the component to be out-of-synch with other components. The restored component will not be aware of the messages it should have received and transmitted since its last backup. However, other components assume the failed component has received and transmitted messages since the backup was taken. Thus, the state of the failed component is out-of-synch with respect to the other application components.
For example, consider a shipping component within an ecommerce application. It is responsible for matching customer orders with delivery addresses and is the system of record for tracking shipped orders. The shipping component might depend on receiving messages from the order processing component and similarly, might transmit messages to a logistics component. The logistics component would be responsible for actually shipping the products. Supposing a catastrophic failure of the shipping component were to occur an hour or so after its last backup?
When the shipping component was restored from that backup, it is oblivious of up to an hour’s worth of message traffic, both received and transmitted. The order processing component will assume it has successfully transmitted messages to the shipping component. Similarly, the logistics component will have received and acted upon messages it assumes the shipping component to be aware of sending. Actually, from the complexity of this scenario it is easy to see why circular dependencies between components is to be avoided.
Restoring the order processing or logistics components from their last backup will not resolve the issue, because, as we pointed out originally, each component now enjoys its own specific backup schedule.
What is obvious is that once a component is restarted, some synchronization effort must always be made prior to it resuming normal activity. Tackling the problem can be broken down into two discrete issues.
- Synchronizing the restored component with upstream message originators
- Synchronizing the downstream message consumers with the restored component
Each of the problems has several prospective solutions and will be discussed in their own individual blog entries. Of the two issues, the former appears the easier to resolve. Please note an implementation based on Event Sourcing would be immune to these kinds of synchronization issues, however a different class of issues related to recovering from catastrophic failures can ensue.
