Event-Driven Internal Microservice Communication โ
- TECHNICAL DOCUMENT
- DENKOV,ANTONIO A.M.
- 12/06/2023
Glossary โ
| Term | Definition |
|---|---|
| Apache JMeter | An open-source software application designed for load testing and performance measurement of software services. |
| API (Application Programming Interface) | An API is a set of rules and protocols that allows different software applications to communicate and interact with each other. It provides a standardized method to access the features or services of a platform or a software. |
| ACID (Atomicity, Consistency, Isolation, and Durability) | A group of characteristics of database operations designed to ensure data validity despite errors, power outages, and other calamities. A transaction is a series of database activities that satisfy the ACID criteria and may be thought of as one logical operation on the data. |
| Bounded context | A well-defined domain within a software system with clear boundaries, encapsulating related concepts and rules. It allows independent development and maintains consistency between different contexts. |
| Base64Encoder | A utility in Java for converting binary data into a Base64 string representation. |
| CPU (Central Processing Unit) | The CPU is the "brain" of a computer, responsible for executing instructions and performing calculations. |
| Data structures | Structures that organize and store data forefficient access and manipulation. Examples include arrays, linked lists, stacks, queues, trees, and graphs, each suited for specific operations |
| Entity | A distinct and identifiable object or concept that represents a unique element within a system or application. It can represent a physical object like a person, place, or thing as well as an abstract idea, such as an order or transaction. Entities usually have attributes that define their properties and relationships with other entities. |
| Endpoint | A specific URL where a web service can be accessed by a client application. |
| Event-driven inter-service data replication | The process of copying and maintaining the same data across multiple systems through the use of events that trigger data synchronization. |
| Eventual consistency | A consistency model in distributed systems where, following updates, data replication may exhibit temporary discrepancies but will eventually reach a consistent state across all nodes, ensuring data coherence. |
| Horizontal scaling | The process of adding more machines to a system or increasing the number of nodes in a cluster to handle increased load. |
| Jackson library | A Java-based library used for processing JSON data format. |
| JPA (Java Persistance API) repository | A specification that describes the management of relational data in applications using Java Platform, Standard Edition and Java Platform, Enterprise Edition. |
| JSON | Lightweight, human-readable data interchange format that employs text-based structures for encoding data in the format of key-value pairs, facilitating seamless data exchange between systems. |
| Latency | The time it takes for data to travel from one point to another. In the context of message brokers, it is the delay before a transfer of data begins following an instruction for its transfer. |
| Liquibase | An open-source, database-independent library for tracking, managing, and applying database schema changes. |
| Microservice architecture | An architectural style that structures an application as a collection of loosely coupled sub-systems. |
| Mirroring (RabbitMQ) | A concept in RabbitMQ that provides similar behavior to replication in Kafka, but it is not on an architectural level. |
| NoSQL (Not Only SQL) | A type of database that provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases (RDBMS). |
| Publish/Subscribe (pub/sub) | A messaging pattern where senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead categorize published messages into classes without knowledge of which subscribers, if any,there may be. |
| RDBMS (Relational Database Management System) | Also known as SQL databases, these are databases that store data in a structured format, using rows and columns. |
| REST communication | A web communication protocol that enables the exchange of data between clients and servers, particularly when requesting or delivering webpages or other resources |
| Scalability | A system's ability to handle an increasing amount of work or its potential to be enlarged to accommodate that growth. In the context of databases, scalability relates to how well the system can handle increasing amounts of data and simultaneous users. |
| Sharding | A type of database partitioning that separates very large databases into smaller, faster, more easily managed parts called data shards. |
| Snapshot (of databases) | A point-in-time copy of data, often used for backups or as a point to revert back to in the event of data loss or corruption. |
| Schema evolution | The ability to adapt your data's schema as your application's requirements change. |
| State-transition database | A database system that stores transitions of data states over time. Each transition, or event, records what has changed and why the change occurred. |
| Tight coupling | The linking of two systems or programs so closely that they depend on each other. |
| Throughput | The amount of data that can be transferred nfrom one place to another in a given amount of time. It is a key measure of the performance of na system. |
| UI (User Interface) | The visual and interactive components that enable users to interact with a software application. |
| Vertical scaling | The process of adding more power (CPU, RAM) to an existing machine. |
| XML (Extensible Markup Language) | A widely used markup language for storing and exchanging structured data. It uses tags to define elements and their relationships, allowing for flexible representation. XML is appropriate for data sharing and storage since it can be read by both humans and machines. |
1. Introduction โ
This technical document is an integral component of the Event-Driven Internal Microservice Communication project for Limax B.V. Its primary objective is to give a complete and in-depth depiction of the research and implementation activities undertaken throughout the project. To accomplish this, the document includes a wide range of information, starting from research approach, key research questions, and thorough explanations of the implementation process, to extensive detailing of technical aspects like diagrams, testing methodologies, results, and additional technical components.
The document begins with an overview in the Introduction, followed by an clarification of the Approach for the research and implementation.
The next part, System Design and Architecture, offers a high-level view of the system's architecture, along with more detailed views through container diagrams and diagrams showcasing components and their interactions. This is further supplemented by sequence and state diagrams to provide a complete picture of the system's structure and workflow.
In the Implementation Details section, readers will find comprehensive coverage of the technologies used during the project, along with the data structures applied. This provides a detailed understanding of the technical choices made during the project and their implications.
The Results section, divided into Research and Implementation subsections, provides an extensive account of the findings and outcomes of the project. This is where the research questions are directly answered. This section further examines the realization of event-driven inter-service data replication and event sourcing within the projectโs scope. It also elaborates on the importance of the concepts of Saga pattern and event sourcing, revealing their influence on the project's direction.
The Testing and Validation section, a crucial element of the paper, offers precise information about the testing procedures utilized, the features evaluated, the pass/fail criteria, and the testing environment. A transparent perspective into the efficacy and performance of the adopted solution is provided by the results of various tests carried out, including unit tests, contract tests, performance tests, and end-to-end tests, which are also shared in this area.
The Conclusion section revisits the main research question and offers a comprehensive summary of the project's outcomes. The conclusion is supported by a detailed rationale based on the evidence gathered throughout the project. This helps to validate the conclusion and provides the reader with a strong closing statement on the success and impact of the project.
Lastly, a list of References is provided to acknowledge the sources of information and inspirations that guided the project.
In summary, the technical document serves not only as an informative resource, detailing each phase of the Event-Driven Internal Microservice Communication project, but also as a guidebook, offering a complete understanding of the process, outcomes, and the strategic decisions made throughout the course of the project.
2. Approach โ
2.1 Research โ
The research has been conducted by following the DOT framework (The DOT Framework, n.d.) and adhering to its respective methodologies and strategies. This is the approved and preferred research framework recommended by Fontys. The framework is based on defining relevant research questions, consisting of one main research question that covers the general core of the assignment, and sub-questions that focus on specifics. The aim is to answer all the sub-questions resulting in a conclusion for the main research question.
To achieve this, the researcher utilizes various research methods, such as library, field, lab, showroom, and workshop. Each method is accompanied by specific strategies that are used to gather and analyze data. The more methods and strategies the researcher uses, the stronger the research conclusions become.
Adhering to the DOT framework and utilizing various research methods and strategies results in comprehensive and thorough research, leading to well-supported conclusions.
Research questions โ
The respective research questions driving the research of this assignment are listed in this section.
Main research question
How can Flowcontrol's microservice architecture be optimized for speed and reliability in internal communication?
Sub-questions
- What are the key factors that affect the performance of the current synchronous REST communication system, and how can they be optimized?
- How does asynchronous message-based communication via a message bus compare to synchronous REST communication in terms of performance and scalability?
- What are the architectural and design considerations for transitioning from a synchronous REST communication system to an asynchronous message-based communication system?
- What are the best practices (conventions) for implementing a message-based communication system using a message bus, and how can they be tailored to the specific context of the system being developed?
- What are the potential risks and challenges of transitioning from a synchronous REST communication system to an asynchronous message-based communication system, and how can they be mitigated?
2.2 Implementation โ
The implementation is research-based, meaning that prior to implementing any code, thorough research on the relevant topics has been conducted. Throughout the project, the Agile principles have been applied with a particular emphasis on the Scrum framework. This approach has been fundamental in driving the project's progress in incremental phases, allowing a structured, professional, and effective execution the assignment.
3. Preliminary Design and Technology Choices โ
3.1 Technologies โ
In this project, a variety of technologies were leveraged to achieve the desired outcomes. To start, the Java framework โSpring Bootโ have been used as the primary backend language and framework, respectively, for the microservices setup. For inter-service communication, which has been the primary objective of the assignment, Apache Kafka is being leveraged, which is an open-source message bus software. Kafka was chosen for its robustness, scalability, and its support for delayed subscriptions and message delivery, which is crucial for the context it is being used for โ data replication with eventual consistency.
For data persistence, SQL databases are being used as primary data stores, specifically Microsoft SQL Server (MSSQL Server) due to its scalability, robustness, and rich feature set.
Docker is also utilized for containerization both locally and in the production environments. This allows to isolate the microservices and other relevant systems in separate containers, facilitating scalability and fault-tolerance.
3.2 Data structures โ
The data structures used in Flowcontrol have been chosen for their suitability to the tasks at hand and their compatibility with the chosen technologies.
The replicated data models are designed to include only the attributes that are relevant to the specific services. For instance, within a given microservice, if a data model consists of 10 attributes, only 5 of these attributes may be relevant to a consumer microservice. For specific details regarding the distribution of replicated data models and the attributes they carry, the reader may refer to the Database Diagrams folder and inspect the database diagrams for each microservice. Replicated data modelsโ table names conclude with the suffix โ_externalโ i.e., โarticles_externalโ.
For the event store, a relational data model within the MSSQL Server database has been employed. Each event that is produced through Kafka is stored as a row in the event table, with columns for the event ID, the event type, the payload, and the timestamp, to aid in data consistency and data recovery. The event payload is stored as a serialized binary via Protobuf, in combination with JSON serialization for nested data models, leveraging the Jackson serialization library. This structure facilitates efficient querying and analyzing of the event data.
4. System Design and Architecture โ
Research strategies utilized while designing the diagrams include:
- Workshop โ IT architecture sketching
4.1 High-level system architecture (C1) โ
The high-level system architecture diagram provides an overview of the system, showing how it interacts with other internal and external (sub) systems and types of users, including high-level components. C1 diagrams are by nature the highest abstraction level of the software architecture diagrams.
From the diagram illustrated in Figure 1 below, it is evident that Users interact with this system through a client interface built with VueJs, Vite, Pinia, Tailwind, DaisyUI, Router, and TypeScript.
The core of Flowcontrol consists of several services orchestrated via a Gateway built with SpringBoot. The services include the Article service, Transport service, Production service, Farmer service, and an Authentication service, each running on Springboot.
Each service has its dedicated MySQL/MariaDB database for storing relevant data. The Article, Transport, and Farmer services also have an MS SQL Server database for logging events. Interaction with these databases is handled through an Object-Relational Mapping (ORM) framework, Hibernate JPA.
Outside of the core system, there is a Kafka cluster with three brokers, responsible for facilitating inter-service communication. The Farmer, Production, Transport, and Article services all interact with the Kafka cluster to send and receive messages.
Finally, the Gateway serves as a bridge, allowing communication between the client interface and all the services using JSON. This setup provides a central entry point, simplifying the interactions between the client and multiple services within the Flowcontrol API.
todo add image
4.2 Container diagram (C2) โ
The container diagram goes into further detail, breaking down the system into containers for the Article microservice. Users interact with the system through a VueJs-based client interface, which communicates with a SpringBoot Gateway. The Gateway centralizes requests and enables interaction with multiple services. Inside the API application, requests first encounter a controller, which forwards them to a service. This service collaborates with a repository to manage CRUD operations with a MySQL/MariaDB database and a Kafka Producer to relay messages to a Kafka broker. Furthermore, the service works with an Event Repository to manage domain events with a MS SQL Server-based event store database. The architecture ensures a smooth flow of operations within the Article service. todo add image
4.3 Components and their interactions (C3) โ
The component diagram dives further into each container and breaks it down into various components. The component diagram for the Article microservice can be seen below: todo add image
The component diagram for the Farmer microservice: todo add image
4.4 Sequence diagram โ
The sequence (also commonly referred to as flow) diagram below provides a visual context of the data replication and event sourcing (logging) process within the Flowcontrol system. It demonstrates the flow of creating an Article entity and using Kafka to replicate it across the Farmer and Transport microservicesโ respective databases. It also showcases the event being captured and logged in the servicesโ respective event stores. todo add image
4.5 State diagrams โ
The state diagram below represents the state of the Article object as it goes through event sourcing and replication process: todo add image
5. Process and Results โ
5.1 Research โ
5.1.1 Research question 1. โ
What are the key factors that affect the performance of the current synchronous REST communication system, and how can they be optimized?
Research strategies utilized to facilitate the research include:
- Library โ Literature study, design pattern research: Library was used to dive into the specific traits of the problem โ HTTP requests, as well as to research relevant patterns as possible solutions.
- Field โ Interview, problem analysis: Interviewed the IT team with questions that would give me enough information to begin performing my research on the problem, such as:
- Is there a high volume of requests that could be a cause of the slow performance?
- Is there a lot of processing occurring behind the requests that could cause the slow performance?
- Which are the endpoints that do not suffice the performance expectations?
- What are the synchronous requests used for?
- Workshop โ Code review: Code review was performed initially to facilitate the problem analysis. Code was thoroughly reviewed to gain understanding of the application functions, but more importantly the problematic endpoints where the performance was slow and the synchronous requests too.
Code analysis and findings โ
During the initial code analysis, the primary focus was to understand the purpose of the HTTP communication within the Flowcontrol system. In summary, the reason is to share data between the microservices.
Flowcontrol is designed with a microservice architecture. Microservices is an architectural style that structures an application as a collection of services that are organized around business capabilities, independently deployable, and loosely coupled. (What are microservices?, n.d.)
As such, the application consists of four primary services, each managing a distinct aspect of the applicationโs functionality: Article, Farmer, Transport, and Production services. Each service also has its own, dedicated database to adhere to one of the main principles of utilizing such architecture โ independency of the microservices. However, the interaction and collaboration between these modules is critical for the overall operation of the application. For instance, the Article service is responsible for creating and managing Articles. These represent various products that are integral to the business process. Once an Article is created, it is necessary to associate it with a Farmer, managed by the Farmer service. This assignment represents a real-world scenario where a particular Farmer is tasked with producing or handling a specific Article.
The Farmer service, therefore, needs to be aware of the creation of new Articles to assign Farmers appropriately. In the current architecture, this knowledge is achieved through synchronous HTTP requests, indicating a strong dependency between the two services. In practical terms, when the Farmer service needs to assign a farmer to an article, it sends an HTTP request to the Article service to fetch the required article. The figure below shows a code snippet illustrating this interaction.
todo insert image
The Transport service further extends this chain of interdependencies. It needs to be aware of the Farmers and the Articles they are assigned, as well as various other entities owned by the two services to facilitate the transportation logistics.
The Transport service performs a significant number of high-volume requests. It is responsible for creating pallet labels for the articles and solely for this task may need to execute up to five separate requests to obtain various entities owned by different services, which are necessary for this process. The code snippet below shows these distinct requests, highlighted by yellow lines.
todo insert image
The findings from the analysis raise the following logical questions:
How may synchronous inter-service communication affect performance?
Having previously mentioned that microservices should be loosely coupled, is this approach of tight coupling a problem?
Synchronous requests & performance โ
In synchronous requests, the client (or requestee) blocks the execution of code and waits for a response to the remote request before continuing execution. (Understanding synchronous and asynchronous requests, 2017) (Synchronous and asynchronous requests, n.d.)
The Hypertext Transfer Protocol (HTTP) is a synchronous communication protocol that facilitates dispersed and collaborative communication inside hypermedia information systems. Since it was adopted in 1990, HTTP has acted as the Internet's main protocol for exchanging data. It provides a standardized approach for facilitating communication between computers. The specification of HTTP outlines the guidelines for constructing and transmitting requests to servers, as well as defining the format and content of server responses. (HTTP - Overview, n.d.)
In HTTP inter-service communication, since the caller is locked in the interaction until a response is received, performance might be reduced. This is because the caller might receive the response in a mere millisecond or in a few seconds. Regardless of the application latency, the caller cannot move forward to the next task until the response is received. This would be particularly detrimental if one of the services becomes unavailable. (Reselman, 2021)
Synchronous communication & coupling โ
The microservice architectural pattern distributes the system into smaller, independent subsystems with the goals of increasing scalability, resilience, ease of deployment and optimization for replaceability. Independence and self-sustainability are in the core of the microservice architecture and design ethos. Therefore, loose coupling is one of its key traits. (Kruisz, 2021) (Simpson, 2022)
Two different types of coupling exist in the context of microservices:
- Design-time coupling โ influences development velocity.
- Runtime coupling โ influences availability.
Design-time coupling โ
Design-time coupling refers to the likelihood that services need to change together for the same reason. If two services are loosely coupled, then a change to one service rarely requires a change to the other service. However, if two services are tightly coupled, then a change to one service often requires a change to the other service.
Runtime coupling โ
The degree to which the availability of one service is influenced by the availability of another service is known as runtime coupling. Particularly, it is the extent to which the availability of one service's implementation of an operation affects the availability of another service. (Essential characteristics of the microservice architecture: loosely coupled, n.d.)
HTTP requests inherently bring tight runtime coupling. Operations that employ it for inter-service communication have reduced availability since all services involved in the communication need to be available and respond in a timely manner. (Essential characteristics of the microservice architecture: loosely coupled, n.d.)
Conclusion โ
In conclusion, synchronous inter-service communication that facilitates data model sharing between services may adversely affect performance of the application and significantly reduce availability, which is one of the core principals of a microservice architecture. Thus, it is commonly labelled as an anti-pattern within such architectures and tightly coupled microservice systems are referred to as distributed monoliths. Distributed Monolith is a system that resembles the microservices architecture but is tightly coupled within itself like a monolithic application. (Mani, 2022) (Soares, 2023) (Hameed, 2022)
5.1.2 Research question 2. โ
How does asynchronous message-based communication via a message bus compare to synchronous REST communication in terms of performance and scalability?
Research strategies utilized to facilitate the research include:
- Library โ Literature study: Used to research the possible solutions online.
- Field โ Interview: Presented the research findings to the stakeholders and asked for input into what approach should be taken.
Before proceeding with specifically message-based asynchronous communication, it is worthwhile to mention that asynchronous calls can also be HTTP-based. The 3 levels of asynchronous communication will be covered below:
- Input/Output level: In this context, asynchronous means that requests sent to other services do not block the main executing thread until there is a response from the service. This allows the thread to complete other tasks simultaneously, thereby increasing the efficiency of the CPU by facilitating the handling of more requests. Such asynchronization can be incorporated in both RESTful calls and message queues, where the former simply completes the request in a separate thread and does not block the main thread. The figure below provides a visualization of how this works. (3 Common Misunderstanding of Inter-Service Communication in Microservices, n.d.) todo insert image
- Protocol level via a message queue: Most message queue systems are fundamentally designed around asynchronous communication principles. In a typical message queuing system, a producer submits a message to the queue without waiting for a consumer to be ready to receive it, and a consumer reads from the queue when it is prepared without needing to communicate in real time with the producer. (3 Common Misunderstanding of Inter-Service Communication in Microservices, n.d.)
- Service integration level: The objective of asynchronous communication at this level is to structure microservices so that they do not necessitate interaction with other services during their request/response cycle. This approach is adopted because the ultimate aim for any service is to remain accessible to the final user, regardless of the operational status of other services within the overall system - whether they are offline or not functioning properly. If a service needs to trigger an action in another service, this should be done outside of the request/response cycle. Moreover, in scenarios where a service is dependent on data located in another service, that data should be replicated across the services using eventual consistency. This also has the added benefit of enabling data conversion into the language of the specific Bounded Context. (3 Common Misunderstanding of Inter-Service Communication in Microservices, n.d.) todo insert image
Level 1 โ input/output โ
In performance and scalability terms, at level 1 โ input/output (I/O), it is all dependent on the applicationโs performance. Because the main thread is essentially idle during this waiting period, which can be seen as a waste of resources, performance would be sub-optimal with synchronous communication in situations where the requested service requires a long time to respond or when the volume of requests is high.
On the other hand, asynchronous communication enables the primary running thread to carry on with other tasks while awaiting a response from a different service. As a result, the CPU can be used more effectively as waiting times do not require it to stay idle. This could considerably boost the application's throughput, especially in I/O-intensive scenarios. (3 Common Misunderstanding of Inter-Service Communication in Microservices, n.d.)
todo insert image
In the context of Flowcontrol, in situations where the system needs to retrieve data from multiple microservices to fulfill a single user request, in a synchronous model, the total time to complete the request would be the sum of the response times from all services. One service's response time will directly affect the whole response time if it is slow. In an asynchronous model, however, these requests can be sent concurrently. The total time to complete the request would be determined by the slowest response, not the sum of all responses. Thus, the application can potentially serve more clients within the same time frame, improving its ability to scale and the overall performance. (Asynchronous and synchronous HTTP request on server side, performance comparison, n.d.)
However, with asynchronous I/O communication, services are still coupled โ if one of the services that needs to respond with data to fulfill the request is unavailable, the whole operation would fail.
Level 2 โ protocol level โ
Message queue systems facilitate scalability and improve system performance. With message queue systems, the sender (producer) and receiver (consumer) do not have to be available at the same time. The producer sends messages to the queue without waiting for an acknowledgment from the consumer. Likewise, the consumer processes messages from the queue at its own pace, without needing to coordinate with the producer. (Benefits of Message Queues, n.d.)
todo insert image
This decoupling of the producer and consumer allows each to operate independently, accommodating different processing rates and tolerating temporary unavailability. As a result, the system can easily scale to accommodate increased load by simply adding more consumers or producers, as needed. This also allows for improved resource usage, as processes are not left at idle waiting for responses, leading to enhanced overall performance. (Karanam, 2019) (Morรกveis, 2019)
Level 3 โ service integration level โ
The use of eventual consistency for data replication would make it possible to avoid this interaction inside the request/response cycle in the situation of Flowcontrol, where one microservice may need to interact with numerous other microservices to complete a single user request. This approach will not only improve performance and scalability, but the system will reach the desired decoupling between the microservices. It will enable each microservice to have its own copy of the data it needs, eliminating the need for time-consuming and resource-intensive REST calls to other services to acquire data. Therefore, adopting this pattern will prove to be optimal when it comes to performance, scalability, and decoupling. (3 Common Misunderstanding of Inter-Service Communication in Microservices, n.d.)
todo insert image
Conclusion โ
Within the specific situation of Flowcontrol, performance of requests is likely adversely affected by the synchronous interservice communication. Moreover, this design brings tight coupling between the microservices which undermines one of the primary purposes of employing a microservice architecture in the first place โ namely availability and service independence, which is achieved by loose coupling. Additionally, recommendations found online for cases where microservices share data between each other suggest employing a data replication pattern via eventual consistency (Microsoft Learn, 2022) (Microsoft Learn, 2023) (3 Common Misunderstanding of Inter-Service Communication in Microservices, n.d.) (Zaeimi, 2020). Therefore, after interviewing the stakeholders and sharing the research outcomes, an agreement was reached to adhere to the recommended practices and replicate data reused across microservices โ effectively employing asynchronous communication at service integration level.
5.1.3 Research question 3. โ
What are the architectural and design considerations for transitioning from a synchronous REST communication system to an asynchronous message-based communication system?
Research strategies utilized to facilitate the research include:
- Field โ Problem analysis
- Workshop โ Brainstorm: Performed internally within the IT team to rule out the specific architectural and design considerations for the use case in Flowcontrol
Having reached a consensus to replicate data used interchangeably between microservices, which would be facilitated through the integration of a separate software system โ a message bus, it is important to carefully examine the architectural and design considerations when doing so.
Message bus selection โ
First and foremost, it is important to select the appropriate message bus.
Message bus definition โ
A message bus is a software system that enables systems to communicate in a publish/subscribe- style (known as pub/sub). The context in which it is intended to be used within Flowcontrol is as a separate infrastructure through which messages will be sent to the separate microservices. The pub/sub pattern means that there is a publisher, which produces messages, and a subscriber, which consumes messages. An event must trigger the publishing of messages. Within the specific system, an event can be create/update/delete of an entity. Once the event occurs, the entity will be produced as the de facto message payload and the subscriber that listens for any incoming messages shall consume the message containing that entity, in turn persisting it (as a copy) in its own database.
Requirements โ
To be able to effectively do the abovementioned tasks, the selected message bus must adhere to specific requirements:
- Scalability: The message bus should be able to handle increasing loads and adapt to the growth of the system without performance degradation, as new services and events are expected to be introduced.
- Fault tolerance: The message bus should provide mechanisms to handle failures and ensure message delivery at all costs because consistency is of primary importance.
- Durability: The message bus should be capable of persisting messages to ensure that they are not lost in case of system crashes or failures, again to ensure consistency.
- Availability: To ensure the systemโs reliability, having a highly available message bus is an important factor.
- Ease of deployment and management: The system should be easy to deploy, configure, and manage, reducing operational overhead for the IT team
- Community and support: The message bus should have an active community to help troubleshoot issues and provide guidance when needed.
To narrow down the search and adhere to the final requirement for having a strong community, support and resource availability, the research will cover the two most popular infrastructures.
RabbitMQ โ
RabbitMQ is a free and open-source message broker that facilitates effective communication between message producers and consumers by acting as an intermediate party. It is built to be distributed and scalable, accommodating varying loads and demands. (Jadon, 2022)
The operational foundation of RabbitMQ is the Advanced Message Queuing Protocol (AMQP), a messaging protocol that ensures reliable asynchronous communication. Its reliability is ensured through a mechanism of message delivery confirmations, where the broker confirms receipt of messages to the message producer, and likewise, consumers confirm to the broker that they have received the messages. (Jadon, 2022)
Crucial elements of RabbitMQโs architecture include:
- Producer: The entity that initiates message transmission to a specific queue based on itsdesignated name. (Jadon, 2022)
- Consumer: This entity subscribes to a broker, receives messages from it, and subsequentlyutilizes these messages for various pre-defined tasks. (Jadon, 2022)
- Broker: This is a message broker that provides a storage facility for produced data. This datais intended for consumption by another application that links to the broker using certain parameters or connection strings. (Jadon, 2022)
- Queue: a linear structure enabling item enqueuing at the end and dequeuing from the front, adhering to a FIFO (First In, First Out) principle. It also offers advanced features like requeueing and prioritization. Applications identify queues by assigned unique names or by requesting the broker for name generation. (Jadon, 2022)
- Exchange: Messages are not dispatched directly to a queue. Instead, they are sent to an exchange by the producer. The exchange manages the routing of these messages to different queues, using bindings and routing keys. A binding is a connection between a queue and an exchange, and the role of a routing key is to forward messages to a specific queue. (Jadon,
- Channel: Channels establish a streamlined connection to a broker through a shared TCP connection, offering a lightweight connection option. (Jadon, 2022)
Features of a RabbitMQ Queue:
- Must have a name.
- Auto-deletes when the last consumer unsubscribes.
- Can survive a broker restart.
- Can be exclusive โ meaning it can only be used by its declaring connection and when the connection is closed/gone, the queue is deleted.
- Has optional arguments like queue length limit (RabbitMQ, n.d.).
The general flow of messages through RabbitMQโs Queues is as follows:
- The producer initiates the process by publishing a message to a specified type of exchange.
- Upon receiving the message, the exchange is tasked with routing it, guided by parameterslike exchange type and routing key.
- Bindings are then established from the exchange to the RabbitMQ Queues, each having aunique name for distinction. The exchange directs the message into the queues based on the message's attributes.
- The messages reside in the queue until a consumer processes them. Once consumed, theyare deleted.
- Lastly, the consumer retrieves and manages the message from the RabbitMQ Queue (RabbitMQ, n.d.).
Features of exchanges in RabbitMQ:
todo insert image
Four types of exchanges exist in RabbitMQ, listed below:
- Direct: The message is routed to the queues whose binding key exactly matches the routing key of the message. todo insert image
- Fanout: Routes messages to all of the queues bound to it. todo insert image
- Topic: Operate similar to a direct exchange, but uses a routing pattern and wildcard, in place of a routing key. Messages are routed to one or many queues based on a matching between a message routing key and pattern. The routing key should be delimited by โ.โ And the binding key will follow the same fashion, but it can use two wildcards as well:
- A star โ*โ, which can substitute for exactly one word.
- A hash โ#โ, which can substitute for zero or more words.
- Headers: Message header attributes are used for routing. (Tripathi, 2021) todo insert image
Apache Kafka โ
Kafka is a distributed streaming platform. Streaming is infinite data that keeps arriving and can be processed in real-time. Distributed in the context of Kafka means that it works in a cluster, where each node is called a broker. Those brokers are servers executing a โcopyโ of Apache Kafka, providing high scalability and fault tolerance: if any of the servers fail, the other servers will take over their work to ensure continuous operations without any data loss. Therefore, Kafka can be classified as a set of machines working together to be able to handle and process real-time infinite data. Moreover, Kafka has the capability to store streams of events durably and reliably for as long as desired. (Sczip, 2021)
Kafka is also a publish-subscribe based messaging system.
Main concepts โ
- Events, also known as records, record the fact that something happened. When data is written to or read from Kafka, this is done in the form of events. A record is a byte array that can store any object in any format. Conceptually, an event has a key, value, timestamp, and optional metadata headers. The value can be whatever needs to be sent, for instance JSON or plain text. (Everything you need to know about Kafka in 10 minutes , n.d.)
- Producers are applications that publish (write) events to Kafka brokers. Consumers are those that subscribe to (read) these events. Producers and consumers are fully decoupled and agnostic of one another โ a key design element to achieve the high scalability Kafka is famous for. For instance, producers never need to wait for consumers, because Kafka provides various guarantees, like the ability to process events exactly-once. (Everything you need to know about Kafka in 10 minutes , n.d.)
- Topics are where events are organized and stored. A topic can have zero, one, or many producers and consumers that write and subscribe to. Unlike traditional messaging systems (i.e., RabbitMQ), events in Kafka are not deleted after consumption. Their retention period is configured per-topic. Data persistence for long time is acceptable. The records in topics are retained as a log, and consumers are responsible for tracking the position in the log, which is also known as the โoffsetโ. The offset is typically advanced by the consumer in a linear manner as messages are read. However, consumers can reset to an older offset when reprocessing records. (Everything you need to know about Kafka in 10 minutes , n.d.)
- Partitions are what Kafka topics are divided into. Each partition contains records in an unchangeable sequence. Every record in a partition is assigned and identified by its unique offset. Partitioning is another contributing factor to the scalability of Kafka as it allows client applications to read the data from and write the data to many brokers at the same time. When new events are published to a topic, they are appended to one of the topicโs partitions. Events with the same event key are written to the same partition. Consumers are guaranteed to read the partitionโs events in exactly the same order as they were written. (Everything you need to know about Kafka in 10 minutes , n.d.)
- Replication factor is configured per topic and facilitates fault tolerance and high availability of the produced data, so that there are always multiple brokers that have a copy of the data in case something goes wrong, or availability is truncated. Every replica has one server acting as a leader and the rest are followers. The leader replica handles all read-write requests and the followers replicate the leader. If the lead server fails, one of the follower servers becomes the leader by default. A common practice in production environments is a replication factor of 3, meaning three brokers must be available to persist three copies of the data. (Everything you need to know about Kafka in 10 minutes , n.d.)
- Zookeeper is an external to Kafka system, also developed by Apache, which is leveraged to keep important metadata about the state of the cluster, such as available brokers, topics, etc. (Everything you need to know about Kafka in 10 minutes , n.d.)
todo insert image
Message flow in Kafka โ
Producers send data in the form of messages to the Kafka cluster. Messages contain a Topic ID, which is used to forward them to the leader broker for that topic. The messages are then stored in the leader broker and replicated across its followers. Consumers can join at any time and subscribe to specific topics and consume messages from the brokers. They keep track of the last consumed message using the offsets. Kafka retains messages for a configurable period of time. They are stored on disk, ensuring data durability and recovery from failures. This allows consumers to read historical messages even if they join the system later. (Johansson, 2020)
Kafka vs RabbitMQ comparison โ
Performance โ
According to a test performed by Confluent (Confluent, 2020), a platform focused on full-scale data streaming, when compared to Apache Kafka, RabbitMQโs peak throughput achieved 38 megabytes- per-second (MB/s), whereas Kafka reached up to 605 MB/s.
In terms of latency, under 200 MB/s load, Kafkaโs latency amounted to 5 milliseconds (ms). RabbitMQโs latency was 1ms, but that was achieved with a reduced load of only 30 MB/s, as RabbitMQโs latencies degrade significantly at throughputs higher than that number.
On another note, Kafka can retain large amounts of data with very little overhead as it is designed for holding and distributing large volumes of data, while RabbitMQโs queues are fastest when they are empty. (When to use RabbitMQ over Kafka?, n.d.)
Durability โ
Kafka is built to retain messages for as long as necessary with forever being a possibility, whereas with RabbitMQ messages get deleted after being consumed by default. Therefore, Kafka allows applications to process and re-process streamed data/messages on disk, making it classifiable as a durable message broker. Its routing approach is very simple, whereas RabbitMQ is more suitable for cases where complex ways of routing might be required. Kafka allows for delayed subscription, enabling the consumption of older messages even if they were not initially subscribed to. (When to use RabbitMQ over Kafka?, n.d.) (What's the difference between RabbitMQ and kafka?, n.d.)
Availability โ
Kafka is a truly distributed system as data is sharded, replicated, providing the possibilities to tune durability guarantees and availability. (What are advantages of Kafka over RabbitMQ?, 2016) RabbitMQ provides similar behavior by a concept called mirroring, but it is not on architectural level.
Scalability โ
Kafka is designed for horizontal scaling, which means scale by adding more machines, whereas RabbitMQ is mostly designed for vertical scaling โ scale by adding more power. (When to use RabbitMQ over Kafka?, n.d.)
Community and support โ
While RabbitMQ is an older system and has an abundance of learning resources and a community to provide support, Kafka has a very active and fast-growing community and is the most popular solution. (Why Apache kafka is more popular than RabbitMQ?, n.d.)
Conclusion โ
Apache Kafka is the superior option for the use case in Flowcontrol due to its superior performance, scalability, durability, and community support over RabbitMQ. Kafka can handle increasing workloads and adapts well to system growth, while providing robust fault tolerance mechanisms through replication and partitioning. It also persists messages on disk, ensuring data consistency and durability. Kafkaโs large and mature community offers a wealth of resources and expertise.
However, the main advantage that made Kafka the preferred choice is its option for delayed subscription. As RabbitMQ is not designed for long-term storage of messages, if a service is down for a while, it might miss some messages. Given the importance of data consistency for Flowcontrol, Kafka is the superior option.
5.1.4 Research question 4. โ
What are the best practices (conventions) for implementing a message-based communication system using a message bus, and how can they be tailored to the specific context of the system being developed?
When it comes to event-driven asynchronous message bus communication for inter-service data sharing there are several aspects that have to be considered for employing relevant conventions. One such aspect is the choice of the correct message bus that carries characteristics suitable to the specific requirements of the system. Regarding the message bus it is important to follow recommended practices for configuration to ensure reliable message delivery, appropriate event schemas, scalability and high availability, as well as appropriate testing of the producer and consumer functionalities. Such practices include employing sufficient amount of brokers, selecting appropriate acknowledgement strategy, partitions and replication factors to ensure reliability, scalability and high availability.
Beyond Kafka, it is important to handle data carefully when replicating. Conventions suggest to only replicate the attributes of the data that are relevant to the service consuming the data. It is crucial for Flowcontrol to ensure data consistency and integrity. In edge cases, data consistency can be effectively achieved by implementing event sourcing and the Saga pattern.
Reliable message delivery, high availability and scalability in Kafka โ
Replication โ
Configuring three brokers to allow for a replication factor of three should provide sufficient fault tolerance, availability and reliable message delivery for Flowcontrol. This way each message is produced to all three brokers, and if a broker is unavailable, one of the other two can reliably take over.
Acknowledgments โ
An acknowledgement strategy refers to the number of acknowledgements a producer is required to receive from the brokers for considering a message as successfully written. There are three configurations for this (AmiyaRanjanRout, n.d.):
- Acks=0: The producer does not wait for any acknowledgement from the broker. Regardless of whether the messages are successful or unsuccessful, the system sends them and then moves on to the next. Despite having the highest throughput and smallest latency, this method cannot guarantee dependability because the producer cannot tell if a message was correctly written or not. (AmiyaRanjanRout, n.d.)
- Acks=1: A producer only waits for acknowledgement from the partition leader. The leader notifies the producer when the message has been successfully written to its local log. This offers a strategy that balances throughput and reliability. However, there is a chance of data loss if the leader collapses right away after sending the acknowledgment but before the followers have copied the data. (AmiyaRanjanRout, n.d.)
- Acks=all: In this instance, the producer is waiting for the leader and all follower replicas (brokers) to return an acknowledgement. The acknowledgment is sent to the producer after the leader has waited for all in-sync replicas (ISRs) to log the message. Because it guarantees that the message is copied to every in-sync replica, this option offers the best dependability guarantee against data loss. This, however, results in a decrease in throughput and an increase in delay. (AmiyaRanjanRout, n.d.)
Minimum in-sync replicas โ
min.insync.replicas is a topic-level configuration, the number of which dictates how many replicas have to write the message before considering the message write successful.
Retries โ
Retries is a producer configuration that dictates how many times the producer can retry sending messages if the initial send fails. This could be set to a high numer like โInteger.MAX_VALUEโ to let it keep retrying almost indefinitely. (Kafka Producer Retries, n.d.)
Auto offset reset โ
This is a consumer side configuration that determines what happens if there is no initial offset or if the current offset no longer exists. Setting it to โearliestโ will ensure that the consumer starts from the beginning of the data if it has not committed any offset. (Golder, Kafka Consumer Auto Offset Reset, 2022)
Unclean leader election โ
This is a broker configuration which, if set to false, ensures only in-sync replicas (ISRs) are elected as leader. This is important for avoiding data loss. Since version 0.11.0.0 of Kafka this configuration is by default set to false, prioritizing durability over availability. (Golder, Kafka Unclean Leader Election, 2022)
Conclusion โ
The primary goal within the Kafka implementation in Flowcontrol is ensuring durable data replication, so data consistency is of utmost significance. Thus, opting for the most reliability is the optimal configuration choice for Kafka. Therefore, three brokers should be employed and a replication factor of โ3โ should be set for all messages, acknowledgements shall be configured to โallโ, min.insync.replicas number should be set to the number of brokers retries should be set to โInteger.MAX_VALUEโ, auto offset reset should be โearliestโ, and unclean leader election should be kept at the default setting โ false.
Partitioning strategy for Kafka โ
Maintaining the order of messages that are produced and consumed within Flowcontrol might be important for having the persisted data in the same order between producers and consumers. This can be achieved by employing a single partition per topic, as messages in the same partition are guaranteed to keep the original order.
5.1.4 Research question 4. โ
What are the potential risks and challenges of transitioning from a synchronous REST communication system to an asynchronous message-based communication system, and how can they be mitigated?
Research strategies utilized to facilitate the research include:
- Field โ Interview: Interviewed the stakeholders to narrow down which criteria is relevant and which is not.
- Workshop โ Multi-criteria decision making
Implementing event-based data replication via a message bus introduces several shortcomings. One such shortcoming is its reliance on eventual consistency, meaning data may not be immediately consistent across all microservices, leading to temporary discrepancies. This could cause problems in scenarios where immediate consistency is crucial.
Another drawback is data duplication across different microservices, which can increase storage requirements and complicate data management. Additionally, implementing data replication with a message bus adds another layer of complexity to the system, as it involves handling events, managing the message bus, and ensuring proper data propagation.
Additionally, data inconsistencies may arise if, for instance, one of the consuming services does not correctly persist the data received within the message i.e., because of an error. The most effective mitigation step for any data inconsistencies that may arise during the process of producing and consuming messages is the implementation of a pattern known as Saga pattern. Research into how to implement the Saga pattern can be found in the Saga research section.
Another important data consistency challenge can arise in cases of unexpected data loss in a specific datastore. Flowcontrol uses snapshots of the databases to restore them to a previous point if data loss occurs. However, data can be out of synchronization between the services if a snapshot does not have the latest data. Data must be synchronized and up-to-date at all times. A mitigation step is a pattern known as event sourcing, which is researched and documented in the Event sourcing research section.
Finally, while event-based replication aims to reduce coupling between microservices, it can introduce a new form of coupling to the message broker itself, and through shared events and message schemas. Coordinated updates across multiple microservices may be necessary when there are changes to the event schema or message format.
After interviewing the stakeholders in combination with brainstorming, the following conclusions were drawn:
- Immediate data consistency is not deemed critical. Crucial entity updates are not expected to happen frequently (if at all). Therefore, a situation of delayed entity update would not be of big concern. The truth is event-based data replication is not the best approach for situations where immediate consistency is important. Though, in Flowcontrolโs case, this is not the primary factor. Moreover, the implementation will be tested thoroughly to evaluate whether the performance (in this case of message delivery) is sufficient.
- The data to be replicated is not large, and databases of different services are not densely populated. Therefore, storage is not expected to be a burden.
- The additional complexity factor is a valid concern but not a primary one.
- To address data consistency concerns in edge cases of data loss, producer events can be logged in an independent datastore, ensuring a log of the latest version of the data is always available. This is referred to as event sourcing and is a common practice within event-driven architectures. In addition to the fault-mitigation benefits, it provides the opportunity to keep track of all changes that have occurred to the data throughout its lifespan.
- For addressing data inconsistencies that may arise due to processing failures of individual consumers, the Saga pattern can be implemented.
- Finally, coupling to the message broker, shared events, and message schemas is preferred to a tight coupling between most of the services, where failure in one service hinders the operation of others. Additionally, Kafka provides fault-tolerance capabilities, such as implementing multiple brokers which share the produced messages. Thus, if one broker fails, others can replace its functions and ensure a fault-tolerant workflow.
As a conclusion, there is no perfect solution that lacks any potential shortcomings. That is in most (if not all) of the cases within software development, and perhaps most things in life in general. The correct approach is to acknowledge the potential edge-cases, consider how relevant they are and decide whether the risks are worth it for the potential (expected) rewards.
5.1.5 Saga pattern โ
Definition โ
A saga is a sequence of local transactions. A transaction, by definition, is a single unit of logic which can be made up of multiple operations (Saga distributed transactions pattern, n.d.). Within sagas, each local transaction updates the datastore and in turn trigger the next transaction. If one of the local transactions fail, the saga executes compensating transactions that undo the changes made by the preceding local transactions, effectively rolling back the changes. (Atamel, 2022)
There are two primary implementation styles of the Saga pattern:
Choreography Saga Pattern โ
Choreography coordinates sagas by applying publish-subscribe principles. Each microservice runs its own local transaction and publishes events to a message bus, in turn triggering local transactions in other microservices. (Ozkaya, Saga Pattern for Orchestrate Distributed Transactions using AWS Step Functions, 2022)
todo insert image
This is a suitable approach for simple workflows that do not require too many microservice transaction steps. One of the advantages with this approach is it effectively decouples direct dependencies of microservices when managing transactions. However, if the steps increase, managing transactions can become confusing and cumbersome. (Ozkaya, Saga Pattern for Orchestrate Distributed Transactions using AWS Step Functions, 2022)
The practical implementation in the Flowcontrol system would be an extension of the data replication pattern that is already in place. When one of the consumer services consumes an event containing data and successfully persists it in its own database, it would publish another event to a Kafka topic indicating that the data has been replicated. Other consuming services could then listen to this topic and perform their respective part of the saga.
The consuming services would also need to handle any exceptions that might occur when trying to persist the data. If an exception occurs, they could publish a compensating event to a different Kafka topic. Other consumers could listen to this event and rollback their own operations if necessary.
Orchestration Saga pattern โ
Orchestration coordinates sagas with a centralized controller microservice, which orchestrates the workflows and invokes microservices to execute local transactions sequentially (Ozkaya, Saga Pattern for Orchestrate Distributed Transactions using AWS Step Functions, 2022).
todo insert image
While this approach is suitable for complex workflows, it introduces a single point-of-failure within the centralized controller microservice and is more complex to implement and manage (Ozkaya, Saga Pattern for Orchestrate Distributed Transactions using AWS Step Functions, 2022).
The practical implementation within Flowcontrol includes developing an orchestrator service which tells each service what to do and when to do it, as opposed to each service deciding what to do on its own (as in the choreopgraphy approach). The steps may look like the following:
- Define the Saga: i.e., Article creation saga starts when the Article service creates an article, and then the article needs to be replicated to the Farmer and Transport services.
- Create the Orchestrator service. It will be responsible to manage the flow of the saga, meaning, within the abovementioned example, it will tell the Article service to create an article and then tell the Farmer and Transport services (sequentially) to replicate it. It will also keep track of the state of the saga (whether it is completed or if an error has occurred).
- The Orchestrator can use the message bus to send commands to other services. For instance, once it receives a request to create an article, it will first save a new โArticleSagaโ with its state marked at โSTARTEDโ. It will then send a command to create an article.
- The Article service will receive these commands, perform the necessary actions, then publish events indicating whether they were successful or not.
- The Orchestrator service will listen for the events. When it receives an โArticleCreatedEventโ, it will update the โArticleSagaโ status to โARTICLE_CREATEDโ, thensend โReplicateArticleโ command and send it to the Farmer and Transport services. Once the Farmer and Transport servvices have replicated the article, they will publish an โArticleReplicatedEventโ, and the Orchestrator service will update the โArticleSagaโ to โFARMER_REPLICATEDโ and โTRANSPORT_REPLICATEDโ, and finally โCOMPLETEDโ.
If any errors occur, the Orchestrator service listens for them and mark the corresponding โArticleSagaโ as โFAILEDโ, then send compensating events that would effectively rollback the previous actions.
5.1.5 Event sourcing โ
Research strategies utilized while researching event sourcing include:
- Library โ Literature study, design pattern research
- Field โ Task analysis
- Workshop โ Requirement prioritization
What is event sourcing? โ
Event sourcing is a powerful architectural pattern used in software design for storing the history of events that have occurred within a system. In event sourcing, every change to the application state is recorded as an immutable event and stored in an append-only event store in sequential order. The event store acts as a single source of truth for the data, and the current state of an entity can be derived by replaying the events related to that entity. (Microsoft, n.d.) (A Beginnerโs Guide to Event Sourcing, n.d.)
Event sourcing can be used for auditability and traceability of the changes to the application state, and for debugging because it can provide the sequence of events that led to a problem. Moreover, it serves as a reliable method for disaster recovery, as replaying all events from the event store can accurately reconstruct lost data. Similarly, it helps ensure data consistency in distributed systems by providing a mechanism to keep all system components in sync by replaying events and applying them in the correct order. (Microsoft, n.d.) (A Beginnerโs Guide to Event Sourcing, n.d.)
In the context of Flowcontrol, event sourcing will primarily be used to ensure consistency for data replicated between the services and to establish a reliable backup for data restoration in case of data loss.
Requirements & considerations โ
When implementing event sourcing in this context, some considerations that should be researched beforehand and taken into account must be defined. These include:
- Versioning and evolution strategy: As the data schema evolves over time, new versions of the data may be introduced. Therefore, it is vital to define a clear versioning strategy to ensure backward and forward compatibility of the data the event holds. (Maison, 2019)
- Serialization: The entity data each event holds must be persisted in some way to allow for state management. Therefore, it is important to decide whether the event schema should contain separate fields for each attribute of the entity data in the respective tables, or the entity data should be serialized and persisted in a single field. If it is to be serialized, an appropriate serialization strategy must be defined. (Maison, 2019)
- Defining event schema: Defining an event schema that aligns with the specific system requirements is a crucial architectural consideration. This includes choosing between the usage of a unified event schema that would use a single table in the event store or defining a separate schema for each event type and subsequently utilizing separate table for them. (Maison, 2019)
- Selecting an event store: The event store must be designed in a way that allows efficient storage and retrieval of events. Expected event volume, scalability, performance, and querying capabilities should be taken into consideration. (Maison, 2019)
- Event backup and recovery: As the event store is the source of truth for the system, it is critical to have backup and recovery mechanisms in place to prevent data loss in case of failures or disasters. (Maison, 2019)
The following user stories will aid the research serving as de facto requirements:
- As a developer, I must be able to log events that hold entity data in an immutable, append- only event store based on an appropriately defined event schema, so that I can have a reliable means of data recovery that can also ensure consistency of the data replicated between producing and consuming services.
- As a developer, I must be able to update the entity data schema, including entity names, field names, and field types, so that evolving requirements can be adhered to.
- As a developer, I must be able to replay the events logged in the event store, so that I can recover data and ensure data consistency.
- As a developer, I must be able to replay events that contain evolved entity data and effectively map the old data model to the new one, so that I can recover data and ensure data consistency.
- As a developer, I must ensure that when events are replayed, no duplicate entities are stored in the respective databases, so that I can ensure reliable data consistency.
Versioning and evolution strategy โ
Given that events will be logged in an append-only storage, versioning must be considered on two separate levels. Firstly, events must be ordered correctly for proper replay and restoration of data to its latest state. To achieve this, the event schema should include a timestamp attribute that captures the exact date and time up to the highest precision available.
Secondly, when the data schema of the logged entity changes over time, it is important to distinguish between different versions of the data. To address this, the event schema should include a version attribute that allows differentiation between separate data versions.
Serialization โ
Serialization or no serialization โ
To decide whether serialization would be required, multiple factors must be accounted for. What are the benefits and the disadvantages to both options, and how do they align with the system requirements?
Serializing entity data offers advantages such as flexibility, schema evolution simplicity, and reduced table complexity. It allows storing the entire entity state in a compact format. Querying specific attributes within serialized data becomes more complex, however, this is not expected for the specific purpose the event store would be used for.
On the other hand, using separate fields for entity attributes enables efficient querying and easy data access without deserialization. However, schema evolution would require modifying the table schema and could lead to downtime. Additionally, as the number of attributes increases, the table structure may become more complex.
Conclusion โ
Considering the system requirements, serialization is the clear choice for handling entity data when event sourcing. This is because entity data typically consists of numerous attributes, which can result in an unnecessarily large event data model if stored as individual attributes. Serialization allows for a more compact representation of the data. Furthermore, Flowcontrol being a rapidly developed distributed system, schema evolution is anticipated and must be taken into consideration. By utilizing serialization, handling schema changes becomes more manageable, as the serialized data can adapt to evolving entity structures without requiring modifications to the event data model.
Additionally, given the append-only nature of the event store and the presence of complex data structures within entity data models (such as nested custom data model attributes), it is essential to minimize the size overhead. Serialization enables a more efficient use of storage space, reducing the storage requirements and optimizing the overall performance of the system.
Serialization strategy โ
As concluded, every event in Flowcontrol has an associated payload that needs to be serialized to be persisted as an attribute of the event data model. Serialization is the process of converting a data object into a series of bytes that saves the object's state in a form that can be easily transmitted. There are many options for data serialization, each with its own set of benefits and drawbacks.
It is critical for Flowcontrol to carefully evaluate and select the serialization method that best meets its specific requirements. By doing so, it can ensure optimal performance and efficiency in terms of data transmission and storage. The selection process should consider factors such as data size, serialization and deserialization speed and complexity, and the ability to support schema evolution. By weighing these factors against its specific needs, a serialization method can be chosen for Flowcontrol that will enable it to effectively manage and process its event data.
Non-binary vs binary serialization โ
When it comes to serializing data specifically for logging events in a data store, the most popular choice among non-binary options is JSON. It is a plaintext format and arguably the most human- readable and widely used option. It is the obvious choice for many developers because it is easy to implement, is supported natively by the most popular database options, both SQL and NoSQL, and can be parsed or produced by any technology stack. Additionally, its human-readable format makes it convenient for analyzing data directly from query results or from database tables using inspection tools. (Ludwikowski, 2019)
However, JSON has some drawbacks, such as considerable size overhead and lack of schema evolution support when compared to binary serialization options. On the other hand, binary serialization options like Avro, Protobuf, and Thrift offer more compressed data storage, with some developers reporting data savings of up to 70%. Additionally, these options provide built-in support for schema evolution. For example, Avro is designed with schema evolution in mind and automatically handles added or missing fields and changed field order. (Ludwikowski, 2019)
The following sections will focus on the specific binary serialization formats mentioned above.
Protocol Buffers โ
Protocol Buffers, known as Protobuf in short, is an open-source project developed by Google that offers a language-neutral, platform-neutral, and customizable approach for serializing structured data. It is compatible with a variety of programming languages, including C++, C#, Dart, Go, Java, and Python. Additionally, Protocol Buffers provide out-of-the-box schema evolution support with backward and forward compatibility. (Serialiazing your data with Protobuf, 2019)
The neutral language used by Protobuf allows modelling messages in a structured format through .proto files which look like the depiction in Figure 22:
todo insert image
After defining the .proto file, the next step is to run the protocol buffer compiler protoc to compile the file, which requires to be additionally installed on the system. This will generate a subdirectory with the required files. Afterward, these files are used to serialize and deserialize the data (Protobuf Dev, n.d.).
Thrift โ
Thrift is a library built by Facebook and currently maintained by Apache, and is similar in functionality to Googleโs Protocol Buffers, but the latter tends to be (subjectively) easier to use. Thrift provides an extensive documentation with high level of technicalities. Thriftโs support for Java is on a sufficiently good level, and additionally supports considerably more languages than Protobuf. Moreover, it provides built-in support for โexceptionsโ, whereas Protobuf does not. (Data Serialization โ Protocol Buffers vs Thrift vs Avro, 2019), (Biggest differences of Thrift vs Protocol Buffers? [closed], n.d.), (Baeldung, 2018)
Thrift uses a special Interface Description Language (IDL) to define data types that are stored as .thrift files and later used as input by the compiler for serialization and deserialization, in the same way Protobuf operates. The schema definition is depicted in figure 23 (Ludwikowski, 2019):
todo insert image
Avro โ
Apache Avro was built in 2006 by Hadoop as a schema-based serialization system. It was developed with a deep understanding of Protocol Buffers and with the explicit goal of addressing some of the challenges and limitations associated with Protobuf. Most notably, Avro was designed to enable modifications to the data schema at runtime, without the need for code generation or recompilation. This feature sets Avro apart from Protocol Buffers, which require schema changes to be made during compile-time. (Nayar, 2022)
When Avro data is serialized, its schema is also included in the serialization process. As a result, the schema is always available when reading the serialized Avro data. Additionally, when Avro data is stored, its schema is stored alongside it. This makes it possible for any program to process the Avro data files at a later time, since the schema is always present. (Nayar, 2022)
The functionalities of Apache Avro are comparable to that of Thrift and Protocol Buffers. It offers support for sophisticated data structures, a compressed binary data format, a file format for persistent storage, and seamless integration with dynamic programming languages. However, what truly sets Avro apart is its robust schema evolution capabilities, which enable developers to make changes to their data models without breaking backward compatibility. While this flexibility requires defining separate serialization and deserialization schemas, it ultimately allows for more seamless and efficient data management in evolving systems. Although, it is important to note that when it comes to schema evolution in terms of changes in data naming, this advantage is irrelevant because it will require schema recompilation, just like the other two options. (Data Serialization โ Protocol Buffers vs Thrift vs Avro, 2019), (Nayar, 2022)
As with Thrift and Protocol Buffers, Avro has its own schema definition which defined in a .avsc file and follows a JSON format. An example depiction of an Avro schema can be seen in figure 24:
todo insert image
Complex data structures such as maps and sets are also supported with some limitations in Avro.
JSON โ
JSON provides by far the most simplistic and straightforward way to serialize and deserialize data. There are many libraries in the Java ecosystem that allow for seamless serialization within a few lines of code. As mentioned previously, though, the downside is this is a non-binary serialization format, which, while human-readable, bears a considerably larger size overhead compared to binary options.
Conclusion โ
Flowcontrol's data volume is not particularly large, with only 130,000 entries in its most heavily used data model over the past four years of usage. However, much of the replicated data between services contains complex data types, such as sets, and is expected to grow over time. To conserve space in the event store, a binary serialization format is preferred.
After interviewing the stakeholders and evaluating the various binary serialization options, including Apache Avro and Protocol Buffers, a unanimous agreement to use Protobuf was reached. While Avro provides easier schema evolution when it comes to changing data field types without requiring schema recompilation (which is why a separate schema needs to be provided upon deserialization of objects which enables automatic type conversion), when it comes to changing data names, both libraries require recompiling the schema and going through a similar deserialization process. Since Flowcontrol expects both types of changes to occur, and in view of the considerably wider adoption of Protobuf, it came out as the preferred choice.
Event schema โ
Having selected a serialization and versioning strategies, the next step is to define an appropriate event schema to enable events to be saved in a data store. For the specific use case of Flowcontrol, the data model must contain the following attributes:
- Event ID to track its position in the event stream.
- Event Type to describe the action that occurred.
- Version for tracking the evolution of the domain model over time.
- Timestamp to distinguish the sequence in which events occurred, ensuring that events are applied in the correct order.
- Topic โ the Kafka topic to which the specific event was produced.
- Entity ID โ the ID of the entity the event is produced for, which can be useful for easily identifying a specific entity for example when querying the database for events related to a particular entity.
- Serialized_data holds the actual payload of the event โ the domain model.
A depiction of an example event schema is visualized in the figure below:
todo insert image
When events need to be processed or the state of an entity needs to be rebuilt, firstly the serialized_data will be deserialized and then the changes described in the event will be (re)applied.
Once the event schema is defined, the next consideration is whether to store events in a single table or distribute them across multiple tables. Taking into account factors such as schema evolution, efficient querying, and the anticipated high volume of events, it is advisable to distribute the events across multiple tables. In this approach, each table corresponds to a specific entity for which the events are produced, ensuring better organization and optimized handling of the events.
The following section will focus on selecting an appropriate event store given the system requirements.
Selecting an event store โ
When considering the selection of an event store for Flowcontrolโs event sourcing needs, certain requirements are essential to ensure the effectiveness and reliability of the event storage. The following are key factors to consider:
- Scalability - Since event sourcing is effectively a history of events, where each new event gets appended and old events are not deleted, scalability in terms of data is one of the main requirements.
- Durability and consistency โ An event store is meant to be an unambiguous source of truth regarding data and as such it should provide strong durability and consistency guarantees, ensuring events are stored safely and reliably.
- Data retention โ The data store must have the capacity for long-term storage.
- Query capabilities and event ordering โ Considering the historical characteristic of the event store, the events could amount to thousands โ if not more. Therefore, the event store must be able to give an ordered stream of the events when queried in a timely manner. Read processing performance is not crucial for the given context, but at the same time querying should not take hours when the event store contains lots of data.
After having clearly defined requirements, next step is to look into which data stores meet these requirements. Popular options among systems utilizing event sourcing are:
- NoSQL (MongoDB/EventStoreDB)
- Relational Database Management System (RDBMS) โ also known as SQL
- Message bus (Kafka/RabbitMQ)
In the following sections a look will be taken at each individual option and a conclusion about which choice is most suitable for Flowcontrol will be drawn.
MongoDB (NoSQL) โ
MongoDB will be used as an example of a document-based database, as it is one of the most popular NoSQL choices. Its main characteristics include high scalability and schema-less storage. Due to the latter, events can easily be saved to MongoDB. However, when attempting to store multiple events simultaneously, MongoDB encounters a limitation: it natively supports only single-document transactions. Consequently, there are two options: save multiple events in a single document or save multiple events across multiple documents. Both approaches have their challenges: the former increases complexity when searching for an entity's events, while the latter necessitates using MongoDB's multiple-document transactions feature, which also adds complexity. (Steimle, 2021)
Additionally, MongoDB lacks support for a global sequence number, meaning that implementing a full sequential read of all events requires custom logic (Steimle, 2021).
In conclusion, using MongoDB as an event store demands extra effort and consideration due to its inherent limitations and complexities.
EventStoreDB โ
EventStoreDB is an open-source state-transition database purposely built to serve as an event store for event sourcing. State-transition means it store different transitions of data that go through time, or in other words, state transitions are events that record what and why has changed in the order the changes occurred. (Event Store, n.d.)
EventStoreDB can run as a server on a variety of platforms from desktop operating systems like Windows, Linux and macOS, to Windows and Linux server, Docker containers, and orchestration tools such as Kubernetes. There is also a Cloud option, but it is obviously not free. (EventStoreDB: The database for Event Sourcing, n.d.)
Clients can connect to the server over HTTP or via gRPC.
EventStoreDBโs main advantages are its optimization for storing and processing event data, meaning it can handle high-volume data streams and provide fast, real-time data processing. It uses a log- structured data model, making it horizontally scalable by adding additional nodes. Furthermore, it features data replication and partitioning, making it resilient and fault-tolerant. This guarantees data is always accessible and can continue running even if one or more cluster nodes fail. Additionally, it provides transactionall support, ACID compliance, and querying capabilities Moreover, there is a sizable user and developer community that can provide support. (EventStoreDB: The database for Event Sourcing, n.d.)
RDBMS (SQL) โ
Relational database management systems (RDBMS) have been in use for a long time and are the most familiar database types to most developers. They are not only easy to use but also have a straightforward data structure for event sourcing. A single table containing a global sequence number, an entity identifier, a sequence number for each entity, and the event payload is all that is required. With this structure, as shown in Figure 26, queries can efficiently be executed for a full sequential read or search for all events related to a specific entity ID. To maintain an append-only policy, simply avoid using UPDATE or DELETE statements. RDBMS is a mature technology that offers ACID transactions and scalability options (Steimle, 2021).
todo insert image
Furthermore, a RDBMS (MsSQL Server) is the database of choice that Flowcontrol uses across all of its services, which means there will not be additional overhead for developers related to getting familiarized with a new type of system.
In summary, RDBMS can effectively be used as an event store because it satisfies all the requirements natively.
Kafka โ
Since event sourcing is usually applied within architectures that use a message bus and most message buses have the capabilities to act as an event store, it is a popular choice among developers for this purpose. Moreover, Kafka is used within Flowcontrol. Due to its distributed nature and its focus on messages it might be a good match for a tool that needs to cope with events and that needs to be scalable. However, there are some downsides.
For instance, using a single topic to store all events allows for a full sequential read of every event, but it also becomes necessary when trying to access all events associated with a specific entity. There are a few ways to optimize event reading for individual entities. One method is to create a separate topic for each entity, although this necessitates manual restoration of the global event order. Another strategy involves using partitions for the topic, which ensures that messages with the same key are stored together, allowing Kafka clients to read from specific partitions. However, this method requires guaranteeing an even distribution of entities across partitions, which involves knowledge about the entities, creating a partitioning function based on that information, and having hardware capable of efficiently computing it.
Furthermore, relying on Kafka as an event store goes against a core principle of event-sourcing-based applications: events should be stored before being published. With Kafka, storing and publishing events are no longer separate actions, creating a potential issue where an event could be lost and irretrievable if a Kafka instance fails during the process. In contrast, non-Kafka-based applications can detect and address storage failures before publishing events. Although Kafka acknowledgements can help minimize this risk, they necessitate replication and additional hardware resources.
In relation to Flowcontrolโs requirements for an event store Kafka carries the following characteristics:
- Scalability: Kafka is designed for high throughput and can scale horizontally, making it suitable for handling large volumes of event data.
- Durability and consistency: Kafka provides durability by replicating events across multiple brokers, but its consistency guarantees are weaker than traditional databases or EventStoreDB due to its distributed nature.
- Data retention: Kafka supports configurable retention policies, but it is primarily designed for real-time data streaming rather than long-term storage.
- Query capabilities and event ordering: Kafka maintains event ordering within partitions but requires additional components like Kafka Streams or ksqlDB for querying and processing events.
Conclusion โ
In conclusion, MongoDB and Kafka can be deemed as unsuitable event stores for Flowcontrol due to their inherent limitations and complexities. Therefore, the choice comes down to EventStoreDB and SQL. Although EventStoreDB presents certain benefits, such as being optimized for handling event data, these advantages are not cardinal for Flowcontrol's use case. Specifically, the systemโs operations are not characterized by write-intensive tasks nor is exceptional read performance a pivotal necessity. Within Flowcontrol, the principal goal with event sourcing is to secure a robust data backup system, enabling reliable data restoration and eventual consistency across all services in the face of potential data loss.
On the other hand, SQL databases meet our primary needs for scalability, durability, consistency, data retention, query capabilities, and event ordering. Moreover, SQL databases are ingrained within Flowcontrol's technology stack, serving as the go-to database system across all microservices. This familiarity significantly lessens the learning curve and facilitates the event store's management and maintenance.
Given these factors, using an SQL database as an event store for Flowcontrol is the most appropriate choice. By leveraging the team's familiarity with SQL and its ability to satisfy Flowcontrol's requirements, the development and maintenance of the event store will be more streamlined and efficient.
Event store backup and recovery โ
In the realm of event sourcing, where the event store serves as the authoritative source of truth for the system's event data, it is crucial to have robust backup and recovery mechanisms in place. The ability to safeguard and restore event data is essential for ensuring data integrity, fault tolerance, and business continuity.
The research on event store backup and recovery focuses on exploring strategies to securely store and replicate event data to prevent loss and to facilitate efficient recovery in case of unexpected data loss scenarios.
Fortunately, there are backup and recovery tools built-in the most popular SQL databases, and there are third-party tools like PGDump that are created for the same purpose (How do you implement event store backups and disaster recovery strategies?, n.d.). Given Flowcontrol will be using MSSQL Server as an event store, we will specifically look into its built-in backup capabilities. They are quite comprehensive and straight-forward:
- The UI provides a simple backup capability. The figure below provides a comprehensive visualization of how this works:
- todo insert image
- Back up using a transactional statement (Transact-SQL) โ again a straightforward process, visualization of which can be seen in the figure below:
- todo insert image
- Back up the full database using Powershell. The example illustrated below on figure 29 backs the database to the default backup location of the server instance:
- todo insert image
By using these built-in back up tools, regular snapshots of the event stores can be created and used in an eventual case of data loss for recovery.
5.2 Implementation โ
In the following section the reader can find key parts of the implementation process, including design choices tailored specifically to the systemโs requirements and direct code implementations in the form of code snippets.
5.2.1 Message-based inter-service data replication โ
Research strategies utilized while integrating the Kafka facilitated data replication across microservices include:
- Workshop - Prototyping
Archtitecture โ
To ensure reliability, 3 Kafka brokers are deployed (as evident from the C3 diagram portrayed in Figure 3) alongside single instances of Apache Zookeeper and Kafdrop in the docker compose, which spins up all microservices and other relevant systems of Flowcontrol. Kafdrop is a separate system that provides a web UI for viewing Kafka producers, brokers, consumers, topics and browsing them to view messages. Zookeeper is used to manage important metadata about the state of the cluster, such as available brokers, topics, partitions, etc. The brokers are configured to communicate both within the Docker network and on a local network level (e.g. localhost). This aids in debugging sessions where local service instances are usually employed. Their respective configurations within the docker-compose can be seen on the figures below.
todo insert image
Separate topics are configured for each distinct type of event. Topics carry clear, specific names representing a particular domain event. The employed topic partitioning strategy is fixed to 1 per topic to guarantee the order of messages. Each topic has a replication factor of 3 to distribute the messages along all brokers. Moreover, minimum in-sync replicas is set to a value of 3 for all topics to ensure that all messages are replicated across all three brokers. The code snippet below provides a code sample of this configuration for a topic called โarticleSavedโ, which is responsible for transmitting messages about newly created or updated Article objects to consumers.
todo insert image
Additionally, specific configuration classes exist for both producers and consumers. These classes manage the acknowledgement settings, along with the serialization and deserialization of message keys and values. With the help of a shared Data Transfer Object (DTO) folder strategy that is integrated as a dependency into the isolated folder structures of the producers and the consumers, providing DTO counterparts for each produced data model, the necessity for specific type mappings was eliminated. As a result, this allowed creating generic Kafka configurations meant for producing JSON objects. Notably, these objects can represent any data model defined in the shared DTO module. An explanation of how this works can be found in the section Type Mappings.
todo insert image
Similarly, the code snippet below shows the generic consumer configuration for JSON messages. It includes additional configurations for error handling, as Spring has no way to handle deserialization errors, and to solve this problem the ErrorHandlingDeserializer class has to be leveraged.
todo insert image
The events themselves are produced in the Service classes when an event takes place (i.e. create an Article, delete an Article, etc.) using a middleware class with the respective configurations and topics to ensure separation of concerns. JPA repository is used as an object-relational mapper, which guarantees idempotency with respect to the state of the data in the database, and errors are handled gracefully, guaranteeing idempotency in terms of interaction with Kafka. Idempotency refers to ensuring an operation can be repeated multiple times without changing the outcome beyond the initial application. (Siddharth, 2023)
todo insert image
On the consumersโ side, there are consumer classes where listeners are employed. Each listener listens for incoming messages to a specific topic and uses the defined custom configurations to consume the messages. When messages are consumed (i.e. message to a topic named โarticleCreatedโ containing an Article entity) the object is mapped to a native entity and persisted in the database in a table named entity_external (i.e article_external) so that it is clear the entity comes from another service and is not native to the current one.
todo insert image
Additional โserviceโ and โrepositoryโ classes are created at the consuming microservices to accommodate the persistence of the replicated data. The โrepositoryโ classes perform the read/write operations to/from the databases, and the โserviceโ classes are responsible for handling the communication with the repository classes to achieve separation of concerns. Moreover, the logic for the previously used synchronous calls is entirely gone and replaced by the data replication pattern.
Type mappings โ
As the data being shared between the microservices is only native to the producer services, it has to be mapped to an object that is also known to the consumer services. In other words, the entity classes that define the data are specific to the producer services and are not directly accessible to the consumer services due to the isolation of their folder structures.
To effectively share data between microservices while maintaining loose coupling, practices suggest creating a common representation of the data, often referred to as a Data Transfer Object (DTO). The DTOs contain only the minimum necessary attributes required by the consumer services and are defined in a shared module accessible to both the producer and consumer services. As mentioned before, the folder structures of the microservices are isolated, so the DTOs are defined in a folder structure which is known to both the producer and consumer microservices. To achieve this, a new module has been created and imported as a dependency into the producer and consumer microservices, effectively sharing the module which contains the DTOs between them.
When the producer service generates an event containing data to be shared, it uses a custom defined mapper class to map the native entity to the DTO. For instance, the Article entity includes many attributes which are not directly relevant to the consuming classes. That is why the ArticleDTO data model only includes the attributes that may be relevant to them, and any additional irrelevances are handled by the custom mapper classes in the respective consumers.
todo insert image
On the consumer side, the microservice receives the DTO and maps it back to its local data structure, which can be different from the producer's structures. This is also facilitated by a custom mapper class.
This type mapping approach allows microservices to share data without being tightly coupled to the internal data structures of the producer service, promoting the following characteristics in the overall system architecture:
- Modularity โ Via DTOs, each microservice can have its own data model that is suited to its specific needs and functions.
- Flexibility โ DTOs enable services to communicate without needing to know about each otherโs internal data structures. This allows developers to change the internal data structures of their microservices without affecting communication between services. For example, if one service's data model needs to change due to new business requirements, as long as the DTO and the mapping logic are updated accordingly, other services can continue to interact with it as before.
- Maintainability โ With DTOs, each microservice only needs to worry about mapping between its internal data models and the shared DTOs. This means that when modifications are required, developers only need to understand and modify a small, localized part of the system (Data Transfer Objects Between Microservices, 2022) (MicroServices โ How To Share DTO (Data Transfer Objects), 2020).
5.2.3 Event sourcing โ
Research strategies utilized while integrating event sourcing include:
- Workshop โ Prototyping
Architecture โ
The initial step to the integration of event sourcing within Flowcontrol was creating separate databases responsible for storing the events., as evident from the C1 diagram illustrated in Figure 1. This was done per-producer-service.
todo insert image
Afterward, each producer microservice had to be manually configured to work with two databases. This involved creating configuration classes and applying folder structure changes to pinpoint both datastores to read the folders relevant for them individually, i.e., standard data models and repositories, and event data models and repositories are separated in distinct folder structures.
Next, the actual event data models had to be created. They employ a generalized data structure with the only difference being the name of the table, e.g., Article_Event, Farmer_Event, etc.
Following, database migrations had to be configured and created. Flowcontrol uses Liquibase to apply migrations, which does not natively support multiple databases, so manual configuration classes had to be employed to facilitate this. The figure below shows sample migrations for creating tables for Article and Pallet Type events:
todo insert image
The next step was to configure the Protobuf files that define the payload which would be serialized. The figure below showcases a sample - the Article.proto file, which represents the Article entity along with all of its attributes. Some of the attributes that are nested custom data models are represented as Strings, so that they can be serialized to JSON string structures via the Jackson library. This strategy was implemented to facilitate a simpler mapping and overall serialization and deserialization process otherwise, separate proto files would have had to be defined and custom mapping logic employed for every nested entity, which would be considerably time-consuming and, in this case, unnecessarily cumbersome.
todo insert image
Afterward, the custom mappers were developed, which are used to map standard entities to binaries using the Protobuf files and backward โ binary files back to standard entities. The code snippets below show the process for the Article mapping.
todo insert image
Additional logic had to be developed for mapping the nested custom data models to a JSON string via the Jackson library. The code snippet below shows this process:
todo insert image
To facilitate the storing and restoring logic for the events, service classes were developed, as the one shown in the snippet below:
todo insert image
A separate service class was created to hold the logic specific to the restoring of the events โ like grouping them by timestamp, event type, etc. This logic can be accessed by calling a custom API endpoint.
todo insert image
The diagram below provides visual context on how the event logging process takes place:
todo insert image
When an event occurs, i.e., create an article, this event is logged to the relevant event store table through the service layer and ultimately via the repository layer.
6. Testing and Validation โ
This testing strategy outlines the scope and plan for testing the Kafka-based data replication feature in Flowcontrol. The goal is to ensure reliable messaging, efficient data transfer, and improved performance.
6.1 Test items โ
- Kafka Producers
- Kafka Consumers
6.2 Features to be tested โ
- Message Production and Consumption: Verify that messages are being sent and receivedcorrectly.
- Data Integrity: Ensure that the data sent and received is as expected, ensuring integrity in thereplication process.
- Performance Improvement: Validate that the endpoint completes requests quicker due tohaving replicated data locally, reducing the need for multiple internal service calls.
- Throughput and Latency: Ensure the system can maintain a high message throughput andlow latency under load.
6.3 Approach โ
The following tests will be performed:
- Unit tests: Testing the correct operation of individual Kafka producers and consumers.
- Contract tests: Testing the correct operation and interaction of the Kafka producers andconsumers as a whole system.
- Performance tests: Testing the system's speed improvement when using Kafka for datareplication.
- End-to-end tests: Testing the correct and expected operation of the entire system end toend.
Contract testing has been chosen over integration testing due to their similarities and operational advantages of the former. Integration tests, while closely replicating real operational scenarios by minimizing mock components, demand significant resources and time, requiring all microservices and associated systems to be functional during testing.
In contrast, contract testing concentrates on validating the interactions at the boundary of external services, ensuring their alignment with the expected contracts by the consumer service. The "contract" here signifies the agreement between services regarding the format and content of Kafka messages.
Essentially, contract tests are a more focused version of integration tests, centered specifically on service contracts. They offer the advantage of requiring fewer infrastructural components for execution, leading to a more streamlined and efficient testing process.
6.4 Pass/Fail Criteria & Test Environment โ
Each test will have specific pass/fail criteria based on the expected behavior of the system. For example, in the data integrity test, a pass will be the replicated data matches the original data exactly, a failure will be any deviation.
The tests will be performed in a staging environment that mirrors the production environment.
6.5 Test results and analysis โ
Research strategies utilized during testing include:
- Lab โ Unit test, System test, Component test, Computer simulation: leveraged to effectively test the Kafka facilitated data replication across the microservice
Unit tests โ
Objective โ
The objective of the unit tests is to validate the functionality and behavior of the Kafka producers and consumers in the context of a microservices architecture, where these components are responsible for the propagation of data changes across services via event-driven communication. Given the asynchronous and distributed nature of this architecture, it is crucial to ensure that any changes to an entity are accurately captured in events, correctly serialized, and successfully published to the correct Kafka topics. On the other side, consumers should correctly deserialize and process these events, triggering the appropriate operations and ensuring data consistency across microservices.
Steps โ
To meet this objective, the unit tests should cover both the Kafka producer and the Kafka consumer components.
โ
Producer unit tests:
- For each operation that should generate an event (such as entity creation, update, deletion), trigger the operation and assert that the event is produced with the correct payload. Verify that the Kafka producer is called with the expected payload.
- Check boundary cases where the operations might be called with empty or incomplete data. Verify that the producer behaves as expected (e.g., it may not produce an event, or it might produce an event with a default or error payload).
- When an operation involves the manipulation of relationships between entities, ensure that the resulting event is correctly produced, with the appropriate identifiers for the relationship.
- In each test case, ensure that the correct interactions with the Kafka producer or any other relevant Kafka classes are taking place.
Consumer unit tests: โ
- For each type of event that the service should consume, send a mocked message to the consumer. Ensure that the appropriate service method is called with the correct data transformed from the event payload.
- Validate that exceptions are correctly handled. For instance, when an event with malformed data is consumed, ensure that the service handles the error.
- Similar to the producer tests, ensure the correct interactions with the Kafka classes in each test case.
Collectively, these tests provide comprehensive coverage of the event-driven data replication process in a microservices architecture. They help ensure that data changes are correctly produced as events, successfully consumed by the appropriate services, and accurately transformed and persisted in their respective databases.
Results โ
Sample producer tests: โ
todo insert image
Sample consumer tests: โ
todo insert image
Contract tests โ
Objective โ
The objective of the contract tests is to validate the correct functioning of the Kafka producers and consumers. These tests are designed to verify whether the Kafka producers generate the expected messages and adhere to the appropriate type mappings. Additionally, the contract tests ensure that the Kafka Consumers correctly process incoming messages and that changes in a producer microservice are accurately reflected in a consumer through Kafka events. By performing these contract tests, the proper functioning of the Kafka-based data replication system can be ensured and the end-to-end integrity of message production, consumption, and data persistence can be validated.
Steps โ
Producer tests verify that data is correctly produced to the appropriate topic:
- Produce an event to the respective topic.
- Consume the message from the same topic.
- Assert that the consumed data equals the original data.
Consumer tests mock the behavior of the respective service method.
- Send data to the respective topic.
- Set up a Kafka consumer to consume the message from the topic.
- Assert that the consumed data equals the original data.
- Verify that the service correctly stores the data via the repository when the message is consumed.
Results โ
Sample of producer tests: โ
todo insert image
Sample of consumer tests: โ
todo insert image
Performance tests โ
Objective โ
Evaluate the performance improvements offered by the Kafka-facilitated data replication system for the pallet label creation endpoint. This will be done by benchmarking the execution time of the endpoint in the old and new systems under a specific load pattern. The specific endpoint was chosen due to being the most inter-service request intensive, thus the worst performing. The testing goal is to evaluate the request-response time of single requests, rather than performance under load of, for instance, multiple users sending requests at the same time. This is because the system is to be used internally within the company, so excessive request load is not expected.
Setup โ
Ensure Apache JMeter is correctly configured to execute 300 consecutive requests on the pallet label creation endpoint. In the old system, ensure all the services involved in synchronous calls are available and functioning correctly. In the new system, the Kafka producers, consumers, and brokers should be running and properly connected. The databases for the consumer services must be set up and ready to store data. Ensure that the data required to fulfill the requests is correctly transmitted and exists in the Transport serviceโs database.
Steps โ
- Start Apache JMeter and configure it to make 300 requests to the pallet label creation endpoint in the old system with a delay of 5 seconds between each request, using a single thread. This simulates the load of a single user making frequent requests.
- Start the JMeter test and monitor the execution time of each request.
- Once the JMeter test completes, note the average execution time of all 300 requests.
- Record and store this result for later comparison.
- Repeat steps 1-4 for the new system, where Kafka-facilitated data replication has been implemented. Ensure the same load pattern (300 requests, 5-second delay, single thread) is used for the test.
- Compare the average execution time of the requests in the old system and the new system. Analyze the results to determine the performance improvements brought by Kafka- facilitated data replication.
- Document the test results, providing insights into the overall effect on performance and reliability when shifting from synchronous inter-service calls to Kafka-based data replication.
Results โ
Old system (synchronous inter-service communication) โ
The average execution time for all 300 requests is equal to five seconds (5008ms). This means each request took on average approximately five seconds to complete. The deviation amounts to 1259ms, indicating a considerable variability or dispersion in the response times.
todo insert image
New system (Asynchronous data replication) โ
The average execution time for the 300 consecutive requests equals two and a half seconds (2408ms). Moreover, the deviation is significantly lower at 352ms, indicating a vastly improved performance consistency.
todo insert image
Conclusion โ
The performance benefits the Kafka-facilitated data replication brings to the system for the most intensive API endpoint show over fifty percent (>50%) improvement and considerably more consistent response times when compared to the old system of synchronous inter-service communication.
End-to-end tests โ
Objective โ
Validate that the producer sends the correct message and that the consumer receives the message and persists the correct data.
Setup โ
Configure and start the Kafka producer and consumer services. Ensure the Kafka brokers are running and that producers and consumers have a successful connection to the expected Kafka topics. Ensure the databases for the consumer services are set up and ready to store data.
Steps โ
Single requests โ
- Execute an API request to a producer endpoint that triggers the production of a message to a Kafka topic.
- Monitor the Kafka topic to confirm that the message has been published.
- Wait for the consumer to process the message.
- Query the consumer's database to verify the data is stored correctly.
- Repeat for each replicated entity.
Scope:
- Create/Update/Delete an Article from producer (Article service) โ ensure operations complete in consumers (Farmer, Transport services)
- Create/Update/Delete a PalletType from producer (Article service) โ ensure operations complete in consumer (Transport service)
- Create/Update/Delete a Farmer from producer (Farmer service) โ ensure operations complete in consumer (Transport service)
- Create/Update/Delete a Cell from producer (Farmer service) โ ensure operations complete in consumer (Transport service)
- Create/Update/Delete a FarmerArticle from producer (Farmer service) โ ensure operations complete in consumer (Transport service)
- Create/Update/Delete a PalletLabel from producer (Transport service) โ ensure operations complete in consumer (Production service)
Results โ
The results will showcase sample (singular) tests of each defined kind to avoid unnecessary documentation cluttering.
Test Input โ
The payload sent for creating an Article can be found in Figure 48:
todo insert image
Expected Output โ
ArticleDTO produced with the same payload as seen in Test Input and sent to the Kafka broker(s) on topic โarticleSavedโ. The ArticleDTO is then expected to be consumed by the Farmer and Transport services, where it would be converted to the relevant native Article entity and persisted in their individual databases in the Articles_External table.
Actual Output โ
Upon inspecting the articleSaved topic through Kafdrop, the expected ArticleDTO entity was discovered, seen on Figure 49 below:
todo insert image
Upon further inspection, the Farmer service has correctly received the ArticleDTO, has successfully mapped the relevant attributes and persisted the entity as Article_External, along with its relevant LinkCodes in the table Link_Codes_External in its own database, seen in Figure 57 and Figure 58:
todo insert image
Furthermore, the Transport service has also correctly received and processed the ArticleDTO and its relevant LinkCodeDTO objects, persisting their relevant attributes in its own datastore, seen on Figure 59 and Figure 60:
todo insert image
Pass/Fail โ
The test has successfully passed, meeting all expectation criteria.
Notes: Each other singular request test scenario has been individually tested and successfully passed by achieving the expected results.
7. Conclusion โ
7.1. Main research question & conclusion โ
Why is the internal microservice communication slow and how can it be optimized for speed, but by ensuring sufficient reliability?
Having sufficient research results related to the sub-questions that dive in a technical manner into the possible causes, and conducting performance testing showcasing clear performance improvements of over 50% on average following the abolishment of synchronous inter-service communication, it is reasonable to conclude that this synchronous communication was a significant contributor in the reduced throughput of certain user-performed API requests. These requests were identified as insufficiently performing by the stakeholders in the beginning of the project.
Furthermore, the final product demonstrates the effective use of asynchronous data replication, a technique that aligns with the broader goal of microservice architectures to create highly decoupled and independently scalable services. This strategy enhances the independence of the subsystems, contributing to their scalability and robustness.
The reliability aspect of the said solution refers to ensuring effective data consistency. There are multiple strategies known to ensure data consistency, one of which has already been integrated. Namely, this is logging the events that trigger the data replication, also known as event sourcing. Event sourcing provides numerous benefits, such as auditability, problem traceability, and the most relevant advantage โ can be used to reliably reconstruct data in edge cases of data loss or data inconsistency between microservices. Moreover, database-interacting methods and Kafka-related operations are ensured to be idempotent which further strengthens reliability.
Another strategy that has the potential to greatly boost reliability is turning the data replication events into distributed transactions, which is also known as the Saga pattern. This is a failure management pattern that helps establish consistency in distributed applications. Essentially, the event that triggers the data replication across microservices is the initial transaction in the โSagaโ. Each consumer of the event is a sequential transaction in the saga. When all consumers complete their part of the transaction โ replicate the incoming data in their databases, the saga is successfully completed. If one of the participants in the saga (i.e., a consumer) incurs any errors along the way and fails to persist the incoming data, the saga fails, and all preceding transactions are rolled back.
In summary, it can be concluded that the transition from synchronous to asynchronous inter-service communication has been instrumental in drastically improving the performance of the microservice architecture. The asynchronous approach, implemented through data replication, has promoted a high degree of decoupling and independent scalability among services, a cornerstone of effective microservice architecture. This strategy has not only augmented the independence of subsystems but also contributed significantly to their robustness and scalability.
Reliability is ensured via strategies like event sourcing, which aids in data replication and enables effective data reconstruction during edge cases of data loss. Further, the idempotency of the database-interacting methods and Kafka-related operations strengthens reliability.
The potential incorporation of the Saga pattern promises to bolster the system's reliability even more. This would allow for comprehensive rollback in case of individual transaction failure, enhancing data consistency.
In conclusion, the refined system architecture optimizes performance and reliability, illustrating the significant benefits of asynchronous communication and advanced reliability strategies in a microservice design.
References โ
- (n.d.). Retrieved from Event Store: https://www.eventstore.com/
- 3 Common Misunderstanding of Inter-Service Communication in Microservices. (n.d.). Retrieved from Hadii: https://www.hadii.ca/insights/microservice-communication
- A Beginnerโs Guide to Event Sourcing. (n.d.). Retrieved from Event Store: https://www.eventstore.com/event-sourcing
- AmiyaRanjanRout. (n.d.). Apache Kafka โ Producer Acknowledgement and min.insync.replicas. Retrieved from GeeksForGeeks: https://www.geeksforgeeks.org/apache-kafka-producer-acknowledgement-and-min-insync-replicas/
- Apache Avro. (n.d.). Getting Started (Java). Retrieved from Apache Avro: https://avro.apache.org/docs/1.11.1/getting-started-java/
- Asynchronous and synchronous HTTP request on server side, performance comparison. (n.d.). Retrieved from Stack Overflow: https://stackoverflow.com/questions/55863127/asynchronous-and-synchronous-http-request-on-server-side-performance-comparison
- Atamel, M. (2022). Implementing the saga pattern in Workflows. Retrieved from Google Cloud: https://cloud.google.com/blog/topics/developers-practitioners/implementing-saga-pattern-workflow
- Baeldung. (2018). Working with Apache Thrift. Retrieved from Baeldung: https://www.baeldung.com/apache-thrift
- Benefits of Message Queues. (n.d.). Retrieved from Amazon Web Services: https://aws.amazon.com/message-queue/benefits/
- Biggest differences of Thrift vs Protocol Buffers? [closed]. (n.d.). Retrieved from Stack Overflow: https://stackoverflow.com/questions/69316/biggest-differences-of-thrift-vs-protocol buffers
- Confluent. (2020). Benchmarking Apache Kafka, Apache Pulsar, and RabbitMQ: Which is the Fastest? Retrieved from Confluent: https://www.confluent.io/blog/kafka-fastest-messaging-system/
- Data Serialization โ Protocol Buffers vs Thrift vs Avro. (2019). Retrieved from Bizety: https://www.bizety.com/2019/04/02/data-serialization-protocol-buffers-vs-thrift-vs-avro/
- Data Transfer Objects Between Microservices. (2022). Retrieved from Softensity: https://www.softensity.com/blog/data-transfer-objects-between-microservices/
- Essential characteristics of the microservice architecture: loosely coupled. (n.d.). Retrieved from Microservices: https://microservices.io/post/architecture/2023/03/28/microservice-architecture-essentials-loose-coupling.html
- EventStoreDB: The database for Event Sourcing. (n.d.). Retrieved from Event Store: https://www.eventstore.com/eventstoredb
- Everything you need to know about Kafka in 10 minutes . (n.d.). Retrieved from Apache Kafka: https://kafka.apache.org/intro
- Golder, R. (2022). Kafka Consumer Auto Offset Reset. Retrieved from Medium: https://medium.com/lydtech-consulting/kafka-consumer-auto-offset-reset-d3962bad2665
- Golder, R. (2022). Kafka Unclean Leader Election. Retrieved from Medium: https://medium.com/lydtech-consulting/kafka-unclean-leader-election-13ac8018f176
- Hameed, E. (2022). Microservices anti-patterns . Retrieved from Dev: https://dev.to/evanhameed99/microservices-anti-patterns-5gdg
- How do you implement event store backups and disaster recovery strategies? (n.d.). Retrieved from LinkedIn: https://www.linkedin.com/advice/0/how-do-you-implement-event-store-backups- disaster
- HTTP - Overview. (n.d.). Retrieved from Tutorials Point: https://www.tutorialspoint.com/http/http_overview.ht
- Implementing event-based communication between microservices (integration events). (n.d.). Retrieved from Microsoft Learn: https://learn.microsoft.com/enus/dotnet/architecture/microservices/multi-container-microservice-net-applications/integration-event-based-microservice-communications
- Implementing event-based communication between microservices (integration events). (2023). Retrieved from Microsoft Learn: https://learn.microsoft.com/en-us/dotnet/architecture/microservices/multi-container-microservice-net-applications/integration-event-based-microservice-communications
- Jadon, A. (2022). Understanding RabbitMQ Queue & Messaging Simplified 101. Retrieved from HEVO Data: https://hevodata.com/learn/rabbitmq-queue/
- Johansson, L. (2020). Part 1: Apache Kafka for beginners - What is Apache Kafka? Retrieved from Cloudkarafka: https://www.cloudkarafka.com/blog/part1-kafka-for-beginners-what-is-apache-kafka.html
- Kafka Producer Retries. (n.d.). Retrieved from Conduktor Kafkademy: https://www.conduktor.io/kafka/kafka-producer-retries/
- Kafka vs RabbitMQ - A Head-to-Head Comparison for 2023. (n.d.). Retrieved from Project Pro: https://www.projectpro.io/article/kafka-vs-rabbitmq/451
- Karanam, R. (2019). Microservice Architecture Best Practices: Messaging Queues. Retrieved from DZone: https://dzone.com/articles/microservice-architecture-best-practices-messaging
- Kleppmann, M. (2012). Schema evolution in Avro, Protocol Buffers and Thrift. Retrieved from Martin Kleppmann: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
- Kruisz, M. (2021). Microservice Architecture Goals. Retrieved from Manuel Kruisz: https://www.manuelkruisz.com/blog/posts/goals-of-microservice-architectures
- Levy, E. (2022). Kafka vs. RabbitMQ: Architecture, Performance & Use Cases. Retrieved fromUpsolver: https://www.upsolver.com/blog/kafka-versus-rabbitmq-architecture-performance-use-case
- Ludwikowski, A. (2019). The best serialization strategy for Event Sourcing. Retrieved from softwaremill: https://blog.softwaremill.com/the-best-serialization-strategy-for-event-sourcing-9321c299632b
- Maison, M. (2019). An introduction to event sourcing. Retrieved from IBM: https://developer.ibm.com/articles/event-sourcing-introduction/
- Mani, G. (2022). Distributed Monoliths vs. Microservices: Which Are You Building? Retrieved from Scout APM: https://scoutapm.com/blog/distributed-monoliths-vs-microservices
- MicroServices โ How To Share DTO (Data Transfer Objects). (2020). Retrieved from Vinsguru:https://www.vinsguru.com/microservices-architecture-how-to-share-dto-data-transfer-objects/
- Microsoft. (2023). Create a Full Database Backup. Retrieved from Learn Microsoft: https://learn.microsoft.com/en-us/sql/relational-databases/backup-restore/create-a-full-database-backup-sql-server?view=sql-server-ver16
- Microsoft. (n.d.). Event Sourcing pattern. Retrieved from Learn Microsoft: https://learn.microsoft.com/en-us/azure/architecture/patterns/event-sourcing
- Microsoft Learn. (2022). Communication in a microservice architecture. Retrieved from Microsoft Learn: https://learn.microsoft.com/en-us/dotnet/architecture/microservices/architect-microservice-container-applications/communication-in-microservice-architecture#asynchronous-microservice-integration-enforces-microservices-autonomy
- Microsoft Learn. (2023). How to design communication across microservice boundaries. Retrieved from Microsoft Learn: https://learn.microsoft.com/en-us/dotnet/architecture/microservices/architect-microservice-container-applications/distributed-data-management#challenge-4-how-to-design-communication-across-microservice-boundaries
- Morรกveis, J. (2019). Message Queues: Even Microservices Want to Chit Chat. Retrieved from Avenue Code: https://blog.avenuecode.com/message-queue-even-microservices-want-to-chit-chat
- Nayar, V. (2022). Data Serialization: Apache Avro vs. Google Protobuf. Retrieved from Funnel-Labs: https://www.funnel-labs.io/2022/08/26/data-serialization-apache-avro-vs-google-protobuf/
- Ozkaya, M. (2021). Microservices Asynchronous Message-Based Communication. Retrieved from Medium: https://medium.com/design-microservices-architecture-with-patterns/microservices-asynchronous-message-based-communication-6643bee06123
- Ozkaya, M. (2022). Saga Pattern for Orchestrate Distributed Transactions using AWS Step Functions. Retrieved from Medium: https://medium.com/aws-lambda-serverless-developer-guide-with- hands/saga-pattern-for-orchestrate-distributed-transactions-using-aws-step-functions-2513db0de84e
- Pรถschl, R. (n.d.). Retrieved from Research Gate: https://www.researchgate.net/figure/Difference-between-Synchronous-and-Asynchronous-I-O_fig4_282847216
- Protobuf Dev. (n.d.). Protocol Buffer Basics: Java. Retrieved from Protobuf Dev: https://protobuf.dev/getting-started/javatutorial/
- RabbitMQ. (n.d.). RabbitMQ Queues. Retrieved from RabbitMQ: https://www.rabbitmq.com/queues.html
- Reselman, B. (2021). Synchronous vs. asynchronous microservices communication patterns. Retrieved from The Server Side: https://www.theserverside.com/answer/Synchronous-vs-asynchronous-microservices-communication-patterns
- Saga distributed transactions pattern. (n.d.). Retrieved from Microsoft Learn: https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/saga/saga
- Sczip, J. G. (2021). Apache Kafka: What Is and How It Works. Retrieved from Medium: https://medium.com/swlh/apache-kafka-what-is-and-how-it-works-e176ab31fcd5
- Serialiazing your data with Protobuf. (2019). Retrieved from blog.conan.io: https://blog.conan.io/2019/03/06/Serializing-your-data-with-Protobuf.html
- Siddharth. (2023). What are idempotent operations? . Retrieved from Codementor: https://www.codementor.io/@sidverma32/the-power-of-idempotency-understanding-its-significance-22zkyc7ci1
- Simpson, J. (2022). How to Design Loosely Coupled Microservices. Retrieved from Nordic APIs: https://nordicapis.com/how-to-design-loosely-coupled-microservices/
- Soares, L. (2023). Distributed Monoliths: Recognizing, Addressing, and Preventing. Retrieved from LinkedIn: https://www.linkedin.com/pulse/distributed-monoliths-recognizing-addressing-luis-soares-m-sc-
- Steimle, F. (2021). The Good, the Bad and the Ugly: How to choose an Event Store? Retrieved from medium.com: https://medium.com/digitalfrontiers/the-good-the-bad-and-the-ugly-how-to-choose-an-event-store-f1f2a3b70b2d
- Synchronous and asynchronous requests. (n.d.). Retrieved from Mozilla for Developers: https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Synchronous_and_Asynchronous_Requests
- The DOT Framework. (n.d.). Retrieved from The DOT Framework: https://ictresearchmethods.nl/The_DOT_Framework
- Tripathi, A. (2021). RabbitMQ Exchanges Explained. Retrieved from Medium: https://medium.com/koko-networks/rabbitmq-exchanges-explained-3df008f4134
- Understanding synchronous and asynchronous requests. (2017). Retrieved from Progress: https://docs.progress.com/bundle/openedge-classic-appserver-development-117/page/Understanding-synchronous-and-asynchronous-requests.html
- What are advantages of Kafka over RabbitMQ? (2016). Retrieved from Stack Overflow: https://stackoverflow.com/questions/40553300/what-are-advantages-of-kafka-over-rabbitmq
- What are microservices? (n.d.). Retrieved from https://microservices.io/
- What's the difference between RabbitMQ and kafka? (n.d.). Retrieved from Stack Overflow: https://stackoverflow.com/questions/44452199/whats-the-difference-between-rabbitmq-and-kafka
- When to use RabbitMQ over Kafka? (n.d.). Retrieved from https://stackoverflow.com/questions/42151544/when-to-use-rabbitmq-over-kafka
- Why Apache kafka is more popular than RabbitMQ? (n.d.). Retrieved from Quora: https://www.quora.com/Why-Apache-kafka-is-more-popular-than-RabbitMQ
- Zaeimi, A. (2020). Synchronous vs Asynchronous communication in microservices integration. Retrieved from Medium: https://medium.com/@mmz.zaeimi/synchronous-vs-asynchronous-communication-in-microservices-integration-f4dd36478fd2