From the course: Microservices: Asynchronous Messaging
Welcome to stream data platform
From the course: Microservices: Asynchronous Messaging
Welcome to stream data platform
- [Instructor] One of my favorite topics in asynchronous messaging in microservices architectures is the stream data platform. The need for real data of system utilization is critical to a well-oiled machine and the stream data platform can solve many of these cases. So what this stream data platform all about? At its core, it handles streams of data. Usually, that data comes from structured log messages that indicate every action, every event on the system. The beauty of this model is that while it does increase the need for disk space across the system, both on the compute boxes but also on the platform itself, its activity is all handled asynchronously. When you have hundreds of microservices at play in a significantly sized system, this ability to capture, aggregate, analyze and act on the data is critical. This, of course, increases complexity of the system but as you'll see, it's all done for very specific reasons. A common architecture for a stream data platform is actually quite straightforward. Once again, you start with our message broker. Often, you will choose a persistent message broker for this model, such as Apache Kafka. You have a set of producers and a set of consumers and they all know about the broker. The producers send messages. The consumers react to those messages as they pertain to them. This is really a pub/sub model with multiple publishers on the same set of topics. So in a stream data platform, who are the producers, you may ask. The first is applications because they, of course, produce logs, or at least they should. And they are ultimately producers of these messages. While usually the application itself doesn't transmit the message to the broker, they produce them and an intermediary like Filebeat shifts the log messages. Databases are also used to produce logs and events. This data is very critical to a good stream data platform, and as such, all logs and events from databases should be included. Of course, servers themselves produce logs, and often can help tell a complete story of what's going on. Consider an application being compromised. The server logs often have critical information. And really, everything can be a producer in this model. If it creates events or logs, those messages could be pertinent to some process, and if they can be included, they should be. So what systems are usually the consumers then, you may ask. There are definite use cases, and we will dive into consumers from those. Log aggregators are a key benefit. Log aggregators can help paint a real picture of what's going on in the system. Correlation or tracing IDs help the log aggregators assemble full pictures of events in the system by linking the logs together. In addition, these tools usually help produce a common logging format across different systems through transformations. Analytics engines are a great use case for streaming data. Since the logs can't have a full picture of what's going, data scientists can use this information to define the information needed for long-term and short-term actions. Long-term storage vehicles are another model. Many use cases in so-called big data drive their data flow from a stream data platform. Once this set of data has been identified through analytics or other learning mechanisms, the data can be collected and shipped to a lake for historical analysis and other uses. One of the coolest uses that I have seen stems from analytics, and that is eventing engines. These engines key off key analytical points and trigger downstream events usually through orchestration. Consider a rogue user on your system. If the analytic point of historical data can help yield standard usages of the system, the outlier that most rogue users behave in can trigger an event to trace the user or even lock him out. This power goes beyond intrusion detection but we'll talk more about that later. So why go through all this trouble if the consumer use cases are enough? Let's look at the big picture. First, you need to accept that data is king. Every business decision needs data to be made effectively, and your system's actual utilization is a big part of that flow of data for both user and internal purposes. Your business can and should drive off that data. Internal data can help in procurement and resource allocation. External or user data can help drive decisions that can make the company more profitable. In addition, not knowing something can be dangerous. If you make a decision in your company's website, for instance, and you don't know historical trends, you may cost yourself dollars and engagements by future changes. But if you know you can roll back or further enhance your offering based on a negative or positive trend respectively. Ultimately, it's all about enabling good decision making.