Infobip messaging platform handles about 250M transactions per day, that is 250M SMS messages received from the client, billed, routed, adapted if necessary, submitted to a telecom operator, delivery report processed and submitted back to the client. That is at least 6 pairs of requests and responses over the network using various APIs on (sometimes) 4 completely different protocols just to process a single message. And that is if everything goes well and retry attempts are not necessary. And this is only the extrenal communication - we're not considering inter-service communication. Moreover, some messages are related, and looking at individual messages doesn't provide a complete picture.
The challenge of keeping track of all those network operations is huge. Support department needs to resolve both client's and operator's inquiries promptly. They need a tool to dig up all request/response pairs which were issued while processing any message, because seeing what exactly was submitted by the client and how exactly it was forwarded to the operator, is the starting point for any debugging or problem solving job.
Now, "distributed" sounds nice. It's a buzzword. Somehow we instinctively want everything to be distributed. But let's see what a distributed logging system actually looks like.
What we've got here is a separate service for each messaging protocol on the client and operator side. Each service handles its own logs, typically storing them in memory, writing them to the disk in a circular buffer fashion. Since the services are developed by different teams, it takes some effort to unify the appearance of web interfaces for managing logs. To make the supporter's life a bit easier, services expose logs through their APIs and a centralized administration interface is introduced where one can look at all logs from all services. Making all services push logs to a logging storage system like Graylog is a nice step towards unification - and centralization. Graylog is nice. Scales well, does the job. It sits on top of an elasticsearch cluster. Stores a log message with indexed key-value metadata. The web interface is a bit clumsy at first, but you get used to it. It also has some handy analytic stuff.
The problem is these logs don't relate. The 6 pairs of request/response are totally unrelated to each other. It takes cumbersome analyzing, manual extraction of correlation identifiers from internal processing platform's database just to connect several communication logs of a single message. Not to mention finding communication logs for related messages. This is because client side is decoupled from the operator side. They don't share a single correlation identifier.
Moreover, by having nicely defined boundaries between service scopes, operator-side services know nothing about the client-side and vice-versa. That is, operator side services don't know from which client the message they are processing came from. It's called separation of concern, and it's generally a good thing. It just doesn't work well for logging because operator-side services cannot enrich logs with some client metadata which would help to relate logs. Also, these services know nothing about message relations - because they don't need to. Only the internal processing platform does, but it has no external communication, hence it doesn't produce any logs. It means our logs can't have any relation metadata associated with them.
One way to solve this would be to make services produce logs and piggyback them on internal communication towards the internal processing platform. The platform can then enrich logs with all metadata and push them on a log storage. This would work but it's plain wrong. It takes precious processing resources and "distracts" the platform from its business jobs.
Well, it's really nothing revolutionary. We figured each message needs a unique ID - UUID (Universally unique identifier) associated at the very beginning, when it enters the system, and we need to pass this ID with the message to the next service in the processing chain. Then, each service producing logs can attach this ID to log metadata. Now that's easy! We now can relate a bunch of log records to a single message. If only...
Pairing delivery reports
It doesn't take long to realize that you have a problem. All it takes is to look at the logs a notice that delivery reports are missing. Again, the problem is in the separation of concerns, and again, we don't want to break it because we still agree it's a good thing. Services that handle a specific protocol can submit a message they received from the platform and handle a delivery report. Pairing a delivery report to a message is a more complex job than one would guess. It is clearly the platform's concern. Still, we want the service which received the delivery report to produce a log record for it - that's its concern. And yet, at the moment of delivery reception it is still unknown to which message it will be paired (if it is going to be paired at all!) hence the service doesn't know which UUID to attach to the log.
Connecting related messages
As said before, some messages are related. Typically, these are long SMS messages which are transferred individually, but processed and eventually displayed on the cellphone as a single message. Sometimes there is splitting and merging involved, so we have one-to-many, many-to-one, and many-to-many relations. It would be nice to also have logs for these messages related because sometimes we need to look at all of these messages as one. The problem is that only the internal processing platform is aware of the relation between messages, and that is where splitting and merging happens. Services handling external communication know nothing about relations. They handle messages individually and each message has its own UUID.
Another source of relations is bulk processing. The client can simply submit several messages in one HTTP request. The service handling HTTP will parse the request and forward several messages into the platform, but it will log only one request and one response. It will generate one UUID to log the HTTP request and a new UUID for each message generated by this request.
Graphs to the rescue
It's clear that we need to store relations between UUIDs somewhere. We need to be able to fetch all related logs when querying for logs of a single message. Let's try to visualize a more complex, but not uncommon relation. We have one submit from the client (UUID-100) containing two long messages (UUID-200 and UUID-210) which were later split into more messages and submitted to the operator (UUID-201, UUID-202, UUID-211, UUID-212). Then one bulk delivery came, containing delivery reports for all submitted message parts (UUID-300).
Unsurprisingly, it's a graph. Sure, one could invest some effort and store it into a relational database, or transform it into a json for a document database, but we have graph databases nowadays. Let's use the right tool for the job! Wherever a new relations happens and splitting or merging occurs, we push a little graph to a graph database. We choose Neo4j because it has great marketing - they are friendly on the conferences, give away free books and shirts. The software is not bad either. You have to pay for clustering and HA, but for logs even the free version will do. Single Neo4j instance on a 16 CPU machine takes care of 100M nodes with 150M relations which take just 15GB of storage. The problem with Neo4j here is that deleting nodes doesn't release memory or disk usage, so you have to purge the data once in a while.
Now, when we need to fetch logs for a message, we take this message's UUID, fetch all related UUIDs from Neo4j and then fetch logs from Graylog by these UUIDs. It's important to fetch only relevant UUIDs, not the entire graph. For example, if we want communication logs for message UUID-200, we only need to fetch log records UUID-100, UUID-200, UUID-201, UUID-202 and UUID-300. The UUID-21x branch is irrelevant. The trick is to use different relation types when connecting bulks and when connecting message parts. Then a simple Neo4j cypher query is used to fetch related UUIDs.
The result is that each service pushes its own logs of the communication it handles. Also, whichever part of the system created a new relation is responsible for publishing this information. Concerns are well separated, logs and relations are stored on dedicated systems where they can eventually be fetched. ( By Milan Mimica, Software Engineer / Team Leader )