Recently we were brought in by a customer to investigate a problem with their microservices architecture where specific transactions weren’t yielding the expected results. Since some of their microservices heavily rely on one another, it wasn’t immediately clear which one of them was the one misbehaving.

After a few days of combing through their code, we finally found the culprit. These are some of the lessons we learned while debugging.

With a microservices infrastructure where at least a dozen docker containers are required to debug a specific part of a transaction, you might be inclined to delay setting up a local development environment as long as possible. Sure, it’s a hassle, but being able to make changes directly in the code and SSH’ing directly into boxes to debug and see logs makes it so much easier to actually reproduce a transaction in a way that lets you find the root cause of the problem. In the end, it’s much faster than only poking around in production logs and going through code.

Don’t try to understand what the full lifecycle of a transaction is through the myriad of API calls that connect microservices to each other. Instead, go for a one-by-one approach. Understand what a certain service does with your input data, look at the logs for that one service when you replay the transaction, verify the output contains what you would expect it to contain, and move on to the next service.

In cases where microservices don’t talk directly to each other but use message busses or queues, make sure to shut down the consumer service so that you have time to inspect the messages being put into the queue before they are being processed. That goes more or less back to the first point, since you probably shouldn’t be shutting down services in the production environment 🙃.

Having multiple places where production services send their log files is not only really annoying, but might also make it very hard to get a full picture of what happened to a certain transaction in the different stages of it being processed.

Also, save logs of all severities. Nothing is more painful than realising you’re missing an important part of the puzzle because you didn’t save verbose/debug logs. Or, better yet: build a mechanism to temporarily enable debug-level logs for all production services when necessary.

It is really easy to be led astray by confusing or vague log messages. Make sure you understand what code gets executed when you send a certain request to a service, and only then run it and explore the logs. That way you know exactly what to look out for, as well as where potential pain points in the code might be located.

When you finally have found the cause of your issue, don’t just patch it and go on with your life, but write a few unit or integration tests to make sure that if someone else breaks the same thing again, they won’t have to waste as many hours as you just did on finding out what they did wrong.

Entrepreneur tech kid, co-founder of NearSt, Londoner, open source enthusiast and aspiring spare time literature geek.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store