Distributed Tracing Is Powerful…But Also Effectively D.O.A.?

 

by

We saw this tweet and were quite intrigued. And also curious: is this true? Are the majority of developers avoiding it? Do people really not care about diagnosing problems across a whole system anymore?

The basic idea behind distributed tracing is not complicated: identifying specific request inflection points along a (typically user-initiated) path within a system, and then instrumenting them. The tracer data is coordinated and collated to provide beautifully precise system insights for debugging…if you know what you are looking for.

Distributed tracing has become — or at least has the capacity to become — an essential component for observing distributed systems and microservices applications. So what could be holding everyone back from embracing it?

I think the likely explanation has more to do with distributed systems themselves, than with distributed tracing — for the simple reason that distributed systems are just really, really hard.

Distributed tracing has been around a good long while. Back in the ‘90s, when we had many fewer tools on our workbench, distributed tracing was very important– particularly useful for diagnosing bugs, why your system wasn’t performant.

But you were mostly using it to scale better on the limited machines you had, and even so, you still had to think really really hard to solve the problems that the DT system would be describing.

Now fast-forward a couple of decades. If using distributed tracing back then was an intellectual challenge, the complexity of modern distributed systems elevates that difficulty to rocket science level.

Not only are distributed systems convoluted — they contain coincidences that are very difficult to get your head around.

Frankly, the entire point of distributed systems in the first place is for people not to have to think about these things or to be concerned with the whole system. In distributed systems, everyone just worries about their one piece of it.

The problem is, the more instances of anything you have, the higher the likelihood that something will break. Probably more than one something, actually, and likely all at the same time. Services will fall over, the communication network will lose packets, disks will fail, VMs will inexplicably terminate.

Trying to figure out what went wrong where, and why, in a distributed system is a seriously labyrinthine undertaking. Distributed tracing can provide a thread to follow through that labyrinth, but even with tracing distributed systems are still very difficult to debug.

DT is a must for diagnosing distributed systems problems, but it is not a solution for those problems.

In order to work, to track down that solution, distributed tracing has to be hypothesis led. Think of it like the television series House, where each week the medical team encountered a new and mystifying malady.

There is a particular episode where one disciple suggests giving a full body scan, and the ever-irascible Dr. House snaps back that you should never order a full body scan because it will present all sorts of new problems — problems that then obscure the one you are actually searching for.

Debugging with distributed tracing in distributed systems means you have to do the House thing and come up with at least a guess before you go looking, lest you drown in an ocean of information.

In fact, distributed tracing by its very nature must be hypothesis led: ‘This is my theory, what do I need to see to disprove it?’ And you look until you find that thing, which tells you your theory was incorrect. Unless you don’t find it, in which case congratulations, you have correctly identified your system’s malady and can now apply the correct remedies.

Pursued this way, distributed tracing can be terrifically effective. And so people who actually care about diagnosing problems across whole systems genuinely love it. These are also the people willing to put in the phenomenally complicated work it requires.

Everyone else? Well, the unpretty truth is that these days most of us just throw more boxes on the system until the problem goes away.

Over-provisioning is the easy hack (unless of course you are a hyperscale company like Google or Netflix or Facebook, but we are talking about the average enterprise org here).

That hack is not free, but the financial cost is manageable and the time savings considerable. We just throw hosting at everything because we are not in the time of efficiency, we are in the time of expansion.

Case in point: Amazon’s EC2 T2 instance size, which is entirely about selling you the same server 10 times. T2s assume you are only going to use a very low amount of CPU and occasionally burst above that — and they are very low cost.

Everyone over-provisions because it is cheap, and it gives the reassurance that they’ve got the stuff if/when they need it.

Amazon knows about this and takes the chance to make some money. (Which I don’t begrudge — it’s not wrong that they profit from the fact that we all use our data centers so inefficiently and can’t be bothered to tidy them up, but that is a topic for another day).

But the hyperscale people can’t really do that, because their hosting costs are so very high and they can’t endlessly throw boxes at a problem. They do, however, also have a lot of engineers, and these tend to comprise the small percentage of folks who currently do love and embrace distributed tracing.

So, circling all the way back to Cindy Sridharan’s original point: is distributed tracing ever going to take off?

For the time being, perhaps not. As Charity Majors, founder of Honeycomb.io tweeted about the future of distributed tracing, “The tech needs the industry to grow up and into it, and it’s missing some key components.”

Direct tracing currently is too complicated to be used directly, so it needs to be hidden from the user behind a fully managed platform. Identifying the relevant part and hiding the rest is the key.

These factors may eventually evolve, but in all honesty, we will likely move to managed platforms before average engineers ever have to get round to actually learning to diagnose distributed systems. 

 


Read more about what we think about observability in our white paper.

Download Whitepaper

2 Comments

  1. Very insightful article Pini. Thanks. We been in that situation where in our distributed system did not have a tracing component and we had to rely on logs to debug issues. Not pretty. Introducing tracing def helped, but instrumenting code for tracing came up with its own challenges. In a multi-team polyglot env, things around code instrumentation get even worse. Hope that these kind of concerns can be addressed with a service mesh. Anyways, very good blog post.

    • Thanks Murphy,

      service mesh may help to hide some of the complexity, but it is still only one additional piece of the puzzle.
      When eco-system becomes more mature, we should be able to buy an off the shelf solution the would include it all out of the box and without configuration hustle.
      Until then, we will continue playing with variety of tools to find the best solutions for each particular case.

Leave a Reply

Your email address will not be published. Required fields are marked *