Latency, Jitter, Cost, and Queueing Theory

Michael Stone, July 27, 2014, , (src), (all posts)

If Engineering a Safer World is the most valuable book that I’ve read in the last five years, then The Principles of Product Development Flow by Don Reinertsen is a strong contender the position of “next-most valuable”, simply for having introduced to me the now fundamental concept of “cost-of-delay”.

That said, in learning to apply Reinertsen’s work, I’ve found myself repeatedly turning to three other resources, to which I’d also like to give credit:

  1. Douglas Hubbard’s book, How to Measure Anythong, for its incredible approachability and clarity of exposition,

  2. Mor Harchol-Balter’s book, Performance Modeling and Design of Computer Systems, for so cogently elaborating the mathematical theory underlying Reinertsen’s book, and

  3. Brendan Gregg’s description of his Utilization-Saturation-Errors method for debugging systems performance issues.

Now, why do I think these resources go together so well? Here’s an illustration, drawn from my personal (and therefore, likely flawed?) understanding of the Toyota Production System’s commitment to the principle of jidoka which, among other things, authorizes and requires line workers in lean manufacturing systems to “stop the line” in response to the discovery of defects or a previously unknown defect-generating process.

In tech terms, jidoka demands that the production line and its components implement “fail-fast” and “fail-stop” semantics. Okay, so why do these matter?

One simplistic answer is that “choosing to stop the line is an economic choice and if fail-fast and fail-stop make economic sense, then we should use them”. However, Gregg’s USE method highlights a more subtle point: prompt detection of errors apparently does make economic sense in manufacturing for its first-order effect of bounding the cost of required rework but there are also important second-order effects – for example, (and further applying Gregg), it is striking to me how control of production station utilization and of the volume of work-in-progress (what Gregg calls “saturation”) not only help to kill defect-generating processes (and therefore bound rework costs) but also how they make it possible to damp the shockwaves of congestion produced by discovered defects to keep them from propagating far enough to cause greater economic (or human) damage!

Thus, to summarize: Gregg and Hubbard give us simple tools with which to analyze the throughput and latency of any queuing system, including product- development, release, and manufacturing systems. Next, Harchol-Balter gives us a lucid explanation of required to understand the system interactions that explain why Gregg’s approach works with a helpful focus on the use of that math to analyze systems performance. Finally, Reinertsen gives us powerful language for explaining to everyone around us why they should care.