Performance Engineering

Michael Stone, October 14, 2018, , (src)

Contents

Introduction

MIT’s 6.172 Performance Engineering course is a first course on (software) performance engineering.

For many students, it is their first encounter with software development considerations that go beyond “make it work, by yourself, however you can” and, as such, it is also a wonderful microcosm of software development.

Over the course of mentoring roughly 10 students participating in this class over the past 6 or so years, I have observed a couple of recurring challenges, inspiring stories, and lessons learned that I’d like share more widely because I think that the topics that they touch on have relevance far beyond MIT’s class.

Student preparation

Almost all of the students I have worked with appear to be at the beginning of their studies of software engineering, programming, and software development, by which I mean something like:

software engineering
“what makes software demonstrably fit for use?”
programming
“how does one write software skillfully?”
software development
(1) “how do organizations make, maintain, and retire software at scale?”
(2) “‘programming integrated over time’”

I make this judgment in part from my observations of their struggles.

Common issues

Many of the students I’ve worked with appear to be struggling as much or more with general unfamiliarity as with performance; for example, with:

  1. basic issues of correctness; specifically, which invariants should be true about their code and how to identify additional invariants

  2. fast-enough strategies to achieve correctness

  3. in one case, language familiarity

  4. how to evaluate the adequacy of their testing efforts

  5. a wide range of relevant testing strategies and their tradeoffs

  6. debugger operation and technology for assisting with systematic debugging

  7. how to comprehend and document unfamiliar or obscure code

  8. how to safely rewrite unfamiliar or obscure code to be more readable or measurable

  9. Polya’s advice to “Make a plan. Follow the plan.”

As a consequence, I think all of my students were in at least one case effectively prevented from trying out various potential optimizations (really, depessimizations) they were interested in.

Code usability issues

One of the first issues that MITPOSSE members are asked to give students feedback on are issues related to what I’ll call “code usability”, by which I mean:

  1. readability
  2. comprehensibility
  3. malleability, discussed more below
  4. modularity and documentation of intent, to support working together in groups on improving the code

Now, what challenges tend to slow students down here?

Some are fairly tactical, like:

But some are broader: the importance of code readability, build system ergonomics, etc. don’t necessarily stand out for everyone the same way, especially if you haven’t (yet) had the experience of trying to read, debug, edit, build, test, and ship someone else’s code at 3 AM in the morning your local time in order to solve some urgent performance, availability, or secrecy problem that is hurting you and your customers now, irreversibly, and in production?

Hence, the general formula: care, believe, act: share stories to help people care, give examples to help them believe that better truly is possible here and now, then summon the spirit of warmth and intellectual curiosity to help motivate them to act (e.g., to try out ideas that seem a little bit unlikely, at least at first blush.

Debugging tips

Moving beyond some of the static code ergonomics issues mentioned above, debugging skills and strategies loom large in my mind as one of the next earliest topics that needs to be addressed both to equip students to efficiently reduce their own confusion (and, thereby, to get the most out of the course) as well as to help them learn more about how to shape their code and supporting materials for greater maintainability and ease-of-use.

A smattering of tactical debugging-related topics that I often introduce students to and then later help them to experience and to learn to use include:

Code malleability

After having addressed code style and any urgent tactical debugging hangups, it’s fairly typical in my experience for students to get into a rhythm of experimentation based on an optimization strategy that looks like this:

  1. use perf to find a hotspot
  2. squash the hotspot
  3. loop

In my experience, this strategy has a number of advantages (including simplicity!) but it also has some important gaps both on the performance-improvement side (see Brendan Gregg’s, Cary Millsap’s, and Raj Jain’s discussions in their respective performance engineering books) and on the “working efficiently with code” side that I’d like to mention.

On the code side, one difficulty that I’d like to highlight concerns the squash the hotspot step. Basically, given both their varying levels of preparation and available time and attention, plus some of the code-bases that they’re starting from, quite a few students seem to have a lot of difficulty executing on their ideas for how to modify the code they’re trying to speed up simply because they’re unable to do so in a way that keeps the code running.

Some common code manipulation problems that I see students run into at this stage include:

Here are some anecdotes that illustrate how these issues tend to come out and some past advice that I’ve given about how to deal with them:

In an effort to simplify their code, A’s team was using a globally locked free-list. Unfortunately, her code was deadlocking in malloc() because her realloc() implementation (which falls through to malloc() in some cases) had already acquired the global lock. I showed her a couple of ways to fix the problem (recursive locks, refactoring the lock-acquisition code) and she was able to move on. (W also made a great suggestion about using a watchpoint on the mutex to halt execution at relevant times.)

W’s code was segfaulting due to corruption of his allocation metadata headers probably resulting from use of uninitialized memory. We talked through a couple of debugging strategies like assert() and printf() debugging, use of valgrind --track-origins=yes, test-case minimization, and educated guessing based on the location of the failure in the failing trace.

Some less fun notes:

Despite the fact that they’re writing a memory allocator, neither group had run their code through valgrind. Doing so immediately revealed memory leaks in A’s groups code and use of uninitialized memory in W’s code.

W’s group had turned off compiler optimization in their release build but hadn’t noticed!

Neither group has really figured out (and I haven’t yet managed to communicate) the debugging value of “atomic” commits from “known good state” to “testably known good state”. :-(

OK, so, zooming out, what are some tools and approaches that help me with these issues and where did I learn about them from?

  1. “systematic debugging” techniques from Why Programs Fail
  2. the idea of “code seams” and input/output control from Michael Feathers’ Working Effectively with Legacy Code
  3. “reverse debugging” tools like Mozilla rr and other record-replay tools for segfault provenance tracking
  4. valgrind for memory debugging: leak tracking, uninitialized memory usage, etc.
  5. practice with printf-debugging
  6. klee for program equivalence checking, learned from Brian Sniffen
  7. cscope for code search, especially cscope -kRp3 + editor integration, learned from Herbert Pötzl
  8. regexes, sed, vim global substitutions, sublime find & replace, and coccinelle for code transformation
  9. ggplot2, pgf-tikZ, vega-lite, HTML, SVG, and NetPBM output for visualization, learned in part from Noah Ilinsky and Bruce Maxwell
  10. Bret Victor’s “Learnable Programming” essay for inspiration on how to visualize program traces.

Thinking about performance

Despite my sense that students are most often held back by problems related to code reading, debugging, refactoring, collaboration, or experimental design & setup, I am nevertheless still often asked by students for help thinking more clearly about performance.

Here’s a typical anecdote:

One time, I asked my students after our project #1 meetings how I could be more helpful to them. The major request I received was to come prepared with more suggestions for speedups. As a result, I decided to try spending more time on “thinking about performance” and on practical tips for measurement than on assurance & quality issues in this set of meetings. After they arrived, we started talking about their code.

Unsurprisingly for the assignment at hand, both W’s and A’s groups had implemented pointer-based quad-trees. Both were experimenting with parallelism (W having had more success) and both were within a factor of 2-3 of the target time of 0.9s.

To make progress, I asked them for the output of perf report. Since both had reduced the number intersection tests via their quad-trees and since neither had done any work on their intersect() routine, they quickly realized that intersect() might be a good thing to speed up. To that end, I had them take a close look at the animation of the screensaver where I asked them to look for special structure that they might be able to exploit – i.e., to the fact that the lines translate but don’t rotate and that they hold only a small number of poses.

But where to go from here?

One useful answer that students are prompted to consider early in the course is low-level consideration of what hardware is available, what hardware is being used, and how is it being used, especially with respect to caching, instruction- and word-level parallelism, and processor front-end/back-end utilization.

Another useful answer that the students are prompted to consider slightly later concerns Bentley’s rules.

Below, I will discuss several challenges that students seem to regularly run into along both lines of consideration.

Caches, inlining, and bithacks

Caches are are surprising beasts that, in my experience, often generate student confusions. Here is one anecdote to illustrate one flavor of confusion in this area:

A2 thought of table lookups as taking a fixed (and small) number of cycles, rather than taking time dependent on the state of the memory hierarchy. (This led to a nice conversation about cache hits, cache misses, lookup-table sizes, and related interactions with set-associative caches.) He left interested in exploring the byte reversal bithacks that he had previously skipped over in favor of his LUT.

However, caches aren’t the only thing confusing A2. As the conversation progressed, we discovered another opportunity to learn to think more clearly, both via additional mental models and by getting the computer to behave more usefully:

A2 asked about whether or not he should change algorithms. This led to a discussion of asymptotic and polynomial cost models and, from there, to the discovery that he had discovered perf stat but not perf annotate.

This conversation then led into an even deeper discussion of how computers can be made to assist with many other aspects of the optimization process with another student:

About here, A2 had to depart for another obligation. However, fortunately, A3 was able to stay for a few minutes longer. His major question was on how to generate new ideas for speeding up his code. To that end, we discussed:

  1. mental models for thinking about whether inlining is helping you
  2. how to measure whether inlining is helping you
  3. hardware counters (e.g., for L1-icache-loads) and perf list as a way to discover available counters
  4. how to find new bithacks: e.g., as a circuit design problem, as a search problem over a space of expression trees, via induction principles, …
  5. coaxing gcc into vectorizing loops

Finally, I recapped my early comments about communication with reviewers. :-)

Bentley’s rules

As you may have noticed, one of my default modes of operation is to try to help students not by directly solving their problems for them but, rather, by helping them to experience, conceptualize, and mentally organize both their confusions and, later, their developing understanding of a wide range of tools, models, and approaches that are relevant to their problems.

In this vein, Bentley’s rules provide an interesting body of performance improvement thought and advice.

While reflecting on them, I have often been led to questions like these while discussing students’ code with them:

Based on these questions, the following (seemingly) crazy hypothetical scheme once occurred to me:

  1. The relevant graphics subsystem turns out to use 16 bits of precision per axis. Therefore, we can certainly store less than 64 bits of precision.

  2. If we store the coordinates in fixed-point rather than floating point then we can use our integer units rather than our floating point units.

  3. Even better, if we use a fixed-point representation and also code the edges of the viewport as (0, USHRT_MAX)) then we wind up having organized the embedding space: the most significant bit of the coordinate tells us which half of the indicated axis of the embedding space (viewport) we’re in, the next most significant bit further subdivides that section of the axis, and so on.

  4. As a result, if we interleave the bits of a pair of these fixed-point coordinates, we get a code-word for the path in a k-d tree of dimension 2 – i.e., in our quad-tree – to the indicated point.

  5. Using the resulting coded paths, we can change the pointer-based quad tree to a packed array of sorted items. That is, (in pseudo-Haskell):

     data QT a = Branch (QT a) (QT a) (QT a) (QT a)
               | Leaf a

    becomes

     data QT a = Array (CodeWord a)
     data CodeWord a = CodeWord { path :: Word32, item :: a }
  6. To build the quad-tree, just code all the paths using the relevant bit-hacks and sort the resulting array on the coded paths. This will bring objects that share path prefixes in the quad tree into contiguous segments of the array, ready to be loaded into L1 caches for fast parallel intersection testing.

  7. Conclusion: data density, locality of reference, and hardware utilization may be able to be substantially improved even beyond packing the line objects.

(Also interesting: Appendix C of

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

!)

Measurement and experimentation

Beyond simply understanding or manipulating code in order to try out new performance improvement ideas, another place many students struggle is to be able to efficiently define experimentally testable hypotheses, design experiments to test their hypotheses, run the experiments, keep track of information and insights/new hypotheses learned from their experiments, and then fold experimentally obtained insights back into their main codebase.

Many of the difficulties that they encounter here stem from challenges that we’ve already touched on including with:

but some are more novel, including:

Here, I wish I had better suggestions and advice to give so I’ll turn things around: do you have any advice that works well for you or the people you work with?

Collaboration between students

On the advice of peer mentors, I have almost always found it valuable to inquire about how the collaboration between students is going and about how my students are feeling about that collaboration.

A typical resulting report that I think it’s nice to know about as context for other conversations:

I started off by inquiring about the division of labor. When I did, W expressed some frustration with his partner C who had (according to W and the commit log) contributed some style fixes and tests and little more.

I met with A3 two days later on the evening of October 26th. He expressed that he was sorry to have been unable to meet earlier with A2 but that, at that point in the semester, he was “pretty hosed”. As before, I asked how things were going in the partnership. He seemed fairly happy with his partner’s work; he also commented that his partner this time had pointed out (and he agreed) that he needed more work thinking about the machine-level details of the code they were writing.

I met with A and W this morning for about 60 minutes. A’s group was having lots of trouble with the assignment; W seemed to be in somewhat better shape; however, both were stuck on correctness / debugging issues that were preventing them from moving on to the performance parts of the assignment.

A2 described his collaboration with his partner, C, as “pretty bad”. When I inquired further, A2 expressed frustration that his partner’s contributions had dropped off after they completed their initial implementation of the reversal algorithm.

Neither A2 nor A3 were working from a written plan.

Written plans

Speaking of written plans and checklists: I have often found something like the following outline to be of use in organizing my conversations, both as a tool to structure them and as a convenient form on which to record comments and feedback:

The Outline:

working style?
written plans?
communicate with your reviewer!
  commit messages
  pictures
  copyright & license declarations!
asymptotics?
  the (illegal) lazy solution... :-)
measurement & threats to validity
  sample sizes?
  input sizes?
  target architectures?
  measurement error?
compiler & compiler flag selection?
  inlining, -fdump-ipa-inline
  -Wextra, genetic algorithms
  supercop  --  http://bench.cr.yp.to/supercop.html
  icc? clang?
coverage?
  statement, branch, exhaustive, and MC/DC coverage
better testing?
  equivalence checking?
    klee  --  http://mstone.info/posts/klee_quickcheck/
  test-case minimization
    QuickCheck -- http://en.wikipedia.org/wiki/QuickCheck
  the Small Scope Hypothesis
    http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=10928
plotting?
  ggplot2  --  http://mstone.info/posts/plots-20120225/
questions?
  A: faster reversal?
    me: cost of loads, instruction-level data dependencies, ...
  W: pointer casting?
    me: sure, also check out gcc's tree-vectorization code

Another written plan:

Coding issues
  Macros
  Commit histories & interactive rebase
  Comments
  Copyright declarations
  (Future work: naming, interfaces, seams & testability, complexity, ...)

Challenges the students faced + requests for feedback?

Testing strategies
  Manual testing
  Table-driven testing
  Choice of units to test
  Regression tests
  Property-based testing
  Coverage and coverage-guided testing
  Coverage criteria: statement, branch, and MC/DC coverage
  Model-checking and symbolic execution
  Test-case minimization
  “Working Effectively with Legacy Code”

Debugging strategies
  mozilla rr & time-travel debugging
  gdb tips (specifically `gdb -x`,  breakpoint commands, conditional breakpoints, watchpoints)
  printf-based / log-based debugging
  visualization of logged data
  “Why Programs Fail”

Build tooling
  CI pipelines
  Running gdb

Stories
  Losing months on a project by not unit-testing simple algebraic beliefs that I "knew" were true
  Using Klee to find bugs at incident tempo in a proposed patch for a security issue affecting some important DNS and HTTP servers
  Being called at 3 AM and needing to be able read, debug, patch, build, and test other people’s code
  Using visualization to solve a tricky bug (a missing `-` sign) in some robotics code

Working with reviewers

In preparation for our meeting, I read my students’ commit shortlogs, patches, writeups, and code. I also used gcc with extra flags to examine unfixed compiler warnings, inlining decisions, and generated assembly for key routines.

To help accomplish these tasks more efficiently, I’ve developed some custom tooling which I hope to keep improving over time:

$ cat ~/.gitconfig
[user]
  email = ...
  name = Michael Stone

[push]
  default = simple

[color]
  diff = auto
  branch = auto
  interactive = auto
  status = auto

[alias]
  lol = log --graph --decorate --pretty=oneline --abbrev-commit
  lola = log --graph --decorate --pretty=oneline --abbrev-commit --all
  ds = diff --stat
  dcs = diff --cached --stat
  dc = diff --cached
  dw = diff -w --word-diff-regex=[^[:space:]]
  fm = log --min-parents=2 -p --cc
  sm = log -p -m
  lm = log -m --name=only --pretty='format:' -1
  find-merge = "!sh -c 'commit=$0 && branch=${1:-HEAD} && (git rev-list $commit..$branch --ancestry-path | cat -n; git rev-list $commit..$branch --first-parent | cat -n) | sort -k2 | uniq -f1 -d | sort -n | tail -1 | cut -f2'"
  show-merge = "!sh -c 'merge=$(git find-merge $0 $1) && [ -n \"$merge\" ] && git show $merge'"
  cp = cherry-pick -x
  dsr = diff --stat reference
  dr = diff reference
  cs = checkout -b submission origin/submission

[merge]
  tool = meld
  conflictstyle = diff3

P.S. - As final notes for myself for future meetings:

In addition to their source code, bring a copy of the perf annotate report for each student’s code!

Also, bring more pens! :-)

Coordination and meetings

One of the first of many of these external considerations beyond “implementation” is coordination and meeting planning.

I have met with students both in-person and via remote teleconferencing tools like webex.

I have coordinated with students to plan meeting times via email/text, Doodle, and Etherpad.

In general, (and both to my surprise and in contrast to my experience doing 1:1 tutoring and mentoring in the past), I have found webex meetings to work better than in-person meetings for my purposes here, I think because of the difficulty of finding comfortable quiet spaces on the MIT campus that everyone can get to on time from wherever they happen to be before. By contrast, Webex/Hangouts/… make it easy to use whatever quiet space is convenient and require much less travel time to make contact.

Interesting questions students have asked?

Q: “Have you gotten anything useful out of reading CS research papers?”

A: “Yes, but mostly as a way of knowing what’s possible rather than for any given specific project at hand.”