Performance Engineering

Michael Stone, October 14, 2018, , (src)

Introduction

MIT’s 6.172 Performance Engineering course is a first course on (software) performance engineering.

For many students, it is their first encounter with software development considerations that go beyond “make it work, by yourself, however you can” and, as such, it is also a wonderful microcosm of software development.

Over the course of mentoring roughly 10 students participating in this class over the past 6 or so years, I have observed a couple of recurring challenges, inspiring stories, and lessons learned that I’d like share more widely because I think that the topics that they touch on have relevance far beyond MIT’s class.

Student preparation

Almost all of the students I have worked with appear to be at the beginning of their studies of software engineering, programming, and software development, by which I mean something like:

software engineering
“what makes software demonstrably fit for use?”
programming
“how does one write software skillfully?”
software development
(1) “how do organizations make, maintain, and retire software at scale?”
(2) “‘programming integrated over time’”

I make this judgment in part from my observations of their struggles.

Common issues

Many of the students I’ve worked with appear to be struggling as much or more with general unfamiliarity as with performance; for example, with:

1. basic issues of correctness; specifically, which invariants should be true about their code and how to identify additional invariants

2. fast-enough strategies to achieve correctness

3. in one case, language familiarity

4. how to evaluate the adequacy of their testing efforts

5. a wide range of relevant testing strategies and their tradeoffs

6. debugger operation and technology for assisting with systematic debugging

7. how to comprehend and document unfamiliar or obscure code

8. how to safely rewrite unfamiliar or obscure code to be more readable or measurable

As a consequence, I think all of my students were in at least one case effectively prevented from trying out various potential optimizations (really, depessimizations) they were interested in.

Code usability issues

One of the first issues that MITPOSSE members are asked to give students feedback on are issues related to what I’ll call “code usability”, by which I mean:

2. comprehensibility
3. malleability, discussed more below
4. modularity and documentation of intent, to support working together in groups on improving the code

Now, what challenges tend to slow students down here?

Some are fairly tactical, like:

• How to define C preprocessor symbols on the command-line – e.g., -DMAGIC_VALUE=3 in order to set up experiments in a Makefile

• unfamiliarity with scripting – e.g., bash for-loops and shell globs – for programming with both the definitions and results of multiple experiments.

• history rewriting and commit structuring techniques to break big changes into small meaningful units (git rebase -i, git add -p, git commit —amend, git tag, git reset --soft, git bisect, branch-by-abstraction, interfaces and adapters for enabling side-by-side comparison of alternate implementations)

• macros and static inline helper functions to abstract out repetitive low-level functionality

• input checking, refactoring large routines into small testable pieces, variable naming + function interface documentation, more asserts, fancy debug macros, variable name length as a function of number of referrers + to document intent, and explaining what arguments are used for

But some are broader: the importance of code readability, build system ergonomics, etc. don’t necessarily stand out for everyone the same way, especially if you haven’t (yet) had the experience of trying to read, debug, edit, build, test, and ship someone else’s code at 3 AM in the morning your local time in order to solve some urgent performance, availability, or secrecy problem that is hurting you and your customers now, irreversibly, and in production?

Hence, the general formula: care, believe, act: share stories to help people care, give examples to help them believe that better truly is possible here and now, then summon the spirit of warmth and intellectual curiosity to help motivate them to act (e.g., to try out ideas that seem a little bit unlikely, at least at first blush.

Debugging tips

Moving beyond some of the static code ergonomics issues mentioned above, debugging skills and strategies loom large in my mind as one of the next earliest topics that needs to be addressed both to equip students to efficiently reduce their own confusion (and, thereby, to get the most out of the course) as well as to help them learn more about how to shape their code and supporting materials for greater maintainability and ease-of-use.

A smattering of tactical debugging-related topics that I often introduce students to and then later help them to experience and to learn to use include:

• printf-style debugging + machine-readable logs
• fancy macros to control the use of printf-style debugging
• how to run gdb, including controlling program input and arguments, setting breakpoints, setting watchpoints, setting conditional breakpoints
• how to script gdb via gdb -x
• how to write new test fixtures (and, later, how to apply other testing strategies)
• how to speed up edit-compile-test cycles (e.g., via scripting, build system tweaks, filesystem watchers, while inotifywait -e close_write **/*.c; do make; done)

Code malleability

After having addressed code style and any urgent tactical debugging hangups, it’s fairly typical in my experience for students to get into a rhythm of experimentation based on an optimization strategy that looks like this:

1. use perf to find a hotspot
2. squash the hotspot
3. loop

In my experience, this strategy has a number of advantages (including simplicity!) but it also has some important gaps both on the performance-improvement side (see Brendan Gregg’s, Cary Millsap’s, and Raj Jain’s discussions in their respective performance engineering books) and on the “working efficiently with code” side that I’d like to mention.

On the code side, one difficulty that I’d like to highlight concerns the squash the hotspot step. Basically, given both their varying levels of preparation and available time and attention, plus some of the code-bases that they’re starting from, quite a few students seem to have a lot of difficulty executing on their ideas for how to modify the code they’re trying to speed up simply because they’re unable to do so in a way that keeps the code running.

Some common code manipulation problems that I see students run into at this stage include:

• missing a call-site that needs to be updated
• getting derailed inefficiently debugging segfaults,
• getting confused and frustrated by non-determinism due to use of uninitialized memory

Here are some anecdotes that illustrate how these issues tend to come out and some past advice that I’ve given about how to deal with them:

In an effort to simplify their code, A’s team was using a globally locked free-list. Unfortunately, her code was deadlocking in malloc() because her realloc() implementation (which falls through to malloc() in some cases) had already acquired the global lock. I showed her a couple of ways to fix the problem (recursive locks, refactoring the lock-acquisition code) and she was able to move on. (W also made a great suggestion about using a watchpoint on the mutex to halt execution at relevant times.)

W’s code was segfaulting due to corruption of his allocation metadata headers probably resulting from use of uninitialized memory. We talked through a couple of debugging strategies like assert() and printf() debugging, use of valgrind --track-origins=yes, test-case minimization, and educated guessing based on the location of the failure in the failing trace.

Some less fun notes:

Despite the fact that they’re writing a memory allocator, neither group had run their code through valgrind. Doing so immediately revealed memory leaks in A’s groups code and use of uninitialized memory in W’s code.

W’s group had turned off compiler optimization in their release build but hadn’t noticed!

Neither group has really figured out (and I haven’t yet managed to communicate) the debugging value of “atomic” commits from “known good state” to “testably known good state”. :-(

OK, so, zooming out, what are some tools and approaches that help me with these issues and where did I learn about them from?

1. “systematic debugging” techniques from Why Programs Fail
2. the idea of “code seams” and input/output control from Michael Feathers’ Working Effectively with Legacy Code
3. “reverse debugging” tools like Mozilla rr and other record-replay tools for segfault provenance tracking
4. valgrind for memory debugging: leak tracking, uninitialized memory usage, etc.
5. practice with printf-debugging
6. klee for program equivalence checking, learned from Brian Sniffen
7. cscope for code search, especially cscope -kRp3 + editor integration, learned from Herbert Pötzl
8. regexes, sed, vim global substitutions, sublime find & replace, and coccinelle for code transformation
9. ggplot2, pgf-tikZ, vega-lite, HTML, SVG, and NetPBM output for visualization, learned in part from Noah Ilinsky and Bruce Maxwell
10. Bret Victor’s “Learnable Programming” essay for inspiration on how to visualize program traces.

Despite my sense that students are most often held back by problems related to code reading, debugging, refactoring, collaboration, or experimental design & setup, I am nevertheless still often asked by students for help thinking more clearly about performance.

Here’s a typical anecdote:

One time, I asked my students after our project #1 meetings how I could be more helpful to them. The major request I received was to come prepared with more suggestions for speedups. As a result, I decided to try spending more time on “thinking about performance” and on practical tips for measurement than on assurance & quality issues in this set of meetings. After they arrived, we started talking about their code.

Unsurprisingly for the assignment at hand, both W’s and A’s groups had implemented pointer-based quad-trees. Both were experimenting with parallelism (W having had more success) and both were within a factor of 2-3 of the target time of 0.9s.

To make progress, I asked them for the output of perf report. Since both had reduced the number intersection tests via their quad-trees and since neither had done any work on their intersect() routine, they quickly realized that intersect() might be a good thing to speed up. To that end, I had them take a close look at the animation of the screensaver where I asked them to look for special structure that they might be able to exploit – i.e., to the fact that the lines translate but don’t rotate and that they hold only a small number of poses.

But where to go from here?

One useful answer that students are prompted to consider early in the course is low-level consideration of what hardware is available, what hardware is being used, and how is it being used, especially with respect to caching, instruction- and word-level parallelism, and processor front-end/back-end utilization.

Another useful answer that the students are prompted to consider slightly later concerns Bentley’s rules.

Below, I will discuss several challenges that students seem to regularly run into along both lines of consideration.

Caches, inlining, and bithacks

Caches are are surprising beasts that, in my experience, often generate student confusions. Here is one anecdote to illustrate one flavor of confusion in this area:

A2 thought of table lookups as taking a fixed (and small) number of cycles, rather than taking time dependent on the state of the memory hierarchy. (This led to a nice conversation about cache hits, cache misses, lookup-table sizes, and related interactions with set-associative caches.) He left interested in exploring the byte reversal bithacks that he had previously skipped over in favor of his LUT.

However, caches aren’t the only thing confusing A2. As the conversation progressed, we discovered another opportunity to learn to think more clearly, both via additional mental models and by getting the computer to behave more usefully:

A2 asked about whether or not he should change algorithms. This led to a discussion of asymptotic and polynomial cost models and, from there, to the discovery that he had discovered perf stat but not perf annotate.

This conversation then led into an even deeper discussion of how computers can be made to assist with many other aspects of the optimization process with another student:

About here, A2 had to depart for another obligation. However, fortunately, A3 was able to stay for a few minutes longer. His major question was on how to generate new ideas for speeding up his code. To that end, we discussed:

1. mental models for thinking about whether inlining is helping you
2. how to measure whether inlining is helping you
3. hardware counters (e.g., for L1-icache-loads) and perf list as a way to discover available counters
4. how to find new bithacks: e.g., as a circuit design problem, as a search problem over a space of expression trees, via induction principles, …
5. coaxing gcc into vectorizing loops

Bentley’s rules

As you may have noticed, one of my default modes of operation is to try to help students not by directly solving their problems for them but, rather, by helping them to experience, conceptualize, and mentally organize both their confusions and, later, their developing understanding of a wide range of tools, models, and approaches that are relevant to their problems.

In this vein, Bentley’s rules provide an interesting body of performance improvement thought and advice.

While reflecting on them, I have often been led to questions like these while discussing students’ code with them:

• What sort of data density are you getting?

• How much locality of reference are you getting?

• How many load operations are necessary to solve the problem?

• Can you change representations to speed things up? In particular:

• How much precision is actually required to solve the problem?
• Is floating-point math needed or would integer math do?
• Which algorithmic steps are parallelizable?

Based on these questions, the following (seemingly) crazy hypothetical scheme once occurred to me:

1. The relevant graphics subsystem turns out to use 16 bits of precision per axis. Therefore, we can certainly store less than 64 bits of precision.

2. If we store the coordinates in fixed-point rather than floating point then we can use our integer units rather than our floating point units.

3. Even better, if we use a fixed-point representation and also code the edges of the viewport as (0, USHRT_MAX)) then we wind up having organized the embedding space: the most significant bit of the coordinate tells us which half of the indicated axis of the embedding space (viewport) we’re in, the next most significant bit further subdivides that section of the axis, and so on.

4. As a result, if we interleave the bits of a pair of these fixed-point coordinates, we get a code-word for the path in a k-d tree of dimension 2 – i.e., in our quad-tree – to the indicated point.

5. Using the resulting coded paths, we can change the pointer-based quad tree to a packed array of sorted items. That is, (in pseudo-Haskell):

 data QT a = Branch (QT a) (QT a) (QT a) (QT a)
| Leaf a

becomes

 data QT a = Array (CodeWord a)
data CodeWord a = CodeWord { path :: Word32, item :: a }
6. To build the quad-tree, just code all the paths using the relevant bit-hacks and sort the resulting array on the coded paths. This will bring objects that share path prefixes in the quad tree into contiguous segments of the array, ready to be loaded into L1 caches for fast parallel intersection testing.

7. Conclusion: data density, locality of reference, and hardware utilization may be able to be substantially improved even beyond packing the line objects.

(Also interesting: Appendix C of

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

!)

Measurement and experimentation

Beyond simply understanding or manipulating code in order to try out new performance improvement ideas, another place many students struggle is to be able to efficiently define experimentally testable hypotheses, design experiments to test their hypotheses, run the experiments, keep track of information and insights/new hypotheses learned from their experiments, and then fold experimentally obtained insights back into their main codebase.

Many of the difficulties that they encounter here stem from challenges that we’ve already touched on including with:

• debugging techniques (asserts, unit tests, visualization approaches) and some code reading techniques to help deal with segfaults.

• commits, controlling commit contents, and testing & debugging technology.

• collaboration, debugging strategies, and using refactoring + internal interfaces to gradually introduce new algorithms + data structure representations to an existing code base.

but some are more novel, including:

• statistics, measurement techniques, automation, build system changes, profiling technology, and team dynamics.

• allocation of time between writing, refactoring, debugging, and optimization — all students reported spending significant time debugging segfaults and correctness problems — and therefore, on the value of better up-front correctness + more time-efficient debugging techniques (valgrind, early detection of invariant violations, testing smaller units)

Here, I wish I had better suggestions and advice to give so I’ll turn things around: do you have any advice that works well for you or the people you work with?

Collaboration between students

On the advice of peer mentors, I have almost always found it valuable to inquire about how the collaboration between students is going and about how my students are feeling about that collaboration.

A typical resulting report that I think it’s nice to know about as context for other conversations:

I started off by inquiring about the division of labor. When I did, W expressed some frustration with his partner C who had (according to W and the commit log) contributed some style fixes and tests and little more.

I met with A3 two days later on the evening of October 26th. He expressed that he was sorry to have been unable to meet earlier with A2 but that, at that point in the semester, he was “pretty hosed”. As before, I asked how things were going in the partnership. He seemed fairly happy with his partner’s work; he also commented that his partner this time had pointed out (and he agreed) that he needed more work thinking about the machine-level details of the code they were writing.

I met with A and W this morning for about 60 minutes. A’s group was having lots of trouble with the assignment; W seemed to be in somewhat better shape; however, both were stuck on correctness / debugging issues that were preventing them from moving on to the performance parts of the assignment.

A2 described his collaboration with his partner, C, as “pretty bad”. When I inquired further, A2 expressed frustration that his partner’s contributions had dropped off after they completed their initial implementation of the reversal algorithm.

Neither A2 nor A3 were working from a written plan.

Written plans

Speaking of written plans and checklists: I have often found something like the following outline to be of use in organizing my conversations, both as a tool to structure them and as a convenient form on which to record comments and feedback:

The Outline:

working style?
written plans?
commit messages
pictures
asymptotics?
the (illegal) lazy solution... :-)
measurement & threats to validity
sample sizes?
input sizes?
target architectures?
measurement error?
compiler & compiler flag selection?
inlining, -fdump-ipa-inline
-Wextra, genetic algorithms
supercop  --  http://bench.cr.yp.to/supercop.html
icc? clang?
coverage?
statement, branch, exhaustive, and MC/DC coverage
better testing?
equivalence checking?
klee  --  http://mstone.info/posts/klee_quickcheck/
test-case minimization
QuickCheck -- http://en.wikipedia.org/wiki/QuickCheck
the Small Scope Hypothesis
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=10928
plotting?
ggplot2  --  http://mstone.info/posts/plots-20120225/
questions?
A: faster reversal?
me: cost of loads, instruction-level data dependencies, ...
W: pointer casting?
me: sure, also check out gcc's tree-vectorization code

Another written plan:

Coding issues
Macros
Commit histories & interactive rebase
(Future work: naming, interfaces, seams & testability, complexity, ...)

Challenges the students faced + requests for feedback?

Testing strategies
Manual testing
Table-driven testing
Choice of units to test
Regression tests
Property-based testing
Coverage and coverage-guided testing
Coverage criteria: statement, branch, and MC/DC coverage
Model-checking and symbolic execution
Test-case minimization
“Working Effectively with Legacy Code”

Debugging strategies
mozilla rr & time-travel debugging
gdb tips (specifically gdb -x,  breakpoint commands, conditional breakpoints, watchpoints)
printf-based / log-based debugging
visualization of logged data
“Why Programs Fail”

Build tooling
CI pipelines
Running gdb

Stories
Losing months on a project by not unit-testing simple algebraic beliefs that I "knew" were true
Using Klee to find bugs at incident tempo in a proposed patch for a security issue affecting some important DNS and HTTP servers
Being called at 3 AM and needing to be able read, debug, patch, build, and test other people’s code
Using visualization to solve a tricky bug (a missing - sign) in some robotics code

Working with reviewers

In preparation for our meeting, I read my students’ commit shortlogs, patches, writeups, and code. I also used gcc with extra flags to examine unfixed compiler warnings, inlining decisions, and generated assembly for key routines.

To help accomplish these tasks more efficiently, I’ve developed some custom tooling which I hope to keep improving over time:

$cat ~/.gitconfig [user] email = ... name = Michael Stone [push] default = simple [color] diff = auto branch = auto interactive = auto status = auto [alias] lol = log --graph --decorate --pretty=oneline --abbrev-commit lola = log --graph --decorate --pretty=oneline --abbrev-commit --all ds = diff --stat dcs = diff --cached --stat dc = diff --cached dw = diff -w --word-diff-regex=[^[:space:]] fm = log --min-parents=2 -p --cc sm = log -p -m lm = log -m --name=only --pretty='format:' -1 find-merge = "!sh -c 'commit=$0 && branch=${1:-HEAD} && (git rev-list$commit..$branch --ancestry-path | cat -n; git rev-list$commit..$branch --first-parent | cat -n) | sort -k2 | uniq -f1 -d | sort -n | tail -1 | cut -f2'" show-merge = "!sh -c 'merge=$(git find-merge $0$1) && [ -n \"$merge\" ] && git show$merge'"
cp = cherry-pick -x
dsr = diff --stat reference
dr = diff reference
cs = checkout -b submission origin/submission

[merge]
tool = meld
conflictstyle = diff3

P.S. - As final notes for myself for future meetings:

In addition to their source code, bring a copy of the perf annotate report for each student’s code!

Also, bring more pens! :-)

Coordination and meetings

One of the first of many of these external considerations beyond “implementation” is coordination and meeting planning.

I have met with students both in-person and via remote teleconferencing tools like webex.

I have coordinated with students to plan meeting times via email/text, Doodle, and Etherpad.

In general, (and both to my surprise and in contrast to my experience doing 1:1 tutoring and mentoring in the past), I have found webex meetings to work better than in-person meetings for my purposes here, I think because of the difficulty of finding comfortable quiet spaces on the MIT campus that everyone can get to on time from wherever they happen to be before. By contrast, Webex/Hangouts/… make it easy to use whatever quiet space is convenient and require much less travel time to make contact.