DTrace Class

23 May 2013

Last week I attended the DTrace: Core Features and Performance Analysis class at Joyent's San Francisco HQ. DTrace is one of the key technologies inside SmartOS (also illumOS, Solaris, MacOS and FreeBSD) that I had read about, but I felt I needed to get some personalized training in order to really understand how I could apply this technology to problems that I face, now and in the future. DTrace stands for Dynamic Tracing and was the brainchild of Bryan Cantrill while he worked at Sun, and who is now VP Engineering at Joyent.

After 22hrs of flying from Singapore, I arrived Monday evening at SFO and made it to my hotel. In an effort to beat the intense jet lag caused by the 16 hour time difference. I intentionally disrupted my sleep schedule starting a few days before. The strategy kept me awake during the course, or perhaps it was the engaging style of Max Bruning as the course instructor, but I managed to stay awake through it all. Brendan Gregg came in during the DTrace labs and gave plenty of hands-on help to the class of 15.

The lecture and hands-on lab format was similar to the type of delivery I myself do when teaching Bioinformatics. The labs consisted of about 20 different tiny black box compiled programs with no source code, all with some run-time bug or flaw that could be discovered by systematically applying DTrace and the USE (Utilization, Saturation and Errors) method, which is elaborated on Brendan Gregg's blog here. At the end of the course, Brendan's preview of DTrace on Ubuntu showing the differences in network stack call depth (Linux > SmartOS) was quite interesting.

At the end our "commencement" ceremony came with sage words of wisdom from the always enthusiastic Bryan Cantrill. Bryan gave us a thumbnail history of DTrace. He explained that being a kernel engineer for the operating system is like owning a bear cub raised from birth. Nice and cuddly looking, but there are moments when it will growl and take your arm off. According to Cantrill, the only way you truly understand DTrace, is the moment when you realize that DTrace has just saved your ass. His final nugget of wisdom was that DTrace users self-select DTrace to learn how to instrument their code and systems. They have a critical understanding that failures within the black box of computing should not be left to guesswork.

While I have not yet had the moment when DTrace saved my ass, I can see it clearly, in retrospect. Over the years I have have had far too many hand-waving arguments from developers about latency and performance problems that involve blaming some black box, either a hardware component or a library, for a performance problem. Howard Feldman, who was one of my graduate students at the time, wrote an O'Reilly DevCenter article in 2004 that explains some of the finger-pointing problems we had in performance tuning. He describes the systematic and unassuming analysis carried out with embedded timing code to report precise event timing, which was the key to pinpointing and fixing the API and parsing problems we had. Today the problems described in Feldman's  2004 article could have been systematically and quantitatively discovered with a few 1-line DTrace commands. For me this retrospective is all I need to recall to motivate myself to learn and understand DTrace.

At the end of the 3-day class, Bryan handed out the beloved DTrace ponycorn stickers, while Dierdré Straughan, the Joyent Head of Educational programs and illumOS community organizer, provided the DTrace t-shirt and certificates. So, after 30 years computing I can now, with DTrace, quantitatively measure just about everything that goes on inside the software and hardware I operate. I no longer have to guess what is causing BLAST to take so long. I can quantitate the timing of two competing libraries I link into my code and figure out which one is faster without ever touching the code. Re-compiling it in debug mode or after inserting a printf statement or timing variables are no longer necessary.

 
DTrace Flame Graphs

Brendan Gregg's Flame Graphs are awesome snippets of visualization.

During the class I made 3 Flame Graphs of classic Bioinformatics programs as examples, FormatDB, BLAST 2.2.24, and TraDES-2.

BLAST is a protein or DNA sequence similarity search engine. Before it can be run, raw sequence data needs to be indexed with a program called FormatDB.

Since I wanted to do a Flame Graph of BLAST, I had to run FormatDB, and since that took over an hour to finish, I decided to throw DTrace at it and see what it looked like. This version of BLAST, FormatDB is from the illumos build of FormatDB, and Blast 2.2.24 I posted here. (It is not BLAST+). I chose this because I already had it compiled on SmartOS, and because BLAST has some well known problems with certain failure modes based on not enough RAM for paging in database files. Back at NCBI this was called "thrashing", which is what it sounds like when you hear the disk access, but it is better known as "memory contention".

First up is the Flame Graph of FormatDB, which was captured while creating the BLAST index for the nr protein database. FormatDB takes over an hour to process the approximately 15Gb of nr FASTA. The Flame Graph data was captured during this run, with a short DTrace 1-liner sampling stack traces over 60 seconds. In this case FormatDB is run with additional index information with the runtime flags as follows:

# ./bin/formatdb -i nr -p T -o T

About a half-hour into the run as it was chugging away, I ran a DTrace profile on it. This one-liner collects 1 minute worth of C function call stack traces at 997Hz and outputs them into a file, that is then sent to Brendan Gregg's Flame Graph Perl scripts (available on GitHub) for conversion into the Flame Graph.

# dtrace -x ustackframes=100 -n 'profile-997 /execname =="formatdb" && arg1/ { @[ustack()] = count(); } tick-60s { exit(0); }' -o formatdb.out

 
This is the data collection step, running the program and using DTrace to extract its executing stack traces at specific timepoints. The profile-997 command does exactly this, sampling stack traces at 997 times a second (997 Hz). The condition /execname == "formatdb" && arg1/ directs DTrace to sample stack traces of only that executable running in memory.  DTrace carries out aggregation of userland stacks via the ustack() command inside an awk-like associative array structure. By default and like awk, this is output to stdout, and in this case it is directed into the file named "formatdb.out". The second condition "tick-60s" stops the sampling after 60 seconds, writes the data to the file, and exits DTrace. 

So this one-liner gathers up all the stack traces it sees at a rate of 997 times a second and does this for 60 seconds focusing only on the running formatdb processes (if there are more than one it will trace all of them, individual instances can be specified with process id).

The win here is that this does not need debug mode, or any other alteration to the compiled executable.

Here is the FormatDB result:

Click on the image to zoom in to the interactive SVG where you can mouse over each segment and see each C function call.

I used the truncate command on nr to limit the file to 10Gb exactly, cutting a sequence arbitrarily at the end of the file, as this run was just for profiling.

Next up I ran a BLAST query of the protein sequence of oncomodulin (PDB 1OMD), a sentimental favorite as it is the first protein I ever purified in the lab. Now the SmartMachine I ran this on had 30Gb of storage, 1 CPU and 1Gb of RAM. So while the BLAST query of this protein against the tiny PDB database is fast, the query against nr is incredibly slow. To the uninitiated, this is unwanted black-box behavior, but to those of us who have studied or hacked BLAST, the problem is memory contention. FormatDB makes searchable chunks that are too big to fit into this little RAM. BLAST uses memory-mapped IO which also contributes to the problem, as the operating system struggles to reload data off disk while the program scans the query.  Most of this unwanted behavior can be seen in the flame graph. The DTrace one-liner is extactly as above for FormatDB, but with "blastall" as the program name.

BLASTALL blastp protein query - Flame Graph under memory contention:

Click on the image to zoom in to the interactive SVG where you can mouse over each segment and see each C function call.

The broad plateau on the right points to this piece of Blast code, where most of the stack samples arise. FormatDB took so long, I didn't spend much time digging further into this, but it nicely illustrates the rule of thumb for Flame Graphs, that long plateaus indicate problems. It would be interesting to spend more time on this example and experiment with incrementally larger amounts of RAM, but for now, this will have to suffice.

The last example for Flame Graphs that I did was that of the current TraDES-2 program, which makes 3D protein structure samples and writes out files in Asn.1 and PDB file format. As this is code I developed and maintained throughout the years, I know pretty much what exactly is going on, but can you tell?

 
Click on the image to zoom in to the interactive SVG where you can mouse over each segment and see each C function call.

So far, this is unique among the posted Flame Graphs I have seen. It got a double take from Brendan Gregg and Bryan Cantrill who had never seen stack calls piled so high. The reason is because the code uses a lot of recursion for 3D neighbor lookups, and for protein chain growth, which both contribute to very deep stack traces. Getting this Flame Graph right required some help from Brendan, who pointed out that it requires a hefty user call stack setting ustackframes=500 so that the flame graph wouldn't get top-truncated.

# dtrace -x ustackframes=500 -n 'profile-997 /execname =="trades" && arg1/ { @[ustack()] = count(); } tick-60s { exit(0); }' -o trades_prot.out

That broad swath of stack frames with the NextResidue() function call represent the protein 3D structure build up calls. As the protein gets longer, and crashes into itself and backtracks, a lot of stack operations keep track of the chain growth. The fine "hairs" on top of this arise from a recursive 3D collision testing algorithm that check to see whether an atom is running into anything in O(NlogN) time. If all the space is clear, memory is allocated to place an atom, at the top of the hairs.

The TraDES-2 code includes several libraries, one of which produces the plateau on the left. This is a floating-point numerically intense computation of protein surface area. The point here is that sometimes plateaus are actually hard-working functions, as is this one. The next-tallest spire in the middle corresponds to a calculation that adds water molecules onto the surface of a finished protein. It uses the same 3D collision testing algorithm as the protein growth code, as does the next tallest spire to its left.  The water addition code produces images like this on the left, a small protein with 10 layers of water added such that crystalline form begins to appear. 

One of things I spotted was that the PDB file write operations, which produce a FORTRAN-style output, occupy a swath of the Flame Graph about twice the size of the Asn.1 write operations (PDB writes are at far right of the Flame Graph). While I know the Asn.1 write operations push fewer bits to disk, there seems to be some room for improvement. To further investigate, I used the DTrace quantize function to see what the distribution of write sizes were while TraDES was writing 200 files, under 3 separate conditions, Asn.1 only, PDB only and both Asn.1 and PDB file writers operational.

DTrace quantize write operations. Run of 200 structures, writing Asn.1 files and log files.









DTrace quantize write operations. Run of 200 structures, writing PDB files and log files.








DTrace quantize write operations. Run of 200 structures, writing Asn.1, PDB and log files.









DTrace quantize emits a text graph of all the file write operations sorted by size over the complete run. This comparison shows rather clearly that the PDB file writers, which I wrote back in 1995, do a lot of small writes compared to the Asn.1 stream binary file writers. This is something I have thought about fixing for a while, writing the PDB file strings into a ByteStore memory buffer, then writing that to disk in larger chunks. It will never be as fast as the Asn.1 writers, but eliminating the large number of tiny writes should make a big difference.

I left the class with a lot of new skills and useful insights. DTrace everything. I spent a lot of time staring at the TraDES Flame Graph, it is like an MRI image of my own bear cub. It makes it easier for others to understand just what goes on inside the TraDES-2 code, as it plainly show the different computational strategies used for dealing with 3D information within the system.
 
More importantly - the ability to quantitatively look at running code offers the key data required to support performance improvement ideas - and there is always  room for improvement. For anyone who wants to recreate the TraDES flame graph, the raw DTrace data is attached at the bottom.

Comments to chogue {at} blueprint {dot} org  or @cwvhogue on Twitter.

For more on DTrace and performance optimization follow:
Brendan Gregg @brendangregg
Max Bruning @mrbruning
Bryan Cantril @bcantrill



ċ
trades_prot.out.gz
(628k)
Christopher Hogue,
May 23, 2013, 3:04 AM
Comments