March 24, 2013
In my research lab at the National University of Singapore (NUS), I am Principal Investigator, Principal Sys-admin and Principal Coder. While my formal training and degree is in Biochemistry, I have been developing software since 1986, web services and REST APIs for Bioinformatics since 1996.
What compute jobs do we run?
Currently the main software package I maintain and develop is an obscure but rather large bit of science code called TraDES which is used by scientists who study protein molecules with Nuclear Magnetic Resonance. TraDES generates protein 3D structures and is most often downloaded as a dependency for another program called ENSEMBLE.
Like much specialty software in science, TraDES has a very small user base, numbering in the hundreds, and it will never get much bigger than that. As scientific computing goes, those hundreds of users all pick different computers, operating systems and distributions.
TraDES' job is simple. Make hundreds of thousands to billions of 3D protein structures chosen from random sampling and expected structure behavior.
This is full-throttle cpu code making big data. How big?
The database of known structures as of today has 88,837 structures of varying sizes.
We routinely create sample sizes of 300,000 from a single TraDES process on a single core, which is about 8Gb of data after compression. Pipelines for structure filtering and collision detection to winnow down results can operate on several million samples of some very large proteins. This is one of 11 successes out of 4.5 million ACTA test structures that passed a membrane plane barrier and docked 4 binding proteins.
TraDES is not parallel code (with good reason), but instances of the TraDES engine scale very nicely with cores/hyperthreads. An E31225 (3.1GHz) Xeon(R) processor in a HP Z210 workstation can make 600,000 3D structures of a 306 amino acid protein in 11 hours on SmartOS with 2x 2Tb drives in the zpool. That runs as 4 single-thread processes running at 98-100% cpu. (For comparison, blastall uses about 80%CPU per process). In general TraDES scales nicely with the number of cores/hyperthreads provided disk I/O is not saturated first.
What have we tried to run this code on before?
Back in 2001, we cross-compiled operating systems support for the TraDES software package on this set of 14 separate operating systems/cpu combinations: Compaq-alpha, HPUX-parisc, Irix 6.x-MIPS, Solaris-sparc, Solaris64-sparc, Linux-i386, Linux-ppc, Linux-ps2, Linux-parisc, FreeBSD-x86, BeOS-x86, MacOSX-ppc, QNX-x86, and Windows98. This was run from our own C cross-compilation system code which triggered nightly builds.
Today TraDES software is distributed on 14 different versions, except now 9 of these are now different Linux on Intel distros. Only one is a non-Intel big-endian distribution: BioSlax-x86 (SlaxWare/Debian), CentOS5-x86, CentOS5-x86-64, CentOS6-x86-64, Fedora14-x86, Fedora14-x86-64, FreeBSD9-x86-64,illumOS-x86-64, MacOSX10.4-PPC, MacOSX10.6-x86, OpenSUSE11.3-x86-64, Ubuntu11-x86, Ubuntu11-x86-64, and Windows-XP/7-32bit.
TraDES has an internal benchmark/unit test program called "benchtraj" which was adapted from code we used to benchmark our Distributed Folding systems back in 2003, and there are many scores posted from back then, so we know pretty well how far computers have come in running our code. The benchmark has scores for software specific I/O operations that create a database file (Maketrj) and compute performance in making 3D protein structures (Foldtraj).
OK So my code runs on everything, why SmartOS?
In a nutshell, storage failure fatigue as I briefly chronicled here. I want an uncomplicated and fast route from CPU to storage with as few storage widgets between them as necessary. I want a compute/storage server that is as fast to set up as an iPhone. I want computational capability as close to my data as possible.
Simply put enough experience with hardware RAID failures, fibre-channel quirks, NAS box failures, fsck-ing, and the time it takes to initialize new storage pushed me over the edge and away from Linux as the core OS on my servers. Migrating infrastructure OSes is not new for me. From Irix (95-2000) to Linux (2000-2003) to Solaris (2000-2006 - see PDF about our SUN COE attached at the bottom), back to Linux (2007-2013 - see infrastructure.blueprint.org).
Cross-compiling means I am never locked-in, and sometimes change is good.
ZFS is the solution to my storage failure fatigue. It is the core filesystem of FreeBSD, Solaris and Illumos. Wikipedia's ZFS article is a great place to start to understand how ZFS was engineered to prevent silent data corruption, scale to as big as you can imagine, and much more. For someone who has seen filesytems fail for a myriad of reasons, the engineering behind ZFS is truly sensible.
I am not alone. Linux users also are demanding ZFS. ZFS has been ported to Linux via the FUSE (Filesystem in UserSpace), and can be used on something as small as a Raspberry Pi, and a native port has brought Lustre ZFS to the LLNL Sequoia computer as its primary 55PB filesystem and object store.
What is SmartOS?
SmartOS is an illumOS distribution. What is illumOS? It is the community fork of the CDDL licensed OpenSolaris code originally released as open source by Sun Microsystems, and it is currently unsupported by Oracle. Rather remarkably, a host of key OS engineers who worked on Solaris within Sun Microsystems are working outside of the Oracle tent on illumOS, and several companies are supporting the effort, including Joyent who produce SmartOS as well as Node.js.
SmartOS is an extremely lean version of illumos, and it is not intended to occupy space on the filesystem. Rather it boots from a <2Gb image,either USB key, PXE network boot, or CDR image, and occupies space in RAM. The OS is kept off the filesystem to simplify upgrades - attach a new image and reboot.
SmartOS USB key boot.
A fresh install of SmartOS takes less than 10 minutes on a server class machine to get to a working, ZFS filesystem. Most of that time is de-configuring hardware raid (configure as JBOD or pass-through disks) and answering SmartOS config questions. Reboot and the server is ready to use.
It is a true open source OS, and you can download and compile it yourself. SmartOS is released on a bi-weekly schedule and there is a lot of innovative activity being put into practice by the developers both from SmartOS and the illumOS community.
SmartOS is a bare OS intended to provide the infrastructure to run Zones. Zones are native OS containers (like chroot or BSD jails). One does not use the root system to work on the machine, only to configure and install Zones from which work is done. Zones persist on the ZFS filesystem between reboots, and can be imaged, copied and redistributed.
There are two kinds of Zones on SmartOS. The first kind runs a virtual SmartOS operating system. The second kind runs a KVM virtual machine image of another operating system such as CentOS, FreeBSD or Windows. In terms of speed, a SmartOS Zone has bare-metal access to networking, whereas a non-SmartOS KVM operating system image has to shunt through an additional layer of networking abstraction.
Brendan Greg has done an excellent job of explaining the difference between an OS VM, a KVM VM, and a XEN VM, and why I/O is much faster in an OS VM, as in a SmartOS Zone, and why the bare-metal KVM VM outpaces the XEN hypervisor.
The bottom line is - for high-performance networking and filesystem access, a SmartOS Zone is always preferred providing you can build your software to run on it. Yet the system is flexible enough to support any operating system as a KVM guest, and an additional layer of security is provided because KVM runs in a Zone. Kernel optimized version of Linux KVM images are provided by Joyent, which are preferred.
Package management for SmartOS is handled by the pkgsrc system (from NetBSD) - the command is "pkgin". There are 9518 packages currently, including MySQL, R, and of course Perl. The base USB operating system does not ship with a package management system, but simple installs can be done quickly. Zones of the base SmartOS images have the package manager installed, as that is where one is supposed to access packages from. NFS mounts cannot be made to a Zone however, only to the global zone. I tend to discourage my students from building spiderwebs out of NFS mounts. If you prefer an illumOS distribution presents itself more like a CentOS server, you can try OmniOS.
Some third party or commercial software support for SmartOS is challenging due to the nature of bleeding-edge. For example, I cannot get the NCBI supported Aspera client command line "ascp" supported on SmartOS, this is the command line tool buried in the web plugin. The Oracle Solaris standalone command line client costs $750 SGD and there is no guarantee it runs on SmartOS.
However I can run it from CentOS6 within a SmartOS KVM zone. Problem solved. Currently Blast+ Solaris x86-64 does not work on SmartOS due to a libc link divergence between Oracle Solaris and Illumos. I am working on it, as time permits.
Other key features of SmartOS are DTrace, which provides some remarkable debugging and instrumentation systems for monitoring what is going on at all levels. I have not made it into DTrace yet, but I have had a deep look at the instrumentation system calls and they are very useful for detecting I/O saturation and performance monitoring is amazing. SmartOS images carry a lot of neat features including load balancers, and a variety of fully instantiated web stack setups - ready to use. SmartOS Chef installs are also explained well, SmartOS zone images for Chef servers are supported, but Chef solo looks really useful without the baggage.
How Do I Administer SmartOS?
I'm only a part-time sysadmin and I don't tend to remember everything because my brain must task-switch between biology and computing. So I write everything down in sometimes painful and obvious detail. For my previous physical Linux infrastructure, we posted much of our stack setup info on infrastructure.blueprint.org. My collection of notes that are not subject to exposing security details are maintained here on smartos.blueprint.org. The SmartOS infrastructure is still experimental, and it is being rolled out one zone at a time. My first priority is to get GIT and my own source code management set up, and to give out students some zones to use for reorganizing our stored lab data. I test stuff out at home on my wee Lenovo IdeaCentre and beefier HP-Z210 boxes, then deploy on my lab server.
Key documents I refer to when administering SmartOS are:
The Linux to SmartOS Cheat Sheet:
Brendan's USE method - a table of commands for monitoring your system - illumos or Linux.
The SmartOS Wiki
The management model I use is pretty simple. My graduate students have root access to just about everything in my stack. That is how I roll, that is how one learns devops in my lab. Right now I have one SmartOS server running in the lab. Each student gets a zone with their own password, a CPU and filesystem quota. They log in to their zone as root and muck around running TraDES and whatever pipelines they have scripted.
I have two other CentOS servers that I want to migrate onto SmartOS. We have already lost about of month of time due to RAID rebuild on one of these, and my students are getting keen about changing up. Files need to be moved first to clear them off. The multiserver GUI zones administration system that is rapidly maturing in Project FIFO which has a wiki, architecture diagrams and full roadmap here, will provide us an in-house cloud across the three servers, which my students can use to the end of their term at NUS. The physical stack running my Linux infrastructure goes for recycling, after serving well for the past 5 years.
The HPZ210s will become my cross-platform TraDES compile on KVMs, which will stay with me after I leave NUS when my contract ends. They gets backed up onto a FreeNAS ZFS system I hacked together.