Manta Unleashed

26 June 2013
Computing on Storage - It's All About The Implementation.
Today Joyent, the makers of SmartOS and Node.js, released the Manta Storage Service. It made quite a splash in the tech press today. I had an early view of Manta back in May, when I visited Joyent HQ for the DTrace course. I got to spend some time talking with the engineering team discussing the possibilities.
Manta is a cloud storage platform that utilizes extensions to the operating file system layer to present a remarkably scalable storage system that can directly support POSIX compliant compute tasks. The coverage about the Manta release in the cloud computing and IT press has been broad, and Manta is well explained in the links section of this post (bottom on mobile, left on desktop). 
The implications for life sciences research and genomic computing are things I want to draw out in this blog post.
Now, cutting through the everyday IT tech hype is not always easy. I have spent about a year and a half going over the technology offerings of companies that offer genomics and bioinformatics on cloud analysis, and have taken a hard look at the infrastructure they are built on.
But let me go back a little further first.
Implementation is the key to scale, as I explained in 2001. Back in 2000 we naively wrapped blast2seqs to do NxN protein sequence comparisions to generate pre-computed lookup tables, on what was a very large cluster at the time. We first tried embedding REST I/O calls into blast2seqs via an elaborate homebrew API called MoBiDiCK, forming "moblast". Well we tossed it out - it was not up to the performance demands. The projection for moblast NxN job completion was 355 days. Oops.
So Michel Dumontier went in and hacked the blastp core to stripe searches on the lower-triangle of the nr BLAST protein database and push output directly onto a local storage database that we linked into it, instead of funnelling it over the network to the centralized server. Then a map-reduce style aggregation collected up all the NxN stripes appending them into the complete database. For input, there was a BLAST database that was updated on every node, and for output, a scalable hash used for pairwise result lookups. The nblast paper is here, and it worked, completing the NxN job in 24 hours on the same cluster as the failed moblast.
More importantly it was capable of running in short bursts to keep up with daily updates. So our second implementation worked far better than the first. This made us the only group other than NCBI to have solved the NxN BLAST problem of computing and storing lookup tables, the secret-sauce behind sequence neighboring in the SeqHound API service. (What happened to it? Thomson-Reuters now owns it, let it bit-rot, tucked it in a shoe box, and left it under the bed somewhere, don't get me started...)
So what does this have to do with Manta? 
Well in this crude and early case, the winning solution was POSIX compliant code running on nodes with all I/O performing on local storage, and eliminating the bottleneck of network transfer. It really looked nothing like Manta, but running big data computations on local storage nodes became a prominent fixation for me. I got very sensitive to the idea of moving data around to compute on it. And I got sensitive to replacing POSIX I/O with homebrew APIs. This is, after all, the job of the operating system.  
So, fast forward to 2010.
The fundamental idea behind Manta, moving computing to local data storage is not new. Appistry released an implementation of the idea with the trademarked name Computational Storage(tm) back in 2010, and have built their SaaS platform on it for genome analysis. You cannot run an arbitrary job on it, because it is simply not IaaS, but it is out there. Appistry made the SDK docs available for light reading, so I went through them last year. The first thing I realized is that their implementation demands a rewrite of application source code to replace POSIX compliant I/O calls with proprietary CloudIQ API calls (supporting only Java, Perl, .Net and C/C++ programs). That is a lot of customization work across a pipeline, and forget running R code. As noted above, with moblast/nblast, been there with the I/O API layer, done that, lessons learned.
Now Appistry's genome analysis throughput numbers validate the concept, and the Computational Storage(tm) implementation seems to have served them quite well in terms of SaaS performance.
But Manta is a game-changing implementation of this idea in part because it is wide open for any application you want to spin up on IaaS.
Manta's "move-compute-to-storage" works at the level of the cloud operating system, not as an API layer. The innovation down the stack, meaning POSIX compliant I/O works without replacing conventional calls with APIs. This also means it is not limited to a only a few programming language implementations. It will work with any UNIX compilable or interpretable code that can run on SmartOS/illumOS.  More technical details are provided in the Joyent Developer Blogs listed here (below on mobile, to the right on desktop).
So what does it take to get started on Manta to run bioinformatics/genomics codes? 
Aside from conventional UNIX and GNU awk/sed/grep tools, Perl, R scripts and other interpreted languages, platform dependent executables (e.g. C/C++) must be compiled to run on SmartOS. Manta distributes compute jobs inside SmartOS illumOS zone based VMs, (and these are not possible with Linux KVMs).
SmartOS/illumOS support for compiled sequence assemblers, and other bioinformatics packages is, at the moment, sparse to nonexestent. Aside from me, not a lot of other scientists are testing/building on illumOS platforms. Joyent has been steadily porting UNIX/Linux apps onto SmartOS and there are over 10,000 now on the pkgsrc repository, including most major programming languages. So there is definitely enough to get started with. Additional bioinformatics programs and packages that require cross-compilation onto illumOS/SmartOS will need some porting and package integration work. This will take some time, but I have started compiling and prioritizing a target list, so stay tuned, and if you need help, ask. 
Manta - So What Can You Do With It?
A few initial thoughts not already discussed with regard to Manta, just to get you thinking.
Rolls Rocye and GE both have extensive live monitoring systems to check on the ongoing health of their jet engines, collecting data and looking for anaomalies across the entire fleet as you fly.  If you are building a small internet-aware medical device for mass market, mobile e-health application or some massive Raspberry Pi / Arduino based sensor network, (aka, The Internet of Things) and you want to push device data onto storage where you can carry out fast distributed analysis across your own fleet with simple Unix tools, Manta is what you are looking for.
If I push my collection of research paper PDF files onto Manta, what happens when I allow others to run data mining across my collection? Watch the Manta video run an analysis on Shakespeare's collection of plays. So, then - what happens when a group of researchers federate their collections together and allow anyone to run a Manta analysis that spans multiple collections? 
Yes, I am wondering out loud whether Manta can be used to provide the full-text data mining service that paywall scientific publishers keep denying to scientists.  If we can store our private PDFs in our own collections in the cloud (like a Mendeley collection), and they don't MOVE and the copyright owned PDF bits are not themselves exported out of the analytical VM that visits the collection...  Hmm. Well. I suspect federated data mining on Manta can be done without violation of terms of use that are already established on cloud publication collection services. But hey, don't take legal advice from me.
Finally, like Appistry, there is a small crowd of startup companies offering to do medical and personal genomics analysis with combinations of in-house and cloud based computing. 23andMe, Bina, Complete GenomicsDNAnexusFoundation Medicine,  GoodStart Genetics, Genome Health SolutionsInVitae and  Knome.  Some focus on technology platforms and price, others on relevant clinical expertice. Academics in the open-source community are also competing in this space, and they are penny-pinchers, usually gravitating to the sweet spot in price/performance. All of these groups have the same set of long-term problems to solve with regard to genomic data storage, reference data storage, analysis, re-analysis and dealing with the future of data scale. With bigger bioinformatics application support still to come, Manta will enable small players to put together cost-effective, auditable and high-performance pipelines that previously required several million dollars in startup costs. How will this landscape shake out? Watch this space.
Manta is only 24 hours out of the aquarium and it is already very interesting.
Manta photos taken at the South East Asia (SEA) Aquarium.
Manta Coverage, How-to's and Use Cases
Joyent Manta home page with video
Manta Pricing
Joyent Engineering Team Blogs about Manta.
Marc Cavage's feature lead on his big idea: Hello, Manta: Bringing Unix to Big Data
Bryan Cantrill on the genesis of Manta and the list of credits: Manta: From revelation to product
Brendan Gregg on driving R image analysis off Manta: Manta: Unix Meets Map Reduce
Keith Wesolowski on the hardware: Manta - What Lies Beneath
Dave Pacheco on Zones, ZFS and hyperlofs: Inside Manta: Distributing the Unix shell
manta-compute-bin is a collection of utilities that are on the $PATH in a Manta compute job.
Manta Press/Blog Coverage:
Ericsson, with video, on thier beta-testing with voice data and on the GuardTime tamper-evident auditing model deployed by Joyent.