Information technology tools and resources at the UW
From 30 Years to a Matter of Weeks
Hyak’s Compute Power and Speed Pushes Research Forward
David Beck is an eScientist, member of the UW Chemical Engineering faculty, and bioinformaticist who works with The Lidstrom Laboratory and Hackett labs. The Lidstrom lab focuses on genomics and transcriptomics research, the Hackett lab on proteomics and transcriptomics.
Much of the work of both labs emphasizes genome-wide investigations, which benefit from the creative application of large-scale compute resources. In simple terms, the work is to collect raw data from the environment or in vitro mimics of environmental conditions, then to collate, assemble and analyze it. All of these tasks are computationally intensive and require significant data storage space.
Beck is also the creator of pdqBLAST, a software tool created through the collaboration between the eScience Institute and the Lidstrom and Hackett labs, with the use of Hyak. pdqBLAST has enabled the Lidstrom lab to run “all against all” queries on databases containing tens of millions of sequences.
The Lidstrom lab is interested in biogeochemical cycling: how methane gets into the atmosphere, how it is removed from the atmosphere and how CO2 can be incorporated from the atmosphere. In addition to understanding these processes from a microbiological perspective, the potential exists to tap some of these biological entities for the purpose of biofuels.
The Hackett lab performs a range of proteomics and transcriptomics work, from dental pathogen research to biogeochemical recycling.
The Computational Challenge
To illustrate the power of Hyak, Beck described a recent, typical job that he ran with pdqBLAST for the Lidstrom lab. In this case, the two data sets consisted of 60 million unattributed RNA sequences, which must be run against 10 million known protein sequences. The goal was to identify proteins that may be derived from the RNA sequences. pdqBLAST provides an infrastructure for running this query on Hyak, taking advantage of Hyak’s significant storage capacity, bandwidth and computational power.
Superior Storage and Bandwidth
Both data sets reside in the central data store, then are pushed to, in this case, 70 nodes. Each node in Hyak received a copy of the complete set of known protein sequences and a subset of the unattributed RNA sequences. Hyak’s excellent bandwidth accommodates this simultaneous push from central storage to 70 nodes.
The query ran on these 70 nodes, with 8 cores per node, for 20 days. Multiplying those figures by 24, the result is 268,800 hours: the number of hours required for the job to complete. If it were possible (or desirable) to run such a query on one core at a time, this job would take 30 years to complete. Hyak made that computation possible in a matter of weeks due to its enormous storage capacity, its bandwidth and its compute power.
pdqBLAST applied to queries of this sort is well suited to Hyak. Specifically, Hyak offers the ability to perform week-long queries over large numbers of nodes. Tasks of this sort are difficult to arrange on existing systems at national supercomputer centers. As bioinformatics workloads of this sort scale up, Hyak serves as a valuable development platform for the next generation of petascale data intensive computers currently under development.