Information technology tools and resources at the UW
Enabling Collaborative Research: Data Management with SQLShare
UW-IT and the eScience Institute are pleased to announce recent improvements to SQLShare, a database service that helps researchers understand and use their data by removing obstacles to using relational databases. SQLShare was a project initiated by Bill Howe and a team of researchers in both the eScience Institute and the Database Group in the Computer Science & Engineering Department to address a significant pain point among UW researchers: spending a large part of their dedicated research time “handling data” rather than advancing their science.
Too Much Data
Some researchers work with data in “desktop-scale” tools such as spreadsheets and CSV files and manipulate them with simple scripts or manual copy and paste. This approach made sense five years ago when data volumes were still manageable. But if you are working with 50-100 spreadsheets, each of which might contain a million rows, you start looking for new approaches. Simultaneously, researchers are working more collaboratively and need a way to easily share data with their colleagues and publish data online; relying on email attachments when the data is changing almost daily is unsustainable. Solutions that exist, including databases, tend to incur significant setup and management costs, and most researchers have little interest in operating and maintaining a data system.
SQLShare is a web platform that allows researchers to upload their data and immediately beginning to query it using SQL1 without the friction of working with conventional databases. SQLShare frees researchers from the burden of designing a database, dealing with superfluous tasks such as installation, configuration, schema design, tuning, data ingest, and even application design.
In addition to letting researchers leapfrog time-consuming database-preparation tasks, SQLShare adds value to data sets it hosts by allowing researchers to build on their own results: the output of one query can be saved as a virtual dataset and used as the input to another query. It also facilitates collaboration by simplifying sharing and reuse through centralization: no need to distribute new versions of a processed dataset each time the processing routines change. And SQLShare provides provenance, effectively creating a digital trail of how results were obtained.
Academic & Collaborative Applications developed the initial web-based application for SQLShare and has continued to collaborate with Bill’s team to re-architect SQLShare to increase the size of individual databases and move the application to a more extensible Python/Django framework. While these initial changes will be effectively invisible to the end user, the change in foundation will make it easier going forward to support SQLShare for many more users and add new features. Current work, slated for this spring and summer, focuses on improving performance, simplifying the user interface and user experience, and providing additional functionality. UW-IT hopes to complete these tangible improvements by the end of fiscal year. A beta release is currently being provided to a limited set of users before SQLShare is made available more broadly.
1 Structured Query Language, the intergalactic lingua franca for manipulating data at scale. SQL is declarative, meaning that researchers need describe only the result they want, not how that result is obtained. Like all database systems, SQLShare automatically chooses the best algorithm when executing a particular query, freeing the researcher from issues of scale and efficiency. For example, unlike R scripts and Python scripts, it is impossible for a SQL query ever to fail due to “out of memory” errors.