rCAD: The RNA Comparative Analysis Database (3)
Doshi K.J., Gutell R.R., and Ozer S. (2010)
rCAD: The RNA Comparative Analysis Database
: Some of the images on this site are adapted from that manuscript.
Table of Contents
With the rapid advances in sequencing technology, the size of RNA datasets to be analyzed with Comparative Analysis is increasing at a rapid rate. The scientific community lacks adequate tools to properly apply Comparative Analysis to large RNA datasets. The Gutell Lab is actively developing a software infrastructure to support the Comparative Analysis of large RNA datasets. The Gutell Lab is ideally suited to this task due to their experience in the Comparative Analysis of Ribosomal RNA (1)
The architecture of the software infrastructure under development by the Gutell Lab is radically different from most Computational Biology and Bioinformatics applications. Generally, Computational Biology and Bioinformatics applications follow a "pipeline" architecture and operate in-memory, loading their data from flat files. The burden falls to the application developer to manage resources in their application (threading, memory, and input/output). The architecture for our infrastructure follows a layered, service-oriented approach. We eliminate flat files and replace them with a data management and analysis system (rCAD) implemented inside the enterprise relational database, Microsoft SQL Server 2008. The rCAD database schema is divided into dimensions analogous to the major data elements in an RNA dataset used for Comparative Analysis. The biological entities stored in the different dimensions of the rCAD database are: sequences, sequence alignments (as sparse matrices), structure models and evolutionary relationships. The relationships between the different dimensions mimic the biological relationships between the entities.
Comparative Analysis algorithms execute inside rCAD and are implemented as either: declarative SQL queries or compiled applications that execute within the database engine memory space; a unique feature of SQL Server. The enterprise database engine is left to manage the threading, memory, input/output, replication and clustering. Given the large size of the RNA datasets now and anticipated in the future, the ability to develop analysis algorithms that do not have to load all the data into memory a-priori is a significant advantage for our infrastructure
Layered on top of the rCAD system is a graphical user interface, the Comparative Analysis Toolkit User Interface (CATUI)
. The CATUI enables the Biologist to perform the Comparative Analysis in an interactive manner. CATUI and rCAD interact in a client/server manner. The advantage of this implementation is that the CATUI can be executed on a separate computer from the rCAD system. For example, rCAD could be deployed in cloud such as Azure while the CATUI executes on the Biologist's laptop. Normally, a Biologist's laptop would not have enough resources (processing power, memory) to visualize a 100,000 sequence alignment, if it has to be loaded into memory. With our infrastructure, the CATUI is only rendering "viewports" into the alignment, obtained though SQL queries, significantly reducing the memory footprint of the CATUI.
The rCAD database schema is split into 4 dimensions that mimic the dimensions in an RNA dataset for Comparative Analysis: Sequence Metadata
, Sequence Alignment
, Structural Relationships
and Evolutionary Relationships
2a. Sequence Metadata Dimension/Compartment (Back to Top)
The Sequence Metadata dimension contains information about all the sequences stored in an rCAD system. This includes RNA type (e.g., 5S Ribosomal RNA), Cell Location (e.g., Nucleus), and the NCBI Genbank Accession identifier. The Sequence Metadata dimension has a relationship to the Evolutionary Relationships dimension to associate each sequence with its taxonomy.
2b. Evolutionary Relationships Dimension/Compartment (Back to Top)
The Evolutionary Relationships dimension provides the set of evolutionary relationships that correspond to the Tree of Life, as disseminated at NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/
. Each sequence in the Sequence Metadata dimension is mapped on to the Tree of Life through the TaxID
2c. Sequence Alignment Dimension/Compartment (Back to Top)
The Sequence Alignment dimension stores the physical sequences organized as alignments. Sequence alignments are stored as sparse matrices, at the resolution of individual nucleotides and gaps are inferred are query time (3)
. A indirection mechanism is used to support global alignment topology modifications (e.g., column insertions) within the database (3)
. The SeqID
is used as a key to relate metadata from Sequence Metadata dimension to any sequence in any alignment. A view is provided (vAlignmentGrid
) to provide access to a sequence alignment through a common interface (row/column). The view (vAlignmentGridUngapped
) providers an interface for direct access to the sequence data only.
The integration between the Sequence Metadata, Sequence Alignment and Evolutionary Relationships dimensions of the rCAD database can be used to create declarative SQL routines to extract specific subsets
of an alignment and then use them for analysis or visualization. The example query below selects a subset of an alignment that contains only Bacterial sequences:
2d. Structural Relationships Dimension/Compartment (Back to Top)
The Structural Relationships dimension includes any know secondary structure model for any individual sequence information. The secondary structure base pairs are stored in their own table. The other structural elements: helices, hairpin loops, internal loops, multistem loops, bulge loops and pseudoknots are stored in a separate table. All structural relationships can be projected across a sequence alignment through the SeqID
and _AlnID keys.
3. Comparative Analysis Algorithms (Back to Top)
Arguably the most important feature of the rCAD database in SQL Server 2008 is the ability to implement different Comparative Analysis algorithms as declarative SQL queries or C# compiled applications that execute within the database memory space. Executing within the database memory space is a key benefit of our architecture because it eliminates the need to transfer large amounts of data from the database execution context to a separate analysis application execution context.
One category of RNA Comparative Analyses that directly utilizes the integrated dimensions of the rCAD database are RNA structural statistics. A comprehensive structural statistics presentation is provided at http://www.rna.ccbb.utexas.edu/SAE/2D
(Gardner DP, manuscript in preparation
). In structural statistics analyses, different RNA structural elements such as a base pair, a helix or a hairpin loop are projected across either an entire sequence alignment, or a subset of a sequence alignment and statistics compiled on frequency of the different sequence combination's observed.
The example query below demonstrates a base pair frequency structural statistic as a declarative SQL query. The base pair is projected across the alignment by identifying the columns of the alignment associated with the 5' and 3' nucleotides of the base pair in a specific reference sequence. The example query projects the base pair across all Bacterial sequences in the alignment.
For this example structural statistic, SQL Server handles all the memory management, threading and I/O for this query. It is not necessary to develop a standalone application to load the entire alignment into memory (which may be impractical without access to significant amounts of hardware resources).
The rCAD Utilities are a set of software applications to facilitate the provisioning and maintenance of an rCAD database. The rCAD Utilities should be installed on the same computer as SQL Server 2008. Additionally, all of the utilities can utilize the SQL Server Integration Services to further automate their tasks (only the rCAD Taxonomy Updater utility requires SQL Server Integration Services). The tasks automated by the rCAD Utilities are:
The individual applications within the rCAD utilities are:
4a. rCAD Creator Utility (Back to Top)
The rCAD Creator Utility creates a new rCAD databases in an instance of SQL Server 2008. It creates all the database tables, views, indices and a two user defined functions for a fully functional rCAD database. If you have Integration Services installed and SQL Server 2008 Standard or higher, the process is completely automated. If you have SQL Server 2008 Express, the rCAD Creator Utility outputs a SQL script that must be executed with the SQLCMD command line application to create the rCAD database. See the documentation here.
4b. rCAD Alignment Loader Utility (Back to Top)
The rCAD Alignment Loader loads pre-formatted RNA datasets into an rCAD database. Currently, a few Ribosomal RNA datasets are provided at the CRW Site. In the future, the rCAD Alignment Loader support loading ad-hoc RNA datasets. Similar to the rCAD Creator Utility, the process is completely automated if you have Integration Services and SQL Server 2008 installed. Otherwise, the rCAD Alignment Loader utility outputs a SQL script and a set of formatted flatfiles. The script is executed in SQLCMD to load the alignment from formatted flatfiles. See the documentation here.
4c. rCAD Taxonomy Updater Utility (Back to Top)
The rCAD Taxonomy Updater automatically synchronizes the Evolutionary Relationships dimension to the NCBI Taxonomy Database http://www.ncbi.nlm.nih.gov/Taxonomy/
. The rCAD Creator Utility has a set of Evolutionary Relationships, and they are installed when the rCAD database is created; however, these relationships will be out of date. The rCAD Taxonomy Updater is used to update any installed rCAD database. SQL Server 2008 Standard or higher, Integration Services, and an Internet connection (with the FTP port open) are required for the rCAD Taxonomy Updater. See the documentation here.