Technologies Offered by the
Innovation Centre
The Genome Alberta Bioinformatics Innovation centre (IBP) is currently
supporting a number of critical bioinformatics activities including:
Sequence clustering and assembly, EST and metagenome annotation,
metabolomic data processing and interpretation, genome annotation,
custom programming as well as proteomic and protein structure analysis.
As with past initiatives there will likely be heavy demands on the
IBP’s Help Desk for database support and custom programming tasks. We
also foresee a need for the continuation of the IBP’s training program
to train new Highly Qualified Personnel and create Power Users, which
can utilize the IBP tools in full. A solid foundation exists, on which
the ABC projects can build. As the last hardware refresh occurred
several years ago, we plan to exchange some end-of-life infrastructure
to be able to operate the systems until 2014, when the ABC grants are
projected to terminate.
Genome Alberta Innovation Center provides the following
technologies and services:
1. Data Infrastructure (relevant to all ABC grants, as well as the
existing Comp. III grants)
The hardware component of BIP is used for various tasks, including data
storage management; high-throughput (High Performance Computing)
analysis; custom programming and testing; web server support; as well
as data integration, interoperability, and visualization. These core
technologies are deployed on a robust
and secure computational hardware installation (mostly UNIX based), and
supplemented by access to computational resources worldwide (especially
via BioMOBY) [Wilkinson & Links, 2002; The BioMoby Consortium,
2008].
1.1 Computational Hardware.
The Bioinformatics Innovation Center has substantial computing
resources including currently: a Sun Fire 6800 server powered by twenty
1.2GHz UltraSPARC IV processors and four 1.5GHz UltraSPARC IV+
processors, a computing cluster consisting of 24 SunFire v210 servers
with dual UltraSPARC IIIi processors, and eight TimeLogic Decipher
engines for high-throughput bioinformatics database searches. The
server infrastructure is located at the Sun Center of Excellence (COE)
for Visual Genomics in Calgary (see Figure 1 below for the current
infrastructure in place at the COE). In addition, the HelpDesk in
Edmonton maintains a Linux cluster consisting of twelve 1.5 GHz dual
processor (24 CPUs) to support more than 30 web services and databases.
1.2 Long-term data storage. (Relevant to all Genome Canada funded
Projects)
The Bioinformatics Platform provides long-term data storage and
retrieval facilities for data, including:
Sequence and quality information for raw reads.
Sequence assembly versions.
Image data (FISH, microarrays, etc.)
Other electronic records (e.g. logs for well-bore
core samples used).
The archival storage of the Bioinformatics Innovation Center is based
on SAM-QFS – a high performance archival system that provides both long
term retention and high availability. The SAM-QFS policy creates a
total of three file archives at pre-defined intervals to ensure backups
exist for both recent and long-term changes. The three copies reside on
the SATA array, on tape in a StorageTek L700 library, and in a VPN
remote data center on a fibre-attached StorageTek L700e respectively.
This triple-redundancy insures us against a major data center or
building disaster. The storage facility can be upgraded for the needs
of the ABC projects without even a server outage. Dedicated storage
components are included in several of the ABC proposals. These will be
used for the data of the respective ABC grants, which requested them.
It is our policy to only store data on an ongoing basis, which cannot
be deposited in public repositories (in accordance with Genome Canada
data release policies). This minimizes the amount of long-term storage
needed.
2. Database design and management (All Genome Canada funded
projects)
Databases are the organizational and communications backbone of any
genomics project, giving workers a "big picture" view of the project as
a whole. All critical raw data are input into the database by lab
workers using either web-based data entry forms or by uploading
spreadsheets. In turn, all members and projects will have secure access
to the unified dataset through web-based clients. In this way, the
database will become the first point of communication across the
platform. In a sense, if a lab worker has to pick up the phone to find
a strain, primer, or sequence, the database has failed.
The integration of data from all subprojects allows workers in all labs
to get a better picture of what the other labs are doing, and promotes
the sharing of data and samples across subprojects. Databases take raw
data, and structure them into knowledge by describing relations between
data objects. For example, an engineered strain contains a construct
within a bacterial species, and the construct may contain a gene
encoding a protein that produces a metabolite. Importantly, because
these relations, and therefore knowledge, are encoded in the database,
the database ensures a continuity of knowledge as people cycle in out
of the lab, as well as preserving that knowledge beyond the duration of
the project. Database will be implemented using standard engines such
as mySQL (http://www.mysql.org). In accordance with Genome Canada’s
data release policy, all data for which public repositories already
exist (such as GenBank for sequence data) will be submitted to these
repositories according to the timeline of the respective projects.
2.1 Data Integration (Essential Platform Core Technology)
Our use of BioMoby as the common framework through which data and
analytical tools are made available to researchers makes it feasible
and straightforward to construct project-specific, comprehensive
portals, providing customized views on the data for each
platform-supported project [Chan et al. 2008; Song et al, 2007; Good et
al, 2006]. Each portal will also incorporate Wiki-like functionality,
allowing project researchers to communicate with each other, and with
the Platform, in a controlled and catalogued environment. BioMOBY is an
essential Platform Core Technology. It has been implemented worldwide
at major bioinformatics data centres [Schoof et al., 2005; Wilkinson et
al, 2005; Neerincx & Leunissen, 2005; Carrere & Gouzy, 2006;
Bruskiewich et al. 2008] and as the core technology behind other
National Bioinformatics Platforms such as Genome Espania [Ismael
Navas-Delgado et al. 2006]
As Web technologies rapidly evolve, and as new data-types, sources, and
tools emerge, BioMoby will need to be further developed in support of
the ABC projects.
3. High-throughput data pipelines
3.1 EST and Genome Annotation (Facchini)
The Sprockets pipeline (Gordon et al., 2006) clusters ESTs according to
their phylogenetic context at high-stringency. This provides
high-resolution separation of gene-family-members. Clusters are then
annotated using the fully automated MAGPIE annotation system
(http://magpie.ucalgary.ca). Due to the availability of dedicated HPC
hardware, the analysis and annotation pipeline is configured to utilize
heuristic (e.g. the BLAST family of search tools) and exhaustive
databank searches (e.g. Smith-Waterman searches and Markov model
searches against GenBank). Our dedicated hardware systems (TimeLogic
DeCypher boards, http://www.timelogic.com) are capable of completing
clustered EST similarity searches within two to three days per run.
These analyses will be iterated over time to ensure that the most
up-to-date annotations are available to the Platform users. This
functionality will be essential for the ABC project, which include
high-throughput genome sequence creation and annotation as part of
their plan. All identified gene functions will be initially categorized
using Gene Ontology classification (http://www.geneontology.org), and
these annotations will be further enhanced by “drill-down” BioMoby
annotation pipelines [Kerhornou et al, 2007] – recovering information
from e.g. PubMed, KEGG, and dbSNP; such pipelines have already been
designed for the AllerGen Network Centre of Excellence [Song et al,
2006], and these will be enhanced as-needed for use by our partner
projects. All annotation information will be available via both a human
Web interface as well as an automatable BioMoby interface, such that we
enable both low-throughput expert validation, as well as
high-throughput analytical pipelines.
3.2. Genome Assembly and Annotation (Tsang, Voordouw)
The assembly of next-generation sequencing is evolving as quickly as
the sequencing technology itself. On top software provided by the
hardware vendors, literally dozen of packages that have appeared or
been repurposed to assemble or map short reads to a reference genome,
such as Bowtie (http://bowtie-bio.sourceforge.net/), Exonerate (Slater
and Birney, 2005), MAQ (http://maq.sourceforge.net/) , and Mosaik
(Hillier e tal., 2008). For de novo assembly of 454 reads (likely the
case for the Tsang and Voordouw applications), open source assemblers
successfully used include MIRA2 (Chevreux et al., 2004) and Velvet
(Zerbino and Birney, 2008). The platform could provide the resources to
run this if projects were to reanalyze their assemblies as technology
improves.
Our focus is on more traditional Sanger and next-gen sequencing tools
(e.g. Velvet and ALLPATHS (Butler et al., 2008)) will be used on raw
sequence subsets that have been
‘binned’ into closely related phylogenetic groupings using PhyloPythia
(McHardy et al., 2007) in metagenomic apps, or directly in traditional
cases. There is no one size fits all solution to next-gen
assembly and the platform will take a comparative approach to each
assembly project. We will use the platform’s expertise in pipelining
and configuring existing software to optimize the assembly of
metagenomic projects.
We plan to install analysis tools, especially for Metagenomes, using
software developed by the J. Craig Venter Institute and the DOE Joint
Genome Institute. Full list of programs being considered is available
in the SOWs. The assembled genomic sequences will be annotated, on a
recurring basis, using MAGPIE; coding regions and gene models will be
initially identified by EST-cluster alignment, in addition to using
genome sequence to resolve artificial EST clusters. In
metagenomic projects, a Sprockets clustering will provide an
alternative gene-function-centric rather than genome-centric view of
the dataset, as microbes may be working in cooperation in metabolic
processes. MAGPIE metabolic pathway analysis will link discovered
genes to metabolic pathways, and a BioMoby-compliant interface to the
KEGG pathway database will enable deeper ad-hoc data mining of
metabolic processes for genes of interest, with these deep annotations
being fed-back into the primary database. Finally, we will enable
automated submission of these richly-annotated genomic sequences to
GenBank as well as making them available to platform researchers
through our visualization and automated BioMoby high-throughput
interfaces. Several ABC projects will be using this capability of the
Innovation Center and will need enhancements to the existing pipelines.
3.3. Microarray Design (Rowland-Cloutier)
The Sprockets clusters will; guide the design of an oligomer-based
Microarray, using OSPREY (Gordon and Sensen, 2004). OSPREY has
successfully been used to design oligonucleotide-based microarrays for
Candida albicans, Sulfolobus solfataricus P2 and Desulfovibrio
vulgaris. Microarray analyses will be achieved using the Innovation
Center’s novel Merlin software (manuscript submitted), which was
developed for previous GC and other projects requiring quick and
concise microarray data normalization and summarization.
3.4 HelpDesk - Metabolomics, Proteomics, and Web Services
(Wishart-related Projects)
The IBP HelpDesk maintains more than 30 publicly accessible web servers
that support metabolomic data analysis (MetaboAnalyst, HMDB, and
DrugBank), proteomic data analysis (GelScape, Proteus2, PPT-DB), genome
annotation (BaSYS, Plasmapper), protein structure analysis (CS23D,
Vadar, Superpose) and text mining (PolySearch, BioSpider). The Help
Desk also supports and routinely updates more than a dozen
web-accessible databases, including the Human Metabolome Database
(HMDB), DrugBank, FooDB, DrugMet, PPT-DB and RefDB. These databases
receive more than 3 million hits a year. The BIP Help Desk also
provides custom programming support, a program repository (with nearly
80 programs), custom data analysis, bioinformatics advice and community
updates (through a biweekly newsletter). Currently the Help Desk
handles about 10 programming/analysis requests per week. The staff at
the IBP Help Desk are internationally recognized for their knowledge
and expertise in the bioinformatics of metabolomics and they will play
a key role in handling nearly all of the requested
Metabolomics data processing for the ABC initiative. Indeed, one third
of the requests for ABC services to the IBP involve Metabolomics data
analysis. The Help Desk’s familiarity with chemometric and statistical
processing software, with numerous MS and NMR spectral analysis tools,
with metabolomic LIMS development as well as their expertise in
metabolic pathway analysis/annotation is unique. Likewise, because the
Help Desk maintains many of the world’s primary metabolomics databases
(HMDB, DrugBank, FooDB, DrugMet), it is in an ideal position to exploit
these resources for Genome Canada researchers. Through continuing
integration with other tools and services within the platform (BioMoby,
Magpie, BlueJay) it should be possible to add considerable value to
these metabolomic analyses and to spread this metabolomic expertise
throughout all nodes in the IBP Innovation Center.
4. Data analysis
The wide range of software applications and services offered by the IBP
will be available to the ABC and other projects through web interfaces,
remote desktops, and downloads. We will continue to develop these tools
to meet the needs of the ABC grants (as outlined in the respective
SOWs).
4.1 BlueJay Genome Browser (Facchini, Tsang, Voordouw,)
All MAGPIE annotations (and also other XML files in TIGR or GenBANK XML
format) can be explored through the Bluejay genomic browser (Gordon and
Sensen, 2000) (http://bluejay.ucalgary.ca), allowing easy navigation
through complete genomes, together with EST alignments and other
sequence annotations. Drill-out functionality is provided through
Bluejay’s integrated Biomoby capabilities, allowing contextualized
import of data and/or annotations from the >1600 BioMoby analytical
services worldwide, including information from genetic mapping
experiments undertaken by platform-supported projects. Further
information can be imported from the TIGR Multiexperiment Viewer
(http://www.tm4.org/mev.html) to allow visualization of gene expression
information, including time-series, in the context of genome location.
4.2 bioLegato (Fristensky Project)
BioLegato is a graphic interface designed to make it easier for
biologists to utilize bioinformatics software. BioLegato decreases the
learning curve by hiding details such as file formats and parameter
syntax. BioLegato can reuse output from one program as input to the
next program, allowing the user to do ad hoc data pipelining. More
importantly, bioLegato is programmable. Most functionality is through
external program calls, and the menus that call these programs are read
at run time. The programmability of BioLegato makes it practical to
rapidly create interfaces for many types of data and programs. Thus,
BioLegato interfaces are planned for BioMoby web services and EMBOSS,
as well as new applications working with data generated by our client
projects. For example, the bioLegato database client will be a
user-friendly interface, both for data entry, as well as for performing
queries and data mining for projects such as Microbial Genomics.
5. Activities for Competition III projects (Designing Oilseeds for
Tomorrow’s Markets,)
These ongoing activities have continued into the year 2010 (with no
cost extensions). This is funded via the Innovation Centre extension.
The Protein Expression Profiling Platform for Heart Disease Biomarker
Discovery; Dynactome: Mapping Spatio-Temporal Dynamic Systems in
Humans; Structural and Functional Annotation of the Human Genome for
Disease Study and Integrative Biology projects also continues.
6. Training (Genome Canada funded and ABC projects)
It is critical to bridge the bioinformatics knowledge gap that still
limits the ability of biologists to work with their data. The training
component of the Bioinformatics Innovation Center, which consists of
two courses per year, has continued to be an important part of the
Center services, as evidenced by the continuing popularity of the
Applied Computational Genomics Courses (ACGC), and the feedback we get
from attendees. Detailed information about ACGC is available at
http://www.gcbioinformatics.ca/training. As the courses have proceeded,
we have uniformly received enthusiastic feedback from our attendees.
While they have generally found the topics covered to be of value,
there are always suggestions for topics that they would like to see in
future courses. To address these needs, we are considering the creation
of a new series of 2-day Training Outreach courses focused on specific
topics, which will be delivered in addition to the basic training
schedule.