Technologies Offered by the
Innovation Centre


The Genome Alberta Bioinformatics Innovation centre (IBP) is currently supporting a number of critical bioinformatics activities including: Sequence clustering and assembly, EST and metagenome annotation, metabolomic data processing and interpretation, genome annotation, custom programming as well as proteomic and protein structure analysis. As with past initiatives there will likely be heavy demands on the IBP’s Help Desk for database support and custom programming tasks. We also foresee a need for the continuation of the IBP’s training program to train new Highly Qualified Personnel and create Power Users, which can utilize the IBP tools in full. A solid foundation exists, on which the ABC projects can build. As the last hardware refresh occurred several years ago, we plan to exchange some end-of-life infrastructure to be able to operate the systems until 2014, when the ABC grants are projected to terminate.

Genome Alberta Innovation Center provides the following technologies and services:


1. Data Infrastructure (relevant to all ABC grants, as well as the existing Comp. III grants)

The hardware component of BIP is used for various tasks, including data storage management; high-throughput (High Performance Computing) analysis; custom programming and testing; web server support; as well as data integration, interoperability, and visualization. These core technologies are deployed on a robust
and secure computational hardware installation (mostly UNIX based), and supplemented by access to computational resources worldwide (especially via BioMOBY) [Wilkinson & Links, 2002; The BioMoby Consortium, 2008].




1.1 Computational Hardware.

The Bioinformatics Innovation Center has substantial computing resources including currently: a Sun Fire 6800 server powered by twenty 1.2GHz UltraSPARC IV processors and four 1.5GHz UltraSPARC IV+ processors, a computing cluster consisting of 24 SunFire v210 servers with dual UltraSPARC IIIi processors, and eight TimeLogic Decipher engines for high-throughput bioinformatics database searches. The server infrastructure is located at the Sun Center of Excellence (COE) for Visual Genomics in Calgary (see Figure 1 below for the current infrastructure in place at the COE). In addition, the HelpDesk in Edmonton maintains a Linux cluster consisting of twelve 1.5 GHz dual processor (24 CPUs) to support more than 30 web services and databases.

1.2 Long-term data storage. (Relevant to all Genome Canada funded Projects)

The Bioinformatics Platform provides long-term data storage and retrieval facilities for data, including:
    Sequence and quality information for raw reads.
    Sequence assembly versions.
    Image data (FISH, microarrays, etc.)


    Other electronic records (e.g. logs for well-bore core samples used).
The archival storage of the Bioinformatics Innovation Center is based on SAM-QFS – a high performance archival system that provides both long term retention and high availability. The SAM-QFS policy creates a total of three file archives at pre-defined intervals to ensure backups exist for both recent and long-term changes. The three copies reside on the SATA array, on tape in a StorageTek L700 library, and in a VPN remote data center on a fibre-attached StorageTek L700e respectively. This triple-redundancy insures us against a major data center or building disaster. The storage facility can be upgraded for the needs of the ABC projects without even a server outage. Dedicated storage components are included in several of the ABC proposals. These will be used for the data of the respective ABC grants, which requested them. It is our policy to only store data on an ongoing basis, which cannot be deposited in public repositories (in accordance with Genome Canada data release policies). This minimizes the amount of long-term storage needed.

2. Database design and management (All Genome Canada funded projects)

Databases are the organizational and communications backbone of any genomics project, giving workers a "big picture" view of the project as a whole. All critical raw data are input into the database by lab workers using either web-based data entry forms or by uploading spreadsheets. In turn, all members and projects will have secure access to the unified dataset through web-based clients. In this way, the database will become the first point of communication across the platform. In a sense, if a lab worker has to pick up the phone to find a strain, primer, or sequence, the database has failed.

The integration of data from all subprojects allows workers in all labs to get a better picture of what the other labs are doing, and promotes the sharing of data and samples across subprojects. Databases take raw data, and structure them into knowledge by describing relations between data objects. For example, an engineered strain contains a construct within a bacterial species, and the construct may contain a gene encoding a protein that produces a metabolite. Importantly, because these relations, and therefore knowledge, are encoded in the database, the database ensures a continuity of knowledge as people cycle in out of the lab, as well as preserving that knowledge beyond the duration of the project. Database will be implemented using standard engines such as mySQL (http://www.mysql.org). In accordance with Genome Canada’s data release policy, all data for which public repositories already exist (such as GenBank for sequence data) will be submitted to these repositories according to the timeline of the respective projects.

2.1 Data Integration (Essential Platform Core Technology)

Our use of BioMoby as the common framework through which data and analytical tools are made available to researchers makes it feasible and straightforward to construct project-specific, comprehensive portals, providing customized views on the data for each platform-supported project [Chan et al. 2008; Song et al, 2007; Good et al, 2006]. Each portal will also incorporate Wiki-like functionality, allowing project researchers to communicate with each other, and with the Platform, in a controlled and catalogued environment. BioMOBY is an essential Platform Core Technology. It has been implemented worldwide at major bioinformatics data centres [Schoof et al., 2005; Wilkinson et al, 2005; Neerincx & Leunissen, 2005; Carrere & Gouzy, 2006; Bruskiewich et al. 2008] and as the core technology behind other National Bioinformatics Platforms such as Genome Espania [Ismael Navas-Delgado et al. 2006]

As Web technologies rapidly evolve, and as new data-types, sources, and tools emerge, BioMoby will need to be further developed in support of the ABC projects.

3. High-throughput data pipelines


3.1 EST and Genome Annotation (Facchini)

The Sprockets pipeline (Gordon et al., 2006) clusters ESTs according to their phylogenetic context at high-stringency. This provides high-resolution separation of gene-family-members. Clusters are then annotated using the fully automated MAGPIE annotation system (http://magpie.ucalgary.ca). Due to the availability of dedicated HPC hardware, the analysis and annotation pipeline is configured to utilize heuristic (e.g. the BLAST family of search tools) and exhaustive databank searches (e.g. Smith-Waterman searches and Markov model searches against GenBank). Our dedicated hardware systems (TimeLogic DeCypher boards, http://www.timelogic.com) are capable of completing clustered EST similarity searches within two to three days per run. These analyses will be iterated over time to ensure that the most up-to-date annotations are available to the Platform users. This functionality will be essential for the ABC project, which include high-throughput genome sequence creation and annotation as part of their plan. All identified gene functions will be initially categorized using Gene Ontology classification (http://www.geneontology.org), and these annotations will be further enhanced by “drill-down” BioMoby annotation pipelines [Kerhornou et al, 2007] – recovering information from e.g. PubMed, KEGG, and dbSNP; such pipelines have already been designed for the AllerGen Network Centre of Excellence [Song et al, 2006], and these will be enhanced as-needed for use by our partner projects. All annotation information will be available via both a human Web interface as well as an automatable BioMoby interface, such that we enable both low-throughput expert validation, as well as high-throughput analytical pipelines.

3.2. Genome Assembly and Annotation (Tsang, Voordouw)

The assembly of next-generation sequencing is evolving as quickly as the sequencing technology itself. On top software provided by the hardware vendors, literally dozen of packages that have appeared or been repurposed to assemble or map short reads to a reference genome, such as Bowtie (http://bowtie-bio.sourceforge.net/), Exonerate (Slater and Birney, 2005), MAQ (http://maq.sourceforge.net/) , and Mosaik (Hillier e tal., 2008). For de novo assembly of 454 reads (likely the case for the Tsang and Voordouw applications), open source assemblers successfully used include MIRA2 (Chevreux et al., 2004) and Velvet (Zerbino and Birney, 2008). The platform could provide the resources to run this if projects were to reanalyze their assemblies as technology improves.

Our focus is on more traditional Sanger and next-gen sequencing tools (e.g. Velvet and ALLPATHS (Butler et al., 2008)) will be used on raw sequence subsets that have been
‘binned’ into closely related phylogenetic groupings using PhyloPythia (McHardy et al., 2007) in metagenomic apps, or directly in traditional cases.  There is no one size fits all solution to next-gen assembly and the platform will take a comparative approach to each assembly project. We will use the platform’s expertise in pipelining and configuring existing software to optimize the assembly of metagenomic projects.

We plan to install analysis tools, especially for Metagenomes, using software developed by the J. Craig Venter Institute and the DOE Joint Genome Institute. Full list of programs being considered is available in the SOWs. The assembled genomic sequences will be annotated, on a recurring basis, using MAGPIE; coding regions and gene models will be initially identified by EST-cluster alignment, in addition to using genome sequence to resolve artificial EST clusters.  In metagenomic projects, a Sprockets clustering will provide an alternative gene-function-centric rather than genome-centric view of the dataset, as microbes may be working in cooperation in metabolic processes.  MAGPIE metabolic pathway analysis will link discovered genes to metabolic pathways, and a BioMoby-compliant interface to the KEGG pathway database will enable deeper ad-hoc data mining of metabolic processes for genes of interest, with these deep annotations being fed-back into the primary database.  Finally, we will enable automated submission of these richly-annotated genomic sequences to GenBank as well as making them available to platform researchers through our visualization and automated BioMoby high-throughput interfaces. Several ABC projects will be using this capability of the Innovation Center and will need enhancements to the existing pipelines.

3.3. Microarray Design (Rowland-Cloutier)

The Sprockets clusters will; guide the design of an oligomer-based Microarray, using OSPREY (Gordon and Sensen, 2004). OSPREY has successfully been used to design oligonucleotide-based microarrays for Candida albicans, Sulfolobus solfataricus P2 and Desulfovibrio vulgaris. Microarray analyses will be achieved using the Innovation Center’s novel Merlin software (manuscript submitted), which was developed for previous GC and other projects requiring quick and concise microarray data normalization and summarization.

3.4 HelpDesk - Metabolomics, Proteomics, and Web Services (Wishart-related Projects)

The IBP HelpDesk maintains more than 30 publicly accessible web servers that support metabolomic data analysis (MetaboAnalyst, HMDB, and DrugBank), proteomic data analysis (GelScape, Proteus2, PPT-DB), genome annotation (BaSYS, Plasmapper), protein structure analysis (CS23D, Vadar, Superpose) and text mining (PolySearch, BioSpider). The Help Desk also supports and routinely updates more than a dozen web-accessible databases, including the Human Metabolome Database (HMDB), DrugBank, FooDB, DrugMet, PPT-DB and RefDB. These databases receive more than 3 million hits a year. The BIP Help Desk also provides custom programming support, a program repository (with nearly 80 programs), custom data analysis, bioinformatics advice and community updates (through a biweekly newsletter). Currently the Help Desk handles about 10 programming/analysis requests per week. The staff at the IBP Help Desk are internationally recognized for their knowledge and expertise in the bioinformatics of metabolomics and they will play a key role in handling nearly all of the requested

Metabolomics data processing for the ABC initiative. Indeed, one third of the requests for ABC services to the IBP involve Metabolomics data analysis. The Help Desk’s familiarity with chemometric and statistical processing software, with numerous MS and NMR spectral analysis tools, with metabolomic LIMS development as well as their expertise in metabolic pathway analysis/annotation is unique. Likewise, because the Help Desk maintains many of the world’s primary metabolomics databases (HMDB, DrugBank, FooDB, DrugMet), it is in an ideal position to exploit these resources for Genome Canada researchers. Through continuing integration with other tools and services within the platform (BioMoby, Magpie, BlueJay) it should be possible to add considerable value to these metabolomic analyses and to spread this metabolomic expertise throughout all nodes in the IBP Innovation Center.

4. Data analysis

The wide range of software applications and services offered by the IBP will be available to the ABC and other projects through web interfaces, remote desktops, and downloads. We will continue to develop these tools to meet the needs of the ABC grants (as outlined in the respective SOWs).

4.1 BlueJay Genome Browser (Facchini, Tsang, Voordouw,)

All MAGPIE annotations (and also other XML files in TIGR or GenBANK XML format) can be explored through the Bluejay genomic browser (Gordon and Sensen, 2000) (http://bluejay.ucalgary.ca), allowing easy navigation through complete genomes, together with EST alignments and other sequence annotations. Drill-out functionality is provided through Bluejay’s integrated Biomoby capabilities, allowing contextualized import of data and/or annotations from the >1600 BioMoby analytical services worldwide, including information from genetic mapping experiments undertaken by platform-supported projects. Further information can be imported from the TIGR Multiexperiment Viewer (http://www.tm4.org/mev.html) to allow visualization of gene expression information, including time-series, in the context of genome location.

4.2 bioLegato (Fristensky Project)

BioLegato is a graphic interface designed to make it easier for biologists to utilize bioinformatics software. BioLegato decreases the learning curve by hiding details such as file formats and parameter syntax. BioLegato can reuse output from one program as input to the next program, allowing the user to do ad hoc data pipelining. More importantly, bioLegato is programmable. Most functionality is through external program calls, and the menus that call these programs are read at run time. The programmability of BioLegato makes it practical to rapidly create interfaces for many types of data and programs. Thus, BioLegato interfaces are planned for BioMoby web services and EMBOSS, as well as new applications working with data generated by our client projects. For example, the bioLegato database client will be a user-friendly interface, both for data entry, as well as for performing queries and data mining for projects such as Microbial Genomics.

5. Activities for Competition III projects (Designing Oilseeds for Tomorrow’s Markets,)

These ongoing activities have continued into the year 2010 (with no cost extensions). This is funded via the Innovation Centre extension. The Protein Expression Profiling Platform for Heart Disease Biomarker Discovery; Dynactome: Mapping Spatio-Temporal Dynamic Systems in Humans; Structural and Functional Annotation of the Human Genome for Disease Study and Integrative Biology projects also continues.

6. Training (Genome Canada funded and ABC projects)

It is critical to bridge the bioinformatics knowledge gap that still limits the ability of biologists to work with their data. The training component of the Bioinformatics Innovation Center, which consists of two courses per year, has continued to be an important part of the Center services, as evidenced by the continuing popularity of the Applied Computational Genomics Courses (ACGC), and the feedback we get from attendees. Detailed information about ACGC is available at http://www.gcbioinformatics.ca/training. As the courses have proceeded, we have uniformly received enthusiastic feedback from our attendees. While they have generally found the topics covered to be of value, there are always suggestions for topics that they would like to see in future courses. To address these needs, we are considering the creation of a new series of 2-day Training Outreach courses focused on specific topics, which will be delivered in addition to the basic training schedule.