1.5 Visualization and data repositories for genomics

There are ~100 animal genomes sequenced as of 2016. On top these, there are many research projects from either individual labs or consortia that produce petabytes of auxiliary genomics data, such as ChIP-seq, RNA-seq, etc.

There are two requirements to be able to visualize genomes and their associated data: 1) you need to be able to work with a species that has a sequenced genome and 2) you want to have annotation on that genome, meaning, at the very least, you want to know where the genes are. Most genomes after sequencing are quickly annotated with gene-predictions or known gene sequences are mapped on to them, and you can also have conservation to other species to filter functional elements. If you are working with a model organism or human, you will also have a lot of auxiliary information to help demarcate the functional regions such as regulatory regions, ncRNAs, and SNPs that are common in the population. Or you might have disease- or tissue-specific data available. The more the organism is worked on, the more auxiliary data you will have.

1.5.0.1 Accessing genome sequences and annotations via genome browsers

As someone who intends to work with genomics, you will need to visualize a large amount of data to make biological inferences or simply check regions of interest in the genome visually. Looking at the genome case by case with all the additional datasets is a necessary step to develop a hypothesis and understand the data.

Many genomes and their associated data are available through genome browsers. A genome browser is a website or an application that helps you visualize the genome and all the available data associated with it. Via genome browsers, you will be able to see where genes are in relation to each other and other functional elements. You will be able to see gene structure. You will be able to see auxiliary data such as conservation, repeat content and SNPs. Here we review some of the popular browsers.

UCSC genome browser: This is an online browser hosted by University of California, Santa Cruz at http://genome.ucsc.edu/. This is an interactive website that contains genomes and annotations for many species. You can search for genes or genome coordinates for the species of your interest. It is usually very responsive and allows you to visualize large amounts of data. In addition, it has multiple other tools that can be used in connection with the browser. One of the most useful tools is the UCSC Table Browser, which lets you download all the data you see on the browser, including sequence data, in multiple formats. Users can upload data or provide links to the data to visualize user-specific data.

Ensembl: This is another online browser maintained by the European Bioinformatics Institute and the Wellcome Trust Sanger Institute in the UK, http://www.ensembl.org. Similar to the UCSC browser, users can visualize genes or genomic coordinates from multiple species and it also comes with auxiliary data. Ensembl is associated with the Biomart tool which is similar to UCSC Table browser, and can download genome data including all the auxiliary data set in multiple formats.

IGV: Integrated genomics viewer (IGV) is a desktop application developed by Broad institute (https://www.broadinstitute.org/igv/). It is developed to deal with large amounts of high-throughput sequencing data, which is harder to view in online browsers. IGV can integrate your local sequencing results with online annotation on your desktop machine. This is useful when viewing sequencing data, especially alignments. Other browsers mentioned above have similar features, however you will need to make your large sequencing data available online somewhere before it can be viewed by browsers.

1.5.0.2 Data repositories for high-throughput assays

Genome browsers contain lots of auxiliary high-throughput data. However, there are many more public high-throughput data sets available and they are certainly not available through genome browsers. Normally, every high-throughput dataset associated with a publication should be deposited in public archives. There are two major public archives we use to deposit data. One of them is Gene Expression Omnibus (GEO) hosted at http://www.ncbi.nlm.nih.gov/geo/, and the other one is European Nucleotide Archive (ENA) hosted at http://www.ebi.ac.uk/ena. These repositories accept high-throughput datasets and users can freely download and use these public data sets for their own research. Many data sets in these repositories are in their raw format, for example, the format the sequencer provides mostly. Some data sets will also have processed data but that is not a norm.

Apart from these repositories, there are multiple multi-national consortia dedicated to certain genome biology or disease-related problems and they maintain their own databases and provide access to processed and raw data. Some of these consortia are mentioned below.

Consortium What is it for?
ENCODE Transcription factor binding sites, gene expression and epigenomics data for cell lines
Epigenomics Roadmap Epigenomics data for multiple cell types
The Cancer Genome Atlas (TCGA) Expression, mutation and epigenomics data for multiple cancer types
1000 genomes project Human genetic variation data obtained by sequencing 1000s of individuals