[ The following is a repost from anderssonr.wordpress.com ]
FANTOM5 (Functional Annotation of the Mammalian Genome) is an international research consortium established by Dr. Hayashizaki and his colleagues at RIKEN in Tokyo, Japan. Founded in 2000 to functionally annotate the mouse DNA sequence with advanced sequencing techniques, FANTOM has since developed and expanded over time to encompass the regulation of genes, networks of genes and their impact in disease. The FANTOM project includes over 500 scientists from more than 20 countries over the whole world.
In FANTOM5 we have used Cap Analysis of Gene Expression (CAGE) to map the sets of transcripts, transcription factors, promoters and enhancers active in the majority of mammalian primary cell types. We have also complemented this with profiles from cancer cell lines, and tissues. The results are published in two articles in Nature describing the promoterome (FANTOM Consortium et al. 2014) and enhancerome (Andersson et al. 2014) of mammalian cells along with several more focused papers in various journals.
In this post, I will explain the computational strategy I used in the enhancer paper (Andersson et al. 2014) for predicting the locations of transcriptional enhancers and quantifying their usage across 808 human FANTOM CAGE libraries.
Bidirectional (divergent) transcription at enhancers
We observed that enhancers, as defined from chromatin features (H3K4me1 and H3K27ac, see e.g. Bulger and Groudine 2011 for an overview of these features), were bidirectionally transcribed producing capped RNAs emanating (divergently) outwards from the center nucleosome deficient region (NDR) (Figure 1A). The observation of bidirectional transcription at enhancers is not new. Tae-Kyung Kim and colleagues (Kim et al. 2010) observed bidirectional transcription at active enhancers in mouse cortical neurons and coined the products eRNAs (enhancer RNAs).
Although functional roles have been suggested for enhancer RNAs (see Lam et al. 2014 for an extensive review), such attribution remains debatable. Nevertheless, the production of eRNAs does provide insight into functional regulatory elements.
We found that the characteristics of enhancer transcription, detected using CAGE are sufficiently distinct from those of gene promoters to permit accurate genome-wide inference of enhancers from eRNAs – while transcription is mainly unidirectional at mRNA promoters, enhancers initiate bidirectional transcription (Figure 1B). Importantly, by in vitro assays, we showed that enhancer transcription is a much better predictor of enhancer activity than chromatin characteristics (3-fold increase in validation rate) (Figure 1C). These observations constitute the fundamental basis of my approach to infer the genomic locations of putative enhancers genome-wide.
Identification of bidirectionally transcribed loci
The computational strategy to predict enhancer locations is made available on Github. Below, I describe the procedure.
Bidirectionally transcribed loci were defined from a set of 1,714,047 forward and 1,597,186 reverse strand CAGE tag clusters (TCs) supported by at least two CAGE tags in at least one sample (TCs defined in FANTOM Consortium et al. 2014). Only TCs not overlapping antisense TCs were used. The identification of bidirectional loci involves the following steps:
- We identified 1,261,036 divergent (reverse-forward) TC pairs separated by at most 400 bp (step 1 in Figure 2)
- We merged all such pairs containing the same TC, while at the same time avoiding overlapping forward and reverse strand transcribed regions (prioritization by expression ranking), which resulted in 200,171 bidirectional loci (step 2 in Figure 2). A center position was defined for each bidirectional locus as the mid position between the rightmost reverse strand TC and leftmost forward strand TC included in the merged bidirectional pair.
- Each bidirectional locus was further associated with two 200 bp regions immediately flanking the center position, one (left) for reverse strand transcription and one (right) for forward strand transcription, in a divergent manner. The merged bidirectional pairs were further required to be bidirectionally transcribed (CAGE tags supporting both windows flanking the center) in at least one individual sample, and to have a greater aggregate of reverse CAGE tags (over all FANTOM5 samples) than forward CAGE tags in the 200 bp region associated with reverse strand transcription, and vice versa. These filtering steps resulted in 78,555 bidirectionally transcribed loci.
We quantified the expression of bidirectional loci for each strand and 200 bp flanking window in each of the 808 FANTOM libraries separately by counting the CAGE tags whose 5′ ends were located within these windows. The expression values of both flanking windows were normalized by converting tag counts to tags per million mapped reads (TPM). The normalized expression values from both windows were used to calculate a sample-set wide directionality score, D, for each enhancer over aggregated normalized reverse, R, and forward, F, strand expression values across all samples (Figure 2);
D = (F-R) / (F+R).
D ranges between -1 and 1 and specifies the bias in expression to reverse and forward strand, respectively (D=0 means 50% reverse and 50% forward strand expression, while abs(D) close to 1 indicates unidirectional transcription). Each bidirectional locus was assigned one expression value for each sample by summing the normalized expression of the two flanking windows.
Prediction of enhancers from bidirectionally transcribed loci
Bidirectional loci were filtered to have low, non-promoter-like, directionality scores (abs(D) < 0.8, Figure 1B) and to be located distant to TSSs and exons of protein- and non-coding genes. This resulted in a final set of 43,011 putative enhancers.
Each predicted enhancer is described in BED12 format with two blocks denoting the merged regions of transcription initiation on the minus and plus strands. The thickStart and thickEnd columns denote the inferred mid position (Figure 2). The score column gives the maximum pooled expression of TCs used to construct each bidirectional loci.