Author: Robin Andersson

Open peer review of Young et al., BioRxiv, April 13 2016, DOI: 10.1101/048629

The recent BioRxiv preprint by Robert Young et al. entitled “Bidirectional transcription marks accessible chromatin and is not specific to enhancers” [1] is a rather controversial manuscript with strong claims. It caught my attention because it touches upon the research we are conducting in my lab, and refers frequently in its comparisons to one of the FANTOM consortium papers inferring enhancer activity from enhancer transcription [2], on which I am a first author.

In their preprint, Young et al. draw two major conclusions:

  1. That “bidirectional transcription initiation from accessible chromatin is not sufficient for, nor specific to, enhancer activity
  2. That the majority of eRNAs are likely not functional, based upon lack of evolutionary constraints and no evidence of purifying selection in eRNA exons

In this post I mainly focus on the first claim. The second claim very much agrees with our own previous analyses on evolutionary conservation of enhancer transcription units (TUs) and also with the fact that most eRNAs are rapidly degraded in the nucleus [2] [3]. (For an overview of how the fate of these RNAs are determined and how the cell deals with pervasive transcription see these recent reviews [4] [5].) Just as a minor comment I want to point out that the focus on multi-exonic eRNAs bias the analysis to a very small fraction of eRNAs (see our previous work on this [2] [3]) and therefore does not support general conclusions about eRNAs as a group. However, assuming functional eRNAs are more stable, the fraction of eRNAs with potential function likely belong to this minority group as splicing is strongly related to RNA stability [3].

The current preprint joins the debate on how to best predict active enhancers in mammalian genomes. During the last ten years the standard practice has mainly been to predict activity based on histone modifications and/or TF or co-activator binding (see my previous review [6] for an overview of some approaches). Several recent studies (e.g. [2] and [7]) have, however, suggested that such predictions contain a large fraction of false positives. Despite this, the field is today highly dogmatized on what constitutes a proper enhancer. This is apparent in the manuscript by Young et al. in their trust in chromatin state predictions of types of regulatory regions and that these states represent functionally distinct units.

While core promoter regions are quite straightforward to identify with current accurate sequencing approaches, e.g. CAGE (5’ end sequencing of cap-selected RNAs) and GRO-cap (5’ end sequencing of cap-selected nascent RNAs), enhancer identification is harder. Our approach is based on the identification of loci with divergent transcription initiation producing balanced amount of eRNAs on both strands. This is challenging because eRNAs are often low abundant and rapidly degraded in the nucleus by the 3’-5’ exoribonucleolytic exosome complex. I therefore welcome studies in which our approach is challenged and may point us in directions on how to improve. However, we first have to assess the reliability of methods and claims by Young et al.

Young et al. claim that bidirectional transcription is not specific to enhancers. I strongly agree with this claim and would even propose that the vast majority of human transcription initiation events are coupled with proximal upstream transcription initiation in the other direction (we have previously examined this property of human gene promoters [8]). So where do our results disagree? It boils down to whether non-gene promoter transcription initiation events are limited to enhancers or whether all open chromatin loci show transcription initiation.

Young et al, utilized CAGE, GRO-cap, GRO-seq, and PRO-seq data to determine transcription initiation levels and directionality at regulatory regions predicted by chromatin state segmentations. Predictions of stability of transcripts were inspired by a recent study by Core et al. [9]. In the present study, the authors state that absence of CAGE signal but presence of GRO-cap, PRO-seq or GRO-seq signal identifies unstable RNAs, while presence of CAGE and any of the other identifies stable RNAs. At comparable sequencing depths and comparable genomic positional distributions, the identification of unstable transcripts by presence of GRO-cap and absence of CAGE reads is rather elegant. In its original form, Core et al. compared TSS-focused reads (GRO-cap) with CAGE, while Young et al. also include TSS-unspecific reads (GRO-seq and PRO-seq). This may easily affect results when intragenic reads from the latter two sequencing methods (but not GRO-cap) overlap putative regulatory regions. For instance, CTCF-promoted RNA polymerase II pausing (e.g. [10]) in introns will likely generate local accumulation of non-TSS GRO-seq or PRO-seq reads. Secondly, presence of CAGE reads and reads from nascent RNA sequencing approaches (even if TSS specific) does not necessarily mean that the RNA is stable. Note that the CAGE data used by Young et al. is the very same that was used by us to identify enhancers, whose produced RNAs are mostly unstable. I would recommend the authors to restrict their TSS analyses to GRO-cap and CAGE and for stability measures use data in which an RNA decay pathway has been perturbed (like the HeLa-S3 exosome knockdown data [3] produced by us).

So, when their analysis shows that bidirectional transcription is not specific to chromatin state predicted enhancers and that bidirectional transcription is for instance present also at CTCF-bound loci I don’t really trust the results. Transcription initiation at CTCF-bound regions could simply be due to intragenic PRO-seq or GRO-seq signal, i.e. not at real TSSs as mentioned above. In addition, other studies (e.g. [11]) suggest that most CTCF sites are in fact untranscribed. Inaccurate classification of regulatory regions from chromatin data or the fact that the regulatory regions considered by Young et al. were defined in a hierarchical manner (i.e. enhancers could be predicted to be promoters as well) could also bias the results. If these predicted regulatory regions are really functionally different could we then really trust the results? Furthermore, Young et al. only required a single CAGE read on each strand at a regulatory locus to call it bidirectionally transcribed. Although CAGE (and GRO-cap) is a quite excellent method, all sequencing techniques are error prone and studies using such data therefore need to carefully assess the signal with respect to sequencing library noise levels. A single CAGE tag is not sufficient to reliably distinguish true signal from noise. What is the expected sequencing noise level in the CAGE (and GRO-cap, GRO-seq, PRO-seq) libraries investigated?

Young et al. further claim that bidirectional transcription is “a by-product of an opening of chromatin at all types of regulatory regions”. This claim is based on two observations. First they observe that the fraction of DHSs with detected transcription is proportional to the strength of the DHS signal. Is this really unexpected? DNase-seq signal is in individual cells binary (indicating open or close) and the signal over a population of cells measures the fraction of cells in agreement, i.e. DNase-seq signal reflects sample cell population heterogeneity in chromatin openness: the fraction of cells in a sample with a locus having open chromatin and being transcribed will determine both the expression and DNase-seq signal of that locus in a measured cell population. They later state that DHS signal shows clear discrimination between chromatin states in their correlation with (most proximal) genic transcription but that enhancer transcription level does not. How can the DHS signal be strongly correlated with local transcription levels and with genic expression levels when its local transcription level is not correlated with genic expression level? Some other apparent issues are 1) the now well-established fact that the most proximal promoter cannot be assumed to be regulated by an enhancer (ideally for such kind of an analysis chromatin interaction data is needed); 2) far from all DHSs are transcribed; 3) RNA polymerase II is not uniformly distributed in the nucleus (tend to co-localize in transcription factories together with multiple active regulatory elements). Furthermore, the order of events (transcription -> opening of DHSs) is not experimentally shown by Young et al., making this an unsupported claim.

Now, let’s get back to the burning point of disagreement: can enhancer RNAs be used to reliably predict enhancer activities? Young et al. state “transcription initiation provides positive predictive value for accessible DNA but no power to discriminate enhancer from non-enhancer”. Since in vitro enhancer reporter assays can only investigate enhancer potential in a quite artificial manner (enhancer close to promoter), regions that positively validate in such assays can be inactive in vivo. Furthermore, a negative validation result doesn’t necessarily mean that the region doesn’t have enhancer potential, due to putative enhancer-promoter preference (but only one minimal promoter is tested for multiple candidate enhancers) and the high background observed at many minimal promoters, which could mask a weak enhancement and result in a false negative. Therefore, statistics (in particular sensitivity) of any approach will be hard to assess when based on in vitro reporter assays. It is also important to point out that candidate sequences need to be selected randomly to not bias the validation results. The practice of testing a list of top candidates will therefore likely favor the validation results, making a comparison of statistics between studies close to meaningless. A technical note on their own assays: if I understand the method section correctly the candidate enhancer is inserted just upstream of the minimal promoter, i.e. there is no polyA site in between that could hinder transcription originating from the candidate enhancer. How will this affect the results?

Despite the issues listed above, one can still test the potential of predicted positives (positive predicted value (PPR): the fraction of predicted positives that validates positively) with enhancer reporter assays. We tested this previously [2] on randomly selected candidate enhancers and found that, as Young et al. report, around 70% of candidate enhancers predicted from bidirectional transcription validated. In contrast, we found that only around 25% of non-transcribed enhancers predicted from chromatin data did. Hence, their finding that the fraction of chromatin-defined enhancers without detected transcription initiation show significantly lower reporter activity than transcribed candidate enhancers agrees with FANTOM. They further find that bidirectional transcription also occurs at other regulatory regions that do not validate in reporter assays. This would indicate that bidirectional transcription as a marker of enhancers has a high false discovery rate (FDR). Can we trust that the bidirectional transcription initiation events considered in this study are OK (and not noise or intragenic non-TSS reads as discussed above)? I am a bit hesitant to trust the results mainly because of their use of GRO-seq and PRO-seq data to determine TSSs and our previous results [3] showing that repressive chromatin states tend to be untranscribed (quite a large fraction of repressed regions are in contrast found to be transcribed in the study by Young et al.). I would recommend that the authors provided PPR and FDR statistics for the different approaches. Regardless of chromatin-state prediction, what are the statistics of predictions based on bidirectional transcription alone and according to varying expression cutoffs or signal to noise ratios? In addition, I would recommend that they reproduce their analyses using alternative massive reporter assay data (e.g. [7]). Will their claims still hold?

Robin Andersson

[UPDATE: May 10, 2016]

On April 22, 2016, Robert Young and Martin Taylor responded to my concerns. In summary, PRO/GRO-seq data were not used for identification of transcription initiation sites (Methods section will be updated to clarify this), the authors still claim that bidirectional transcription is pervasive at all chromatin states and does not specifically identify active enhancers. Here is my reply.

References

  1. Young et al., Bidirectional transcription marks accessible chromatin and is not specific to enhancers, 2016
  2. Andersson et al., An atlas of active enhancers across human cell types and tissues, 2014
  3. Andersson et al., Nuclear stability and transcriptional directionality separate functionally distinct RNA species, 2014
  4. Andersson et al., A unified architecture of transcriptional regulatory elements, 2015
  5. Jensen et al., Dealing with Pervasive Transcription, 2013
  6. Andersson, Promoter or enhancer, what’s the difference? Deconstruction of established distinctions and presentation of a unifying model, 2014
  7. Kheradpour et al., Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay, 2013
  8. Andersson et al., Human Gene Promoters Are Intrinsically Bidirectional, 2015
  9. Core et al., Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers, 2014
  10. Shukla et al., CTCF-promoted RNA polymerase II pausing links DNA methylation to splicing, 2011
  11. Danko et al., Identification of active transcriptional regulatory elements from GRO-seq data, 2015

Emerging similarities between regulatory elements

Mammalian transcription is controlled by a complex interplay of regulatory events. Together, these events determine the correct spatio-temporal initiation and rate of RNA polymerase II (RNAPII) gene transcription.

Several decades of research on transcriptional regulation have identified different modes of regulation that have been ascribed to distinct regulatory elements – stretches of DNA with specific information and genomic locations needed for their tasks. Core promoters are located at transcription initiation sites and carry binding sites to general transcription factors that help recruit and assemble the RNAPII machinery. The activation, rate of initiation and elongation are further influenced by promoter-distal regulatory elements called enhancers – elements that are generally believed to be the main determinants of cell-type-specific and precise developmental gene expression. Silencers – elements that bind repressors that block transcription – or insulators – boundary elements that block the interplay between promoters and enhancers or silencers, further add to the complex landscape of transcriptional regulation.

Incorrect spatio-temporal regulation of gene expression may lead to disease. This is one of the main motivations behind increased efforts to characterize the locations and cell-type-specific activities of regulatory elements in mammalian genomes. While the genomic locations of gene promoters may be inferred using various RNA sequencing techniques, other kinds of regulatory elements have been harder to localize. Major consortia such as the The Encyclopedia of DNA Elements (ENCODE) and the NIH Roadmap Epigenomics Mapping Consortium have therefore invested a considerable amount of time and resources to delineate rules on how to localize and infer the activities of regulatory elements.

However, recent studies have generated data that question the fundamental separation of regulatory elements into distinct entities. Transcription initiation occurs at enhancers (see also related post). This property does not only enable precise localization of enhancers; it is also a good predictor of their regulatory activities. Hence, the activity of an enhancer may be inferred by a property that is generally ascribed promoters. Adding to their similarities, gene promoters can have enhancer activities and a considerable fraction of DNA sequences with enhancer potential overlap core promoters in Drosophila.

I discuss the similarities between enhancers and promoters in a recent opinion paper [1]. Together with Albin Sandelin and Charles Danko, we take this one step further and suggest a unified architecture of transcriptional regulatory elements [2]. I am now happy to see that also Tae-Kyung Kim and Ramin Shiekhattar recognize the similarities between enhancers and promoters and the problems with their distinctions [3].

Similarities between regulatory elements have been discussed also elsewhere in the literature. Jesse Raab and Rohinton Kamakaka discuss the apparent similarities between insulators and promoters [4]. Alexander Feuerborn and Peter Cook suggest a unifying view of regulatory elements and that regulatory function is determined by the three-dimensional structure of chromatin and selective tethering of transcription units to transcription factories [5].

Taken together, recent observations call for a reconsideration of current discriminatory rules of regulatory elements. In light of recent data, regulatory elements should not generally be considered distinct entities but rather elements with varying functions. The functions of regulatory elements seem to be context dependent and determined by the physical proximity of other regulatory elements and their bound factors.

  1. Andersson R. Promoter or enhancer, what’s the difference? Deconstruction of established distinctions and presentation of a unifying model. Bioessays. 2015 Mar;37(3):314-23.
  2. Andersson R, Sandelin A, Danko CG. A unified architecture of transcriptional regulatory elements. Trends Genet. 2015 Aug;31(8):426-33.
  3. Kim T-K, Shiekhattar R. Architectural and Functional Commonalities between Enhancers and Promoters. Cell. 2015;162(5): 948-59.
  4. Raab JR, Kamakaka RT. Insulators and promoters: closer than we think. Nat Rev Genet. 2010 Jun;11(6):439-46.
  5. Feuerborn A, Cook PR. Why the activity of a gene depends on its neighbors. Trends Genet. 2015 Aug 7.

Identification of bidirectionally transcribed loci and prediction of enhancers [repost]

[ The following is a repost from anderssonr.wordpress.com ]

FANTOM5 (Functional Annotation of the Mammalian Genome) is an international research consortium established by Dr. Hayashizaki and his colleagues at RIKEN in Tokyo, Japan. Founded in 2000 to functionally annotate the mouse DNA sequence with advanced sequencing techniques, FANTOM has since developed and expanded over time to encompass the regulation of genes, networks of genes and their impact in disease. The FANTOM project includes over 500 scientists from more than 20 countries over the whole world.

In FANTOM5 we have used Cap Analysis of Gene Expression (CAGE) to map the sets of transcripts, transcription factors, promoters and enhancers active in the majority of mammalian primary cell types. We have also complemented this with profiles from cancer cell lines, and tissues. The results are published in two articles in Nature describing the promoterome (FANTOM Consortium et al. 2014) and enhancerome (Andersson et al. 2014) of mammalian cells along with several more focused papers in various journals.

In this post, I will explain the computational strategy I used in the enhancer paper (Andersson et al. 2014) for predicting the locations of transcriptional enhancers and quantifying their usage across 808 human FANTOM CAGE libraries.

Bidirectional (divergent) transcription at enhancers

We observed that enhancers, as defined from chromatin features (H3K4me1 and H3K27ac, see e.g. Bulger and Groudine 2011 for an overview of these features), were bidirectionally transcribed producing capped RNAs emanating (divergently) outwards from the center nucleosome deficient region (NDR) (Figure 1A). The observation of bidirectional transcription at enhancers is not new. Tae-Kyung Kim and colleagues (Kim et al. 2010) observed bidirectional transcription at active enhancers in mouse cortical neurons and coined the products eRNAs (enhancer RNAs).

Enhancer transcription as a marker of regulatory activity

Figure 1: Enhancer transcription as a marker of regulatory activity. A, Enhancers identified by chromatin marks were overlaid with CAGE data, revealing a bidirectional transcription pattern. B, Density plot illustrating the difference in directionality of transcription at transcription start sites of protein-coding genes and center positions of chromatin-defined enhancers. C, Success rates of in vitro enhancer assays in HeLa cells. Vertical axis shows the fraction of active enhancers (success defined by Student’s t-test, P<0.05, vs. random regions). Numbers of successful assays are shown on the respective bar. Figures are modified, with permission, from Andersson et al. © (2014) Macmillan Publishers Ltd. All rights reserved.

Although functional roles have been suggested for enhancer RNAs (see Lam et al. 2014 for an extensive review), such attribution remains debatable. Nevertheless, the production of eRNAs does provide insight into functional regulatory elements.

We found that the characteristics of enhancer transcription, detected using CAGE are sufficiently distinct from those of gene promoters to permit accurate genome-wide inference of enhancers from eRNAs – while transcription is mainly unidirectional at mRNA promoters, enhancers initiate bidirectional transcription (Figure 1B). Importantly, by in vitro assays, we showed that enhancer transcription is a much better predictor of enhancer activity than chromatin characteristics (3-fold increase in validation rate) (Figure 1C). These observations constitute the fundamental basis of my approach to infer the genomic locations of putative enhancers genome-wide.

Identification of bidirectionally transcribed loci

The computational strategy to predict enhancer locations is made available on Github. Below, I describe the procedure.

Identification of bidirectionally transcribed loci

Figure2: Identification of bidirectionally transcribed loci

Bidirectionally transcribed loci were defined from a set of 1,714,047 forward and 1,597,186 reverse strand CAGE tag clusters (TCs) supported by at least two CAGE tags in at least one sample (TCs defined in FANTOM Consortium et al. 2014). Only TCs not overlapping antisense TCs were used. The identification of bidirectional loci involves the following steps:

  1. We identified 1,261,036 divergent (reverse-forward) TC pairs separated by at most 400 bp (step 1 in Figure 2)
  2. We merged all such pairs containing the same TC, while at the same time avoiding overlapping forward and reverse strand transcribed regions (prioritization by expression ranking), which resulted in 200,171 bidirectional loci (step 2 in Figure 2). A center position was defined for each bidirectional locus as the mid position between the rightmost reverse strand TC and leftmost forward strand TC included in the merged bidirectional pair.
  3. Each bidirectional locus was further associated with two 200 bp regions immediately flanking the center position, one (left) for reverse strand transcription and one (right) for forward strand transcription, in a divergent manner. The merged bidirectional pairs were further required to be bidirectionally transcribed (CAGE tags supporting both windows flanking the center) in at least one individual sample, and to have a greater aggregate of reverse CAGE tags (over all FANTOM5 samples) than forward CAGE tags in the 200 bp region associated with reverse strand transcription, and vice versa. These filtering steps resulted in 78,555 bidirectionally transcribed loci.

We quantified the expression of bidirectional loci for each strand and 200 bp flanking window in each of the 808 FANTOM libraries separately by counting the CAGE tags whose 5′ ends were located within these windows. The expression values of both flanking windows were normalized by converting tag counts to tags per million mapped reads (TPM). The normalized expression values from both windows were used to calculate a sample-set wide directionality score, D, for each enhancer over aggregated normalized reverse, R, and forward, F, strand expression values across all samples (Figure 2);

D = (F-R) / (F+R).

D ranges between -1 and 1 and specifies the bias in expression to reverse and forward strand, respectively (D=0 means 50% reverse and 50% forward strand expression, while abs(D) close to 1 indicates unidirectional transcription). Each bidirectional locus was assigned one expression value for each sample by summing the normalized expression of the two flanking windows.

Prediction of enhancers from bidirectionally transcribed loci

Bidirectional loci were filtered to have low, non-promoter-like, directionality scores (abs(D) < 0.8, Figure 1B) and to be located distant to TSSs and exons of protein- and non-coding genes. This resulted in a final set of 43,011 putative enhancers.

The predicted enhancers and the expression of enhancers across FANTOM5 libraries are available at http://enhancer.binf.ku.dk (direct link to BED file: here).

Each predicted enhancer is described in BED12 format with two blocks denoting the merged regions of transcription initiation on the minus and plus strands. The thickStart and thickEnd columns denote the inferred mid position (Figure 2). The score column gives the maximum pooled expression of TCs used to construct each bidirectional loci.