An R package for regulatory element analysis using transcription initiation data • PRIME

Quantitative analysis of transcription initiation data

PRIME is an R package for quantitative analysis of transcription initiation data from CAGE and related TSS assays. It extends CAGEfightR with tools for regulatory element characterization, including transcription initiation complexity, signal saturation, background noise, divergent transcription, and core promoter architecture. PRIME also provides an interface to a LightGBM model for data-driven regulatory element scoring (via PRIMEmodel).

PRIME is designed to work alongside CAGEfightR and uses the same Bioconductor data structures (SummarizedExperiment / RangedSummarizedExperiment and GRanges). PRIME adds additional quantitative and modeling utilities while staying fully compatible with Bioconductor genomic infrastructure.

Core functionality

TSS-level quantification & processing

TSS-level quantification (via CAGEfightR::quantifyCTSSs())
expression summary, normalization, subsampling
transcription initiation metrics
- complexity (e.g., dispersion/entropy-style summaries depending on analysis)
- saturation / downsampling-based analyses
- background noise estimation

Regulatory element characterization

divergent transcription detection and quantification
tag cluster analysis utilities (decomposition and downstream quantification)
strand balance and other profile-derived summaries (depending on workflow)

Core promoter analysis

promoter decomposition
positional dispersion summaries (e.g., width/dispersion-style measures)
initiator sequence patterns (INR-like classifications)

Profile-based analysis

signal aggregation around genomic features
window-based summarization and heatmap-style matrices

Machine learning interface

interfaces to score candidate regulatory elements using the PRIME LightGBM model (distributed via PRIMEmodel)

PRIME toolkit

The PRIME toolkit consists of three interconnected tools for the analysis of transcription initiation data (e.g., CAGE):

Tool	Type	Purpose
PRIMEprep	Bash pipeline	Raw FASTQ → QC → trimming → mapping → BigWig
PRIME	R package	TSS quantification, divergent loci, promoter decomposition, normalization, noise estimation
PRIMEmodel	R package + Python	Genome-wide prediction / scoring of regulatory elements

Input data

PRIME works with standard Bioconductor genomic data structures:

TSS-level data: typically a RangedSummarizedExperiment produced by CAGEfightR (row ranges are TSS positions; assays contain counts/TPM).
Tag clusters / loci / regions: typically GRanges (or RangedSummarizedExperiment objects where rowRanges() are regions).
PRIME functions are generally compatible with SummarizedExperiment + GRanges workflows and integrate with the Bioconductor ecosystem.

Getting started

PRIME is a toolbox (not a single rigid pipeline). The fastest way to get started is to follow the vignettes on the PRIME website:

Articles index: https://anderssonlab.org/PRIME/articles/
Installing PRIME: https://anderssonlab.org/PRIME/articles/installation.html
TSS processing & QC: https://anderssonlab.org/PRIME/articles/ctss-processing.html
Tag cluster decomposition: https://anderssonlab.org/PRIME/articles/tag-cluster-decomposition.html
Divergent loci: https://anderssonlab.org/PRIME/articles/divergent-loci.html
Normalization & batches: https://anderssonlab.org/PRIME/articles/normalization-batches.html
Noise estimation: https://anderssonlab.org/PRIME/articles/noise-estimation.html
Regulatory element prediction (PRIMEmodel): https://anderssonlab.org/PRIME/articles/prediction.html
End-to-end workflow: https://anderssonlab.org/PRIME/articles/end-to-end-workflow.html

Relationship to CAGEfightR

PRIME is designed to work alongside CAGEfightR. It uses the same data structures (SummarizedExperiment and GRanges) and extends CAGEfightR with additional quantitative and modeling utilities for regulatory element analysis.

CAGEfightR repository: https://github.com/MalteThodberg/CAGEfightR

PRIME model (PRIMEmodel)

PRIME (this package) is the analysis toolkit (TSS QC, quantification helpers, complexity/noise utilities, tag cluster and promoter analysis).
PRIMEmodel distributes the trained LightGBM model and provides genome-wide (or focal) scoring of candidate regulatory elements from TSS signal profiles.

PRIMEmodel repository: https://github.com/anderssonlab/PRIMEmodel
PRIMEmodel website: https://anderssonlab.org/PRIMEmodel/

Example applications

Compare transcription initiation complexity across conditions
Identify and characterize divergently transcribed loci
Analyze core promoter architecture via decomposition and dispersion measures
Perform profile-based analyses around genomic features (heatmap-style and window summaries)
Score candidate regulatory elements using the PRIME model (via PRIMEmodel)

Installation

PRIME depends on a mix of CRAN, Bioconductor, and GitHub packages. For detailed instructions, see:

https://anderssonlab.org/PRIME/articles/installation.html

Documentation

Website: https://anderssonlab.org/PRIME/
Articles (vignettes): https://anderssonlab.org/PRIME/articles/
Reference: https://anderssonlab.org/PRIME/reference/

PRIME