Run the full PRIMEmodel pipeline for regulatory element prediction

This function executes the complete PRIMEmodel pipeline to predict regulatory elements (enhancers and promoters) from CTSS CAGE data. It integrates CAGE-derived tag clustering, feature preparation, prediction using a pre-trained LightGBM model, and post-processing to output high-confidence non-overlapping loci.

Usage

predict(
  ctss_rse,
  python_path = NULL,
  score_threshold = 0.75,
  score_diff = 0.1,
  num_cores = NULL,
  keep_tmp = FALSE,
  log_dir = NULL,
  ...
)

Arguments

ctss_rse

A `RangedSummarizedExperiment` object containing CTSS-level expression data.

python_path

Path to the Python environment to use. This can be one of the following:

A full path to a Python binary (e.g., `"/usr/bin/python3"`),
A path to a virtualenv directory (must contain `bin/`),
The name of a conda environment (e.g., `"prime-env"`),
A full path to a conda environment directory.

The specified path must exist and be valid. If it is NULL, it will try to find the path with reticulate::py_config() Default is NULL.

Important: If Python is already initialized (e.g., in RStudio or a long-running session), changing the Python environment from within the function will not take effect. To guarantee that the correct Python is used (especially when pointing to `"/usr/bin/python3"`), set the environment variable `RETICULATE_PYTHON` before starting R or RStudio. Alternatively, call [plc_configure_python()] early in the session.

score_threshold

Minimum score threshold for core region predictions. Must be between 0 and 1. Default is `0.75`.

score_diff

Minimum score difference required between merged regions. Must be non-negative and less than `score_threshold`. Default is `0.1`.

num_cores

Number of cores to use for parallel processing. Must be a positive integer or `NULL` to auto-detect. Default is `NULL`.

keep_tmp

Logical. Whether to keep intermediate files (e.g., profiles and temp folders). Default is `FALSE`.

log_dir

Optional. Directory path where a log file named `"PRIMEmodel.log"` will be written. If `NULL`, logs will be printed to the R console. Default is `NULL`.

Value

A `GRanges` or `GRangesList` object containing the final predicted loci after postprocessing.

Details

The function is designed for users who prefer a single-command workflow from a `RangedSummarizedExperiment` input to the final genomic predictions.

The pipeline was originally developed for human genome (hg38) CAGE data, but can be adapted for other genomes with similar CAGE annotations.

The PRIMEmodel pipeline includes the following steps:

Identifying Tag Clusters (TCs): Identify tag clusters (TCs) from the extracted CTSS data using the CAGEfightR package.
Sliding Through TCs: Slide through the identified TCs (default window size = 20) to create tiled regions for downstream processing.
Creating Normalized Profiles: Generate normalized transcriptional profiles suitable for input into the prediction model.
Predicting Profile Probabilities: Use pre-trained PRIME LightGBM models to assign probabilities to each region, indicating likelihood of being a regulatory element.
Post-Processing: Refine and filter model predictions using score thresholds and additional criteria. Output non-overlapping core regulatory loci in `GRanges` or `GRangesList` object, and (optional) BED file format for further analysis.

Python Environment

This function attempts to configure Python using the `python_path` argument. However, due to reticulate's behavior, Python must be configured before initialization. If using a system Python path (e.g., `"/usr/bin/python3"`), set `RETICULATE_PYTHON` before launching R. For virtualenvs and conda environments, configuration within the session usually works unless Python has already been initialized.

If `keep_tmp = FALSE`, temporary files will be removed after the pipeline completes.