ORFcompare¶
ORFcompare is a companion tool to ORFanage that compares CDS annotations between query and template transcripts. While ORFanage is designed to annotate new ORFs, ORFcompare evaluates and quantifies the differences between existing CDS annotations in two GTF/GFF files.
This tool is particularly useful for:
Validating ORFanage results against known annotations
Comparing CDS annotations from different sources
Quantifying frame preservation between transcript isoforms
Identifying transcripts with matching or divergent coding regions
Basic Usage¶
At minimum, ORFcompare requires a query file, a template file, and an output file:
$ orfcompare --query query.gtf --template template.gtf --output comparison.tsv
To include start and stop codon information, provide a reference genome:
$ orfcompare --reference genome.fa --query query.gtf --template template.gtf --output comparison.tsv
Output Format¶
ORFcompare outputs a tab-separated file containing metrics for each query/template pair comparison. For each query transcript that overlaps with a template transcript, the following metrics are computed:
Length metrics: CDS lengths for both query and template
Overlap metrics: Matching, in-frame, out-of-frame, extra, and missing bases
Percent identity metrics: LPI, MLPI, and ILPI scores
Codon information: Start and stop codons for both transcripts (requires
--reference)
For detailed column descriptions, please refer to the ORFcompare Stats Output section in File Formats.
Description of Options¶
--query¶
Path to the GTF/GFF file containing query transcripts with CDS annotations to be compared.
--template¶
Path to the GTF/GFF file containing template (reference) transcripts with CDS annotations.
--output¶
Path to the output TSV file where comparison results will be written.
--reference¶
Path to the reference genome in FASTA format. When provided, enables extraction of actual
start and stop codon amino acids for both query and template transcripts. This allows
verification of proper translation initiation (M for methionine) and termination (. for stop codon).
--threads¶
Number of threads to use for parallel processing. Similar to ORFanage, transcripts are grouped by coordinate overlap or gene ID and processed independently.
--use_id¶
When enabled, transcripts are grouped by gene ID rather than coordinate overlap. This is useful when gene IDs are consistent between query and template files.
Interpreting Results¶
The key metrics to evaluate CDS similarity are:
- ILPI (In-frame Length Percent Identity)
The most important metric for coding sequence comparison. High ILPI (>90%) indicates the query and template share most of their coding sequence in the same reading frame.
- Start/Stop Codon Match
Mfor start codon indicates canonical translation initiation.for stop codon indicates proper terminationOther amino acid letters indicate incomplete or alternative codons
- len_extra / len_missing
These values quantify how the query CDS differs from the template:
High
len_extra: Query has additional coding sequence not in templateHigh
len_missing: Query is missing coding sequence present in template
Example Analysis¶
Compare ORFanage output against the original reference annotation:
$ orfcompare --reference genome.fa \
--query orfanage_output.gtf \
--template original_annotation.gtf \
--output validation.tsv
Then filter for high-confidence matches:
# Find transcripts with >95% in-frame identity
awk -F'\t' '$12 > 95' validation.tsv > high_confidence.tsv
# Find transcripts with matching start codons
awk -F'\t' '$13 == "M" && $14 == "M"' validation.tsv > matching_starts.tsv