Core Types
The main types of the package are ORF (Open Reading Frame) and ORFCollection. An ORF stores coordinates and metadata, while ORFCollection bundles ORFs with their source sequence for clean sequence extraction.
GeneFinder.GeneFinderMethod Type
abstract type GeneFinderMethodAbstract base type for different ORF finding methods/algorithms.
Subtypes should implement the calling interface to find ORFs in a sequence, returning an ORFCollection.
GeneFinder.ORFCollection Type
struct ORFCollection{F<:GeneFinderMethod, S<:NucleicSeqOrView}A collection of Open Reading Frames (ORFs) bundled with a view of their source sequence.
This is the primary return type for all GeneFinderMethod implementations. It provides a clean API for accessing ORFs and their corresponding sequences without relying on global state.
The source is always stored as a view (LongSubSeq) to avoid unnecessary copying while maintaining a reference to the original sequence data.
Type Parameters
F<:GeneFinderMethod: The gene finding algorithm used to identify these ORFs.S<:NucleicSeqOrView: The type of the source sequence view.
Fields
source::S: A view of the source DNA sequence containing the ORFs.orfs::Vector{ORF{F}}: The vector of ORFs found in the source sequence.
Example
using BioSequences, GeneFinder
seq = dna"ATGATGCATGCATGCATGCTAG"
collection = findorfs(seq, finder=NaiveFinder)
# Source is stored as a view
typeof(source(collection)) # LongSubSeq{DNAAlphabet{4}}
# Iteration
for orf in collection
println(orf)
end
# Sequence extraction
orfseq = sequence(collection, 1)See also: ORF, sequence, source
GeneFinder.OpenReadingFrame Type
struct OpenReadingFrame{F<:GeneFinderMethod}The OpenReadingFrame (aliased as ORF) struct represents an Open Reading Frame in genomics.
An ORF is a lightweight coordinate-based structure that stores the location and metadata of a potential coding region. ORFs are typically contained within an ORFCollection, which provides access to the source sequence.
Type Parameter
F<:GeneFinderMethod: The gene finding algorithm used to identify this ORF.
Fields
range::UnitRange{Int64}: The position range (start:stop) of the ORF on the sequence.strand::Strand: The strand orientation (PSTRANDorNSTRAND).frame::Int8: The reading frame (1, 2, or 3).features::NamedTuple: Additional features/metadata associated with the ORF.
Example
using BioSequences, GeneFinder
# ORFs are typically obtained from an ORFCollection
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
collection = findorfs(seq)
# Access individual ORF
orf = collection[1]
# Extract sequence through the collection
orfseq = sequence(collection, 1)See also: ORFCollection, sequence, features
GeneFinder.Strand Type
StrandAn enumeration type representing DNA strand orientation.
Values
PSTRAND = 1: Positive/forward strand (+)NSTRAND = 2: Negative/reverse strand (-)
Example
strand = PSTRAND # Positive strand
strand = NSTRAND # Negative strandGeneFinder._extract_sequence Method
_extract_sequence(seq::NucleicSeqOrView, orf::ORF) -> DNA sequenceInternal function to extract the DNA sequence for an ORF from a source sequence.
Handles strand orientation:
PSTRAND: Returns a view (no allocation)NSTRAND: Returns reverse complement (allocates)
Includes bounds checking and codon validation.
sourceGeneFinder._isvalidorf Method
_isvalidorf(range, strand, frame, features) -> BoolValidate ORF parameters and throw descriptive errors for invalid inputs.
sourceGeneFinder.features Method
features(orf::ORF{F}) where {F}Extract the features/metadata from an ORF.
Arguments
orf::ORF{F}: The ORF from which to extract features.
Returns
NamedTuple: The features associated with the ORF (may be empty).
Example
orf = ORF{NaiveFinder}(1:33, PSTRAND, Int8(1), (score=0.95, gc=0.52))
feats = features(orf) # Returns (score = 0.95, gc = 0.52)
feats.score # Access individual feature: 0.95See also: ORF
GeneFinder.finder Method
finder(collection::ORFCollection{F}) where {F}Get the gene finding method type used for this collection.
Returns
Type{F}: The gene finder method type (e.g.,NaiveFinder).
GeneFinder.finder Method
finder(orf::ORF{F}) where {F}Get the gene finding method type used to identify this ORF.
Arguments
orf::ORF{F}: The ORF to query.
Returns
Type{F}: The gene finder method type (e.g.,NaiveFinder).
Example
orf = ORF{NaiveFinder}(1:33, PSTRAND, Int8(1))
finder(orf) # Returns NaiveFinderSee also: GeneFinderMethod, ORF
GeneFinder.frame Method
frame(orf::ORF{F}) where {F}Get the reading frame of the ORF.
Arguments
orf::ORF{F}: The ORF to query.
Returns
Int8: The reading frame (1, 2, or 3).
Example
orf = ORF{NaiveFinder}(1:33, PSTRAND, Int8(2))
frame(orf) # Returns 2GeneFinder.leftposition Method
leftposition(orf::ORF{F}) where {F}Get the left (start) position of the ORF range.
Arguments
orf::ORF{F}: The ORF to query.
Returns
Int: The first position of the ORF range.
Example
orf = ORF{NaiveFinder}(10:42, PSTRAND, Int8(1))
leftposition(orf) # Returns 10See also: rightposition, ORF
GeneFinder.orfvector Method
orfvector(collection::ORFCollection)Get the vector of ORFs from a collection.
Arguments
collection::ORFCollection: The collection to query.
Returns
Vector{ORF{F}}: The vector of ORFs.
Example
collection = findorfs(seq)
orf_vector = orfvector(collection)GeneFinder.rightposition Method
rightposition(orf::ORF{F}) where {F}Get the right (end) position of the ORF range.
Arguments
orf::ORF{F}: The ORF to query.
Returns
Int: The last position of the ORF range.
Example
orf = ORF{NaiveFinder}(10:42, PSTRAND, Int8(1))
rightposition(orf) # Returns 42See also: leftposition, ORF
GeneFinder.sequence Method
sequence(collection::ORFCollection, i::Int)Extract the DNA sequence for the ORF at index i in the collection.
Arguments
collection::ORFCollection: The collection containing the ORF and source sequence.i::Int: The index of the ORF.
Returns
LongSubSeq{DNAAlphabet{4}}: The DNA sequence corresponding to the ORF as a view.
Example
collection = findorfs(seq)
orfseq = sequence(collection, 1) # Get sequence of first ORFExample
For getting the sequence of all ORFs are several alternatives:
collection = findorfs(seq)
# Using a for loop with push!
# Using sequence methods
orfseqs = sequence(collection, 1:length(collection)) # All ORF sequences
# Using broadcasting
orfseqs = sequence.(Ref(collection), 1:length(collection)) # All ORF sequencesSee also: sequences, ORFCollection
GeneFinder.sequence Method
sequence(collection::ORFCollection, orf::ORF)Extract the DNA sequence for a specific ORF using the collection's source.
Arguments
collection::ORFCollection: The collection containing the source sequence.orf::ORF: The ORF for which to extract the sequence.
Returns
- The DNA sequence (LongSubSeq{DNAAlphabet{4}}) corresponding to the ORF.
Example
collection = findorfs(seq)
orf = collection[1]
orfseq = sequence(collection, orf)GeneFinder.source Method
source(collection::ORFCollection)Get the source sequence associated with an ORF collection.
Arguments
collection::ORFCollection: The collection to query.
Returns
- The source DNA sequence.
Example
collection = findorfs(seq)
src = source(collection) # Returns the original sequenceGeneFinder.strand Method
strand(orf::ORF{F}) where {F}Get the strand orientation of the ORF.
Arguments
orf::ORF{F}: The ORF to query.
Returns
Strand: EitherPSTRAND(positive/forward) orNSTRAND(negative/reverse).
Example
orf = ORF{NaiveFinder}(1:33, PSTRAND, Int8(1))
strand(orf) # Returns PSTRANDFinding ORFs
The findorfs function serves as a unified interface for different gene finding methods. All methods return an ORFCollection.
GeneFinder.findorfs Method
findorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; finder::Type{F}, kwargs...) where {N, F<:GeneFinderMethod}Main interface for finding Open Reading Frames (ORFs) in a DNA sequence.
Returns an ORFCollection containing the found ORFs bundled with the source sequence, providing a clean API for sequence extraction.
Arguments
sequence::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to search for ORFs.
Keywords
finder::Type{F}=NaiveFinder: The algorithm to use (NaiveFinder,NaiveFinderLazy, etc.).alternative_start::Bool=false: Whether to consider alternative start codons (GTG, TTG).minlen::Int=6: Minimum ORF length in nucleotides.
Returns
ORFCollection{F}: Collection of ORFs bundled with the source sequence.
Example
using BioSequences, GeneFinder
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
collection = findorfs(seq)
# Access ORFs sequences
orfseqs = sequence.(Ref(collection), collection.orfs) # Get sequences of all ORFs
# Use a different finder
collection = findorfs(seq, finder=NaiveFinderLazy)See also: NaiveFinder, NaiveFinderLazy, ORFCollection
ORF Finding Algorithms
NaiveFinder
Uses regular expression matching to find ORFs.
GeneFinder.NaiveFinder Type
NaiveFinder <: GeneFinderMethodA simple ORF finding method that detects all Open Reading Frames in a DNA sequence using regular expression matching.
See also: NaiveFinderLazy, GeneFinderMethod
GeneFinder.NaiveFinder Method
NaiveFinder(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N} -> ORFCollection{NaiveFinder}Find all Open Reading Frames (ORFs) in a DNA sequence using regular expression matching.
Returns an ORFCollection containing the source sequence and all found ORFs, providing a clean API for sequence extraction.
Arguments
seq::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to search for ORFs.
Keywords
alternative_start::Bool=false: Iftrue, uses extended start codons (ATG, GTG, TTG).minlen::Int64=6: Minimum ORF length in nucleotides.
Returns
ORFCollection{NaiveFinder}: Collection of ORFs bundled with the source sequence.
Example
using BioSequences, GeneFinder
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
collection = NaiveFinder(seq)
# Iterate over ORFs
for orf in collection
println(sequence(collection, orf))
end
# Index access
first_seq = sequence(collection, 1)See also: NaiveFinderLazy, ORFCollection, sequence
GeneFinder.NaiveFinderLazy Type
NaiveFinderLazy <: GeneFinderMethodA memory-optimized variant of NaiveFinder that pre-allocates the ORF vector based on estimated start codon counts.
See also: NaiveFinder, GeneFinderMethod
GeneFinder.NaiveFinderLazy Method
NaiveFinderLazy(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N} -> ORFCollection{NaiveFinderLazy}Memory-optimized ORF finder with smart pre-allocation.
Returns
ORFCollection{NaiveFinderLazy}: Collection of ORFs bundled with the source sequence.
GeneFinder._estimate_orf_count Method
_estimate_orf_count(seq::NucleicSeqOrView{DNAAlphabet{N}}) where {N} -> IntEstimate the number of ORFs in a sequence for vector pre-allocation.
Counts ATG start codons (and their reverse complement CAT) using a k-mer iterator to provide a heuristic estimate for the expected number of ORFs.
Arguments
seq::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to analyze.
Returns
Int: Estimated ORF count (minimum of 10 to avoid zero allocation).
Implementation
Uses FwRvIterator with 3-mers to efficiently count start codon occurrences on both strands in a single pass.
See also: NaiveFinderLazy
GeneFinder._locationiterator Method
_locationiterator(seq::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}Create an iterator that yields ORF location ranges in a DNA sequence.
Arguments
seq::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to search.
Keywords
alternative_start::Bool=false: Iftrue, matches NTG (ATG, GTG, TTG) as start codons; iffalse, only matches ATG.
Returns
An iterator yielding UnitRange{Int64} objects representing ORF locations.
Implementation Details
The function uses a regular expression to find ORFs:
Pattern:
ATG(?:[N]{3})*?T(AG|AA|GA)(orNTG...with alternative starts)ATGorNTG: Start codon(?:[N]{3})*?: Non-greedy match of any number of 3-nucleotide codonsT(AG|AA|GA): Stop codons (TAA, TAG, TGA)
The iterator uses findfirst with progressive offsets to find non-overlapping ORFs.
See also: NaiveFinder
GeneFinder._search_strand! Method
_search_strand!(orfs, seq, strand, seqlen, alternative_start, minlen)Search for ORFs on a single strand and append results to the ORF vector.
This is an internal helper function that avoids code duplication between forward and reverse strand searching in NaiveFinderLazy.
Arguments
orfs::Vector{ORF{NaiveFinderLazy}}: Vector to append found ORFs to (mutated).seq::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to search.strand::Strand: The strand being searched (PSTRANDorNSTRAND).seqlen::Int: Length of the original sequence (for coordinate transformation).alternative_start::Bool: Whether to use alternative start codons.minlen::Int64: Minimum ORF length filter.
Coordinate Handling
For
PSTRAND: Coordinates are used directly.For
NSTRAND: Coordinates are transformed from reverse complement positions back to original sequence positions.
See also: NaiveFinderLazy
Writing ORFs to Files
Export ORFs in various formats (FASTA, BED, GFF3).
GeneFinder.write_orfs_bed Method
write_orfs_bed(input, output; kwargs...)Write ORF coordinates in BED format.
BED (Browser Extensible Data) format is a tab-delimited text format commonly used for genomic annotations. Each line contains: start, stop, strand, and frame.
Arguments
input::NucleicSeqOrView{DNAAlphabet{N}}: Input DNA sequence to search for ORFs.output::Union{IOStream, IOBuffer, String}: Output destination (file path or IO stream).
Keywords
finder::Type{F}=NaiveFinder: ORF finding algorithm to use.alternative_start::Bool=false: Use alternative start codons (GTG, TTG) in addition to ATG.minlen::Int64=6: Minimum ORF length in nucleotides.
Output Format
start stop strand frame
35 79 + 2
120 180 - 3Example
using BioSequences, GeneFinder
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
# Write to file
write_orfs_bed(seq, "output.bed"; finder=NaiveFinder)
# Write to IO stream
open("output.bed", "w") do io
write_orfs_bed(seq, io; finder=NaiveFinder)
endSee also: write_orfs_gff, write_orfs_fna, findorfs
GeneFinder.write_orfs_faa Method
write_orfs_faa(input, output; kwargs...)Write translated ORF sequences in FASTA format (amino acids).
Outputs the protein sequences of all detected ORFs, translating each ORF using the specified genetic code. Headers contain the same metadata as write_orfs_fna.
Arguments
input::NucleicSeqOrView{DNAAlphabet{N}}: Input DNA sequence to search for ORFs.output::Union{IOStream, IOBuffer, String}: Output destination (file path or IO stream).
Keywords
finder::Type{F}=NaiveFinder: ORF finding algorithm to use.code::GeneticCode=ncbi_trans_table[1]: Genetic code for translation (default: standard code).alternative_start::Bool=false: Use alternative start codons (GTG, TTG) in addition to ATG.minlen::Int64=6: Minimum ORF length in nucleotides.
Output Format
>ORF01 id=01 start=35 stop=79 strand=+ frame=2 features=[]
MHACA*
>ORF02 id=02 start=120 stop=180 strand=- frame=3 features=[]
MLALA*Example
using BioSequences, GeneFinder
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
# Write to file with standard genetic code
write_orfs_faa(seq, "output.faa"; finder=NaiveFinder)
# Use bacterial genetic code (table 11)
write_orfs_faa(seq, "output.faa"; finder=NaiveFinder, code=ncbi_trans_table[11])
# Write to IO stream
open("output.faa", "w") do io
write_orfs_faa(seq, io; finder=NaiveFinder)
endSee also: write_orfs_fna, write_orfs_gff, findorfs, BioSequences.translate
GeneFinder.write_orfs_fna Method
write_orfs_fna(input, output; kwargs...)Write ORF nucleotide sequences in FASTA format.
Outputs the DNA sequences of all detected ORFs, with headers containing metadata (ID, coordinates, strand, frame, and features).
Arguments
input::NucleicSeqOrView{DNAAlphabet{N}}: Input DNA sequence to search for ORFs.output::Union{IOStream, IOBuffer, String}: Output destination (file path or IO stream).
Keywords
finder::Type{F}=NaiveFinder: ORF finding algorithm to use.alternative_start::Bool=false: Use alternative start codons (GTG, TTG) in addition to ATG.minlen::Int64=6: Minimum ORF length in nucleotides.
Output Format
>ORF01 id=01 start=35 stop=79 strand=+ frame=2 features=[]
ATGCATGCATGCATGCTAG
>ORF02 id=02 start=120 stop=180 strand=- frame=3 features=[]
ATGCTAGCTAGCTAGCTAAExample
using BioSequences, GeneFinder
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
# Write to file
write_orfs_fna(seq, "output.fna"; finder=NaiveFinder)
# Write to IO stream
open("output.fna", "w") do io
write_orfs_fna(seq, io; finder=NaiveFinder, minlen=9)
endSee also: write_orfs_faa, write_orfs_gff, findorfs
GeneFinder.write_orfs_gff Method
write_orfs_gff(input, output; kwargs...)Write ORF annotations in GFF3 (General Feature Format version 3).
GFF3 is a standard format for genomic annotations, compatible with genome browsers like IGV, JBrowse, and the UCSC Genome Browser.
Arguments
input::NucleicSeqOrView{DNAAlphabet{N}}: Input DNA sequence to search for ORFs.output::Union{IOStream, IOBuffer, String}: Output destination (file path or IO stream).
Keywords
finder::Type{F}=NaiveFinder: ORF finding algorithm to use.seqname::String="Chr": Sequence/chromosome name for the first GFF column.alternative_start::Bool=false: Use alternative start codons (GTG, TTG) in addition to ATG.minlen::Int64=6: Minimum ORF length in nucleotides.
Output Format
##gff-version 3
##sequence-region Chr 1 1000
Chr . ORF 35 79 . + . ID=ORF01;Name=ORF01;Frame=2;Features=[]
Chr . ORF 120 180 . - . ID=ORF02;Name=ORF02;Frame=3;Features=[]Example
using BioSequences, GeneFinder
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
# Write to file
write_orfs_gff(seq, "output.gff"; finder=NaiveFinder)
# Custom chromosome name for genome browser visualization
write_orfs_gff(seq, "output.gff"; finder=NaiveFinder, seqname="scaffold_1")
# Write to IO stream
open("output.gff", "w") do io
write_orfs_gff(seq, io; finder=NaiveFinder)
endSee also: write_orfs_bed, write_orfs_fna, findorfs