Skip to content

Core Types

The main types of the package are ORF (Open Reading Frame) and ORFCollection. An ORF stores coordinates and metadata, while ORFCollection bundles ORFs with their source sequence for clean sequence extraction.

GeneFinder.GeneFinderMethod Type
julia
abstract type GeneFinderMethod

Abstract base type for different ORF finding methods/algorithms.

Subtypes should implement the calling interface to find ORFs in a sequence, returning an ORFCollection.

source
GeneFinder.ORFCollection Type
julia
struct ORFCollection{F<:GeneFinderMethod, S<:NucleicSeqOrView}

A collection of Open Reading Frames (ORFs) bundled with a view of their source sequence.

This is the primary return type for all GeneFinderMethod implementations. It provides a clean API for accessing ORFs and their corresponding sequences without relying on global state.

The source is always stored as a view (LongSubSeq) to avoid unnecessary copying while maintaining a reference to the original sequence data.

Type Parameters

  • F<:GeneFinderMethod: The gene finding algorithm used to identify these ORFs.

  • S<:NucleicSeqOrView: The type of the source sequence view.

Fields

  • source::S: A view of the source DNA sequence containing the ORFs.

  • orfs::Vector{ORF{F}}: The vector of ORFs found in the source sequence.

Example

julia
using BioSequences, GeneFinder

seq = dna"ATGATGCATGCATGCATGCTAG"
collection = findorfs(seq, finder=NaiveFinder)

# Source is stored as a view
typeof(source(collection))  # LongSubSeq{DNAAlphabet{4}}

# Iteration
for orf in collection
    println(orf)
end

# Sequence extraction
orfseq = sequence(collection, 1)

See also: ORF, sequence, source

source
GeneFinder.OpenReadingFrame Type
julia
struct OpenReadingFrame{F<:GeneFinderMethod}

The OpenReadingFrame (aliased as ORF) struct represents an Open Reading Frame in genomics.

An ORF is a lightweight coordinate-based structure that stores the location and metadata of a potential coding region. ORFs are typically contained within an ORFCollection, which provides access to the source sequence.

Type Parameter

  • F<:GeneFinderMethod: The gene finding algorithm used to identify this ORF.

Fields

  • range::UnitRange{Int64}: The position range (start:stop) of the ORF on the sequence.

  • strand::Strand: The strand orientation (PSTRAND or NSTRAND).

  • frame::Int8: The reading frame (1, 2, or 3).

  • features::NamedTuple: Additional features/metadata associated with the ORF.

Example

julia
using BioSequences, GeneFinder

# ORFs are typically obtained from an ORFCollection
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
collection = findorfs(seq)

# Access individual ORF
orf = collection[1]

# Extract sequence through the collection
orfseq = sequence(collection, 1)

See also: ORFCollection, sequence, features

source
GeneFinder.Strand Type
julia
Strand

An enumeration type representing DNA strand orientation.

Values

  • PSTRAND = 1: Positive/forward strand (+)

  • NSTRAND = 2: Negative/reverse strand (-)

Example

julia
strand = PSTRAND  # Positive strand
strand = NSTRAND  # Negative strand
source
GeneFinder._extract_sequence Method
julia
_extract_sequence(seq::NucleicSeqOrView, orf::ORF) -> DNA sequence

Internal function to extract the DNA sequence for an ORF from a source sequence.

Handles strand orientation:

  • PSTRAND: Returns a view (no allocation)

  • NSTRAND: Returns reverse complement (allocates)

Includes bounds checking and codon validation.

source
GeneFinder._isvalidorf Method
julia
_isvalidorf(range, strand, frame, features) -> Bool

Validate ORF parameters and throw descriptive errors for invalid inputs.

source
GeneFinder.features Method
julia
features(orf::ORF{F}) where {F}

Extract the features/metadata from an ORF.

Arguments

  • orf::ORF{F}: The ORF from which to extract features.

Returns

  • NamedTuple: The features associated with the ORF (may be empty).

Example

julia
orf = ORF{NaiveFinder}(1:33, PSTRAND, Int8(1), (score=0.95, gc=0.52))

feats = features(orf)  # Returns (score = 0.95, gc = 0.52)
feats.score            # Access individual feature: 0.95

See also: ORF

source
GeneFinder.finder Method
julia
finder(collection::ORFCollection{F}) where {F}

Get the gene finding method type used for this collection.

Returns

  • Type{F}: The gene finder method type (e.g., NaiveFinder).
source
GeneFinder.finder Method
julia
finder(orf::ORF{F}) where {F}

Get the gene finding method type used to identify this ORF.

Arguments

  • orf::ORF{F}: The ORF to query.

Returns

  • Type{F}: The gene finder method type (e.g., NaiveFinder).

Example

julia
orf = ORF{NaiveFinder}(1:33, PSTRAND, Int8(1))
finder(orf)  # Returns NaiveFinder

See also: GeneFinderMethod, ORF

source
GeneFinder.frame Method
julia
frame(orf::ORF{F}) where {F}

Get the reading frame of the ORF.

Arguments

  • orf::ORF{F}: The ORF to query.

Returns

  • Int8: The reading frame (1, 2, or 3).

Example

julia
orf = ORF{NaiveFinder}(1:33, PSTRAND, Int8(2))
frame(orf)  # Returns 2

See also: strand, ORF

source
GeneFinder.leftposition Method
julia
leftposition(orf::ORF{F}) where {F}

Get the left (start) position of the ORF range.

Arguments

  • orf::ORF{F}: The ORF to query.

Returns

  • Int: The first position of the ORF range.

Example

julia
orf = ORF{NaiveFinder}(10:42, PSTRAND, Int8(1))
leftposition(orf)  # Returns 10

See also: rightposition, ORF

source
GeneFinder.orfvector Method
julia
orfvector(collection::ORFCollection)

Get the vector of ORFs from a collection.

Arguments

  • collection::ORFCollection: The collection to query.

Returns

  • Vector{ORF{F}}: The vector of ORFs.

Example

julia
collection = findorfs(seq)
orf_vector = orfvector(collection)
source
GeneFinder.rightposition Method
julia
rightposition(orf::ORF{F}) where {F}

Get the right (end) position of the ORF range.

Arguments

  • orf::ORF{F}: The ORF to query.

Returns

  • Int: The last position of the ORF range.

Example

julia
orf = ORF{NaiveFinder}(10:42, PSTRAND, Int8(1))
rightposition(orf)  # Returns 42

See also: leftposition, ORF

source
GeneFinder.sequence Method
julia
sequence(collection::ORFCollection, i::Int)

Extract the DNA sequence for the ORF at index i in the collection.

Arguments

  • collection::ORFCollection: The collection containing the ORF and source sequence.

  • i::Int: The index of the ORF.

Returns

  • LongSubSeq{DNAAlphabet{4}}: The DNA sequence corresponding to the ORF as a view.

Example

julia
collection = findorfs(seq)
orfseq = sequence(collection, 1)  # Get sequence of first ORF

Example

For getting the sequence of all ORFs are several alternatives:

julia
collection = findorfs(seq)
# Using a for loop with push!

# Using sequence methods

orfseqs = sequence(collection, 1:length(collection))  # All ORF sequences

# Using broadcasting
orfseqs = sequence.(Ref(collection), 1:length(collection))  # All ORF sequences

See also: sequences, ORFCollection

source
GeneFinder.sequence Method
julia
sequence(collection::ORFCollection, orf::ORF)

Extract the DNA sequence for a specific ORF using the collection's source.

Arguments

  • collection::ORFCollection: The collection containing the source sequence.

  • orf::ORF: The ORF for which to extract the sequence.

Returns

  • The DNA sequence (LongSubSeq{DNAAlphabet{4}}) corresponding to the ORF.

Example

julia
collection = findorfs(seq)
orf = collection[1]
orfseq = sequence(collection, orf)
source
GeneFinder.source Method
julia
source(collection::ORFCollection)

Get the source sequence associated with an ORF collection.

Arguments

  • collection::ORFCollection: The collection to query.

Returns

  • The source DNA sequence.

Example

julia
collection = findorfs(seq)
src = source(collection)  # Returns the original sequence
source
GeneFinder.strand Method
julia
strand(orf::ORF{F}) where {F}

Get the strand orientation of the ORF.

Arguments

  • orf::ORF{F}: The ORF to query.

Returns

  • Strand: Either PSTRAND (positive/forward) or NSTRAND (negative/reverse).

Example

julia
orf = ORF{NaiveFinder}(1:33, PSTRAND, Int8(1))
strand(orf)  # Returns PSTRAND

See also: frame, Strand, ORF

source

Finding ORFs

The findorfs function serves as a unified interface for different gene finding methods. All methods return an ORFCollection.

GeneFinder.findorfs Method
julia
findorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; finder::Type{F}, kwargs...) where {N, F<:GeneFinderMethod}

Main interface for finding Open Reading Frames (ORFs) in a DNA sequence.

Returns an ORFCollection containing the found ORFs bundled with the source sequence, providing a clean API for sequence extraction.

Arguments

  • sequence::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to search for ORFs.

Keywords

  • finder::Type{F}=NaiveFinder: The algorithm to use (NaiveFinder, NaiveFinderLazy, etc.).

  • alternative_start::Bool=false: Whether to consider alternative start codons (GTG, TTG).

  • minlen::Int=6: Minimum ORF length in nucleotides.

Returns

  • ORFCollection{F}: Collection of ORFs bundled with the source sequence.

Example

julia
using BioSequences, GeneFinder

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
collection = findorfs(seq)

# Access ORFs sequences
orfseqs = sequence.(Ref(collection), collection.orfs)  # Get sequences of all ORFs

# Use a different finder
collection = findorfs(seq, finder=NaiveFinderLazy)

See also: NaiveFinder, NaiveFinderLazy, ORFCollection

source

ORF Finding Algorithms

NaiveFinder

Uses regular expression matching to find ORFs.

GeneFinder.NaiveFinder Type
julia
NaiveFinder <: GeneFinderMethod

A simple ORF finding method that detects all Open Reading Frames in a DNA sequence using regular expression matching.

See also: NaiveFinderLazy, GeneFinderMethod

source
GeneFinder.NaiveFinder Method
julia
NaiveFinder(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N} -> ORFCollection{NaiveFinder}

Find all Open Reading Frames (ORFs) in a DNA sequence using regular expression matching.

Returns an ORFCollection containing the source sequence and all found ORFs, providing a clean API for sequence extraction.

Arguments

  • seq::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to search for ORFs.

Keywords

  • alternative_start::Bool=false: If true, uses extended start codons (ATG, GTG, TTG).

  • minlen::Int64=6: Minimum ORF length in nucleotides.

Returns

  • ORFCollection{NaiveFinder}: Collection of ORFs bundled with the source sequence.

Example

julia
using BioSequences, GeneFinder

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
collection = NaiveFinder(seq)

# Iterate over ORFs
for orf in collection
    println(sequence(collection, orf))
end

# Index access
first_seq = sequence(collection, 1)

See also: NaiveFinderLazy, ORFCollection, sequence

source
GeneFinder.NaiveFinderLazy Type
julia
NaiveFinderLazy <: GeneFinderMethod

A memory-optimized variant of NaiveFinder that pre-allocates the ORF vector based on estimated start codon counts.

See also: NaiveFinder, GeneFinderMethod

source
GeneFinder.NaiveFinderLazy Method
julia
NaiveFinderLazy(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N} -> ORFCollection{NaiveFinderLazy}

Memory-optimized ORF finder with smart pre-allocation.

Returns

  • ORFCollection{NaiveFinderLazy}: Collection of ORFs bundled with the source sequence.
source
GeneFinder._estimate_orf_count Method
julia
_estimate_orf_count(seq::NucleicSeqOrView{DNAAlphabet{N}}) where {N} -> Int

Estimate the number of ORFs in a sequence for vector pre-allocation.

Counts ATG start codons (and their reverse complement CAT) using a k-mer iterator to provide a heuristic estimate for the expected number of ORFs.

Arguments

  • seq::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to analyze.

Returns

  • Int: Estimated ORF count (minimum of 10 to avoid zero allocation).

Implementation

Uses FwRvIterator with 3-mers to efficiently count start codon occurrences on both strands in a single pass.

See also: NaiveFinderLazy

source
GeneFinder._locationiterator Method
julia
_locationiterator(seq::NucleicSeqOrView{DNAAlphabet{N}}; alternative_start::Bool=false) where {N}

Create an iterator that yields ORF location ranges in a DNA sequence.

Arguments

  • seq::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to search.

Keywords

  • alternative_start::Bool=false: If true, matches NTG (ATG, GTG, TTG) as start codons; if false, only matches ATG.

Returns

An iterator yielding UnitRange{Int64} objects representing ORF locations.

Implementation Details

The function uses a regular expression to find ORFs:

  • Pattern: ATG(?:[N]{3})*?T(AG|AA|GA) (or NTG... with alternative starts)

  • ATG or NTG: Start codon

  • (?:[N]{3})*?: Non-greedy match of any number of 3-nucleotide codons

  • T(AG|AA|GA): Stop codons (TAA, TAG, TGA)

The iterator uses findfirst with progressive offsets to find non-overlapping ORFs.

See also: NaiveFinder

source
GeneFinder._search_strand! Method
julia
_search_strand!(orfs, seq, strand, seqlen, alternative_start, minlen)

Search for ORFs on a single strand and append results to the ORF vector.

This is an internal helper function that avoids code duplication between forward and reverse strand searching in NaiveFinderLazy.

Arguments

  • orfs::Vector{ORF{NaiveFinderLazy}}: Vector to append found ORFs to (mutated).

  • seq::NucleicSeqOrView{DNAAlphabet{N}}: The DNA sequence to search.

  • strand::Strand: The strand being searched (PSTRAND or NSTRAND).

  • seqlen::Int: Length of the original sequence (for coordinate transformation).

  • alternative_start::Bool: Whether to use alternative start codons.

  • minlen::Int64: Minimum ORF length filter.

Coordinate Handling

  • For PSTRAND: Coordinates are used directly.

  • For NSTRAND: Coordinates are transformed from reverse complement positions back to original sequence positions.

See also: NaiveFinderLazy

source

Writing ORFs to Files

Export ORFs in various formats (FASTA, BED, GFF3).

GeneFinder.write_orfs_bed Method
julia
write_orfs_bed(input, output; kwargs...)

Write ORF coordinates in BED format.

BED (Browser Extensible Data) format is a tab-delimited text format commonly used for genomic annotations. Each line contains: start, stop, strand, and frame.

Arguments

  • input::NucleicSeqOrView{DNAAlphabet{N}}: Input DNA sequence to search for ORFs.

  • output::Union{IOStream, IOBuffer, String}: Output destination (file path or IO stream).

Keywords

  • finder::Type{F}=NaiveFinder: ORF finding algorithm to use.

  • alternative_start::Bool=false: Use alternative start codons (GTG, TTG) in addition to ATG.

  • minlen::Int64=6: Minimum ORF length in nucleotides.

Output Format

start	stop	strand	frame
35	79	+	2
120	180	-	3

Example

julia
using BioSequences, GeneFinder

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"

# Write to file
write_orfs_bed(seq, "output.bed"; finder=NaiveFinder)

# Write to IO stream
open("output.bed", "w") do io
    write_orfs_bed(seq, io; finder=NaiveFinder)
end

See also: write_orfs_gff, write_orfs_fna, findorfs

source
GeneFinder.write_orfs_faa Method
julia
write_orfs_faa(input, output; kwargs...)

Write translated ORF sequences in FASTA format (amino acids).

Outputs the protein sequences of all detected ORFs, translating each ORF using the specified genetic code. Headers contain the same metadata as write_orfs_fna.

Arguments

  • input::NucleicSeqOrView{DNAAlphabet{N}}: Input DNA sequence to search for ORFs.

  • output::Union{IOStream, IOBuffer, String}: Output destination (file path or IO stream).

Keywords

  • finder::Type{F}=NaiveFinder: ORF finding algorithm to use.

  • code::GeneticCode=ncbi_trans_table[1]: Genetic code for translation (default: standard code).

  • alternative_start::Bool=false: Use alternative start codons (GTG, TTG) in addition to ATG.

  • minlen::Int64=6: Minimum ORF length in nucleotides.

Output Format

>ORF01 id=01 start=35 stop=79 strand=+ frame=2 features=[]
MHACA*
>ORF02 id=02 start=120 stop=180 strand=- frame=3 features=[]
MLALA*

Example

julia
using BioSequences, GeneFinder

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"

# Write to file with standard genetic code
write_orfs_faa(seq, "output.faa"; finder=NaiveFinder)

# Use bacterial genetic code (table 11)
write_orfs_faa(seq, "output.faa"; finder=NaiveFinder, code=ncbi_trans_table[11])

# Write to IO stream
open("output.faa", "w") do io
    write_orfs_faa(seq, io; finder=NaiveFinder)
end

See also: write_orfs_fna, write_orfs_gff, findorfs, BioSequences.translate

source
GeneFinder.write_orfs_fna Method
julia
write_orfs_fna(input, output; kwargs...)

Write ORF nucleotide sequences in FASTA format.

Outputs the DNA sequences of all detected ORFs, with headers containing metadata (ID, coordinates, strand, frame, and features).

Arguments

  • input::NucleicSeqOrView{DNAAlphabet{N}}: Input DNA sequence to search for ORFs.

  • output::Union{IOStream, IOBuffer, String}: Output destination (file path or IO stream).

Keywords

  • finder::Type{F}=NaiveFinder: ORF finding algorithm to use.

  • alternative_start::Bool=false: Use alternative start codons (GTG, TTG) in addition to ATG.

  • minlen::Int64=6: Minimum ORF length in nucleotides.

Output Format

>ORF01 id=01 start=35 stop=79 strand=+ frame=2 features=[]
ATGCATGCATGCATGCTAG
>ORF02 id=02 start=120 stop=180 strand=- frame=3 features=[]
ATGCTAGCTAGCTAGCTAA

Example

julia
using BioSequences, GeneFinder

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"

# Write to file
write_orfs_fna(seq, "output.fna"; finder=NaiveFinder)

# Write to IO stream
open("output.fna", "w") do io
    write_orfs_fna(seq, io; finder=NaiveFinder, minlen=9)
end

See also: write_orfs_faa, write_orfs_gff, findorfs

source
GeneFinder.write_orfs_gff Method
julia
write_orfs_gff(input, output; kwargs...)

Write ORF annotations in GFF3 (General Feature Format version 3).

GFF3 is a standard format for genomic annotations, compatible with genome browsers like IGV, JBrowse, and the UCSC Genome Browser.

Arguments

  • input::NucleicSeqOrView{DNAAlphabet{N}}: Input DNA sequence to search for ORFs.

  • output::Union{IOStream, IOBuffer, String}: Output destination (file path or IO stream).

Keywords

  • finder::Type{F}=NaiveFinder: ORF finding algorithm to use.

  • seqname::String="Chr": Sequence/chromosome name for the first GFF column.

  • alternative_start::Bool=false: Use alternative start codons (GTG, TTG) in addition to ATG.

  • minlen::Int64=6: Minimum ORF length in nucleotides.

Output Format

##gff-version 3
##sequence-region Chr 1 1000
Chr	.	ORF	35	79	.	+	.	ID=ORF01;Name=ORF01;Frame=2;Features=[]
Chr	.	ORF	120	180	.	-	.	ID=ORF02;Name=ORF02;Frame=3;Features=[]

Example

julia
using BioSequences, GeneFinder

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"

# Write to file
write_orfs_gff(seq, "output.gff"; finder=NaiveFinder)

# Custom chromosome name for genome browser visualization
write_orfs_gff(seq, "output.gff"; finder=NaiveFinder, seqname="scaffold_1")

# Write to IO stream
open("output.gff", "w") do io
    write_orfs_gff(seq, io; finder=NaiveFinder)
end

See also: write_orfs_bed, write_orfs_fna, findorfs

source