The Main ORF type
The main type of the package is ORFI
which represents an Open Reading Frame Interval. It is a subtype of the GenomicInterval
type from the GenomicFeatures
package.
GeneFinder.OpenReadingFrameInterval Type
struct ORFI{N,F} <: AbstractGenomicInterval{F}
The ORFI
struct represents an Open Reading Frame Interval (ORFI) in genomics.
Fields
groupname::String
: The name of the group to which the ORFI belongs.first::Int64
: The starting position of the ORFI.last::Int64
: The ending position of the ORFI.strand::Strand
: The strand on which the ORFI is located.frame::Int
: The reading frame of the ORFI.seq::LongSubSeq{DNAAlphabet{N}}
: The DNA sequence of the ORFI.features::Features
: The features associated with the ORFI.
Main Constructor
ORFI{N,F}(
groupname::String,
first::Int64,
last::Int64,
strand::Strand,
frame::Int,
features::Features,
seq::LongSubSeq{DNAAlphabet{N}}
)
Example
A full instance ORFI
ORFI{4,NaiveFinder}("seq01", 1, 33, STRAND_POS, 1, Features((score = 0.0,)), nothing)
A partial instance ORFI
ORFI{NaiveFinder}(1:33, '+', 1)
GeneFinder.features Method
features(i::ORFI{N,F})
Extracts the features from an ORFI
object.
Arguments
i::ORFI{N,F}
: AnORFI
object.
Returns
The features of the ORFI
object. Those could be defined by each GeneFinderMethod
.
GeneFinder.sequence Method
sequence(i::ORFI{N,F})
Extracts the DNA sequence corresponding to the given open reading frame (ORFI).
Arguments
i::ORFI{N,F}
: The open reading frame (ORFI) for which the DNA sequence needs to be extracted.
Returns
- The DNA sequence corresponding to the given open reading frame (ORFI).
GeneFinder.source Method
source(i::ORFI{N,F})
Get the source sequence associated with the given ORFI
object.
Arguments
i::ORFI{N,F}
: TheORFI
object for which to retrieve the source sequence.
Returns
The source sequence associated with the ORFI
object.
Examples
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
orfs = findorfs(seq)
source(orfs[1])
44nt DNA Sequence:
ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA
Warning
The source
method works if the sequence is defined in the global scope. Otherwise it will throw an error. For instance a common failure is to define a simple ORFI
that by defualt will have an "unnamedsource" as groupname
and then try to get the source sequence.
orf = ORFI{NaiveFinder}(1:33, '+', 1)
source(orf)
ERROR: UndefVarError: `unnamedsource` not defined
Stacktrace:
[1] source(i::ORFI{4, NaiveFinder})
@ GeneFinder ~/.julia/dev/GeneFinder/src/types.jl:192
[2] top-level scope
@ REPL[12]:1
Finding ORFIs
The function findorfs
serves as a method interface as it is generic method that can handle different gene finding methods.
GeneFinder.findorfs Method
findorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; ::F, kwargs...) where {N, F<:GeneFinderMethod}
This is the main interface method for finding open reading frames (ORFIs) in a DNA sequence.
It takes the following required arguments:
sequence
: The nucleic acid sequence to search for ORFIs.finder
: The algorithm used to find ORFIs. It can be eitherNaiveFinder
,NaiveCollector
or yet other implementations.
Keyword Arguments regardless of the finder method:
alternative_start::Bool
: A boolean indicating whether to consider alternative start codons. Default isfalse
.minlen::Int
: The minimum length of an ORFI. Default is6
.scheme::Function
: The scoring scheme to use for scoring the sequence from the ORFI. Default isnothing
.
Returns
A vector of ORFI
objects representing the found ORFIs.
Example
sequence = randdnaseq(120)
120nt DNA Sequence:
GCCGGACAGCGAAGGCTAATAAATGCCCGTGCCAGTATC…TCTGAGTTACTGTACACCCGAAAGACGTTGTACGCATTT
findorfs(sequence, finder=NaiveFinder)
1-element Vector{ORFI}:
ORFI{NaiveFinder}(77:118, '-', 2)
Finding ORFs using BioRegex
GeneFinder.NaiveFinder Method
NaiveFinder(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORFI{N,F}} where {N,F}
A simple implementation that finds ORFIs in a DNA sequence.
The NaiveFinder
method takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORFI} containing the ORFIs found in the sequence. It searches entire regularly expressed CDS, adding each ORFI it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFIs on both strands. Extending the starting codons with the alternative_start = true
will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.
Note
This function has neither ORFIs scoring scheme by default nor length constraints. Thus it might consider aa"M*"
a posible encoding protein from the resulting ORFIs.
Required Arguments
seq::NucleicSeqOrView{DNAAlphabet{N}}
: The nucleic acid sequence to search for ORFIs.
Keywords Arguments
alternative_start::Bool
: If true will pass the extended start codons to search. This will increase 3x the execution time. Default isfalse
.minlen::Int64=6
: Length of the allowed ORFI. Default value allowaa"M*"
a posible encoding protein from the resulting ORFIs.
Note
As the scheme is generally a scoring function that at least requires a sequence, one simple scheme is the log-odds ratio score. This score is a log-odds ratio that compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model:
If the log-odds ratio exceeds a given threshold (η
), the sequence is considered likely to be coding. See lordr
for more information about coding creteria.
GeneFinder._locationiterator Method
_locationiterator(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N}
This is an iterator function that uses regular expressions to search the entire ORFI (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}}
sequence. It uses an anonymous function that will find the first regularly expressed ORFI. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.
Note
As a note of the implementation we want to expand on how the ORFIs are found:
The expression (?:[N]{3})*?
serves as the boundary between the start and stop codons. Within this expression, the character class [N]{3}
captures exactly three occurrences of any character (representing nucleotides using IUPAC codes). This portion functions as the regular codon matches. Since it is enclosed within a non-capturing group (?:)
and followed by *?
, it allows for the matching of intermediate codons, but with a preference for the smallest number of repetitions.
In summary, the regular expression ATG(?:[N]{3})*?T(AG|AA|GA)
identifies patterns that start with "ATG," followed by any number of three-character codons (represented by "N" in the IUPAC code), and ends with a stop codon "TAG," "TAA," or "TGA." This pattern is commonly used to identify potential protein-coding regions within genetic sequences.
See more about the discussion here
GeneFinder.NaiveCollector Method
NaiveCollector(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORFI{N,F}} where {N,F}
The NaiveCollector
function searches for open reading frames (ORFs) in a DNA sequence. It takes the following arguments:
Required Arguments
seq::NucleicSeqOrView{DNAAlphabet{N}}
: The nucleic sequence to search for ORFs.
Keywords Arguments
alternative_start::Bool
: A flag indicating whether to consider alternative start codons. Default isfalse
.minlen::Int64
: The minimum length of an ORF. Default is6
.overlap::Bool
: A flag indicating whether to allow overlapping ORFs. Default isfalse
.
The function returns a sorted vector of ORFI{N,NaiveCollector}
objects, representing the identified ORFs.
Note
This method finds, by default, non-overlapping ORFs in the given sequence. It is much faster than the NaiveFinder
method. Althought it uses the same regular expression to find ORFs in a source sequence, it levarages on the eachmatch
function to find all the ORFs in the sequence.
Warning
Using the overlap = true
flag will increase the runtime of the function significantly, but some of the ORFs found may display premature stop codons.
Writing ORFs to files
GeneFinder.write_orfs_bed Method
write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)
Write BED data to a file.
Arguments
input
: The input DNA sequence NucSeq or a view.output
: The otput format, it can be a file (String
) or a buffer (IOStream
or `IOBuffer)finder
: The algorithm used to find ORFIs. It can be eitherNaiveFinder()
orNaiveFinderScored()
.
Keywords
alternative_start::Bool=false
: If true, alternative start codons will be used when identifying CDSs. Default isfalse
.minlen::Int64=6
: The minimum length that a CDS must have in order to be included in the output file. Default is6
.
GeneFinder.write_orfs_faa Method
write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::String, finder::F; kwargs...)
Write the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.
Arguments
input
: The input DNA sequence NucSeq or a view.output
: The otput format, it can be a file (String
) or a buffer (IOStream
or `IOBuffer)finder
: The algorithm used to find ORFIs. It can be eitherNaiveFinder()
orNaiveFinderScored()
.
Keywords
code::GeneticCode=BioSequences.standard_genetic_code
: The genetic code by which codons will be translated. SeeBioSequences.ncbi_trans_table
for more info.alternative_start::Bool=false
: If true will pass the extended start codons to search. This will increase 3x the exec. time.minlen::Int64=6
: Length of the allowed ORFI. Default value allowaa"M*"
a posible encoding protein from the resulting ORFIs.
Examples
filename = "output.faa"
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
open(filename, "w") do file
write_orfs_faa(seq, file)
end
GeneFinder.write_orfs_fna Method
write_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)
Write a file containing the coding sequences (CDSs) of a given DNA sequence to the specified file.
Arguments
input::NucleicAcidAlphabet{DNAAlphabet{N}}
: The input DNA sequence.output::IO
: The otput format, it can be a file (String
) or a buffer (IOStream
or `IOBuffer)finder::F
: The algorithm used to find ORFIs. It can be eitherNaiveFinder()
orNaiveFinderScored()
.
Keywords
alternative_start::Bool=false
: If true, alternative start codons will be used when identifying CDSs. Default isfalse
.minlen::Int64=6
: The minimum length that a CDS must have in order to be included in the output file. Default is6
.
Examples
filename = "output.fna"
seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
open(filename, "w") do file
write_orfs_fna(seq, file, NaiveFinder())
end
GeneFinder.write_orfs_gff Method
write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)
Write GFF data to a file.
Arguments
input
: The input DNA sequence NucSeq or a view.output
: The otput format, it can be a file (String
) or a buffer (IOStream
or `IOBuffer)finder
: The algorithm used to find ORFIs. It can be eitherNaiveFinder()
orNaiveFinderScored()
.
Keywords
code::GeneticCode=BioSequences.standard_genetic_code
: The genetic code by which codons will be translated. SeeBioSequences.ncbi_trans_table
for more info.alternative_start::Bool=false
: If true will pass the extended start codons to search. This will increase 3x the exec. time.minlen::Int64=6
: Length of the allowed ORFI. Default value allowaa"M*"
a posible encoding protein from the resulting ORFIs.