GeneFinder.jl

The Main ORF type

The main type of the package is ORFI which represents an Open Reading Frame Interval. It is a subtype of the GenomicInterval type from the GenomicFeatures package.

GeneFinder.OpenReadingFrameInterval Type

julia

struct ORFI{N,F} <: AbstractGenomicInterval{F}

The ORFI struct represents an Open Reading Frame Interval (ORFI) in genomics.

Fields

groupname::String: The name of the group to which the ORFI belongs.
first::Int64: The starting position of the ORFI.
last::Int64: The ending position of the ORFI.
strand::Strand: The strand on which the ORFI is located.
frame::Int: The reading frame of the ORFI.
seq::LongSubSeq{DNAAlphabet{N}}: The DNA sequence of the ORFI.
features::Features: The features associated with the ORFI.

Main Constructor

julia

ORFI{N,F}(
    groupname::String,
    first::Int64,
    last::Int64,
    strand::Strand,
    frame::Int,
    features::Features,
    seq::LongSubSeq{DNAAlphabet{N}}
)

Example

A full instance ORFI

julia

ORFI{4,NaiveFinder}("seq01", 1, 33, STRAND_POS, 1, Features((score = 0.0,)), nothing)

A partial instance ORFI

julia

ORFI{NaiveFinder}(1:33, '+', 1)

source

GeneFinder.features Method

julia

features(i::ORFI{N,F})

Extracts the features from an ORFI object.

Arguments

i::ORFI{N,F}: An ORFI object.

Returns

The features of the ORFI object. Those could be defined by each GeneFinderMethod.

source

GeneFinder.sequence Method

julia

sequence(i::ORFI{N,F})

Extracts the DNA sequence corresponding to the given open reading frame (ORFI).

Arguments

i::ORFI{N,F}: The open reading frame (ORFI) for which the DNA sequence needs to be extracted.

Returns

The DNA sequence corresponding to the given open reading frame (ORFI).

source

GeneFinder.source Method

julia

source(i::ORFI{N,F})

Get the source sequence associated with the given ORFI object.

Arguments

i::ORFI{N,F}: The ORFI object for which to retrieve the source sequence.

Returns

The source sequence associated with the ORFI object.

Examples

julia

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"
orfs = findorfs(seq)
source(orfs[1])

44nt DNA Sequence:
ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA

Warning

The source method works if the sequence is defined in the global scope. Otherwise it will throw an error. For instance a common failure is to define a simple ORFI that by defualt will have an "unnamedsource" as groupname and then try to get the source sequence.

julia

orf = ORFI{NaiveFinder}(1:33, '+', 1)
source(orf)

ERROR: UndefVarError: `unnamedsource` not defined
Stacktrace:
 [1] source(i::ORFI{4, NaiveFinder})
   @ GeneFinder ~/.julia/dev/GeneFinder/src/types.jl:192
 [2] top-level scope
   @ REPL[12]:1

source

Finding ORFIs

The function findorfs serves as a method interface as it is generic method that can handle different gene finding methods.

GeneFinder.findorfs Method

julia

findorfs(sequence::NucleicSeqOrView{DNAAlphabet{N}}; ::F, kwargs...) where {N, F<:GeneFinderMethod}

This is the main interface method for finding open reading frames (ORFIs) in a DNA sequence.

It takes the following required arguments:

sequence: The nucleic acid sequence to search for ORFIs.
finder: The algorithm used to find ORFIs. It can be either NaiveFinder, NaiveCollector or yet other implementations.

Keyword Arguments regardless of the finder method:

alternative_start::Bool: A boolean indicating whether to consider alternative start codons. Default is false.
minlen::Int: The minimum length of an ORFI. Default is 6.
scheme::Function: The scoring scheme to use for scoring the sequence from the ORFI. Default is nothing.

Returns

A vector of ORFI objects representing the found ORFIs.

Example

julia

sequence = randdnaseq(120)

120nt DNA Sequence:
 GCCGGACAGCGAAGGCTAATAAATGCCCGTGCCAGTATC…TCTGAGTTACTGTACACCCGAAAGACGTTGTACGCATTT

findorfs(sequence, finder=NaiveFinder)

1-element Vector{ORFI}:
 ORFI{NaiveFinder}(77:118, '-', 2)

source

Finding ORFs using BioRegex

GeneFinder.NaiveFinder Method

julia

NaiveFinder(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORFI{N,F}} where {N,F}

A simple implementation that finds ORFIs in a DNA sequence.

The NaiveFinder method takes a LongSequence{DNAAlphabet{4}} sequence and returns a Vector{ORFI} containing the ORFIs found in the sequence. It searches entire regularly expressed CDS, adding each ORFI it finds to the vector. The function also searches the reverse complement of the sequence, so it finds ORFIs on both strands. Extending the starting codons with the alternative_start = true will search for ATG, GTG, and TTG. Some studies have shown that in E. coli (K-12 strain), ATG, GTG and TTG are used 83 %, 14 % and 3 % respectively.

Note

This function has neither ORFIs scoring scheme by default nor length constraints. Thus it might consider aa"M*" a posible encoding protein from the resulting ORFIs.

Required Arguments

seq::NucleicSeqOrView{DNAAlphabet{N}}: The nucleic acid sequence to search for ORFIs.

Keywords Arguments

alternative_start::Bool: If true will pass the extended start codons to search. This will increase 3x the execution time. Default is false.
minlen::Int64=6: Length of the allowed ORFI. Default value allow aa"M*" a posible encoding protein from the resulting ORFIs.

Note

As the scheme is generally a scoring function that at least requires a sequence, one simple scheme is the log-odds ratio score. This score is a log-odds ratio that compares the probability of the sequence generated by a coding model to the probability of the sequence generated by a non-coding model:

S (x) = \sum_{i = 1}^{L} β_{x_{i} x} = \sum_{i = 1} \log \frac{a_{i - 1}^{m_{1}} x_{i}}{a_{i - 1}^{m_{2}} x_{i}}

If the log-odds ratio exceeds a given threshold (η), the sequence is considered likely to be coding. See lordr for more information about coding creteria.

source

GeneFinder._locationiterator Method

julia

_locationiterator(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) where {N}

This is an iterator function that uses regular expressions to search the entire ORFI (instead of start and stop codons) in a LongSequence{DNAAlphabet{4}} sequence. It uses an anonymous function that will find the first regularly expressed ORFI. Then using this anonymous function it creates an iterator that will apply it until there is no other CDS.

Note

As a note of the implementation we want to expand on how the ORFIs are found:

The expression (?:[N]{3})*? serves as the boundary between the start and stop codons. Within this expression, the character class [N]{3} captures exactly three occurrences of any character (representing nucleotides using IUPAC codes). This portion functions as the regular codon matches. Since it is enclosed within a non-capturing group (?:) and followed by *?, it allows for the matching of intermediate codons, but with a preference for the smallest number of repetitions.

In summary, the regular expression ATG(?:[N]{3})*?T(AG|AA|GA) identifies patterns that start with "ATG," followed by any number of three-character codons (represented by "N" in the IUPAC code), and ends with a stop codon "TAG," "TAA," or "TGA." This pattern is commonly used to identify potential protein-coding regions within genetic sequences.

See more about the discussion here

source

GeneFinder.NaiveCollector Method

julia

NaiveCollector(seq::NucleicSeqOrView{DNAAlphabet{N}}; kwargs...) -> Vector{ORFI{N,F}} where {N,F}

The NaiveCollector function searches for open reading frames (ORFs) in a DNA sequence. It takes the following arguments:

Required Arguments

seq::NucleicSeqOrView{DNAAlphabet{N}}: The nucleic sequence to search for ORFs.

Keywords Arguments

alternative_start::Bool: A flag indicating whether to consider alternative start codons. Default is false.
minlen::Int64: The minimum length of an ORF. Default is 6.
overlap::Bool: A flag indicating whether to allow overlapping ORFs. Default is false.

The function returns a sorted vector of ORFI{N,NaiveCollector} objects, representing the identified ORFs.

Note

This method finds, by default, non-overlapping ORFs in the given sequence. It is much faster than the NaiveFinder method. Althought it uses the same regular expression to find ORFs in a source sequence, it levarages on the eachmatch function to find all the ORFs in the sequence.

Warning

Using the overlap = true flag will increase the runtime of the function significantly, but some of the ORFs found may display premature stop codons.

source

Writing ORFs to files

GeneFinder.write_orfs_bed Method

julia

write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_bed(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)

Write BED data to a file.

Arguments

input: The input DNA sequence NucSeq or a view.
output: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)
finder: The algorithm used to find ORFIs. It can be either NaiveFinder() or NaiveFinderScored().

Keywords

alternative_start::Bool=false: If true, alternative start codons will be used when identifying CDSs. Default is false.
minlen::Int64=6: The minimum length that a CDS must have in order to be included in the output file. Default is 6.

source

GeneFinder.write_orfs_faa Method

julia

write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_faa(input::NucleicSeqOrView{DNAAlphabet{4}}, output::String, finder::F; kwargs...)

Write the protein sequences encoded by the coding sequences (CDSs) of a given DNA sequence to the specified file.

Arguments

input: The input DNA sequence NucSeq or a view.
output: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)
finder: The algorithm used to find ORFIs. It can be either NaiveFinder() or NaiveFinderScored().

Keywords

code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
minlen::Int64=6: Length of the allowed ORFI. Default value allow aa"M*" a posible encoding protein from the resulting ORFIs.

Examples

julia

filename = "output.faa"

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"

open(filename, "w") do file
     write_orfs_faa(seq, file)
end

source

GeneFinder.write_orfs_fna Method

julia

write_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_fna(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)

Write a file containing the coding sequences (CDSs) of a given DNA sequence to the specified file.

Arguments

input::NucleicAcidAlphabet{DNAAlphabet{N}}: The input DNA sequence.
output::IO: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)
finder::F: The algorithm used to find ORFIs. It can be either NaiveFinder() or NaiveFinderScored().

Keywords

alternative_start::Bool=false: If true, alternative start codons will be used when identifying CDSs. Default is false.
minlen::Int64=6: The minimum length that a CDS must have in order to be included in the output file. Default is 6.

Examples

julia

filename = "output.fna"

seq = dna"ATGATGCATGCATGCATGCTAGTAACTAGCTAGCTAGCTAGTAA"

open(filename, "w") do file
     write_orfs_fna(seq, file, NaiveFinder())
end

source

GeneFinder.write_orfs_gff Method

julia

write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::Union{IOStream, IOBuffer}, finder::F; kwargs...)
write_orfs_gff(input::NucleicSeqOrView{DNAAlphabet{N}}, output::String, finder::F; kwargs...)

Write GFF data to a file.

Arguments

input: The input DNA sequence NucSeq or a view.
output: The otput format, it can be a file (String) or a buffer (IOStream or `IOBuffer)
finder: The algorithm used to find ORFIs. It can be either NaiveFinder() or NaiveFinderScored().

Keywords

code::GeneticCode=BioSequences.standard_genetic_code: The genetic code by which codons will be translated. See BioSequences.ncbi_trans_table for more info.
alternative_start::Bool=false: If true will pass the extended start codons to search. This will increase 3x the exec. time.
minlen::Int64=6: Length of the allowed ORFI. Default value allow aa"M*" a posible encoding protein from the resulting ORFIs.

source

The Main ORF type ​

Finding ORFIs ​

Finding ORFs using BioRegex ​

Writing ORFs to files ​

The Main ORF type

Finding ORFIs

Finding ORFs using BioRegex

Writing ORFs to files