screen thresholds and decision logic¶
This page documents how phu screen decides whether a hit, protein, or contig passes filtering.
It is meant as an implementation-aligned reference for PFAM and KOfam behavior.
Scope¶
These rules apply to the screening workflow in phu screen after protein prediction and HMM search.
Core pass/fail flow¶
For each hit emitted by pyHMMER:
- Start from hits marked as included by pyHMMER.
- Compute an effective score.
- Compute an effective minimum bitscore threshold.
- Apply score and E-value filters.
- Group remaining hits by contig and apply
--combine-moderules.
Only contigs with at least one remaining hit after all filters can be kept.
Which score is used¶
Each hit has:
- Full-sequence bitscore (HMMER "full sequence score").
- Domain bitscore derived as the maximum score among included domains for that hit.
Effective score selection:
- Default: use full-sequence bitscore.
- KOfam model with
score_type = domain: use domain bitscore when available. - KOfam model with
score_type = full: use full-sequence bitscore.
Threshold precedence¶
Let:
min_bitscorebe the CLI value from--min-bitscore(can be unset).ko_thresholdbe the KOfam threshold fromko_listfor that KO (can be missing).
If --use-kofam-thresholds is enabled and ko_threshold exists:
- If
min_bitscoreis unset: effective minimum bitscore isko_threshold. - If
min_bitscoreis set: effective minimum bitscore ismax(min_bitscore, ko_threshold).
If KOfam thresholds are disabled (or no KO threshold exists), effective minimum bitscore is just min_bitscore.
This means user thresholds can only make filtering stricter when KOfam thresholds are active.
E-value behavior¶
--max-evalue is always applied using the hit independent E-value from the top-level hit.
Important: even when KOfam score_type is domain, the E-value filter is still based on the hit-level E-value, not domain i-Evalue or c-Evalue from domtblout rows.
PFAM behavior¶
PFAM accessions are resolved to local models, then screened like any other HMM model.
Threshold behavior for PFAM depends on CLI options:
--cut-gaon (default): pyHMMER applies profile GA gathering cutoffs during search.--no-cut-ga: no model GA cutoff is forced by pyHMMER; filtering relies on--min-bitscoreand--max-evalue.
PFAM does not use KOfam ko_list thresholds.
KOfam behavior¶
KOfam models are resolved by KO ID and enriched with metadata parsed from ko_list, including:
thresholdscore_type(fullordomain)
When --use-kofam-thresholds is enabled (default), KOfam thresholding is applied per KO using the KO score_type logic above.
domtblout interpretation¶
--keep-domtbl keeps raw domtblout files for inspection and audit.
In current implementation, pass/fail filtering does not re-parse domtblout text. The selection is performed from in-memory hit objects produced by pyHMMER.
Use domtblout as an audit artifact to interpret why a hit likely passed or failed.
combine mode after filtering¶
After score/E-value filtering, contigs are retained by combine mode:
any: keep contigs with at least one passing model; keep top hits per model per contig.all: keep contigs that match all models.threshold: keep contigs with at least--min-hmm-hitsdistinct matching models.
Worked interpretation example¶
If KO metadata says:
threshold = 136.43score_type = full
Then hits with full-sequence scores around 15-20 fail thresholding, even if domain scores look reasonable in domtblout.
If --max-evalue remains default (1e-5), many such hits also fail the E-value filter.