screen thresholds and decision logic¶

This page documents how phu screen decides whether a hit, protein, or contig passes filtering.

It is meant as an implementation-aligned reference for PFAM and KOfam behavior.

Scope¶

These rules apply to the screening workflow in phu screen after protein prediction and HMM search.

Core pass/fail flow¶

For each hit emitted by pyHMMER:

Start from hits marked as included by pyHMMER.
Compute an effective score.
Compute an effective minimum bitscore threshold.
Apply score and E-value filters.
Group remaining hits by contig and apply --combine-mode rules.

Only contigs with at least one remaining hit after all filters can be kept.

Which score is used¶

Each hit has:

Full-sequence bitscore (HMMER "full sequence score").
Domain bitscore derived as the maximum score among included domains for that hit.

Effective score selection:

Default: use full-sequence bitscore.
KOfam model with score_type = domain: use domain bitscore when available.
KOfam model with score_type = full: use full-sequence bitscore.

Threshold precedence¶

Let:

min_bitscore be the CLI value from --min-bitscore (can be unset).
ko_threshold be the KOfam threshold from ko_list for that KO (can be missing).

If --use-kofam-thresholds is enabled and ko_threshold exists:

If min_bitscore is unset: effective minimum bitscore is ko_threshold.
If min_bitscore is set: effective minimum bitscore is max(min_bitscore, ko_threshold).

If KOfam thresholds are disabled (or no KO threshold exists), effective minimum bitscore is just min_bitscore.

This means user thresholds can only make filtering stricter when KOfam thresholds are active.

E-value behavior¶

--max-evalue is always applied using the hit independent E-value from the top-level hit.

Important: even when KOfam score_type is domain, the E-value filter is still based on the hit-level E-value, not domain i-Evalue or c-Evalue from domtblout rows.

PFAM behavior¶

PFAM accessions are resolved to local models, then screened like any other HMM model.

Threshold behavior for PFAM depends on CLI options:

--cut-ga on (default): pyHMMER applies profile GA gathering cutoffs during search.
--no-cut-ga: no model GA cutoff is forced by pyHMMER; filtering relies on --min-bitscore and --max-evalue.

PFAM does not use KOfam ko_list thresholds.

KOfam behavior¶

KOfam models are resolved by KO ID and enriched with metadata parsed from ko_list, including:

threshold
score_type (full or domain)

When --use-kofam-thresholds is enabled (default), KOfam thresholding is applied per KO using the KO score_type logic above.

domtblout interpretation¶

--keep-domtbl keeps raw domtblout files for inspection and audit.

In current implementation, pass/fail filtering does not re-parse domtblout text. The selection is performed from in-memory hit objects produced by pyHMMER.

Use domtblout as an audit artifact to interpret why a hit likely passed or failed.

combine mode after filtering¶

After score/E-value filtering, contigs are retained by combine mode:

any: keep contigs with at least one passing model; keep top hits per model per contig.
all: keep contigs that match all models.
threshold: keep contigs with at least --min-hmm-hits distinct matching models.

Worked interpretation example¶

If KO metadata says:

threshold = 136.43
score_type = full

Then hits with full-sequence scores around 15-20 fail thresholding, even if domain scores look reasonable in domtblout.

If --max-evalue remains default (1e-5), many such hits also fail the E-value filter.