cluster¶
Sequence clustering wrapper around the external vclust
tool with three preconfigured modes for common viral genomics workflows.
Overview¶
The phu cluster
command provides a simplified interface to vclust
, implementing common use cases (see the vclust wiki) while allowing advanced customization via --vclust-params
.
Synopsis¶
phu cluster --mode <MODE> --input-contigs <FASTA_FILE> [OPTIONS]
Outputs are placed in an output folder (default clustered-contigs/
) and include cluster assignment TSVs and representative FASTA files, for example:
clustered-contigs/
├── ani.ids.tsv
├── ani.tsv
├── cluster_representatives_ids.txt
├── fltr.txt
├── representatives.fna
└── species.tsv
Modes¶
-
dereplication
— remove redundant sequences while keeping representatives (cd-hit/ANI-based). -
votu
— cluster into viral Operational Taxonomic Units (leiden/ANI-based) following MIUViG-style defaults. -
species
— classify sequences into species (complete/TANI-based) following ICTV-style defaults.
Default parameters by mode¶
Command Options¶
Sequence clustering wrapper around external 'vclust' with three modes.
For advanced usage, provide custom vclust parameters as a quoted string.
See the vclust wiki for parameter details:
https://github.com/refresh-bio/vclust/wiki
Example:
phu cluster --mode votu --input-contigs genomes.fna
╭─ Options ─────────────────────────────────────────────────────────────────╮
│ * --mode [dereplication|votu|s dereplication | votu | │
│ pecies] species │
│ [required] │
│ * --input-contigs PATH Input FASTA [required] │
│ --output-folder PATH Output directory │
│ [default: │
│ clustered-contigs] │
│ --threads INTEGER RANGE [x>=0] 0=all cores; otherwise │
│ N threads │
│ [default: 0] │
│ --vclust-params TEXT Custom vclust │
│ parameters: │
│ "--min-kmers 20 │
│ --outfmt lite --ani │
│ 0.97" │
│ --help -h Show this message and │
│ exit. │
╰───────────────────────────────────────────────────────────────────────────╯
Examples¶
# Dereplicate viral contigs (default outputs in clustered-contigs/)
phu cluster --mode dereplication --input-contigs viral_contigs.fna
# Cluster into vOTUs following MIUViG standards
phu cluster --mode votu --input-contigs viral_contigs.fna
# Species classification following ICTV standards
phu cluster --mode species --input-contigs complete_genomes.fna
Advanced: pass custom vclust parameters¶
phu cluster --mode species --input-contigs genomes.fna \
--vclust-params="--metric tani --tani 0.70"
Treat species clustering as genus-level by lowering similarity to 70%
Notes¶
--vclust-params
provides full control overvclust
parameters; use it to tune behavior for large or highly-redundant datasets.- For reproducibility, save the full vclust command and parameters used (the tool often logs them in the output folder).
See also¶
- vclust wiki: https://github.com/refresh-bio/vclust/wiki/6-Use-cases