simplify-taxa¶
Convert vContact taxonomy prediction strings into compact lineage codes for downstream analysis.
Overview¶
phu simplify-taxa
transforms verbose vContact3 *_prediction
columns into compact, standardized lineage codes (e.g. Caudoviricetes:NF2:NG1
) to make taxonomy easier to filter, visualize, and analyze.
Synopsis¶
phu simplify-taxa -i <INPUT_FILE> -o <OUTPUT_FILE> [OPTIONS]
Input accepts CSV or TSV files from vContact3's final_assignments.csv
output. Output format is automatically detected from file extension.
Input/Output Formats¶
Supported Formats¶
- Input: CSV, TSV (auto-detected from extension or
--sep
parameter) - Output: CSV, TSV (auto-detected from file extension)
Expected Input Columns¶
The command automatically detects and processes any columns matching the pattern *_prediction
:
- kingdom_prediction
- phylum_prediction
- class_prediction
- order_prediction
- family_prediction
- subfamily_prediction
- genus_prediction
- realm_prediction
(if present)
Transformation Logic¶
Before Transformation¶
novel_genus_1_of_novel_family_2_of_novel_order_3_of_Caudoviricetes
After Transformation¶
Caudoviricetes:NO3:NF2:NG1
Compact Code Format¶
The transformation uses standardized rank codes:
- NK
= Novel Kingdom
- NP
= Novel Phylum
- NC
= Novel Class
- NO
= Novel Order
- NF
= Novel Family
- NSF
= Novel Subfamily
- NG
= Novel Genus
Command Options¶
Simplify vContact taxonomy prediction columns into compact lineage codes.
Transforms verbose vContact taxonomy strings like
'novel_genus_1_of_novel_family_2_of_Caudoviricetes' into compact codes like
'Caudoviricetes:NF2:NG1'.
Example:
phu simplify-taxa -i final_assignments.csv -o simplified.csv
--add-lineage
╭─ Options ─────────────────────────────────────────────────────────────────╮
│ * --input-file -i PATH Input vContact final_assignments.csv │
│ [required] │
│ * --output-file -o PATH Output path (.csv or .tsv) [required] │
│ --add-lineage FLAG Append compact_lineage column from deepest │
│ simplified rank │
│ --lineage-col TEXT Name of the lineage column │
│ [default: compact_lineage] │
│ --sep TEXT Override delimiter: ',' or '\t'. │
│ Auto-detected from extension if not set │
│ --help -h Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────╯
Examples¶
Basic Usage¶
# Simplify CSV
phu simplify-taxa -i final_assignments.csv -o simplified_taxonomy.csv
# Process TSV format with automatic detection
phu simplify-taxa -i final_assignments.tsv -o simplified_taxonomy.tsv
# Override input delimiter detection
phu simplify-taxa -i data.txt -o output.csv --sep "\t"
Advanced Usage¶
# Add compact lineage column with deepest available classification
phu simplify-taxa -i final_assignments.csv -o simplified.csv --add-lineage
# Customize lineage column name
phu simplify-taxa -i final_assignments.csv -o simplified.csv \
--add-lineage --lineage-col "best_taxonomy"
Lineage Column Feature¶
The --add-lineage
option creates an additional column containing the deepest (most specific) available taxonomic classification for each sequence.
Priority Order (Most → Least Specific)¶
genus_prediction
subfamily_prediction
family_prediction
order_prediction
class_prediction
phylum_prediction
kingdom_prediction
realm_prediction
Example Output¶
Special Cases Handled¶
Edge Cases for "0" Chains¶
The tool correctly handles vContact2's special "0" designation patterns:
# Input
novel_class_0_of_novel_phylum_0_of_novel_kingdom_5_of_Duplodnaviria
# Output
Duplodnaviria:NK5:NP0:NC0
Multiple Candidates¶
When vContact2 provides multiple taxonomic candidates (separated by ||
), each is processed independently:
# Input
Caudoviricetes:NF1:NG2||Caudoviricetes:NF3:NG4
# Output
Caudoviricetes:NF1:NG2||Caudoviricetes:NF3:NG4
Quality Assessment¶
After processing, the command provides a summary showing remaining novel_
strings for quality control:
QA Summary:
genus_prediction: 45 remaining 'novel_' strings
family_prediction: 12 remaining 'novel_' strings
order_prediction: 3 remaining 'novel_' strings
Workflow Integration¶
Typical Bioinformatics Pipeline¶
# 1. Run vContact3 (external)
vcontact3 --nucleotide <viral-genome.fasta> --output-dir <vcontact-output>
# 2. Simplify taxonomy predictions
phu simplify-taxa -i vcontact_output/final_assignments.csv \
-o taxonomy_simplified.csv --add-lineage
# 3. Use simplified taxonomy for downstream analysis
# - Phylogenetic visualization
# - Diversity analysis
# - Taxonomic filtering
Comparison with Manual Processing¶
Output File Structure¶
The output file preserves the original structure while transforming taxonomy columns:
Original columns + Simplified *_prediction columns [+ compact_lineage column]
All non-taxonomy columns remain unchanged, ensuring compatibility with existing workflows.