simplify-taxa¶
Convert vContact taxonomy prediction strings into compact lineage codes for downstream analysis.
Overview¶
phu simplify-taxa transforms verbose vContact3 *_prediction columns into compact, standardized lineage codes (e.g. Caudoviricetes:NF2:NG1) to make taxonomy easier to filter, visualize, and analyze.
Synopsis¶
phu simplify-taxa -i <INPUT_FILE> -o <OUTPUT_FILE> [OPTIONS]
Input accepts CSV or TSV files from vContact3's final_assignments.csv output. Output format is automatically detected from file extension.
Input/Output Formats¶
Supported Formats¶
- Input: CSV, TSV (auto-detected from extension or 
--sepparameter) - Output: CSV, TSV (auto-detected from file extension)
 
Expected Input Columns¶
The command automatically detects and processes any columns matching the pattern *_prediction:
- kingdom_prediction
- phylum_prediction 
- class_prediction
- order_prediction
- family_prediction
- subfamily_prediction
- genus_prediction
- realm_prediction (if present)
Transformation Logic¶
Before Transformation¶
novel_genus_1_of_novel_family_2_of_novel_order_3_of_Caudoviricetes
After Transformation¶
Caudoviricetes:NO3:NF2:NG1
Compact Code Format¶
The transformation uses standardized rank codes:
- NK = Novel Kingdom
- NP = Novel Phylum
- NC = Novel Class
- NO = Novel Order
- NF = Novel Family
- NSF = Novel Subfamily
- NG = Novel Genus
Command Options¶
 Simplify vContact taxonomy prediction columns into compact lineage codes.   
 Transforms verbose vContact taxonomy strings like                           
 'novel_genus_1_of_novel_family_2_of_Caudoviricetes' into compact codes like 
 'Caudoviricetes:NF2:NG1'.                                                   
 Example:                                                                    
     phu simplify-taxa -i final_assignments.csv -o simplified.csv           
     --add-lineage                                                           
╭─ Options ─────────────────────────────────────────────────────────────────╮
│ *  --input-file       -i  PATH  Input vContact final_assignments.csv       │
│                              [required]                                   │
│ *  --output-file      -o  PATH  Output path (.csv or .tsv) [required]      │
│    --add-lineage         FLAG  Append compact_lineage column from deepest │
│                              simplified rank                              │
│    --lineage-col         TEXT  Name of the lineage column                 │
│                              [default: compact_lineage]                   │
│    --sep                 TEXT  Override delimiter: ',' or '\t'.           │
│                              Auto-detected from extension if not set      │
│    --help            -h        Show this message and exit.                │
╰───────────────────────────────────────────────────────────────────────────╯
Examples¶
Basic Usage¶
# Simplify CSV
phu simplify-taxa -i final_assignments.csv -o simplified_taxonomy.csv
# Process TSV format with automatic detection
phu simplify-taxa -i final_assignments.tsv -o simplified_taxonomy.tsv
# Override input delimiter detection
phu simplify-taxa -i data.txt -o output.csv --sep "\t"
Advanced Usage¶
# Add compact lineage column with deepest available classification
phu simplify-taxa -i final_assignments.csv -o simplified.csv --add-lineage
# Customize lineage column name
phu simplify-taxa -i final_assignments.csv -o simplified.csv \
  --add-lineage --lineage-col "best_taxonomy"
Lineage Column Feature¶
The --add-lineage option creates an additional column containing the deepest (most specific) available taxonomic classification for each sequence.
Priority Order (Most → Least Specific)¶
genus_predictionsubfamily_predictionfamily_predictionorder_predictionclass_predictionphylum_predictionkingdom_predictionrealm_prediction
Example Output¶
| Sequence | genus_prediction | family_prediction | compact_lineage | 
|---|---|---|---|
| seq1 | Caudoviricetes:NF2:NG1 | Caudoviricetes:NF2 | Caudoviricetes:NF2:NG1 | 
| seq2 | - | Caudoviricetes:NF5 | Caudoviricetes:NF5 | 
| seq3 | - | - | - | 
Special Cases Handled¶
Edge Cases for "0" Chains¶
The tool correctly handles vContact2's special "0" designation patterns:
# Input
novel_class_0_of_novel_phylum_0_of_novel_kingdom_5_of_Duplodnaviria
# Output  
Duplodnaviria:NK5:NP0:NC0
Multiple Candidates¶
When vContact2 provides multiple taxonomic candidates (separated by ||), each is processed independently:
# Input
Caudoviricetes:NF1:NG2||Caudoviricetes:NF3:NG4
# Output
Caudoviricetes:NF1:NG2||Caudoviricetes:NF3:NG4
Quality Assessment¶
After processing, the command provides a summary showing remaining novel_ strings for quality control:
QA Summary:
  genus_prediction: 45 remaining 'novel_' strings
  family_prediction: 12 remaining 'novel_' strings
  order_prediction: 3 remaining 'novel_' strings
Workflow Integration¶
Typical Bioinformatics Pipeline¶
# 1. Run vContact3 (external)
vcontact3 --nucleotide <viral-genome.fasta> --output-dir <vcontact-output>
# 2. Simplify taxonomy predictions
phu simplify-taxa -i vcontact_output/final_assignments.csv \
  -o taxonomy_simplified.csv --add-lineage
# 3. Use simplified taxonomy for downstream analysis
# - Phylogenetic visualization
# - Diversity analysis  
# - Taxonomic filtering
Comparison with Manual Processing¶
| Task | phu simplify-taxa | Manual Processing | 
|---|---|---|
| Complexity | Single command | Custom scripts/regex | 
| Edge cases | Automatically handled | Error-prone | 
| Consistency | Standardized format | Variable approaches | 
| Speed | Optimized pandas operations | Slower loops | 
| Maintenance | Built-in updates | Manual fixes needed | 
Output File Structure¶
The output file preserves the original structure while transforming taxonomy columns:
Original columns + Simplified *_prediction columns [+ compact_lineage column]
All non-taxonomy columns remain unchanged, ensuring compatibility with existing workflows.