Libraries
Downloading genomes
The first steps of this challenge requires genome data. A very practical way to download data programmatically is to use ncbi-genome-downlad
program. Here a you can see how to use it:
Then using we will download complete genomes from the Pseudomonas syringae and drop them into the Genome folder. In the following line we select ~37 genomes from the organism.
ngd -n -s refseq -F fasta --genera "Pseudomonas syringae" -l complete bacteria
Considering the following 47 assemblies for download:
GCF_004006335.1 Pseudomonas syringae inb918
GCF_014524645.1 Pseudomonas syringae CAS02
GCF_016694755.2 Pseudomonas syringae BIM B-268
GCF_018388485.1 Pseudomonas syringae KF529
GCF_018388505.1 Pseudomonas syringae Susan762
GCF_018388525.1 Pseudomonas syringae U643
GCF_018394375.1 Pseudomonas syringae Susan2139
GCF_023278085.1 Pseudomonas syringae PA-2-9E
GCF_900235815.1 Pseudomonas syringae CFBP3840
GCF_900289125.1 Pseudomonas syringae CFBP 2116
GCF_002905815.2 Pseudomonas syringae pv. syringae Pss9097
GCF_023277945.1 Pseudomonas syringae pv. syringae Pss9644
GCF_900235825.1 Pseudomonas syringae pv. syringae CFBP4215
GCF_900235865.1 Pseudomonas syringae pv. syringae CFBP2118
GCF_002966555.1 Pseudomonas syringae pv. tomato B13-200
GCF_009800225.1 Pseudomonas syringae pv. tomato delta X
GCF_016599655.1 Pseudomonas syringae pv. maculicola MAFF 302723
GCF_001913215.1 Pseudomonas syringae pv. actinidiae ICMP 20586
GCF_001913235.1 Pseudomonas syringae pv. actinidiae NZ-47
GCF_002024285.1 Pseudomonas syringae pv. actinidiae CRAFRU 12.29
GCF_002024305.1 Pseudomonas syringae pv. actinidiae CRAFRU 14.08
GCF_002763655.1 Pseudomonas syringae pv. actinidiae MAFF212063
GCF_003665415.1 Pseudomonas syringae pv. actinidiae P220
GCF_023167685.1 Pseudomonas syringae pv. actinidiae MAFF613020
GCF_022557255.1 Pseudomonas syringae pv. tagetis ICMP 4091
GCF_003047185.1 Pseudomonas syringae pv. atrofaciens LMG5095
GCF_001482725.1 Pseudomonas syringae pv. lapsa ATCC 10859
GCF_000012245.1 Pseudomonas syringae pv. syringae B728a B728a
GCF_000007805.1 Pseudomonas syringae pv. tomato str. DC3000 DC3000
GCF_022557235.1 Pseudomonas syringae pv. helianthi LMG 5067
GCF_900235905.1 Pseudomonas syringae group genomosp. 3 CFBP6411
GCF_900235885.1 Pseudomonas syringae pv. cerasicola CFBP6109
GCF_000145825.2 Pseudomonas syringae Cit 7 Cit 7
GCF_000145845.2 Pseudomonas syringae pv. maculicola str. ES4326 ES4326
GCF_900235835.1 Pseudomonas syringae pv. avii CFBP3846
GCF_000648735.3 Pseudomonas syringae pv. actinidiae ICMP 18884 ICMP 18884
GCF_000344335.2 Pseudomonas syringae pv. actinidiae ICMP 9853 ICMP 9853
GCF_000344355.2 Pseudomonas syringae pv. actinidiae ICMP 18708 ICMP 18708
GCF_000344475.3 Pseudomonas syringae pv. actinidiae str. Shaanxi_M228 Shaanxi_M228
GCF_000988485.1 Pseudomonas syringae pv. syringae B301D B301D
GCF_000988395.1 Pseudomonas syringae pv. syringae HS191 HS191
GCF_001281365.1 Pseudomonas syringae UMAF0158 UMAF0158
GCF_000452705.1 Pseudomonas syringae CC1557 CC1557
GCF_000452605.2 Pseudomonas syringae CC440 CC440
GCF_000452565.2 Pseudomonas syringae UB303 UB303
GCF_000452525.3 Pseudomonas syringae USA011 USA011
GCF_000452445.2 Pseudomonas syringae pv. pisi str. PP1 PP1
ngd -s refseq\
-F fasta\
--genera "Pseudomonas syringae"\
-l complete\
--flat-output bacteria\
-o Data/Genomes
gzip -d Data/Genomes/*
Filenames cleaning
python ~/Programs/Bioinf_tools/bit-dedupe-fasta-headers.py -h
usage: bit-dedupe-fasta-headers.py [-h] -i INPUT_FASTA [-o OUTPUT_FASTA_NAME]
This script will append a number to headers if that exact ID has already
appeared in the fasta file. For version info, run `bit-version`.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_FASTA_NAME, --output_fasta_name OUTPUT_FASTA_NAME
Output fasta file (default: "Renamed.fasta").
required arguments:
-i INPUT_FASTA, --input_fasta INPUT_FASTA
Starting fasta file
for F in Genomes/*.fna ; do
N=$(basename $F .fna) ;
python /Users/camilogarcia/Programs/Bioinf_tools/bit-dedupe-fasta-headers.py -i $F -o $N_renamed.mfa ;
done
for F in Genomes/*.fna; do
mv -- "$F" \
"$(awk 'NR==1{printf("%s_%s_%s\n",$2,$3,substr($1,2));exit}' "$F")".fna
done
K-mers scanning on genomes
Script help & parallel execution
python ~/Projects/Biolibrary/kmer_count.py -h
usage: kmer_count.py [-h] [-k K] input output
Count the frecuency of a k-mer set given its size (k) along a genome or a set
of genomes
positional arguments:
input Path to input fasta file
output Path to put file/folder output
optional arguments:
-h, --help show this help message and exit
-k K size of k-mer (default: 1)
l ~/Projects/Random/Genomes/Pseudomonas_syringae_NZ* | parallel "python kmer_count.py -k 2 {} {/.}.csv"
Pseudomonas_syringae_NZ_CP005969.1_9mers.csv
Pseudomonas_syringae_NZ_CP005970.1_9mers.csv
Pseudomonas_syringae_NZ_CP006256.1_9mers.csv
Pseudomonas_syringae_NZ_CP007014.1_9mers.csv
Pseudomonas_syringae_NZ_CP011972.2_9mers.csv
Pseudomonas_syringae_NZ_CP012179.1_9mers.csv
Pseudomonas_syringae_NZ_CP013183.1_9mers.csv
Pseudomonas_syringae_NZ_CP017007.1_9mers.csv
Pseudomonas_syringae_NZ_CP017009.1_9mers.csv
Pseudomonas_syringae_NZ_CP018202.1_9mers.csv
Pseudomonas_syringae_NZ_CP019730.1_9mers.csv
Pseudomonas_syringae_NZ_CP019732.1_9mers.csv
Pseudomonas_syringae_NZ_CP019871.1_9mers.csv
Pseudomonas_syringae_NZ_CP024646.1_9mers.csv
Pseudomonas_syringae_NZ_CP024712.1_9mers.csv
Pseudomonas_syringae_NZ_CP026568.1_9mers.csv
Pseudomonas_syringae_NZ_CP028490.1_9mers.csv
Pseudomonas_syringae_NZ_CP032459.1_9mers.csv
Pseudomonas_syringae_NZ_CP032631.1_9mers.csv
Pseudomonas_syringae_NZ_CP032871.1_9mers.csv
Pseudomonas_syringae_NZ_CP034078.1_9mers.csv
Pseudomonas_syringae_NZ_CP045799.1_9mers.csv
Pseudomonas_syringae_NZ_CP047073.1_9mers.csv
Pseudomonas_syringae_NZ_CP047260.1_9mers.csv
Pseudomonas_syringae_NZ_CP047267.1_9mers.csv
Pseudomonas_syringae_NZ_CP067024.1_9mers.csv
Pseudomonas_syringae_NZ_CP068034.1_9mers.csv
Pseudomonas_syringae_NZ_LT962480.1_9mers.csv
Pseudomonas_syringae_NZ_LT962481.1_9mers.csv
Pseudomonas_syringae_NZ_LT963391.1_9mers.csv
Pseudomonas_syringae_NZ_LT963402.1_9mers.csv
Pseudomonas_syringae_NZ_LT963408.1_9mers.csv
Pseudomonas_syringae_NZ_LT963409.1_9mers.csv
Pseudomonas_syringae_NZ_LT985192.1_9mers.csv
Importing & cleaning
# posts/2021-01-25-Kmer-analysis/
files <- dir_ls("data/Kmers")
dataset <- files |>
map_df(read_csv)
dataset
# A tibble: 20,709,376 × 3
...1 kmer kmer_value
<dbl> <chr> <dbl>
1 0 AAAAAAAAA 5
2 1 AAAAAAAAC 21
3 2 AAAAAAAAT 17
4 3 AAAAAAAAG 17
5 4 AAAAAAACA 23
6 5 AAAAAAACC 44
7 6 AAAAAAACT 15
8 7 AAAAAAACG 29
9 8 AAAAAAATA 26
10 9 AAAAAAATC 40
# … with 20,709,366 more rows
total <- dataset |>
mutate(kmer = as_factor(kmer)) |>
group_by(kmer) |>
summarise(total = sum(kmer_value))
summary(total) |>
kable()
kmer | total | |
---|---|---|
AAAAAAAAA: 1 | Min. : 3 | |
AAAAAAAAC: 1 | 1st Qu.: 235 | |
AAAAAAAAT: 1 | Median : 509 | |
AAAAAAAAG: 1 | Mean : 819 | |
AAAAAAACA: 1 | 3rd Qu.: 1047 | |
AAAAAAACC: 1 | Max. :15920 | |
(Other) :262138 | NA |
K-mers counts frequency plot
kmers_plot <- total |>
ggplot(aes(total)) +
geom_histogram(stat = "count", color = "#8A8A8A") +
geom_label_repel(aes(label = paste(kmer, "appears", max(total), "\n times in all genomes"), y = 120), data = . %>% filter(total == max(total)), min.segment.length = 0, segment.colour = "black") +
geom_vline(xintercept = mean(total$total), linetype = "dashed", color = "#C60000") +
annotate("label", label = expression(paste(mu, "=", 818)), x = 1100, y = 350, size = 4, colour = "#C60000") +
scale_x_continuous(guide = "prism_offset_minor", trans = "log10", limits = c(1, 18000), n.breaks = 10) +
theme(
axis.line.x = element_line(color = "black")
) +
labs(
x = "K-mer value [n]",
y = "Absolute frequency [n]"
) +
coord_cartesian(expand = F)
kmers_plot_intervals <- total |>
mutate(intervals = cut(total, breaks = 19)) |>
ggplot(aes(y = intervals)) +
geom_histogram(stat = "count") +
scale_x_continuous(guide = "prism_offset_minor") +
theme(
axis.line.x = element_line(color = "white")
) +
labs(
x = "Absolute frequency [n]",
y = ""
) +
coord_cartesian(expand = F)
# ggsave("kmers_freq_intervals_01.png", width = 10, height = 10, dpi = 400)
kmers_plot / kmers_plot_intervals
# ggsave("kmers_freq_plots_01.png", width = 10, height = 12, dpi = 400)
Citation
BibTeX citation:
@online{garcía-botero2021,
author = {García-Botero, Camilo},
title = {K-Mer Counts on Genomes},
date = {2021-02-25},
langid = {en}
}
For attribution, please cite this work as:
García-Botero, Camilo. 2021. “K-Mer Counts on Genomes.”
February 25, 2021.