Microbes made me do it

Searching for shared sequence between Mycobacterium tuberculosis and Homo sapiens

2023-06-21T00:00:00+00:00

Motivation
Shared k-mer content
Aligning reads
Summary
References

Motivation

We are in the early stages of planning a Mycobacterium tuberculosis (MTB) analysis pipeline for a research project in Papua New Guinea. We’ll be sequencing sputum samples with Oxford Nanopore Technologies (ONT) devices and were thinking of different ways of decontaminating the data - i.e. remove anything non-MTB. Sputum samples traditionally have a lot of host (human) reads and reads from a variety of bacteria. Traditionally the MTB component is quite small¹. One component of this pipeline will be to upload sequencing reads to a remote/cloud server, so any reduction in file size will make uploads faster. As human reads are not used in any analysis steps, and will need to be removed prior to making any data available, we thought we could simplify things by removing human data as the first step. Our idea was to align reads to the human genome and just remove anything that aligns. However, one concern with this approach was whether any MTB reads could be lost in the process. This effectively boils down to the question: Do Mycobacterium tuberculosis and Homo sapiens share genomic sequence? After a literature search, I was unable to find an answer - which seemed quite surprising. My suspicion is that most people just assume they do not. (Or my literature searching skills are poor.) So let’s take a look.

Shared k-mer content

The first thing I thought to check was whether there are shared k-mers between the two reference genomes for MTB and human. As an aside, after struggling to install/run multiple tools for this job I wrote a simple Rust program - skc - to do this comparison.

The human genome used is the Telomere-to-Telomere (T2T) Consortium CHM13 v2.0 assembly (accession: GCA_009914755.4)². The MTB reference genome used is H37Rv (accession: NC_000962.3)³. In addition to the CHM13 human genome, I looked at the shared k-mer content between MTB and a collection of other closely- and distantly-related genomes to give some background expectations. The other genomes are:

The previous human reference genome GRCh38.p14 (hg38)
The Mus musculus (mouse) reference genome GRCm39 (mm39)
The Arabidopsis thaliana (thale cress) reference genome TAIR10.1
The Human immunodeficiency virus 1 (HIV-1) reference genome NC_001802.1
The Escherichia coli strain K-12 substr. MG1655 reference genome ASM584v2
The Mycobacterium avium subsp. hominissuis strain OCU889s_P11_4s reference genome NZ_CP018019.1

I ran skc with all k from 13 to 31 and plot the number of shared k-mers at each k for each of the genomes listed above.

From this figure we can see that the largest shared content we get been MTB and human (CHM13) is 29-mers - for which there are 2. Interestingly, there is only one match between hg38 and MTB - meaning one of the matches with CHM13 is in new sequence generated by the T2T consortium. The two matches between MTB and CHM13 are

NC_000962.3:2357258, which is in the PE_PGRS36 gene, and chr20:44924007 which begins at the last base of the PTPRT-207 gene.
NC_000962.3:837317, which is in the PE_PGRS9 gene, and chrX:86236022, which is in RP6-43L17.2 (mitochondrial ribosomal protein S22 pseudogene 1)

Both of these 29-mers also match in soft-masked regions of the CHM13 assembly - indicating they’re likely in repeats discovered by the T2T team.

Unsurprisingly, the most 31-mer matches were with M. avium, followed by E. coli. There are also 46 31-mers that match in the A. thaliana genome, which I was quite surprised about initially. But on further inspection, those hits are in the 16S rRNA of the chloroplast.

Next, I looked at the GC content distribution of the matching k-mers between MTB and CHM13. This was to convince myself these matches with the human genome were likely due to chance given they come from repetitive regions in both genomes.

The GC content of the MTB genome is ~65%. However, we see that the bulk of the shared k-mers (78%) have a GC content over 65% - with 46% being over 90%. Importantly, our two 29-mers have GC content of 93.1% and 96.6%.

Aligning reads

Given the whole point of this analysis was to see if I would lose MTB reads when aligning a sputum sample to just the human genome, it’s fair to argue I should have just tested that approach and been done with it. But I can be a little paranoid so I looked at shared k-mers first because…well why not?

My main interest for this project was ONT data, so I simulated ONT reads from the MTB reference genome (H37Rv) to 5x depth with Badread⁴. I then aligned these reads to CHM13 with minimap2⁵ (-x map-ont), but got no alignments. As the shared k-mers I observed were found in repetitive regions, I also tried aligning the MTB reads to CHM13 using Winnowmap⁶, which is designed for aligning long reads to repetitive reference sequences. Still no alignments.

As an additional analysis, as I assume this will be of interest to others, I also checked the alignment of Illumina reads. I simulated paired MTB reads from H37Rv with ART⁷ to a depth of 20x from a HiSeq 2500 (-ss HS25) and MiSeq v3 (-ss MSv3). I aligned these to CHM13 with minimap2 (-x sr) and got a very small number of alignments - 46 reads for HiSeq 2500 and 32 reads from MiSeq v3 (none of these alignments were near where the 29-mer matches are).

Summary

For my part, I’m pretty happy to conclude that aligning ONT sputum data to the human genome will not remove any MTB reads. Doing the same for Illumina data will result in a negligible number of MTB reads being lost. While there is some shared k-mers between MTB and the human genome, these are likely repeat artifacts. I’m going to boldly conclude that there is no shared sequence between M. tuberculosis and Homo sapiens - at least nothing that is evolutionarily meaningful. I would love to be proven wrong though.

References

Nilgiriwala K, Rabodoarivelo M-S, Hall MB, Patel G, Mandal A, Mishra S, et al. Genomic sequencing from sputum for tuberculosis disease diagnosis, lineage determination, and drug susceptibility prediction. J Clin Microbiol. 2023;61: e0157822. doi:10.1128/jcm.01578-22
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, et al. The complete sequence of a human Y chromosome. bioRxiv. 2022. doi:10.1101/2022.12.01.518724
Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393: 537–544. doi:10.1038/31159
Wick R. Badread: simulation of error-prone long reads. J Open Source Softw. 2019;4: 1316. doi:10.21105/joss.01316
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34: 3094–3100. doi:10.1093/bioinformatics/bty191
Jain C, Rhie A, Hansen NF, Koren S, Phillippy AM. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods. 2022;19: 705–710. doi:10.1038/s41592-022-01457-8
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28: 593–594. doi:10.1093/bioinformatics/btr708

You can cite this post as

Hall, Michael B. Searching for shared sequence between Mycobacterium tuberculosis and Homo sapiens. Zenodo; 2023. doi:10.5281/zenodo.8068147

Cheap Parallelisation

2020-06-22T00:00:00+00:00

What is fd?
Baby steps: finding our files
Constructing execution commands
Putting it all together: Parallel MSA
Benchmark
- Results
- Conclusion
Final Remarks

Motivation

I was recently creating a snakemake pipeline and needed to write a rule/process that would perform a multiple sequence alignment (MSA) on 2,582 fasta files. Usually, it is easy to parallelise this kind of task using snakemake. To cut a long story short; using snakemake to parallelise across the files was not feasible. I knew there were ways of doing this kind of thing with tools such as parallel, xargs, and find, but I had never really invested the time to get comfortable with them. This post is an attempt to document that process using one of my favourite CLI tools: fd. We’ll see how fd can be used to execute multiple MSAs (with MAFFT) simultaneously, and benchmark how much faster it is than a conventional “synchronous” approach.

What is `fd`?

Quoting from its GitHub repository:

fd is a simple, fast and user-friendly alternative to find.

and I would certainly agree with that. For the sake of brevity, I won’t dive into the full range of what fd does, but please take a minute to have a quick look at the README before reading on any further.

For the purposes of this post, the functionality of fd we are most interested in is -x, --exec and -j, --threads.

    -x, --exec <cmd>
            Execute a command for each search result.
            All arguments following --exec are taken to be arguments to the command until the argument ';' is encountered.
            Each occurrence of the following placeholders is substituted by a path derived from the current search result before the command is
            executed:
              '{}':   path
              '{/}':  basename
              '{//}': parent directory
              '{.}':  path without file extension
              '{/.}': basename without file extension
              
    -j, --threads <num>
            Set number of threads to use for searching & executing (default: number of available CPU cores)

Using --exec and --threads together will allow us to run a given command on however many threads we like.

Baby steps: finding our files

The first thing we need to do before executing a command on multiple threads is to find the files we want to operate on. The toy dataset I’ll be using is eight small fasta files. If you’re going to follow along at home, you can download the data from here.

The directory tree we are working with is

.
├── fd_example
│  ├── f1.fa
│  ├── f2.fa
│  ├── f3.fa
│  ├── f4.fa
│  ├── f5.fa
│  ├── f6.fa
│  ├── f7.fa
│  └── f8.fa
└── msas

To find the fasta files with fd, we can use the following command

$ fd --extension fa . fd_example
fd_example/f1.fa
fd_example/f2.fa
fd_example/f3.fa
fd_example/f4.fa
fd_example/f5.fa
fd_example/f6.fa
fd_example/f7.fa
fd_example/f8.fa

--extension tells fd we want to filter out search results by the given extension - fa in this case. (I’ll use the abbreviated -e for this in subsequent commands). Next, we tell fd the pattern we are looking for is . - i.e. anything. Lastly, we want to search (recursively) under the fd_example directory.

An alternate solution for the above would be skipping the use of --extension and using the pattern on its own

$ fd '\.fa$' fd_example

The pattern is a regular expression this time that says “anything ending in .fa”.

Personally, I like to make commands like this as easy for others, and myself in the future, to understand. So I’ll stick with the --extension method.

Constructing execution commands

Now that we have a list of file paths, we need to do something with them. From earlier, we can see that --exec takes a command, <cmd>. The fd docs have examples of different conventions for writing these commands. This command will be executed for each file path that fd finds.

Simple: counting characters in files

We’ll start with something easy like counting characters in each file using wc. wc is a command-line utility that counts lines, words, and bytes (characters) in files.

$ fd -e fa --exec wc -c '{}' \; . fd_example
fd_example/f5.fa
fd_example/f8.fa
fd_example/f1.fa
fd_example/f2.fa
fd_example/f7.fa
fd_example/f3.fa
fd_example/f6.fa
fd_example/f4.fa

Let’s break down the juicy bit - --exec wc -c '{}' \;. We tell fd to execute the command wc -c on {}. If we refresh our memories on the --exec docs, {} refers to the path returned by fd (note: we add quotes around {} to prevent more complex paths failing). The \; bit at the end just tells fd where the execution command ends. Anything else after this is for fd, or the shell, to deal with.

An even more concise version of this command would be

$ fd -e fa -x wc -c \; . fd_example

as fd appends the path {} to the end of the command if we don’t add it ourselves.

Ok, so that was pretty easy. Let’s look at something a little more complicated.

Intermediate: changing file extensions

As a (somewhat contrived) second example, we’ll use the same approach to change the file extensions on all our fasta files from .fa to .fasta.

$ fd -e fa --exec mv '{}' '{.}'.fasta \; . fd_example
$ ls fd_example
f1.fasta  f2.fasta  f3.fasta  f4.fasta  f5.fasta  f6.fasta  f7.fasta  f8.fasta

Then only new thing here is the use of '{.}'. Again, from the help menu, {.} gives us the path without the file extension. So fd_example/f1.fa becomes fd_example/f1.

Advanced: redirecting stdout to a file within the command

Trying to redirect the standard output of a command to a file turns out to be a little more involved. For example, going with what we’ve seen so far, if we wanted to get the sequence identifier for each sequence in a fasta file and write them to file, our initial attempt might look like this

$ fd -e fa \
    --exec rg -P '^>(?P<id>[^\s]+)\s.*' --replace '$id' > '{/.}'.ids \; \
    . fd_example

In this command, we use ripgrep (rg) to grab the identifiers for the file and write them out to a file called '{/.}'.ids. The help menu for fd tells us that {/.} is the path basename without the file extension. So fd_example/f1.fa becomes f1.ids. However, when we run this, we get an error message like zsh: no such file or directory: {/.}.ids. What is effectively happening here is that before this command gets passed to fd, it is first interpreted by the shell. The shell thinks we are trying to write the output from everything up until the redirection operator > (i.e. fd -e fa --exec rg -P '^>(?P<id>[^\s]+)\s.*' --replace '$id') to a file called '{/.}'.ids \; . fd_example. The shell is correct, there is no file called {/.}.ids - this syntax is only valid inside an fd --exec command.

We can fix this in one of two ways: inline bash command execution or a shell script.

Inline bash command

This option is pretty ugly - unless you’re into one-liners…

$ fd -e fa \
    --exec sh -c "rg -P '^>(?P<id>[^\s]+)\s.*' --replace '\$id' '{}' > '{/.}'.ids" \; \
    . fd_example

As you can see, we need to invoke an inline shell command with sh -c. Part of the awkwardness here is also having to escape some characters to prevent the shell from trying to evaluate them, amongst other things. These inline commands can get very cumbersome very fast!

Shell script

Recommended. For complex examples, such as the one we are working with, I would suggest using this approach. The fd command itself does become slightly more obscure, but if you name the script accordingly, it should still be self-explanatory.

Replicating our example from the above example, we create a script extract_seq_id.sh

#!/usr/bin/env sh
rg -P '^>(?P<id>[^\s]+)\s.*' --replace '$id' "$1" > "$2"

and then execute the script with fd.

$ fd -e fa \
   --exec sh extract_seq_id.sh '{}' '{/.}'.ids \; \
   . fd_example

Much neater!

If we take a look inside one of the output files, we should see the identifiers

$ cat f1.ids
fadD23+Rv3827c+Rv3828c
76c33157-d262-467c-960b-c21f8fa16991

Putting it all together: Parallel MSA

Ok, armed with our fd --exec knowledge, let’s parallelise multiple sequence alignment on our samples and benchmark how much of a difference throwing more threads at fd makes.

To perform the MSA, we are going to use MAFFT. MAFFT writes its output to standard output. Luckily we know how to deal with this 😎. We’ll write a script and then execute that script with fd.

msa.sh

#!/usr/bin/env sh
mafft --thread 1 --auto "$1" > "$2"

Note: we are only specifying one thread for benchmarking purposes.

Our fd command will then be

$ fd -e fa --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . fd_example

Our directory tree now looks like

.
├── fd_example
│  ├── f1.fa
│  ├── f2.fa
│  ├── f3.fa
│  ├── f4.fa
│  ├── f5.fa
│  ├── f6.fa
│  ├── f7.fa
│  └── f8.fa
├── msa.sh
└── msas
   ├── f1.msa.fa
   ├── f2.msa.fa
   ├── f3.msa.fa
   ├── f4.msa.fa
   ├── f5.msa.fa
   ├── f6.msa.fa
   ├── f7.msa.fa
   └── f8.msa.fa

Benchmark

How much of a difference does more threads make? We’ll benchmark various numbers of threads using the tool hyperfine and its neat parameter scan functionality. The data used in this benchmarking is a little different to the toy dataset. I have restricted the files to those above 9kb in size (80 files in total) to try and see the full effects of multiple threads. The 100 threads option is just a roundabout way of saying “however many threads you can find.” The machine I ran this benchmark on has 16.

$ hyperfine -L threads 1,2,4,8,100 --export-markdown results.md \
    "fd -j {threads} -S +9k -e fa --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data"

Results

Benchmark #1: fd -j 1 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):     16.708 s ±  0.475 s    [User: 13.742 s, System: 6.550 s]
  Range (min … max):   15.972 s … 17.666 s    10 runs

Benchmark #2: fd -j 2 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):     10.003 s ±  0.064 s    [User: 15.850 s, System: 7.498 s]
  Range (min … max):    9.896 s … 10.115 s    10 runs

Benchmark #3: fd -j 4 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):      7.232 s ±  0.331 s    [User: 20.556 s, System: 10.231 s]
  Range (min … max):    6.943 s …  7.751 s    10 runs

Benchmark #4: fd -j 8 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):      6.118 s ±  0.233 s    [User: 22.905 s, System: 11.137 s]
  Range (min … max):    5.461 s …  6.250 s    10 runs

Benchmark #5: fd -j 100 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):      5.913 s ±  0.028 s    [User: 24.093 s, System: 11.508 s]
  Range (min … max):    5.880 s …  5.954 s    10 runs

Summary
  'fd -j 100 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/' ran
    1.03 ± 0.04 times faster than 'fd -j 8 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/'
    1.22 ± 0.06 times faster than 'fd -j 4 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/'
    1.69 ± 0.01 times faster than 'fd -j 2 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/'
    2.83 ± 0.08 times faster than 'fd -j 1 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/'

Command	Mean [s]	Min [s]	Max [s]	Relative
`fd -j 1 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/`	16.708 ± 0.475	15.972	17.666	2.83 ± 0.08
`fd -j 2 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/`	10.003 ± 0.064	9.896	10.115	1.69 ± 0.01
`fd -j 4 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/`	7.232 ± 0.331	6.943	7.751	1.22 ± 0.06
`fd -j 8 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/`	6.118 ± 0.233	5.461	6.250	1.03 ± 0.04
`fd -j 100 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/`	5.913 ± 0.028	5.880	5.954	1.00

Conclusion

The benchmark I have run here is a little contrived and doesn’t really reflect the full benefit of multiple threads in fd. The fasta files I am performing MSA on are quite small/simple, so the benefit of multiple threads isn’t quite as drastic as it would be for more complex alignments. Part of the problem with benchmarking on more complex files is the single-threaded commands run far too slow, and this is not intended to be a full scale benchmarking post. For a real-world (anecdotal) example, it took ~3.5 minutes to run MAFFT on my 2,582 fasta files using 16 threads. You can see an example of how I embedded this in a snakemake pipeline here.

Final Remarks

I hope this post was insightful and useful. In the future, I hope to write more posts like this one (provided people find this one helpful) on whatever comes up in my work that I think might be interesting to others. In the meantime, feel free to get in touch if you have any questions or comments or complaints about this post.

Benchmarking Guppy algorithms

2019-02-01T00:00:00+00:00

Methods
Results
Conclusions
Supplementary code

ONT’s basecaller Guppy has recently been released to the masses. And with the announcement of the new “flip-flop” basecalling algorithm there is now the choice of two different algorithms for basecalling.

ONT has obviously been singing flip-flop’s praises, and understandably so, as the initial results look like a decent step up in read accuracy.

For an upcoming project I am going to be doing a lot of basecalling of Mycobacterium tuberculosis and given the project will involve assessing metrics heavily reliant on read accuracy I thought it best to invest some time in deciding which algorithm to go with. Another reason for my indecision came when I read a recent blog from Keith Robison which showed that maybe the new flip-flop algorithm doesn’t work well with organisms that have a higher GC content.

As M. tuberculosis has a GC content around 65% I thought it best to do a little benchmarking of the two basecalling algorithms first. Unfortunately for me, I couldn’t really rely on the results from Ryan Wick’s wonderful basecalling comparison due to the species he used, E. coli, having a roughly even GC content.

Note: Just before publishing this post Ryan released an updated version of the comparison as a preprint. In the test set there was one bacteria, Stenotrophomonas maltophilia, with a GC content similar to M. tuberculosis. Figure 2 in that paper shows flip-flop as having a higher read identity than the default Guppy algorithm.

What I will do here is walk through a small-scale basecalling algorithm comparison of the default Guppy algorithm and the flip-flop algorithm that comes as a config option with Guppy.

The data I am using to run this analysis was sequenced on an R9.4.1 flowcell. It was also a multiplexed run with 5 clinical samples of M. tuberculosis.

I’ll add in some code snippets for how I ran this analysis so you can recreate at home with your own data too. If you aren’t interested and just want to see some results then feel free to skip ahead.

Methods

Basecall

The only thing we need to change in order to use the flip-flop algorithm is to change the config file used.

Default config

cd Guppy_testing/normal
input=../fast5
output=basecalled_fastq/

guppy_basecaller --input_path "$input" \
    --save_path "$output" \
    --recursive \
    --verbose_logs \
    --worker_threads 32 \
    --config dna_r9.4.1_450bps.cfg

Basecalling took 120077.25 CPU seconds. As there are 1009917 reads total, that is approximately 505 reads/min.

Flip-flop config

cd Guppy_testing/flipflop
input=../fast5
output=basecalled_fastq/

guppy_basecaller --input_path "$input" \
    --save_path "$output" \
    --recursive \
    --verbose_logs \
    --worker_threads 32 \
    --config dna_r9.4.1_450bps_flipflop.cfg

Basecalling took 4051443 CPU seconds. As there are 1009917 reads total, that is approximately 15 reads/min.

At the time of writing this, I have not been able to run Guppy on the GPUs here. But once I have done that I will add the runtime figures for that too.

In terms of wall clock time, I ran the default config on 32 cores and it completed in 4.33 hours. For the flip-flop, I also ran it on 32 cores and it completed in 35.33 hours.

Barcode demultiplexing

As this is a 5x multiplexed sample I chose to use Ryan Wick’s Deepbinner tool for demultiplexing. From the results in the Deepbinner paper, and from my own personal testing, Deepbinner saves a lot more reads from the dreaded “unknown” bin.

Deepbinner classification

fast5_dir=../fast5
output=classification

deepbinner classify --native "$fast5_dir" > "$output"

I ran the deepbinner classification step on a GPU and it took 10 hours to classify all 1009917 reads - so approximately 1683 reads/min.

Deepbinner binning

Split the reads into separate fastq files for each barcode based on the classifications learned.

cd Guppy_testing/normal
classifications=../classification
out_dir=barcode_bins/
reads_dir=basecalled_fastq/

# combine all the fastq files into a single one
cat $(find $reads_dir -name '*.fastq') > tmp_reads.fastq

deepbinner bin \
    --classes "$classifications" \
    --reads tmp_reads.fastq \
    --out_dir "$out_dir"

rm tmp_reads.fastq

Do the same thing for Guppy_testing/flipflop.

Adapter trimming

Chop off adapter sequences using another of Ryan Wick’s tools, Porechop.

cd Guppy_testing/normal
outdir=barcode_bins/

# I only expect barcodes 1-5
for f in $(find barcode_bins/ -type f | grep -E 'barcode0[1-5].fastq.gz')
do
  name=$(basename $f)
  porechop --input "$f" \
          --output "$outdir"/"${name%%.*}".trimmed.fastq.gz \
          --discard_middle
done

Do the same thing for Guppy_testing/flipflop.

Map

The accuracy of the reads will be based on alignment to the reference genome. The alignment is done using minimap2.

cd Guppy_testing/normal
reference=../NC_000962.3.fa
outdir=mapped/
for f in $(find barcode_bins/ -name '*trimmed*')
do
  sample=$(basename ${f%%.*})
  output="$outdir"/"$sample".sorted.bam
  minimap2 -ax map-ont "$reference" "$f" | samtools sort -o "$output" -
done

Do the same thing for Guppy_testing/flipflop.

Plotting

I did some quality control plotting using a python package I developed called Pistis.

cd /hps/nobackup/research/zi/mbhall/Guppy_testing/normal
for i in {1..5}
do
  bam=mapped/barcode0"$i".sorted.bam
  reads=barcode_bins/barcode0"$i".trimmed.fastq.gz
  output=reports/barcode0"$i"_pistis.pdf
  pistis --fastq "$reads" --bam "$bam" \
      --output "$output" --downsample 0
done

Do the same thing for Guppy_testing/flipflop.

Results

Quality vs Read length

Probably the most startling thing for me initially was the difference in Phred quality scores the two algorithms were producing.

Figure 1: Guppy default basecalling algorithm quality vs read length. The y-axis shows the Phred quality score average for each read. The x-axis is the reads length in base pairs.

We can see from Figure 1 above that the Phred scores for the default algorithm are centred around 14. However, when we look at the same plot for the flip-flop algorithm (Figure 2), we see a very different story in terms of quality scores.

Figure 2: Guppy flip-flop basecalling algorithm quality vs read length. The y-axis shows the Phred quality score average for each read. The x-axis is the reads length in base pairs.

As you can see, flip-flop seems to rate itself very highly. The densest part of the kernel being around Phred score 42….yes, 42.

At the end of the day though, I don’t generally pay much attention to the quality scores. I am more interested in how well the reads match what I expect them to, i.e the “truth”. As I don’t have an absolute truth for this particular dataset, I am going to use the M. tuberculosis reference, NC_000962.3, as decent approximation. I know, it’s not ideal, but it’s the best I have access to at the moment.

Figures 1 & 2 were produced from my package Pistis. For the following plots, I will post the code for the functions I used to prepare the data at the end of this post.

import pysam
import matplotlib.pyplot as plt
from pathlib import Path
from collections import Counter
import seaborn as sns
import pandas as pd

# get the paths for the bam files
normal_bams = list(Path('../normal').rglob('*.bam'))
flipflop_bams = list(Path('../flipflop').rglob('*.bam'))

# gather all the required info into a dataframe
normal_df = stats_for_bams(normal_bams)
normal_df["model"] = "normal"
flipflop_df = stats_for_bams(flipflop_bams)
flipflop_df["model"] = "flipflop"
df = pd.concat([normal_df, flipflop_df])

Total yield

Let’s see if there is a major difference in the raw number of base pairs we get from each basecalling algorithm.

# faster to sum all the bases for each barcode/model into a dataframe
yield_df = df.groupby(by=['model', 'barcode']).sum()
yield_df.reset_index(level=['model', 'barcode'], inplace=True)

fig, ax = plt.subplots(figsize=(15, 9))
p = sns.barplot(data=yield_df, x="barcode", y="aligned_bases",
                hue="model", hue_order=['normal', 'flipflop'], ax=ax)
p = p.set(title="Total yield", ylabel="aligned bases (bp)")

Figure 3: Total number of bases produced by the Guppy default (blue) and flip-flop (orange) algorithms for each barcode.

As you can see. Flip-flop consistently produces more bases. The impact of this will be seen when we look at the relative read lengths for both algorithms.

GC content

As mentioned earlier, M. tuberculosis has a GC content around 65%. Will this have an impact on the new basecaller as Keith Robison seemed to suspect?

sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(15, 9))
p = sns.violinplot(x='barcode', y='gc_content', data=df, split=True, inner="quartile",
                   hue='model', hue_order=['normal', 'flipflop'], ax=ax)
p = p.set(title="GC content", ylabel="GC proportion per read (%)")

Figure 4: GC content for each barcode calculated on a per-read basis for both the default (blue) and flip-flop (orange) algorithms of Guppy.

I plotted this many different ways and the distributions were nearly identical every way I looked at it. So I guess the flip-flop algorithm may have changed a bit since Keith looked at it, or potentially ONT has some M. tuberculosis in there training dataset?

Read identity

This is the plot I was most interested in. For me, this is the most important plot. How identical are the reads to the section of the reference they map to? As I mentioned already, we don’t have absolute truth here, but it is a pretty close approximation. This metric is effectively asking for the reads that align (I ignore unmapped reads and secondary/supplementary alignments), how similar is the sequence to the reference at that location? I have cut off the axis at 50% to get a clearer view of the bulk of the distribution, but the tails extend past 50%.

sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(15, 9))
p = sns.violinplot(y='barcode', x='pid', data=df, split=True, inner="quartile",
                   hue='model', hue_order=['normal', 'flipflop'], ax=ax)
p = p.set(title="Read identity", ylabel="Read percent identity (%)")
_ = ax.set_xlim((50, 100))
_ = plt.legend(loc='lower right')

Figure 5: Read percent identity for primary alignments to the M. tuberculosis reference, NC_000962.3. Blue shows the default algorithm for Guppy and orange shows the flip-flop algorithm. The dashed lines within the violins show the percentiles of the data.

Wow! That is a pretty good improvement. On average, flip-flop has about 2% higher read identity compared to Guppy’s default algorithm.

Relative read length

To see whether the algorithms are causing insertions and/or deletions we can look at the relative read length. That is, we take the length of the aligned part of the read and divide it by the length of the aligned part of the reference. Below 1.0 means there have been some deletions, above 1.0 means we’ve had some insertions - compared to the reference of course.

sns.set_style("whitegrid")
fig, ax = plt.subplots(figsize=(15, 9))
p = sns.violinplot(x='barcode', y='rel_len', data=df, split=True, inner="quartile",
                   hue='model', hue_order=['normal', 'flipflop'], ax=ax)
p = p.set(title="Relative read length", ylabel="read alignment length / ref alignment length")
_ = ax.set_ylim((0.75, 1.25))

Figure 6: Relative read length for Guppy’s default (blue) and flip-flop (orange) algorithms. Relative read length is calculated as the length of the aligned part of the read and divide it by the length of the aligned part of the reference.

So it appears that flip-flop, on average, causes more deletions than insertions, but it is definitely an improvement on the default algorithm. As we saw from the total yield plot, flip-flop produces more bases and the outcome of that, at least for M. tuberculosis in the case, is fewer deletions.

Conclusions

So in conclusion, given the results from Ryan Wick on S. maltophilia and those presented here on M. tuberculosis, you can make a strong argument for using the flip-flop algorithm over the default for GC-rich genomes without much concern regarding accuracy. You get more accurate reads with fewer deletions. But the big caveat is time. Flip-flop is much slower than the default algorithm. At least on CPUs, it is probably only feasible to use flip-flop if you have a computing cluster with at least 16 cores you can grab unless you want to smash your laptop for a week or so. As I said earlier, I have not been able to run Guppy on GPUs yet, so I am interested to see how much faster flip-flop GPU is compared to the CPU version.

I hope someone finds this useful. And of course, if you have any problems with anything I have done please do get in touch.

Supplementary code

This code was used for preparing the data for plotting.

import pysam
from pathlib import Path
from collections import Counter
import pandas as pd

def gc_content(sequence, as_decimal=True):
    """Returns the GC content for the sequence.
    Notes:
        This method ignores N when calculating the length of the sequence.
        It does not however, ignore other ambiguous bases. It also only
        includes the ambiguous base S (G or C). In this sense, the method is
        conservative with its calculation.
    Args:
        sequence (str): A DNA string.
        as_decimal (bool): Return the result as a decimal. Setting to False
        will return as a percentage. i.e for the sequence GCAT it will
        return 0.5 by default and 50.00 if set to False.
    Returns:
        float: GC content calculated as the number of G, C, and S divided
        by the number of (non-N) bases (length).
    """
    gc_total = 0.0
    num_bases = 0.0
    n_tuple = tuple('nN')
    accepted_bases = tuple('cCgGsS')

    # counter sums all unique characters in sequence. Case insensitive.
    for base, count in Counter(sequence).items():

        # dont count N in the number of bases
        if base not in n_tuple:
            num_bases += count

            if base in accepted_bases:  # S is a G or C
                gc_total += count

    result = gc_total / num_bases

    if not as_decimal:  # return as percentage
        result *= 100

    return result

def get_percent_identity(read):
    """Calculates the percent identity of a read based on the NM tag if present
    , if not calculate from MD tag and CIGAR string.
    Args:
        read (pysam.AlignedSegment): A pysam read alignment record.
    Returns:
        The percent identity or None if required fields are not present.
    """
    try:
        return 100 * (1 - read.get_tag("NM") / read.query_alignment_length)
    except KeyError:
        try:
            return 100 * (
                    1 - (_parse_md_flag(read.get_tag("MD")) +
                         _parse_cigar(read.cigartuples)) /
                    read.query_alignment_length
            )
        except KeyError:
            return None
    except ZeroDivisionError:
        return None

def relative_read_length(read):
    """Calculates the relative read length of the given read.
    That is, read aligned length/reference aligned length.

    Args:
        read (pysam.AlignedSegment): A pysam read alignment record.

    Returns:
        Relative read length as a float.

    """
    return read.query_alignment_length / read.reference_length

def sam_read_stats(filepath):
    """Opens a SAM/BAM file and extracts the read percent identity for all
    mapped reads that are not supplementary or secondary alignments.
    Args:
        filepath (Path): Path to SAM/BAM file.
    Returns:
        A pandas dataframe where the index column is the read id:
            1. 'pid' - read percent identity.
            2. 'rel_len' - relative read length.
            3. 'aligned_bases' - length of query aligned segment.
    """
    # get pysam read option depending on whether file is sam or bam
    file_ext = filepath.suffix
    read_opt = 'rb' if file_ext == '.bam' else 'r'

    # open file
    samfile = pysam.AlignmentFile(filepath, read_opt)

    stats = dict()
    for record in samfile:
        # make sure read is mapped, and is not a suppl. or secondary alignment
        if (record.is_unmapped or
                record.is_supplementary or
                record.is_secondary):
            continue
        pid = get_percent_identity(record)
        relative_len = relative_read_length(record)

        stats[record.query_name] = {
            "pid": pid,
            "rel_len": relative_len,
            "aligned_bases": record.query_alignment_length,
            "gc_content": gc_content(record.query_sequence, as_decimal=False)
        }

    df = pd.DataFrame(stats).T
    df["read_id"] = df.index
    df.reset_index(inplace=True, drop=True)
    return df

def stats_for_bams(bams):
    """Collates stats for a given list of {s,b}am files.
    Args:
        bams (list[Path]): A list of Path objects for {s,b}am files.
    Returns:
        A pandas dataframe of BAM stats where each row is a read and
        the columns are:
            1. 'model' - guppy basecaller model used.
            2. 'barcode' - nanopore barcode.
            3. 'pid' - read percent identity.
            4. 'rel_len' - relative read length.
            5. 'aligned_bases' - length of query aligned segment.
    """
    stats = []
    for bam in bams:
        barcode = bam.name.split('.')[0]
        df = sam_read_stats(bam)
        df['barcode'] = barcode
        stats.append(df)
    return pd.concat(stats)

Microbes made me do it

Searching for shared sequence between Mycobacterium tuberculosis and Homo sapiens

Table of Contents

Motivation

Shared k-mer content

Aligning reads

Summary

References

Cheap Parallelisation

Table of Contents

Motivation

What is fd?

Baby steps: finding our files

Constructing execution commands

Simple: counting characters in files

Intermediate: changing file extensions

Advanced: redirecting stdout to a file within the command

Inline bash command

Shell script

Putting it all together: Parallel MSA

Benchmark

Results

Conclusion

Final Remarks

Benchmarking Guppy algorithms

Methods

Basecall

Default config

Flip-flop config

Barcode demultiplexing

Deepbinner classification

Deepbinner binning

Adapter trimming

Map

Plotting

Results

Quality vs Read length

Total yield

GC content

Read identity

Relative read length

Conclusions

Supplementary code

What is `fd`?