<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.9.3">Jekyll</generator><link href="https://mbh.sh/feed.xml" rel="self" type="application/atom+xml" /><link href="https://mbh.sh/" rel="alternate" type="text/html" hreflang="en" /><updated>2023-06-22T06:01:43+00:00</updated><id>https://mbh.sh/feed.xml</id><title type="html">Microbes made me do it</title><subtitle>Posts about microbes, genomics, bioinformatics, and anything else relevant (or not) to my research.
</subtitle><author><name>Michael Hall</name><email>michael@mbh.sh</email></author><entry><title type="html">Searching for shared sequence between Mycobacterium tuberculosis and Homo sapiens</title><link href="https://mbh.sh/2023-06-21-mtb-human-shared-sequence/" rel="alternate" type="text/html" title="Searching for shared sequence between Mycobacterium tuberculosis and Homo sapiens" /><published>2023-06-21T00:00:00+00:00</published><updated>2023-06-21T00:00:00+00:00</updated><id>https://mbh.sh/mtb-human-shared-sequence</id><content type="html" xml:base="https://mbh.sh/2023-06-21-mtb-human-shared-sequence/">&lt;h1 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h1&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#motivation&quot;&gt;Motivation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#shared-k-mer-content&quot;&gt;Shared &lt;em&gt;k&lt;/em&gt;-mer content&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#aligning-reads&quot;&gt;Aligning reads&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#summary&quot;&gt;Summary&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#references&quot;&gt;References&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href=&quot;https://doi.org/10.5281/zenodo.8068147&quot;&gt;&lt;img src=&quot;https://zenodo.org/badge/DOI/10.5281/zenodo.8068147.svg&quot; alt=&quot;DOI&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1 id=&quot;motivation&quot;&gt;Motivation&lt;/h1&gt;

&lt;p&gt;We are in the early stages of planning a &lt;em&gt;Mycobacterium tuberculosis&lt;/em&gt; (MTB) analysis pipeline for a research project in Papua New Guinea. We’ll be sequencing sputum samples with Oxford Nanopore Technologies (ONT) devices and were thinking of different ways of decontaminating the data - i.e. remove anything non-MTB. Sputum samples traditionally have a lot of host (human) reads and reads from a variety of bacteria. Traditionally the MTB component is quite small&lt;sup&gt;1&lt;/sup&gt;. One component of this pipeline will be to upload sequencing reads to a remote/cloud server, so any reduction in file size will make uploads faster. As human reads are not used in any analysis steps, and will need to be removed prior to making any data available, we thought we could simplify things by removing human data as the first step. Our idea was to align reads to the human genome and just remove anything that aligns. However, one concern with this approach was whether any MTB reads could be lost in the process. This effectively boils down to the question: &lt;strong&gt;Do &lt;em&gt;Mycobacterium tuberculosis&lt;/em&gt; and &lt;em&gt;Homo sapiens&lt;/em&gt; share genomic sequence&lt;/strong&gt;? After a literature search, I was unable to find an answer - which seemed quite surprising. My suspicion is that most people just assume they do not. (Or my literature searching skills are poor.) So let’s take a look.&lt;/p&gt;

&lt;h1 id=&quot;shared-k-mer-content&quot;&gt;Shared &lt;em&gt;k&lt;/em&gt;-mer content&lt;/h1&gt;

&lt;p&gt;The first thing I thought to check was whether there are shared &lt;em&gt;k&lt;/em&gt;-mers between the two reference genomes for MTB and human. As an aside, after struggling to install/run multiple tools for this job I wrote a simple Rust program - &lt;a href=&quot;https://github.com/mbhall88/skc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;skc&lt;/code&gt;&lt;/a&gt; - to do this comparison.&lt;/p&gt;

&lt;p&gt;The human genome used is the &lt;a href=&quot;https://github.com/marbl/CHM13#t2t-chm13v20-t2t-chm13y&quot;&gt;Telomere-to-Telomere (T2T) Consortium CHM13 v2.0 assembly&lt;/a&gt; (accession: &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/assembly/GCA_009914755.4&quot;&gt;GCA_009914755.4&lt;/a&gt;)&lt;sup&gt;2&lt;/sup&gt;. The MTB reference genome used is H37Rv (accession: &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3&quot;&gt;NC_000962.3&lt;/a&gt;)&lt;sup&gt;3&lt;/sup&gt;. In addition to the CHM13 human genome, I looked at the shared &lt;em&gt;k&lt;/em&gt;-mer content between MTB and a collection of other closely- and distantly-related genomes to give some background expectations. The other genomes are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The previous human reference genome &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40/&quot;&gt;GRCh38.p14 (hg38)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;The &lt;em&gt;Mus musculus&lt;/em&gt; (mouse) reference genome &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/assembly/GCF_000001635.27/&quot;&gt;GRCm39 (mm39)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;The &lt;em&gt;Arabidopsis thaliana&lt;/em&gt; (thale cress) reference genome &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/assembly/GCF_000001735.4&quot;&gt;TAIR10.1&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;The Human immunodeficiency virus 1 (HIV-1) reference genome &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1&quot;&gt;NC_001802.1&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;The &lt;em&gt;Escherichia coli&lt;/em&gt; strain K-12 substr. MG1655 reference genome &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/assembly/GCF_000005845.2/&quot;&gt;ASM584v2&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;The &lt;em&gt;Mycobacterium avium subsp. hominissuis&lt;/em&gt; strain OCU889s_P11_4s reference genome &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP018019.1&quot;&gt;NZ_CP018019.1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I ran &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;skc&lt;/code&gt; with all &lt;em&gt;k&lt;/em&gt; from 13 to 31 and plot the number of shared &lt;em&gt;k&lt;/em&gt;-mers at each &lt;em&gt;k&lt;/em&gt; for each of the genomes listed above.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/posts/shared-seq/shared-count.png&quot; alt=&quot;plot of shared k-mer counts for each genome&quot; /&gt;&lt;/p&gt;

&lt;p&gt;From this figure we can see that the largest shared content we get been MTB and human (CHM13) is 29-mers - for which there are 2. Interestingly, there is only one match between hg38 and MTB - meaning one of the matches with CHM13 is in new sequence generated by the T2T consortium. The two matches between MTB and CHM13 are&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://mycobrowser.epfl.ch/jbrowse/index.html?data=data_mycobrowser%2FM.tuberculosis_H37Rv&amp;amp;loc=NC_000962.3%3A2357235..2357280&amp;amp;tracks=DNA%2CAnnotation&amp;amp;highlight=&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NC_000962.3:2357258&lt;/code&gt;&lt;/a&gt;, which is in the PE_PGRS36 gene, and &lt;a href=&quot;https://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1&amp;amp;lastVirtModeType=default&amp;amp;lastVirtModeExtraState=&amp;amp;virtModeType=default&amp;amp;virtMode=0&amp;amp;nonVirtPosition=&amp;amp;position=chr20%3A44924002%2D44924011&amp;amp;hgsid=1649864754_k9kpCSJ4SEii1AHLlN9L5tzDJi0Y&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chr20:44924007&lt;/code&gt;&lt;/a&gt; which begins at the last base of the PTPRT-207 gene.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://mycobrowser.epfl.ch/jbrowse/index.html?data=data_mycobrowser%2FM.tuberculosis_H37Rv&amp;amp;loc=NC_000962.3%3A837294..837339&amp;amp;tracks=DNA%2CAnnotation&amp;amp;highlight=&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NC_000962.3:837317&lt;/code&gt;&lt;/a&gt;, which is in the PE_PGRS9 gene, and &lt;a href=&quot;https://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1&amp;amp;lastVirtModeType=default&amp;amp;lastVirtModeExtraState=&amp;amp;virtModeType=default&amp;amp;virtMode=0&amp;amp;nonVirtPosition=&amp;amp;position=chrX%3A86236017%2D86236026&amp;amp;hgsid=1649864754_k9kpCSJ4SEii1AHLlN9L5tzDJi0Y&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chrX:86236022&lt;/code&gt;&lt;/a&gt;, which is in &lt;a href=&quot;https://genome.ucsc.edu/cgi-bin/hgc?hgsid=1649865962_pnhY9tSh3kU7tqDllTU3Pg2qE6iy&amp;amp;db=hg38&amp;amp;c=chrX&amp;amp;l=87808740&amp;amp;r=87808931&amp;amp;o=87807548&amp;amp;t=87809883&amp;amp;g=gtexGeneV8&amp;amp;i=RP6%2D43L17.2&quot;&gt;RP6-43L17.2&lt;/a&gt; (mitochondrial ribosomal protein S22 pseudogene 1)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both of these 29-mers also match in soft-masked regions of the CHM13 assembly - indicating they’re likely in repeats discovered by the T2T team.&lt;/p&gt;

&lt;p&gt;Unsurprisingly, the most 31-mer matches were with &lt;em&gt;M. avium&lt;/em&gt;, followed by &lt;em&gt;E. coli&lt;/em&gt;. There are also 46 31-mers that match in the &lt;em&gt;A. thaliana&lt;/em&gt; genome, which I was quite surprised about initially. But on further inspection, those hits are in the 16S rRNA of the chloroplast.&lt;/p&gt;

&lt;p&gt;Next, I looked at the GC content distribution of the matching &lt;em&gt;k&lt;/em&gt;-mers between MTB and CHM13. This was to convince myself these matches with the human genome were likely due to chance given they come from repetitive regions in both genomes.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/img/posts/shared-seq/gc-content.png&quot; alt=&quot;plot of shared k-mer GC distribution&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The GC content of the MTB genome is ~65%. However, we see that the bulk of the shared &lt;em&gt;k&lt;/em&gt;-mers (78%) have a GC content over 65% - with 46% being over 90%. Importantly, our two 29-mers have GC content of 93.1% and 96.6%.&lt;/p&gt;

&lt;!-- TODO check human vs all other species --&gt;

&lt;h1 id=&quot;aligning-reads&quot;&gt;Aligning reads&lt;/h1&gt;

&lt;p&gt;Given the whole point of this analysis was to see if I would lose MTB reads when aligning a sputum sample to just the human genome, it’s fair to argue I should have just tested that approach and been done with it. But I can be a little paranoid so I looked at shared &lt;em&gt;k&lt;/em&gt;-mers first because…well why not?&lt;/p&gt;

&lt;p&gt;My main interest for this project was ONT data, so I simulated ONT reads from the MTB reference genome (H37Rv) to 5x depth with &lt;a href=&quot;https://github.com/rrwick/Badread&quot;&gt;Badread&lt;/a&gt;&lt;sup&gt;4&lt;/sup&gt;. I then aligned these reads to CHM13 with &lt;a href=&quot;https://github.com/lh3/minimap2&quot;&gt;minimap2&lt;/a&gt;&lt;sup&gt;5&lt;/sup&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-x map-ont&lt;/code&gt;), but got no alignments. As the shared &lt;em&gt;k&lt;/em&gt;-mers I observed were found in repetitive regions, I also tried aligning the MTB reads to CHM13 using &lt;a href=&quot;https://github.com/marbl/Winnowmap&quot;&gt;Winnowmap&lt;/a&gt;&lt;sup&gt;6&lt;/sup&gt;, which is designed for aligning long reads to repetitive reference sequences. Still no alignments.&lt;/p&gt;

&lt;p&gt;As an additional analysis, as I assume this will be of interest to others, I also checked the alignment of Illumina reads. I simulated paired MTB reads from H37Rv with &lt;a href=&quot;https://www.niehs.nih.gov/research/resources/software/biostatistics/art/index.cfm&quot;&gt;ART&lt;/a&gt;&lt;sup&gt;7&lt;/sup&gt; to a depth of 20x from a HiSeq 2500 (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ss HS25&lt;/code&gt;) and MiSeq v3 (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ss MSv3&lt;/code&gt;). I aligned these to CHM13 with minimap2 (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-x sr&lt;/code&gt;) and got a very small number of alignments - 46 reads for HiSeq 2500 and 32 reads from MiSeq v3 (none of these alignments were near where the 29-mer matches are).&lt;/p&gt;

&lt;h1 id=&quot;summary&quot;&gt;Summary&lt;/h1&gt;

&lt;p&gt;For my part, I’m pretty happy to conclude that aligning ONT sputum data to the human genome will not remove any MTB reads. Doing the same for Illumina data will result in a negligible number of MTB reads being lost. While there is &lt;em&gt;some&lt;/em&gt; shared &lt;em&gt;k&lt;/em&gt;-mers between MTB and the human genome, these are likely repeat artifacts. I’m going to boldly conclude that there &lt;strong&gt;is no shared sequence between &lt;em&gt;M. tuberculosis&lt;/em&gt; and &lt;em&gt;Homo sapiens&lt;/em&gt;&lt;/strong&gt; - at least nothing that is evolutionarily meaningful. I would love to be proven wrong though.&lt;/p&gt;

&lt;h1 id=&quot;references&quot;&gt;References&lt;/h1&gt;

&lt;ol&gt;
  &lt;li&gt;Nilgiriwala K, Rabodoarivelo M-S, Hall MB, Patel G, Mandal A, Mishra S, et al. Genomic sequencing from sputum for tuberculosis disease diagnosis, lineage determination, and drug susceptibility prediction. J Clin Microbiol. 2023;61: e0157822. doi:&lt;a href=&quot;https://doi.org/10.1128/jcm.01578-22&quot;&gt;10.1128/jcm.01578-22&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, et al. The complete sequence of a human Y chromosome. bioRxiv. 2022. doi:&lt;a href=&quot;https://doi.org/10.1101/2022.12.01.518724&quot;&gt;10.1101/2022.12.01.518724&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393: 537–544. doi:&lt;a href=&quot;https://doi.org/10.1038/31159&quot;&gt;10.1038/31159&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Wick R. Badread: simulation of error-prone long reads. J Open Source Softw. 2019;4: 1316. doi:&lt;a href=&quot;https://doi.org/10.21105/joss.01316&quot;&gt;10.21105/joss.01316&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34: 3094–3100. doi:&lt;a href=&quot;https://doi.org/10.1093/bioinformatics/bty191&quot;&gt;10.1093/bioinformatics/bty191&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Jain C, Rhie A, Hansen NF, Koren S, Phillippy AM. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods. 2022;19: 705–710. doi:&lt;a href=&quot;https://doi.org/10.1038/s41592-022-01457-8&quot;&gt;10.1038/s41592-022-01457-8&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28: 593–594. doi:&lt;a href=&quot;https://doi.org/10.1093/bioinformatics/btr708&quot;&gt;10.1093/bioinformatics/btr708&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can cite this post as&lt;/p&gt;

&lt;p&gt;Hall, Michael B. Searching for shared sequence between Mycobacterium tuberculosis and Homo sapiens. Zenodo; 2023. doi:&lt;a href=&quot;https://doi.org/10.5281/zenodo.8068146&quot;&gt;10.5281/zenodo.8068147&lt;/a&gt;&lt;/p&gt;</content><author><name>Michael Hall</name><email>michael@mbh.sh</email></author><category term="bioinformatics" /><category term="shared-sequence" /><category term="kmers" /><category term="tuberculosis" /><category term="human" /><summary type="html"></summary></entry><entry><title type="html">Cheap Parallelisation</title><link href="https://mbh.sh/2020-06-22-cheap-parallelisation/" rel="alternate" type="text/html" title="Cheap Parallelisation" /><published>2020-06-22T00:00:00+00:00</published><updated>2020-06-22T00:00:00+00:00</updated><id>https://mbh.sh/cheap-parallelisation</id><content type="html" xml:base="https://mbh.sh/2020-06-22-cheap-parallelisation/">&lt;h1 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h1&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;#what-is-fd&quot;&gt;What is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#baby-steps-finding-our-files&quot;&gt;Baby steps: finding our files&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#constructing-execution-commands&quot;&gt;Constructing execution commands&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#simple-counting-characters-in-files&quot;&gt;Simple: counting characters in files&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#intermediate-changing-file-extensions&quot;&gt;Intermediate: changing file extensions&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#advanced-redirecting-stdout-to-a-file-within-the-command&quot;&gt;Advanced: redirecting stdout to a file within the command&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#putting-it-all-together-parallel-msa&quot;&gt;Putting it all together: Parallel MSA&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#benchmark&quot;&gt;Benchmark&lt;/a&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#results&quot;&gt;Results&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#final-remarks&quot;&gt;Final Remarks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1 id=&quot;motivation&quot;&gt;Motivation&lt;/h1&gt;

&lt;p&gt;I was recently creating a &lt;a href=&quot;https://snakemake.readthedocs.io/en/stable/&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;snakemake&lt;/code&gt;&lt;/a&gt; pipeline and needed to write a
rule/process that would perform a multiple sequence alignment (MSA) on 2,582 fasta
files. Usually, it is easy to parallelise this kind of task using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;snakemake&lt;/code&gt;. To cut a
long story short; using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;snakemake&lt;/code&gt; to parallelise across the files was not feasible. I
knew there were ways of doing this kind of thing with tools such as
&lt;a href=&quot;https://www.gnu.org/software/parallel/&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parallel&lt;/code&gt;&lt;/a&gt;, &lt;a href=&quot;https://www.man7.org/linux/man-pages/man1/xargs.1.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;xargs&lt;/code&gt;&lt;/a&gt;, and &lt;a href=&quot;https://www.gnu.org/software/findutils/&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;find&lt;/code&gt;&lt;/a&gt;, but I had never really
invested the time to get comfortable with them. This post is an attempt to document that
process using one of my favourite CLI tools: &lt;a href=&quot;https://github.com/sharkdp/fd&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;&lt;/a&gt;. We’ll see how &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; can be used
to execute multiple MSAs (with MAFFT) simultaneously, and benchmark how much faster it is than
a conventional “synchronous” approach.&lt;/p&gt;

&lt;h2 id=&quot;what-is-fd&quot;&gt;What is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;?&lt;/h2&gt;

&lt;p&gt;Quoting from its GitHub repository:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; is a simple, fast and user-friendly alternative to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;find&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and I would certainly agree with that. For the sake of brevity, I won’t dive into the
full range of what &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; does, but please take a minute to have a quick look at &lt;a href=&quot;https://github.com/sharkdp/fd/blob/master/README.md&quot;&gt;the
README&lt;/a&gt; before reading on any further.&lt;/p&gt;

&lt;p&gt;For the purposes of this post, the functionality of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; we are most interested in is
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-x, --exec&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-j, --threads&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    -x, --exec &amp;lt;cmd&amp;gt;
            Execute a command for each search result.
            All arguments following --exec are taken to be arguments to the command until the argument ';' is encountered.
            Each occurrence of the following placeholders is substituted by a path derived from the current search result before the command is
            executed:
              '{}':   path
              '{/}':  basename
              '{//}': parent directory
              '{.}':  path without file extension
              '{/.}': basename without file extension
              
    -j, --threads &amp;lt;num&amp;gt;
            Set number of threads to use for searching &amp;amp; executing (default: number of available CPU cores)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--exec&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--threads&lt;/code&gt; together will allow us to run a given command on however
many threads we like.&lt;/p&gt;

&lt;h2 id=&quot;baby-steps-finding-our-files&quot;&gt;Baby steps: finding our files&lt;/h2&gt;

&lt;p&gt;The first thing we need to do before executing a command on multiple threads is to
&lt;em&gt;find&lt;/em&gt; the files we want to operate on. The toy dataset I’ll be using is eight small
fasta files. If you’re going to follow along at home, you can download the data from
&lt;a href=&quot;https://github.com/mbhall88/mbhall88.github.io/tree/master/assets/data/fd_example.tar.gz&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The directory tree we are working with is&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;.
├── fd_example
│  ├── f1.fa
│  ├── f2.fa
│  ├── f3.fa
│  ├── f4.fa
│  ├── f5.fa
│  ├── f6.fa
│  ├── f7.fa
│  └── f8.fa
└── msas
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To find the fasta files with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;, we can use the following command&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;nt&quot;&gt;--extension&lt;/span&gt; fa &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; fd_example
fd_example/f1.fa
fd_example/f2.fa
fd_example/f3.fa
fd_example/f4.fa
fd_example/f5.fa
fd_example/f6.fa
fd_example/f7.fa
fd_example/f8.fa
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--extension&lt;/code&gt; tells &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; we want to filter out search results by the given extension -
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fa&lt;/code&gt; in this case. (I’ll use the abbreviated &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-e&lt;/code&gt; for this in subsequent commands).
Next, we tell &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; the pattern we are looking for is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.&lt;/code&gt; - i.e. anything. Lastly, we
want to search (recursively) under the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd_example&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;An alternate solution for the above would be skipping the use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--extension&lt;/code&gt; and using
the pattern on its own&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;s1&quot;&gt;'\.fa$'&lt;/span&gt; fd_example
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The pattern is a &lt;a href=&quot;https://en.wikipedia.org/wiki/Regular_expression&quot;&gt;regular expression&lt;/a&gt; this time that says “anything ending in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.fa&lt;/code&gt;”.&lt;/p&gt;

&lt;p&gt;Personally, I like to make commands like this as easy for others, and myself in the
future, to understand. So I’ll stick with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--extension&lt;/code&gt; method.&lt;/p&gt;

&lt;h2 id=&quot;constructing-execution-commands&quot;&gt;Constructing execution commands&lt;/h2&gt;

&lt;p&gt;Now that we have a list of file paths, we need to do something with them. From earlier,
we can see that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--exec&lt;/code&gt; takes a command, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;cmd&amp;gt;&lt;/code&gt;. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; docs have
&lt;a href=&quot;https://github.com/sharkdp/fd/blob/master/README.md#parallel-command-execution&quot;&gt;examples&lt;/a&gt; of different conventions for writing these commands. This command
will be executed for &lt;em&gt;each&lt;/em&gt; file path that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; finds.&lt;/p&gt;

&lt;h3 id=&quot;simple-counting-characters-in-files&quot;&gt;Simple: counting characters in files&lt;/h3&gt;

&lt;p&gt;We’ll start with something easy like counting characters in each file using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wc&lt;/code&gt;. &lt;a href=&quot;https://linux.die.net/man/1/wc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wc&lt;/code&gt;&lt;/a&gt;
is a command-line utility that counts lines, words, and bytes (characters) in files.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; fa &lt;span class=&quot;nt&quot;&gt;--exec&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;wc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'{}'&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; fd_example
3689 fd_example/f5.fa
2155 fd_example/f8.fa
7324 fd_example/f1.fa
2503 fd_example/f2.fa
1433 fd_example/f7.fa
5701 fd_example/f3.fa
1766 fd_example/f6.fa
2530 fd_example/f4.fa
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Let’s break down the juicy bit - &lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--exec wc -c '{}' \;&lt;/code&gt;&lt;/strong&gt;. We tell &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; to execute the
command &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wc -c&lt;/code&gt; on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{}&lt;/code&gt;. If we refresh our memories on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--exec&lt;/code&gt; docs, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{}&lt;/code&gt; refers to
the path returned by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; (note: we add quotes around &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{}&lt;/code&gt; to prevent more complex paths
failing). The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\;&lt;/code&gt; bit at the end just tells &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; where the execution command ends.
Anything else after this is for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;, or the shell, to deal with.&lt;/p&gt;

&lt;p&gt;An even more concise version of this command would be&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; fa &lt;span class=&quot;nt&quot;&gt;-x&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;wc&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; fd_example
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; appends the path &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{}&lt;/code&gt; to the end of the command if we don’t add it ourselves.&lt;/p&gt;

&lt;p&gt;Ok, so that was pretty easy. Let’s look at something a little more complicated.&lt;/p&gt;

&lt;h3 id=&quot;intermediate-changing-file-extensions&quot;&gt;Intermediate: changing file extensions&lt;/h3&gt;

&lt;p&gt;As a (somewhat contrived) second example, we’ll use the same approach to change the file
extensions on all our fasta files from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.fa&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.fasta&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; fa &lt;span class=&quot;nt&quot;&gt;--exec&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;mv&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'{}'&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'{.}'&lt;/span&gt;.fasta &lt;span class=&quot;se&quot;&gt;\;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; fd_example
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;ls &lt;/span&gt;fd_example
f1.fasta  f2.fasta  f3.fasta  f4.fasta  f5.fasta  f6.fasta  f7.fasta  f8.fasta
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then only new thing here is the use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;'{.}'&lt;/code&gt;. Again, from the help menu, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{.}&lt;/code&gt; gives
us the path &lt;em&gt;without&lt;/em&gt; the file extension. So &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd_example/f1.fa&lt;/code&gt; becomes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd_example/f1&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;advanced-redirecting-stdout-to-a-file-within-the-command&quot;&gt;Advanced: redirecting stdout to a file within the command&lt;/h3&gt;

&lt;p&gt;Trying to redirect the standard output of a command to a file turns out to be a little
more involved. For example, going with what we’ve seen so far, if we wanted to get the
sequence identifier for each sequence in a fasta file and write them to file, our
initial attempt might look like this&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; fa &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--exec&lt;/span&gt; rg &lt;span class=&quot;nt&quot;&gt;-P&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'^&amp;gt;(?P&amp;lt;id&amp;gt;[^\s]+)\s.*'&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--replace&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'$id'&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'{/.}'&lt;/span&gt;.ids &lt;span class=&quot;se&quot;&gt;\;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; fd_example
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this command, we use &lt;a href=&quot;https://github.com/BurntSushi/ripgrep&quot;&gt;ripgrep&lt;/a&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rg&lt;/code&gt;) to grab the identifiers for the file and
write them out to a file called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;'{/.}'.ids&lt;/code&gt;. The help menu for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; tells us that
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{/.}&lt;/code&gt; is the path basename without the file extension. So &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd_example/f1.fa&lt;/code&gt; becomes
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f1.ids&lt;/code&gt;. However, when we run this, we get an error message like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;zsh: no such file or
directory: {/.}.ids&lt;/code&gt;. What is effectively happening here is that before this command
gets passed to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;, it is first interpreted by the shell. The shell thinks we are
trying to write the output from everything up until the &lt;a href=&quot;https://pubs.opengroup.org/onlinepubs/9699919799.2016edition/basedefs/V1_chap03.html#tag_03_318&quot;&gt;redirection operator&lt;/a&gt;
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;gt;&lt;/code&gt; (i.e. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd -e fa --exec rg -P '^&amp;gt;(?P&amp;lt;id&amp;gt;[^\s]+)\s.*' --replace '$id'&lt;/code&gt;) to a file
called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;'{/.}'.ids \; . fd_example&lt;/code&gt;. The shell is correct, there is no file called
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;{/.}.ids&lt;/code&gt; - this syntax is only valid inside an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd --exec&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;We can fix this in one of two ways: inline bash command execution or a shell script.&lt;/p&gt;

&lt;h4 id=&quot;inline-bash-command&quot;&gt;Inline bash command&lt;/h4&gt;

&lt;p&gt;This option is pretty ugly - unless you’re into one-liners…&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; fa &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--exec&lt;/span&gt; sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;rg -P '^&amp;gt;(?P&amp;lt;id&amp;gt;[^&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\s&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;]+)&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\s&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;.*' --replace '&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;id' '{}' &amp;gt; '{/.}'.ids&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; fd_example
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, we need to invoke an inline shell command with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sh -c&lt;/code&gt;. Part of the
awkwardness here is also having to escape some characters to prevent the shell from
trying to evaluate them, amongst other things. These inline commands can get very
cumbersome very fast!&lt;/p&gt;

&lt;h4 id=&quot;shell-script&quot;&gt;Shell script&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Recommended&lt;/strong&gt;. For complex examples, such as the one we are working with, I would
suggest using this approach. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; command itself does become slightly more obscure,
but if you name the script accordingly, it should still be self-explanatory.&lt;/p&gt;

&lt;p&gt;Replicating our example from the above example, we create a script &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;extract_seq_id.sh&lt;/code&gt;&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env sh&lt;/span&gt;
rg &lt;span class=&quot;nt&quot;&gt;-P&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'^&amp;gt;(?P&amp;lt;id&amp;gt;[^\s]+)\s.*'&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--replace&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'$id'&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$2&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and then execute the script with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; fa &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;--exec&lt;/span&gt; sh extract_seq_id.sh &lt;span class=&quot;s1&quot;&gt;'{}'&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'{/.}'&lt;/span&gt;.ids &lt;span class=&quot;se&quot;&gt;\;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; fd_example
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Much neater!&lt;/p&gt;

&lt;p&gt;If we take a look inside one of the output files, we should see the identifiers&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;f1.ids
fadD23+Rv3827c+Rv3828c
76c33157-d262-467c-960b-c21f8fa16991
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;putting-it-all-together-parallel-msa&quot;&gt;Putting it all together: Parallel MSA&lt;/h2&gt;

&lt;p&gt;Ok, armed with our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd --exec&lt;/code&gt; knowledge, let’s parallelise multiple sequence alignment
on our samples and benchmark how much of a difference throwing more threads at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;
makes.&lt;/p&gt;

&lt;p&gt;To perform the MSA, we are going to use &lt;a href=&quot;https://mafft.cbrc.jp/alignment/software/&quot;&gt;MAFFT&lt;/a&gt;. MAFFT writes its output to
standard output. Luckily we know how to deal with this 😎. We’ll write a script and then
execute that script with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;msa.sh&lt;/code&gt;&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env sh&lt;/span&gt;
mafft &lt;span class=&quot;nt&quot;&gt;--thread&lt;/span&gt; 1 &lt;span class=&quot;nt&quot;&gt;--auto&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$1&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$2&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: we are only specifying one thread for benchmarking purposes.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt; command will then be&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;fd &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; fa &lt;span class=&quot;nt&quot;&gt;--exec&lt;/span&gt; sh msa.sh &lt;span class=&quot;s1&quot;&gt;'{}'&lt;/span&gt; msas/&lt;span class=&quot;s1&quot;&gt;'{/.}'&lt;/span&gt;.msa.fa &lt;span class=&quot;se&quot;&gt;\;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; fd_example
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Our directory tree now looks like&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;.
├── fd_example
│  ├── f1.fa
│  ├── f2.fa
│  ├── f3.fa
│  ├── f4.fa
│  ├── f5.fa
│  ├── f6.fa
│  ├── f7.fa
│  └── f8.fa
├── msa.sh
└── msas
   ├── f1.msa.fa
   ├── f2.msa.fa
   ├── f3.msa.fa
   ├── f4.msa.fa
   ├── f5.msa.fa
   ├── f6.msa.fa
   ├── f7.msa.fa
   └── f8.msa.fa
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;benchmark&quot;&gt;Benchmark&lt;/h2&gt;

&lt;p&gt;How much of a difference does more threads make? We’ll benchmark various numbers of
threads using the tool &lt;a href=&quot;https://github.com/sharkdp/hyperfine&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hyperfine&lt;/code&gt;&lt;/a&gt; and its neat &lt;a href=&quot;https://github.com/sharkdp/hyperfine#parameterized-benchmarks&quot;&gt;parameter
scan&lt;/a&gt; functionality. The data used in this benchmarking is a little
different to the toy dataset. I have restricted the files to those above 9kb in size (80
files in total) to try and see the full effects of multiple threads. The 100 threads
option is just a roundabout way of saying “however many threads you can find.” The
machine I ran this benchmark on has 16.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;hyperfine &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt; threads 1,2,4,8,100 &lt;span class=&quot;nt&quot;&gt;--export-markdown&lt;/span&gt; results.md &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;fd -j {threads} -S +9k -e fa --exec sh msa.sh '{}' msas/'{/.}'.msa.fa &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; . bench_data&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;results&quot;&gt;Results&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Benchmark #1: fd -j 1 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):     16.708 s ±  0.475 s    [User: 13.742 s, System: 6.550 s]
  Range (min … max):   15.972 s … 17.666 s    10 runs

Benchmark #2: fd -j 2 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):     10.003 s ±  0.064 s    [User: 15.850 s, System: 7.498 s]
  Range (min … max):    9.896 s … 10.115 s    10 runs

Benchmark #3: fd -j 4 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):      7.232 s ±  0.331 s    [User: 20.556 s, System: 10.231 s]
  Range (min … max):    6.943 s …  7.751 s    10 runs

Benchmark #4: fd -j 8 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):      6.118 s ±  0.233 s    [User: 22.905 s, System: 11.137 s]
  Range (min … max):    5.461 s …  6.250 s    10 runs

Benchmark #5: fd -j 100 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/
  Time (mean ± σ):      5.913 s ±  0.028 s    [User: 24.093 s, System: 11.508 s]
  Range (min … max):    5.880 s …  5.954 s    10 runs

Summary
  'fd -j 100 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/' ran
    1.03 ± 0.04 times faster than 'fd -j 8 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/'
    1.22 ± 0.06 times faster than 'fd -j 4 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/'
    1.69 ± 0.01 times faster than 'fd -j 2 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/'
    2.83 ± 0.08 times faster than 'fd -j 1 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/'
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Command&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Mean [s]&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Min [s]&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Max [s]&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Relative&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd -j 1 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;16.708 ± 0.475&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15.972&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;17.666&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2.83 ± 0.08&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd -j 2 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;10.003 ± 0.064&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;9.896&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;10.115&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.69 ± 0.01&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd -j 4 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7.232 ± 0.331&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;6.943&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7.751&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.22 ± 0.06&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd -j 8 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;6.118 ± 0.233&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5.461&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;6.250&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.03 ± 0.04&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd -j 100 -e fa -S +9k --exec sh msa.sh '{}' msas/'{/.}'.msa.fa \; . bench_data/&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5.913 ± 0.028&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5.880&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5.954&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.00&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;The benchmark I have run here is a little contrived and doesn’t really reflect the full
benefit of multiple threads in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fd&lt;/code&gt;. The fasta files I am performing MSA on are quite
small/simple, so the benefit of multiple threads isn’t quite as drastic as it would be
for more complex alignments. Part of the problem with benchmarking on more complex files
is the single-threaded commands run far too slow, and this is not intended to be a full
scale benchmarking post. For a real-world (anecdotal) example, it took ~3.5 minutes to
run MAFFT on my 2,582 fasta files using 16 threads. You can see an example of how I
embedded this in a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;snakemake&lt;/code&gt; pipeline &lt;a href=&quot;https://github.com/mbhall88/head_to_head_pipeline/blob/9af773c7e5e8f861dbe05cd1e71300a856325ea2/data/H37Rv_PRG/Snakefile#L352-L379&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;final-remarks&quot;&gt;Final Remarks&lt;/h2&gt;

&lt;p&gt;I hope this post was insightful and useful. In the future, I hope to write more posts
like this one (provided people find this one helpful) on whatever comes up in my work
that I think might be interesting to others. In the meantime, feel free to get in touch
if you have any questions or comments or complaints about this post.&lt;/p&gt;</content><author><name>Michael Hall</name><email>michael@mbh.sh</email></author><category term="bioinformatics" /><category term="benchmark" /><category term="command-line-utilities" /><category term="speed" /><summary type="html"></summary></entry><entry><title type="html">Benchmarking Guppy algorithms</title><link href="https://mbh.sh/2019-02-01-benchmark-guppy-algorithms/" rel="alternate" type="text/html" title="Benchmarking Guppy algorithms" /><published>2019-02-01T00:00:00+00:00</published><updated>2019-02-01T00:00:00+00:00</updated><id>https://mbh.sh/benchmark-guppy-algorithms</id><content type="html" xml:base="https://mbh.sh/2019-02-01-benchmark-guppy-algorithms/">&lt;ul id=&quot;markdown-toc&quot;&gt;
  &lt;li&gt;&lt;a href=&quot;#methods&quot; id=&quot;markdown-toc-methods&quot;&gt;Methods&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#basecall&quot; id=&quot;markdown-toc-basecall&quot;&gt;Basecall&lt;/a&gt;        &lt;ul&gt;
          &lt;li&gt;&lt;a href=&quot;#default-config&quot; id=&quot;markdown-toc-default-config&quot;&gt;Default config&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#flip-flop-config&quot; id=&quot;markdown-toc-flip-flop-config&quot;&gt;Flip-flop config&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#barcode-demultiplexing&quot; id=&quot;markdown-toc-barcode-demultiplexing&quot;&gt;Barcode demultiplexing&lt;/a&gt;        &lt;ul&gt;
          &lt;li&gt;&lt;a href=&quot;#deepbinner-classification&quot; id=&quot;markdown-toc-deepbinner-classification&quot;&gt;Deepbinner classification&lt;/a&gt;&lt;/li&gt;
          &lt;li&gt;&lt;a href=&quot;#deepbinner-binning&quot; id=&quot;markdown-toc-deepbinner-binning&quot;&gt;Deepbinner binning&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#adapter-trimming&quot; id=&quot;markdown-toc-adapter-trimming&quot;&gt;Adapter trimming&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#map&quot; id=&quot;markdown-toc-map&quot;&gt;Map&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#plotting&quot; id=&quot;markdown-toc-plotting&quot;&gt;Plotting&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#results&quot; id=&quot;markdown-toc-results&quot;&gt;Results&lt;/a&gt;    &lt;ul&gt;
      &lt;li&gt;&lt;a href=&quot;#quality-vs-read-length&quot; id=&quot;markdown-toc-quality-vs-read-length&quot;&gt;Quality vs Read length&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#total-yield&quot; id=&quot;markdown-toc-total-yield&quot;&gt;Total yield&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#gc-content&quot; id=&quot;markdown-toc-gc-content&quot;&gt;GC content&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#read-identity&quot; id=&quot;markdown-toc-read-identity&quot;&gt;Read identity&lt;/a&gt;&lt;/li&gt;
      &lt;li&gt;&lt;a href=&quot;#relative-read-length&quot; id=&quot;markdown-toc-relative-read-length&quot;&gt;Relative read length&lt;/a&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#conclusions&quot; id=&quot;markdown-toc-conclusions&quot;&gt;Conclusions&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;#supplementary-code&quot; id=&quot;markdown-toc-supplementary-code&quot;&gt;Supplementary code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ONT’s basecaller Guppy has recently been released to the masses. And with the announcement of the new “&lt;a href=&quot;https://community.nanoporetech.com/posts/pre-release-of-stand-alone&quot;&gt;flip-flop&lt;/a&gt;” basecalling algorithm there is now the choice of two different algorithms for basecalling.&lt;/p&gt;

&lt;p&gt;ONT has obviously been singing flip-flop’s praises, and understandably so, as the &lt;a href=&quot;https://community.nanoporetech.com/posts/pre-release-of-stand-alone&quot;&gt;initial results&lt;/a&gt; look like a decent step up in read accuracy.&lt;/p&gt;

&lt;p&gt;For an upcoming project I am going to be doing &lt;em&gt;a lot&lt;/em&gt; of basecalling of &lt;em&gt;Mycobacterium tuberculosis&lt;/em&gt; and given the project will involve assessing metrics heavily reliant on read accuracy I thought it best to invest some time in deciding which algorithm to go with. Another reason for my indecision came when I read a &lt;a href=&quot;https://omicsomics.blogspot.com/2018/12/flappie-vs-albacore-via-counterr.html&quot;&gt;recent blog from Keith Robison&lt;/a&gt; which showed that maybe the new flip-flop algorithm doesn’t work well with organisms that have a higher GC content.&lt;/p&gt;

&lt;p&gt;As &lt;em&gt;M. tuberculosis&lt;/em&gt; has a GC content around 65% I thought it best to do a little benchmarking of the two basecalling algorithms first. Unfortunately for me, I couldn’t really rely on the results from &lt;a href=&quot;https://github.com/rrwick/Basecalling-comparison&quot;&gt;Ryan Wick’s wonderful basecalling comparison&lt;/a&gt; due to the species he used, &lt;em&gt;E. coli&lt;/em&gt;, having a roughly even GC content.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: Just before publishing this post Ryan &lt;a href=&quot;https://www.biorxiv.org/node/164533.full&quot;&gt;released an updated version of the comparison as a preprint&lt;/a&gt;. In the test set there was one bacteria, &lt;em&gt;Stenotrophomonas maltophilia&lt;/em&gt;, with a &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/genome/?term=Stenotrophomonas%20maltophilia[Organism]&amp;amp;cmd=DetailsSearch&quot;&gt;GC content similar&lt;/a&gt; to &lt;em&gt;M. tuberculosis&lt;/em&gt;. Figure 2 in that paper shows flip-flop as having a higher read identity than the default Guppy algorithm.&lt;/p&gt;

&lt;p&gt;What I will do here is walk through a small-scale basecalling algorithm comparison of the default Guppy algorithm and the flip-flop algorithm that comes as a config option with Guppy.&lt;/p&gt;

&lt;p&gt;The data I am using to run this analysis was sequenced on an R9.4.1 flowcell. It was also a multiplexed run with 5 clinical samples of &lt;em&gt;M. tuberculosis&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I’ll add in some code snippets for how I ran this analysis so you can recreate at home with your own data too. If you aren’t interested and just want to see some results then feel free to &lt;a href=&quot;#results&quot;&gt;skip ahead&lt;/a&gt;.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;methods&quot;&gt;Methods&lt;/h1&gt;

&lt;h2 id=&quot;basecall&quot;&gt;Basecall&lt;/h2&gt;

&lt;p&gt;The only thing we need to change in order to use the flip-flop algorithm is to change the config file used.&lt;/p&gt;

&lt;h3 id=&quot;default-config&quot;&gt;Default config&lt;/h3&gt;
&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;Guppy_testing/normal
&lt;span class=&quot;nv&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;../fast5
&lt;span class=&quot;nv&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;basecalled_fastq/

guppy_basecaller &lt;span class=&quot;nt&quot;&gt;--input_path&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$input&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--save_path&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$output&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--recursive&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--verbose_logs&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--worker_threads&lt;/span&gt; 32 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--config&lt;/span&gt; dna_r9.4.1_450bps.cfg
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Basecalling took 120077.25 CPU seconds. As there are 1009917 reads total, that is approximately 505 reads/min.&lt;/p&gt;

&lt;h3 id=&quot;flip-flop-config&quot;&gt;Flip-flop config&lt;/h3&gt;
&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;Guppy_testing/flipflop
&lt;span class=&quot;nv&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;../fast5
&lt;span class=&quot;nv&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;basecalled_fastq/

guppy_basecaller &lt;span class=&quot;nt&quot;&gt;--input_path&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$input&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--save_path&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$output&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--recursive&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--verbose_logs&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--worker_threads&lt;/span&gt; 32 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--config&lt;/span&gt; dna_r9.4.1_450bps_flipflop.cfg
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Basecalling took 4051443 CPU seconds. As there are 1009917 reads total, that is approximately 15 reads/min.&lt;/p&gt;

&lt;p&gt;At the time of writing this, I have not been able to run Guppy on the GPUs here. But once I have done that I will add the runtime figures for that too.&lt;/p&gt;

&lt;p&gt;In terms of wall clock time, I ran the default config on 32 cores and it completed in 4.33 hours. For the flip-flop, I also ran it on 32 cores and it completed in 35.33 hours.&lt;/p&gt;

&lt;h2 id=&quot;barcode-demultiplexing&quot;&gt;Barcode demultiplexing&lt;/h2&gt;

&lt;p&gt;As this is a 5x multiplexed sample I chose to use Ryan Wick’s &lt;a href=&quot;https://github.com/rrwick/Deepbinner&quot;&gt;Deepbinner&lt;/a&gt; tool for demultiplexing. From the results in the Deepbinner &lt;a href=&quot;https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006583&quot;&gt;paper&lt;/a&gt;, and from my own personal testing, Deepbinner saves a lot more reads from the dreaded “unknown” bin.&lt;/p&gt;

&lt;h3 id=&quot;deepbinner-classification&quot;&gt;Deepbinner classification&lt;/h3&gt;
&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;fast5_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;../fast5
&lt;span class=&quot;nv&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;classification

deepbinner classify &lt;span class=&quot;nt&quot;&gt;--native&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$fast5_dir&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$output&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I ran the deepbinner classification step on a GPU and it took 10 hours to classify all 1009917 reads - so approximately 1683 reads/min.&lt;/p&gt;

&lt;h3 id=&quot;deepbinner-binning&quot;&gt;Deepbinner binning&lt;/h3&gt;

&lt;p&gt;Split the reads into separate fastq files for each barcode based on the classifications learned.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;Guppy_testing/normal
&lt;span class=&quot;nv&quot;&gt;classifications&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;../classification
&lt;span class=&quot;nv&quot;&gt;out_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;barcode_bins/
&lt;span class=&quot;nv&quot;&gt;reads_dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;basecalled_fastq/

&lt;span class=&quot;c&quot;&gt;# combine all the fastq files into a single one&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;find &lt;span class=&quot;nv&quot;&gt;$reads_dir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-name&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'*.fastq'&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; tmp_reads.fastq

deepbinner bin &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--classes&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$classifications&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--reads&lt;/span&gt; tmp_reads.fastq &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--out_dir&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$out_dir&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;nb&quot;&gt;rm &lt;/span&gt;tmp_reads.fastq
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Do the same thing for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Guppy_testing/flipflop&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;adapter-trimming&quot;&gt;Adapter trimming&lt;/h2&gt;
&lt;p&gt;Chop off adapter sequences using another of Ryan Wick’s tools, &lt;a href=&quot;https://github.com/rrwick/Porechop&quot;&gt;Porechop&lt;/a&gt;.&lt;/p&gt;
&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;Guppy_testing/normal
&lt;span class=&quot;nv&quot;&gt;outdir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;barcode_bins/

&lt;span class=&quot;c&quot;&gt;# I only expect barcodes 1-5&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;f &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;find barcode_bins/ &lt;span class=&quot;nt&quot;&gt;-type&lt;/span&gt; f | &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-E&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'barcode0[1-5].fastq.gz'&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;do
  &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;basename&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$f&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
  porechop &lt;span class=&quot;nt&quot;&gt;--input&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$f&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
          &lt;span class=&quot;nt&quot;&gt;--output&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$outdir&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;/&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;%%.*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;.trimmed.fastq.gz &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
          &lt;span class=&quot;nt&quot;&gt;--discard_middle&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Do the same thing for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Guppy_testing/flipflop&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;map&quot;&gt;Map&lt;/h2&gt;
&lt;p&gt;The accuracy of the reads will be based on alignment to the reference genome. The alignment is done using &lt;a href=&quot;https://github.com/lh3/minimap2&quot;&gt;minimap2&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;Guppy_testing/normal
&lt;span class=&quot;nv&quot;&gt;reference&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;../NC_000962.3.fa
&lt;span class=&quot;nv&quot;&gt;outdir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;mapped/
&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;f &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;find barcode_bins/ &lt;span class=&quot;nt&quot;&gt;-name&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'*trimmed*'&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;do
  &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;sample&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;basename&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;%%.*&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;nv&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$outdir&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;/&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$sample&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;.sorted.bam
  minimap2 &lt;span class=&quot;nt&quot;&gt;-ax&lt;/span&gt; map-ont &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$reference&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$f&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; | samtools &lt;span class=&quot;nb&quot;&gt;sort&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$output&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; -
&lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Do the same thing for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Guppy_testing/flipflop&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;plotting&quot;&gt;Plotting&lt;/h2&gt;
&lt;p&gt;I did some quality control plotting using a python package I developed called &lt;a href=&quot;https://github.com/mbhall88/pistis&quot;&gt;Pistis&lt;/a&gt;.&lt;/p&gt;
&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; /hps/nobackup/research/zi/mbhall/Guppy_testing/normal
&lt;span class=&quot;k&quot;&gt;for &lt;/span&gt;i &lt;span class=&quot;k&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;1..5&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;do
  &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;bam&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;mapped/barcode0&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$i&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;.sorted.bam
  &lt;span class=&quot;nv&quot;&gt;reads&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;barcode_bins/barcode0&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$i&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;.trimmed.fastq.gz
  &lt;span class=&quot;nv&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;reports/barcode0&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$i&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;_pistis.pdf
  pistis &lt;span class=&quot;nt&quot;&gt;--fastq&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$reads&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--bam&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$bam&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;--output&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$output&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--downsample&lt;/span&gt; 0
&lt;span class=&quot;k&quot;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Do the same thing for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Guppy_testing/flipflop&lt;/code&gt;.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;results&quot;&gt;Results&lt;/h1&gt;

&lt;h2 id=&quot;quality-vs-read-length&quot;&gt;Quality vs Read length&lt;/h2&gt;

&lt;p&gt;Probably the most startling thing for me initially was the difference in Phred quality scores the two algorithms were producing.&lt;/p&gt;

&lt;p class=&quot;figure&quot;&gt;&lt;img src=&quot;/assets/img/posts/guppy/default_quality_vs_len.png&quot; alt=&quot;Pistis quality vs read length plot for default algorithm&quot; /&gt;
Figure 1: Guppy default basecalling algorithm quality vs read length. The y-axis shows the Phred quality score average for each read. The x-axis is the reads length in base pairs.&lt;/p&gt;

&lt;p&gt;We can see from Figure 1 above that the Phred scores for the default algorithm are centred around 14. However, when we look at the same plot for the flip-flop algorithm (Figure 2), we see a very different story in terms of quality scores.&lt;/p&gt;

&lt;p class=&quot;figure&quot;&gt;&lt;img src=&quot;/assets/img/posts/guppy/flipflop_quality_vs_len.png&quot; alt=&quot;Pistis quality vs read length plot for flip-flop algorithm&quot; /&gt;
Figure 2: Guppy flip-flop basecalling algorithm quality vs read length. The y-axis shows the Phred quality score average for each read. The x-axis is the reads length in base pairs.&lt;/p&gt;

&lt;p&gt;As you can see, flip-flop seems to rate itself &lt;em&gt;very&lt;/em&gt; highly. The densest part of the kernel being around Phred score 42….yes, &lt;strong&gt;42&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At the end of the day though, I don’t generally pay much attention to the quality scores. I am more interested in how well the reads match what I expect them to, i.e the “truth”. As I don’t have an absolute truth for this particular dataset, I am going to use the &lt;a href=&quot;https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3&quot;&gt;&lt;em&gt;M. tuberculosis&lt;/em&gt; reference, NC_000962.3&lt;/a&gt;, as decent approximation. I know, it’s not ideal, but it’s the best I have access to at the moment.&lt;/p&gt;

&lt;p&gt;Figures 1 &amp;amp; 2 were produced from my package &lt;a href=&quot;https://github.com/mbhall88/pistis&quot;&gt;Pistis&lt;/a&gt;. For the following plots, I will post the code for the functions I used to prepare the data at the end of this post.&lt;/p&gt;

&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pysam&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pathlib&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Path&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Counter&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;seaborn&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# get the paths for the bam files
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normal_bams&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'../normal'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rglob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'*.bam'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;flipflop_bams&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'../flipflop'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rglob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'*.bam'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# gather all the required info into a dataframe
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normal_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stats_for_bams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normal_bams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;normal_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;model&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;normal&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;flipflop_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stats_for_bams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flipflop_bams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;flipflop_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;model&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;flipflop&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;concat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;normal_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;flipflop_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;total-yield&quot;&gt;Total yield&lt;/h2&gt;

&lt;p&gt;Let’s see if there is a major difference in the raw number of base pairs we get from each basecalling algorithm.&lt;/p&gt;

&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# faster to sum all the bases for each barcode/model into a dataframe
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;yield_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupby&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;by&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'model'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'barcode'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]).&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;yield_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset_index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;level&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'model'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'barcode'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subplots&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;barplot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;yield_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;barcode&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;aligned_bases&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;hue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;model&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hue_order&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'normal'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'flipflop'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Total yield&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;aligned bases (bp)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p class=&quot;figure&quot;&gt;&lt;img src=&quot;/assets/img/posts/guppy/total_yield.png&quot; alt=&quot;Total Yield&quot; /&gt;
Figure 3: Total number of bases produced by the Guppy default (blue) and flip-flop (orange) algorithms for each barcode.&lt;/p&gt;

&lt;p&gt;As you can see. Flip-flop consistently produces more bases. The impact of this will be seen when we look at the relative read lengths for both algorithms.&lt;/p&gt;

&lt;h2 id=&quot;gc-content&quot;&gt;GC content&lt;/h2&gt;

&lt;p&gt;As mentioned earlier, &lt;em&gt;M. tuberculosis&lt;/em&gt; has a GC content around 65%. Will this have an impact on the new basecaller as &lt;a href=&quot;https://omicsomics.blogspot.com/2018/12/flappie-vs-albacore-via-counterr.html&quot;&gt;Keith Robison seemed to suspect&lt;/a&gt;?&lt;/p&gt;

&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_style&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;whitegrid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subplots&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;violinplot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'barcode'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'gc_content'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;quartile&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                   &lt;span class=&quot;n&quot;&gt;hue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'model'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hue_order&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'normal'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'flipflop'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;GC content&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;GC proportion per read (%)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p class=&quot;figure&quot;&gt;&lt;img src=&quot;/assets/img/posts/guppy/gc_content.png&quot; alt=&quot;GC content&quot; /&gt;
Figure 4: GC content for each barcode calculated on a per-read basis for both the default (blue) and flip-flop (orange) algorithms of Guppy.&lt;/p&gt;

&lt;p&gt;I plotted this many different ways and the distributions were nearly identical every way I looked at it. So I guess the flip-flop algorithm may have changed a bit since Keith looked at it, or potentially ONT has some &lt;em&gt;M. tuberculosis&lt;/em&gt; in there training dataset?&lt;/p&gt;

&lt;h2 id=&quot;read-identity&quot;&gt;Read identity&lt;/h2&gt;

&lt;p&gt;This is the plot I was most interested in. For me, this is the most important plot. How identical are the reads to the section of the reference they map to? As I mentioned already, we don’t have absolute truth here, but it is a pretty close approximation. This metric is effectively asking for the reads that align (I ignore unmapped reads and secondary/supplementary alignments), how similar is the sequence to the reference at that location? I have cut off the axis at 50% to get a clearer view of the bulk of the distribution, but the tails extend past 50%.&lt;/p&gt;

&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_style&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;whitegrid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subplots&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;violinplot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'barcode'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'pid'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;quartile&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                   &lt;span class=&quot;n&quot;&gt;hue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'model'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hue_order&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'normal'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'flipflop'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Read identity&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Read percent identity (%)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_xlim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;legend&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'lower right'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p class=&quot;figure&quot;&gt;&lt;img src=&quot;/assets/img/posts/guppy/pid.png&quot; alt=&quot;Read percent identity&quot; /&gt;
Figure 5: Read percent identity for primary alignments to the  M. tuberculosis reference, NC_000962.3. Blue shows the default algorithm for Guppy and orange shows the flip-flop algorithm. The dashed lines within the violins show the percentiles of the data.&lt;/p&gt;

&lt;p&gt;Wow! That is a pretty good improvement. On average, flip-flop has about 2% higher read identity compared to Guppy’s default algorithm.&lt;/p&gt;

&lt;h2 id=&quot;relative-read-length&quot;&gt;Relative read length&lt;/h2&gt;

&lt;p&gt;To see whether the algorithms are causing insertions and/or deletions we can look at the relative read length. That is, we take the length of the &lt;em&gt;aligned&lt;/em&gt; part of the read and divide it by the length of the &lt;em&gt;aligned&lt;/em&gt; part of the reference. Below 1.0 means there have been some deletions, above 1.0 means we’ve had some insertions - compared to the reference of course.&lt;/p&gt;

&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_style&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;whitegrid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;plt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subplots&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figsize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;violinplot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'barcode'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'rel_len'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inner&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;quartile&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                   &lt;span class=&quot;n&quot;&gt;hue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'model'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hue_order&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'normal'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'flipflop'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Relative read length&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ylabel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;read alignment length / ref alignment length&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_ylim&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.75&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.25&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p class=&quot;figure&quot;&gt;&lt;img src=&quot;/assets/img/posts/guppy/rel_len.png&quot; alt=&quot;Relative read length&quot; /&gt;
Figure 6: Relative read length for Guppy’s default (blue) and flip-flop (orange) algorithms. Relative read length is calculated as the length of the aligned part of the read and divide it by the length of the aligned part of the reference.&lt;/p&gt;

&lt;p&gt;So it appears that flip-flop, on average, causes more deletions than insertions, but it is definitely an improvement on the default algorithm. As we saw from the total yield plot, flip-flop produces more bases and the outcome of that, at least for &lt;em&gt;M. tuberculosis&lt;/em&gt; in the case, is fewer deletions.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h1&gt;

&lt;p&gt;So in conclusion, given the results from Ryan Wick on &lt;em&gt;S. maltophilia&lt;/em&gt; and those presented here on &lt;em&gt;M. tuberculosis&lt;/em&gt;, you can make a strong argument for using the flip-flop algorithm over the default for GC-rich genomes without much concern regarding accuracy. You get more accurate reads with fewer deletions. But the big caveat is time. Flip-flop is much slower than the default algorithm. At least on CPUs, it is probably only feasible to use flip-flop if you have a computing cluster with at least 16 cores you can grab unless you want to smash your laptop for a week or so. As I said earlier, I have not been able to run Guppy on GPUs yet, so I am interested to see how much faster flip-flop GPU is compared to the CPU version.&lt;/p&gt;

&lt;p&gt;I hope someone finds this useful. And of course, if you have any problems with anything I have done please do get in touch.&lt;/p&gt;

&lt;hr /&gt;

&lt;h1 id=&quot;supplementary-code&quot;&gt;Supplementary code&lt;/h1&gt;

&lt;p&gt;This code was used for preparing the data for plotting.&lt;/p&gt;

&lt;div class=&quot;language-py highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pysam&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pathlib&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Path&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Counter&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;gc_content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sequence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as_decimal&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Returns the GC content for the sequence.
    Notes:
        This method ignores N when calculating the length of the sequence.
        It does not however, ignore other ambiguous bases. It also only
        includes the ambiguous base S (G or C). In this sense, the method is
        conservative with its calculation.
    Args:
        sequence (str): A DNA string.
        as_decimal (bool): Return the result as a decimal. Setting to False
        will return as a percentage. i.e for the sequence GCAT it will
        return 0.5 by default and 50.00 if set to False.
    Returns:
        float: GC content calculated as the number of G, C, and S divided
        by the number of (non-N) bases (length).
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;gc_total&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;num_bases&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;n_tuple&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nN'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;accepted_bases&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'cCgGsS'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# counter sums all unique characters in sequence. Case insensitive.
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;base&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sequence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;

        &lt;span class=&quot;c1&quot;&gt;# dont count N in the number of bases
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;base&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n_tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;num_bases&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;

            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;base&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;accepted_bases&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# S is a G or C
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;gc_total&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gc_total&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_bases&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as_decimal&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# return as percentage
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get_percent_identity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Calculates the percent identity of a read based on the NM tag if present
    , if not calculate from MD tag and CIGAR string.
    Args:
        read (pysam.AlignedSegment): A pysam read alignment record.
    Returns:
        The percent identity or None if required fields are not present.
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_tag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;NM&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_alignment_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;KeyError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                    &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_parse_md_flag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_tag&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;MD&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
                         &lt;span class=&quot;n&quot;&gt;_parse_cigar&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cigartuples&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_alignment_length&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;KeyError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;ZeroDivisionError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;relative_read_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Calculates the relative read length of the given read.
    That is, read aligned length/reference aligned length.

    Args:
        read (pysam.AlignedSegment): A pysam read alignment record.

    Returns:
        Relative read length as a float.

    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_alignment_length&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reference_length&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;sam_read_stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filepath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Opens a SAM/BAM file and extracts the read percent identity for all
    mapped reads that are not supplementary or secondary alignments.
    Args:
        filepath (Path): Path to SAM/BAM file.
    Returns:
        A pandas dataframe where the index column is the read id:
            1. 'pid' - read percent identity.
            2. 'rel_len' - relative read length.
            3. 'aligned_bases' - length of query aligned segment.
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# get pysam read option depending on whether file is sam or bam
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;file_ext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;filepath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;suffix&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;read_opt&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'rb'&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;file_ext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'.bam'&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'r'&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# open file
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;samfile&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pysam&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AlignmentFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filepath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;read_opt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;samfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# make sure read is mapped, and is not a suppl. or secondary alignment
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_unmapped&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;or&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_supplementary&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;or&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;is_secondary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;continue&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;pid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_percent_identity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;relative_len&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relative_read_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

        &lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&quot;pid&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&quot;rel_len&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relative_len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&quot;aligned_bases&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_alignment_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&quot;gc_content&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gc_content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_sequence&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as_decimal&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;read_id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset_index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;stats_for_bams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Collates stats for a given list of {s,b}am files.
    Args:
        bams (list[Path]): A list of Path objects for {s,b}am files.
    Returns:
        A pandas dataframe of BAM stats where each row is a read and
        the columns are:
            1. 'model' - guppy basecaller model used.
            2. 'barcode' - nanopore barcode.
            3. 'pid' - read percent identity.
            4. 'rel_len' - relative read length.
            5. 'aligned_bases' - length of query aligned segment.
    &quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bam&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;barcode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bam&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'.'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sam_read_stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bam&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'barcode'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;barcode&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;concat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;</content><author><name>Michael Hall</name><email>michael@mbh.sh</email></author><category term="bioinformatics" /><category term="nanopore" /><category term="guppy" /><category term="benchmark" /><summary type="html"></summary></entry></feed>