scr⁴eam - single-cell RNA-seq realistic random read emitting Awk mess

scr⁴eam is is a small collection of scripts (that prominently feature the use of the Awk programming language) to generate realistic scRNA-seq reads based on an (isoform-level) DGE, read and fragment mapping statistic of a reference dataset, and a FASTA file with spliced transcript sequences.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Usage

The CLI is still work in progress and will be added soon.

However, the scripts themselves are already documented below.

Installation

Manual

Detailed installation instructions are still work in progress and will be added soon.

For now, simply download the scripts and try to run them. Error messages should help you identify missing dependencies.

Using conda

The conda package is still work in progress and will be added soon.

For now, follow the manual installation instructions above.

Citation

The scr⁴eam paper is still a work in progress and will be published as a pre-print soon.

Meanwhile, if you use scr⁴eam, please reference this repository and consider citing the following paper that uses (and briefly introduces) scr⁴eam:

Quantification of transcript isoforms at the single-cell level using SCALPEL
Franz Ake, Marcel Schilling, Sandra M. Fernández-Moya, Akshay Jaya Ganesh, Ana Gutiérrez-Franco, Lei Li, Mireya Plass
Nat Commun* 16, 6402 (2025). doi://10.1038/s41467-025-61118-0

Documentation of individual scripts

`extract_distributions.awk`

Extract distributions required for read generation using scr4eam from read and fragment mapping summary TSV generated by SCALPEL.

Usage

[gawk --file=]extract_distributions.awk \
  [--assign=multi_isoform_genes_txt=<multi-isoform-genes-txt>] \
  [--assign=read_dists_anchors_skipped_txt=<read-dists-anchors-skipped-txt>] \
  [--assign=n_reads_per_fragment_txt=<n-reads-per-fragment-txt>] \
  [--assign=read_distance_col=<read-distance-column>] \
  [--assign=gene_col=<gene-column>] \
  [--assign=isoform_col=<isoform-column>] \
  [--assign=fragment_col=<fragment-column>] \
  [--assign=read_col=<read-column>] \
  [--assign=fragment_distance_col=<fragment-distance-column>] \
  [< <fragments-and-reads-stats-tsv>] \
  [> <unambiguous-fragment-dists-per-gene-tsv>]

Command line arguments

<fragments-and-reads-stats-tsv>: TSV file/stream to read read and fragment mapping summary data (e.g. as generated by SCALPEL) from; default: STDIN
<unambiguous-fragment-dists-per-gene-tsv>: TSV file/stream to write the assigned gene and the fragment's distance (0-based) to the transcript's (3') end for each unambiguously assigned fragment to; default: STDOUT
multi_isoform_genes_txt: Text file to write genes expressing more than just a single isoform to (one gene per line); default: genes.multiple_expressed_isoforms.txt
read_dists_anchors_skipped_txt: Text file to write distances (0-based) to the fragment's 3' end for all non-anchor reads (i.e. excluding the first zero-distance read of each fragment) to (one integer per line); default: read_dists.anchors_skipped.txt
n_reads_per_fragment_txt: File name/path to write total read counts for each fragment to; default: n_reads.per_fragment.txt
read_distance_col: Column index (1-based) in the input TSV containing the read's distance (0-based) from the transcript's (3') end; default: 9
gene_col: Column index (1-based) in the input TSV specifying the gene the read was assigned to; default: 11
isoform_col: Column index (1-based) in the input TSV specifying the isoform the read was assigned to; default: 12
fragment_col: Column index (1-based) in the input TSV specifying the fragment the read was assigned to; default: 13
read_col: Column index (1-based) in the input TSV identifying the read the input line belongs to; default: 14
fragment_distance_col: Column index (1-based) in the input TSV containing the fragment's distance (0-based) from the transcript's (3') end; default: 15

`sample_reads.R`

Sample (relative) read coordinates for scr4eam based on given iDGE and read/fragment distributions extracted from reference scRNA-seq data.

Usage

[Rscript ]sample_reads.R [-h/--help] \
  [-f/--fragment-dists \
    <unambiguous-fragment-dists-single-isoform-expressed-genes-txt>] \
  [-n/--reads-per-fragment <n-reads-per-fragment-txt>] \
  [-r/--read-distances <read-dists-anchors-skipped-txt>] \
  [-o/--simulated-reads <simulated-reads-tsv>] \
  <simulated-idge-rds>

Command line arguments

Positional

<simulated-idge-rds>: R object (rds) file to read (sparse) Matrix object holding the simulated iDGE (cells in columns, transcripts in rows) from

Optional

--help/-h: Show help message and exit
--fragment-dists/-f: (Compressed) text file to read the fragments' (0-based) distances to their corresponding transcript's (3') ends from (one integer per line); default: unambiguous_fragment_dists.single_isoform_expressed_genes.txt.gz
--reads-per-fragment/-n: (Compressed) text file to read the read counts per fragment from (one integer per line); default: n_reads.per_fragment.txt
--read-distances/-r: (Compressed) text file to read the reads' distances (0-based) to their corresponding fragment's 3' end for non-anchor reads (i.e. excluding the first zero- distance read of the fragment) from (one integer per line); default: read_dists.anchors_skipped.txt
--simulated-reads/-o: (Compressed) TSV file to write synthetic read data (columns: transcript name/ID (from input iDGE row names), read ID (cell ID (from input iDGE column names), fragment ID (fragment_N with N increasing (1-based)), and read ID (read_M with M increasing (1-based), concatenated with underscores (_) as separators)), and (relative) distance (0-based) of the read to the corresponding transcript(!)'s (3') end) to; default: simulated_reads.tsv.gz

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSES		LICENSES
.mailmap		.mailmap
LICENSE		LICENSE
README.md		README.md
extract_distributions.awk		extract_distributions.awk
hexsticker.R		hexsticker.R
hexsticker.sh		hexsticker.sh
sample_reads.R		sample_reads.R
scr4eam.svg		scr4eam.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scr⁴eam - single-cell RNA-seq realistic random read emitting Awk mess

Table of contents

License

Usage

Installation

Manual

Using conda

Citation

Documentation of individual scripts

`extract_distributions.awk`

Usage

Command line arguments

`sample_reads.R`

Usage

Command line arguments

Positional

Optional

About

Uh oh!

Releases

Languages

License

plasslab/scr4eam

Folders and files

Latest commit

History

Repository files navigation

scr⁴eam - single-cell RNA-seq realistic random read emitting Awk mess

Table of contents

License

Usage

Installation

Manual

Using conda

Citation

Documentation of individual scripts

extract_distributions.awk

Usage

Command line arguments

sample_reads.R

Usage

Command line arguments

Positional

Optional

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages

`extract_distributions.awk`

`sample_reads.R`