---
title: AWAggregator
author: Jiahua Tan, Gian L. Negri, Gregg B. Morin, David D. Y. Chen
output:
html_document:
keep_md: yes
self_contained: no
---
# Introduction
The `AWAggregator` package implements an attribute-weighted aggregation
algorithm which leverages peptide-spectrum match (PSM) attributes to provide a
more accurate estimate of protein abundance compared to conventional
aggregation methods. This algorithm employs pre-trained random forest models to
predict the quantitative inaccuracy of PSMs based on their attributes. PSMs are
then aggregated to the protein level using a weighted average, taking the
predicted inaccuracy into account. Additionally, the package allows users to
construct their own training sets that are more relevant to their specific
experimental conditions if desired.
Since `ExperimentHub` can only retrieve data from the `AWAggregatorData`
package with Bioconductor version 3.21 or later, please use the legacy version
of the `AWAggregator` package if you are using an earlier Bioconductor version:
https://github.com/Tan-Jiahua/AWAggregator-compat
## Overview of Package Functions
Functions available in the `AWAggregator` package:
* `getDistMetric()`: Calculates the distance metric for PSMs. Distance metric
reflects on whether the quantified ratio of each pair of samples of a PSM
diverges from other PSMs in the same redundant/unique group. Redundant group,
unique group and distance metric were originally defined in the iPQF method.
Please refer to "iPQF: a new peptide-to-protein summarization method using
peptide spectra characteristics to improve protein quantification" for more
details.
* `getPSMAttributes()`: Retrieves attributes required for training or test
sets.
* `getAvgScaledErrorOfLog2FC()`: Calculates the Average Scaled Error of
log2FC values required for training sets.
* `mergeTrainingSets()`: Extracts a similar number of PSMs from each input
dataset and merges them into a single training set.
* `fitQuantInaccuracyModel()`: Trains a random forest model to predict the
level of quantitative inaccuracy of PSMs.
* `aggregateByAttributes()`: Aggregates PSMs using a random forest model.
* `convertPDFormat()`: Converts output from Proteome Discoverer into the
input format required by `AWAggregator`.
Function available in the associated `AWAggregatorData` package:
* `loadQuantInaccuracyModel()`: Loads a pre-trained random forest model for
predicting the level of quantitative inaccuracy of PSMs.
## Overview of Package Data
Data available in the `AWAggregator` package:
* `sample.PSM.FP`: represents sample PSMs mapped to the proteins A0AV96,
A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the `psm.tsv` output file
generated by FragPipe. Columns unnecessary for the `AWAggregator` have been
removed from the sample data.
* `sample.prot.PD`: represents sample proteins A0AV96, A0AVF1, A0AVT1,
A0FGR8, and A0M8Q6, obtained from the TXT export of the proteins page in the
Proteome Discoverer search results. Columns unnecessary for the `AWAggregator`
have been removed from the sample data.
* `sample.PSM.PD`: represents sample PSMs mapped to the proteins A0AV96,
A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, obtained from the TXT export of the PSMs
page in the Proteome Discoverer search results. Columns unnecessary for the
`AWAggregator` have been removed from the sample data.
Data available in the associated `AWAggregatorData` package:
* `regr`: represent the pre-trained random forest model that incorporates the
average coefficient of variation (CV) as a feature.
* `regr.no.CV`: represent the pre-trained random forest model that does not
include the average CV as a feature.
* `benchmark.set.1`, `benchmark.set.2`, `benchmark.set.3`: represents PSMs in
Benchmark Set 1 \~ 3 derived from the `psm.tsv` output files generated by
FragPipe, which are used to train the random forest model. Columns unnecessary
for the `AWAggregator` have been removed from the sample data.
# Installation
The `AWAggregator` package and the associated `AWAggregatorData` package can be
installed from Bioconductor.
```r
if (!requireNamespace('BiocManager', quietly=TRUE))
install.packages('BiocManager')
BiocManager::install('AWAggregator')
BiocManager::install('AWAggregatorData')
```
They can also be directly installed using the `devtools` package.
```r
install.packages('devtools')
library(devtools)
install_github("Tan-Jiahua/AWAggregator")
install_github("Tan-Jiahua/AWAggregatorData")
```
# Workflow Examples
Load the `AWAggregator` package and the `AWAggregatorData` package.
```r
library(AWAggregator)
library(AWAggregatorData)
```
## Ex.1: Aggregate PSMs from FragPipe Using the Pre-Trained Model.
In this example, we aggregate the reporter ion intensities of PSMs to the
protein level. We use the sample dataset `sample.PSM.FP`, included in the
`AWAggregator` package and derived from the `psm.tsv` output file generated by
FragPipe. This dataset includes reporter ion intensities from nine samples,
labeled from `Sample 1` to `Sample 9`, without replicates. The PSMs are mapped
to the following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with
unnecessary columns removed for clarity.
This example demonstrates the basic functionality of the `AWAggregator` package
using the default pre-trained model.
```r
# Load the pre-trained random forest model that does not include the average CV
# as a feature, which indicates the average CV in percentage for processed PSM
# reporter ion intensities across different replicate groups. It is recommended
# to load the pre-trained model with average CV when replicates are available;
# otherwise, use the model without the average CV
data(sample.PSM.FP)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.FP)[grep('Sample', colnames(sample.PSM.FP))]
groups <- samples
df <- getPSMAttributes(
PSM=sample.PSM.FP,
# TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as
# fixed post-translational modifications (PTMs)
fixedPTMs=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=TRUE
)
aggregated_results <- aggregateByAttributes(
PSM=df,
colOfReporterIonInt=samples,
ranger=regr,
ratioCalc=FALSE
)
```
The output dataframe will provide estimates of protein abundance.
```
Protein Sample 1 Sample 2 Sample 3 Sample 4 ...
sp|A0AV96|RBM47_HUMAN 0.9292177 1.0111264 0.7933874 0.9606382 ...
sp|A0AVF1|IFT56_HUMAN 0.6646691 0.6600642 0.6696656 0.7984397 ...
sp|A0AVT1|UBA6_HUMAN 1.1883116 1.1752203 1.0482381 1.0910095 ...
sp|A0FGR8|ESYT2_HUMAN 0.9304190 0.8504465 1.0550898 0.7952998 ...
sp|A0M8Q6|IGLC7_HUMAN 0.4205675 0.6393757 0.7475482 0.6968704 ...
```
## Ex.2: Aggregate PSMs from Proteome Discoverer Using the Pre-Trained Model.
In this example, we convert the search result from Proteome Discoverer to the
format required by `AWAggregator` and aggregate the reporter ion intensities of
PSMs to the protein level. We use the sample dataset `sample.PSM.PD`, alongside
its corresponding protein table `sample.prot.PD`, both included in the
`AWAggregator` package. These files are derived from the TXT exports of the
proteins and PSMs pages in the search results from Proteome Discoverer. This
dataset includes reporter ion intensities from nine samples, labeled from
`Sample 1` to `Sample 9`, without replicates. The PSM and protein tables
contains following proteins: A0AV96, A0AVF1, A0AVT1, A0FGR8, and A0M8Q6, with
unnecessary columns removed for clarity.
```r
# Load the pre-trained random forest model that does not include the average CV
# as a feature, which indicates the average CV in percentage for processed PSM
# reporter ion intensities across different replicate groups. It is recommended
# to load the pre-trained model with average CV when replicates are available;
# otherwise, use the model without the average CV
data(sample.PSM.PD)
data(sample.prot.PD)
regr <- loadQuantInaccuracyModel(useAvgCV=FALSE)
# Load sample names (Sample 1 ~ Sample 9)
samples <- colnames(sample.PSM.PD)[grep('Sample', colnames(sample.PSM.PD))]
groups <- samples
df <- convertPDFormat(
PSM=sample.PSM.PD,
protein=sample.prot.PD,
colOfReporterIonInt=samples
)
df <- getPSMAttributes(
PSM=df,
# TMT tag and carbamidomethylation are applied as static PTMs
fixedPTMs=c('TMT6plex', 'Carbamidomethyl'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=TRUE
)
aggregated_results <- aggregateByAttributes(
PSM=df,
colOfReporterIonInt=samples,
ranger=regr,
ratioCalc=FALSE
)
```
The output dataframe will provide estimates of protein abundance.
```
Protein Sample 1 Sample 2 Sample 3 Sample 4 ...
A0AV96_Homo sapiens 0.9392033 0.9514846 0.7096284 0.9393484 ...
A0AVF1_Homo sapiens 0.6591366 0.6534372 0.7121089 0.7741971 ...
A0AVT1_Homo sapiens 1.2035820 1.1647425 1.0494833 1.1121796 ...
A0FGR8_Homo sapiens 0.9664924 0.8391658 1.0946545 0.7832414 ...
A0M8Q6_Homo sapiens 0.3516833 0.4695273 0.7225070 0.6042526 ...
```
## Ex.3: Build a Merged Training Set and Retrain the Model.
Retraining the AWA model using additional spike-in datasets can improve the
number of quantified PSMs in the merged training set, and hence the robustness
of the correlation. In addition, retraining using experiment-specific in-house
spike-in datasets could also provide potential benefits for the machine
learning model by better representing the employed hardware and acquisition
modes.
In this example, we create a training set by merging three benchmark spike-in
datasets (`benchmark.set.1`, `benchmark.set.2`, and `benchmark.set.3`), all
included in the `AWAggregator` package and derived from the `psm.tsv` output
files generated by FragPipe. This combined training set is then used to train a
random forest model.
### Step 1: Load Spike-in Datasets
We load the spike-in datasets using `ExperimentHub` package. These datasets
correspond to the sets described in the `AWAggregator` publication. You may
substitute your own spike-in datasets if desired.
```r
library(ExperimentHub)
eh <- ExperimentHub()
benchmarkSet1 <- eh[['EH9637']] # Benchmark Set 1
benchmarkSet2 <- eh[['EH9638']] # Benchmark Set 2
benchmarkSet3 <- eh[['EH9639']] # Benchmark Set 3
```
### Step 2: Calculate PSM Attributes and Average Scaled Error of log~2~FC
Firstly, we calculate the attributes and the values of Average Scaled Error of
log~2~FC in `benchmark.set.1`.
```r
library(stringr)
# Load sample names (Sample 'H1+E1_1' ~ Sample 'H1+E6_3')
samples <- colnames(benchmarkSet1)[
grep('H1[+]E[0-9]+_[1-4]', colnames(benchmarkSet1))
]
groups <- str_match(samples, 'H1[+]E[0-9]+')[, 1]
PSM1 <- getPSMAttributes(
PSM=benchmarkSet1,
# TMT tag (229.1629) and carbamidomethylation (57.0214) are applied as
# fixed PTMs
fixedPTM=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups
)
PSM1 <- getAvgScaledErrorOfLog2FC(
PSM=PSM1,
colOfReporterIonInt=samples,
groups=groups,
# The actual protein fold change may be deviated from the intended values
# after TMT labelling as the original work indicates when H1+Y6 is
# involved, and therefore, H1+Y6 is not used in the calculation of Average
# of Scaled Error of log2FC
expectedRelativeAbundance=list(`H1+E1`=1, `H1+E2`=2, `H1+E6`=NA),
speciesAtConstLevel='HUMAN'
)
```
Secondly, we calculate the attributes and the values of Average Scaled Error of
log~2~FC in `benchmark.set.2`. `benchmark.set.2` consists of three separate
mass spectrometry runs, indicated by the `Replicate` column. Each run is
processed individually because of potential run-specific differences using
`lapply` function, and merged together by `bind_rows` function.
```r
library(dplyr)
# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_3')
samples <- colnames(benchmarkSet2)[
grep('H1[+]Y[0-9]+_[1-3]', colnames(benchmarkSet2))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
# Process each replicate separately using lapply()
# lapply() loops over all unique replicate IDs in benchmarkSet2.
# 'X' is the current replicate ID.
tmp <- lapply(unique(benchmarkSet2$Replicate), FUN=function(X){
# Select PSMs from the current replicate X
df <- benchmarkSet2[benchmarkSet2$Replicate == X, ]
df <- getPSMAttributes(
PSM=df,
fixedPTM=c('229.1629', '57.0214'),
colOfReporterIonInt=samples,
groups=groups,
setProgressBar=FALSE
)
df <- getAvgScaledErrorOfLog2FC(
PSM=df,
colOfReporterIonInt=samples,
groups=groups,
expectedRelativeAbundance=list(`H1+Y1`=1, `H1+Y4`=4, `H1+Y10`=10),
speciesAtConstLevel='HUMAN'
)
# Return the processed PSMs from the current replicate
return(df)
})
# Combine results from all replicates into one dataframe
PSM2 <- bind_rows(tmp)
```
Thirdly, we calculate the attributes and the values of Average Scaled Error of
log~2~FC in `benchmark.set.3`.
```r
# Load sample names (Sample 'H1+Y1_1' ~ Sample 'H1+Y10_2')
samples <- colnames(benchmarkSet3)[
grep('H1[+]Y[0-9]+_[1-2]', colnames(benchmarkSet3))
]
groups <- str_match(samples, 'H1[+]Y[0-9]+')[, 1]
PSM3 <- getPSMAttributes(
PSM=benchmarkSet3,
fixedPTM=c('304.2071', '125.0476'),
colOfReporterIonInt=samples,
groups=groups,
# The signals for yeast PSMs in group H1+Y0 is completely from noise, so
# they are not used for calculating Average CV
groupsExcludedFromCV='H1+Y0'
)
PSM3 <- getAvgScaledErrorOfLog2FC(
PSM=PSM3,
colOfReporterIonInt=samples,
groups=groups,
expectedRelativeAbundance=list(
`H1+Y0`=0, `H1+Y1`=1, `H1+Y5`=5, `H1+Y10`=10
),
speciesAtConstLevel='HUMAN'
)
```
### Step 3: Merge Spike-in Datasets as a New Training Set
Next, we merge a new training set from these three datasets. The minimum number
of PSMs to extract from each dataset is determined by the number of PSMs in the
smallest set. Complete sets of PSMs mapped to the selected proteins are
extracted, resulting in final PSM counts from each set that are equal to or
slightly larger than the preset values.
```r
set.seed(1000)
PSM <- mergeTrainingSets(
PSMList=list(
`Benchmark Set 1`=PSM1,
`Benchmark Set 2`=PSM2,
`Benchmark Set 3`=PSM3
),
numPSMs=min(nrow(PSM1), nrow(PSM2), nrow(PSM3))
)
```
### Step 4: Train a New Random Forest Model
Train a new random forest model using Average CV as an attribute.
```r
regr <- fitQuantInaccuracyModel(PSM, useAvgCV=TRUE, seed=3979)
```