add function dotproduct and unit tests #17

tnaake · 2019-08-29T09:10:10Z

Hi @jorainer, @sgibb, @lgatto

as discussed here (rformassspectrometry/Spectra#49) dotproduct should move to MsCoreUtils.

sgibb

Dear Thomas thanks for this PR.

Before looking into it in a more detail I have a few questions:

We have already a dotproduct function in MSnbase: https://github.com/lgatto/MSnbase/blob/604440ae5dd7c86b685e404f22c874cedea19642/R/matching.R#L102-L123. It ignores the mz values. Are the mz values important? IMHO after matching the spectra (e.g. by joinPeaks in Spectra) they should be fairly similar and it should be more or less a multiplication by a constant factor, or?
Are the arguments m and n really needed? Are there real use cases for changing them?
Is a non-normalized similarity useful? So do we need the normalize argument at all?

jorainer

Thanks for the PR @tnaake !

@sgibb, regarding the normalize parameter, that was my fault. I suggested to add this parameter because I thought that the normalized dot product is different from the dotproduct that is available in MSnbase (although is not exported!).

R/dotproduct.R

tnaake · 2019-09-02T07:36:48Z

Hi @sgibb,

Concerning your questions:

Especially for small molecules I like to give also a weight on the m/z values to accommodate that shared fragments with higher m/z are less likely and will mean that molecules might be more similar.
If we set m and n to 1, the two functions will return the same value. Otherwise, the proposed function with arguments m and n will add a bit more flexibility for calculating the normalized dot product.
I would say it's not very useful and it is misleading when we compare non-normalized values across pairwise similarities. I can remove this.

@jorainer Sure, I can make a line break after every @....

sgibb · 2019-09-02T10:29:27Z

Especially for small molecules I like to give also a weight on the m/z values to accommodate that shared fragments with higher m/z are less likely and will mean that molecules might be more similar.

I am never worked with small molecules so I don't know whether weighting by m/z is useful.
But currently I get a larger dot product for lower m/z values.
Is the following expected and intended?

x <- data.frame(mz = c(1, NA, 100), intensity = c(1, 1, 1))
y <- data.frame(mz = c(1, 2, 101), intensity = c(1, 1, 1))
dotproduct(x, y)
# [1] 0.9999998

x <- data.frame(mz = c(101, NA, 200), intensity = c(1, 1, 1))
y <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
dotproduct(x, y)
# [1] 0.9413118

If we set m and n to 1, the two functions will return the same value. Otherwise, the proposed function with arguments m and n will add a bit more flexibility for calculating the normalized dot product.

I see but is there a real use case? Otherwise I would vote for a simpler implementation and API (and remove both arguments).

I would say it's not very useful and it is misleading when we compare non-normalized values across pairwise similarities. I can remove this.

👍

tnaake · 2019-09-06T12:16:45Z

Dear @sgibb
thanks for pointing this out.

Concerning 1. Given the input, this is expected (but not what we intend). As we have the exponent n=2 the differences in m/z will be intensified, especially 200^n/ 201^n and 100^n and 101^n will change the weights ws1 and ws2 tremendously (with the dotproduct function we will calculate the similarity between ws1 and ws2). If the function gets "aligned" m/z values, the similarity is calculated as follows and as expected (this means, the function should only get the same/aligned m/z values when n != 0):

x1 <- data.frame(mz = c(101, NA, 201), intensity = c(1, 0, 1))
y1 <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
dotproduct(x1, y1, m=0.3, n=1) ## 0.8294594
dotproduct(x1, y1, m=0.3, n=2) ## 0.9413171

x2 <- data.frame(mz = c(1, NA, 201), intensity = c(1, 0, 1))
y2 <- data.frame(mz = c(1, 102, 201), intensity = c(1, 1, 1))
dotproduct(x2, y2, m=0.3, n=1) ## 0.795221
dotproduct(x2, y2, m=0.3, n=2) ## 0.9378086

For two spectra with "lower" m/z value, we get a lower similarity score. If we set a higher n we will get a higher similarity score.

Concerning 2. There was a small error in my text before: n has to be 0 (all m/z values will be 1).

x1 <- data.frame(mz = c(101, NA, 201), intensity = c(1, 0, 1))
y1 <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
x2 <- data.frame(mz = c(1, NA, 201), intensity = c(1, 0, 1))
y2 <- data.frame(mz = c(1, 102, 201), intensity = c(1, 1, 1))
dotproduct(x1, y1, m=0.3, n=0) ## 0.6666667
dotproduct(x2, y2, m=0.3, n=0) ## 0.6666667

This means 2 of 3 shared peaks. I used the formula implemented in this dotproduct function (containing weights for m/z and intensities) before and it was also used in some publications. What we could implement is, that n=0 and m=1 (default), i.e. m/z values will not be used, but only intensity values as they are.

x1 <- data.frame(mz = c(101, NA, 201), intensity = c(1, 0, 1))
y1 <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
x2 <- data.frame(mz = c(101, NA, 201), intensity = c(3, 0, 5)) 
y2 <- data.frame(mz = c(101, 102, 201), intensity = c(3,4, 5))
dotproduct(x1, y1, m=1, n=0) ## 0.6666667
dotproduct(x2, y2, m=1, n=0) ## 0.68

sgibb

@tnaake Thanks for the clarification! Now I reviewed your PR and add some comments. There is a little bit of refactoring needed before we could accept this PR.

Please follow our coding style and it would be nice if you could use conventional commit messages as well.

Please add yourself as contributor to the DESCRIPTION and README.md.

R/dotproduct.R

sgibb · 2019-09-12T19:40:47Z

R/dotproduct.R

+#' @param n `numeric(1)`, exponent for m/z-based weights
+#' @param normalize `logical` whether to calculate the DP (FALSE) or NDP (TRUE)
+#' @details 
+#' `x` and `y` have to be spectrally aligned. Each row in `x` corresponds to the


Is "spectrally aligned" a valid phrase?

I changed this to:

Each row in x corresponds to the respective row in y, i.e. the peaks
(entries "mz") per spectrum have to match.

@sgibb, do you think it is clearer this way?

R/dotproduct.R

sgibb · 2019-09-12T19:46:21Z

R/dotproduct.R

+#'         intensity=c(2, 0, 3, 1, 4, 0.4))
+#' dotproduct(x, y, m=0.5, n=2, normalize=TRUE) 
+#' @export
+dotproduct <- function(x, y, m=0.5, n=2, normalize=TRUE) {


Several points:

I am not sure that data.frame/list input is a good idea. I know the output of join is a list so it would be convenient. But what about mz1, int1, mz2, int2 or x1, y1, x2, y2? @jorainer : what do you think?

You didn't check m and n for (in)valid input.

Please remove the normalize argument as discussed before.

Please follow our coding style: m = 0.5 (see https://rformassspectrometry.github.io/RforMassSpectrometry/articles/RforMassSpectrometry.html#coding-style)

Yes, input of data.frame/list was written with having the output of join in mind. I can change to vector input as outlined above. Maybe we can wait for @jorainer 's ideas on this.

I added checks for m and n now.

I removed the normalize argument.

I updates for coding style.

R/dotproduct.R

…y are not identical and n != 0

…ction

…nit test

codecov-io · 2019-09-17T13:03:37Z

Codecov Report

Merging #17 into master will decrease coverage by 0.63%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #17      +/-   ##
==========================================
- Coverage    99.2%   98.56%   -0.64%     
==========================================
  Files          15       17       +2     
  Lines         250      418     +168     
==========================================
+ Hits          248      412     +164     
- Misses          2        6       +4

Impacted Files	Coverage Δ
R/dotproduct.R	`100% <100%> (ø)`
R/graphPeaks.R	`97.29% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4aabfe2...b263787. Read the comment docs.

… y, y

sgibb

Dear @tnaake, thanks for the refactorisation of your PR and following my suggestions. We are nearly there. Just a few minor points left.
I will ping @jorainer to get his opinion about the arguments (two list/data.frames vs four numeric vectors).

And sorry for the slow review, I just started a new job in September and my spare time was very limited.

sgibb · 2019-10-03T17:41:02Z

DESCRIPTION

 		            comment = c(ORCID = "0000-0001-7406-4443")),
-             person(given = "Sigurdur", family = "Smarason", role = "ctb")
+             person(given = "Sigurdur", family = "Smarason", role = "ctb"),
+	     person(given = "Thomas", family = "Naake", role = "ctb")


Suggested change

person(given = "Thomas", family = "Naake", role = "ctb")

person(given = "Thomas", family = "Naake", role = "ctb")

R/dotproduct.R

tests/testthat/test_dotproduct.R

jorainer

Nice contribution @tnaake !

On the question whether the input parameters x and y should be a data.frame or individual numeric: I would opt against the data.frame. What we get from the upstream functions (being it peaks or joinPeaks, which will match the peaks of two spectra) are a two-column matrix. So I would either use 2 matrix or 4 numeric.

Thinking a little ahead on defining a common API for peak similarity calculation functions, I think that x and y being a matrix might be better. So, parameter FUN in compareSpectra would have to be a function taking two parameters x and y of type matrix representing the (matched) peak matrices of the two spectra with m/z values in its first, and intensity values in its second column.
@tnaake @sgibb , what do you think?

Another question, since the peaks have to be mapped - do you actually need the mz1 and mz2 (I suppose they would have to be similar/identical)?

lgatto · 2019-10-04T07:13:23Z

I also vote for 2 matrices.

In the interest of standardisation, should we define/document somewhere what the column names and or the order of the m/z and intensity in these columns? We could then have a helper function that checks if the data (easy enough to tell m/z and intensities apart) and column names?

jorainer · 2019-10-04T07:19:14Z

You mean you want that check function in MsCoreUtils? I'm not so sure if such a check function on a matrix is not a little too much. We have all the checks already in the Spectra package, and they are not performed on the matrix but rather on the data how it is stored there (i.e. mz and intensity as a NumericList, so we don't have to check for is.numeric, we only check that mz are increasingly ordered.

I agree on having it documented (it's documented already in Spectra if I'm not mistaken - it's what as.list,Spectra returns.

…actoring

…gestions by sgibb

tnaake · 2019-10-06T18:42:08Z

I changed the function that it accepts two matrices (actually this was changed in the second pull request, but I didn't know to deal with the sitatuation).

I have two questions still.

You mean you want that check function in MsCoreUtils? I'm not so sure if such a check function on a matrix is not a little too much.

@jorainer do you mean that I should remove the two !is.matrix statements in the first lines of dotproduct?

Another question, since the peaks have to be mapped - do you actually need the mz1 and mz2 (I suppose they would have to be similar/identical)?

@jorainer I think we do need this at the moment. E.g. graphPeaks will return a (list of) two matrices with a column mz. The rows of the two matrices correspond to each other and thus the mz values. However it's possible (and will happen in most cases) that the two columns with the mz are not identical. If we want to store that information properly, then in dotproduct that information has to be assigned to mz1 and mz2. What I could think of is adding an argument that specifies how dotproduct will deal with aligned m/z values (e.g. averaging the two corresponding m/z from each spectrum). Indeed, if n != 0 that would become a valid enhancement to calculate similarities properly.

Co-Authored-By: Sebastian Gibb <[email protected]>

jorainer

Thanks @tnaake ! All fine from my side.

do you mean that I should remove the two !is.matrix statements in the first lines of dotproduct?

No, all fine. This was more related to have a function that evaluates whether a numeric matrix has correct peak data (i.e. a column names "mz", and one "intensity" and that the mz values are ordered increasingly) - I would however not check that within the dotproduct function. This has to be done (and is done) upstream in Spectra.

And thanks for the clarification why we need mz1 and mz2!

Great work!

sgibb

@tnaake thanks again for this nice PR and I am sorry that the review took so long and was a little bit cumbersome.

I will merge the PR and apply the change regarding the DOI myself.

sgibb · 2019-10-08T11:33:16Z

R/dotproduct.R

+#' @references 
+#' Li et al. (2015): Navigating natural variation in herbivory-induced
+#' secondary metabolism in coyote tobacco populations using MS/MS structural 
+#' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112.


Should work with

Suggested change

#' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112.

#' analysis. PNAS, E4147--E4155, [DOI: 10.1073/pnas.1503106112](https://doi.org/10.1073/pnas.1503106112).

sgibb · 2019-10-08T13:46:24Z

I manually rebased & merged it into the devel branch.

lgatto · 2019-10-11T10:22:47Z

Sorry, merged by accident :-/

sgibb · 2019-10-11T10:29:47Z

@lgatto I already merged this into devel. So I am closing it again. But it looks like you merged #20 by accident. I will undo this as well.

add function dotproduct and unit tests

ba57a62

sgibb reviewed Aug 30, 2019

View reviewed changes

jorainer reviewed Sep 2, 2019

View reviewed changes

R/dotproduct.R Outdated Show resolved Hide resolved

sgibb requested changes Sep 12, 2019

View reviewed changes

tnaake added 13 commits September 17, 2019 11:45

style: add spaces between = in function arguments

24e496a

refactor: remove normalize as an argument

6ed0d3e

refactor: check if mz values are identical and raise a warning if the…

0b5c377

…y are not identical and n != 0

docs: rewrite details section, add references section

f48fe4b

docs: remove tags name and usage in documenation

23b6e12

docs: explain better arguments m and n, add explanation to details se…

75e113d

…ction

test: remove argument normalize, add tests with different length to u…

9a8530c

…nit test

feat: write dotproduct to NAMESPACE

1411ae8

Merge remote-tracking branch 'upstream/master'

cea33a7

docs: add contributor

565bc0f

docs: add dotproduct in NEWS file

5ac3015

test: add expect_warning since warnings are treated as errors

04adafa

docs: change n=2 to n=0 in order to not create a WARNING

827112a

test: add tests for m and n are numeric and length identical of x, x,…

45314e7

… y, y

sgibb requested changes Oct 3, 2019

View reviewed changes

sgibb requested a review from jorainer October 3, 2019 18:23

sgibb mentioned this pull request Oct 3, 2019

graphPeaks for matching two spectra and returning a matched spectra #20

Merged

jorainer requested changes Oct 4, 2019

View reviewed changes

tnaake added 2 commits October 6, 2019 19:20

docs: dotproduct function, change manual

9582dcc

docs: improve the documentation after review by sgibb and do some ref…

cde8444

…actoring

test: add regexpr to expect_warning and expect_error according to sug…

17cf6f8

…gestions by sgibb

docs: update DESCRIPTION, intend person()

b263787

Co-Authored-By: Sebastian Gibb <[email protected]>

jorainer approved these changes Oct 8, 2019

View reviewed changes

sgibb approved these changes Oct 8, 2019

View reviewed changes

sgibb closed this Oct 8, 2019

lgatto reopened this Oct 11, 2019

sgibb closed this Oct 11, 2019

	person(given = "Thomas", family = "Naake", role = "ctb")
	person(given = "Thomas", family = "Naake", role = "ctb")

	#' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112.
	#' analysis. PNAS, E4147--E4155, [DOI: 10.1073/pnas.1503106112](https://doi.org/10.1073/pnas.1503106112).

add function dotproduct and unit tests #17

add function dotproduct and unit tests #17

Uh oh!

Conversation

tnaake commented Aug 29, 2019

Uh oh!

sgibb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorainer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tnaake commented Sep 2, 2019

Uh oh!

sgibb commented Sep 2, 2019

Uh oh!

tnaake commented Sep 6, 2019

Uh oh!

sgibb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sgibb Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

tnaake Sep 17, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sgibb Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

tnaake Sep 17, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-io commented Sep 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sgibb left a comment

Choose a reason for hiding this comment

Uh oh!

sgibb Oct 3, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorainer left a comment

Choose a reason for hiding this comment

Uh oh!

lgatto commented Oct 4, 2019

Uh oh!

jorainer commented Oct 4, 2019

Uh oh!

tnaake commented Oct 6, 2019

Uh oh!

jorainer left a comment

Choose a reason for hiding this comment

Uh oh!

sgibb left a comment

Choose a reason for hiding this comment

Uh oh!

sgibb Oct 8, 2019

Choose a reason for hiding this comment

Uh oh!

sgibb left a comment •

edited

Loading

codecov-io commented Sep 17, 2019 •

edited

Loading