-
Notifications
You must be signed in to change notification settings - Fork 12
add function dotproduct and unit tests #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear Thomas thanks for this PR.
Before looking into it in a more detail I have a few questions:
-
We have already a
dotproductfunction inMSnbase: https://github.com/lgatto/MSnbase/blob/604440ae5dd7c86b685e404f22c874cedea19642/R/matching.R#L102-L123. It ignores the mz values. Are the mz values important? IMHO after matching the spectra (e.g. byjoinPeaksinSpectra) they should be fairly similar and it should be more or less a multiplication by a constant factor, or? -
Are the arguments
mandnreally needed? Are there real use cases for changing them? -
Is a non-normalized similarity useful? So do we need the
normalizeargument at all?
jorainer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Hi @sgibb, Concerning your questions:
@jorainer Sure, I can make a line break after every |
I am never worked with small molecules so I don't know whether weighting by m/z is useful. x <- data.frame(mz = c(1, NA, 100), intensity = c(1, 1, 1))
y <- data.frame(mz = c(1, 2, 101), intensity = c(1, 1, 1))
dotproduct(x, y)
# [1] 0.9999998
x <- data.frame(mz = c(101, NA, 200), intensity = c(1, 1, 1))
y <- data.frame(mz = c(101, 102, 201), intensity = c(1, 1, 1))
dotproduct(x, y)
# [1] 0.9413118
I see but is there a real use case? Otherwise I would vote for a simpler implementation and API (and remove both arguments).
👍 |
|
Dear @sgibb
For two spectra with "lower" m/z value, we get a lower similarity score. If we set a higher
This means 2 of 3 shared peaks. I used the formula implemented in this |
sgibb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnaake Thanks for the clarification! Now I reviewed your PR and add some comments. There is a little bit of refactoring needed before we could accept this PR.
Please follow our coding style and it would be nice if you could use conventional commit messages as well.
Please add yourself as contributor to the DESCRIPTION and README.md.
R/dotproduct.R
Outdated
| #' @param n `numeric(1)`, exponent for m/z-based weights | ||
| #' @param normalize `logical` whether to calculate the DP (FALSE) or NDP (TRUE) | ||
| #' @details | ||
| #' `x` and `y` have to be spectrally aligned. Each row in `x` corresponds to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "spectrally aligned" a valid phrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to:
Each row in x corresponds to the respective row in y, i.e. the peaks
(entries "mz") per spectrum have to match.
@sgibb, do you think it is clearer this way?
R/dotproduct.R
Outdated
| #' intensity=c(2, 0, 3, 1, 4, 0.4)) | ||
| #' dotproduct(x, y, m=0.5, n=2, normalize=TRUE) | ||
| #' @export | ||
| dotproduct <- function(x, y, m=0.5, n=2, normalize=TRUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several points:
- I am not sure that
data.frame/listinput is a good idea. I know the output ofjoinis alistso it would be convenient. But what aboutmz1, int1, mz2, int2orx1, y1, x2, y2? @jorainer : what do you think? - You didn't check
mandnfor (in)valid input. - Please remove the
normalizeargument as discussed before. - Please follow our coding style:
m = 0.5(see https://rformassspectrometry.github.io/RforMassSpectrometry/articles/RforMassSpectrometry.html#coding-style)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes, input of
data.frame/listwas written with having the output ofjoinin mind. I can change to vector input as outlined above. Maybe we can wait for @jorainer 's ideas on this. - I added checks for
mandnnow. - I removed the
normalizeargument. - I updates for coding style.
…y are not identical and n != 0
Codecov Report
@@ Coverage Diff @@
## master #17 +/- ##
==========================================
- Coverage 99.2% 98.56% -0.64%
==========================================
Files 15 17 +2
Lines 250 418 +168
==========================================
+ Hits 248 412 +164
- Misses 2 6 +4
Continue to review full report at Codecov.
|
sgibb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @tnaake, thanks for the refactorisation of your PR and following my suggestions. We are nearly there. Just a few minor points left.
I will ping @jorainer to get his opinion about the arguments (two list/data.frames vs four numeric vectors).
And sorry for the slow review, I just started a new job in September and my spare time was very limited.
DESCRIPTION
Outdated
| comment = c(ORCID = "0000-0001-7406-4443")), | ||
| person(given = "Sigurdur", family = "Smarason", role = "ctb") | ||
| person(given = "Sigurdur", family = "Smarason", role = "ctb"), | ||
| person(given = "Thomas", family = "Naake", role = "ctb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| person(given = "Thomas", family = "Naake", role = "ctb") | |
| person(given = "Thomas", family = "Naake", role = "ctb") |
jorainer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice contribution @tnaake !
On the question whether the input parameters x and y should be a data.frame or individual numeric: I would opt against the data.frame. What we get from the upstream functions (being it peaks or joinPeaks, which will match the peaks of two spectra) are a two-column matrix. So I would either use 2 matrix or 4 numeric.
Thinking a little ahead on defining a common API for peak similarity calculation functions, I think that x and y being a matrix might be better. So, parameter FUN in compareSpectra would have to be a function taking two parameters x and y of type matrix representing the (matched) peak matrices of the two spectra with m/z values in its first, and intensity values in its second column.
@tnaake @sgibb , what do you think?
Another question, since the peaks have to be mapped - do you actually need the mz1 and mz2 (I suppose they would have to be similar/identical)?
|
I also vote for 2 matrices. In the interest of standardisation, should we define/document somewhere what the column names and or the order of the m/z and intensity in these columns? We could then have a helper function that checks if the data (easy enough to tell m/z and intensities apart) and column names? |
|
You mean you want that check function in I agree on having it documented (it's documented already in |
…gestions by sgibb
|
I changed the function that it accepts two matrices (actually this was changed in the second pull request, but I didn't know to deal with the sitatuation). I have two questions still.
@jorainer do you mean that I should remove the two
@jorainer I think we do need this at the moment. E.g. |
Co-Authored-By: Sebastian Gibb <[email protected]>
jorainer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tnaake ! All fine from my side.
do you mean that I should remove the two !is.matrix statements in the first lines of dotproduct?
No, all fine. This was more related to have a function that evaluates whether a numeric matrix has correct peak data (i.e. a column names "mz", and one "intensity" and that the mz values are ordered increasingly) - I would however not check that within the dotproduct function. This has to be done (and is done) upstream in Spectra.
And thanks for the clarification why we need mz1 and mz2!
Great work!
sgibb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnaake thanks again for this nice PR and I am sorry that the review took so long and was a little bit cumbersome.
I will merge the PR and apply the change regarding the DOI myself.
| #' @references | ||
| #' Li et al. (2015): Navigating natural variation in herbivory-induced | ||
| #' secondary metabolism in coyote tobacco populations using MS/MS structural | ||
| #' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should work with
| #' analysis. PNAS, E4147--E4155, DOI: 10.1073/pnas.1503106112. | |
| #' analysis. PNAS, E4147--E4155, [DOI: 10.1073/pnas.1503106112](https://doi.org/10.1073/pnas.1503106112). |
|
I manually rebased & merged it into the devel branch. |
|
Sorry, merged by accident :-/ |
Hi @jorainer, @sgibb, @lgatto
as discussed here (rformassspectrometry/Spectra#49)
dotproductshould move toMsCoreUtils.