Bits and Brains

Writing Effective Pull Requests

2020-10-05T00:00:00+00:00

So, you have changes to code that the world would be better off having. Great! This checklist describes how to make sure that your pull request (PR) sails through the review process.

Why care?

Developing and maintaining open source software is a lot of work. Ralf Gommers has a blog post on this topic. Part of the “cost” of each open source contribution is that someone has to review each change, which is often time consuming.

Following the checklist below for your PR (prior to submitting it or requesting a review) will make life easier for you as the author and for anyone reviewing your code by reducing the amount of modifications that are necessary to merge your PR. This will also likely help your PR get merged faster so you can move on to other things.

The checklist

Note: the guidelines below were written explicitly for our package graspologic. Every project will have its own policies and practices that may be slightly different from what you see here - for instance, the location where documentation files are located may vary, or the project may use some service other than Netlify for hosting documentation, or they may have slightly different style conventions, etc. Most repos will have a contribution guideline that should get you up to speed on these things.

Before requesting a review on your pull request, make sure that:

PR itself (on GitHub)

PR has a descriptive and succinct title.
PR has a brief description of what was changed.
PR is addressing a change that has already been brought up in an issue. This will ensure that the change you propose is desired by the maintainers, and may provide a chance to discuss implementation details prior to PRing which will save everyone time.
PR cites any related issues in the PR description and uses closing keywords for any issues that should be closed as a result of the PR.
PR does not have any extraneous file changes associated with it. One common example of an extraneous change is one that was already made in the target branch, but is showing up as part of this PR under Files changed.
PR does not have any unresolved merge conflicts.
If implementing a major new feature or algorithm, a notebook demonstrating the proof of effectiveness is linked in the PR description. Depending on the feature, this kind of validation may take too long to run as a test or in a tutorial notebook.

Style

Code follows Python variable naming conventions, one description is here. Most variables should be snake_case and classes shoud be defined as UpperCamelCase, for example.
Code uses descriptive variable names throughout.
Code has no commented out “junk code.”
Line lengths are short (recommended <88 chars). This includes docstrings. Note: black tries to shorten line lengths but sometimes it is unable to do so automatically, especially for docstrings.

Documentation

Any new public classes/functions have docstrings.
Any new public classes/functions have been added to the appropriate .rst file here to be rendered by Sphinx.
Any major new features (like a new algorithm) are accompanied by a succinct tutorial notebook. Or, if it is agreed upon with the maintainers that this is out of scope for the current PR, a new issue has been created specifying that we need a tutorial notebook for this functionality.
Any new tutorial notebooks have been added to the appropriate folder here and included in the .rst file here to be rendered by Sphinx.
Modifications to the documentation render appropriately in the Netlify build.

Testing

New public classes/functions are tested to ensure they achieve the desired output.
New public classes/functions are tested to ensure proper errors are thrown when invalid inputs are passed.

Acknowledgements

Thanks to Adam Li and Ariel Rokem for feedback on an earlier version of this checklist (see Twitter).

10 Simple Rules to Developing a New Method

2019-11-21T00:00:00+00:00

So you want to develop a new method. I recommend you do the following:

Find something about the world that you’d like to be better My recommendation is to read how to choose a project, and find the project that maximizes feasibility and significance for you, given you intrinsic motivation. This is the hardest part, but once you have done this, you’ll hopefully have both purpose and direction.
Write down the mathematical problem you are trying to solve that will improve the world for the above described project Using as formal notation as possible, write down the goal of your method. Is it a classification problem? Are there any particular constraints? Which of these things do you particularly care about, and how much, and what constraints do you have. I particularly like the example provided in my signal subgraph paper.
Identify Reference Methods Identify the current state-of-the-art (SoA) methods for solving this problem. Make a list of references (up to five articles) describing SoA, and highlight those for which you can identify reference implementations. For python users, compare to at least one relevant algorithm in sklearn, and probably a few others (sklearn only includes algorithms that are at least three years old, so it never has the most up-to-date algorithms). For R users, search through the CRAN Task Views.
Identify Benchmark Settings The previous SoA must have run their code on some datasets, including both simulated and real. Identify those settings. We will be comparing our algorithm’s performance on those settings, so as to not be cherry-picking settings. Also consider a few (like one or two) very simple real datasets, such as sklearn’s toy datasets.
Identify Metrics The best metrics will depend on your goals. For example, in classification, held-out classification error might be the best metric, but other metrics can also be useful, for example, see the metrics we used in our sporf paper, which include a normalized effect size variant of Cohen’s kappa.
Understand Geometry Describe in detail (as formally as you can) the simplest parametric model that you will be using to evaluate this algorithm. This model should have the property that you can change a one-dimensional parameter to make vary the performance of your algorithm from better thant the current SoA to worse than the current SoA. For example, e.g., k-means will work better than GMM when two clusters are all spherically symmetric Gaussians and each cluster is equally likely, but will work worse than GMM when one cluster is much more likely than the other. Therefore, the probability of a cluster is the one-dimensional parameter that illustrates the phase transition in for the algorithms. If you can’t figure out a single simple illustration for this, come up with two different simple settings: one for which you expect your new algorithm to outperform the SoA, and one for which you expect your new algorithm to under-perform the SoA. In this case, vary some property of both settings (for example, number of dimensions, or variance), to demonstrate the robustness of the result across this dimension. To the extent possible, use previously proposed settings, such as the ones in sklearn’s clustering or sample generators; we have also proposed several in our papers, such as mgc, sporf, and geodesic forests.
Simulate Write code to simulate those scenarios, and look at the results. Confirm that your simulations are in fact simulating what the equations describing the simulations meant to describe.
Run Previous SoA
1. Make a Jupyter notebook running previous SoA on the simulations and data from their paper. Make sure your results at least approximately match their reported results (like, within 1%); ideally your results are identical to the published results, but that is not typical. This is especially important if you are re-implementing the previous SoA, rather than simply using their results.
2. Make a Jupyter notebook on any additional simulation settings.
3. PR the part of this Jupyter notebook that includes an sklearn algorithm.
Pseudocode, Code, and Test
1. Write pseudocode (as formally as you can) for your algorithm. I particularly like the text explanation pseudocode that we wrote for our mgc paper.
2. Write the code for your algorithm. If you are in Python, to the extent possible, follow the sklearn API.
3. Test your code on very simple experimental settings. Settings which should result in an obvious result. I recommend the following three tests:
  1. A setting in which it is so easy that for sure any reasonable algorithm will get the answer perfect.
  2. A setting which is impossible for your algorithm to get right.
  3. A reasonable setting, in which as sample size increases, accuracy should improve. If your implementation works as expected in all of these settings, proceed. Otherwise, iterate.
Evaluate Algorithm on Simulations Make a Jupyter notebook evaluating all the algorithms on the simulations. For each simulation setting:
1. Generate 10x random samples of the data
2. Run all relevant algorithms.
3. Compare performance (using the above established metrics) on each run of the simulation. That is, do not compare the average, but rather, make 10 comparisons. Comparing each run is more powerful and informative.
4. Plot all 10x comparisons for each algorithm compared to yours.
Evaluate Algorithm on Real Data Make a Jupyter notebook evaluating all the algorithms on the simulations. For each real data setting:
1. Run 5-fold or 10-fold cross-validation. When appropriate, use stratified cross-validation. Do cross-validation such that the data are divided into k (approximately) equal sizes subsets of data. Train on the k-1 subsets and test on the remaining subset, so that each sample is used for testing exactly once.
2. Run all relevant algorithms.
3. Compare performance (using the above established metrics) on each fold. That is, do not compare the average, but rather, make k comparisons. Comparing each run is more powerful and informative.
4. Plot all 10x comparisons for each algorithm compared to yours.

If you follow the above plan, you will carefully demonstrated the value of your new algorithm in an unambiguous fashion.

An Email to Andrew Gelman About Evaluating Journals

2019-08-03T00:00:00+00:00

I wrote the below transcribed email to Andrew Gelman today. He often publishes my emails on his blog with a 6 month time delay, so I figured I’d post it here just in case. If he bloggs about it, I’ll link to it here.

i noticed you disparage a number of journals quite frequently on your blog. i wonder what metric you are using implicitly to make such evaluations? is it the number of articles that they publish that end up being bogus? or the fraction of articles that they publish that end up being bogus? or the fraction of articles that get through their review process that end up being bogus? or the number of articles that they publish that end up being bogus AND enough people read them and care about them to identify the problems in those articles.

my guess (without actually having any data), is that Nature, Science, and PNAS are the best journals when scored on the metric of fraction of bogus articles that pass through their review process. in other words, i bet all the other journals publish a larger fraction of the false claims that are sent to them than Nature, Science, or PNAS.

the only data i know on it is described [here] (https://www.nature.com/articles/s41562-018-0399-z) according to the article, 62% of social-science articles in Science and Nature published from 2010-2015 replicated an earlier paper from the same group found that 61% of papers from specialty journals published between 2011 and 2014 replicated.

i’d suspect that the fraction of articles on social sciences that pass the review criteria for Science and Nature is much smaller than that of the specialty journals, implying that the fraction of articles that get through peer review in Science and Nature that replicate is much higher than the specialty journals.

curious to here your thoughts….

How to Join Us

2019-07-28T00:00:00+00:00

NeuroData is literally always hiring exceptional individuals. If you think you’d like to work with us in any capacity, please email us now! Some thoughts that might be useful prior to emailing us to consider further:

We are interested in all kinds of people with all kinds of backgrounds, we do not discriminate for any reason, and we are all students at different stages. Although this should go without saying, this includes women and minorities.
We believe that to maximally flourish requires being in the appropriate environment specifically for you. Each individual benefits from unique circumstances. We wrote this blog post which describes out thoughts on choosing projects, but it is also relevant for choosing teams. If you are reading this far, it seems likely that the kind of work we do is intrinsically motivating for you. Even if not, and you think we could help find a team better aligned with your dreams, please email us.
To create the most peaceful and productive environment, our team has constructed these agreements, which we all adhere to. This includes treating everyone with respect on the team, and at any team events. See our code of conduct for details. If it seems like that kind of environment would suit you well, please email us.
If you are interested in pursuing graduate studies with us, there are many different programs that are available. The Johns Hopkins University Biomedical Engineering graduate programs are the most common approach, which includes the MSE and PhD programs, as well as a few others (Tsinghua JHU-BME dual degree and MD/PhD). Note that BME@JHU is the #1 rated BME graduate school in the world according to US News & World Reports. I also sometimes advise students in Computer Science, Applied Mathematics & Statistics, Biostatistics, and Neuroscience. If you are interested in graduate studies with us in a different program (e.g., JHU/Janelia, or ECE@JHU), please also feel free to email us.
If you are already enrolled at JHU and are interested in joining the lab, the simplest route is to enroll in the year-long project-based course we offer called NeuroData Design. A key motivation for offering this course was to ensure that we could work with as many amazing individuals as possible. If for some reason, enrolling in the course does not seem feasible for you, please email us and we will try to figure out some other arrangement.
If you are a visiting or exchange student, please see JHU’s official exchange program overview, and go through the official process; after you do this, email us to say you applied. It is complicated for legal and bureaucratic reasons to accept visiting/exchange students outside these official programs, although it is possible. So, if those programs do not work for any reason, please email us.
In general, we find part time work, and remote work, inefficient, and unlikely to be in your or our best interest (though there are exceptions). If that is the only option for you, we’d be happy to discuss your interests with you to find a more suitable position given your constraints.

OK, given all the above, we recommend that you email us. Below we provide an example email. Note that we will respond within a week in almost all cases.

Dear Prof. X,

I have been studying [something quite relevant to your research] for the last [X duration], and I expect to graduate from [Y institution] in [Z expected month]. I recently became interested with your work, specifically, the article “[some article from our group]”.
I am particularly intrigued by this work because of [X], which is one of the things I am most excited about studying next. I would therefore be thrilled to do [X] for a period of [Y duration] in your group. For your convenience, I have attached my CV, and and an unofficial copy of my transcript.

I look forward to hearing from you at your earliest convenience.

Best, [Some Awesome Person.]

In conclusion, a brief checklist that may help you of what to include in your email to faculty:

Excitement about working on something specific relevant to the individual you are contacting
Relevant background experience
Desired duration and title of position
Up to date cv
Remember, the shorter the better, this is just to get your foot in the door.

Good luck!

How to Succeed in a Biomedical Data Science Lab (specifically ours)

2019-07-27T00:00:00+00:00

The degree of success one can achieve in any environment, specifically a lab, depends upon appropriately matching individual talents with an environment. There are many web resources that describe how to succeed in graduate school, or in a job [in a subsequent update to this post, I’ll include links to my favorite ones]. Here, we try to list the qualities of individuals that we believe tend to be positively related to “success” in our lab. We believe these qualities are also likely to be positively related to success in other positions, but are relatively data poor for the other positions. On the other hand, we’ve interacted with between 50-100 individuals so far, and have therefore amassed a fair bit of anecdotal evidence.

Mission Aligned Our mission is to Understand and improve animal and machine intelligences worldwide. If you are motivated to solve these problems, over and above any other problems, you are in the right place. Read our mission for more details.
Interest in Data Science Our work is at the intersection of statistical machine learning, brain science, and mental health. If you are maximally intrinsically motivated to study this intersection to achieve our mission, read on.
Personality Traits Each institution/lab is a unique environment, with unique quirks, ours included. The traits that seem most important to us include:
1. enthusiasm for solving these problems;
2. humility to realize that we are all often wrong, and the goal is to learn and understand, rather than be “correct”;
3. gratitude, both for the opportunity for learning and growth that comes from our errors, and because we get to work in the best environment with the best in the world;
4. patience, because these things are hard and take time;
5. simplicity, because the fewer parts the easier to understand and the less that can go wrong; and
6. trust in each other, as each member of the team has unique talents.
Data Science Chops To succeed, you’ll essentially need to learn or already know graduate level probability, statistics, matrix analysis, network science, and some numerical programming and brain science. If you plan on developing methods, strong numerical programming skills are required. If you plan on applying methods, strong brain science understanding is required. If you plan on proving theorems, more theory is required
Technical Communication Something that is particularly difficult to quantify, but as important as the above properties, is an ability to communicate technical content to us effectively. This is important because we work on teams, and the efficiency of our team is partially determined by our ability to communicate effectively with one another. We acknowledge that this is highly subjective, a given individual might be able to communicate effectively with some people, and not others. This does not require fluency in English per se, but does require being able to speak, make slides, and write reports about technical content in a coherent fashion. This is something we all learn by doing, it is not expected that anybody joins the team knowing these things.
Agreements Our team made a set of agreements, that we continue to update as appropriate. Agreeing to these agreements is a prerequisite to joining the team, and therefore, is required for succeeding on the team. If you have recommendations for how to further improve our agreements, please let us know. Note that these agreements include both personal and professional activities.
Learn Context To make a meaningful contribution to the literature, it is helpful to know what is already known in the literature, and what are the biggest feasible open challenges. There are many kinds of activities that help you gain context including:
1. Take relevant classes. We specifically recommend classes taught by Jovo, Carey Priebe, Randal Burns, Rene Vidal, and Raman Arora.
2. Attend conferences. One to two conferences a year, for example, one large general conference and one smaller specialized conference, can be invaluable for learning. At those conferences, literally attend all possible conference activities, including all talks, posters, social activities, and workshops. For multi-track conferences, as long as one track has an activity, go to it. Expect to be learning essentially 12 hours a day. Take breaks as appropriate, probably nearly hourly, to decompress. It helps to attend talks/posters/etc. with other people who are more senior, so that you can discuss the contents afterwards and digest the most important points. Go a day early, or stay a day late, to be able to enjoy the city you are in; there is no need to do those activities during the conference, there is plenty of time after. If at all possible, do the fun activities with other conference attendees, ideally not from your lab.
3. Read papers. There is no limit to the number of papers one can read. If you find a really interesting article, read all the references of that paper. A good rule is read an average of one paper a day. The depth with which you read it can vary as appropriately, from only 30 minutes to get the gist, to 8 hours to understand the details. If you think you want to read a paper for 8 hours, first read it for 30 minutes, and then read the key references each for 30 minutes. Then do the 8 hour read.
Plan on Being Here a While Many people approach us and are interested for a summer internship, or even a year long fellowship. It is quite difficult to make a positive contribution in less than 1 year, for the simple reason that there is a lot of specific background knowledge and context that the lab has that is nearly impossible to get anywhere else (this is likely true for most labs). PhD’s take about 5 years, that is long enough to be able to make a contribution. Realistically, after about two years, expect to make a useful contribution, that is, your research output more than exceeds the output the team would have had had you not been there (which means you’ve overcome the research debt accrued by virtue of the energy spent training you). The efficiency with which you can make contributions continues to increase with training. Ideally, the best trainees would never leave, meaning, they would either take a postdoc position with us, and/or get a faculty position at JHU, or a neighboring university if that makes more sense for various reasons.
Meet the Experts There are many brilliant people working on related questions all over the world. To the extent possible, meet them. I know of three ways:
1. Sign up for relevant seminars and attend them. At JHU, this includes the following seminar: CIS, data, CS, BME, CS theory, AMS. Please email us if we forgot any. At the seminars, if you don’t understand something, raise your hand and ask a question. The speaker’s job is to make you understand, either during the talk, or if they deem it appropriate, after the talk. Let them choose when to answer, rather than choosing for them. Also, when possible, sign up to speak with the speakers. For many speakers there are lunches with students, go to as many of them as you can. If there is no lunch, either sign-up to speak with them yourself, or with a small team of other students, or ask your PI to join their meeting.
2. At conferences. Prior to the conference, say about 1 week before, find out which faculty that interest you will be at the conference (if they are speaking, they will be there). Email them all, offer to meet them at literally any time and place that is convenient for them. Offer to meet them at the airport, or go with them to the airport, if necessary. Faculty tend to love this kind of thing, as it caters to our egos. Also, it makes the rides more enjoyable. After their talk, go up to them and ask them questions, or simply tell them how much you learned and how much you appreciate their work. Don’t forget to introduce yourself, and in 30 seconds, tell them whose lab you are in, and what your main project is focused on.
3. Invite faculty of interest to seminars. In many seminars there are slots for student invites. In others, faculty decide, but should always be willing/interested in inviting faculty that are interesting to the students. So, bring the faculty you want to you, and then be their host for the day, take them around to their meetings, join them for lunch and dinner, etc.

I imagine there are other qualities that I’ve forgotten. If so, please let us know in the comments below, and we can update this list.

10 Simple Rules to Write a Paper from Start to Finish

2019-02-10T18:27:57+00:00

Learn how to write by reading how to structure a paper and Writing Science (sorry that I am recommending purchasing a book, though it should be available for free from any academic library, and I recommend all PIs own a copy for themselves and their students, so it does not cost students to read it at a minimum).
Main result write draft paper title and draft killer fig that visually makes the main point of your story
Abstract write title and abstract based on OCAR story structure (described in Writing Science)
Figures draft the remaining figures
Outline generate a 1 sentence per paragraph outline
Venue choose the publication venue
Draft flesh out details of paper
Revision revise
Submission post to arxiv, submit to journal, tweet to world
Rebuttal respond to reviewers

Upon believing that you have completed work sufficient to write a peer reviewed manuscript, follow the below steps in order. If you are simply writing an abstract (for a conference, for example), just do the section entitled “Outline and Abstract”

Learn how to write Before you write any paper, I recommend reading How to structure a paper. I also recommend reading Writing Science, if care about writing clear compelling stories. You only need to do this once ever, not once per paper. Once you’ve read these things, follow these steps:
Main Result
1. Write a one sentence summary of your work (will become your title; ~5 min). This sentences describes the main take home message / main result. Avoide jargon to the extent possible, and grab the attention of your readership with strong substantiated claims. It is often required to have less than 88 characters.
2. Draw (or make) the “killer fig” that makes the point as clearly and concisely as possible. Guidance for making paper quality figures is here. (~5 min)
3. Get feedback from your PI and other principle authors. If the author list is unclear, send this summary and figure to everyone you think might believe they deserve to be authors on the manuscript, and invite them to be co-authors. Tell them that if they agree (by responding to you and the corresponding author), they can expect to get emails from you with updates, each of which will request feedback. And if they provide feedback, it will be carefully considered. Explain to them that the goal of feedback at this particular stage is that you’d like to know what they gathered was the main point of your paper, based on reading the title and looking at the killer fig (without sending them the caption). If they don’t get it exactly right, iterate until they do (don’t tell them the answer, just update the title/figure).
Abstract
1. Choose a writing medium. I recommend google docs unless there are a large number of equations, in which case I recommend overleaf. In either case, you want people to be able to comment easily directly on the draft, in real time, as to avoid the possibility of collisions. Prior to sending it to anybody, make a local copy, so they can’t screw it up (and update your local copy prior to each round of feedback).
2. Describe the other results, typically 3-5 additional figures or theorems. The goal of each of these is to support the main claim, for example, by further refining, adding controls, etc. Ideally, they are sequenced together in a logical chain, like a proof, each building on the next, to tell the story. (~1 hr)
3. Write a one paragraph summary (will become your abstract; ~30 min). This will be about 250 - 300 words, more than 500 words is a page, not an abstract. To include:
  1. Big opportunity sentence: what is the grandest opportunity that this work is addressing? In other words, what is wrong in the world, impacting people (not just scientists), that you are working on righting? Potentialy answers include disease, world hunger, data deluge, etc.
  2. Specific opportunity: what opportuntity specifically will this manuscript address? This is filtering down from the above, we are not going to try to solve all of world hunger, maybe just identify the primary causal factors contributing to hunger in India, for example.
  3. Challenge sentence: what is hard about addressing this opportunity? Think about the limitations of approaches that others have taken, they are great in some ways, but incomplete because something was a challenge that they did not overcome. Make sure to comment how great the previous approaches are, the people who developed them are our readers, reviewers, etc., and we all have egos.
  4. Gap sentence: what is currently missing? What is the key insight/innovation that enables us to overcome the challenge that others did not yet overcome.
  5. Action sentence: what did you do to address the gap, overcome the challenge, and therefore meet the opportunity? it should provide the key intuition/insight, the magic that makes this work, where others failed. This is the place where one can write something like “In this paper, we….”, although I would instead simply write “We….” (in this paper should be obvious from context).
  6. Resolution sentence: what changes for the reader now that you have met this challenge?

Note that the above structure is called “OCAR” in Writing Science, for Opportunity, Challenge, Action, and Resolution (tehcnically they use “Opening” instead of “Opportunity”, but I like “Opportunity” more). Steps 1 and 2 form the opportunity, steps 3 and 4 form the challenge, 5 is action and 6 is resolution.

When considering which results to mention in the abstract, consider the following: the killer fig gets two sentences. Any other figure could get up to one sentence, and result that is not a figure does not get mentioned.

Generate a “take home message” for each of the figures you plan to create. These become the first sentence of the figure caption.
Ask for feedback from senior and other co-first authors. Explain that the role of feedback at this stage is for them to tell you whether they think there are any glaring flaws with the basic setup, eg, do they have a different idea about what is the biggest challenge. There is no need to knit-pick about grammar at this stage.
Figures
1. Make a first draft of all the figures and tables with detailed captions (~ 1 week). Captions should each be about a paragraph long. At this point, the figures need not be “camera ready”, but should have all the main points made.
2. Get feedback on figures from co-authors and close colleagues (~1 week). Show them the figures, do not show them the captions, and ask them to tell you what the main point of each figure is. If they don’t get it right, don’t worry about it, take notes on what they thought, and ask them to go to the next figure. Then, spend another week updating the figures, and repeat.
Outline
1. Write a “long-form” outline of the intro (~1 hr). This is essentially an expansion of the abstrast, and is therefore structured as follows:
  1. a bulleted list of ~3-5 main factors that create an opportunity for your work, filtering from most general to most specific, and not including anything ancillary (~20 min)
  2. bulleted list of the ~3-5 main challenges that must be overcome (~20 min)
  3. 1 sentence summary of the gap, that is, the key ingredient that is missing (~5 min)
  4. 2-3 sentence summary of what you did (~5 min)
  5. 2-3 sentence summary on how your work changes the world (~5 min)
2. Outline the methods. (~20 min)
3. Outline the results results (~20 min). Allow for at least one paragraph per figure, table, and/or theorem. These paragraphs follow the following form (no need for them to be proper “paragraphs” at this point):
  1. a sentence describing the setting,
  2. a few sentences providing additional details, and
  3. a concluding sentence providing the take home summary.
4. Outline the discussion (~1 hr), to include (not a summary)
  1. bulleted list of previous related work (~20 min)
  2. bulleted list of potential extensions (~20 min)
5. Get feedback from your co-authors. At this stage, the question is whether the logic is sound, meaing, based on the challenge you proposed, the sequence of results provides compelling evidence that you’ve satisfactorily addressed it. If they disagree, iterate until there is agreement.
Venue
1. Read the top 3-5 most cited/downloaded articles from each potential venue to submit to, to see what the readership of that community likes. This will help you both choose the appropriate venue, and write effectively for that venue.
2. Choose the actual journal/conference you will be submitting to. Until this point, it is irrelevant, since you’ve only been focusing on the logical structure of your story. However, before you start writing, you want to make sure you are writing for a particular target audience, and write for them. This is because different venues have different expectations, which you’d like to meet.
3. Note the structure and approximate length of the venue of choice for the top articles published there. Do not worry about length at this time, just note down how many pages and figures they have, if they have an explicit methods section, if they have a preferred outline structure.
Draft
1. Expand the outline into a full draft. Do not concern yourself at this point with the details of what the publisher wants, we will deal with that later. The point of this is to simply have a draft at all.
2. Check that paper follows all relevant checklists, including:
3. Read the paper carefully outloud, and remove any words that are unnecessary.
4. Make sure each display item (e.g., figure, table, or theorem) is enumerated and explicitly refered to in the main text. If you are using LaTeX, use \ref{fig:<informative_name>} to refer to each display item. Recall that when referring to sections or figures, etc., the name of the section/figure is a proper noun, and is therefore capitalized.
5. Make sure you have sufficiently cited the literature to place your work in context. For conferences, it is typical to have about 1 page of citations (10-20). For journal articles, 30-50 is more typical. Recall, the authors of these papers are likely to be the reviewers and readers for this paper. So, it is important that you highlight all the important work, and say how great it is. In particular, you want your readers to feel good about themselves while they are reading your work, which you can facilitate by citing their work, and explaining why it is so great and important. This is, of course, actually true, since you are building on this work, and your work would likely not even be possible without the work you are citing. If you are using Google docs, I recommend using the Paperpile “Add-On”.
6. Get feedback again, ideally from a professional editor. At this stage, feedback on logic/etc. is no longer appropriate, so be clear when asking for feedback that you are asking for feedback whether the individual sentences/paragraphs are clear, and random grammar/spelling mistakes. The opportunity to provide feedback on logic has passed.
Revision
1. Update abstract and introduction to finalize draft on text (~1 day).
2. Revise manuscript addressing each and every concern you were made aware of by any of your readers at this point (~1 week). This does not necessarily mean making new figures, rather, it might mean clarifying various points of confusion.
3. Do another round of feedback, give them another week.
Submission
1. Finalize manuscript (~1 wk). I recommend that you fully ignore all guidelines provided by the journal, and you submit that which you believe will be easiest for the editors/reviewers to read and understand. I have almost always done it this way, and the submission is almost always reviewed anyway. If it gets good reviews, we can modify it so it fits their rules. An exception is conferences, where they actually care.
2. If your code is not yet open source, make it so, following the FIRM principles
3. If your data are not yet anonymized and open access, make it so, following the FAIR principles.
4. Draft a cover letter (in google docs if you are writing in google docs, or in the same overleaf repo as the manuscript if you are using overleaf). The cover letter has the following form:
  1. It is on institutional letterhead
  2. Dear [name of editor that will be reading the letter],
  3. Paragraph 1
    1. We are delighted to submit to you our manuscript submission titled, “[title],” for publication as a [type of article, assuming there are multiple types in the journal].
    2. Establish the gap that this paper is filling is clear non-technical terms
    3. State how one could address this gap.
  4. Pargraph 2:
    1. Summary of main contributions of the manuscript, in non-technical terms. One sentence per contribution.
    2. Conclude by stating what we expect the implications will be for their readership, and more broadly
  5. Signed by the corresponding author, “on behalf of my co-authors”
  6. Add a ps - that all the code is open source and all the data is open access, in accordance with the above mentioned principles.
  7. The language can be more flowery/confident about the expected importance of the contribution than you would write in the article itself.
5. Submit to journal.
6. Post to pre-print server.
Rebuttal

Great, you’ve now submitted the work, waiting some number of weeks/months, and received detailed feedback from the editor and reviewers. How to respond? Note, no matter what they say with regard to accept, minor revision, major revision, or reject, we essentially respond in the same way.

Respond immediately, unless you have a grant deadline, revising this paper is now your #1 top priority bar none. All upcoming conference deadlines, and anything else you are working on now take a back seat. The sooner you get the responses, the sooner they read them, and the less they forget what they read/wrote, and the less likely they are to re-read your whole paper, and give us a whole new set of complaints.
Make a google doc, copy the entirety of the comments you received into it. Change the font color for all of their comments to red.
After each “complaint”, make a line break, and respond directly to it. (Basically) Never disagree with them. The game now is simply to address the comments (ego) of the reviewers. Any comment they made is useful, at a minimum, to inform us as to places where readers might be confused about why what we did is so awesome. Sometimes this will mean making your paper slightly worse to appease them. I find it is typically worthwhile. In the response,
1. Tell them how great they are for identifying this problem with the manuscript, and express gratitude for them finding it, really try to feel grateful, your response will be more productive;
2. Tell them how you have altered the manuscript to address the issue. Yes, this means modifying the manuscript to address each and every one of their complaints, no matter how big/small they are. This includes certainly citing everything they recommend you cite, and also praising that work in the actual text.
3. For any big changes, such as a modification/addition of a figure, or a new paragraph, directly append the new/revised content into this document. The goal is to make sure the reviewer does not look back at the original manuscript, lest they might find additional limitations that they want addressed. Quotes from the paper should be in blue, to highlight that they are quotes. Note: send the google doc proposing changes (but without implementing any yet) to your PI to get the green light on how to response, once you have the green light, proceed apace).
Send google doc to your co-authors with track changes on. Often a good idea to also make a back-up copy of the rebuttal prior to sharing with them just in case. Specifically ask for feedback of the following form: “if you were the referee, and you saw that I responded thusly, would you be satisfied or not?”

If you follow the above plan, you will have a manuscript ready to submit two to three months after you start writing. Note that it includes many stages of feedback, and at each stage, a specific kind of feedback is explicitly requested. This procedure helps streamline the amount of work you do between iterations, and streamlines the entire process. Good luck!

What We Do

2019-02-10T18:27:57+00:00

This blog post is our attempt to characterize what we, at NeuroData, do. All of our actions are motivated by our mission, to understand and improve intelligences; the motivation behind our approach is described in our about page. This post summarizes the bulk or our actual work, which can be divided into three complementary threads:

big data systems
statistics / machine learning / artificial intelligence
applications

Big Data Systems

A big data system is a computational ecosystem, including hardware and software, design to support analysis of “big data”, operationally defined as data too big to fit into a single machine’s working memory. The operations of any big data system include: storing, interfacing, pipelining, and visualizing. Existing solutions have been inadequate to support efficient hypothesis generation and scientific discovery. We therefore build, extend, and deploy systems that generalize the functionality of existing systems to support scientific inquiry. We first developed the Open Connectome Project stack in 2011 to host the first ever big dataset in neuroscience, a 10 terabyte electron microscopy dataset from Davi Bock and Clay Read (ocp). As the data size and complexity increased, the development needs extended beyond our capabilities, so we began collaborating extensively with other teams, including scientists and engineers at Google, Allen Institute for Brain Science, and Janelia Research Campus. This collaboration led to our existing open source, community developed, computational ecosystem (ndcloud). The core components of this work include bossDB for big data storage, NDWebtools for interfacing with the data, neuroglancer for visualizing, and various pipelines for different modalities, including ndmg for functional, structural, and diffusion magnetic resonance imaging and reg for registration of 2D and 3D volumes. We continue to extend our collaboration network and ecosystem, to support an ever growing need for big data systems in brain sciences across scales and modalities, ranging from electron microscopy, to whole clear brains, to human and non-human magnetic resonance imaging.

Statistics / Machine Learning / Artificial Intelligence

Statistics, machine learning, and artificial intelligence are terms that describe complementary approaches to solving overlapping sets of problems. Central to all of them is the existence of some data samples, and a question; these approaches then build tools designed to learn an answer to the question from the data. Existing tools, however, have severe limitations that we must overcome in order to obtain answers to the questions of interest. First, raw data in neuroscience tends to be very high-dimensional (e.g., images can be petabytes), and exhibit many nonlinear relationships (e.g., the input/output function of a neuron). We have therefore developed a number computational statistics tools for such settings. This includes state of the art methods for dimensionality reduction (LOL), classification (rerf), time-series modeling [mr. sid]](https://doi.org/10.1016/J.PATREC.2016.12.012), clustering (eclust), and hypothesis testing (mgc). Second, the eventual representation of data most interesting to us is networks in the brain, or connectomes. We have therefore spent much of the last 10 years building foundational statistical estimators, theories, and algorithms for modeling populations of networks with graph, vertex, and edge attributes. A survey summarizing much of our work was recently published in JMLR (rdpg). We subsequently developed a python package that implements all of our theoretical developments (graspy), and wrote a review article on our approach to modeling connectomes, called connectal coding.

Applications

The potential applications of our work are widespread. We illustrate a couple applications spanning our most relevant work, including model (non-human) systems and human variation. First, we characterized which sets of neurons were causally involved in which behaviors, in larval Drosophila. This led to an exploratory analysis of the mushroom body of the larval Drosophila, which we describe in detail in a technical report, with further analysis available in our survey JMLR paper, and our connectal coding paper. Second, to characterize human variation, we developed a cloud pipeline for estimation and analysis of human connectomes (sic), and used it to quantify variability present within and across individuals and studies (ndmg).

Open Science

A key principle underlying our work is the democratization of science: we desire that anybody, regardless of resources, is able to both access and contribute to the cutting edge of scientific discovery. To that end, all of the work we do is open science, including both open source code and open access data. Over the last 8 years, our website, https://neurodata.io, has grown from about 100 unique visitors each week to >1,000 unique visitors each month. In the last calendar year, >13,000 unique visitors browsed the site. These visitors span every inhabited continent, including over 2,000 different cities (see image below). Over the 8 years, approximately 80,000 unique individuals have visited our site, over double the number of people at a typical Society for Neuroscience conference, suggesting that many non-neuroscientists have visited our site. We hope that those people visiting the site feel the same sense of awe and inspiration as we do, associated with unraveling the secrets of mental function in these beautiful images of brains.

Some thoughts on building interdisciplinary biological/data science teams

2019-02-10T18:27:57+00:00

I recently was asked a few questions on this topic, and thought I’d share my answers. I did, tweeted it, and then updated this post based on public and private responses. For the twitter discussion, including contributors, please see here. I’m grateful to everyone who has contributed to this dialog. If anybody has contrary opinions, I’d love to know about them.

What do effective biologist/data scientist teams look like?

While I’m sure there are many different flavors of succesful teams, we have been most succesful when team have the following properties:

A relatively concrete biological question, with a careful experimental design. Good data scientists can find patterns in anything, but not all patterns are scientifically interesting.
The data scientist is integrated into the scientific process as early as possible. For example, there are often choices the experimentalist could make that would result in much easier or more difficult challenges for the data scientist. Only by virtue of the data scientist being engaged before the data are collected is it even possible for them to provide input into these decisions.
Both the data scientist and the neurobiologist have deep respect for the other’s time and expertise. In particular, both realize that the challenges the other faces are difficult to overcome, and may take time. At the same time, both realize that they are working to answer a particular question, rather than solve some “generic” problem. For example, the data scientist must develop an algorithm that detects cell bodies with sufficiently high sensitivity and specificity to answer the particular neurobiological question of interest, rather than “the best cell body detection method ever”. This may require either or both individuals to go outside their comfort zones at time for the success of the project.
If novelty and complexity are not required for the methods, even if such properties would be sexy, neither party insists on pursuing such approaches, which can be quite time consuming and typically not a dramatic improvement on existing technology.
Often making authorship agreements up front can avoid future clashes. One workable approach is a single manuscript approach, with first-author and corresponding-author rights shared by both labs. Another reasonable approach is a two-manuscipt approach, with one manuscript describing the data science technique, and another describing the biology. A potential downside to this approach is that often the data science technique does not have a natural “peer reviewed” home, because it is a “trivial” extension of previous work. That is not to say that it did not require a Herculean amount of work to get it working sufficiently well, but rather, that many stats/AI/ML peer-reviewed venues do not highly value this kind of work.

One caveat to the above, is that it assumed a legitimate collaboration. Sometimes, the biologists simply needs a consultant, to check their work, suggest a complementary strategy, etc. Those can often be quite fruitful for the biologist, although the data scientist rarely gets any “academic credit” for such efforts. On the other hand, for serious collaborations, getting the data scientist up to speed on the biology can often take quite some time, for example one to two years even. Doing so is a considerable energetic investment for all parties involved. In these cases, it can be advantageious for everyone if there are smaller “wins” in the interm period that indicate to all parties that continued investment is likely worthwhile.

How deeply are they integrated?

In our experience, weekly (potentially remote) meetings between both neurobiologist and data scientist are incredibly helpful to discuss current results and get further direction for next steps. The data scientist also visits the neurobiologist at least once in the beginning of the project to witness a day of experiment, to understand what the data are. Quarterly or semi-annual in person visits are also quite beneficial, in our experience.

What is the management structure?

Usually one trainee acquiring data and one trainee developing/applying methods, each supervised by their respective PI. Once the method is fairly well established and routinely able to run by the data science trainee, it is often helpful to engage a software engineer on the data science side, to transition the code from “gradware” to at least “labware”, meaning that other people in the data science lab could also run the analysis, thereby freeing up the trainee to write a manuscript, further develop the method, or transition to other projects. Another complementary option is to engage a younger trainee to transition the project to them.

Do you have a sense for if the data scientists in the teams will continue in tackling tough biological problems, or is this perceived more as a training opportunity for junior researchers/scientist who may be looking at industry careers?

With the artificial intelligence industry sky rocketing, a large fraction of trainees are interested in transitioning to industry. Industry can be start-ups, big tech companies, the financial sector, or biotech. In each of those industries, there are now opportunities to work in data science as applied to biological problems, particularly in healthcare. Those trainees that are motivated specifically to tackle the biology question are more likely to continue doing so when they graduate. And in fact, those that are inspired to actually solve the problem, rather than transition to a different industry, I find to be much more effective at solving the biological question.

Bilateral Homology

2019-02-10T18:27:57+00:00

Homology, means having the same or similar relation, relative position, or structure. In biology, homology usually refers to comparison across taxa, like human vs non-human primate. Two biological structures are homologous in different taxa if they are derived from a common ancestor, such as feet and flippers.

Within an individual, people speak of bilateral symmetry, which means there is an approximate reflection symmetry about some body axis. For example, the human cortex is bilaterally symmetric. This is despite the fact that the left side has regions of the brain that are not present in the right side.
In other words, if one were to quantify the degree of differences between the two sides, bilateral symmetry only requires that the difference between the two sides is less than some threshold. Where one puts the threshold is a subjective matter.

Bilateral homology is a relatively new concept. For now, we say that two biological structures are bilaterally homologous if, under “normal” environmental conditions, the degree of bilateral symmetric is $c$ , and under certain perturbations of the conditions, the degree of bilateral symmetry is $c'$ , and $c' > c$ . Which means that, by definition, any “normal” connectome cannot exhibit bilateral homology.

To be a bit more concrete, for a particular individual, we observe n points from one side, $x_1,\ldots, x_n$ , and m points from the other side, $y_1,\ldots, y_m$ . Assume that each observation is sampled independently from the others, and that those from the first side are sampled from some true but unknown distribution $F$ , and those from the other side are sampled from some true but unknown distribution $G$ . We can then posit a formal hypothesis test:

$H0: F = G, \qquad HA: F \neq G$

Implementing this test requires choosing a test statistic

$t=t( x_1,...,x_n, y_1,..., y_m) \in \mathcal{R},$

and we reject bilateral homology if and only if the observed $t$ is more extreme than $(1 - \alpha)\%$ of the time under the null.

This test, like all other tests, requires a distribution of the test statistic $t$ under the null. One way to proceed would be to acquire S “typical” individuals, and compute $t$ for each of them. Then, given 1 individual that experienced an experimental perturbation, we could accurately ascertain how extreme this individuals test statistic is relative to the “typical” ones. This approach, of course, is sensitive to both the choose of test statistic and model: $F, G \in \mathcal{F}$ . Nonetheless, reasonable choices for both are available in many situations.

On the other hand, in the absence of S typical individuals, how might one proceed? One could sample S times from an estimated null. This would depend strongly on $\mathcal{F}$ , but is still possible.

Now, assume we observe only a single typical individual. The question we are interested in asking is: are the left and the right “more different than one would expect by chance?”. Since we’ve only observed 1 individual, how might we go about estimating the null? One option would be to fit a series of increasingly complex models to the one side, $\mathcal{F}_1, \mathcal{F}_2, \ldots$ , and ascertain their model fit the other side. Then, do the same training models on the right, and ascertaining them on the left. Assuming for simplicity that both experiments revealed that the same model was the best fit across sides, we can operate under the assumption that this is a reasonable model. Now, we can proceed as proposed above: using this selected model, estimate the null distribution by sampling, which provides a p-value.

Tips for Getting into a Top Graduate Program

2018-10-21T18:27:57+00:00

As faculty in the Department of Biomedical Engineering at Johns Hopkins University, the best BME department in the world, both in terms of undergraduate and graduate schools, I have learned what I and other faculty are looking for in applicants. Before getting into what the most important factors are, I believe it is important to understand our goals, which motivate those factors. From the perspective of the graduate admissions committee, our goal is to estimate whether we believe that BME@JHU is the best place for you to thrive to achieve you ultimate dreams. In other words, we try to ascertain whether the environment that we create at JHU will be maximally supportive of both your strengths and weaknesses. As it turns out, this does not mean necessarily that we accept the best students in some abstract sense (as defined by some arbitrary metric), but rather, we try to accept the students for which we believe that we will be the best mentors for you. Of course, this is a complicated objective function, and one for which we will most likely sometimes make errors. Nonetheless, it is our goal. To make such estimations, we look for the following:

Research Experience: First and foremost, we are a research university. So, the best way for us to determine whether our research environment will support you to flourish is to understand your previous research experience, and in which settings you flourished more than others. Although successful research is difficult to quantify, research artifacts provide some data with which we can evaluate your achievements. Such artifacts include poster presentations, conference proceedings, pre-prints, journal publications, numerical packages, and even patents sometimes. If you are the first (or co-first) author on any of these, we typically assume that much of the work is yours, and thus first author research artifacts are most informative. Middle author works are also informative, especially if you clarify your role in the research in your personal statement. Note, however, strong research experience is not a pre-requisite for admission. Rather, it is an information-rich piece of data for us.
Grades: JHU is not just a research university, we are also a teaching university, and we take our teach responsibilities quite seriously. Moreover, many of our graduate level BME courses are also serious and time consuming. It is important to us that you perform well in them, because they provide the necessary background upon which our research programs are based. It is not important to us that you got straight A’s, very few applicants have. Rather, we care that you perform well in the courses that will be the most relevant for your research during your PhD, typically quantitative and biology classes for us. It is also not crucial that you performed well in every semester. The grades in the most recent semesters, in the most relevant courses, are most important. Aim for getting A’s in them, but GPA alone neither gets students admission nor rejection. We understand that life happens, and certain things are more important than coursework (family, health, well-being, etc.). Finally, we appreciate that not everybody gets the same opportunities, in life, in high school, etc., and therefore not everybody is equally well prepared for our coursework. That is ok, we are trying to estimate whether you will be successful in our program.
Recommendations: While these do not come directly from you, they are quite important to us. BME@JHU is like a big extended family. We work closely with one another, sit near each other (typically), we have been doing so for a long time, and plan to continue doing so for many years to come. Therefore, our community is quite important to us, and our success comes largely from surrounding ourselves not just with the smartest people in the world, but more importantly, really good people. So, the recommendation letters are a way for us to get information about how pleasant it is to work with you. I particularly look for recommendations from other faculty with successful research programs, as they are the most informative with regards to what it takes to have a successful PhD. Recommendations from industry can be somewhat informative, but less so. In other words, the number of PhD students somebody has mentored matters in our assessment. In terms of content, we are looking for recommendations that write that you are pleasant to work with, and amongst the best of his/her previous students along some dimensions, such as productivity, passion, drive, creativity, organization, etc. In other words, you excel in the kinds of personality traits that we think contribute to successful PhDs. Much like research experience and grades, a good or bad recommendation cannot determine your acceptance.
Personal Statement: Your personal statement is your opportunity to express yourself. The most important aspect of a personal statement for me is passion. Success in our field, I believe, is strongly correlated with passion. Even if that is not the case, it is more fun for me to work with people that are passionate about solving some problems. So, express yourself freely and passionately. And be specific. Find a few faculty members in the department that you are applying to, and write about what, in particular, you find most exciting about their work. In this way, we’ll be able to align your passions with ours in the review process. If you’ve reached out to any of the faculty, or anybody else associated with the department prior to application, mention it, and how it has informed your decision to apply. I recommend that you do reach out to faculty in advance, if possible. And don’t forget to spell/grammar check it.

There are a few things that people invest a bunch of energy in, that do not matter hardly at all. First is the GRE. Evidence is building that it is classist, racist, and sexist (see for example, here, though see counter-points here). Several schools have stopped using them, but not all (there are lists online). As it currently stands at BME@JHU, only if somebody does quite poorly on the quantitative aspect of the GRE (say, below 70%), does his/her GRE score even typically come up for discussion. In certain cases, the GRE can be waived, so we encourage you to email and ask. Second, is fellowships. In general, if we have not heard of them, do not understand the criteria for winning them, who applies, or what is achieved, it is hard to evaluate their value. I’ve literally never ever heard them come up in discussing any applicant, and I’ve now been privvy to discussion literally hundreds, maybe over 1,000 applicants across multiple different departments.

My lab, as well as many other successful labs, are always accepting exceptional graduate students. The success of our labs’ depends on the success of excellent students, so we are always searching for and hoping to find people whose passions align with ours, and whose abilities either align with or complement our own.

Finally, we strongly encourage applications from diverse individuals. We try our best to evaluate each individual in the context of his/her/their background. We believe that a more diverse and inclusive academic environment leads to better science, and a better society.

I hope this is helpful. If anybody disagrees with my assessment, or has other recommendations, or further questions, I’d love to hear from you in the comments.