Field of Science

Showing posts with label machine learning. Show all posts
Showing posts with label machine learning. Show all posts

Rutherford on tools and theories (and machine learning)

Image

Ernest Rutherford was the consummate master of experiment, disdaining theoreticians for playing around with their symbols while he and his fellow experimentalists discovered the secrets of the universe. He was said to have used theory and mathematics only twice - once when he discovered the law of radioactive decay and again when he used the theory of scattering to interpret his seminal discovery of the atomic nucleus. But that's where his tinkering with formulae stopped.

Time and time again Rutherford used relatively simple equipment and tools to pull off seemingly miraculous feats. He had already won the Nobel Prize for chemistry by the time he discovered the nucleus - a rare and curious case of a scientist making their most important discovery after they won a Nobel prize. The nucleus clearly deserved another Nobel, but so did his fulfillment of the dreams of the alchemists when he transmuted nitrogen to oxygen by artificial disintegration of the nitrogen atom in 1919. These achievements justified every bit Rutherford's stature as perhaps one of two men who were the greatest experimental physicists in modern history, the other being Michael Faraday. But they also justified the primacy of tools in engineering scientific revolutions.

However, Rutherford was shrewd and wise enough to recognize the importance of theory - he famously mentored Niels Bohr, presumably because "Bohr was different; he was a football player." And he was on good terms with both Einstein and Eddington, the doyens of relativity theory in Europe. So it's perhaps not surprising that he pointed out an observation about the discovery of radioactivity attesting to the important of theoretical ideas that's quite interesting.

As everyone knows, radioactivity in uranium was discovered by Henri Becquerel in 1896, then taken to great heights by the Curies. But as Rutherford points out in a revealing paragraph (Brown, Pais and Pippard, "Twentieth Century Physics", Vol. 1; 1995), it could potentially have been discovered a hundred years earlier. More accurately, it could have been experimentally discovered a hundred years earlier.

Image

Rutherford's basic point is that unless there's an existing theoretical framework for interpreting an experiment - providing the connective tissue, in some sense - the experiment remains merely an observation. Depending only on experiments to automatically uncover correlations and new facts about the world is therefore tantamount to hanging on to a tenuous, risky and uncertain thread that might lead you in the right direction only occasionally, by pure chance. In some ways Rutherford here is echoing Karl Popper's refrain when Popper said that even unbiased observations are "theory laden"; in the absence of the right theory, there's nothing to ground them.

It strikes me that Rutherford's caveat applies well to machine learning. One goal of machine learning - at least as believed by its most enthusiastic proponents - is to find patterns in the data, whether the data is dips and rises in the stock market or signals from biochemical networks, by blindly letting the algorithms discover correlations. But simply letting the algorithm loose on data would be like letting gold leaf electroscopes and other experimental apparatus loose on uranium. Even if they find some correlations, these won't mean much in the absence of a good intellectual framework connecting them to basic facts. You could find a correlation between two biological responses, for instance, but in the absence of a holistic understanding of how the components responsible for these responses fit within the larger framework of the cell and the organism, the correlations would stay just that - correlations without a deeper understanding.

What's needed to get to that understanding is machine learning plus theory, whether it's a theory of the mind for neuroscience or a theory of physics for modeling the physical world. It's why efforts that try to supplement machine learning by embedding knowledge of the laws of physics or biology in the algorithms are likely to work, while efforts blindly using machine learning to try to discover truths about natural and artificial systems using correlations alone would be like Rutherford's fictitious uranium salts from 1806 giving off mysterious radiation that's detected without interpretation, posing a question waiting for an explanation.

A new paper on kinase inhibitor discovery: not one on "drugs", and not one on an "AI breakthrough"

Image
Image
There is a new multicenter study on the discovery of some new kinase inhibitor compounds for the kinase DDR1 that has been making the rounds. Using a particular flavor of generative models, the authors derive a few potent and selective inhibitors for DDR1, a kinase target that has been implicated in fibrosis.

The paper is an interesting application of generative deep learning models to kinase inhibitor discovery. The authors start with six training datasets including ZINC and several patents along with a negative dataset of non-kinase inhibitors. After using their generative reinforcement learning model and filtering out reactives and clustering, they select 40 random molecules that have a less than 0.5 Tanimoto similarity to vendor stocks and the patent literature, and pick 6 out of these for testing. Four of the six compounds are indicated as showing an improvement in the potency against DDR1, although it seems that for two of these, the potency is little improved relative to the parent compound (10 and 21 nM vs 15 nM, which is well within the two or threefold margin of error in most biological assays). The selectivity of two of the compounds for the undesirable isoform DDR2 is also essentially the same (649 nM vs 1000 nM and 278 nM vs 162 nM; again within the twofold error margin of the assay). So from a potency standpoint, the algorithm seems to find equipotent inhibitors at best; given that these four molecules were culled from a starting set of 30,000, that indicates a hit rate of 0.01%. Good selectivity against a small kinase panel is demonstrated, but selectivity against a larger panel of off-targets is not indicated. There also don't seem to be tests for aggregation or non-specific behavior; computational techniques in drug discovery are well known to produce a surfeit of false positives. It would also be really helpful to get some SAR for these compounds to know if they are on-off non-specific binders or actual lead compounds.

Now, even equipotent inhibitors can be useful if they show good ADME properties or evidence scaffold hops. The group tested the inhibitors in liver microsomal assays, and they seem to have similar stability as a group of non-kinase inhibitor controls, although it would be good to see some accompanying data for DDR inhibitors next to this data. They also tested one of the compounds in a rodent model, and it seems to show satisfactory half lives; it's again not clear how these compare to other DDR inhibitors. Finally, they build a pharmacophore-based binding model of the inhibitor and compare it to a similar quantum mechanical model, but there is no experimental data (from NMR or mutagenesis for instance) which would allow a good experimental validation of this binding pose. Pharmacophore models are again notorious for producing false positives, and it's important to demonstrate that the pharmacophore in fact does not also fit the negative data.

The paper claims to have discovered the inhibitors "in 21 days" and tested them in 46. The main issue here - and this is by no means a critique of just this paper - is not that the discovered inhibitors show very modest improvement at best over the reference; it's that there is no baseline comparison, no null models, that can tell us what the true value of the technique is. This has been a longstanding complaint in the computational community. For instance, could regular docking followed by manual picking have found the same compounds in the same time? What about simple comparisons with property-based metrics or 2D metrics? And could a team of expert medicinal chemists brainstorming over beer have looked at the same data and come up with the same conclusions much sooner? I am glad that the predictions were actually tested - even this simple follow-up is often missing from computational papers - but 21 days is not as short as it sounds if you start with a vast amount of already-existing and curated data from databases and patents, and if simpler techniques can find the same results sooner. And the reliance on vast amounts of data is of course a well-known Achilles heel for deep learning techniques, so these techniques will almost certainly not work well on new targets with a paucity of data.

Inhibitor discovery is hardly a new problem for computational techniques, and any new method is up against a whole phalanx of structure and ligand-based methods that have been developed over the last 30+ years. There's a pretty steep curve to surmount if you actually want to proclaim your latest and greatest AI technique as a novel application. As it stands, the issue is not that the generative methods didn't discover anything, it's that it's impossible to actually judge their value because of an absence of baseline comparisons.

The AI hype machine is out in absolute full force on this one (see herehere and especially here for instance). I simply don't understand this great desire to proclaim every advance in a field as a breakthrough without simply calling it a useful incremental step or constructively criticizing it. And when respected sources like WIRED and Forbes proclaim that there's been a breakthrough in new drug discovery, the non-scientific public which is unfamiliar with IC50 curves or selectivity profiles or the fact that there's a huge difference between a drug and a lead will likely think that a new age of drug discovery is upon us. There's enough misleading hype about AI to go around, and adding more to the noise does both the scientific and the non-scientific community a disservice.

Longtime cheminformatics expert Andreas Bender has some similar thoughts here, and of course, Derek at In the Pipeline has an excellent, detailed take here.

The three horsemen of the machine learning apocalypse

Image
My colleague Patrick Riley from Google has a good piece in Nature in which he describes three very common errors in applying machine learning to real world problems. The errors are general enough to apply to all uses of machine learning irrespective of field, so they certainly apply to a lot of machine learning work that has been going on in drug discovery and chemistry.

The first kind of error is an incomplete split between training and test sets. People who do ML in drug discovery have encountered this problem often; the test set can be very similar to the training set, or - as Patrick mentions here - the training and test sets aren't really picked at random. There should be a clear separation between the two sets, and the impressive algorithms are the ones which extrapolate non-trivially from the former to the latter. Only careful examination of the training and test sets can ensure that the differences are real.

Another more serious problem with training data is of course the many human biases that have been exposed over the last few years, biases arising in fields ranging from hiring to facial recognition. The problem is that it's almost impossible to find training data that doesn't have some sort of human bias (in that context, general image data usually works pretty well because of the sheer number of random images human beings capture), and it's very likely that this hidden bias is what your model will then capture. Even chemistry is not immune from such biases; for instance, if your training data contains compounds synthesized using metal-catalyzed coupling reactions and is therefore enriched in biaryls, you will be training an algorithm that is excellent at identifying biaryls, drug scaffolds that are known to have issues with stability and clearance in the body.

The second problem is that of hidden variables, and this is especially the case with unsupervised learning where you let loose your algorithm on a bunch of data and expect it to learn relevant features. The problem is that there are a very large number of features in the data that your algorithm could potentially learn and find correlations with, and a good number of these might be noise or random features that would give you a good correlation while being physically irrelevant. A couple of years ago there was a good example of an algorithm used to classify tumors learning nothing about the tumors per se but instead learning features of rulers; it turns out that oncologists often keep rulers next to malignant tumors to measure their dimensions, and these were visible in the pictures. 

Closer to the world of chemistry, there was a critique last year of an algorithm that was supposed to pick an optimal combination of reaction conditions for a synthetic Buchwald-Hartwig reaction. This is a rather direct application of machine learning in chemistry, and one of the most promising ones in my view, partly because reaction optimization is still very much a trial-and-error art and it is far more deterministic than, say, finding a new drug target based on sparse genomic correlations. After the paper was published there was a critique pointing out that you could get the same results if you randomized the data or fit the model on noise. That doesn't mean the original model was wrong, it means that it wasn't unique and wasn't likely causative. Basically asking what exactly your model is fitting to is always a good idea.

As Patrick's article points out, there are other examples like an algorithm latching on to edge effects of plates in a biological assay or in image analysis in phenotypic screening; two other applications very relevant to drug discovery. The remedy here again is to run many different models while asking many different questions, a process that needs patience and foresight. Another strategy which I increasingly like would be to not do unsupervised learning but instead do constrained learning, with the constraints coming from the laws of science.

The last problem is a bit more subtle and involves using the wrong objective or "loss" function. A lot of this boils down to asking the right question. Patrick cites the example of using ML to diagnose diabetic retinopathy using images of the back of the eye. It turns out that if the question they asked was focused more on diagnosing a single disease rather than whether the patient needs to see a doctor, the models were thrown into disarray.

So what's the solution? What it always has been. As the article says,


"First, machine-learning experts need to hold themselves and their colleagues to higher standards. When a new piece of lab equipment arrives, we expect our lab mates to understand its functioning, how to calibrate it, how to detect errors and to know the limits of its capabilities. So, too, with machine learning. There is no magic involved, and the tools must be understood by those using them."
Would you expect the developer of a new NMR technique or a new quantum chemistry calculation algorithm to not know what lies under the hood? Would you expect these developers to not run tests using many different parameters and under different controls and conditions? For that matter, would you expect a solider to go into battle without understanding the traps the enemy has laid? Then why expect developers of machine learning to operate otherwise? 
Some of it is indeed education, but much of it involves the same standards and values that have been part of the scientific and engineering disciplines since antiquity. Unfortunately, too often machine learning, especially because of its black-box nature, is regarded as magic. But there is no magic (Arthur Clarke quotes notwithstanding). It's all careful, meticulous investigation, it's about going into the field knowing that there almost certainly will be a few mines scattered here and there. Be careful if you don't want you/r model to get blown up.

Open Borders

Image
Image
The traveler comes to a divide. In front of him lies a forest. Behind him lies a deep ravine. He is sure about what he has seen but he isn’t sure what lies ahead. The mostly barren shreds of expectations or the glorious trappings of lands unknown, both are up for grabs in the great casino of life.
First came the numbers, then the symbols encoding the symbols, then symbols encoding the symbols. A festive smattering of metamaniacal creations from the thicket of conjectures populating the hive mind of creative consciousness. Even Kurt Gödel could not grasp the final import of the generations of ideas his self-consuming monster creation would spawn in the future. It would plough a deep, indestructible furrow through biology and computation. Before and after that it would lay men’s ambitions of conquering knowledge to final rest, like a giant thorn that splits open dreams along their wide central artery.
Code. Growing mountains of self-replicating code. Scattered like gems in the weird and wonderful passage of spacetime, stupefying itself with its endless bifurcations. Engrossed in their celebratory outbursts of draconian superiority, humans hardly noticed it. Bits and bytes wending and winding their way through increasingly Byzantine corridors of power, promise and pleasure. Riding on the backs of great expectations, bellowing their heart out without pondering the implications. What do they expect when they are confronted, finally, with the picture-perfect contours of their creations, when the stagehands have finally taken care of the props and the game is finally on? Shantih, shantih, shantih, I say.
Once the convoluted waves of inflated rational expectations subside, the reality kicks in in ways that only celluloid delivered in the past. Machines learning, loving, loving the learning that other machines love to do was only a great charade. The answer arrives in a hurry, whispered and then proudly proclaimed by the stewards of possibility. We can never succeed because we don’t know what success means. How doth the crocodile eat the tasty bits if he can never know where red flesh begins and the sweet lilies end? Who will tell the bards what to sing if the songs of Eden are indistinguishable from the lasts gasps of death? We must brook no certainty here, for the fruit of the tree can sow the seeds of murderous doubt.
Just so often, although not as often as our eager minds would like, science uncovers connections between seemingly unrelated phenomena that point to wholly new ways forward. Last week, a group of mathematicians and computer scientists uncovered a startling connection between logic, set theory and machine learning. Logic and set theory are the purest of mathematics. Machine learning is the most applied of mathematics and statistics. The scientists found a connection between two very different entities in these very different fields – the continuum hypothesis in set theory and the theory of learnability in machine learning.
The continuum hypothesis is related to two different kinds of infinities found in mathematics. When I first heard the fact that infinities can actually be compared, it was as if someone had cracked my mind open by planting a firecracker inside it. There is the first kind of infinity, the “countable infinity”, which is defined as an infinite set that maps one-on-one with the set of natural numbers. Then there’s the second kind of infinity, the “uncountable infinity”, a gnarled forest of limitless complexity, defined as an infinity that cannot be so mapped. Real numbers are an example of such an uncountable infinity. One of the staggering results of mathematics is that the infinite set of real numbers is somehow “larger” than the infinite set of natural numbers. The German mathematician Georg Cantor supplied the proof of the uncountable nature of the real numbers, sometimes called the “diagonal proof”. It is like a beautiful gem that has suddenly fallen from the sky into our lap; reading it gives one intense pleasure.
The continuum hypothesis asks whether there is an infinity whose size is between the countable infinity of the natural numbers and the uncountable infinity of the real numbers. The mathematicians Kurt Gödel and – more notably – Paul Cohen were unable to prove whether the hypothesis is correct or not, but they were able to prove something equally or even more interesting; that the continuum hypothesis cannot be decided one way or another within the axiomatic system of number theory. Thus, there is a world of mathematics in which the hypothesis is true, and there is one in which it is false. And our current understanding of mathematics is consistent with both these worlds.
Fifty years later, the computational mathematicians have found a startling and unexpected connection between the truth or lack thereof of the continuum hypothesis and the idea of learnability in machine learning. Machine learning seeks to learn the details of a small set of data and make correlative predictions for larger datasets based on these details. Learnability means that an algorithm can learn parameters from a small subset of data and accurately make extrapolations to the larger dataset based on these parameters. The recent study found that whether learnability is possible or not for arbitrary, general datasets depends on whether the continuum hypothesis is true. If it is true, then one will always find a subset of data that is representative of the larger, true dataset. If the hypothesis is false, then one will never be able to pick such a dataset. In fact in that case, only the true dataset represents the true dataset, much as only an accused man can best represent himself.
This new result extends both set theory and machine learning into urgent and tantalizing territory. If the continuum hypothesis is false, it means that we will never be able to guarantee being able to train our models on small data and extrapolate to large data. Specific models will still be able to be built, but the general problem will remain unsolvable. This result can have significant implications for the field of artificial intelligence. We are entering an age where it’s possible to seriously contemplate machines controlling others machines, with human oversight not just impossible in practice but also in principle. As code flows through the superhighway of other code and groups and regroups to control other pieces of code, machine learning algorithms will be in charge of building models based on existing data as well as generating new data for new models. Results like the current result might make it impossible for such self-propagating intelligent algorithms to ensure being able to solve all our problems, or solve their own problems to imprison us. The robot apocalypse might be harder than we think.
As Jacob Bronowski memorably put it in his “The Ascent of Man”, one of the major goals of science in the 20th century was to establish the certainty of scientific knowledge. One of the major achievements of science in the 20th century was to prove that this goal is unattainable. In physics, Heisenberg’s uncertainty principle put a fundamental limit on measurement in the world of elementary particles. Einstein’s theory of relativity put a fundamental limit on the speed of light. But most significantly, it was Gödel’s famous incompleteness theorem that put a fundamental limit on what we could prove and know even in the seemingly impregnable world of pure, logical mathematics. Even in logic, that bastion of pure thought, where conjectures and refutations don’t depend on any quantity in the real world, we found that there are certain statements whose truth might forever remain undecidable.
Now the same Gödel has thrown another wrench in the machine, asking us whether we can indeed hold inevitability and eternity in the palm of our hands. As long as the continuum hypothesis remains undecidable, so will the ability of machine learning to transform our world and seize power from human beings. And if we cannot accomplish that feat of extending our knowledge into the infinite unknown, instead of despair we should be filled with the ecstatic joy of living in an open world, a world where all the answers can never be known, a world forever open to exploration and adventure by our children and grandchildren. The traveler comes to a divide, and in front of him lies an unyielding horizon.

What areas of chemistry could AI impact the most? An opinion poll

Image
The other day I asked the following as a survey question regarding potential areas of chemistry where AI could have the biggest impact.

Image

There were 163 responses which wasn't a bad representative sample. The responses are in fact in line with my own thinking: synthesis planning and automation emerge as the leading candidates. 

I think synthesis planning AI will have the biggest impact on everyday lab operations during the next decade. Synthesis planning, while still challenging, is still a relatively deterministic protocol based on a few good reactions and a large but digestible number of data points. Reliable reactions like olefin metathesis and metal-mediated coupling have now become fairly robust and heavily used to generate thousands of machine-readable data points and demonstrate reliability and relative predictability; there are now fewer surprises, and whatever surprises exist are well-documented. As recent papers make it clear, synthesis planning had been waiting in the wings for several years for the curation of millions of examples pertaining to successful and unsuccessful reactions and chemotypes as well as for better neural networks and computing power. Without the first development the second wouldn't have made a big difference, and it seems like we are finally getting there with good curation.

I was a bit more surprised that materials science did not rank higher. Quantum chemical calculations for estimating optical, magnetic, electronic and other properties have been successful in materials science and have started enabling high-throughout studies in areas like MOF and battery technology, so I expect this field to expand quite a bit during the next few years. Similar to how computation has worked in drug discovery, AI approaches don't need to accurately predict every material property to three decimal places; they will have a measurable impact even if they can qualitatively rank different options and narrow down the pool so that chemists have to spend fewer resources making them.

Drug design, while already a beneficiary of compute, will see mixed results in my opinion over the next decade. For one thing, "drug design" is a catchall phrase that can include everything from basic protein-ligand structure prediction to toxicity prediction, with the latter being at the challenging end of the spectrum. Structure-based design will likely benefit from deep learning that learns basic intermolecular interactions which are transferable across target classes, so that they are limited by the paucity of training data.

Areas like synthesis planning do contribute to drug design, but the real crux of successful drug design will be multiparameter optimization and SAR prediction, where an algorithm is able to successfully calculate multiple properties of interest like affinity, PK/PD and toxicity. PK/PD and toxicity are systemic effects that are complex and emergent, and I think the field will still not be able to make a significant dent in predicting idiosyncratic toxicity except for obvious cases. One area in which I see AI having a bigger impact is any field of drug discovery involving image recognition; for instance phenotypic screening, and perhaps the processing of images in cryo-EM and standard x-ray crystallography.

Finally, automation is one area where I do think AI will make substantial progress. This is partly due to better seamless integration of hardware and software and partly because of better data generation and recording that will enable machine learning and related models to improve. This development, combined with reaction planning that allows scientists to test multiple hypotheses will contribute, in my opinion, in automation making heavy inroads in the day-to-day work of chemists fairly soon.

Another area which I did not mention in the survey but which will impact all of the above areas is text mining. There the problem is one of discovering relationships between different entities (drugs and potential targets, for instance) that are not novel per se but that are just hidden in a thicket of patents, papers and other text sources which are too complicated for humans to parse. Ideally, one would be able to combine text mining with intelligent natural language processing algorithms to enable new discovery through voice commands.

Domains of Applicability (DOA) in top-down and bottom-up drug discovery

Image
Image
You don’t use a hammer to do impressionistic painting. And although you technically could, you won’t use a spoon for drinking beer. The domains of applicability of these tools are different, in terms of quality and quantity.

The idea of domains of applicability (DOA) is an idea that is somehow both blatantly simple as well as easily forgotten. As the examples above indicate, the definition is apparent; every tool, every idea, every protocol, has a certain reach. There are certain kinds of data for which it works well and certain others for which it fails miserably. Then there are the most interesting cases; pieces of data on the boundary between applicable and non-applicable. These often serve as real testing grounds for your tool or idea.

Often the DOA of a tool becomes clear only when it’s been used for a long time on enough number of test cases. Sometimes the DOA reveals itself accidentally, when you are trying to use the tool on data for which it’s not really designed. That way can lie much heartbreak. It’s better instead to be constantly aware of the DOA for your techniques and deliberately stress-test its range. The DOA can also inform you about the sensitivity of your model; for instance, for a certain model a small change from a methyl to a hydroxy might fall within its DOA, while for another it might exceed it.

The development and use of molecular docking, an important part of bottom-up drug discovery, makes the idea of DOA clear. By now there’s an extensive body of knowledge about docking, developed over at least twenty years, which makes it clear when docking works well and when you can trust it less. For example, docking works quite well in reproducing known crystal poses and generating new poses when the protein is well resolved and relatively rigid; when there are no large-scale conformational changes; when there are no unusual interactions in the binding site; when water molecules aren’t playing any weird or special role in the binding. On the other hand, if you are doing docking on a homology model built on sparse homology that features a highly flexible loop and several bridging water molecules as key binding elements, all bets are off. You have probably stepped way outside the DOA of docking. Then there are the intermediate and in many ways the most interesting cases; somewhat rigid proteins, just one or two water molecules, a good knowledge base around that protein that tells you what works. In these cases, one can be cautiously optimistic and make some testable hypotheses.

Fortunately there are ways to pressure-test the DOA of a favorite technique. If you suspect that the system under consideration does not fall within the DOA, there are simple tests you can run and questions you can ask. The first set of questions concerns the quality and quantity of data that is available. This data falls into two categories; data that was used for training the method and the data that you actually have in your test case. If the test data closely matches the training data then there’s a fair chance that your DOA is covered. If not, you ask the second important question: What’s the quickest way I can actually test the DOA? Usually the quickest way to test any hypothesis in early stage drug discovery is to propose a set of molecules that your model suggests as top candidates. As always, the easier these are to make, the faster you can test them and the better you can convince chemists to make them in the first place. It might also be a good idea to sneak in a molecule that your model says has no chance in hell of working. If neither of these predictions come true within a reasonable margin, you clearly have a problem, either with the data itself or with your DOA.

There are also ways to fix the DOA of a technique, but because that task involves generating more training data and tweaking the code accordingly, it’s not something that most end users can do. In case of docking for instance, a DOA failure might result from inadequate sampling or inadequate scoring. Both of these issues can be fixed in principle through better data and better force fields, but that’s really something only a methods developer can do.

When a technique is new it always struggles to establish its DOA. Unfortunately both technical users and management don’t understand this and can immediately start proclaiming the method as a cure for all your problems; they think that just because it has worked well on certain cases it will do so on most others. The lure of publicity, funding and career advancement can further encourage this behavior. That certainly happened with docking and other bottom-up drug design tools in the Wild West of the late 80s and early 90s. I believe that something similar is happening with machine learning and deep learning now.

For instance it’s well known that when it comes to problems like image recognition and natural language processing (NLP), machine learning can do extremely well. In that case one is clearly operating well within the DOA. But what about modeling traffic patterns or brain activity or social networks or SAR data for that matter? What is the DOA of machine learning in these areas? The honest answer is that we don’t know. Now some users and developers of machine learning acknowledge this and are actually trying to circumscribe the right DOA by pressure-testing the algorithms. Others unfortunately simply take it for granted that more data must translate to better accuracy; in other words, they assume that the DOA is purely dictated by data quantity. This is true only in a narrow sense. Yes, less data can certainly hamper your efforts, but more data is neither always necessary and certainly not sufficient. You can have as much data as possible, but your technique can still be operating in the wrong DOA. For example, the presence of a discontinuous landscape of molecular activity places limitations on using machine learning in medicinal chemistry. Would more data ameliorate this problem? We don’t know yet, but this kind of thinking would not be inconsistent with the new religion of “dataism” which says that data is everything.

There are many opportunities to test the DOA of top-down approaches like deep learning in drug discovery and beyond. But to do this, both scientists and management must have realistic goals about the efficacy of the techniques, and more importantly must honestly acknowledge that they don’t know the DOA in the first place. In other words, they need to honestly acknowledge that they don’t yet know whether the technique will work for their specific problem. Unfortunately these kinds of decisions and proclamations are severely subject to hype and the enticement of dollars and drama. Machine learning is seen as a technique with such an outsize potential impact on diverse areas of our lives, that many err on the side of wishful thinking. Companies have sunk billions of dollars into the technology; how many of them would be willing to admit that the investment was really based on hope rather than reality?

In this context, machine learning can draw some useful lessons from the cautionary tale of drug design in the 80s, when companies were throwing money from all directions at molecular modeling. Did that money result in important lessons learnt and egos burnt? Indeed it did, but one might argue that computational chemists are still suffering from the negative effects of that hype, both in accurately using their techniques and in communicating the true value of those techniques to what seem like perpetually skeptical Nervous Nellies and Debbie Downers. Machine learning could go down the same route and it would be a real tragedy, not only because the technique is promising but because it could potentially impact many other aspects of science, technology, engineering and business and not just pharmaceutical development. And it might all happen because we were unable or unwilling to acknowledge the DOA of our methods.

Whether it’s top-down or bottom-up approaches, we can all ultimately benefit from Feynman’s words: “For a successful technology, reality has to take precedence over public relations, for Nature cannot be fooled.” For starters, let’s try not to fool each other.

Want to know if you are depressed? Don't ask Siri just yet.

Image
Image
"Tell me more about your baseline calibration, Siri"
There's no dearth of articles claiming that the "wearables revolution" is around the corner and that we aren't far from the day when every aspect of our health is recorded every second, analyzed and sent to the doctor for rapid diagnosis and treatment. That's why it was especially interesting for me to read this new analysis from computer scientists at Berkeley and Penn that should temper the soaring enthusiasm that riddles pretty much all things "AI" these days.

The authors are asking a very simple question in the context of machine learning (ML) algorithms that claim to predict your mood - and by proxy mental health issues like depression - based on GPS and other data. What's this simple question? It's one about baselines. When any computer algorithm makes a prediction, one of the key questions is how much better this prediction is compared to some baseline. Another name for baselines is "null models". Yet another is "controls", although controls themselves can be artificially inflated. 

In this case the baseline can be of two kinds: personal baselines (self-reported individual moods) or population baselines (the mood of a population). What the study finds is not too pretty. They analyze a variety of literature on mood-reporting ML algorithms and find that in about 77% of cases the studies use meaningless baselines that overestimate the performance of the ML models with respect to predicting mood swings. The reason is because the baselines that are used in most studies are population baselines rather than the more relevant personal baselines. The population baseline assumes a constant average state for all individuals, while the individual baseline assumes an average state for every individuals but different states between individuals. 

Clearly doing better than the population baseline is not very useful for tracking individual mood changes, and this is especially true since the authors find greater errors for population baselines compared to individual ones; these larger errors can simply obscure model performance. The paper also consider two datasets and try to figure out how to improve the performance of models on these datasets using a metric which they call "user lift" that determines how much better the model is compared to the baseline. 

I will let the abstract speak for itself:

"A new trend in medicine is the use of algorithms to analyze big datasets, e.g. using everything your phone measures about you for diagnostics or monitoring. However, these algorithms are commonly compared against weak baselines, which may contribute to excessive optimism. To assess how well an algorithm works, scientists typically ask how well its output correlates with medically assigned scores. Here we perform a meta-analysis to quantify how the literature evaluates their algorithms for monitoring mental wellbeing. We find that the bulk of the literature (∼77%) uses meaningless comparisons that ignore patient baseline state. For example, having an algorithm that uses phone data to diagnose mood disorders would be useful. However, it is possible to over 80% of the variance of some mood measures in the population by simply guessing that each patient has their own average mood - the patient-specific baseline. Thus, an algorithm that just predicts that our mood is like it usually is can explain the majority of variance, but is, obviously, entirely useless. Comparing to the wrong (population) baseline has a massive effect on the perceived quality of algorithms and produces baseless optimism in the field. To solve this problem we propose “user lift” that reduces these systematic errors in the evaluation of personalized medical monitoring."

That statement about being able to explain 80% of the variance in the model simply by guessing an average  mood for every individual should stand out. It means that simple informed guesswork based on an average "feeling" is both as good as the model and is also eminently useless since it predicts no variability and is therefore of little practical utility.

I find this paper important because it should put a dent in what is often inflated enthusiasm about wearables these days. It also illustrates the dangers of what is called "technological solutionism": simply because you can strap on a watch or device on your body to measure various parameters and simply because you have enough computing power to analyze the resulting stream of data does not mean the results will be significant. You record because you can, you analyze because you can, you conclude because you can. What the authors find about tracking moods can apply to tracking other kinds of important variables like blood pressure and sleep duration. Every time the question must be; am I using the right baseline for comparison? And am I doing better than the baseline? Hopefully the authors can use larger and more diverse datasets and find out similar facts about other such metrics.

I also found this study interesting because it reminds me of a whole lot of valid criticism in the field of molecular modeling that we have seen over the last few years. One of the most important questions there is about null models. Whenever your latest and greatest FEP/MD/informatics/docking study is claimed to have done exceptionally well on a dataset, the first question should be; is it better than the null model? And have you defined the null model correctly to begin with? Is your model doing better than a simpler method? And if it's not, why use it, and why assign a causal connection between your technique and the relevant result?

In science there are seldom absolutes. Studies like this show us that every new method needs to be compared with what came before it. When old facts have already paved the way, new facts are compelled to do better. Otherwise they can create the illusion of doing well.

Lab automation using machine learning? Hold on to your pipettes for now.

Image
Image
There is an interesting article on using machine learning and AI for lab automation in Science that generally puts a positive spin on the use of smart computer algorithms for automating routine experiments in biology. The idea is that at some point in the near future, a scientist could design, execute and analyze the results of experiments on her MacBook Air from a Starbucks.

There's definitely a lot of potential for automating routine lab protocols like pipetting and plate transfers, but this has already been done by robots for decades. What the current crop of computational improvements plans to do is potentially much more consequential though; it is to conduct entire suites of biological experiments with a few mouse clicks. The CEO of Zymergen, a company profiled in the piece, says that the ultimate objective is to "get rid of human intuition"; his words, not mine.

I must say I am deeply skeptical of that statement. There is no doubt that parts of experiment planning and execution will indeed become more efficient because of machine learning, but I don't see human biologists being replaced or even significantly augmented anytime soon. The reason is simple: most of research, and biological research in particular, is not about generating and rapidly testing answers (something which a computer excels at), but about asking questions (something which humans typically excel at). A combination of machine learning and robotics may well be very efficient at laying out a whole list of possible solutions and testing them, but it will all come to naught if the question that's being asked is the wrong one.

Machine learning will certainly have an impact, but only in a narrowly circumscribed set of experimental space. Thus, I don't think it's just a coincidence that the article focuses on Zymergen, a company which is trying to produce industrial chemicals by tweaking bacterial genomes. This process involves mutating thousands of genes in bacteria and then picking combinations that will increase the fitness of the resulting organism. It is exactly the kind of procedure that is well-adapted to machine learning (to try to optimize and rank mutations for instance) and robotics (to then perform the highly repetitive experiments). But that's a niche application, working well in areas like directed evolution; as the article itself says, "Maybe Zymergen has stumbled on the rare part of biology that is well-suited to computer-controlled experimentation."

In most of biological research, we start with figuring out what question to ask and what hypotheses to generate. This process is usually the result of combining intuition with experience and background knowledge. As far as we know, only human beings excel in this kind of coarse-grained, messy data gathering and thinking. Take drug discovery for instance; most drug discovery projects start with identifying a promising target or phenotype. This identification is usually quite complicated and comes from a combination of deep expertise, knowledge of the literature and careful decisions on what are the right experiments to do. Picking the right variables to test and knowing what the causal relationships between them are is paramount. In fact, most drug discovery fails because the biological hypothesis that you begin with is the wrong one, not because it was too expensive or slow to test the hypothesis. Good luck teaching a computer to tell you whether the hypothesis is the right one.

It's very hard for me to see how to teach a machine this kind of multi-layered, interdisciplinary analysis. One we have the right question or hypothesis of course we can potentially design an automated protocol to carry out the relevant experiments, but reaching that point is going to take a lot more than just rapid trial and error and culling of less promising possibilities.

This latest wave of machine learning optimism therefore looks very similar to the old waves. It will have some impact, but the impact will be modest and likely limited to particular kinds of projects and goals. The whole business reminds me of the story - sometimes attributed to Lord Kelvin - about the engineer who was recruited by a company to help them with building a bridge. After thinking for about an hour, he made a mark with a piece of chalk on the ground, told the company's engineers to start building the bridge at that location, and then billed them for ten thousand dollars. When they asked what on earth he expected so much money for, he replied, "A dollar for making that mark. Nine thousand nine hundred and ninety nine for knowing where to make it." 

I am still waiting for that algorithm which tells me where to make the mark.

Bottom-up and top-down in drug discovery

Image
Image
There are two approaches to discovering new drugs. In one approach drugs fall in your lap from the sky. In the other you scoop them up from the ocean. Let’s call the first the top-down approach and the second the bottom-up approach.

The bottom-up approach assumes that you can discover drugs by thinking hard about them, by understanding what makes them tick at the molecular level, by deconstructing the dance of atoms orchestrating their interactions with the human body. The top-down approach assumes that you can discover drugs by looking at their effects on biological systems, by gathering enough data about them without understanding their inner lives, by generating numbers through trial and error, by listening to what those numbers are whispering in your ear.

To a large extent, the bottom-up approach assumes knowledge while the top-down approach assumes ignorance. Since human beings have been ignorant for most of their history, for most of the recorded history of drug discovery they have pursued the top-down approach. When you don't know what works, you try things out randomly. The Central Americans found out by accident that chewing the bark of the Cinchona plant relieved them of the afflictions of malaria. Through the Middle Ages and beyond, people who called themselves physicians prescribed a witches' brew of substances ranging from sulfur to mercury to arsenic to try to cure a corresponding witches' brew of maladies, from consumption to the common cold. More often than not these substances killed patients as readily as the diseases themselves.

The top-down approach may seem crude and primitive, and it was primitive, but it worked surprisingly well. For the longest time it was exemplified by the ancient medical systems of China and India – one of these systems delivered an antimalarial medicine that helped its discoverer bag a Nobel Prize for Medicine. Through fits and starts, scores of failures and a few solid successes, the ancients discovered many treatments that were often lost to the dust of ages. But the philosophy endured. It endured right up to the early 20th century when the German physician Paul Ehrlich tested 604 chemical compounds - products of the burgeoning dye industry pioneered by the Germans - and found that compound 606 worked against syphilis. Syphilis was a disease that so bedeviled people since medieval times that it was often a default diagnosis of death, and cures were desperately needed. Ehrlich's 606 was arsenic-based, unstable and had severe side effects, but the state of medicine was such back then that anything was regarded as a significant improvement over the previous mercury-based compounds.

It was with Ehrlich's discovery that drug discovery started to transition to a more bottom-up discipline, systematically trying to make and test chemical compounds and understand how they worked at the molecular level. But it still took decades before the approach bore fruition. For that we had to await a nexus of great and concomitant advances in theoretical and synthetic organic chemistry, spectroscopy and cell and molecular biology. These advances helped us figure out the structure of druglike organic molecules, they revealed the momentous fact that drugs work by binding to specific target proteins, and they also allowed us to produce these proteins in useful quantity and uncover their structures. Finally at the beginning of the 80s, we thought we had enough understanding of chemistry to design drugs by bottom-up approaches, "rationally", as if everything that had gone on before was simply the product of random flashes of unstructured thought. The advent of personal computers (Apple and Microsoft had launched in the late 70s) and their immense potential left people convinced that it was only a matter of time before drugs were "designed with computers". What the revolution probably found inconvenient to discuss much was that it was the top-down analysis which had preceded it that had produced some very good medicines, from penicillin to thorazine.

Thus began the era of structure-based drug design which tries to design drugs atom by atom from scratch by knowing the protein glove in which these delicate molecular fingers fit. The big assumption is that the hand that fits the glove can deliver the knockout punch to a disease largely on its own. An explosion of scientific knowledge, startups, venture capital funding and interest from Wall Street fueled those heady times, with the upbeat understanding that once we understood the physics of drug binding well and had access to more computing power, we would be on our way to designing drugs more efficiently. Barry Werth's book "The Billion-Dollar Molecule" captured this zeitgeist well; the book is actually quite valuable since it's a rare as-it-happens study and not a more typical retrospective one, and therefore displays the same breathless and naive enthusiasm as its subjects.

And yet, 30 years after the prophecy was enunciated in great detail and to great fanfare, where are we? First, the good news. The bottom-up approach did yield great dividends - most notably in the field of HIV protease inhibitor drugs against AIDS. I actually believe that this contribution from the pharmaceutical industry is one of the greatest public services that capitalism has performed for humanity. Important drugs for lowering blood pressure and controlling heartburn were also the beneficiaries of top-down thinking. 

The bad news is that the paradigm fell short of the wild expectations that we had from it. Significantly short in fact. And the reason is what it always has been in the annals of human technological failure: ignorance. Human beings simply don't know enough about perturbing a biological system with a small organic molecule. Biological systems are emergent and non-linear, and we simply don't understand how simple inputs result in complex outputs. Ignorance was compounded with hubris in this case. We thought that once we understood how a molecule binds to a particular protein and optimized this binding, we had a drug. But what we had was simply a molecule that bound better to that protein; we still worked on the assumption that that protein was somehow critical for a disease. Also, a molecule that binds well to a protein has to overcome enormous other hurdles of oral bioavailability and safety before it can be called a drug. So even if - and that's a big if - we understood the physics of drug-protein binding well, we still wouldn't be any closer to a drug, because designing a drug involves understanding its interactions with an entire biological system and not just with one or two proteins.

In reality, diseases like cancer manifest themselves through subtle effects on a host of physiological systems involving dozens if not hundreds of proteins. Cancer especially is a wily disease because it activates cells for uncontrolled growth through multiple pathways. Even if one or two proteins were the primary drivers of this process, simply designing a molecule to block their actions would be too simplistic and reductionist. Ideally we would need to block a targeted subset of proteins to produce optimum effect. In reality, either our molecule would not bind even one favored protein sufficiently and lack efficacy, or it would bind the wrong proteins and show toxicity. In fact the reason why no drug can escape at least a few side effects is precisely because it binds to many other proteins other than the one we intended it to.

Faced with this wall of biological complexity, what do we do? Ironically, what we had done for hundreds of years, only this time armed with far more data and smarter data analysis tools. Simply put, you don't worry about understanding how exactly your molecule interacts with a particular protein; you worry instead only about its visible effects, about how much it impacts your blood pressure or glucose levels, or how much it increases urine output or metabolic activity. These endpoints are agnostic of knowledge of the detailed mechanism of action of a drug. You can also compare these results across a panel of drugs to try to decipher similarities and differences.

This is top-down drug design and discovery, writ large in the era of Big Data and techniques from computer science like machine learning and deep learning. The field is fundamentally steeped in data analysis and takes advantage of new technology that can measure umpteen effects of drugs on biological systems, greatly improved computing power and hardware to analyze these effects, and refined statistical techniques that can separate signal from noise and find trends.

The top-down approach is today characterized mainly by phenotypic screening and machine learning. Phenotypic screening involves simply throwing a drug at a cell, organ or animal and observing its effects. In its primitive form it was used to discover many of today's important drugs; in the field of anxiety medicine for instance, new drugs were discovered by giving them to mice and simply observing how much fear the mice exhibited toward cats. Today's phenotypic screening can be more fine-grained, looking at drug effects on cell size, shape and elasticity. One study I saw looked at potential drugs for wound healing; the most important tool in that study was a high-resolution camera, and the top-down approach manifested itself through image analysis techniques that quantified subtle changes in wound shape, depth and appearance. In all these cases, the exact protein target the drug might be interacting with was a distant horizon and an unknown. The large scale, often visible, effects were what mattered. And finding patterns and subtle differences in these effects - in images, in gene expression data, in patient responses - is what the universal tool of machine learning is supposed to do best. No wonder that every company and lab from Boston to Berkeley is trying feverishly to recruit data and machine learning scientists and build burgeoning data science divisions. These companies have staked their fortunes on a future that is largely imaginary for now.

Currently there seems to be, if not a war, at least a simmering and uneasy peace between top-down and bottom-up approaches in drug discovery. And yet this seems to be mainly a fight where opponents set up false dichotomies and straw men rather than find complementary strengths and limitations. First and foremost, the ultimate proof of the pudding is in the eating, and machine learning's impact on the number of approved new drugs still has to be demonstrated; the field is simply too new. The constellation of techniques has also proven itself to be much better at solving certain problems (mainly image recognition and natural language processing) than others. A lot of early stage medicinal chemistry data contains messy assay results and unexpected structure-activity relationships (SAR) containing "activity cliffs" in which a small change in structure leads to a large change in activity. Machine learning struggles with these discontinuous stimulus-response landscapes. Secondly, there are still technical issues in machine learning such as working with sparse data and noise that have to be resolved. Thirdly, while the result of a top-down approach may be a simple image or change in cell type, the number of potential factors that can lead to that result can be hideously tangled and multifaceted. Finally, there is the perpetual paradigm of garbage-in-garbage-out (GIGO). Your machine learning algorithm is only as good as the data you feed it, and chemical and biological data are notoriously messy and ill-curated; chemical structures might be incorrect, assay conditions might differ in space and time, patient reporting and compliance might be sporadic and erroneous, human error riddles data collection, and there might be very little data to begin with. The machine learning mill can only turn data grist into gold if what it's provided with is grist in the first place.

In contrast to some of these problems with the top-down paradigm, bottom-up drug design has some distinct advantages. First of all, it has worked, and nothing speaks like success. Also operationally, since you are usually looking at the interactions between a single molecule and protein, the system is much simpler and cleaner, and the techniques to study it are less prone to ambiguous interpretation. Unlike machine learning which can be a black box, here you can understand exactly what's going on. The amount of data might be smaller, but it may also be more targeted, manageable and reproducible. You don't usually have to deal with the intricacies of data fitting and noise reduction or the curation of data from multiple sources. Ultimately at the end of the day, if like HIV protease your target does turn out to be the Achilles heel of a deadly disease like AIDS, your atom-by-atom design can be as powerful as Thor's hammer. There is little doubt that bottom-up approaches have worked in selected cases, where the relevance of the target has been validated, and there is little doubt that this will continue to be the case.

Now it's also true that just like with top-downers, bottom-uppers have had their burden of computational problems and failures, and both paradigms have been subjected to their fair share of hype. Starting from that "designing drugs using computers" headline in 1981, people have understood that there are fundamental problems in modeling intermolecular interactions: some of these problems are computational and in principle can be overcome with better hardware and software, but others like the poor understanding of water molecules and electrostatic interactions are fundamentally scientific in nature. The downplaying of these issues and the emphasizing of occasional anecdotal successes has led to massive hype in computer-aided drug design. But in case of machine learning it's even worse in some sense since hype from applications of the field in other human endeavors is spilling over in drug discovery too; it seems hard for some to avoid claiming that your favorite machine learning system is going to soon cure cancer if it's making inroads in trendy applications like self-driving cars and facial recognition. Unlike machine learning though, the bottom-up take has at least had 20 years of successes and failures to draw on, so there is a sort of lid on hype that is constantly waved by skeptics.

Ultimately, the biggest advantage of machine learning is that it allows us to bypass detailed understanding of complex molecular interactions and biological feedback and work from the data alone. It's like a system of psychology that studies human behavior purely based on stimuli and responses of human subjects, without understanding how the brain works at a neuronal level. The disadvantage is that the approach can remain a black box; it can lead to occasional predictive success but at the expense of understanding. And a good open question is to ask how long we can keep on predicting without understanding. Knowing how many unexpected events or "Black Swans" exist in drug discovery, how long can top-down approaches keep performing well?

The fact of the matter is that both top-down and bottom-up approaches to drug discovery have strengths and limitations and should therefore be part of an integrated approach to drug discovery. In fact they can hopefully work well together, like members of a relay team. I have heard of at least one successful major project in a leading drug firm in which top down phenotypic screening yielded a valuable hit which then, midstream, was handed over to a bottom-up team of medicinal chemists, crystallographers and computational chemists who deconvoluted the target and optimized the hit all the way to an NDA (New Drug Application). At the same time, it was clear that the latter would not have been made possible without the former. In my view, the old guard of the bottom-up school has been reluctant and cynical in accepting membership in the guild for the young Turks of the top-down school, while the Turks have been similarly guilty of dismissing their predecessors as antiquated and irrelevant. This is a dangerous game of all-or-none in the very complex and challenging landscape of drug discovery and development, where only multiple and diverse approaches are going to allow us to discover the proverbial needle in the haystack. Only together will the two schools thrive, and there are promising signs that they might in fact be stronger together. But we'll never know until we try.

(Image: BenevolentAI)