Deep Learning, Nature and Data Leakage, Reproducibility and Academic Engineering

2019 Jun 25

This piece of Rajiv Shah called “Stand up for Best Practices” that involves a well known scientific journal Nature shows the academic rigor failed during several layers down and why reproducibility matters.

The letter called Deep learning of aftershock patterns following large earthquakes from DeVries Et al. that was published in Nature, according to Shah, shows a basic problem of Data Leakage and this problem could invalidate all the experiments. Shah tried to replicate the results and found the scenario of Data Leakage and after he tried to communicate the authors and Nature about the error got some harsh responses (some will be at the end of this post).

Paper abstract - Source: https://www.nature.com/articles/s41586-018-0438-y

The repository with all analysis it is here.

Of course that a letter it’s a small piece that communicates in a brief way a larger research and sometimes the authors need to suppress some information to the matter of clarity of journal limitations. And for this point, I can understand the authors, and since they gentle provided the source code (here) more skeptical minds can check the ground truth.

As a said before in my 2019 mission statement: “In god we trust, others must bring the raw data with the source code of the extraction in the GitHub”.

The main question here it’s not about if the authors made a mistake or not (that did, because they incorporated a part of an earthquake to train the model, and this for itself can explain the AUC bigger in test than in training set) but how this academic engineering it’s killing the Machine Learning field and inflating a bubble of expectations.

But first I’ll borrow the definition of Academic Engineering provided by Filip Piekniewski in his classic called Autopsy Of A Deep Learning Paper:

_I read a lot of deep learning papers, typically a few/week. I’ve read probably several thousands of papers. My general problem with papers in machine learning or deep learning is that often they sit in some strange no man’s land between science and engineering, I call it “academic engineering”. Let me describe what I mean:
_

_1) A scientific paper IMHO, should convey an idea that has the ability to explain something. For example a paper that proves a mathematical theorem, a paper that presents a model of some physical phenomenon. Alternatively a scientific paper could be experimental, where the result of an experiment tells us something fundamental about the reality. Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality.
_

_2) An engineering paper shows a method of solving a particular problem. Problems may vary and depend on an application, sometimes they could be really uninteresting and specific but nevertheless useful for somebody somewhere. For an engineering paper, things that matter are different than for a scientific paper: the universality of the solution may not be of paramount importance. What matters is that the solution works, could be practically implemented e.g. given available components, is cheaper or more energy efficient than other solutions and so on. The central point of an engineering paper is an application, and the rest is just a collection of ideas that allow to solve the application.
_

Machine learning sits somewhere in between. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

One thing that I noticed in this Academic Engineering phenomena it’s that a lot of people (well-intentioned) are doing a lot of experiments, using nice tools and put their code available and this is very cool. However one thing that I noticed it’s that some of this Academic Engineering papers brings tons of methodological problems regarding of Machine Learning part.

I tackled one example of this some months ago related with a systematic review from Christodoulou Et al. called “A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models” that the authors want to start a confirmatory study without a clear understanding of the methodology behind of Machine Learning and Deep Learning papers (you can read the full post here).

In Nature’s letter from DeVries Et al. it’s not different. Let’s check, for example, HOW they end it up with the right architecture. The paper only made the following consideration about the architecture:

The neural networks used here are fully connected and have six hidden layers with 50 neurons each and hyperbolic tangent activation functions (13,451 weights and biases in total). The first layer corresponds to the inputs to the neural network; in this case, these inputs are the magnitudes of the six independent components of the co-seismically generated static elastic stress-change tensor calculated at the centroid of a grid cell and their negative values.

DeVries, P. M. R., Viégas, F., Wattenberg, M., & Meade, B. J. (2018)

The code available in GitHub shows the architecture:

Only the aspect of choosing the right architecture can rises tons of questions regarding the methodological rigor as:

Why 6 layers and not 10 or 20? How did they get in this number of layers?
What the criteria to choose 50 as number of neurons? What’s the processed to identify that number?
There’s substantial differences between the activation functions in terms of the convergence of the network. Regarding of that, why they choose tahn instead relu or sigmoid? That’s the results provided with the use of the other 2 activation functions?
Another aspect in terms of convergence of the network it’s the optimizer. What’s the criteria to use Adadelta instead Adam or Adagrad or the very good and old Stocastic Gradient Descent?
For every layer there’s a Dropout layer afterwards. How they get the 50% of Dropout for every layer?
All layers uses the lecun_uniform as a Kernel Initializer. Why this initializer it’s most suitable for this problem/data? Other options were tested? If yes, how was the results? And why the seed for the lecun_uniform was not set?

These questions I raised in only 8 minutes (and believe me, even junior reviewers from B-class journals would make those questions), and for the bottom of my heart, I would like to believe that Nature it’s doing the same.

After that a question arises: If even a very well know scientific journal it’s rewarding this kind of academic engineering - even with all code available - and not even considering to review the letter, what could happen in this moment in several papers that do not have this kind of mechanism of verification and the research itself it’s a completely a black box?

Final thoughts

There’s an eagerness to believe in almost every journal that has a huge impact and spread the word about the good results, but if you cannot explain HOW that result was made in a sense to have a methodological rigor, IMHO the result it’s meaningless.

Keep sane from hype, keep skeptic.

Below all the letters exchanged:

FIRST LETTER FROM MR. SHAH

Dear Editors:

A recent paper you published by DeVries, et al., Deep learning of aftershock patterns following large Earthquakes, contains significant methodological errors that undermine its conclusion. These errors should be highlighted, as data science is still an emerging field that hasn’t yet matured to the rigor of other fields. Additionally, not correcting the published results will stymie research in the area, as it will not be possible for others to match or improve upon the results. We have contacted the author and shared with them the problems around data leakage, learning curves, and model choice. They have not yet responded back.

First, the results published in the paper, AUC of 0.849, are inflated because of target leakage. The approach in the paper used part of an earthquake to train the model, which then was used again to test the model. This form of target leakage can lead to inflated results in machine learning. To prevent against this, a technique called group partitioning is used. This requires ensuring an earthquake appears either in the train portion of the data or the test portion. This is not an unusual methodological mistake, for example a recent paper by Rajpurkar et. al on chest x-rays made the same mistake, where x-rays for an individual patient could be found in both the train and test set. These authors later revised their paper to correct this mistake.

In this paper, several earthquakes, including 1985NAHANN01HART, 1996HYUGAx01YAGI, 1997COLFIO01HERN, 1997KAGOSH01HORI, 2010NORTHE01HAYE were represented in both the train and test part of the dataset. For example, in 1985 two large magnitude earthquakes occurred near the North Nahanni River in the northeast Cordillera, Northwest Territories, Canada, on 5 October (MS 6.6) and 23 December (MS 6.9). In this dataset, one of the earthquakes is in the train set and the other in the test set. To ensure the network wasn’t learning the specifics about the regions, we used group partitioning, this ensures an earthquake’s data only was in test or in train and not in both. If the model was truly learning to predict aftershocks, such a partitioning should not affect the results.

We applied group partitioning of earthquakes randomly across 10 different runs with different random seeds for the partitioning. I am happy to share/post the group partitioning along with the revised datasets. We found the following results as averaged across the 10 runs (~20% validation):

Method

Mean AUC

Coulomb failure stress-change

0.60

Maximum change in shear stress

0.77

von Mises yield criterion

0.77

Random Forest

0.76

Neural Network

0.77

In terms of predictive performance, the machine learning methods are not an improvement over traditional techniques of the maximum change in shear stress or the von Mises yield criterion. To assess the value of the deep learning approach, we also compared the performance to a baseline Random Forest algorithm (basic default parameters - 100 trees) and found only a slight improvement.

It is crucial that the results in the paper will be corrected. The published results provide an inaccurate portrayal of the results of machine learning / deep learning to predict aftershocks. Moreover, other researchers will have trouble sharing or publishing results because they cannot meet these published benchmarks. It is in the interest of progress and transparency that the AUC performance in the paper will be corrected.

The second problem we noted is not using learning curves. Andrew Ng has popularized the notion of learning curves as a fundamental tool in error analysis for models. Using learning curves, one can find that training a model on just a small sample of the dataset is enough to get very good performance. In this case, when I run the neural network with a batch size of 2,000 and 8 steps for one epoch, I find that 16,000 samples are enough to get a good performance of 0.77 AUC. This suggests that there is a relatively small signal in the dataset that can be found very quickly by the neural network. This is an important insight and should be noted. While we have 6 million rows, you can get the insights from just a small portion of that data.

The third issue is jumping straight to a deep learning model without considering baselines. Most mainstream machine learning papers will use benchmark algorithms, say logistic regression or random forest when discussing new algorithms or approaches. This paper did not have that. However, we found that a simple random forest model was able to achieve similar performance to neural network. This is an important point when using deep learning approaches. In this case, really any simple model (e.g. SVM, GAM) will provide comparable results. The paper gives the misleading impression that only deep learning is capable of learning the aftershocks.

As practicing data scientists, we see these sorts of problems on a regular basis. As a field, data science is still immature and there isn’t the methodological rigor of other fields. Addressing these errors will provide the research community with a good learning example of common issues practitioners can run into when using machine learning. The only reason we can learn from this is that the authors were kind enough to share their code and data. This sort of sharing benefits everyone in the long run.

At this point, I have not publicly shared or posted any of these concerns. I have shared them with the author and she did not reply back after two weeks. I thought it would be best to privately share them with you first. Please let me know what you think. If we do not hear back from you by November 20th, we will make our results public.

Thank you

Rajiv Shah

University of Illinois at Chicago

Lukas Innig

DataRobot

NATURE COMMENTS

Referee’s Comments:

In this proposed Matters Arising contribution, Shah and Innig provide critical commentary on the paper “Deep learning aftershock patterns following large earthquakes”, authored by Devries et al. and published in Nature in 2018. While I think that Shah and Innig raise make several valid and interesting points, I do not endorse publication of the comment-and-reply in Matters Arising. I will explain my reasoning for this decision in more detail below, but the upshot of my thinking is that (1) I do not feel that the central results of the study are compromised in any way, and (2) I am not convinced that the commentary is of interest to audience of non-specialists (that is, non machine learning practicioners).

Shah and Innig’s comment (and Devries and Meade’s response) centers on three main points of contention: (1) the notion of data leakage, (2) learning curve usage, and (3) the choice of deep learning approach in lieu of a simpler machine learning method. Point (1) is related to the partitioning of earthquakes into training and testing datasets. In the ideal world, these datasets should be completely independent, such that the latter constitutes a truly fair test of the trained model’s performance on data that it has never seen before. Shah and Innig note that some of the ruptures in the training dataset are nearly collocated in space and time with ruptures in the testing dataset, and thus a subset of aftershocks are shared mutually. This certainly sets up the potential for information to transfer from the training to testing datasets (violating the desired independence described above), and it would be better if the authors had implemented grouping or pooling to safeguard against this risk. However, I find Devries and Meade’s rebuttal to the point to be compelling, and would further posit that the potential data leakage between nearby ruptures is a somewhat rare occurrence that should not modify the main results significantly.

Shah and Innig’s points (2) and (3) are both related, and while they are interesting to me, they are not salient to the central focus of the paper. It is neat (and perhaps worth noting in a supplement), that the trainable parameters in the neural network, the network biases and weights, can be adequately trained using a small batch of the full dataset. Unfortunately, this insight from the proposed learning curve scheme would likely shoot over the heads of the 95% of the general Nature audience that are unfamiliar with the mechanics of neural networks and how they are trained. Likewise, most readers wouldn’t have the foggiest notion of what a Random Forest is, nor how it differs from a deep neural network, nor why it is considered simpler and more transparent. The purpose of the paper (to my understanding) was not to provide a benchmark machine learning algorithm so that future groups could apply more advanced techniques (GANs, Variational Autoencoders, etc.) to boost AUC performance by 5%. Instead, the paper showed that a relatively simple, but purely data-driven approach could predict aftershock locations better than Coulomb stress (the metric used in most studies to date) and also identify stress-based proxies (max shear stress, von Mises stress) that have physical significance and are better predictors than the classical Coulomb stress. In this way, the deep learning algorithm was used as a tool to remove our human bias toward the Coulomb stress criterion, which has been ingrained in our psyche by more than 20 years of published literature.

To summarize: regarding point (1), I wish the Devries et al. study had controlled for potential data leakage, but do not feel that the main results of the paper are compromised by doing so. As for point (2), I think it is interesting (though not surprising) that the neural network only needs a small batch of data to be adequately trained, but this is certainly a minor point of contention, relative to the key takeaways of the paper, which Shah and Innig may have missed. Point (3) follows more or less directly from (2), and it is intuitive that a simpler and more transparent machine learning algorithm (like a Random Forest) would give comparable performance to a deep neural network. Again, it would have been nice to have noted in the manuscript that the main insights could have been derived from a different machine learning approach, but this detail is of more interest to a data science or machine learning specialist than to a general Nature audience. I think the disconnect between the Shah and Innig and Devries et al. is a matter of perspective. Shah and Innig are concerned primarily with machine learning best practices methodology, and with formulating the problem as “Kaggle”-like machine learning challenge with proper benchmarking. Devries et al. are concerned primarily with using machine learning as tool to extract insight into the natural world, and not with details of the algorithm design.

AUTHORS RESPONSE