Two weeks ago we submitted a grant application to the National Institutes of Health in the USA and among the questions we asked was: Will our dual-modality Aceso system obtain false-positive scores, recall rates, cancer detection rates and positive-predictive values which are equivalent to the digital breast tomosynthesis device currently used at Groote Schuur Hospital? Our first blog for 2015 was about biological bad luck and explored the question: What proportion of cancer is caused by environmental factors and how much is due to genetics?
Published in Science today is an article entitled “What is the question?” in which the authors state: “We have found that the most frequent failure in data analysis is mistaking the type of question being considered.” Their figure below (© Science by AAAS) illustrates how the analysis of data may be broadly classified into one of six types. The first level is descriptive in which a set of measurements is summarized without any effort to interpret the data. In exploratory analysis you are searching for trends or relationships between measurements for the purpose of generating hypotheses.
Data analysis which quantifies whether an observed pattern will hold true for a data set is known as inferential and is the most common type of statistical analysis in the scientific literature. It is important to remember the old adage, though, that “correlation does not imply causation.” As an extension of inferential data analysis – which documents relationships at the population level – predictive analysis employs features to forecast an outcome for a single person. This type of analysis allows you to predict one measurement from another but does not explain how the prediction works.
The purpose of a causal data analysis is to establish what happens to one particular measurement if you change another measurement, and it is able to identify both the magnitude and direction of that change. A good example is the well-documented causal relationship between smoking and cancer: if you smoke, then your risk of acquiring cancer will certainly increase. Finally, mechanistic data analysis is employed to demonstrate that by altering one parameter you will always get a specific and deterministic behaviour in another parameter. Interestingly, mechanistic data analysis is rare and occurs primarily in the field of engineering.
The authors conclude: “The solution is to ensure that data analytic education is a key component in research training, and the most important step in that direction is to know the question.”