Public health workers and medical practitioners are frequently required to make predictions regarding various health outcomes. For example, they may be required to predict whether a 65-year-old woman with a 50-year history of alcohol drinking, betel nut chewing, and cigarette smoking will develop oral cancer in 1 year. They may also be required to predict whether a young man without previous medical history will survive an emergency operation for a ruptured dissecting aortic aneurysm. Many risk or prognostic factors have been found for nearly every health outcome. On the basis of such factors, a risk prediction model (e.g., the Framingham score for cardiovascular risk or a disease-staging system [e.g., the International Federation of Gynecology and Obstetrics staging for cervical cancer prognosis]) can be constructed to make accurate predictions. However, a prediction with approximately 100% certainty is seldom possible.
We report a new avenue for health outcome prediction. The method hinges on collecting multiple “perturbations”[3,4] of the health outcome of interest. Notably, if a person has the health outcome of concern (e.g., clinically diagnosed liver cancer) or is in the process of developing the outcome (e.g., a small malignant liver tumor not yet manifested clinically), many attributes of that person may undergo subtle changes (referred to as the “perturbed” attributes in this paper). For example, the person's physical/emotional characteristics, physiological/biochemical profiles, various “omics” (e.g., epigenomics, transcriptomics, proteomics, metabolomics, and exposomics), behavior patterns, social activities, and data pertaining to that person. Notably, changes induced by the health outcome may be nondeterministic (i.e., only vary in probability), and the magnitudes of the changes may be minuscule. We utilize such health big data and propose a method, namely “prediction using multiple perturbations” (PUMP). This is a proof-of-concept study. In this paper, we investigate the asymptotic properties of PUMP when the number of attributes tends to infinity.
2.1 Training samples
To train a PUMP to predict a health outcome (Y), we need a case sample (people with Y; with a sample size of n1 indexed by j) and 2 independent control samples (people without Y; with a sample size of n0,1 indexed by k1, and a sample size of n0,2 indexed by k2, respectively). Information on a total of m attributes (indexed by i) is accrued, for all people in the case sample () and the 2 control samples ( and ). The attributes are assumed to be binary (0 or 1); otherwise, we censor values that are excessively high to an (arbitrarily defined) high limit and those that are excessively too low to a low limit, followed by mapping all values to the unit interval. The point of doing such censorship is to bound the ranges of all attributes to between zero and one: , for all i, j, k1, and k2. Next, we calculate the mean attributes for the 3 training samples: , respectively, for i = 1, 2,…, m.
2.2 The prediction method
To make a prediction for a new person, we accrue the information of the corresponding attributes pertaining to the person: , for i = 1, 2,…, m. We calculate a perturbation score at each and every attribute for the new person: , for i = 1, 2,…, m. These scores are then averaged across the attributes to yield . We predict the new person to eventually develop Y if his/her average score is larger than a certain threshold value (a small positive number near zero, to be discussed later); otherwise, we predict the person to not develop Y.
2.3 Ethical review
This paper is a methodological study and does not involve the enrollment of study subjects. Ethical approval is not necessary.
As the number of attributes tends to infinity (m → ∞), the probability of a correct prediction using the proposed method tends to 1 if the following 4 conditions are met:
Note that (C1), (C2), and (C3) above concern the relation between the many attributes to be collected and the health outcome under concern, whereas (C4) above concerns the relation between the attributes themselves.
3.1 A proof that the probability of a correct prediction tends to one
Let the mean (variance) of the ith attribute be denoted by for the people with Y, for the people in the process of developing Y, and for the people without Y, respectively. Because all attributes are between zero and one inclusive, we have that , respectively, for all i. For attributes that are bounded between zero and one, the variances are the largest when they are Bernoulli distributed (either 0 or 1, but not between). Therefore, we have that , and , respectively, for all i.
From C1, let indicate that the ith attribute is a perturbed attribute or , and otherwise (). The number of perturbed attributes is , which is a function of m. From C2, there exist positive constants π and M such that for all . From C3, there exists a positive constant ξ such that for all i with .
Because of the independence between and (the former and the latter being based on different people), the means and variances of the perturbation score at the ith attribute for a new person are and
is a function of the means of the attributes and the sample sizes of the training data.
From C4, assume that any attribute can correlate with at most C other attributes, for either the people with Y, the people in the process of developing Y, or the people without Y. Then, we have that , for all i.
We now calculate the mean of the average score for the new person as a function of m. We have that
for all m if the new person is without Y, and that
for all m ≥ M if the new person is developing Y. As for the variance of the average score for the new person, irrespectively of whether he/she is without Y or developing Y we have that
for all m.
For any t in and any m ≥ M, the probability of a correct prediction for a person without Y is and the probability of a correct prediction for a person developing Y is .
A simple numerical analysis shows that . Therefore, we see that as the number of attributes tends to infinity (m → ∞), the probability of a correct prediction tends to one.
3.2 Probability of a correct prediction and number of attributes needed
Assuming , we have that . Let , the probability of a correct prediction is then (assuming m ≥ M). Figure 1 shows that as the number of attributes increases, the lower bound for the probability of a correct prediction increases.
To control a false positive rate (the probability of a wrong prediction for a person without the outcome) no larger than α (0 < α < 1) and a false negative rate (the probability of a wrong prediction for a person developing the outcome) no larger than β (0 < β < 1), we can set the threshold value at . The total number of attributes needed is then (assuming ). Table 1 shows that an extremely large number of attributes (m[0.01,0.01]) is required to control the false positive and the false negative rates both no larger than 0.01. In Fig. 1, the lower bound for the probability of a correct prediction is larger than zero for (assuming m[1,1] ≥ M).
Conventional asymptotic analysis assumes the number of subjects to tend to infinity. Hall et al and Ahn et al proposed an alternative approach assuming the number of “dimensions” (corresponding to “attributes” in this paper), instead, to tend to infinity. Previously, we built on this alternative asymptotic to develop new methods, respectively, for detecting weak associations, detecting and correcting the bias of unmeasured factors, and testing treatment effects in randomized controlled trials. Along this line of inquiry, in this paper we propose a new avenue for health outcome prediction.
The C1 condition is the fundamental assumption of our method; it stipulates that the natural course of the health outcome of concern should produce perturbations as such. The C1 condition also implies that there is an intermediate state between the absence and the presence of the outcome: the outcome-developing state. For some health outcome that develops very quickly, the C1 condition my not apply. The C2 condition signifies that collecting numerous attributes does not always help; only the perturbed attributes count as the informative “signals” and the non-perturbed attributes are the uninformative “noises.” To meet this condition, as the number of attributes tends to infinity, the proportion of the perturbed attributes (signal prevalence) must be no smaller than a certain positive value. The C3 condition requires that the perturbations should be non-negligible if small; the perturbation magnitude (signal strength)—as measured by the deviation of the mean of those developing the outcome from that of those without the outcome, relative to the distance between the high and low limits set up for that attribute—must be no smaller than a certain positive value. Finally, the C4 condition stipulates that diverse types of attributes must be collected to minimize the correlations between them; as the number of attributes tends to infinity, the number of other attributes that an attribute correlates with must be bounded.
The current version of the PUMP only accepts binary attributes. Additional work is needed to expand the range of applicability to include categorical/continuous attributes, and those attributes that my change over time either because of the outcome-developing processes or by their own nature. The PUMP is also very naive; it does not care to differentiate signals (the perturbed attributes) from noises (the non-perturbed attributes) before taking them all in. Methods to select attributes for the PUMP need further development. From an artificial intelligence perspective, the PUMP is the simplest “machine learner” which takes in one layer of attributes, performs a simple linear combination of them, and outputs an average perturbation score. Recent “deep learners” allow multiple processing layers and complex nonlinear combinations between the input attributes and each and every node in the deep layers. Further studies along this line are also warranted. From Table 1, we see that the current version of the PUMP requires an extremely large number of attributes. Future updates of the PUMP envisioned above may reduce the number of attributes needed for a near perfect prediction to a practically feasible level.
At present, it is still not possible to collect billions or trillions of attributes of a person. But moving into this big data era,[9–11] we are gradually closing that gap. This paper shows that if such personal big data can be obtained and the C1–C4 conditions are met, theoretically we can use a PUMP to predict the health outcome of a person to near certainty.
This is a single-authorship paper by WCL.
Conceptualization: Wen-Chung Lee.
Data curation: Wen-Chung Lee.
Formal analysis: Wen-Chung Lee.
Funding acquisition: Wen-Chung Lee.
Investigation: Wen-Chung Lee.
Methodology: Wen-Chung Lee.
Project administration: Wen-Chung Lee.
Resources: Wen-Chung Lee.
Software: Wen-Chung Lee.
Supervision: Wen-Chung Lee.
Validation: Wen-Chung Lee.
Visualization: Wen-Chung Lee.
Writing – original draft: Wen-Chung Lee.
Writing – review & editing: Wen-Chung Lee.
. Wilson PWF, D’Agostino RB, Levy D, et al. Prediction of coronary heart disease using risk factor categories. Circulation 1998;97:1837–47.
. Benedet JL, Bender H, Jones H, et al. FIGO staging classifications and clinical practice guidelines in the management of gynecologic cancers. Int J Gynaecol Obstet 2000;70:209–62.
. Lo MT, Lee WC. Detecting a weak association by testing its multiple perturbations: a data mining
approach. Sci Rep 2014;4:5081.
. Lee WC. Detecting and correcting the bias of unmeasured factors using perturbation analysis: a data-mining approach. BMC Med Res Methodol 2014;14:18.
. Hall P, Marron JS, Neeman A. Geometric representation of high dimension, low sample size data. J Royal Stat Soc (series B) 2005;67:427–44.
. Ahn J, Marron JS, Muller KM, et al. The high-dimension, low-sample-size geometric representation hold under mild conditions. Biometrika 2007;94:760–6.
. Lee WC, Lin JH. A test for treatment effects in randomized controlled trials, harnessing the power of ultrahigh dimensional big data
. Medicine 2019;98:43(e17630).
. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436–44.
. Murdoch TB, Detsky AS. The inevitable application of big data
to health care. J Am Med Assoc 2013;309:1351–2.
. Krumholz HM. Big data
and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff (Millwood) 2014;33:1163–70.
. Alyass A, Turcotte M, Meyre D. From big data
analysis to personalized medicine for all: challenges and opportunities. BMC Med Genomics 2015;8:33.