Correspondence Address: Source of Support: None, Conflict of Interest: None DOI: 10.4103/0028-3886.293445
Source of Support: None, Conflict of Interest: None
In any research or study, missing data is practically inevitable. Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The causes may be due to the data being time-sensitive, resource-intensive or tedious repeated longitudinal data measures. Common examples are missing information in source documents (in retrospective studies), lack of availability of a variable (example: laboratory tests that were not performed), clinical situations where the collection of some variables is not possible (example, missing coma scale data in sedated patients), missed study visits, patients lost to follow-up etc. Despite the best efforts to prevent missing data at inception, the missing data becomes the Achilles heel when publishing research. We will briefly discuss how missing data can be classified, with its implications and what a reader should know about missing data when a study has one.
Types of missing data
Before we embark on the hunt of what to do with missing data, we should know what we missed, and classify it correctly, so that we understand if there is a possibility to make up for it, or we are at a true loss.
Missing completely at random (MCAR)
These are missing variables which are unrelated to observed values. For example, a sample could not be processed on a day, because the machine broke down, and the samples had a limited shelf life. This may lead to missing data for that variable. The data would still be unbiased, but less precise.
Missing at random (MAR)
This missing data may emerge when it is related to an observed value in other measured variables. An example would be finding that a higher proportion of non-salaried group refusing to share their monthly income details vis-à-vis salaried group. Therefore, the possibility of the variable 'income' missing will depend on another variable (salaried versus non-salaried in this example). Such missing data cannot be ignored and should be accounted for. Here the missingness of a value depends on an observed data but not on unobserved data.
Not missing at random (NMAR)
This is the simplest to understand and most difficult to tackle! They do not fall into either of the above two categories. The only way to obtain an unbiased estimate of the parameters in such a case is to model the missing data. The model may then be incorporated into a more complex one for estimating the missing values.
How missing data affects the results of the study?
Now that we understand that except for the MCAR missing data, all other types have the potential to affect the results of the study, we should also know in what way they affect the results.
If a complete case analysis has been done
So here, we delete or remove cases where there is missing data. It is simple to explain and compute. If the data is MCAR results are unbiased. The P value, standard error estimates and hypothesis tests are correct. It will give biased results if the data is MAR or NMAR type. The statistical power may be significantly smaller with this method. If we do not do anything about missing data, the statistical tool gives us a complete case analysis [Figure 1].
If imputation is done
Imputation simply means replacement of missing value by another value. If done right can utilize the precious data that the 'complete case analysis' throws away. There are several methods of imputation: mean/median imputation, regression imputation, likelihood imputation, hot deck, and cold deck imputation. Describing these is beyond the scope of the article.
Last observation carried forward (LOCF)
This needs special mention because it is a common term described in randomized controlled trials (RCT). It is a form of hot deck imputation where we substitute missing data with similar responding observation. It is used when the subject leaves the study, and LOCF produces a conservative estimate of treatment effect. However, it may produce spurious results.
Which missing data can be imputed and how much of the missing data can be imputed?
All types of variables can be imputed by computational methods. We need to know that we cannot impute the determinant and outcome variables. Only co-variates may be imputed. [Figure 2] shows an example where the dataset used in [Figure 1] has been analysed for missing values. Regards how much of missing data can be imputed- it depends on missingness mechanism. Up to 20-25% of MCAR or MAR can be handled, especially when more sophisticated imputation methods are used. It should be accompanied by sensitivity analysis.
The essential steps to multiple imputations include the introduction of random variation into the process of imputing missing values to generate several data sets, each with slightly different imputed values. Then the analysis is performed on each of the data sets. Finally, we combine the results into a single set of parameter estimates, standard errors, and test statistics.
In clinical studies, the most common source of missing data is 'loss-to-follow-up'. It is easy to understand this in the context of a clinical trial. In clinical trials, if one or a few patients are lost to follow up in studies with hundreds of patients, there is no problem. The problem comes if the patients lost to follow-up are too many to allow valid conclusions. This requires answering the question – how many are too many? Well, it depends on the total sample size of the study as well as the degree of the observed difference in outcomes between the two study groups. Certainly, less than five out of hundreds, less than tens in thousands are unlikely to be too many in any circumstance. But the best thing is to ask: Do the losses to follow-up threaten the validity of the results. Re-analysing the results with certain assumptions for those lost to follow-up can assess this. Such reanalyses are called sensitivity analyses. The type of assumption depends on the study conclusions. If the conclusion favours the new treatment, assume a worst-case scenario, and re-analyse. If the conclusion does not favour the new treatment, assume a best-case scenario, and re-analyse. (If the comparison is between two active treatments, do both). The details are as follows:
(i) Worst-case analysis
Consider a hypothetical randomised study, which has 100 patients in each of the treatment and control groups, of whom 10 (10%) are lost to follow-up in each group. Of the remaining 90 patients in each group, 40 (44%) die in the control group and 20 (22%) in the experimental treatment group. The difference (40/90 vs 20/90) between the two is statistically significant (P = 0.001). The conclusion is that treatment works. The worst-case scenario analysis will count 40 (40%) deaths in the control group (all the ten patients lost to follow-up are assumed to have survived in this group) and 30 (30%) deaths in the treatment group (all the ten patients lost to follow-up in this group died). The reanalysis (40/100 vs 30/100) shows the difference is statistically non-significant (P = 0.18). The conclusion now would be that the difference might be due to chance. Since the two conclusions (one without counting the losses to follow-up and second with worst-case scenario) differ, the losses to follow up (10% in this example) are too many to allow strong conclusions. (There is no need to do best case scenario here because this will lead to the same conclusion as the analysis presented in the study).
Obviously, this analysis is based on an extreme assumption, which is unlikely to be true. In other words, this is a stringent test. If the study passes this test, there is no question that conclusions are valid despite losses to follow up. If the study fails this test, it may or may not be valid, we do not know.
(ii) Best-case analysis
If a study concludes that the new treatment does not make any difference but had losses to follow-up, you can do best-case scenario analysis to determine if losses were too many to allow valid conclusions.
Again, consider a hypothetical randomised study with 100 patients in each of the two arms. Let us say, 25 patients in the treatment arm and 20 in the placebo arm were lost to follow up. Thirty died in each group. The result says – there is no statistically significant difference between the two groups, in fact, percentage-wise, the control group is marginally better. The best-case scenario would consider that all those lost to follow up in the treatment group survived, whereas those in the control group died. The final figures would be 30 deaths in the treatment group and 50 in the control group (now out of 100 in each group). A reanalysis gives a P value of 0.006, statistically very significant difference. This will mean that the losses to follow up are so many that the results reported in the study may not be called robust.
Of course, the worst and best-case scenarios are based on extreme assumptions, which are probably implausible. They have value if they do not change the conclusions of the studies. (This means, conclusions are robust and losses to follow-up, do not invalidate the results). If they do, then the validity of the results cannot be said to be robust. The extent to which the validity is compromised depends on the degree to which the outcome of treatment patients lost to follow-up differs from that of control patients lost to follow-up. Akl, et al. have proposed methods for handling participants excluded from analyses of randomized trials based on a range of plausible assumptions.
Missing data is common in clinical research and should be avoided at the source. Any missing data should be accounted for, and the limitations incurring understood. While reading a paper dealing with missing data, the mechanism of imputation should be interpreted in light of the limitations of the methods and sensitivity analysis. In your own research having missing data, the type of missing data and quantum should be known. The variables selected for imputation and the limitations should be addressed in the manuscript with sensitivity analysis. However, it is important to note that no statistical methods will ever be able to replace missing information completely, thus 'the optimal solution to the problem of missing values is not to have any'. Therefore, every effort must be made to keep missing values to a minimum. More details are beyond the scope of this paper. Interested readers may read further references.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
[Figure 1], [Figure 2]