Atormac
brintellex
Neurology India
menu-bar5 Open access journal indexed with Index Medicus
  Users online: 97  
 Home | Login 
About Editorial board Articlesmenu-bullet NSI Publicationsmenu-bullet Search Instructions Online Submission Subscribe Videos Etcetera Contact
  Navigate Here 
 Search
 
  
 Resource Links
    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
    Article in PDF (923 KB)
    Citation Manager
    Access Statistics
    Reader Comments
    Email Alert *
    Add to My List *
* Registration required (free)  

 
  In this Article
  Conclusion
   References
   Article Figures

 Article Access Statistics
    Viewed249    
    Printed5    
    Emailed0    
    PDF Downloaded34    
    Comments [Add]    

Recommend this journal

 


 
Table of Contents    
NI FEATURE: KNOW YOUR VITAL STATISTICS
Year : 2020  |  Volume : 68  |  Issue : 4  |  Page : 886-888

How to Deal with Missing Data?


Department of Neurology, All India Institute of Medical Sciences, New Delhi, India

Date of Web Publication26-Aug-2020

Correspondence Address:
Prof. Kameshwar Prasad
Professor, Room No. 2, 6th Floor, Neurosciences Center, Department of Neurology, All India Institute of Medical Sciences, New Delhi
India
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/0028-3886.293445

Rights and Permissions



How to cite this article:
Vibha D, Prasad K. How to Deal with Missing Data?. Neurol India 2020;68:886-8

How to cite this URL:
Vibha D, Prasad K. How to Deal with Missing Data?. Neurol India [serial online] 2020 [cited 2020 Oct 24];68:886-8. Available from: https://www.neurologyindia.com/text.asp?2020/68/4/886/293445




In any research or study, missing data is practically inevitable. Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest.[1] The causes may be due to the data being time-sensitive, resource-intensive or tedious repeated longitudinal data measures. Common examples are missing information in source documents (in retrospective studies), lack of availability of a variable (example: laboratory tests that were not performed), clinical situations where the collection of some variables is not possible (example, missing coma scale data in sedated patients), missed study visits, patients lost to follow-up etc. Despite the best efforts to prevent missing data at inception, the missing data becomes the Achilles heel when publishing research. We will briefly discuss how missing data can be classified, with its implications and what a reader should know about missing data when a study has one.

Types of missing data

Before we embark on the hunt of what to do with missing data, we should know what we missed, and classify it correctly,[2] so that we understand if there is a possibility to make up for it, or we are at a true loss.

Missing completely at random (MCAR)

These are missing variables which are unrelated to observed values. For example, a sample could not be processed on a day, because the machine broke down, and the samples had a limited shelf life. This may lead to missing data for that variable. The data would still be unbiased, but less precise.

Missing at random (MAR)

This missing data may emerge when it is related to an observed value in other measured variables. An example would be finding that a higher proportion of non-salaried group refusing to share their monthly income details vis-à-vis salaried group. Therefore, the possibility of the variable 'income' missing will depend on another variable (salaried versus non-salaried in this example). Such missing data cannot be ignored and should be accounted for. Here the missingness of a value depends on an observed data but not on unobserved data.

Not missing at random (NMAR)

This is the simplest to understand and most difficult to tackle! They do not fall into either of the above two categories. The only way to obtain an unbiased estimate of the parameters in such a case is to model the missing data.[3] The model may then be incorporated into a more complex one for estimating the missing values.

How missing data affects the results of the study?

Now that we understand that except for the MCAR missing data, all other types have the potential to affect the results of the study, we should also know in what way they affect the results.

If a complete case analysis has been done

So here, we delete or remove cases where there is missing data. It is simple to explain and compute. If the data is MCAR results are unbiased. The P value, standard error estimates and hypothesis tests are correct. It will give biased results if the data is MAR or NMAR type. The statistical power may be significantly smaller with this method. If we do not do anything about missing data, the statistical tool gives us a complete case analysis [Figure 1].
Figure 1: Complete case analysis

Click here to view


If imputation is done

Imputation simply means replacement of missing value by another value. If done right can utilize the precious data that the 'complete case analysis' throws away. There are several methods of imputation: mean/median imputation, regression imputation, likelihood imputation, hot deck, and cold deck imputation. Describing these is beyond the scope of the article.

Last observation carried forward (LOCF)

This needs special mention because it is a common term described in randomized controlled trials (RCT). It is a form of hot deck imputation where we substitute missing data with similar responding observation. It is used when the subject leaves the study, and LOCF produces a conservative estimate of treatment effect. However, it may produce spurious results.

Which missing data can be imputed and how much of the missing data can be imputed?

All types of variables can be imputed by computational methods. We need to know that we cannot impute the determinant and outcome variables. Only co-variates may be imputed. [Figure 2] shows an example where the dataset used in [Figure 1] has been analysed for missing values. Regards how much of missing data can be imputed- it depends on missingness mechanism. Up to 20-25% of MCAR or MAR can be handled, especially when more sophisticated imputation methods are used. It should be accompanied by sensitivity analysis.
Figure 2: Example of missing case analysis output by statistical software

Click here to view


Multiple imputations

The essential steps to multiple imputations include the introduction of random variation into the process of imputing missing values to generate several data sets, each with slightly different imputed values. Then the analysis is performed on each of the data sets. Finally, we combine the results into a single set of parameter estimates, standard errors, and test statistics.

Sensitivity analysis

In clinical studies, the most common source of missing data is 'loss-to-follow-up'. It is easy to understand this in the context of a clinical trial. In clinical trials, if one or a few patients are lost to follow up in studies with hundreds of patients, there is no problem. The problem comes if the patients lost to follow-up are too many to allow valid conclusions. This requires answering the question – how many are too many? Well, it depends on the total sample size of the study as well as the degree of the observed difference in outcomes between the two study groups. Certainly, less than five out of hundreds, less than tens in thousands are unlikely to be too many in any circumstance. But the best thing is to ask: Do the losses to follow-up threaten the validity of the results. Re-analysing the results with certain assumptions for those lost to follow-up can assess this. Such reanalyses are called sensitivity analyses. The type of assumption depends on the study conclusions. If the conclusion favours the new treatment, assume a worst-case scenario, and re-analyse. If the conclusion does not favour the new treatment, assume a best-case scenario, and re-analyse. (If the comparison is between two active treatments, do both). The details are as follows:

(i) Worst-case analysis

Consider a hypothetical randomised study, which has 100 patients in each of the treatment and control groups, of whom 10 (10%) are lost to follow-up in each group. Of the remaining 90 patients in each group, 40 (44%) die in the control group and 20 (22%) in the experimental treatment group. The difference (40/90 vs 20/90) between the two is statistically significant (P = 0.001). The conclusion is that treatment works. The worst-case scenario analysis will count 40 (40%) deaths in the control group (all the ten patients lost to follow-up are assumed to have survived in this group) and 30 (30%) deaths in the treatment group (all the ten patients lost to follow-up in this group died). The reanalysis (40/100 vs 30/100) shows the difference is statistically non-significant (P = 0.18). The conclusion now would be that the difference might be due to chance. Since the two conclusions (one without counting the losses to follow-up and second with worst-case scenario) differ, the losses to follow up (10% in this example) are too many to allow strong conclusions. (There is no need to do best case scenario here because this will lead to the same conclusion as the analysis presented in the study).

Obviously, this analysis is based on an extreme assumption, which is unlikely to be true. In other words, this is a stringent test. If the study passes this test, there is no question that conclusions are valid despite losses to follow up. If the study fails this test, it may or may not be valid, we do not know.

(ii) Best-case analysis

If a study concludes that the new treatment does not make any difference but had losses to follow-up, you can do best-case scenario analysis to determine if losses were too many to allow valid conclusions.

Again, consider a hypothetical randomised study with 100 patients in each of the two arms. Let us say, 25 patients in the treatment arm and 20 in the placebo arm were lost to follow up. Thirty died in each group. The result says – there is no statistically significant difference between the two groups, in fact, percentage-wise, the control group is marginally better. The best-case scenario would consider that all those lost to follow up in the treatment group survived, whereas those in the control group died. The final figures would be 30 deaths in the treatment group and 50 in the control group (now out of 100 in each group). A reanalysis gives a P value of 0.006, statistically very significant difference. This will mean that the losses to follow up are so many that the results reported in the study may not be called robust.

Of course, the worst and best-case scenarios are based on extreme assumptions, which are probably implausible. They have value if they do not change the conclusions of the studies. (This means, conclusions are robust and losses to follow-up, do not invalidate the results). If they do, then the validity of the results cannot be said to be robust. The extent to which the validity is compromised depends on the degree to which the outcome of treatment patients lost to follow-up differs from that of control patients lost to follow-up. Akl, et al.[4] have proposed methods for handling participants excluded from analyses of randomized trials based on a range of plausible assumptions.


  Conclusion Top


Missing data is common in clinical research and should be avoided at the source. Any missing data should be accounted for, and the limitations incurring understood. While reading a paper dealing with missing data, the mechanism of imputation should be interpreted in light of the limitations of the methods and sensitivity analysis. In your own research having missing data, the type of missing data and quantum should be known. The variables selected for imputation and the limitations should be addressed in the manuscript with sensitivity analysis. However, it is important to note that no statistical methods will ever be able to replace missing information completely, thus 'the optimal solution to the problem of missing values is not to have any'.[5] Therefore, every effort must be made to keep missing values to a minimum. More details are beyond the scope of this paper. Interested readers may read further references.[6]

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.



 
  References Top

1.
Mack C, Su Z, Westreich D. Types of Missing Data [Internet]. Agency for Healthcare Research and Quality (US); 2018. Available from: https://www.ncbi.nlm.nih.gov/books/NBK493614/. [Last accessed on 2020 Jun 19].  Back to cited text no. 1
    
2.
Donders ART, Heijden GJMG van der, Stijnen T, Moons KGM. Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology 2006;59:1087-91.  Back to cited text no. 2
    
3.
Scheffer J. Dealing with Missing Data. Res Lett Inf Math Sci 2002;3:153-60.  Back to cited text no. 3
    
4.
Akl EA, Johnston BC, Alonso-Coello P, Neumann I, Ebrahim S, Briel M, et al. Addressing dichotomous data for participants excluded from trial analysis: A guide for systematic reviewers. PLoS ONE 2013;8:e57132.  Back to cited text no. 4
    
5.
Lachin JM. Fallacies of last observation carried forward analyses. Clin Trials 2016;13:161-8.  Back to cited text no. 5
    
6.
Little RJ, Rubin D. Statistical Analysis with Missing Data. 3rd ed. Wiley-Blackwell; 2019.  Back to cited text no. 6
    


    Figures

  [Figure 1], [Figure 2]



 

Top
Print this article  Email this article
   
Online since 20th March '04
Published by Wolters Kluwer - Medknow