missing values – your advice is needed

by Urs E. Gattiker on 2008/11/24 · 3 comments 9,649 views

in best in class

    When data are missing, the appropriate missing data analysis procedures do not generate something out of nothing but do make the most out of the data available

One of the most common forms of analysis with missing data involves simply substituting the mean for the variable whenever a value is missing. Unfortunatley, mean substitution can produce very wrong estimates of variances and covariances. In general, substituting the mean for the missing value has the effect of underestimating the magnitude of both variances and covariances

    “1. Whenever possible, use the EM algorithm (or other maximum likelihood procedure, including the multiple-group structural equation-modeling procedure or, where appropriate, multiple imputation) for analyses involving missing data.2. If other analyses must be used, keep in mind that they produce biased results and should not be relied upon for final analyses. Recommmending:

    a. Never use mean substitution, even for preliminary analyses.

    b. With minimal missing data, analysis of complete cases may be a reasonable solution.

    c. If data are missing completely at random, pairwise deletion or complete cases analysis may be a reasonable solution.

    d. If data are not missing completely at random and the cause of missingness has been measured, complete cases may produce unbiased e2stimates, although it is a generally less powerful approach than the EM algorithm or multiple-group procedure.”

John W. Graham, Scott M. Hofer, and Andrea M. Piccinin (1994). Analysis With Missing Data in Drug Prevention Research L. M. Collins & L. A. Seitz (eds.), Advances in Data Analysis for Prevention Intervention Research (13-63). NIDA Research Monograph 142 Bethesda, MA: U.S. Department of Health and Human Service

Accordingly, mean substitution:

1. artificially decreases the variation of scores, in turn, this decrease in individual variation for each of the variables is proportional to the number of missing data – in turn, the more missing data, the more “perfectly average scores” will be artificially added to the data set; and

2. substitutes missing data with artificially created “average” data points – this can result in considerably changing the values of correlations.

We have tried to minimize this issue with calculating the mean value for those variables with the same Google Page Rank only (e.g., take all means from variable x for those blogs with Google PageRank 4 only – calculate the average score – use it). In turn, this reduces the impact outliers – high and low – may have on our results. For more see also:

what we do with missing values

We are currently in the process of finding ways for doing multiple – imputations. Want to support our efforts, get in touch and/or leave a comment below.We are wondering how we can integrate this program:

Schafer, J. Software for Multiple Imputation

with our work. Suggestions are welcome.

We need to find a way to run it without using any of the statistical packages to do the job for us – any advice – please leave a comment we need your expertise and appreciate any help we can get.

other resources

Allen, E. I., & Sharpe, N. R. (2005) Demonstration of Ranking Issues for Students: A Case Study. Journal of Statistics Education Volume 13, Number 3

Karen Grace-Martin – writes The Analsis Factor blog – you should subscribe – it is refreshing and very helpful indeed




  • http://www.analysisfactor.com Karen Grace-Martin

    Thanks for the kind words. Refreshing-wow!

    Here’s my advice:

    Get on the Impute mailing list: http://lists.utsouthwestern.edu/mailman/listinfo/impute. All the theoretical statisticians who work with multiple imputation (the guys who derive the equations) seem to be there.

    A couple good books on missing data you might want to start with are Allison and Little & Rubin. Allison’s is not equation heavy, but I do know there is an equation in there about how to combine the standard errors for multiple imputation. Little and Rubin is very equation heavy, so probably has much of what you need. Full citations are at my site under Resources: Books.

  • urs

    Dear Karen

    The nice words I put down about your blog are well deserved, I meant and still mean it :-)

    Thanks for the input, I am getting on the list as per your suggestion…

    Equations is fine for me… but I am trying to find a way to get a program that can help us deal with this and do it right.

    Maybe the mailing list will give me the info, or else looking at these books? If you know of a program that we can use (we program in php), let me know please, I really would love to know.

    Thanks Urs

  • http://www.analysisfactor.com Karen Grace-Martin

    Hi Urs–I am totally not a programmer, so can’t help you there. The books will help you with the equations to program, but I suspect someone on that Impute list will know about programming it.

    You might want to look into R as well. It’s free statistical software, and I believe open source. I’m pretty sure it has multiple imputation.

    Good luck–Karen

Previous post:

Next post: