- When data are missing, the appropriate missing data analysis procedures do not generate something out of nothing but do make the most out of the data available
One of the most common forms of analysis with missing data involves simply substituting the mean for the variable whenever a value is missing. Unfortunatley, mean substitution can produce very wrong estimates of variances and covariances. In general, substituting the mean for the missing value has the effect of underestimating the magnitude of both variances and covariances
- “1. Whenever possible, use the EM algorithm (or other maximum likelihood procedure, including the multiple-group structural equation-modeling procedure or, where appropriate, multiple imputation) for analyses involving missing data.2. If other analyses must be used, keep in mind that they produce biased results and should not be relied upon for final analyses. Recommmending:
a. Never use mean substitution, even for preliminary analyses.
b. With minimal missing data, analysis of complete cases may be a reasonable solution.
c. If data are missing completely at random, pairwise deletion or complete cases analysis may be a reasonable solution.
d. If data are not missing completely at random and the cause of missingness has been measured, complete cases may produce unbiased e2stimates, although it is a generally less powerful approach than the EM algorithm or multiple-group procedure.”
Accordingly, mean substitution:
1. artificially decreases the variation of scores, in turn, this decrease in individual variation for each of the variables is proportional to the number of missing data – in turn, the more missing data, the more “perfectly average scores” will be artificially added to the data set; and
2. substitutes missing data with artificially created “average” data points – this can result in considerably changing the values of correlations.
We have tried to minimize this issue with calculating the mean value for those variables with the same Google Page Rank only (e.g., take all means from variable x for those blogs with Google PageRank 4 only – calculate the average score – use it). In turn, this reduces the impact outliers – high and low – may have on our results. For more see also:
what we do with missing values
We are currently in the process of finding ways for doing multiple – imputations. Want to support our efforts, get in touch and/or leave a comment below.We are wondering how we can integrate this program:
Schafer, J. Software for Multiple Imputation
with our work. Suggestions are welcome.
We need to find a way to run it without using any of the statistical packages to do the job for us – any advice – please leave a comment we need your expertise and appreciate any help we can get.
other resources