How raw data are normalized

1 comment 168,847 views

    It is not just the data that matters but, as importantly, how do you deal with missing values in your data set?

Often variables are not expressed with the same standard (e.g., using centimeters or time instead we use both, centimeters and time).

In such a case, one way to reduce the risk for comparing composite indices made up of apples with oranges is to use normalization. Normalization serves the purpose of bringing the indicators into the same unit.

A summary of various methods available besides using the is provided here:
Table 5.5 Summary of normalisation methods p. 50

The above Table summarises the various techniques available for transforming indicators in pure, dimensionless numbers, a process called normalization.

Standardization
Standardization or z-scores is the most commonly used method. It converts all indicators to a common scale with an average of zero and standard deviation of one.

The average of zero means that it avoids introducing aggregation distortions stemming from differences in indicators’ means. The scaling factor is the standard deviation of the indicator across, for instance, the countries, companies or blogs being ranked. Thus, an indicator with extreme values will have intrinsically a greater effect on the composite indicator.

Such a greater effect on the compositie indicator might be desirable if the intention is to reward exceptional behavior or performance. This could be the case if an extremely good result on few indicators is thought to be better than a lot of average scores.

Nevertheless, this effect can be corrected in the aggregation methodology. For instance, one can exclude the best and worst sub-indicator scores from the inclusion in the index. Another option might be by assigning differential weights based on the the “desirability” of the indicators used to arrive at the composite scores.

how is it done
The rankings are calculated using ‘z-scores‘ (calculated for each criterion as the actual value minus the mean of the criterion, divided by the standard deviation of the criterion).

The raw score on each measure is converted to a z-score ((‘score’-‘mean score’)/’standard
deviation of scores’)

By taking account of the standard deviation within any one criterion, this method aims to provide a more sophisticated analysis of the differences, and indeed the similarities in some measures, between blogs or webpages.

The z-score of any one criterion is calculated as = (actual value – mean of criterion)/standard deviation of criterion.

z-scores – things to consider
z-scores do have their own drawbacks, especially, if they are used with weights for the measure – see Z-scores – their effects on league tables.

We refrain from using weights in the default version of our rankings, scores and tables. However, the user may want to use weightings.

The calculated z-score [z-score = (value – mean)/SD] describes where a value is located in the distribution. For instance, a z-score of 0 is at the mean of the distribution and a z-score of 2.0 or beyond is in the tails of the distribution.  (PS. A negative zscore means that the original score was below the mean). As the sample size becomes large, approximately half the z-scores should be negative and half of the z-scores should be positive.

Based on statistical logic, for a composite standardised (z) score, individual variables (that comprise this composite) must be transformed before forming the composite.

However, the total weighted z-score (e.g., total z-score / nr. of criteria = z-score) is scaled so that the highest outcome is 100 and the rest expressed as a proportion of the highest

To allow for missing data, the sum of the weighted z-scores is divided by the sum of the weights of the measures with available data

Nonetheless, the z-score method of analyzing the criteria used does not address the general issues regarding the rationale behind any measure’s inclusion in the first place.

Why we prefer z-scores
Normalization of data or using z-scores overcomes objections of relativism which can be applied to methods that allocate points pro rata to the top blogs or webpages or else to a ranking method used.

For instance, AdAge Power 150 uses the Technorati Ranking i.e. top 9,000, top 10,000, top 20,000, etc. However, each range is assigned a number from 1 to 30 to get the value for each blog on that criterion.

Neither does the AdAge Power 150 publish in its methodology the rational used to arrive at the ranges taken. This makes using such a method for calculating the rankings, and hence the table, opaque.

As well, z-scores take into account the difference in score between each blog in the ranking and spread beteeen top and bottom blogs. The z-scores can then be converted to indices, so all scores for all rankings are within the same range.

Find out more information about the Methodology here: — Composite indicatorsHow raw data are normalized you are here, How rankings are calculatedWhat about missing values?, Trend time seriesWeighting) — ComMetrics rankings (Method behind the numbers).