In this Issue:
Greetings,
Many software packages have a feature that generates descriptive statistics for a set of data. These statistics include the mean, standard deviation, variation, count, and a host of other numbers. Two other numbers often encountered are the results for the skewness and kurtosis.
This month's newsletter covers these two statistics. These two statistics are meant to be "shape" statistics, i.e., they describe the shape of the distribution. What do the skewness and kurtosis really represent? And do they help you understand your process any better? Are they useful statistics? You might be surprised at the answer.
Best regards,
Bill
Introduction
Are your data normally distributed? This is a very common question today. There are many good ways of determining whether your data come from a normal distribution. Many books say that the skewness and kurtosis statistics give you insights into the shape of the distribution. Skewness is a measure of symmetry. A normal distribution has a skewness equal to 0; i.e. perfect symmetry. Of course, that symmetry does not exist too often in real life. Kurtosis is a measure of the flatness of the distribution. The kurtosis will be 3 for a normal distribution. Some calculations (including that in Microsoft Excel) subtract 3 from the result, so the kurtosis will be 0 for a normal distribution. Skewness and kurtosis are defined in the next two sections.
Skewness
In his book The Six Sigma Handbook, Thomas Pyzdek has the following definition for skewness:
"A measure of asymmetry. Zero indicates perfect symmetry; the normal distribution has a skewness of zero. Positive skewness indicates that the "tail" of the distribution is more stretched on the side above the mean. Negative skewness indicates that the tail of the distribution is more stretched on the side below the mean."
The equation for skewness is:
where x are the individual values, Xbar is the overall average, n is then sample size and s is the standard deviation. This equation is from Dr. Donald Wheeler's excellent book Advanced Topics in Statistical Process Control (www.spcpress.com).
Microsoft Excel, on the other hand, uses the following to define skewness:
"Skewness characterizes the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values."
Very close to Pyzdek's definition. Excel uses a slightly different equation for skewness:
Does the difference in equations cause a difference in the results? Just a little. For the most part, the results are pretty close. Three examples of distributions, each with a different skewness, are shown below.
Distribution Number 1: Skewness > 0; Tail Extends to the Right
Distribution Number 2: Skewness = 0; Normal Distribution
Distribution Number 3: Skewness < 0; Tail Extends to the Left
Kurtosis
Pyzdek defines the following:
"Kurtosis is a measure of flatness of the distribution. Heavier tailed distributions have larger kurtosis measures. The normal distribution has a kurtosis of 3"
The equation for kurtosis (from Wheeler's book) is:
Microsoft Excel has the following definition:
"Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution."
Excel uses the following equation to calculate kurtosis:
This is where it can get confusing. The first equation for kurtosis above (from Wheeler) will give a kurtosis of 3 for a normal distribution. Excel's equation has a built-in correction that will give a kurtosis of 0 for a normal distribution. The different equations do give slightly different results (when accounting for the difference of 3). Two examples of distributions with differing values of kurtosis are shown below.
Distribution with Too Much Peak (Kurtosis > 0)
Too Flat Distribution (Kurtosis < 0)
The Population
Are the skewness and kurtosis any value to you? To explore this, a data set of 5000 points was randomly generated. The goal was to have a mean of 100 and a standard deviation of 10. The random generation resulted in a data set with a mean of 99.95 and a standard deviation of 10.01. The histogram for this data is shown below and looks fairly bell-shaped.
The skewness of the data is 0.007. The kurtosis is 0.03. Both values are close to 0 as you would expect for a normal distribution. These two numbers represent the "true" value for the skewness and kurtosis since they were calculated from all the data. In real life, you don't know the real skewness and kurtosis because you have to sample the process. This is where the problem begins for skewness and kurtosis. Sample size has a big impact on the results.
Impact of Sample Size on Skewness and Kurtosis
The 5,000 point data set above was used to explore what happens to skewness and kurtosis based on sample size. For example, suppose we wanted to determine the skewness and kurtosis for a sample size of 50. 50 results were randomly selected from the data set above and the two statistics calculated. The results are shown in the table below.
Sample Size | Skewness | Kurtosis |
5 | 1.983 | 3.974 |
10 |
-0.078 |
-1.468 |
15 | -0.384 | 0.127 |
25 | -0.356 | -0.025 |
50 | -0.169 | -0.752 |
75 | -0.489 | 0.615 |
100 | -0.346 | 0.671 |
250 | 0.089 | 0.061 |
500 | 0.186 | 0.232 |
750 | -0.02 | 0.042 |
1000 | -0.138 | 0.062 |
1250 | 0.085 | 0.079 |
1500 | -0.017 | 0.001 |
2000 | -0.059 | -0.009 |
2500 | 0.037 | 0.096 |
3000 | 0.009 | 0.005 |
3500 | -0.015 | 0.004 |
4000 | -0.015 | -0.009 |
4500 | 0.009 | 0.036 |
5000 | 0.007 | 0.030 |
Notice how much different the results are when the sample size is small compared to the "true" skewness and kurtosis for the 5,000 results. For a sample size of 25, the skewness was -.356 compared to the true value of 0.007 while the kurtosis was -0.025. Both signs are opposite of the true values which would lead to wrong conclusions about the shape of the distribution. There appears to be a lot of variation in the results based on sample size. The two charts below show how the skewness and kurtosis changed with sample size.
30 samples is a common number used for process capability studies. A subgroup size of 30 was randomly selected from the data set. This was repeated 100 times. The skewness varied from -1.327 to 1.275 while the kurtosis varied from -1.12 to 2.978. What kind of decisions can you make about the shape of the distribution when the skewness and kurtosis vary so much? Essentially, no decisions.
Conclusions
The skewness and kurtosis statistics appear to be very dependent on the sample size. The table above shows the variation. In fact, even several hundred data points didn't give very good estimates of the true kurtosis and skewness. Smaller sample sizes can give results that are very misleading. Dr Wheeler said it correctly in his book mentioned above:
"In short, skewness and kurtosis are practically worthless. Shewhart made this observation in his first book. The statistics for skewness and kurtosis simply do not provide any useful information beyond that already given by the measures of location and dispersion."
Walter Shewhart was the "Father" of SPC. So, don't put much emphasis on skewness and kurtosis values you may see. And remember, the more data you have, the better you can describe the shape of the distribution.
Attachment | Size |
---|---|
Data for Kurtosis and Skewness.xls | 718.5 KB |