April 2008

 

In this Issue:

 

Greetings,

Many software packages have a feature that generates descriptive statistics for a set of data. These statistics include the mean, standard deviation, variation, count, and a host of other numbers. Two other numbers often encountered are the results for the skewness and kurtosis.

This month's newsletter covers these two statistics. These two statistics are meant to be "shape" statistics, i.e., they describe the shape of the distribution. What do the skewness and kurtosis really represent? And do they help you understand your process any better? Are they useful statistics? You might be surprised at the answer.

 

Best regards,

Bill

 

Introduction

Are your data normally distributed? This is a very common question today. There are many good ways of determining whether your data come from a normal distribution. Many books say that the skewness and kurtosis statistics give you insights into the shape of the distribution. Skewness is a measure of symmetry. A normal distribution has a skewness equal to 0; i.e. perfect symmetry. Of course, that symmetry does not exist too often in real life. Kurtosis is a measure of the flatness of the distribution. The kurtosis will be 3 for a normal distribution. Some calculations (including that in Microsoft Excel) subtract 3 from the result, so the kurtosis will be 0 for a normal distribution. Skewness and kurtosis are defined in the next two sections.

 

Skewness

In his book The Six Sigma Handbook, Thomas Pyzdek has the following definition for skewness:

"A measure of asymmetry. Zero indicates perfect symmetry; the normal distribution has a skewness of zero. Positive skewness indicates that the "tail" of the distribution is more stretched on the side above the mean. Negative skewness indicates that the tail of the distribution is more stretched on the side below the mean."

The equation for skewness is:

Skewness EQ

where x are the individual values, Xbar is the overall average, n is then sample size and s is the standard deviation. This equation is from Dr. Donald Wheeler's excellent book Advanced Topics in Statistical Process Control (www.spcpress.com).

 

Microsoft Excel, on the other hand, uses the following to define skewness:

"Skewness characterizes the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values."

Very close to Pyzdek's definition. Excel uses a slightly different equation for skewness:

Excel Skewness EQ

 

Does the difference in equations cause a difference in the results? Just a little. For the most part, the results are pretty close. Three examples of distributions, each with a different skewness, are shown below.

Distribution Number 1: Skewness > 0; Tail Extends to the Right

Skewness Right

 

Distribution Number 2: Skewness = 0; Normal Distribution

Skewness Normal

 

Distribution Number 3: Skewness < 0; Tail Extends to the Left

Skewness Left

 

Our SPC for Excel software easily constructs and updates histograms, control charts and many other SPC tools. Click here for more information.

 

Kurtosis

Pyzdek defines the following:

"Kurtosis is a measure of flatness of the distribution. Heavier tailed distributions have larger kurtosis measures. The normal distribution has a kurtosis of 3"

The equation for kurtosis (from Wheeler's book) is:

Kurtosis EQ

 

Microsoft Excel has the following definition:

"Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution."

Excel uses the following equation to calculate kurtosis:

Excel Kurtosis EQ

This is where it can get confusing. The first equation for kurtosis above (from Wheeler) will give a kurtosis of 3 for a normal distribution. Excel's equation has a built-in correction that will give a kurtosis of 0 for a normal distribution. The different equations do give slightly different results (when accounting for the difference of 3). Two examples of distributions with differing values of kurtosis are shown below.

Distribution with Too Much Peak (Kurtosis > 0)

Kurtosis Peak

 

Too Flat Distribution (Kurtosis < 0)

Kurtosis Flat

 

The Population

Are the skewness and kurtosis any value to you? To explore this, a data set of 5000 points was randomly generated. The goal was to have a mean of 100 and a standard deviation of 10. The random generation resulted in a data set with a mean of 99.95 and a standard deviation of 10.01. The histogram for this data is shown below and looks fairly bell-shaped.

Histogram Data

 

The skewness of the data is 0.007. The kurtosis is 0.03. Both values are close to 0 as you would expect for a normal distribution. These two numbers represent the "true" value for the skewness and kurtosis since they were calculated from all the data. In real life, you don't know the real skewness and kurtosis because you have to sample the process. This is where the problem begins for skewness and kurtosis. Sample size has a big impact on the results.

 

Impact of Sample Size on Skewness and Kurtosis

The 5,000 point data set above was used to explore what happens to skewness and kurtosis based on sample size. For example, suppose we wanted to determine the skewness and kurtosis for a sample size of 50. 50 results were randomly selected from the data set above and the two statistics calculated. The results are shown in the table below.

 

Sample Size Skewness Kurtosis
5 1.983 3.974
10

-0.078

-1.468
15 -0.384 0.127
25 -0.356 -0.025
50 -0.169 -0.752
75 -0.489 0.615
100 -0.346 0.671
250 0.089 0.061
500 0.186 0.232
750 -0.02 0.042
1000 -0.138 0.062
1250 0.085 0.079
1500 -0.017 0.001
2000 -0.059 -0.009
2500 0.037 0.096
3000 0.009 0.005
3500 -0.015 0.004
4000 -0.015 -0.009
4500 0.009 0.036
5000 0.007 0.030

 

Notice how much different the results are when the sample size is small compared to the "true" skewness and kurtosis for the 5,000 results. For a sample size of 25, the skewness was -.356 compared to the true value of 0.007 while the kurtosis was -0.025. Both signs are opposite of the true values which would lead to wrong conclusions about the shape of the distribution. There appears to be a lot of variation in the results based on sample size. The two charts below show how the skewness and kurtosis changed with sample size.

Sample Size Skewness

 

Sample Size Kurtosis

 

30 samples is a common number used for process capability studies. A subgroup size of 30 was randomly selected from the data set. This was repeated 100 times. The skewness varied from -1.327 to 1.275 while the kurtosis varied from -1.12 to 2.978. What kind of decisions can you make about the shape of the distribution when the skewness and kurtosis vary so much? Essentially, no decisions.

 

Conclusions

The skewness and kurtosis statistics appear to be very dependent on the sample size. The table above shows the variation. In fact, even several hundred data points didn't give very good estimates of the true kurtosis and skewness. Smaller sample sizes can give results that are very misleading. Dr Wheeler said it correctly in his book mentioned above:

"In short, skewness and kurtosis are practically worthless. Shewhart made this observation in his first book. The statistics for skewness and kurtosis simply do not provide any useful information beyond that already given by the measures of location and dispersion."

Walter Shewhart was the "Father" of SPC. So, don't put much emphasis on skewness and kurtosis values you may see. And remember, the more data you have, the better you can describe the shape of the distribution.

 

To download the workbook containing the macro and results that generated the above tables, please click here.

 

AttachmentSize
Data for Kurtosis and Skewness.xls718.5 KB

Newsletter Sign-up

Click here to sign up for our FREE monthly newsletter, featuring SPC and other statistical topics, case studies and more!

SPC Around the World

SPC for Excel is used in over 60 countries internationally.  Click here for a list of those countries.