June 2011

You have a set of data. You would like to know if it fits a certain distribution - for example, the normal distribution. Maybe there are a number of statistical tests you want to apply to the data but those tests assume your data are normally distributed? How can you determine if the data are normally distributed. You can construct a histogram and see if it looks like a normal distribution. You could also make a normal probability plot and see if the data falls in a straight line.  We have past newsletters on histograms and making a normal probability plot.  There is an additional test you can apply. It is called the Anderson-Darling test and is the subject of this month's newsletter.

We have included an Excel workbook that you can download to perform the Anderson-Darling test for up to 200 data points.  It includes a normal probability plot.  We have also included a link to VBA function macro that you can use to calculate the Anderson-Darling statistic and associated p-value. 

In this issue:

You can download the workbook containing the data at this link.

The Anderson-Darling Test Hypotheses

The Anderson-Darling Test was developed in 1952 by Theodore Anderson and Donald Darling. It is a statistical test of whether or not a dataset comes from a certain probability distribution, e.g., the normal distribution. The test involves calculating the Anderson-Darling statistic.  You can use the Anderson-Darling statistic to compare how well a data set fits different distributions.

The two hypotheses for the Anderson-Darling test for the normal distribution are given below:

H0: The data follows the normal distribution

H1: The data do not follow the normal distribution

The null hypothesis is that the data are normally distributed; the alternative hypothesis is that the data are non-normal.

In many cases (but not all), you can determine a p value for the Anderson-Darling statistic and use that value to help you determine if the test is significant are not. Remember the p ("probability") value is the probability of getting a result that is more extreme if the null hypothesis is true. If the p value is low (e.g., <=0.05), you conclude that the data do not follow the normal distribution. Remember that you chose the significance level even though many people just use 0.05 the vast majority of the time. We will look at two different data sets and apply the Anderson-Darling test to both sets.

Two Data Sets

The two data sets come from another website that has statistical data you can use for examples. Click here for the link.  The first data comes from Mater Mother's Hospital in Brisbane, Australia. The data set contains the birth weight, gender, and time of birth of 44 babies born in the 24-hour period of 18 December 1997. The data were explained using four different distributions. We will focus on using the normal distribution, which was applied to the birth weights. The data are shown in the table below.

Table of Birth Weights (Grams) 

38373480
33343116
35543428
38383783
36253345
22083034
17452184
28463300
31662383
35203428
33804162
32943630
25763406
32083402
35213500
37463736
35233370
29022121
26353150
39203866
36903542
34303278

 

The second set of data involves measuring the lengths of forearms in adult males. The 140 data values are in inches. The data is given in the table below.

Table of Forearm Lengths

17.320.918.717.918.3
1918.118.819.117.9
18.219.419.417.318.3
1920.518.519.419.6
1920.418.618.319.6
20.416.119.619.321
18.318.718.517.218
19.918.82017.517.9
18.717.317.819.618.1
20.918.119.817.619.5
17.719.916.62017.1
19.119.619.419.918.9
19.718.419.316.918.5
18.119.520.119.519.2
18.416.820.520.420.5
17.517.12019.118.3
18.918.920.818.519.4
1919.717.718.321.4
20.519.719.919.819
17.319.218.819.118.6
18.320.616.417.519.5
18.420.118.518.517.4
18.618.81919.318.5
19.817.120.619.118.4
20.218.619.217.418.3
18.51817.116.320.7
18.518.716.318.219.3
1820.317.218.817.7

 

The Anderson-Darling Test

The Anderson-Darling Test will determine if a data set comes from a specified distribution, in our case, the normal distribution. The test makes use of the cumulative distribution function. The Anderson-Darling statistic is given by the following formula:

Anderson-Darling Statistic

where n = sample size, F(X) = cumulative distribution function for the specified distribution and i = the ith sample when the data is sorted in ascending order.  You will often see this statistic called A2.

To demonstrate the calculation using Microsoft Excel and to introduce the workbook, we will use the first five results from the baby weight data. Those five weights are 3837, 3334, 3554, 3838, and 3625 grams. You definitely want to have more data points than this to determine if your data are normally distributed. We will walk through the steps here.  You can download the Excel workbook which will do this for you automatically here: download workbook. Of course, the Anderson-Darling test is included in the SPC for Excel software.

The data are placed in column E in the workbook. After entering the data, the workbook determines the average, standard deviation and number of data points present The workbook can handle up to 200 data points.  

 

Workbook output

 

The next step is to number the data from 1 to n as shown below.

Workbook output

The formula in Cell F2 is "=IF(ISBLANK(E2),"",1)". The formula in cell F3 is "=IF(ISBLANK(E3),"",F2+1)". The formula in cell F3 is copied down the column.

To calculate the Anderson-Darling statistic, you need to sort the data in ascending order. This is done in column G using the Excel function SMALL(array, k). This function returns the kth smallest number in the array. The sorted data are placed in column G.

Workbook output

The formula in cell G2 is "=IF(ISBLANK(E2), NA(),SMALL(E$2:E$201,F2))". This formula is copied down the column.  The NA() is used so that Excel will not plot points with no data.

Now we are ready to calculate F(Xi). Remember, this is the cumulative distribution function. In Excel, you can determine this using either the NORMDIST or NORMSDIST functions. They both will give the same result. We will use the NORMDIST function. The workbook places these results in column H.

Worbook Output

The formula in cell H2 is "=IF(ISBLANK(E2),"",NORMDIST(G2, $B$3, $B$4, TRUE))". This formula is copied down column H. The average is in cell B3; the standard deviation in cell B4. Using "TRUE" returns the cumulative distribution function.

Take a look again at the Anderson-Darling statistic equation:

anderson-darling equation

We have F(Xi). The equation shows we need 1-F(Xn-i+1). It takes two steps to get this in the workbook. First the value of 1- F(Xi) is calculated in column I and then the results are sorted in column J. The results are shown below.

 

workbook output

 

The formula in cells I2 is "=IF(ISBLANK(E2), "", 1-H2)" and the formula in cell J2 is "=IF(ISBLANK(E2),"",SMALL(I$2:I$201,F2))." These are copied down those two columns.

We are now ready to calculate the summation portion of the equation. So, define the following for the summation term in the Anderson-Darling equation:

anderson-darling summation term

This result is placed in column K in the workbook.

workbook output

The formula in cell K2 is "=IF(ISBLANK(E2),"",(2*F2-1)*(LN(H2)+LN(J2)))". This formula is copied down the column.

We are now ready to calculate the Anderson-Darling statistic. This is given by:

anderson-darling result

The value of AD needs to be adjusted for small sample sizes. The adjusted AD value is given by:

adjusted anderson-darling equation

For these 5 data points, AD* = .357. The workbook has the following output in columns A and B:

Workbook Output

The last entry is the p value.  That depends on the value of AD*.

The p Value for the Adjusted Anderson-Darling Statistic

The calculation of the p value is not straightforward. The reference most people use is R.B. D'Augostino and M.A. Stephens, Eds., 1986, Goodness-of-Fit Techniques, Marcel Dekker. There are different equations depending on the value of AD*. These are given by:

  • If AD*=>0.6, then p = exp(1.2937 - 5.709(AD*)+ 0.0186(AD*)2
  • If 0.34 < AD* < .6, then p = exp(0.9177 - 4.279(AD*) - 1.38(AD*)2
  • If 0.2 < AD* < 0.34, then p = 1 - exp(-8.318 + 42.796(AD*)- 59.938(AD*)2)
  • If AD* <= 0.2, then p = 1 - exp(-13.436 + 101.14(AD*)- 223.73(AD*)2)

The workbook (and the SPC for Excel software) uses these equations to determine the p value for the Anderson-Darling statistic.

Applying the Anderson-Darling Test

Now let's apply the test to the two sets of data, starting with the baby weight. The question we are asking is - are the baby weight data normally distributed?" The results for that set of data are given below.

AD = 1.717
AD* =  1.748
p Value = 0.000179

The p value is less than 0.05. Since the p value is low, we reject the null hypotheses that the data are from a normal distribution. You can construct a normal probability plot of the data. How to do this is explained in our June 2009 newsletter.   The normal probability plot is included in the workbook. If the data comes from a normal distribution, the points should fall in a fairly straight line. You can see that this is not the case for these data and confirms that the data does not come from a normal distribution.

baby weight normal probability plot

Now consider the forearm length data.  Again, we are asking the question - are the data normally distributed?  The results for the elbow lengths 

AD = 0.237
AD* =  0.238
p Value =  0.782045

Since the p value is large, we accept the null hypotheses that the data are from a normal distribution. The normal probability plot shown below confirms this.

normal probability plot for forearm length

The workbook contains all you need to do the Anderson-Darling test and to see the normal probability plot.  If you prefer to use VBA code, this link gives you the VBA code for an Anderson-Darling function.   You can copy this code into a workbook and use it to calculate AD, AD* and the p value for the data.

Summary

The Anderson-Darling test is used to determine if a data set follows a specified distribution.  In this newsletter, we applied this test to the normal distribution.  The test involves calculating the Anderson-Darling statistic and then determining the p value for the statistic.  It is often used with the normal probability plot.

Quick Links

SPC for Excel Software

Visit our home page

SPC Training

SPC Consulting

Ordering Information

Thanks so much for reading our publication. We hope you find it informative and useful. Happy charting and may the data always support your position.

Sincerely,

Dr. Bill McNeese
BPI Consulting, LLC

View Bill McNeese's profile on LinkedIn

Connect with Us

       

Comments (16)

  • anon

    awesome article !

    Jun 30, 2011
  • anon

    Very Illustrative, Easy to adopt and enables any to tackle similar issues irrespective of age, education & position

    Jul 02, 2011
  • anon

    Thanks for the info.

    You will never know how much you helped!

    May 29, 2012
  • anon

    Well explained topic, thanks

    Sep 08, 2012
  • anon

    Very well explained in places, slightly ambiguous in others. Shame about the grammar used throughout the piece!

    Apr 14, 2015
  • anon

    And what is wrong with the grammar?  Ready fine to me!

    Apr 15, 2015
  • anon

    How Anderson-Darling test is different from Shapiro Wilk test for normality?  

    Aug 03, 2015
  • anon

    I have seen varying data on which approach is better - have seen where Shapiro-Wilk has more power.  But, I have not looked too much into the Shapiro-Wilk test.

    Aug 03, 2015
  • anon

    Hi. This is really usefull thank you. However is there any way to increase the amount of data that can be analysed in this workbook? I've got 750 samples. I did change the maximum values in the formulas to include a bigger data sample but wasn’t sure if the formulas would be compromised.e.g  E$701 =IF(ISBLANK(E2), NA(),SMALL(E$2:E$1000,F2))

    Sep 23, 2015
  • anon

    You can use the workbook with larger sample sizes.  You just need to be sure that it is changed in all formulas, including Avg, stdev, n, S and the ones containing SMALL. 

    Sep 24, 2015
  • anon

    Hi, Thanks for the info. I'm reproducing the steps in Excel but I don't want to compare with a Normal distribution, I have my own set of data and I want to check it with my own distribution. In this case how do generate F(Xi) using 10,000 data points I have for the distribution? 

    Jan 14, 2016
  • anon

    I am not sure I understand what you want to do.  Maybe this:

    • Sort your data in a column (say column A) from smallest to largest.
      In Column B, put the numbers from 1 to 10,000
      In cell C1, enter = B1/10000
      Copy that from C2 to C10000
      Plot A vs C to generate the CDF

    Jan 15, 2016
  • anon

    Is it possible to explain the correction in the calculation of the Z-value (see column L of sheet 2 in the embedded excel-sheet). The P value is not calculated as i/n. But corrected and is now calculated as (i-0,3)/(n+0.4) Is it possible to give some substantiation of the used 0.3 and 0.4.

    Mar 29, 2016
  • anon

    The method used is median rank method for uncensored data. This gives p = (i-0.3)/(n+.4).   There are other methods that could be used.  For example,  you could use (i-0.5)/n; or i/(n+1) or simply i/n. 

    Mar 29, 2016
  • anon

    Thanks

    Mar 30, 2016
  • anon

    This is really very informative article.I come to know about this useful test.thanks  

    May 18, 2016

Leave a comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h1> <h2> <h3> <h4> <h5> <h6> <img> <hr> <div> <span> <strike> <b> <i> <u> <table> <tbody> <tr> <td> <th>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.