Process Capability and Non-Normal Data

Home

July 2014

This month’s publication takes a look at process capability calculations and the impact non-normal data has on the results. The most common method of expressing process capability involves calculating a Cpk value, i.e., a process has a Cpk = 1.54. In our April publication, we explained why a Cpk value by itself is not sufficient for defining process capability – and that is if your data are somewhat normally distributed. If your data are not normally distributed, then forget it. Cpk, applied to the raw data, is pretty much worthless as a measure of process capability.

Remember, not all data are normally distributed. There are many naturally occurring distributions. For example, the exponential distribution is often used to describe the time it takes to answer a telephone inquiry, how long a customer has to wait in line to be served, or the time to failure for a component with a constant failure rate. These types of data have many short time periods with occasional long time periods. These data are not described by a normal distribution.

So, how can you handle these types of data when it comes to process capability? This publication examines how this is done using the exponential distribution as an example. In this issue:

Exponential Example Data
The Wrong Approach: Cpk
The Correct Approach: Non-Normal Ppk
Previous Process Capability Publications
Summary
Quick Links

Exponential Example Data

We will use the same data set that we used last month to take a look at the impact of non-normal data on control charts. Our data set consists of 100 random numbers that were generated for an exponential distribution with a scale = 1.5. The scale is what determines the shape of the exponential distribution. Suppose these data describe how long it takes for a customer to be greeted by a salesperson in a store. Usually a customer is greeted very quickly. Sometimes it is crowded in the store and it takes longer. The data are shown in Table 1.

Table 1: Exponential Data

2.26	2.68	4.17	0.03	2.02
7.77	0.13	4.05	0.04	5.28
3.67	1.37	5.12	0.21	0.03
0.91	0.65	2.24	2.67	0.75
0.11	0.13	0.53	0.6	0.43
0.36	0.14	0.29	2.95	1.53
0.06	1.35	2.09	0.54	2.22
0.31	1.46	2.82	3.54	0.19
0.91	0.01	1.24	3.43	0.75
1.01	0.18	1.03	2.65	2.99
3.21	1.98	0.5	1.7	0.3
0.24	0.82	2.02	0.16	2.41
3.84	1.77	0.86	0.16	2.07
2.28	2.49	0.51	4.06	1.31
1.75	0.53	2.17	2.04	1
1.45	0.4	0.11	3.56	2.15
1.81	1.67	0.8	6.1	1.3
0.3	1.02	3.63	0.77	5.25
0.63	0.81	0.6	0.87	2.44
2.22	0.15	0.13	4.74	0.76

Management has set the goal that every customer must be greeted by a salesperson within six minutes of when they enter the store. This is the upper specification limit (USL) for our process. There is no lower specification limit (LSL). You have collected the data in Table 1 and now want to determine if the process is capable of meeting the specification set by management.

To perform a Cpk calculation, two things need to be true. First, the process must be in statistical control. Second, the data must be somewhat normally distributed. Last month’s publication examined these data as a control chart. The process is in statistical control. So, the first assumption is satisfied.

Are these data normally distributed? Last month’s publication also showed the histogram and normal probability plot for these data. These two techniques demonstrated quite clearly that the data are not normally distributed. So, the second assumption of normality is not satisfied. But suppose you don’t take the time to find that out – that you just merely ignore the assumptions used for determining process capability and move forward with calculating Cpk based on the data in Table 1.

The Wrong Approach: Cpk

One thing is for sure – calculating Cpk from non-normal data is just plain wrong. Yet, it is amazing how often people do that. Our USL is 6 minutes. We have our software program or Excel template on process capability. We plug in our numbers and get our Cpk. It equals 1.1. It is greater than 1 so essentially nothing is out of specification! We are happy! (Don’t look at the data in Table 1. A couple points are above the USL. Just ignore that little fact.)

Again, a number by itself means nothing. You need to look at the relationship between the histogram, the specifications and your assumed distribution. This can be done examining the process capability chart shown in Figure 1.

Figure 1: Process Capability Chart – Normal Distribution

This chart shows the histogram of the raw data, the normal distribution based on the average and standard deviation of the raw data, and the specification limits (only a USL in this case). What do you notice about this chart? A few things stand out.

The normal distribution does not appear to fit the histogram, so a basic assumption of normality for calculating the Cpk value is not valid.
The normal distribution has values below zero – clearly not the case for waiting time. No matter how quickly you greet customers, the time will not be a negative number.
The tail of the normal distribution approaches zero at time a little less than 6 implying there are no values about that.
There are some points (2 to be precise) above the USL.

Some of the calculated statistics from this analysis are given below. These calculations assume that the data are normally distributed.

Within Capabilities:
- Cp=N/A
- Cpk=1.1
- Cpu=1.1 (0.05%)
- Cpl=N/A (0%)
- Est. Sigma (σ) =1.316
- PPM>USL=485.35
- PPM<LSL=0
- Total PPM=485.35
Overall Capabilities:
- Pp=N/A
- Ppk=0.94
- Ppu=0.94 (0.24%)
- Ppl=N/A (0%)
- Sigma (s) = 1.545
- PPM>USL=2466.74
- PPM<LSL=0
- Total PPM=2466.74
- Average (X)= 1.658

The within capabilities (Cpk calculations) are based on the estimated standard deviation from the average range (σ), while the overall capabilities are based on the calculated standard deviation (s). There are no values for Cp, Cpl, etc. since there is no LSL.

The equations for Cp, Cpk, Cpu, and Cpl are given below. The equations for Pp, Ppk, Ppu, and Ppl are identical except that the estimated standard deviation, σ, is replaced by the calculated standard deviation, s.

Cp=(USL-LSL)/6σ

Cpu=(USL-X)/3σ

Cpl=(X-LSL)/3σ

Cpk is the minimum of Cpu and Cpl. The calculations don’t care if the data are normally distributed or not. Plug in the numbers and you get results for each of the values. But these calculations don’t mean much if the data are not normally distributed. For example consider the value of Cpu above. Cpu = 1.1 and the theoretical % above the USL is 0.05%. But there are 2 points out of 100 that are above the USL. This is 2% – considerably different than 0.05%. This is a result of the assumptions (a normal distribution in this case) not being valid.

So, how do you handle non-normal data and process capability?

The Correct Approach: Non-Normal Ppk

If you have non-normal data, you have two options. First, you can transform the data (using something like the Box-Cox or Johnson transformations), if possible, so that the transformed data follows a normal distribution. Then perform the Cpk calculations. The problem with this approach is that the original data format is lost by the transformation. A waiting time of 4.5 is no longer 4.5, but a transformed number. The transformed data no longer reflect the times to greet a customer. This confuses things to someone else looking at the results.

Second, you can select a distribution that fits your situation, both theoretically and based on the raw data. This is the best approach because it maintains the data format and helps you with your understanding of your process. This is what we will cover below. With this approach, Cpk disappears from the picture. It is not calculated at all since the equations that produce Cpk depend on the assumption that you have a normal distribution.

The first step is to determine what distribution your data follows. This is an important step. You can easily put your data into a software package that will test many different distributions to find out which distribution fits your data best. But you should have a reason for using a certain distribution – it must make sense in terms of your process. For example, with the waiting time data in Table 1, it makes sense that it follows an exponential distribution. Will a three-parameter gamma distribution fit your data better? Maybe based on the numbers, but not on the process.

Once you know your distribution, you can generate a process capability chart based on the non-normal data. This chart is shown in Figure 2.

Figure 2: Process Capability Chart – Exponential Distribution

This process capability chart has the exponential distribution (with scale = 1.5) superimposed on the histogram. It is easy to see that the distribution fits the histogram well. Also note that the exponential distribution has a very long tail to the left, approaching zero at a waiting time above 10 – completely different than the normal distribution in Figure 1.

The calculated statistics with this chart are shown below.

Overall Capabilities:
- Pp=N/A
- Ppk=0.56
- Ppu=0.56 (1.83%)
- Ppl=N/A
- PPM>USL=18315.64
- PPM<LSL=N/A
- Total PPM=18315.64

Note that value for Ppk (0.56) is considerably less than when the Cpk calculation was done, which gave a Ppk = 0.94. In addition, note that the estimated % above the USL is 1.83 for the non-normal process capability chart – much more in line with the actual results. This is the preferred approach for non-normal process capability calculations. You still interpret the numbers as before. Values of Ppk greater than 1.0 are desirable – the higher the better.

The formulas for Ppk look different than for the normal distribution. The values depend on the type of the distribution you are using. The formulas are given below.

Pp=(USL-LSL)/(X_.99865-X_.00135 )

where USL = upper specification limit, LSL = lower specification limits, X_.99865 = 99.865th percentile of the exponential distribution, and X_.00135 = 0.135th percentile of the exponential distribution (or your specified distribution).

Ppl=(X_.5-LSL)/(X_.5-X_.00135)

where X_.5 = 50th percentile of the exponential distribution.

Ppu=(USL-X_.5)/(X_{_.99865}-X_.5)

Ppk is the minimum of Ppu and Ppl. The percentiles above match the percentiles of a normal distribution for ± three sigma – the distance from the average to the average ± three sigma.

Previous Process Capability Publications

If you are new to process capability, please see some of our previous publications on process capability. All these publications assume that the data follow a normal distribution.

Process Capability – Part 1 (October 2004)
Process Capability – Part 2 (November 2004)
Process Capability – Part 3 (December 2004)
Cpk Improvement Methodology (February 2012)
An Interactive Look at Process Capability (March 2014)
Cpk Alone is Not Sufficient (April 2014)
Cpk vs Ppk: Who Wins (May 2014)

You can access these publications here.

Summary

This publication examines how non-normal data impacts process capability calculations and results. With non-normal data, it is wrong to calculate a Cpk based on the raw data. A better approach is to determine what distribution best fits your process and data and then use the non-normal Ppk approach. The equations for Ppk are different for non-normal data than for normally distributed data.

Quick Links

Thanks so much for reading our SPC Knowledge Base. We hope you find it informative and useful. Happy charting and may the data always support your position.

Sincerely,

Dr. Bill McNeese
BPI Consulting, LLC

Connect with Us

Process Capability

Name

Website

Name

Website

30 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Anonymous

11 years ago

How did you arrive at theoritical % above USL? I am referring to the statement above: For example consider the value of Cpu above. Cpu = 1.1 and the theoretical % above the USL is 0.05%. How did you arrive at 0.05% also Ppu=0.56 (1.83%). How did you arrive at 1.83%.

Bill

Admin

Reply to Anonymous

11 years ago

For the normal distribution (Cpu), the % out of spec is estimated by calculating a z value. z is the (USL – Average)/Estimated Sigma. For the data above z = 3.299392. Then use 1 – Normsdist(z) in Excel to find the fraction of results beyond that value of z (or above the USL in this case). That gives 0.00005 which translates to 0.05%.
For the non-normal, you use the cumulative distribution function for the specified distribution to find the % above the USL. In this example, that was the exponential distribution. The CDF for the exponential function is 1 – Exp(X/Scale) where X = USL in this case and Scale = 1.5. This equates to 0.981684. Subtracting from 1 gives the 0.0183.

rol

6 years ago

A very useful article! Very clear! thanks

Sigal

6 years ago

Thank you for the article , what is that can be done to calculate Ppk if no distribution is found that fits the data?

Bill

Admin

Reply to Sigal

6 years ago

In this case, I would simply use the histogram and compare it to the specs for a long period of time. And then just note the PPM out of spec. No calculation really for Ppk can be done.

Jimmy Hood

5 years ago

I have a Good CP and CPK yet still shows law is not normal, Distribution is Good,Process OK and Setting Ok and yet still telling me law is not normal.

Bill

Admin

Reply to Jimmy Hood

5 years ago

I am not sure i understand what you mean by "law". Please send me that data to look at. bill@www.spcforexcel.com

Ramachandiran

5 years ago

If my Cpk is <1, but data is normally distributed for one of the process. whether we can start to monitor the Control chart such X Bar R or IMR or we need to make the process to meet capability of cpk >1.33 then only we can make SPC monitoring. Please confirm.Also, if SPC monitoring in place how to know wheher CPk is improving or remains same. what is the suggestion?

Bill

Admin

Reply to Ramachandiran

5 years ago

We discussed this on your post on LinkedIn. Let me know if you have more questions.

Chennakesava Reddy

5 years ago

Hi, Good Evening.As thumb rule, before calculating process capability (Cpk), data should be normally distributed.I have a case where, data points are not normally distributed (P-value is less than 0.05 in probability plot). But, process capability (CpK) is very high i.e 50(more than 1.33). In this case, I do not have lower specification limit and histogram is fallen left side.Can you interpret the above please.

Bill

Admin

Reply to Chennakesava Reddy

5 years ago

Hello, please send me the data so I see what it looks like. bill@www.spcforexcel.com

Salah

5 years ago

Hello, i want to calculate the Ppk of a truncated gaussian distribution . (The lower spec of the distribution is 0 and the upper sup is 0.8).Do you know the formula to calculate the Ppk in this case? Thanks

Bill

Admin

Reply to Salah

5 years ago

Is 0 a natural boundary? If so, you would have no lower specification and just calculate is based on the upper spec.

Salah

Reply to Bill

5 years ago

Sorry for the late response, do you mean i can use the classical formula ppk= (USL-Average)/(3Soverall).Thanks you so much for your answer !

Bill

Admin

Reply to Salah

5 years ago

Yes I think so. Send me the data if you would like me to look at it. bill@www.spcforexcel.com

Anonymous

5 years ago

General quries,In the above senario,My assumption , As shopkeeper suggested 6 min is the time max that the customer as to wait.I start collecting data , and verified my data follow exponential distribution.then i go for normal capabilty analysis chart in minitab and i click distribution as exponential .and verified for cpk value is it fine , for our estimation.what is the need of non -normal data analysis chart in minitab as mention above calculate. i can do it in normal capability analysis by selction of distribution as exponential in drop down list.can u pls clarify the same.thank in advance.

Bill

Admin

Reply to Anonymous

5 years ago

Not sure I understand you. In SPC for Excel, like Minitab, if you select the Exponential distribution, you are not doing a capability analysis with the normal distribution, but with the exponential distribution. so it is not a "normal" distribution process capability. You get different results from assuming you have a normal distribution.

Gaurav

4 years ago

If my data has Largest extreme distribution (non-normal distribution) with mean of 50, maximum value = 55. minimum value = 45. If i calculte Ppk, it comes out to be 0.83 (even when my specification limits are 40 and 60). Ppk of 0.83 means process is not able to produce within the specification limits of 40 and 60. How is it possible if all my data poiunts fall [45 55] ?

Bill

Admin

Reply to Gaurav

4 years ago

Please send me the data and I will look at it (bill@www.spcforexcel.com). All the data can be in specs, but not the entire theoretical distribution.

Brendan

4 years ago

Hi,I have variable data not meeting normality and fits a Largest extreme value distribution. So, now to determine ppk. I am using only a one sided lower spec which is a good bit below my data set.Minitab cannot give ppk but just an Astrix. I think its because the lower spec is so far from the distribution. Anyways, what is the ppk formula for non normal largest extreme value?Any help appreciated

Bill

Admin

Reply to Brendan

4 years ago

Is there a reason you believe the data should fit the largest extreme value distribution? How many data points do you have? Please send me the data and I will show you how the calculation is done. You simply need the pdf for the largest extreme value distribution so you can calculate the limits. But if you data are far from the lower spec, why does it matter? You are capable. Please send it to bill@www.spcforexcel.com

Leo Yang

4 years ago

How to find a non-normal distribution can be fitted with my data? Do we still need to have a p value >0.05 for the goodness of fit test?

Bill

Admin

Reply to Leo Yang

4 years ago

Please see this two part series staring with this link:
https://www.spcforexcel.com/knowledge/basic-statistics/distribution-fitting
https://www.spcforexcel.com/knowledge/basic-statistics/deciding-which-distribution-fits-your-data-best

Leo Yang

Reply to Bill

4 years ago

Hi Bill, Thanks for your reply. I have used Minitab to analyse my data and assessed 15 distribution models (Weibu, Exponential, etc). However, through the Goodness of Fit Test for the 15 distribution models, all of the p values are small and Weibu achieved a highest p value <0.01. Do I have to get a p value >0.05 for the distribution fitting before calculating the cpk or ppk? Is the SPC for Excel able to solve my issue? I can send my data to you if you do not mind.

Bill

Admin

Reply to Leo Yang

4 years ago

Send me the data and I will look at it. It is possible none of the distributions fit the data. IN that case, you just compare the histgoram to the specs. Not much else to do there. bill@www.spcforexcel.com

Steve Jones

3 years ago

For MS Excel analysis, what is a very simple and fairly reliable method for determining good normal distribution of your data? If Cp and Pp comparison is a good method (they should be almost identical), then how close is good enough… 95%, 90%, 85%, etc? Thanks.

Bill

Admin

Reply to Steve Jones

3 years ago

If your process is in control, the Cpk and Ppk values will close for the normal distribution. Not sure I have a number that means close.

Anonymous

2 years ago

Thank you so much for this article. But I want to ask something. In my case, my data is negative binomial distributed, which is not included in 14 distributions in Minitab. How to solve that problem, is there any other method ?. Thank you very much

2 years ago

thanks for the article. What if in my case, the data is negative binomial distributed. which is not included in the 14 distributions in Minitab for non normal ? Or there is any different method for this problem ? Thank you so much.

Bill

Admin

2 years ago

Hello, in this case, i think you just have to consruct a histogram and compare the specifictions to the histogram.

wpDiscuz