July 2008
In this Issue:
Greetings,
This month's newsletter is the second part of a series on linear regression. Linear regression can be used to mathematically define the relationship between two variables. We often want to know how the changes in one variable affect another variable. There is sometimes a straight line relationship between two variables. Linear regression helps us define this relationship. The major objective is to determine if one variable can be controlled by controlling another variable. Linear regression helps us build a model of the process. This is one method of decreasing process variation.
In the first part of the series, we saw how to develop the best fit equation for a series of data. This part is available on our website (click here to see the first part). In this part, we will explore how to determine if the equation is statistically significant.
We are also happy to announce in this issue the release of the 4th version of SPC for Excel set for September 1, 2008. Please check out this version's new features which are described in the last section of this newsletter.
Best regards,
Bill
The example we used in Part 1 is from the book Introduction to Linear Regression Analysis (Montgomery, Peck and Vinning, 4th edition, Wiley & Sons, 2006). The shear strength of the bond between two types of propellant is important in the manufacturing of a rocket motor. Someone asked the question, "Is the age of the propellant related to the shear strength?" To answer this question, twenty paired observations of shear strength and age of the propellant were collected. The data is shown in the table below. You can download the workbook containing the data and the results from our website (click here).
| Observation | Shear Strength, Y (psi) | Age of Propellant, X (weeks) |
| 1 | 2158.70 | 15.50 |
| 2 | 1678.15 | 23.75 |
| 3 | 2316.00 | 8.00 |
| 4 | 2061.30 | 17.00 |
| 5 | 2207.50 | 5.50 |
| 6 | 1708.30 | 19.00 |
| 7 | 1784.70 | 24.00 |
| 8 | 2575.00 | 2.50 |
| 9 | 2357.90 | 7.50 |
| 10 | 2256.70 | 11.00 |
| 11 | 2165.20 | 13.00 |
| 12 | 2399.55 | 3.75 |
| 13 | 1779.80 | 25.00 |
| 14 | 2336.75 | 9.75 |
| 15 | 1765.30 | 22.00 |
| 16 | 2053.50 | 18.00 |
| 17 | 2414.40 | 6.00 |
| 18 | 2200.50 | 12.50 |
| 19 | 2654.20 | 2.00 |
| 20 | 1753.70 | 21.50 |
The data were used to develop the best fit equation for data. The best fit equation has the following form:
where Y is the predicted value of y (shear strength) for a given value of X (age of propellant) once b0 and b1 have been determined. b1 is the slope of the line. b0 is the y-intercept (where the line crosses the y axis). The calculations for the best fit equation are given in Part 1 of this series.
The best-fit equation and the chart are shown below.
Y = 2627.82 -37.15X
The slope, b1, means that, on average, the shear rate decreases by 37.15 for each additional week of age for the propellant. Note that this model can be used to improve the process. For example, if the minimum shear strength is 2100, the age of the propellant can't be more than 14 weeks.
The model appears to fit the data fairly well. The analysis below is used to see if this is really a good model for the data.
Our software performs multiple linear regression. Click here for more information.
Most software programs generate output that can be used to determine if the model is statistically significant. Microsoft Excel has a built-in regression function. The output for this data from Excel is shown below.
Regression Statistics
| Multiple R | 0.9497 |
| RSquare | 0.9018 |
| Adjusted R Square | 0.8964 |
| Standard Error | 96.1061 |
| Observations | 20 |
ANOVA
| df | SS | MS | F | Significance F | |
| Regression | 1 | 1527482.7 | 1527482.7 | 165.4 | 1.64334E-10 |
| Residual | 18 | 166254.9 | 9236.4 | ||
| Total | 19 | 1693737.6 |
| Coeff. | Stand. Err. | t Stat | P-value | Lower 95% | Upper 95% | |
| Intercept | 2627.82 | 44.18 | 59.47 | 4.06E-22 | 2535.00 | 2720.65 |
| Propellant Age | -37.15 | 2.89 | -12.86 | 1.64E-10 | -43.22 | -31.08 |
This output is fairly typical of most software packages. Our SPC for Excel software has a regression module that provides much more information than Excel does, but we will use the output from Excel here. Version 4, due out September 1, 2008, has even more for regression (see the last section in this newsletter). Our objective here is to understand what this output is telling us about the model.
We begin by examining the Analysis of Variance (ANOVA) table. This table tells us the precision of the regression. The general format of the ANOVA table is given below. This will be used to explain the regression results.
Analysis of Variance (ANOVA) Table
| Source | DF | Sum of Squares | Mean Square | F |
| Regression | 1 | SSreg = Sum (Yhat_i - Ybar)^2 | SSreg/1 | MSreg/MSres |
| Residual | n-2 | SSres = Sum (Y_i - Yhat_i)^2 | SSres/(n-2) | |
| Total | n-1 | SStotal = Sum (Y_i - Ybar)^2 |
Yhat is the estimated value of Y using the best fit equation (the model). Ybar is the average of the Y results. n is the number of observations. A subscript "i" denotes the ith observation for Y.
There are three sources of variation listed in the regression ANOVA table. These are regression, the residual, and the total. We will start with the third column: the sum of squares.
For regression, you can see from the table that the sum of squares due to regression (SSreg) is the sum of the squares of the predicted values (Yhat) minus the overall average (Ybar). This is a measure of how far (how much variation) each predicted value is from the overall average.
SSreg = Sum (Yhat_i - Ybar)^2
The residual sum of squares (SSres) is the sum of the square of the observed value (Y) minus predicted value (Yhat).
SSres = Sum (Y_i - Yhat_i)^2
The residual is the "error" term. It measures how "close" the model predicts the actual results. If the model is good, the sum of squares due to regression will be much larger than the sum of squares due to the residual (error).
The third sum of squares is the total sum of squares (SStotal). This is the sum of the squares of each observed value (Y) minus the overall average (Ybar). It is a measure of how far (how much variation) each observed value is from the overall average.
SStotal = Sum (Y_i - Ybar)^2
Note that the following is true:
SStotal = SSreg + SSres
It is important to remember that the sums of squares are measures of variation. We want the sum of squares due to regression to be much larger than the sum of squares about the regression. If this is the case, it means that the model is statistically significant.
The calculations for the ANOVA table are contained in the workbook that can be downloaded from our website (click here to download the workbook).
The next column to consider in the ANOVA table is the "df" column. "df" is the degrees of freedom associated with each sum of squares. It is a measure of how much independent information you have. There are n-1 degrees of freedom for the total sum of squares, one for the regression sum of squares (since there is only one X), and n -2 for the residual sum of squares. Since there are 20 observations in this example, the following is true:
The mean square column is obtained by dividing each sum of squares by its degrees of freedom. The mean square for the residuals is often called the Mean Square Error (MSE). When the model is correct, MSE provides an estimate of the variance about the regression. Then for this example,
Mean Square Regression = SSreg/DFreg = 1527482.7/1 = 1527482.7
Mean Square Residual (Error) = SSres/DFres = 166254.9/18 = 9236.4
The F value is obtained by dividing the Mean Square for regression by the Mean Square for the residual. This F value is the ratio of two variances: the mean square for regression and the mean square residual. The F distribution is a probability distribution based on the ratio of variances. We will use the value of F to determine if the model fits the data. For this example,
F value = Mean Square Regression/Mean Square Residual = 1527482.74/9236.4 = 165.4
The column in the ANOVA table labeled "Significance F" gives us the probability that the mean square regression and the mean square residual are the "same." If the probability listed in this column is 0.05 or less, it is customary to assume that there is evidence that the two variances are not the same - that the model is significant because there is a significant difference between the two.
You can obtain this probability using the FDIST function in Excel. The form of the function for this data is FDIST(F value, DFreg, DFres) = FDIST(165.4, 1, 18). The significance for F listed in the ANOVA table is 1.6433E-10 which is less than 0.05. Thus we conclude that the model explains a significant portion of the variation in Y and is statistically significant. The general guidelines for interpreting the Significance of F column are given below.
You may apply the following guidelines for the p values:
Click here to access all our previous newsletters.
1-(SSres/dfres)/(SStotal/dftotal)
Of course, it still contains all the great version 3 features. With it you can easily split control limits, add comments to charts, delete points from the calculations, make multiple individual charts or process capability charts at once. It has a greatly enhanced measurement systems analysis component. The list goes on and on.
This affordable software is easy to learn, easy to use, and fits the needs of the SPC novice or SPC expert. It is the premier Excel-based SPC program. We have reached this position by listening to what our users say they need. It is used in over 30 countries world-wide.
Customer Complaint SPC Software