Linear Regression (Part 1)

June 2008

In this Issue

 

Greetings,

This month's newsletter is the first part of a series on linear regression. Linear regression is closely related to one of the basic SPC tools: the scatter diagram. A scatter diagram examines the relationship between two variables. It may be that one variable increases as the other increases or decreases. Our February 2005 newsletter explores scatter diagrams in more detail.

Linear regression can be used to mathematically define the relationship between two variables. We often want to know how the changes in one variable affect another variable. There is sometimes a straight line relationship between two variables. Linear regression helps us define this relationship. The major objective is to determine if one variable can be controlled by controlling another variable. Linear regression helps us build a model of the process. This is one method of decreasing process variation.

This month we will explore how the best fit linear equation is developed. Next month we will explore how to tell if the relationship is significant.

If you have difficulty viewing this newsletter, it is on our website along with all past newsletters.

 

Best regards,

Bill

 

Regression Example

The following example demonstrates how linear regression works. This example is from the book Introduction to Linear Regression Analysis (Montgomery, Peck and Vinning, 4th edition, Wiley & Sons, 2006). This is an excellent book on regression for those of you who want to learn much more about regression.

The shear strength of the bond between two types of propellant is important in the manufacturing of a rocket motor. Someone asked the question, "Is the age of the propellant related to the shear strength?" To answer this question, twenty paired observations of shear strength and age of the propellant were collected. These are shown below.

 

Observation Shear Strength, Y (psi) Age of Propellant, X (weeks)
1 2158.70 15.50
2 1678.15 23.75
3 2316.00 8.00
4 2061.30 17.00
5 2207.50 5.50
6 1708.30 19.00
7 1784.70 24.00
8 2575.00 2.50
9 2357.90 7.50
10 2256.70 11.00
11 2165.20 13.00
12 2399.55 3.75
Â13 1779.80 25.00
14 2336.75 9.75
15 1765.30 22.00
16 2053.50 18.00
17 2414.40 6.00
18 2200.50 12.50
19 2654.20 2.00
20 1753.70 21.50

 

The scatter diagram for this data is shown below. As can be seen from the scatter diagram, there does appear to be a relationship. It appears that as the age of the propellant increases, the shear strength decreases. We can use regression analysis to quantify that relationship. This will also allow us to predict the shear strength based on the age of the propellant. We start by determining the best fit linear equation.

Scatter Propellant

 

 

Our software performs multiple linear regressions. Click here for more information.

 

Method of Least Squares

In regression, y is the variable we want to predict (the dependent variable). In this example, y is the shear strength. x is the independent variable -- in this case, the age of the propellant.

We begin by assuming that there is a linear relationship between the age of the propellant and shear strength. Remember, this does not mean there is a linear relationship. The methodology below will generate the best fit linear equation for almost any set of data. We must eventually decide if the model is useful to us. This relationship is described by the model below.

y = b0 + b1x + e

where x is the propellant age and y is the shear strength. b0 and b1 are called parameters of the model. y consists of two parts:

 

  • the value of b0 + b1x
  • e which is the error or the distance any value of y may fall off the regression line

 

The parameters in the model can not be known exactly. However, we can make use of the data collected to estimate the parameters -- just like we use process samples to estimate the average for control charts. If b0 and b1 are our estimates of b0 and b1, respectively, the model becomes:

yp = b0 + b1x

where yp is the predicted value of y for a given value of x once b0 and b1 have been determined. b1 is the slope of the line. b0 is the y-intercept (where the line crosses the y axis).

The method of least squares is used to determine the best fit line between x and y. Suppose, as in the data given in the table above, we have a series of observations for x and y. Suppose there are n sets of observations between x and y. We can denote these n sets as (x1, y1), (x2, y2), ......., (xn, yn). When we determine b0 and b1, the model can be used to predict the values of y for given values of x. The difference between what the model predicts (yp) and what we actually see (yi )is called the residual (ei).

ei = yi - yp

The method of least squares is based on choosing b0 and b1 so that the sum of the squares of the deviations (residuals) is a minimum. If SSR is used to denote the sum of the squares of the deviations, we are trying to minimize the following:

SSR = S(yi - yp)2

This is shown graphically in the figure below. A line fitted using the method of least squares minimizes the vertical dotted-line distances shown in the figure.

Least Square

 

The next section shows how to calculate the best fit line for the example data.

 

 

Our SPC for Excel software easily constructs and updates histograms, control charts and many other SPC tools. Click here for more information.

 

 

Best Fit Equation

The best fit equation can be determined using the following equations:

Sxx

Sxy

B1

B0

 

A variable with a bar over it means it is the average. The table below summarizes the calculations, which are normally done using software.

Table xxsxy

 

Using the data in the table, the following calculations can be done:

Sxx = 4677.6880*(267.25*267.25/20) = 1106.56

Sxy = 528,492.6 - (20)(13.3625)(2131.358)=-41,112.65

b1 = Sxy/Sxx = -41,112.65/1106.56 = -37.15

b0 = ybar - b1(xbar) = 2131.358 - (-37.15)(13.3625) = 2627.82

Thus, the best fit equation for this data is:

y = 2627.82 + (-37.5)x

 

Conclusions

The best fit line has been added to the scatter diagram as shown below. The slope, b1, means that, on average, the shear rate decreases by 37.15 for each additional week of age of the propellant. Note that this model can be used to improve the process. For example, if the minimum shear strength is 2100, the age of the propellant can't be more than 14 weeks.

 

Scatter Line

 

A linear equation can always be fit data. This does not mean that the regression is statistically significant or of any practical use. Next month's newsletter will explore this issue.

 

Click here to access all our previous newsletters.

 

SPC for Excel Software

 

X Chart Online

SPC for Excel is full of great features and options: easily split control limits, add comments to charts, delete points from the calculations, make multiple individuals charts or process capability charts at once-- greatly enhanced measurement systems analysis component. The list goes on and on.

Please take a moment to see what is in this great software package. It is very affordable at only $149 for a single user with great discounts for multiple users.

SPC for Excel is used to generate and easily update SPC charts and to perform other statistical functions from Microsoft Excel spreadsheets. This affordable software is easy to learn, easy to use, and fits the needs of the SPC novice or SPC expert. It is the premier Excel-based SPC program. We have reached this position by listening to what our users say they need. It is used in over 30 countries world-wide.

 

 

Thanks so much for reading our newsletter. We hope you find it informative and useful. Happy charting and may the data always support your position.

 

 

Quick Links

Visit our home page

SPC for Excel Software

Online Videos of How the SPC for Excel Software Works

Measurement Systems Analysis (Gage R&R)

Customer Complaint SPC Software

SPC Training

SPC PowerPoint Training Modules You Can Customize

SPC Implementation

Special Offers

Ordering Information

 

 

Sincerely,

William McNeese

BPI Consulting, LLC