Bivariate Regression Using Deviation Scores

Deviation Score Matrix

The Deviation score matrix is calculated by subtracting the column mean for the variable from each score in your data set. Notice that the columns in this matrix are now lower case x and y to indicate their deviation score status. For our data, this becomes: which, after doing the elementary math, makes the table look like this:

  1. Deviation Score Matrix

Individuals

x

y

Pete

2-6

2-6

Jeanette

4-6

4-6

Julie

6-6

8-6

John

8-6

6-6

Karl

10-6

10-6

  1. Deviation Score Matrix

Individuals

x

y

Pete

-4

-4

Jeanette

-2

-2

Julie

0

2

John

2

0

Karl

4

4

When the data are in deviation score form, note that now you cal tell whether a given person is above or below the mean on the variables in your study merely by looking at their score. If you see a negative number there, you know that the person When the data are in deviation score form, it's also easy to calculate summary statistics that you may have seen in analysis of variance (known as sums of squares) and it's also easy to calculate three of the statistics which are important for conducting a regression test of whether there is a relationship between X and Y. These three statistics are the variance of X , the variance of Y, and the covariance of X and Y . The calculate this, we'll need to compute three new columns which correspond to the squares of x, y, and the cross-product of x and y. These columns are computed and shown in this table:

  1. Deviation Score Matrix

Individuals

x

y

 

 

xy

Pete

-4

-4

16

16

16

Jeanette

-2

-2

4

4

4

Julie

0

2

0

4

0

John

2

0

4

0

0

Karl

4

4

16

16

16

Sum

0

0

40

40

36

Sum/(N-1)

0

0

10

10

9

Sum/N

0

0

8

8

7.2

At the bottom of this table I now add up the columns and then divide by N-1 (to get population estimates of variances and covariance) or divide by N to get sample variances and covariance.

Calculation of regression statistics using the deviation score matrix

For the moment, know that a correlation coefficient cal by calculated from a covariance b dividing the covariance by the standard deviations of the two variables. i.e., the correlation of X and Y in this study can be calculated as: Notice that you get the same answer for the correlation regardless of whether you use the population or sample estimates.

It's also possible to calculate the regression weight which predicts Y from X. You can use the formula in the book which uses deviation sums and cross products (formula 2.7 on page 19) or you can use variances and covariances. I.e.: The regression intercept for use when you use raw scores is calculated as:

When you are doing a regression using deviation scores, the regression equation changes slightly in that the intercept is no longer present, I.e., y=bx+error. The predicted deviation score for these data is expressed by

Graphical representation of the regression line.

The regression line looks like this:

Notice that, as before, the regression line must pass through the point representing the means of the two variables. Now the means are 0 for both variables, though, so that's represented by 0,0, also known as the origin. The interpretation of the regression weight is the same as that given above for the original regression line.

Partitioning the Sums of Squares of Y (or, equivalently, the variance of Y) - what does it mean in terms of predicted scores and graphically?

It is slightly easier to do the math, though, using the deviation score matrix, so I'll choose to do that here. Notice that the prediction equation for y is: . Notice that there is no intercept because, as mentioned above, we're doing this in the deviation score matrix. Consider the following table which shows the deviation scores for the variables, squared terms for y, the predicted value of y for each observation, the error of prediction, and the square of that error.

  1. Deviation Score Matrix

Individuals

x

y

 

 

 

 

 

Pete

-4

-4

16

16

-3.6

-.4

.16

Jeanette

-2

-2

4

4

-1.8

-.2

.04

Julie

0

2

0

4

0

2

4

John

2

0

4

0

1.8

-1.8

3.24

Karl

4

4

16

16

3.6

.4

.16

Sum

0

0

40

40

0

0

7.6

Sum/(N-1)

0

0

10

10

0

0

1.9

Sum/N

0

0

8

8

0

0

1.52

From this we can now talk about the way that knowledge of variable x in this study tells us something about variable y. More specifically, variability in y is expressed as the variance of y (the Sum/(N-1) row entry for the column). If the variable x tells us nothing about y, our best guess for any particular person's y score would be 0 (because the mean of y in a deviation score matrix is zero.) If x tells us everything we need to know about y, then we would expect to see that everyone's y score is equal to their score.

Unfortunately, the picture is a little less clear cut than this in practice, because a regression line is the best possible fitting line for a sample of data. This means that it also capitalizes on chance variability in the sample. Consider the case where there are only two data points in your study. For example, suppose that your data contains two observations: one person received a score of -1 on x and -5 on y and the other person received a score of 1 on x and 5 on y. If I then tell you that one person received a score of -1 on x and I asked you to guess what their score on y is, a moment's reflection will tell you that you know with certainty that -5 is the only value you could guess (assuming that you are limiting your consideration to this data set). The correlation between x and y with two observations is usually going to be 1 or -1. It can be zero (if you happen to get data where the two y values in your study are identical and making the variance of y therefore equal to 0) or undefined (if you happen to get two x values in your study which are identical, thereby involving you in a situation where you have to divide by zero in the formula above for a regression weight.

Graphically, the discussion of how predicted score and actual scores are related to the variance of Y can be shown below. (You'll have to cut me some slack in the resolution of this picture because some of the lines I want to draw your attention to overlap.) As you can see, the general picture is the same as it was for the prediction equation for raw data described above, except that the intercept is located at the origin. The observed variance of y (or Y they're the same thing) can be calculated from such a picture by measuring the vertical distance of each data point to the mean (0 for a deviation score metric), squaring this number, adding all these squares together, and then dividing by N-1 (or N). The vertical lengths of interest for calculating the variance are given in the picture by the red dashed lines. The variance of y due to x can be calculated by drawing lines from the predicted y score for each observation in your data set to the mean, squaring this value, and then adding them up and dividing by N-1 (or N). The lengths of interest here are given by blue dashed lines in the figure. The error variance, or variability in the criterion which is not accounted for by the predictor can be calculated by the distance from the predicted score for every observation to its respective observed score (as indicated by green lines in the figure). The phrase partitioning the variance of the criterion only means that the variance of y can be divided into two parts, one which is the variance of y predicted on the basis of x, and another component which is unpredicted variability, also known as the variability in y due to error. Rather than use the term "error" which is somewhat pejorative, it is sometimes helpful to just refer to these error scores as "the variability in y left over after adjusting for x" or "y scores adjusted for the effect of x.".