University of San Francisco
  Previous   font
 

Non-Experimental Data

In this section...

Correlation
Simple Linear Regression
Suggested Further Reading

If you have not worked with SPSS before, it is recommended that you learn some navigation and data entry techniques before beginning this section. To learn about navigation click here.

For all users, it is suggested that you have an active SPSS data set to use while working through this section. It can be your actual data or one of the data sets available in SPSS (to learn how to access SPSS data sets click here). It is further suggested that you run descriptives on your data before continuing. You can refresh your memory on how to run descriptive statistics by clicking here.

If you find yourself in this section, you have decided that your data is non-experimental in nature. The designation "non-experimental" could mean several things. First, it might reference the way in which the data were collected. You may have used a survey or other self-report instrument to gather data rather than a more structured approach, such as a pretest/posttest control group design. Or, it could mean that you are trying to understand more abstract concepts, such as people's beliefs, values, or feelings. "Non-experimental" could also refer to the nature of your investigation. You may not be interested in mean differences between groups on a dependent variable. In fact, you may not be interested in differences at all.

Sometimes researchers are more interested in the relationships between variables than they are in differences between groups. To illustrate this fundamentally different approach, let's go back to the fictitious compensation study we have been using. You may not wish to examine, say, the mean difference in current salary between high school graduates, college graduates and those with graduate degrees. You're not interested in the answer to the question, "Do holders of graduate degrees earn more money than those with high school diplomas?" Rather, your interest may lie in understanding the more global relationship between education and salary level. Your research question may be more along the lines of, "Is there a relationship between education level and current salary?" Or, more specifically, "Does salary level generally increase or decrease with additional education?"

Understanding the relationship between variables can be of enormous value. SPSS has several options that can help you study relationships. Two of the most common are correlation and regression. Before moving on to more detailed discussion of these techniques, it is necessary to make a very important point. When interpreting the results of correlation and regression analyses, you must be careful not to confuse association (relationship) with causation. The results of correlation and regression techniques show that two variables under study are in some way related; they do not show, however, that one variable causes the other.


Correlation

Pearson Product-Moment Correlation

The simplest way to examine the relationship between two or more variables is to run a correlation. There is a variety of correlations you can choose from depending upon the nature of the variables under consideration. The most common correlation is the Pearson product-moment correlation. To run a Pearson correlation you have to have at least two normally distributed, quantitative variables.

The degree of relationship between variables is captured by the "correlation coefficient," abbreviated "r" and sometimes called "Pearson r". The range of values for the correlation coefficient is -1.00 to +1.00. If the relationship between two variables is expressed by a positive correlation coefficient, the variables are said, logically, to be "positively correlated." This means one of two things: either that an increase in the first variable is associated with a corresponding increase in the second, or a decrease in the first variable is associated with a corresponding decrease in the second. Height and weight, for example, are positively correlated since as height increases so too does weight.

If the relationship between two variables is expressed by a negative correlation coefficient, the variables are "negatively correlated" with one another. This can also mean one of two things: either that an increase in the first variable is accompanied by a corresponding decrease in the second, or a decrease in the first is accompanied by a corresponding increase in the second. An example of negatively correlated variables might be the relationship often found between noise level and task performance. In general, as noice level increases, the ability to perform tasks well decreases (and vice versa).

Last but not least is the notion of completely unrelated variables. If there is truly no relationship between two variables, they are said to be uncorrelated. This fact is expressed by a correlation coefficent of "0," specifically r = 0.

Sometimes it's not enough merely to know that two variables are related. Sometimes you want to know how strong the relationship between them is. The strength of the relationship between variables is reflected in the size of the correlation coefficient. Squaring the correlation coefficient provides an indication of how much variance (or variability) the two variables have in common. The greater the amount of shared variance between two variables, the stronger their relationship. Wouldn't you expect variables with 80% shared variance to be more closely associated with one another than two variables that had only 40% of their variance in common? (To convince yourself, take the square root of .80 and .40 to get the correlation coefficients and compare the results.) Consider this method when evaluating the strength of relationship between variables.

The command string to run a Pearson correlation is:

Analyze --> Correlate --> Bivariate (two variables)

Remember, it is recommended that you have an active SPSS data set to use while working through this section. If you don't have a data set of your own, you should access on of the the data sets available in SPSS . To learn how to access SPSS data sets, click here.

Choose from the complete list of variables displayed at the left of the dialog box the two variables whose relationship you wish to investigate and move them to the "variables" box on the right. (You can run correlations on any number of variables in your dataset. The process and analysis is the same. The only difference is in the size of the output.) Click OK.

Your output should show a single table. If you correlated only two variables the table consists of two rows and two columns. The rows and columns are labelled with the names of the variables you chose to correlate. The intersection of a row and a column is a "cell" that contains the information you are seeking about the relationship between the variables. The first value in any cell is the Pearson correlation (r). This is what you're after, but don't stop here. Below the Pearson "r" is the significance level of the correlation (which you want to be equal to or less than your predetermined level of significance, either .05 or .01). This is a critical component of your interpretation of the data. The third value is the number of cases that were correlated. If the value of the correlation coefficient is significant, you can say with confidence that the two variables are related.

To illustrate, let's return to our fictitous compensation study. Let's assume that the correlation coefficient between education level and current salary is .81. This is a strong positive correlation, suggesting that the two variables are very closely related. But beyond that, what does it mean? It could be intepreted that as education level increases, so too does current salary. It could also mean that as education level decreases, so too does current salary level. What it does not mean is that more education causes high salaries.

Go now to the SPSS Viewer showing the output for your data. Study the columns and rows of your table. Did you find a statistically significant relationship between your two variables? If so, decide for yourself if it is a small, moderate, or strong relationship. How would you interpret your result? (Hint: how much variance is shared between them?)

A note on correlation tables (aka correlation matrixes). If you look closely at the table you just produced you're likely to notice two things. First, you'll see two cases of a perfect correlations, that is, where r = 1.00. Perfect correlations are rare, so this should surprise you. Until, that is, you notice that SPSS has correlated each variable with itself in addition to the other variable you selected. The relationship between a variable and itself is always a perfect one. If your results indicate something to the contrary, first check your data and then re-run the correlation.

The second thing you'll notice in the correlation table is that the same information is presented twice. The values found in the upper lefthand cell are identical to the values in the lower righthand cell, and the values in the lower lefthand cell are the same as those in the upper righthand cell. This is a feature common to all correlation matrixes. As you work with correlation matrixes you will learn to read them diagonally. Go now to your correlation table. With your mind's eye draw a diagonal line from the lower left to the upper right. The information contained on either side of this imaginary diagonal is the same (to confirm this compare the values in the upper lefthand cell to those in the lower righthand one). Because this is so, you can legitimately disregard anything above or below your diagonal without losing any insight into the relationship between the variables. By ignoring this data you are not doing anything wrong--on the contrary, it's standard practice. In fact, most professional journals eliminate this duplication in the correlation matrix by printing only results above or below the diagonal.

A true test of your ability to run a Pearson correlation and to interpret the results would be to run one with three or more variables. Go now to your data set and try this.


Simple Linear Regression (bivariate regression)

Regression, like correlation, focuses on the relationship between variables, and for this reason you will sometimes hear regression called a "correlational technique." Although there are similarities between regression and correlation, there are also considerable differences. Whereas correlation attempts to determine the extent to which two variables are related, regression is used to examine how well one variable can predict a second variable or how well one variable can explain observed differences in a second variable.

Knowing only this much about regression, you can perhaps begin to appreciate what an enormously powerful (and complex) tool it is. In this section we will concentrate on the most basic regression technique, simple (bivariate) linear regression. Simple linear regression is used mostly for prediction, so the focus of the discussion will be on that topic. To help set the stage for this discussion, here's a hypothetical example of a question regression can answer: how well do GRE scores predict grade point average in the doctoral program at USF? If your research questions are similar, it's likely you will want to run a regression.

In order to run a simple linear regression you need two normally distributed quantitative variables. By now you have probably gotten used to referring to variables as "independent" or "dependent." The variable which you suspect is able to predict the other is called the "predictor variable." The variable which you suspect is able to be predicted is called the "criterion variable." You can draw your own conclusions as to which corresponds to an independent variable and which to a dependent variable. Although SPSS makes use of the more familiar terms, you are encouraged to begin thinking in terms of predictor and criterion variables when working with regression.

A second semantic quirkiness unique to regression is the way in which researchers talk about the procedure. You will frequently hear (or read) the statement "Y (a variable) was regressed onto X (another variable)." It may take some getting used to, but the statement simply means "the criterion variable was regressed onto the predictor variable." Let's illustrate with our hypothetical GRE/GPA question from above. We are interested in determining if GRE score predicts grade point average in the USF doctoral program. GRE score is the predictor variable because we suspect it predicts GPA. GPA is the criterion variable because we suspect it can be predicted by GRE score. So, in order to answer the question, "how well does GRE score predict grade point average in the doctoral program at USF?", one would need to regress GPA onto GRE score. All this means is you will run a statistical test (regression) to determine if GRE score predicts grade point average.

The mechanics underlying regression's ability to predict the outcome of one variable based on what is available in another are a bit complicated. Briefly, regression computes an equation which includes three key elements: (1) a calculated constant that aids in prediction; (2) a known value on the predictor variable; and (3) the slope of a line running through through the middle of all the data points, both criterion and predictor, in the data set. This equation is called the regression equation and looks like this (incidentally, this is the standard equation for a straight line):

Y = a + bX

where "Y" are the criterion variable scores you're trying to predict; "a" is the calculated constant; "b" is the slope of the line running through all the data points in your data set; and "X" are the known score values on the predictor variable. Regression will determine how accurately this equation predicts the criterion variable based upon what is known in the predictor variable.

Analyze --> Regression --> Linear

At this point we normally suggest that learners have access to an active SPSS data set to practice the procedures as they work through this section. That reminder is still fundamentally sound. However, regression is the most complex statistical technique contained in these modules. For beginners it might be wise to simply read through this section and attempt to grasp the larger concepts. Those already familiar with the basics of regression may wish to attempt an analysis. At the end of this section is a list of references which might be useful for anyone studying regression.

Still, if you wish to run a regression analysis, it is recommended that you have an active data set available. If you don't have a data set of your own, you should access one of the the data sets available in SPSS . To learn how to access SPSS data sets, click here. Be careful when choosing a data set for this section. It must contain two quantitative interval variables, one of which you suspect predicts the other.

To illustrate this technique from the fictitious compensation study we have been using, consider the following. Up to this point we have considered gender, ethnicity, and education level as they relate to current salary level. We have found significant differences in mean salary level by gender and ethnicity (using ANOVA), and discovered a strong positive correlation between education and salary (by running a simple Pearson r). Still, we suspect there is more that contributes to the difference we see in salary level of employees. One possible source of the difference is previous job experience (measured in months). Our new question, one appropriate to regression, is this: does previous job experience predict salary level?

Now, execute the command string Analyze --> Regression --> Linear.

Choose from the complete list of variables displayed at the left of the dialog box the criterion variable and move it to the "dependent" box on the right (from our compensation study, this would be "current salary"). Next, choose from the list the predictor variable and move it to the "independent" box ("previous experience in months" from the compensation study). Click on the "statistics" button in the lower portion of the dialog box. You'll see that "estimates" and "model fit" are already checked. Click on "descriptives" and then hit the "continue" button. Click OK.

If you haven't been working in your own data up to this point, it is recommended that you do so before continuing.

In the Output Editor you'll see several tables. The first table provides common descriptive statistics with which you are by now familiar. Look over the means and standard deviations to see if anything strikes you as unusual.

The second table contains the correlation coefficients for the two variables you are studying. Although this table organizes data in a slightly different way than described in the preceding section on correlation, you should be able to determine the strenght of relationship between the two variables.

The third table, Variables Entered/Removed, is the first regression-specific table in this output. You will not want to skim over this table. Rather, use the information contained here to confirm that you have entered the variables correctly. Until you are comfortable with the the terminology specific to regression, it's easy to confuse predictor and criterion variables, and to regress them in the wrong order. Look to this table to remind yourself what the criterion and predictor variables are. The criterion variable is identified in footnote "b" (in our example study, current salary) and the predictor variable is listed in the "variables entered" column (previous job experience in months).

The fourth table, called Model Summary, is important. But before examining the results presented in table four, go to the ANOVA table immediately below it. Information in the ANOVA table tells you whether the predictor variable in fact predicts the criterion variable. If the F value indicated in the fifth column of the ANOVA table is significant, you can conclude that the predictor variable does indeed contain information that is useful in predicting the criterion variable. What's more, you can conclude that the regression equation for your two variables is valid. In other words, plugging the values known and available to you from the predictor variable into the regression equation will produce a fairly good estimate of the value of the criterion variable (more on this later). If the F value is not significant, however, you have determined that variable X does not significantly or reliably predict variable Y. If in our compensation study, for example, we have a significant F value we can safely say that months of previous experience is predictive of salary level.

Although it is clearly important to know that one variable predicts another, it is often not enough. Most researchers are not content with answering their research questions simply "yes" or "no." Think about it. Let's say that based on a significant F value we report that number of months of previous experience is in fact predictive of current salary level. Wouldn't you want to know the magnitude of months of previous experience in predicting salary? That is, wouldn't you be curious if previous experience accounts for 25%, 50% or 75% of the difference in current salary level? To get a sense of the relative weight of the predictor variable, look back to the model summary table and focus your attention on the third column, labelled "R square." The "R square" value is the percentage of variance in the criterion variable (current salary) that is predicted by the predictor variable (months of previous experience). It's likely that a beginning researcher running a regression is interested in large "R square" values because you want your predictor to do its job in a meaningful way. In our compensation study we'll assume that the "R square" value is only .09. This means that 9% of the variance in salary level is explained by number of months of previous job experience. While it's nice to account for this 9%, there remains over 90% of the variance in salary that is unexplained by previous job experience! (More advanced regression techniques, ones involving multiple variables, would help identify such factors.)

The final table in this output, called simply "coefficients," contains all the elements you need for plugging into the regression equation. Recall from above that regression equations look like this:

Y = a + bX

where "Y" is the criterion variable you're trying to predict; "a" is the calculated constant; "b" is the slope of the line running through all the data points in your data set; and "X" is the predictor variable. Beyond telling you that the equation is useful in predicting the value of the criterion variable from what is contained in the predictor variable, regression also provides you with the missing pieces of the equation--the constant and the slope. This information is contained in the second column of the final table in the output under "unstandardized coefficients". The first line in that column is the constant, and the second line is the slope. Since you already know the value of the predictor variable, you have all the pieces you need to calculate the value of the criterion variable.

Let's return to our salary study to illustrate this in a concrete way. Since we're trying to predict salary level from months of previous experience, the regression equation for our research question would look like this when written in words:

current salary = constant + slope*months of previous experience

We know from the results presented in the coefficient table that the constant is $32,409.00 and that the slope is 15.9. Thus:

current salary = $32,409 + 15.9*months of previous experience

Now, to make a prediction simply do the math. If you were trying to predict the salary level of an individual with 36 months of previous experience, you would predict salary level as follows:

current salary = $32,409 + 15.9(36)

= $32,409 + 572.4

= $32,981.40

Similarly, if you were trying to predict salary for an applicant with no previous experience, you would follow the same procedure:

current salalry = $32,409 + 15.9(0)

= $32,409 + 0

= $32,409

These results make intuitive sense: we would expect someone with job related experience to earn more than someone without such experience (even if we feel it should be more than $572.40 a year).

Regression is a powerful but complex tool. Although it is unlikely that a novice researcher will use regression often, you will encounter it in the professional journals in your field. For that reason, it is important to have a general understanding of this procedure. The more you read about regression, the more you will understand it, and the better able you will be to use it when your time comes.


Suggested further reading:

  1. Green, S. & Salkind, N. (2003). Using SPSS for Windows and Macintosh. Analyzing and Understanding Data (3rd edition). New Jersey: Prentice Hall.

    This hands-on book is useful for those who want to analyze data right away. Unit 8 (pp. 237-282) covers correlational techniques. A good review of correlation is given on pages 238-246. Simple linear regression is presented beginning on page 257.

  2. Huck, S. (2000). Reading Statistics and Research (3rd edition). New York: Longman.

    This book is recommended for those interested in a more general overview of regression without the hands on portion. Pages 565-578 are especially helpful.
 
SPSS Resource and Tutorial
SPSS Home Page
Module I: Getting Started
Module II: Navigation, Data Entry and Management
Module III: Summarizing and Describing Data
Module IV: Data Analysis
Frequently Asked Questions
 
  About USF | Academics | Prospective Students | Admission | Current Students | Alumni Contact Us | SOE Home