# Section 4.1: Scatter Diagrams and Correlation

## Objectives

By the end of this lesson, you will be able to...

1. draw and interpret scatter diagrams
2. describe the properties of the linear correlation coefficient (LCC)
3. estimate the LCC based on a scatter diagram
4. compute and interpret the LCC
5. explain the difference between correlation and causation

For a quick overview of this section, watch this short video summary:

In Chapter 3, we looked at numerically summarizing data from one variable (univariate data), but newspaper articles and studies frequently describe the relationship between two variables (bivariate data). It's this second class that we'll be focusing on in Chapter 4.

There are plenty of variables which seem to be related. The links below are articles from various news sources, all discussing relationships between two variables.

In each case, there's a response variable (GPA, newborn's health, cancer levels) whose value can be explained at least in part by a predictor variable (SAT score, proximity to highways, weight-loss pill consumption).

Remember, unless we perform a designed experiment, we can only claim an association between the predictor and response variables, not a causation.

Our goal in this chapter will be to find ways to describe relationships like the one between a student's SAT score and his/her GPA, and to describe the strength of that relationship.

First, we need a new type of graph.

## Scatter Diagrams

Scatter diagrams are the easiest way to graphically represent the relationship between two quantitative variables. They're just x-y plots, with the predictor variable as the x and the response variable as the y.

Example 1

The data below are heart rates of students from a Statistics I class at ECC during the Spring semester of 2008. Students measured their heart rates (in beats per minute), then took a brisk walk and measured their heart rates again.

 before after before after before after 86 98 58 128 60 70 62 70 64 74 80 92 52 56 74 106 66 70 90 110 76 84 80 92 66 76 56 96 78 116 80 96 72 82 74 114 78 86 72 78 90 116 74 84 68 90 76 94

We can see that the heart rate before going on the walk is the predictor (x), and the heart rate after the walk is the response (y).

Here's an excellent video showing a scatter diagram on steroids created by the BBC:

## Technology

Here's a quick overview of the steps for creating scatter diagrams in StatCrunch.

 Select Graphics > Scatter plot Select quantitative variables for the X & Y axes.
 You can also go to the video page for links to see videos in either Quicktime or iPod format.

## Types of Relationships

Not all relationships have to be linear, like the before/after heart rate data. The images below show some of the possibilities for the relationship (or lack thereof) between two variables.

 Linear Linear Nonlinear No relation

The price of a manufactured item and the profit the company gains from it, for example, do not have a linear relationship. When prices are low, sales are high, but profit is still low since very little is made from each sale. As prices increase, profits increase, but at some point, sales will start to drop, until eventually too steep of a price will drive sales down so far as to not be profitable. This might be represented by the third, "Nonlinear" image.

## Positive and Negative Association

The next thing we to do is somehow quantify the strength and direction of the relationship between two variables.

Here's how we'll describe the direction:

In general, we say two linearly related variables are positively associated if an increase in one is associated with an increase in the other (first "Linear" image). We say two linearly related variables are negatively associated if an increase in one is associated with a decrease in the other (second "Linear" image).

The images below show some examples of what scatter plots might look like for two positively associated variables.

### positively associated

And these are some examples of what scatter plots might look like for two negatively associated variables.

## The Linear Correlation Coefficient

As we can see from these examples, knowing the directions isn't enough - we need to quantify the strength of the relationship as well. What we'll use to do that is a new statistic called the linear correlation coefficient. (In this class, we'll be dealing solely with linear relationships, so we usually just call it the correlation.)

The linear correlation coefficient is a measure of the strength of the linear relationship between two variables.

 where is the sample mean of the predictor variable sx is the sample standard deviation of the predictor variable is the sample mean of the response variable sy is the sample standard deviation of the response variable n is the sample size

I know that's quite a mouthful, but we'll be using technology to calculate it. Here's a quick summary of some of the properties of the linear correlation coefficient, as described in your text.

## Properties of the Linear Correlation Coefficient

1. The linear correlation coefficient is always between -1 and 1.
2. If r = +1, there is a perfect positive linear relation between the two variables.
3. If r = -1, there is a perfect negative linear relation between the two variables.
4. The closer r is to +1, the stronger is the evidence of positive association between the two variables.
5. The closer r is to -1, the stronger is the evidence of negative association between the two variables.
6. If r is close to 0, there is little or no evidence of a linear relation between the two variables - this does not mean there is no relation, only that there is no linear relation.

Source: Statistics: Informed Decisions Using Data
Author: Michael Sullivan III
© 2007, All right reserved.

Next, I'd like you to visit two web sites that offer Java applets. These will help you interact with data to get a sense of the linear correlation coefficient.

Example 2

This first applet was created for use with another textbook, Introduction to the Practice of Statistics, by David S. Moore and George P. McCabe.

The applet is designed to allow you to add your own points and watch it calculate the linear correlation coefficient for you. (There are other capabilities as well, but we'll get to those in the next section.)

Example 3

This second applet was designed as part of the Rossman/Chance Applet Collection at California Polytechnic State University.

This applet generates scatter plots for you and asks you to guess the correlation for each. Click on "New Sample" to start, enter your answer, and then "Enter" to see if you're correct.

Applet: Guess the Correlation

Example 4

Let's try to calculate a correlation ourselves. To make our data set a bit more manageable, let's use the before/after data from Example 1 in Section 4.1, but let's just use the first 8 as our sample.

 before after 86 98 0.97865 0.78657 0.76978 62 70 -0.90036 -0.84484 0.76065 52 56 -1.68327 -1.66054 2.79514 90 110 1.29181 1.48575 1.91931 66 76 -0.58719 -0.49525 0.29080 80 96 0.50890 0.67004 0.34098 78 86 0.35231 0.08740 0.03079 74 84 0.03915 -0.02913 -0.00114 6.90632

Using computer software, we find the following values:

= 73.5
sx ≈ 12.77274
= 84.5
sy
≈ 17.16308

Note: We don't want to round these values here, since they'll be used in the calculation for the correlation coefficient - only round at the very last step.

Since we have a sample size of 8, we divide the sum by 7 and get a correlation factor of 0.99. That seems fairly high, but looking at the scatter plot (below), we can see why it's so strong.

## Technology

Here's a quick overview of the formulas for finding the linear correlation coefficient in StatCrunch.

 Select Stat > Regression > Simple Linear Select the predictor variable for X & the response variable for Y Select Calculate
 You can also go to the video page for links to see videos in either Quicktime or iPod format.

Here's one for you to try.

Example 5

Researchers at General Motors collected data on 60 U.S. Standard Metropolitan Statistical Areas (SMSA's) in a study of whether air pollution contributes to mortality. The dependent variable for analysis is age adjusted mortality (called "Mortality").

The data below show the age adjusted mortality rate (deaths per 100,000) and the sulfur dioxide polution potential. Use StatCrunch to calculate the linear correlation coefficient. Round your answer to three digits.

 City Mortality* SO2 potential** Akron, OH 921.87 59 Albany, NY 997.87 39 Allentown, PA 962.35 33 Atlanta, GA 982.29 24 Baltimore, MD 1071.29 206 Birmingham, AL 1030.38 72 Boston, MA 934.7 62 Bridgeport, CT 899.53 4 Buffalo, NY 1001.9 37 Canton, OH 912.35 20 Chattanooga, TN 1017.61 27 Chicago, IL 1024.89 278 Cincinnati, OH 970.47 146 Cleveland, OH 985.95 64 Columbus, OH 958.84 15 Dallas, TX 860.1 1 Dayton, OH 936.23 16 Denver, CO 871.77 28 Detroit, MI 959.22 124 Flint, MI 941.18 11 Fort Worth, TX 891.71 1 Grand Rapids, MI 871.34 10 Greensboro, NC 971.12 5 Hartford, CT 887.47 10 Houston, TX 952.53 1 Indianapolis, IN 968.67 33 Kansas City, MO 919.73 4 Lancaster, PA 844.05 32 Los Angeles, CA 861.26 130 Louisville, KY 989.26 193 Memphis, TN 1006.49 34 Miami, FL 861.44 1 Milwaukee, WI 929.15 125 Minneapolis, MN 857.62 26 Nashville, TN 961.01 78 New Haven, CT 923.23 8 New Orleans, LA 1113.16 1 New York, NY 994.65 108 Philadelphia, PA 1015.02 161 Pittsburgh, PA 991.29 263 Portland, OR 893.99 44 Providence, RI 938.5 18 Reading, PA 946.19 89 Richmond, VA 1025.5 48 Rochester, NY 874.28 18 St. Louis, MO 953.56 68 San Diego, CA 839.71 20 San Francisco, CA 911.7 86 San Jose, CA 790.73 3 Seattle, WA 899.26 20 Springfield, MA 904.16 20 Syracuse, NY 950.67 25 Toledo, OH 972.46 25 Utica, NY 912.2 11 Washington, DC 967.8 102 Wichita, KS 823.76 1 Wilmington, DE 1003.5 42 Worcester, MA 895.7 8 York, PA 911.82 49 Youngstown, OH 954.44 39

* Age Adjusted Mortality (deaths per 100,000)
** Sulfer Dioxide pollution potential

Source: StatLib

[ reveal answer ]

Rounded to three digits, r ≈ 0.426.

<< previous section | next section >>