Data Snooping 101

The purpose of this note is to look at how we can analyze data that we believe might be described using either a linear model or an exponential model. We are interested in three kinds of models:

We are often interested in fitting one of these families of models to a set of data -- that is, to see if one of these models can be used to describe a relationship that exists between two variables in a data set. Click here to open a new window with data in the form of a table for an example that illustrates this discussion. Click here to open a live figure with the same data as a scatter plot. We would like to describe this data using a model of the form:

record = m * (year - 1860) + b

This is a variation on the form we used above for a linear model. The parameter m still denotes the slope but now the parameter b denotes the world record for the one mile run in the year 1860. We use this form because the left edge of the graph is at the year 1860. Use the parameter controls to change the values of the two parameters -- m and b so that the linear function is as close as possible to the data.

Now we want to look at how we can estimate the slope -- that is, the value of the parameter m -- by looking at the data. This process is often called data snooping. We will work with the form:

Suppose that (x1, y1) and (x2, y2) are two points on the graph of this function. This is,

and

Using a little bit of algebra we can determine the value of m as shown below.

Our data does not usually give us points that are exactly on this line but it does give us points that are close to the line. Thus, using two data points we can get an estimate that is close to the exact value of m.

Example:

We will use the world record for the mile run in 1880 and the world record for the mile run in 1945 as our two data points. Thus,

(x1, y1) = (1880, 4:23.2)

and

(x2, y2) = (1945, 4:01.4)

This leads to the estimate


Questions:

  1. Find another estimate for m using the the world record for the mile run in 1874 and in 1875. Compare this estimate with the preceding one. Which estimate do you think is better? Why?

  2. Find another estiimate of m using the world record for the mile run in 1865 and in 1999. Be careful when you find the difference in the numerator. The times are given in minutes and seconds. Subtraction is a bit tricky.

  3. Notice that choosing different points generally results in different estimates for m. Which choices are more likely to result in better estimates? Justify your answer.

  4. After estimating the value of the parameter m as described above, it is easy to estimate the value of the parameter b. Explain how.

  5. We often use the form y = m(x - a) + b for the family of linear models. With this form m is stilled called the slope and b is the value of y when x = a. Show that with this form we can still use the formula

    to estimate the value of the slope. This shows that we are justified in using the same letter and the same terminology for m in both forms.


Linear models are appropriate when the change in the dependent variable, y, is a constant multiple (namely, the slope m) of the change in the independent variable, x. We use words like -- "The world record for the one mile run is changing at the constant rate of -0.34 seconds per year." In many cases the change in the dependent variable cannot be described in this way. For many common situations words like -- "The level of pollution is dropping at the constant rate of 7 percent per year" -- are more accurate. In these situations we use simple exponential models. We look at data snooping in these situations next.

[ on to simple exponential models ]