You are here

Dealing with Data: A 'Simple' Linear Fit

Author(s): 
Larry Gladney and Dennis DeTurck

[Note: The activities in this module make reference to the computer algebra system (CAS) Maple. Any other CAS can be used instead (e.g., Mathematica, Mathcad, etc.) as long as the user is familiar with that CAS system. In other words, while preferred here, Maple is not required for the use of this module.]

Larry Gladney is Associate Professor of Physics and Dennis DeTurck is Professor of Mathematics, both at the University of Pennsylvania.

The first true test of any scientific theory is whether or not people can use it to make accurate predictions. Calculus, being the study of quantities that change, provides the language and the mathematical tools to discuss and understand change in a precise, quantitative way. An important prerequisite to using calculus to analyze "real-world" situations is having a good understanding of the basic "elementary" functions: polynomials, logarithms, trigonometric functions, and all their compositions, inverses, etc.

With an understanding of the calculus of the basic functions, it is often possible to formulate a mathematical model of (an idealized version of) a phenomenon in one of two ways: First, enough might be understood about the phenomenon so that a mathematical formulation of it is directly attainable. For example, Newton's second law of motion -- force is the derivative of momentum, where momentum is the product of mass and velocity -- is such a model. At the other extreme are models which are derived purely empirically -- data are collected, and one searches for an appropriate formula to match the data with reasonable accuracy. Many economic models are derived in this manner.

More often, however, mathematical models are developed with a combination of the two approaches: one has some basic understanding of a phenomenon, enough to restrict the class of functions appropriate to model it. Very often, one knows enough so that the functions are determined except for a few parameters, such as the coefficients of a polynomial, or some other kind of multiplicative factor. Then, experimental data are used to determine the values of the missing parameters. Many of the "constants", "coefficients" and "numbers" one encounters in science (e.g., rate constants of chemical reactions, half-lives of radioactive elements, coefficients of thermal conductivity, the gravitational constant, etc.) started out as the last unknown parameters in a mathematical model, which had to be determined by collecting experimental data.

Linear fits: In many situations, researchers want to understand how some quantity will change when another quantity is varied. A simple example of this might be the following sports-physics experiment: A basketball is dropped from different heights, and the height of the first bounce is measured each time. What is the relationship between the height of the drop and the height of the bounce? We can try a simple mathematical experiment to look at the problem. Below is an interactive program that allows you to enter several data points (possibly non-physical) claiming to be data representing the starting height and subsequent bounce heights of a basketball. Enter the x,y values for any points you like, then use the mouse to click on any two positions inside the graph area. The program draws a line between the two points and indicates the endpoints and the midpoints with circles. By clicking near the center of any of the circles, you can drag the line around. As you do so, you will see a display of the distance between the closest approach of the line to each data point. You also see at the bottom a display of a number which characterizes how "badly" the line fits the points. The smaller the "badness" number, the better the line should appear to represent the points. Try it! [This applet no longer functions. Ed.]

  

Dr. DeTurck collected the following data by dropping a basketball in his garage. After each drop, he measured the height of the first bounce:

Height of Drop (in.) Height of 1st Bounce (in.)
36 25
40 29
40 28.5
44 31.5
44 32
48 35
52 38
56 42
60 46

To get ready for our subsequent analysis, we use Maple to make a list of the drop heights and the corresponding bounce heights:

#Make an ordered list of data points
#
drop:=[36,40,40,44,44,48,52,56,60]:
bounce:=[25,29,28.5,31.5,32,35,38,42,46]:

The square brackets indicate to Maple that the set of numbers is an ordered list. The two statements end with colons, rather than semicolons, so that there will be no output from them (because in this case, Maple would just parrot back the input).

It will be helpful to have Maple make what statisticians call a "scatter plot" of the data points. To plot points from a list, Maple expects an ordered list containing the x-coordinate of the first point followed by the y-coordinate of the first point, followed by the x-coordinate of the second point, etc. To transform our drop and bounce lists into ordered pairs, we enter the following command (this is pretty advanced Maple-speak, so don't worry if you wouldn't have thought of it):

points:=convert(linalg[transpose]([drop,bounce]),listlist);

This defines the variable points to be the list of points we want to plot. Be careful when you type this statement that you distinguish carefully between parentheses and square brackets.

The variable points has the drop height and the corresponding bounce height right next to each other, for use with plot. So try plotting the data ("style=POINT" and "symbol=cross" are to keep Maple from connecting the dots):

plot(points, style=POINT, symbol=cross);

 

The data look pretty linear, but how do we find the line that "best" describes it? There are several different definitions of "best" in use. We will be using the so-called "least-squares" fit. For our drop-bounce data, the least squares line is obtained as follows:

with(stats,fit); with(fit,leastsquare); leastsquare[[x,y],y=a*x+b]([drop,bounce]);

Whether you found a formula from our interactive program or from Maple, your result should be

y = .8502604294 x - 5.567708941.

Now we can plot the data and the line to see how well Maple did with fitting the data. Since we want to combine two different kinds of plots, we will be using the display command.

First, let's assign a name to the equation Maple returned.

fitlin := .8502604294*x - 5.567708941;

Now we can plot both the line and data and store those plots as variables. The names stand for what Maple calls "plot structures", which are Maple's internal directions for making plots. It is very important to use colons at the end of statements that assign plots to names. You defnintely do not want to see the plot structures!

fitplot := plot(fitlin,x=35..60):
pointplot := plot(points, style=POINT, symbol=cross):

We can display the plot commands we've saved. Before we do that, we have to have Maple load the display command.

with(plots,display):
display({fitplot,pointplot});


Problem 1. The following table (from the World Almanac) gives the winning heights (in inches) in the Olympic pole vault:
 

Year Height Year Height
1896 130 1952 179
1900 130 1956 179.5
1904 137.75 1960 185
1908 146 1964 200.75
1912 155.5 1968 212.5
1920 161 1972 216.5
1924 155.5 1976 216.5
1928 165.25 1980 227.5
1932 169.75 1984 226.25
1936 171.25 1988 237.25
1948 169.25 1992 228.25

Fit this data with a least-squares line. What interpretation do you give to the slope of your line? Using your linear model, predict the winning height in the 2000 Olympics... in the 2096 Olympics. According to your model, in what year will pole vaulters be able to "leap tall buildings in a single bound"? (The Empire State building is 1250 feet tall.)

Comment on the reasonableness of your model (including comments about the residuals).

Problem 2. In a physics experiment, students measure the period of a pendulum (i.e., the amount of time the pendulum takes to swing back and forth) as a function of its length. One group of students obtained the following data:
 

Length (cm) Period (sec) Length (cm) Period (sec)
6.5 0.51 24.4 1.01
11.0 0.67 26.5 1.08
13.2 0.73 30.6 1.13
15.0 0.79 34.3 1.25
18.1 0.89 37.5 1.28
23.0 0.98 41.5 1.33

As you did in the first problem, find the least squares line that best fits this data. Compute and plot the residuals -- these are the differences between the measured values of the period and the value predicted by the least squares equation for each measurement. Explain why these indicate that a different model is needed.


 
Published July 2001
© 2001 by Larry Gladney and Dennis DeTurck

Larry Gladney and Dennis DeTurck, "Dealing with Data: A 'Simple' Linear Fit," Convergence (November 2004)