Figure 1. Columbia River velocity vs. depth

Modeling data is an important mathematical process for analyzing scientific information. Scientists investigate systems by collecting data and producing a mathematical model to try to understand the system.
Measurements in many natural systems can produce data with considerable noise or scatter. Getting students into this modeling process involves a number of steps that have been incorporated in many mathematics textbooks from algebra through calculus, as recommended by AMATYC (2004) and NCTM (2000). At the beginning level, mathematics textbooks have done a much better job at introducing modeling than the science textbooks.
Scott A. Sinex is Professor and Chair of Physical Sciences and Engineering at Prince George's Community College.
Assessing goodness of fit comes into the picture as students start to model data. The basic measures of the goodness of fit are the coefficient of determination and the residual. The coefficient of determination r ^{2} is the fraction of the yvariable that is explained by the variation of the xvariable; it ranges from 0 to 1. The residual is the difference between the actual ydatum and the yvalue calculated from the regression equation.
How do we address these measures of goodness of fit using technology available to almost everyone? Graphing calculators and spreadsheet applications such as Excel calculate r ^{2} and easily produce residuals plots. In this article, I show how to use an interactive Excel spreadsheet to help students discover the goodness of fit concept and develop their analysis and interpretation skills. Discovery learning follows the recommendations of NSTA (2001).
For the graph in Figure 2, ask your students, "Does this show a trend of the ozone concentration over time?" Too often students do not know how to deal with scatter in the data and presume that scatter eliminates the possibility of a trend; hence, they may respond that there is no trend. Figure 3 shows the same data with the regression line and its equation.
Figure 1. Columbia River velocity vs. depth

Modeling data is an important mathematical process for analyzing scientific information. Scientists investigate systems by collecting data and producing a mathematical model to try to understand the system.
Measurements in many natural systems can produce data with considerable noise or scatter. Getting students into this modeling process involves a number of steps that have been incorporated in many mathematics textbooks from algebra through calculus, as recommended by AMATYC (2004) and NCTM (2000). At the beginning level, mathematics textbooks have done a much better job at introducing modeling than the science textbooks.
Scott A. Sinex is Professor and Chair of Physical Sciences and Engineering at Prince George's Community College.
Assessing goodness of fit comes into the picture as students start to model data. The basic measures of the goodness of fit are the coefficient of determination and the residual. The coefficient of determination r ^{2} is the fraction of the yvariable that is explained by the variation of the xvariable; it ranges from 0 to 1. The residual is the difference between the actual ydatum and the yvalue calculated from the regression equation.
How do we address these measures of goodness of fit using technology available to almost everyone? Graphing calculators and spreadsheet applications such as Excel calculate r ^{2} and easily produce residuals plots. In this article, I show how to use an interactive Excel spreadsheet to help students discover the goodness of fit concept and develop their analysis and interpretation skills. Discovery learning follows the recommendations of NSTA (2001).
For the graph in Figure 2, ask your students, "Does this show a trend of the ozone concentration over time?" Too often students do not know how to deal with scatter in the data and presume that scatter eliminates the possibility of a trend; hence, they may respond that there is no trend. Figure 3 shows the same data with the regression line and its equation.
I provide the basics of developing a computational interactive Excel spreadsheet, including an Excel tutorial, in Sinex (2004). In this example the new twist is the use of the RANDBETWEEN function (available in the function editor) to add noise or scatter to the data. (See the Microsoft Excel Help, and enter the keyword “randbetween” for instructions to load this function, which is part of the Analysis ToolPak.)
This function returns a random number between two set limits placed in parentheses, such as RANDBETWEEN (10, 10). The RANDBETWEEN function is a perfect way to add noise, as random variation, to the data, and the noise is recalculated each time the function is applied. The effect or size of the error can be controlled by an adjustable variable ($H$2 in Figure 4), controlled by the scroll bar and multiplied by the RANDBETWEEN function. Figure 4 shows a sample formula.
I use the interactive spreadsheet scatter.xls to have students discover some basic characteristics of scatter in data and how they influence the goodness of fit. The spreadsheet includes four separate worksheets to explore scatter, outliers, and curvature. My mode of questioning with students is embedded in the following discussion as the worksheets are introduced. The individual worksheets are keyed to the tabs at the bottom of the Excel screen as shown in Figure 5.
The spreadsheet also includes introduction of residuals, since the value of r^{2} fails to take into account the effect of minor curvature in data. I have placed Comment boxes within the worksheets to explain the various operations. These comment boxes are denoted by a little red triangle in upper right corner of a cell; if you place the cursor on the cell, the comments appear.
I also provide a sample student activity.
If you have not already opened the scatter.xls spreadsheet, please do so now. Then click on the Scatter tab. The worksheet should look something like Figure 6.
This worksheet allows students to vary the scatter (or noise) level, by adjusting the scroll bar or by clicking on the arrows, to see how the slope and intercept of line respond to the addition of scatter to the data, while monitoring the value of r^{2}. Here the r^{2} value is treated as a mystery variable. Getting science students to use the language of mathematics is also a hurdle to overcome here, but it should not be avoided. How does r^{2} vary with the scatter of the data about the regression line? Students should quickly discover that as scatter increases the value of r^{2} decreases. (Because the scatter cannot be set to zero in this worksheet, they will only see r^{2} get close to 1, not actually reach it.)
The use of r^{2} as a measure of the goodness of fit is realistic because it describes the fraction (or if multiplied by 100, the percentage) of the yvariable that is explained by the variation of the xvariable. You can also discuss statistical significance at this point (For 10 data points, an r^{2} > 0.795 is needed at the 95% confidence level to have significance.)
The Outlier worksheet (see Figure 7) starts with a perfect set of data for the y = x line (no scatter). What is the value of r^{2} for a perfect fit? Student now vary the scatter of the last datum point shown in red on the graph. How does it influence the regression line? There is a drastic influence on the line as a whole, especially to the value of the slope. This is the leverage effect, and it can be explored further using the downloadable JAVA simulation of Lohninger et al. (undated) or the Putting Points applet at the University of Illinois at UrbanaChampaign Department of Statistics.
The Outlier II tab is a repeat of the Outlier worksheet, except that the scatter is introduced to a midrange datum point. Vary the scatter of the datum point shown in red on the graph. How does the scatter influence the regression line? Here the effect is, for the most part, on the intercept until large scatter occurs.
Overall, outliers can have a large effect on your model results. How is your ability to make a prediction when a model is produced containing an outlier? Think about the models produced with and without the outlier. As you have seen with the Outlier and Outlier II worksheets, the predictive capabilities of the model can be very distorted or biased by the presence of an outlier.
Consider this assessment question: If the point indicated in Figure 9 with the arrow were to be measured again and found to be (1.4, 0.10) instead of (1.4, 0.20), how would the regression line respond?
The answer is shown in Figure 10.
The fourth worksheet is similar to the scatter worksheet but now permits the slope to be changed by entering a value in the yellow box. (See Figure 11, and click on the image to see a fullsize picture.)
A residuals plot is also included in this worksheet (Figure 12).
Both scatter and curvature of the data are introduced in the generated data, the latter by an x^{2} term in the y column. Explore this worksheet one variable at a time. How does the residuals plot behave? If a small amount of curvature occurs, would the value of r^{2} alert you to it?
Using only r^{2}, you will not discover the presence of curvature. It is seen only in the residuals plot, as in Figure 13, which shows a nice trend.
Ideally, residual plots should show a random distribution of points, i.e., no trends, and the sum of the squared residuals should be kept to a minimum. Now, with a contribution of curvature, as seen in the residuals plot (or in Figure 13), increase the noise/scatter. What happens? You find that a nonlinear model can be very difficult to uncover if scatter is apparent in the data. So, some system models can start out linear and, as measurement techniques for these systems improve, may evolve into nonlinear models.
All too often, the statement “it’s a lousy datum point  let’s just throw it out” is made in reference to an unusual datum value. This is not the proper way to handle a potential outlier. If you have recentlycollected experimental data, you can repeat the measurement or experiment to try to determine if the outlier was a fluke. Is the outlier a real anomalous datum point? Some outliers are simply data recording errors. But, for example, with annual stream discharge (volume/sec), you can have a very wet or dry year that stands out from more typical or average years. It just pays to be cautious with outliers and use methods that deal with them, such as running averages. See Motulsky (2002) for more information on dealing with outliers, including the Grubbs' test for detecting outliers on replicates of measurement.
Could it be that your linear model is breaking down? Some linear models do not hold at high xvalues. You can see how some calibration curves in chemistry start to develop curvature at high concentrations by using the Beer’s Law Simulator (click on the “Bad Cal” tab). Points at higher concentrations might appear to be outliers.
Another example illustrating emerging curvature is the "lightweighting" of the aluminum can, as shown in Figure 14. The possible future mass of the can is shown in Figure 15.
Students can use the scatter.xls spreadsheet to discover the influence of scatter or noise on the goodness of fit through a directed mode of questioning. Many realworld data sets, especially environmental data, can show considerable scatter. You and your students can explore the many data sets of Langkamp and Hull (2003).
We need to expose students to a wider range of data and even experimental situations where noise can be a problem. The use of r^{2} and residuals are good first steps that take advantage of available technology for students just starting to explore linear models. Ideal models are great as an early component of learning. However, examining real data puts the process of modeling into the real world of scientific study. This is just the rudimentary beginning of mathematical modeling, but it introduces students to science process skills and the thrill of discovery. Further limitations and other measures of the goodness of fit can be introduced to students as other nonlinear models are encountered.
AMATYC (2004) Beyond Crossroads Draft Version 6.0, American Mathematical Association of TwoYear Colleges (accessed 18 March 2005)
Langkamp, G., and J. Hull (2003) ClassroomReady Data Sets in Environmental Mathematics, Journal of Online Mathematics and its Applications 3 (accessed 18 March 2005)
Lohninger, H., et al. (undated) Learning by Simulations , Electronic Media Group, Institute of Chemical Technology and Analytics, Vienna University of Technology, Austria (accessed 18 March 2005)
Motulsky, H. (2002) Detecting Outliers , GraphPad Software (accessed 18 March 2005)
NCTM (2000) Principles and Standards for School Mathematics , National Council of Teachers of Mathematics (accessed 18 March 2005)
NSTA (2001) College Pathways to the Science Education Standards, National Science Teachers Association (accessed 18 March 2005)
Sinex, S. A. (2004) An Interactive HigherOrder Thinking Tool, TechLearning Educator's Outlook (accessed 18 March 2005)
Scatter Excel spreadsheet, used on pages 38
Constructing an Interactive Excel Spreadsheet tutorial from Sinex (2004), referenced on page 2 (accessed 18 March 2005)
Sample student activity using the Scatter spreadsheet  see page 3 (accessed 18 March 2005)
Beer's Law Simulator, referenced on page 10 (accessed 18 March 2005)
QELP data sets, Quantitative Environmental Learning Project  see also Langkamp and Hull (2003) and pages 1 and 11 (accessed 18 March 2005)
Leverage Effect in Regression , downloadable Java simulation, #2003 in Lohninger et al. (undated), referenced on page 5 (accessed 18 March 2005)
Putting Points Java applet, Department of Statistics, University of Illinois at UrbanaChampaign, referenced on page 5 (accessed 18 March 2005)
Grubbs' Test, online calculator for outlier detection, Graphpad Software QuickCalcs  see also Motulsky (2002) and page 9 (accessed 18 March 2005)
Detecting Outliers with Grubbs' Test, interactive Excel spreadsheet to accompany Motulsky (2002), Graphpad Software  see page 9 (accessed 18 March 2005)