Exploring the Goodness of Fit in Linear Models

Author(s): 
Scott A. Sinex

 
 Figure 1. Columbia River velocity vs. depth
Source: QELP Data Set 011
How well does the linear mathematical model fit the data shown in Figure 1? How will a novice learner approach this question?

Modeling data is an important mathematical process for analyzing scientific information. Scientists investigate systems by collecting data and producing a mathematical model to try to understand the system.

Measurements in many natural systems can produce data with considerable noise or scatter. Getting students into this modeling process involves a number of steps that have been incorporated in many mathematics textbooks from algebra through calculus, as recommended by AMATYC (2004) and NCTM (2000). At the beginning level, mathematics textbooks have done a much better job at introducing modeling than the science textbooks.

Scott A. Sinex is Professor and Chair of Physical Sciences and Engineering at Prince George's Community College.

Assessing goodness of fit comes into the picture as students start to model data. The basic measures of the goodness of fit are the coefficient of determination and the residual. The coefficient of determination r 2 is the fraction of the y-variable that is explained by the variation of the x-variable; it ranges from 0 to 1. The residual is the difference between the actual y-datum and the y-value calculated from the regression equation.

How do we address these measures of goodness of fit using technology available to almost everyone? Graphing calculators and spreadsheet applications such as Excel calculate r 2 and easily produce residuals plots. In this article, I show how to use an interactive Excel spreadsheet to help students discover the goodness of fit concept and develop their analysis and interpretation skills. Discovery learning follows the recommendations of NSTA (2001).

For the graph in Figure 2, ask your students, "Does this show a trend of the ozone concentration over time?" Too often students do not know how to deal with scatter in the data and presume that scatter eliminates the possibility of a trend; hence, they may respond that there is no trend. Figure 3 shows the same data with the regression line and its equation.


Figure 2. Ozone at Halley Bay (1956-2000)
Source: British Antarctic Survey

 


Figure 3. Ozone at Halley Bay (1956-2000), with regression line
Source: British Antarctic Survey

Published March, 2005
© 2005, Scott A. Sinex

Exploring the Goodness of Fit in Linear Models - Introduction

Author(s): 
Scott A. Sinex

 
 Figure 1. Columbia River velocity vs. depth
Source: QELP Data Set 011
How well does the linear mathematical model fit the data shown in Figure 1? How will a novice learner approach this question?

Modeling data is an important mathematical process for analyzing scientific information. Scientists investigate systems by collecting data and producing a mathematical model to try to understand the system.

Measurements in many natural systems can produce data with considerable noise or scatter. Getting students into this modeling process involves a number of steps that have been incorporated in many mathematics textbooks from algebra through calculus, as recommended by AMATYC (2004) and NCTM (2000). At the beginning level, mathematics textbooks have done a much better job at introducing modeling than the science textbooks.

Scott A. Sinex is Professor and Chair of Physical Sciences and Engineering at Prince George's Community College.

Assessing goodness of fit comes into the picture as students start to model data. The basic measures of the goodness of fit are the coefficient of determination and the residual. The coefficient of determination r 2 is the fraction of the y-variable that is explained by the variation of the x-variable; it ranges from 0 to 1. The residual is the difference between the actual y-datum and the y-value calculated from the regression equation.

How do we address these measures of goodness of fit using technology available to almost everyone? Graphing calculators and spreadsheet applications such as Excel calculate r 2 and easily produce residuals plots. In this article, I show how to use an interactive Excel spreadsheet to help students discover the goodness of fit concept and develop their analysis and interpretation skills. Discovery learning follows the recommendations of NSTA (2001).

For the graph in Figure 2, ask your students, "Does this show a trend of the ozone concentration over time?" Too often students do not know how to deal with scatter in the data and presume that scatter eliminates the possibility of a trend; hence, they may respond that there is no trend. Figure 3 shows the same data with the regression line and its equation.


Figure 2. Ozone at Halley Bay (1956-2000)
Source: British Antarctic Survey

 


Figure 3. Ozone at Halley Bay (1956-2000), with regression line
Source: British Antarctic Survey

Published March, 2005
© 2005, Scott A. Sinex

Exploring the Goodness of Fit in Linear Models - The Excel RANDBETWEEN Function

Author(s): 
Scott A. Sinex

I provide the basics of developing a computational interactive Excel spreadsheet, including an Excel tutorial, in Sinex (2004). In this example the new twist is the use of the RANDBETWEEN function (available in the function editor) to add noise or scatter to the data. (See the Microsoft Excel Help, and enter the keyword “randbetween” for instructions to load this function, which is part of the Analysis ToolPak.)

This function returns a random number between two set limits placed in parentheses, such as RANDBETWEEN (-10, 10). The RANDBETWEEN function is a perfect way to add noise, as random variation, to the data, and the noise is recalculated each time the function is applied. The effect or size of the error can be controlled by an adjustable variable ($H$2 in Figure 4), controlled by the scroll bar and multiplied by the RANDBETWEEN function. Figure 4 shows a sample formula.


Figure 4. Sample Excel formula using RANDBETWEEN
to add noise to a linear function

 

Exploring the Goodness of Fit in Linear Models - The SCATTER Spreadsheet

Author(s): 
Scott A. Sinex

I use the interactive spreadsheet scatter.xls to have students discover some basic characteristics of scatter in data and how they influence the goodness of fit.  The spreadsheet includes four separate worksheets to explore scatter, outliers, and curvature. My mode of questioning with students is embedded in the following discussion as the worksheets are introduced. The individual worksheets are keyed to the tabs at the bottom of the Excel screen as shown in Figure 5. 


Figure 5. Tabbed worksheets within the scatter.xls spreadsheet

The spreadsheet also includes introduction of residuals, since the value of r2 fails to take into account the effect of minor curvature in data.  I have placed Comment boxes within the worksheets to explain the various operations. These comment boxes are denoted by a little red triangle in upper right corner of a cell; if you place the cursor on the cell, the comments appear.

I also provide a sample student activity.

Exploring the Goodness of Fit in Linear Models - Introducing Noise or Scatter: The Scatter Tab

Author(s): 
Scott A. Sinex

If you have not already opened the scatter.xls spreadsheet, please do so now. Then click on the Scatter tab. The worksheet should look something like Figure 6.


Figure 6. The Scatter worksheet
(Click on image for larger view.)

This worksheet allows students to vary the scatter (or noise) level, by adjusting the scroll bar or by clicking on the arrows, to see how the slope and intercept of line respond to the addition of scatter to the data, while monitoring the value of r2. Here the r2 value is treated as a mystery variable. Getting science students to use the language of mathematics is also a hurdle to overcome here, but it should not be avoided. How does r2 vary with the scatter of the data about the regression line? Students should quickly discover that as scatter increases the value of r2 decreases. (Because the scatter cannot be set to zero in this worksheet, they will only see r2 get close to 1, not actually reach it.)

The use of r2 as a measure of the goodness of fit is realistic because it describes the fraction (or if multiplied by 100, the percentage) of the y-variable that is explained by the variation of the x-variable. You can also discuss statistical significance at this point (For 10 data points, an r2 > 0.795 is needed at the 95% confidence level to have significance.)

Exploring the Goodness of Fit in Linear Models - Outlier Data Point at End of Range: Outlier Tab

Author(s): 
Scott A. Sinex

The Outlier worksheet (see Figure 7) starts with a perfect set of data for the y = x line (no scatter). What is the value of r2 for a perfect fit? Student now vary the scatter of the last datum point shown in red on the graph. How does it influence the regression line? There is a drastic influence on the line as a whole, especially to the value of the slope. This is the leverage effect, and it can be explored further using the downloadable JAVA simulation of Lohninger et al.  (undated) or the Putting Points applet at the University of Illinois at Urbana-Champaign Department of Statistics.


Figure 7. The Outlier worksheet
(Click on image for a larger view.)

 

Exploring the Goodness of Fit in Linear Models - Outlier Data Point at Mid-range: Outlier II Tab

Author(s): 
Scott A. Sinex

The Outlier II tab is a repeat of the Outlier worksheet, except that the scatter is introduced to a mid-range datum point. Vary the scatter of the datum point shown in red on the graph. How does the scatter influence the regression line? Here the effect is, for the most part, on the intercept until large scatter occurs.


Figure 8. The Outlier II worksheet
(Click on image for a larger view.)

 

Exploring the Goodness of Fit in Linear Models - Assessing the Effect of Outliers

Author(s): 
Scott A. Sinex

Overall, outliers can have a large effect on your model results. How is your ability to make a prediction when a model is produced containing an outlier? Think about the models produced with and without the outlier. As you have seen with the Outlier and Outlier II worksheets, the predictive capabilities of the model can be very distorted or biased by the presence of an outlier.

Consider this assessment question: If the point indicated in Figure 9 with the arrow were to be measured again and found to be (1.4, 0.10) instead of (1.4, 0.20), how would the regression line respond?


Figure 9. An end-point outlier

The answer is shown in Figure 10.


Figure 10. No longer an outlier

Exploring the Goodness of Fit in Linear Models - Scatter and Curvature: The S-C-R Tab

Author(s): 
Scott A. Sinex

The fourth worksheet is similar to the scatter worksheet but now permits the slope to be changed by entering a value in the yellow box. (See Figure 11, and click on the image to see a full-size picture.)


Figure 11. The Scatter, Curvature, and Residuals worksheet
(shown with slope = 2, no scatter, and no curvature)

A residuals plot is also included in this worksheet (Figure 12).


Figure 12. A residuals plot with small scatter and no curvature

Both scatter and curvature of the data are introduced in the generated data, the latter by an x2 term in the y column. Explore this worksheet one variable at a time. How does the residuals plot behave? If a small amount of curvature occurs, would the value of r2 alert you to it?

Using only r2, you will not discover the presence of curvature. It is seen only in the residuals plot, as in Figure 13, which shows a nice trend.


Figure 13. A residuals plot with small scatter and small curvature

Ideally, residual plots should show a random distribution of points, i.e., no trends, and the sum of the squared residuals should be kept to a minimum. Now, with a contribution of curvature, as seen in the residuals plot (or in Figure 13), increase the noise/scatter. What happens? You find that a non-linear model can be very difficult to uncover if scatter is apparent in the data. So, some system models can start out linear and, as measurement techniques for these systems improve, may evolve into non-linear models.

Exploring the Goodness of Fit in Linear Models - What Should I Do with the Outlier?

Author(s): 
Scott A. Sinex

All too often, the statement “it’s a lousy datum point - let’s just throw it out” is made in reference to an unusual datum value. This is not the proper way to handle a potential outlier. If you have recently-collected experimental data, you can repeat the measurement or experiment to try to determine if the outlier was a fluke. Is the outlier a real anomalous datum point? Some outliers are simply data recording errors. But, for example, with annual stream discharge (volume/sec), you can have a very wet or dry year that stands out from more typical or average years. It just pays to be cautious with outliers and use methods that deal with them, such as running averages. See Motulsky (2002) for more information on dealing with outliers, including the Grubbs' test for detecting outliers on replicates of measurement.

Exploring the Goodness of Fit in Linear Models - Is the Trend Changing?

Author(s): 
Scott A. Sinex

Could it be that your linear model is breaking down? Some linear models do not hold at high x-values. You can see how some calibration curves in chemistry start to develop curvature at high concentrations by using the Beer’s Law Simulator (click on the “Bad Cal” tab). Points at higher concentrations might appear to be outliers.

Another example illustrating emerging curvature is the "lightweighting" of the aluminum can, as shown in Figure 14. The possible future mass of the can is shown in Figure 15.


Figure 14. Aluminum can lightweighting (mass as a function of time)
Source: The Aluminum Association


Figure 15. Another possible conclusion from the same data

Exploring the Goodness of Fit in Linear Models - Some Final Thoughts

Author(s): 
Scott A. Sinex

Students can use the scatter.xls spreadsheet to discover the influence of scatter or noise on the goodness of fit through a directed mode of questioning. Many real-world data sets, especially environmental data, can show considerable scatter. You and your students can explore the many data sets of Langkamp and Hull (2003).

We need to expose students to a wider range of data and even experimental situations where noise can be a problem. The use of r2 and residuals are good first steps that take advantage of available technology for students just starting to explore linear models. Ideal models are great as an early component of learning. However, examining real data puts the process of modeling into the real world of scientific study. This is just the rudimentary beginning of mathematical modeling, but it introduces students to science process skills and the thrill of discovery. Further limitations and other measures of the goodness of fit can be introduced to students as other non-linear models are encountered.

Exploring the Goodness of Fit in Linear Models - References and Resources

Author(s): 
Scott A. Sinex

AMATYC (2004) Beyond Crossroads Draft Version 6.0, American Mathematical Association of Two-Year Colleges (accessed 18 March 2005)

Langkamp, G., and J. Hull (2003) Classroom-Ready Data Sets in Environmental Mathematics, Journal of Online Mathematics and its Applications 3 (accessed 18 March 2005)

Lohninger, H., et al. (undated) Learning by Simulations , Electronic Media Group, Institute of Chemical Technology and Analytics, Vienna University of Technology, Austria (accessed 18 March 2005)

Motulsky, H. (2002) Detecting Outliers , GraphPad Software (accessed 18 March 2005)

NCTM (2000) Principles and Standards for School Mathematics , National Council of Teachers of Mathematics (accessed 18 March 2005)

NSTA (2001) College Pathways to the Science Education Standards, National Science Teachers Association (accessed 18 March 2005)

Sinex, S. A. (2004) An Interactive Higher-Order Thinking Tool, TechLearning Educator's Outlook (accessed 18 March 2005)

Additional resources

Scatter Excel spreadsheet, used on pages 3-8

Constructing an Interactive Excel Spreadsheet tutorial from Sinex (2004), referenced on page 2 (accessed 18 March 2005)

Sample student activity  using the Scatter spreadsheet -- see page 3 (accessed 18 March 2005)

Beer's Law Simulator, referenced on page 10 (accessed 18 March 2005)

QELP data sets, Quantitative Environmental Learning Project -- see also Langkamp and Hull (2003) and pages 1 and 11 (accessed 18 March 2005)

Leverage Effect in Regression , downloadable Java simulation, #2003 in Lohninger et al. (undated), referenced on page 5 (accessed 18 March 2005)

Putting Points Java applet, Department of Statistics, University of Illinois at Urbana-Champaign, referenced on page 5 (accessed 18 March 2005)

Grubbs' Test, online calculator for outlier detection, Graphpad Software QuickCalcs  -- see also Motulsky (2002) and page 9 (accessed 18 March 2005)

Detecting Outliers with Grubbs' Test, interactive Excel spreadsheet to accompany Motulsky (2002), Graphpad Software -- see page 9 (accessed 18 March 2005)