Regression Analysis
Introduction
The Line Fitting Tool provides a sneak peak into the art of regression analysis: finding the equation that best describes the relationship between a variable whose value you would like to predict (a dependent variable) and the variables whose values you believe can be used to predict the value of the dependent variable (the independent variables). This page answers the question: "What is Regression Analysis?" To learn how to use the Line Fitting Tool, visit Using the Line Fitting Tool; to fit some lines, visit The Line Fitting Tool.
Regression analysis can be used for both prediction and testing
- As a kayaker I use it to predict the levels of rivers I might want to paddle based on information available on the internet, like rainfall or the level of a nearby river upon which the USGS has placed a river gage. A business owner, on the other hand, might use it to predict how many more kayaks they could sell if they charged less for them.
- As an economist I use it to test economic theories. Economic theory predicts that all else equal, the number of gallons of gasoline sold over a particular period of time in a particular place is inversely related to its price. If economic theory is correct, then if I plot the actual numbers of gallons sold on one axis of a graph and the corresponding prices charged on the other, the line best fitting these combinations of points should be downward sloping. If it isn't, either there is a problem either with economic theory or with the way I'm testing it (all else may not be equal, for example).
Regression analysis is an extraordinarily useful and widely used tool. It is a fundamental tool in disciplines as diverse as the epidemiology, hydrology, environmental science, engineering, and financial analysis, just to name a few. Even a relatively small college like that at which I teach is likely to offer several courses in several departments on its methods and use (ECON 365: Econometrics, taught in the Economics Department and STAT 325: Introduction to Regression Models, taught in the Mathematics & Statistics Department, in our case).
The derogatory term for the indiscriminate and unthinking use of regression analysis is line fitting. Because my online tool is so simple and my description of its use so basic, I have chosen to call it "The Line Fitting Tool." To move beyond mere line fitting, take the time to think about the reasons the values of the variables you include in your regressions are likely to be related to each other, take a course in regression analysis and use a statistical package with more oomph than The Line Fitting Tool provides. Many spreadsheet programs include regression packages that can be stretched a step or two beyond The Line Fitting Tool, including Microsoft Excel; practitioners use any of the multitude of statistical software packages that include it.
Data sets
A data set is a collection of values of variables between which a researcher believes a relationship exists. Data sets are comprised of series of observations, each of which includes the value of a dependent variable and the values of the independent variables associated with it.
Suppose, for example, that you suspect the scores students receive on quizes and tests are related to the number of hours they study. Suppose as well your classmates are willing to share with you their scores on the Statistics 101 midterm and the number of hours they studied. Here is what they report:
| Test Score | Hours Studied |
|---|---|
| 62 | 3.7 |
| 73 | 5.9 |
| 63 | 3.6 |
| 82 | 4.5 |
| 78 | 4.4 |
| 81 | 6.6 |
| 66 | 4.0 |
| 86 | 8.0 |
| 95 | 8.5 |
| 74 | 4.9 |
| 90 | 7.2 |
| 91 | 8.1 |
| 83 | 6.2 |
| 79 | 7.3 |
| 80 | 7.1 |
| 91 | 8.1 |
| 73 | 4.0 |
| 65 | 2.4 |
| 79 | 6.3 |
| 84 | 7.6 |
| 94 | 6.3 |
| 79 | 8.9 |
| 87 | 6.9 |
| 70 | 4.9 |
| 81 | 7.2 |
| 99 | 7.0 |
| 79 | 5.1 |
| 84 | 6.3 |
| 75 | 4.9 |
| 78 | 4.8 |
The contents of the table above constitute a data set; each of its rows represents an observation.
Scatter diagrams
Suppose you are intrigued by the relationship between test scores and the number of hours studied and wish to understand it better. To picture it, draw a graph in which the number of hours studied (the independent variable) is represented as the distance from the origin in the horizontal direction and test scores as the distance from the origin in the vertical direction (the origin is the point at which the graph's two axes cross). The point representing the number of hours studied and the test score earned by the first student in your data set looks like this:
A Single Data Point
If you plot the combinations of hours studied and test scores earned for every student in your data set you'll get a graph that looks like this:
A Scatter Diagram of Data Points
Graphs like this are called scatter diagrams.
Lines, graphs, data sets and equations
Think for a moment about the relationship between graphs and equations. A graph is a drawing of an equation. Except in the extreme case of a vertical line, straight lines have y-intercepts and slopes. A line's y-intercept describes where it crosses the vertical axis of its graph; its slope describes how steep it is. Specifically, a line's slope tells us by how much the value represented as a distance along the graph's vertical axis changes when the value represented as a distance along the graph's horizontal axis increases by 1. A horizontal line has a slope equal to zero; a downward-slanting line has a negative slope and an upward-slanting line a positive slope.
Equations describing straight lines are called linear. Linear equations are frequently written in the form Y = mX + b, where Y is the value of the variable represented as a distance moved along the vertical (or y) axis, X the value of the variable represented as a distance moved along the horizontal (or x) axis, m the slope of the line and b the line's y-intercept.
The Line Fitting Tool finds the linear equation that best describes the relationship between the values of a data set's dependent variable and the values of the independent variables associated with them. One of the names for the statistical method The Line Fitting Tool uses to fit the "best" line is ordinary least squares. This name describes the criterion used to judge the "best" fit: by "best" statisticians mean the equation that minimizes the sum of the squared differences in the vertical direction between the line and the points in the scatter diagram.
Visualizing what the line fitting tool does
I find it easier to draw a picture of what The Line Fitting Tool does than to describe it. Imagine that you are trying to use a ruler to fit the "best" line through the cloud of points that represents your data set. The line you draw will look something like this:
A Regression Line and Data Points
Note that the line doesn't go through every point; given the data your classmates have provided a straight line can't. The distance between the line and the score a particular student earned represents the amount you would be off if you tried to use your line and the number of hours the student studied to predict their score. We call this distance the data point's estimation error.
Using a regression line to make predictions
Once again imagine that you use a ruler to fit the "best" line through the cloud of points that represents combinations of hours studied and scores earned. The line you fit will look like this:
A Single Variable (Simple) Regression Line
If you are incredibly meticulous -- or if you use the Line Fitting Tool -- you will find that the line that fits your data set best has a y-intercept = 54.140 and a slope = 4.299. The data needed to confirm this is included in the Line Fitting Tool's Sample Data Menu. Select the Simple Regression Test Score Example. That the line that fits your data set best has a y-intercept = 54.140 and a slope = 4.299 means that as a best guess your line predicts that a student who doesn't study at all will score 54.14 points and that with each hour studied her score will increase by 4.299 points. Indeed, you can predict any student's score using the equation:
Test Score = 54.14 + 4.299*Hours Studied
The problem is that the line doesn't fit the data all that well . . . there is a fair amount of estimation error, one measure of which is the coefficient of determination, or R2. R2 measures the fraction of the total variation in the dependent variable (test scores) explained by the variation in the independent variable (hours studied). It's value can range from 0 (variation in the "independent variable" explains none of the variation in the dependent variable) to 1 (variation in the independent variable explains all the variation in the dependent variable. Note that less than 60% of the variation in test scores is explained by differences in hours studied (R2 = .584), leaving over 40% of the variation unaccounted for.
Estimation error
There are a number of reasons to expect estimation errors whenever we attempt to fit lines. These include:
- Measurement errors: many variables are difficult to measure with precision and we have to make due with approximations. Just how many hours -- and fractions of hours -- did each student really study, taking into account telephone calls domestic crises and other interuptions?
- Inherent randomness: some relationships have random elements that can't be measured or controlled. Ever flip a coin as you were running out of time and wind up reviewing the "wrong" concept?
- Omitted variables: a dependent variable may be influenced by many independent variables and we either haven't or can't take them all into account. Quantity and quality of study time may determine test scores, for example, and we either haven't or can't measure quality.
- Non-linear relationships: many relationships are non-linear, that is, changes in the independent variable have effects on the value of the dependent variable that change as the magnitude of the independent variable grows. Is every hour spent studying equally productive, or do the first few hours have a bigger impact on performance than the last few?
As long as we are measuring as carefully as we can there is not much we can do about measurement error and inherent randomness, but there is a lot we can do about omitted variables and non-linear relationships. Specifically, we can employ multiple regression and choose alternative functional forms.
Multivariable (multiple) regression
Suppose that upon reflection we realize that there are at least two more variables that influence the scores students receive on quizes and tests: the number of hours they sleep the night before each test and their aptitude for the test's subject. If we can take these additional independent variables into account we should be able to improve our predictions.
Multiple regression is the statistical tool that allows us to simultaneously estimate the relationship between multiple independent variables in a manner similar to that provided by single variable regression. Visually it is a little harder to explain what regression analysis does (with two independent variables we're fitting the "best" plane through a three dimensional cloud of points), but we wind up in a similar place -- with an equation that can be used to predict test scores. The difference is that the equation we end up with is of the form Y = a + b1X1 + b2X2 + . . . + bnXn, where X1, X2 and Xn represent different independent variables -- hours studied, hours slept and aptitude, for example.
To demonstrate, suppose that in a streak of great luck your classmates are not only willing to share their scores on the Statistics 101 midterm and the number of hours they studied, but they have kept track of the number of hours they slept the night before the test and recorded their scores on the math aptitude test the professor administered on the first day of class. Here is what they report:
| Test Score | Hours Studied | Hours Slept | Aptitude |
|---|---|---|---|
| 62 | 3.7 | 5.0 | 9 |
| 73 | 5.9 | 5.0 | 5 |
| 63 | 3.6 | 7.5 | 6 |
| 82 | 4.5 | 7.6 | 16 |
| 78 | 4.4 | 11.2 | 10 |
| 81 | 6.6 | 3.7 | 12 |
| 66 | 4.0 | 4.6 | 12 |
| 86 | 8.0 | 4.6 | 6 |
| 95 | 8.5 | 4.4 | 14 |
| 74 | 4.9 | 5.7 | 11 |
| 90 | 7.2 | 6.2 | 10 |
| 91 | 8.1 | 5.3 | 9 |
| 83 | 6.2 | 8.9 | 4 |
| 79 | 7.3 | 2.3 | 12 |
| 80 | 7.1 | 3.2 | 10 |
| 91 | 8.1 | 6.6 | 5 |
| 73 | 4.0 | 9.0 | 9 |
| 65 | 2.4 | 9.6 | 13 |
| 79 | 6.3 | 5.4 | 7 |
| 84 | 7.6 | 5.3 | 4 |
| 94 | 6.3 | 7.4 | 17 |
| 79 | 8.9 | 0.7 | 11 |
| 87 | 6.9 | 5.2 | 12 |
| 70 | 4.9 | 4.6 | 10 |
| 81 | 7.2 | 3.5 | 9 |
| 99 | 7.0 | 9.8 | 14 |
| 79 | 5.1 | 5.4 | 14 |
| 84 | 6.3 | 7.0 | 8 |
| 75 | 4.9 | 7.4 | 8 |
| 78 | 4.8 | 8.4 | 9 |
This data is included in the Line Fitting Tool's Sample Data Menu. Select the Multiple Regression Test Score Example and you'll find that the plane (the four dimensional hyperplane if you are picky . . .) that best fits your data is described by the equation:
Test Score = 16.959 + 6.439*Hours Studied + 2.519*Hours Slept + .925*Aptitude
Two aspects of this new equation are noteworthy. First, as expected, taking into account two important omitted variables significantly raised our R2: our expanded equation appears to explain over 95% of the variation in test scores (R2 = .951). And second, it appears that studying has a bigger impact on test scores than we had originally thought: each additional hour spent studying appears to raise test scores approximately 6.4 points, not the 4.3 points we first thought. It appears our overly simple initial model was a bit misleading!
Though regression analysis is a fundamental tool in many disciplines and professions, it is unlikely that any discipline uses it more than my own, economics. The reason for this is that economists work in extremely dirty laboratories -- that is, the laboratories provided by the real world, laboratories in which we are interested in testing relationships between dependent variables and a particular independent variables but can't hold the levels of all the other independent variables influencing the relationships constant. Multiple regression analyis helps us see through the clutter.
Functional form
The final area to consider when trying to improve a line's fit is in the choose of the line's functional form. "Functional form" is just a fancy way to say "the shape of the curve the Line Fitting Tool fits." The Line Fitting Tool always fits the "best" straight line. The original data can be transformed, however, in ways that make a straight line fitted to the transformed data look like a curved line when viewed from the perspective of the original data. Since real world relationships are often "curved," the best fits often come with curved "lines."
If, for example, the effect of changes in the value of an independent variable on the value of a dependent variable either gets greater and greater or smaller and smaller as the value of the independent variable increases, the exponential functional form may be appropriate. To estimate the exponential functional form, the Line Fitting Tool creates a new data set in which the values of the dependent variable are the natural logs of the values of the original dependent variable. The "best" straight line is then fitted through the transformed data set. Naturally this straight line will have an intercept (b0) and a slope (b1). To plot this "line" through the original data, the Line Fitting Tool uses the following formula:
Y = aebX, b0 = ln(a) and b1 = b.
Yuck! Fortunately you don't have to understand any of this. All you need to do is go to the Line Fitting Tool's Sample Data and Select Options menus, pick a data set, and then play with the different functional forms. Not all data sets are appropriate for all functional forms, but if you pick an inappropriate functional form the Line Fitting Tool will tell you so. You will find that from the perspective of the original data, the linear functional form plots a straight line, the semi-logarithmic form a curved line, the quadratic polynomial form a curved line with one bend, the cubic polynomial form a curved line with two bends, the power functional form yet another curved line and exponential functional yet another curved line once again.
Here's a comparison of a our single variable regression line and a quadratic polynomial regression line fitted through our original data set. Two aspects of the polynomial's equation are noteworthy. First, it the curve gets less and less steep as the number of hours studied rises. This implies that the first few hours studied have a bigger impact on test scores than the last few hours. And second, just as was the case with multivariable regression, it appears that studying has a bigger impact on test scores than we had originally thought: note that will the fitted line gets less and less steep as the number of hours studied rises, it is steeper than the original line over almost the entire range of hours studied. Once again, it appears our overly simple initial model was a bit misleading when it comes to the impact of studying on test scores!
A Quadratic Polynomial Regression Line
Caveat emptor
As I stated earlier, despite the fact that it can do some cool things like multiple regression and alternate functional forms, the Line Fitting Tool is actually pretty simplistic. For example, while it can do multiple regressions and alternate functional forms, it is very limited in its ability to combine them. Worse, I haven't said anything about hypothesis testing -- the most important part of regression analysis from a scientific perspective -- because the Line Fitting Tool doesn't estimate the necessary statistics to do it. If this introduction intrigues you, you owe it to yourself to take any of the many courses on it taught at almost any college or university.
A mini-slide show for your pedagogical pleasure:
Click the links to review the images used to illustrate this introduction to regression analysis:

