Regression analysis

A regression model in plane words

Awesome! A regression model can help you predict the future, but exactly how? 

In a nutshell linear regression analysis is a method used to establish a model of the relationship between two or more variables. 

Concept of regression analysis

The point of a regression analysis is to predict the evolution of one variable based on trends in the second variable.

Example


A regression model af establish the relationship between the sales of icecream and the temperature. More specifically the regression model will estimate exactly how many icrecream  you will sell on venvive beach when the temperatyure hit 30 degrees

When we are working to establish relationships in our data, we can distinguish between three different methods:

1. Correlation analysis

2. Simple linear regression model

3. Multiple linear regression model

Correlation analysis can be seen as the precursor to regression analysis. A correlation measured solely based on the relationship between two variables is classified as either strong or weak. This could be the relationship between, for example, outdoor temperature and ice cream sales.

Simple linear regression is a bit more advanced. It establishes a linear function that is used to estimate the value of the dependent variable (Y) using a given value of the independent variable (X).

For example, how many ice cream cones are sold (Y) when it is 25 degrees Celsius (X). Formula for the function of a straight line: (linear function): Y 5 a 1 b  X.

Multiple regression is used more than one independent variable, for example, we can determine how many ice cream cones are sold when the temperature is (X1) and the price is (X2).

Correlation analysis

For a practical demonstration, consider a IT company that lets consumers build their computers over the Internet. Let us assume that the company wants to expand by providing business-to-business solutions. In this context, the management wants to identify the factors that contribute most to business-to-business telephone sales. From experience, the management assumes that telephone sales are particularly affected by two factors: the number of daily sales calls (call rate) and the salesperson’s experience (in months). The question now is which of these two factors affects sales the most? Can it be estimated from a graph? Let us try.

The graph immediately shows that the relationship between calls made and sales is a strong, positive one. The more calls, the more sales.

Similarly, we see a clear connection between sales and the salesperson’s experience. Can we assume the graphs determine which of these two factors has the greatest impact on sales?

The answer is not entirely obvious because different scales are used. One chart measures phone calls by number and the other measures sales experience in terms of time.

To make the two variables comparable, we must have a uniform scale. It is precisely here that correlation analysis is useful.

Depending on whether a relationship is positive or negative, or shows a partial correlation, the coefficient varies from (r for population and r� for sample) 2 1 to 1 1. The following relationships are both linear, but one is positive and one is negative.

The correlation coefficient is calculated as follows:

The formulas used to calculate SAP and SAK are found here

Returning to the example of the IT business, where we looked at the relationship between sales/calls (0,831) and sales/experience 0,774 can both be calculated using the Statlearn program.

Both correlation coefficients are positive, suggesting that both phone calls and experience have a positive impact on sales.

But, it’s not quite possible to conclude that the number of calls has the greatest impact, since the correlation coefficient is only based on point estimates.

In the section about hypothesis tests we said that the value of the point estimates should be tested before one tries to generalize the results. The same is true for the point estimates used for correlation coefficients. You can learn how to test this in the section titled ”Procedure for Regression”

Extreme Observations: Outliers

It can be a good idea to supplement your correlation analysis with a graph. There are basically two reasons to show the results graphically.

One is that using a graph makes it possible to spot significant deviations, also known as “outliers”.

As shown in the graph below, outliers are observations that deviate radically from the normal observations. Without a graph, outliers can be hidden in your data and thereby manipulate the value of the correlation coefficient.

Despite manipulating the correlation coefficient, outliers may contain valuable information. In the case of the IT company, an outlier might be a person who has very little sales experience, but still manages to sell more than a person with many years of experience. He or she may have extraordinary sales talent or a method that may be interesting to study.

Simple linear regression

In the previous section, we identified the rate (number of sales per day) as the variable that affected sales the most. In this light, it might be interesting if we could calculate how many calls would have to be made to reach specific sales targets.

It is here that a simple linear regression would be suitable. With a simple linear regression, we seek to create a linear function of the correlation between two variables, as shown to the left

The notation for the simple linear regression model depends on whether you are working with data for an entire population or just a sample.

Obtaining data for an entire population is resource intensive, therefore, a sample-based regression model will almost always be used.

The linear regression model for the population and sample, respectively, are as follows:

The term “” represents the residual, which is the deviation between the regression model estimate (Y�) and the actual observation (Yi ). It is worth noting that observations are rarely spoken of in regression analysis. Instead, the residuals are often referenced when talking about a regression model’s precision, or lack thereof. 

Least Square Method (MKM)

As we initially said, simple linear regression is used to establish a linear relationship between the dependent (Y) and independent variable (X).

But why is there really a need to do a simple linear regression? Isn’t it relatively simple to understand whether a data plot demonstrates consistency by drawing a line that follows the observations?

Let us do an experiment. The following shows two graphs based on the same database. In each graph, we have attempted to draw the line that best represents the relationship between advertising and sales.

Despite the fact that both lines have different intersections and slopes, they appear to be relatively good at illustrating the development of sales compared to advertising costs.

The example should hopefully illustrate that it is not a matter of simply choosing the line that best describes the relationship between sales and advertising. What impact can this lack of precision have?

To answer this question, we can estimate what sales would look like with advertising expenditures of 25 million.

For line A, the estimate is 55 million, and it is 70 million for line B. Thus, there is a deviation of 15 million.

A shortfall of this magnitude could mean the difference between success and failure, so it is important to determine the line which most accurately indicates the correlation between X and Y.

Now that we have seen that the location of the regression line is crucial to the value of the regression estimate, this raises the question of how to calculate the regression line that most accurately describes the relationship between X and Y.

It would be logical to choose the line that minimizes the distance between all the observations. Let us build on this approach, as illustrated by the graphs below. In this situation, the best line is the one that cuts through both points.

This always applies when there are only two observations.

If we add two further observations, we can move the line so that it now minimizes the distance between all four observations.

So far, the approach based on minimizing the distance between all of the observations seems to work quite well.

If we look at the new set of observations we suddenly run into trouble if we use the same approach to minimize the distance between allobservations.

Line A has minimized the distance between the observations by cutting through two points and “ignoring” the last point.

However, it seems that line B best describes the relationship between X and Y. This means that a method that seeks to minimize the distance between all observations may paradoxically result in a line which is not necessarily the most accurate.

To handle this situation using regression analysis, the least squares method may be used. 

The purpose of the least squares method is to calculate the regression line so that the sum of the squared deviations between the individual observations and the regression line is smallest.

Using the least squares method on lines A and B, we see that line B minimizes the sum of the deviations and is, therefore, a better choice than line A.

The least squares method is the foundation for the calculations used in a regression model. All of the formulas underlying the calculations used in a regression model can be found herex.

Method for doing regression analysis

This section discusses the procedure used for simple linear regression analysis. In short, the procedure can be outlined by the following points:

  • Select the appropriate formula
  • Validate the assumptions of the model
  • Calculate the regression intercept and slope
  • Interpret the coefficient of determination
  • Test the coefficients for the model
  • Validate that the model’s assumptions are met (residual analysis)

The following elaborates on the individual points in the procedure:

STEP 1: STATE THE FORMULA

These are the formulas for a simple linear regression model for a sample and a population, respectively:

So y� and Y are regression estimates and b0 and b0 represent the intersection with the y-axis. Additionally, b1 and b1 represent the slope and b0 and  represent the residual. A residual is, as mentioned earlier, an expression for the difference between regression estimates and observed values

STEP 2:  STATE THE ASSUMPTIONS

A. The correlation between X and Y must be linear

B.  The residuals must be normally distributed, with a mean of zero

C.  The variance for the residuals must be constant

D.  The residuals should be independent of each other

If the following assumptions holds true then A-D are fulfilled. 

STEP 3:  VALIDATE THE ASSUMPTIONS

A. So, is the correlation between X and Y linear? 

GRAPH missing

B.  Are the residuals normally distributed, with a mean of zero?

Graph B shows that the overwhelming share of the residuals are below 0, which means that the distribution is skewed, not normal. This results in negative residual averages.

C.  Is the variance of the residuals constant? 

The residuals should have a constant variance. A rising variance would prevent us from calculating a consistent regression estimate.


D.  Are the residuals independent of each other?  

The residuals should be independent of each other. If the opposite is true, there may be patterns in the dataset which would break the assumption of a linear relationship. In practical terms, this would reduce the accuracy of the regression estimates.


STEP 3:  CALCULATE REGRESSION SLOPEAND INTERCEPT

To calculate the regression coefficients, use the Statlearn program. The formulas for the individual coefficients are shown in the table below.

Note that SAKy is used to calculate the coefficient of determination. See the following section for more on SAKy.

STEP 4:  CALCULATE COEFFICIENT OF DETERMINATION

The coefficient of determination (R2 ) gives us our first indication of the usability of the regression model.

The formulas used to calculate the sizes of SAPxy, SAKx and SAKy are shown in the previous section.

The coefficient of determination is an overall measure of how much total variation there is between X and Y, which is explained by the regression model.

The value of the coefficient of determination ranges from 0-1. The nearer it is to 1, the more accurately the regression model reproduces the correlation between X and Y.

It is important not to confuse the coefficient of determination with the precision of the regression estimate. A coefficient of determination equaling 90% does not mean that the regression model estimates will be 90% accurate.

The 90% is only a measure of the model’s overall ability to explain the correlation between X and Y.

The following shows how the coefficient of determination reflects the correlation between X and Y:

STEP 5:  TEST THE REGRESSION MODEL

Besides a high R2 , the slope (b1) is also significant, if it is different from 0. Remember, a regression slope is often calculated from a sample and is, therefore, no guarantee that there is a real slope.

Therefore, always test whether the slope is significant, which in this context means different from 0. If the slope is not significant, we can exclude a correlation between X and Y, which means we must reject the regression model.

A slope is generally assumed to be significant if the p-value is less than 5%. In the example below, the regression output has been found using Excel. It shows that the p-value of the slope is clearly significant. This indicates a strong correlation between the X and Y variables.

Example using simple linear regression

The following provides a regression analysis based on 20 randomly selected salespeople who sell consulting services in the financial sector. This model looks at the relationship between them.

The analysis follows the approach outlined in the previous section. The data underlying the calculations can be found herex.

All calculations can be performed using the Statlearn program.

​1. Regression model

​2. Assumptions

  • The correlation between X and Y must be linear
  • The residuals must be normally distributed, with a mean of zero
  • The residuals must have constant variance
  • The residuals should be independentof each other

The assumptions can be described as herex

​3.  Calculating the regression coefficients

(Calculations for the regression analysis can be seen at the end of this chapter, at the end of this chapter, in Appendix 3.)

​Intersection 758.151,647

Slope 279.558,058

4. Interpretation of the coefficient of determination

The coefficient of determination (R2 ) is 83%, which means that the regression model explains 83% of the total variation between seniority and sales. Since it explains the total variation well, this indicates that the model is good at explaining the relationship being analyzed.

Calculations for the regression analysis can be seen herex