Thursday, 12 July 2018

Towards Data Science: DAY 6 to 8

As I progressed on the Edward W. Frees book, these three days were about covering Regression over Multiple Explanatory Variables. Again, this was no different as more or less the same concepts applied to multiple liner regression as well. However, my infatuation with correlations and scatter plots took a hard hit as I learned how deceiving looks can be.

As the book explains this by way of a dataset that lists prices of 37 Refrigerators along with their details and features. The regression attempts to fit Refrigerator Prices to the following explanatory variables:

  1.  ECOST: Energy Cost;
  2. RSIZE: Refrigerator Compartment Size;
  3. FSIZE: Freezer Compartment Size;
  4. SHELVES: Number of Shelves;
  5. FEATURES: Number of Features.

As admittedly I love looking at scatter plots therefore a scatter plot matrix just makes things even more ******.


Scatter Plot Matrix
The regression equation came out like this:


Coefficients
Standard Error
t Stat
Intercept
         (797.808)
                271.409
              (2.940)
ECOST
              (6.958)
                    2.275
              (3.058)
RSIZE
              76.497
                  19.442
                3.935
FSIZE
           137.381
                  23.763
                5.781
SHELVES
              37.937
                    9.886
                3.837
FEATURES
              23.764
                    4.512
                5.267

The coefficient for ECOST is of particular importance with its negative sign. It makes sense as higher energy cost would result in lesser demand for the refrigerator as lower Energy Costs increase consumer surplus and thus consumers would indeed would want to pay more for it. However, simple correlations disagree:
Correlation Plot using the corrplot package on R
Data implies a positive 52% correlation between Energy Cost and Prices.

The book solves this by introducing the concept of “Added Variable Plot”.

Added Variable Plot: Regressions of PRICES and ECOST against 4 explanatory variables
The plot attempts to plot the residuals of regressions making PRICES as the Explanatory Variable in the first regression and the ECOST in the second regression, with both regressions having the four remaining explanatory variables i.e. RSIZE, FSIZE, SHELVES and FEATURES. The correlation of the two sets of residuals is the true correlation after controlling for the affect of the explanatory variables, which in this case worked out to be -0.48, having almost the same magnitude in the opposite direction. Thus, it is possible that the positive relationship between PRICE and ECOST is due not to a causal relationship but rather to one or more additional variables that cause both variables to be large (exactly quoting the book).


This case study was an enlightening experience as to how regression coefficients and correlation coefficients can depict different pictures altogether. However, so far there is much to be learned as I move towards categorical variables and explanatory variable transformations.

No comments:

Post a Comment