Friday, 6 July 2018

Towards Data Science: DAY 1 to 5

As a Life Actuary primarily associated with valuations and financial reporting positioned in a very much underdeveloped insurance market, Data Analysis is an area that I never get to experience but do get to hear about more than I would like to. Buzzwords like “BIGDATA” and “R-Programming” further add to that sense of inferiority.

As my response to this growing skillset deficiency, I have resolved to revisit statistics through Edward W.Frees book dedicated to Regression applications in Actuarial and Finance Areas with the aid of R-software. First impressions of R so far have been that its extremely convenient and quick to use with just a few lines of code needed for all kinds of statistical work. As for the book, it’s extremely relatable to the usual problems in Actuarial Science.

Although not a stranger to regression analysis but not being in practice, actually resulted in me being fascinated my simple things as scatter plot and qq plots. The book teaches these concepts through a dataset aimed at determining the impact on lottery sales in the state of Wisconsin. The dataset consists of average lottery sales (SALES) over a 40-week period, April 1998 through January 1999, from fifty randomly selected areas identified by postal (ZIP) code within the state of Wisconsin. The example attempts to draw conclusions from a regression based on Sales against population of the area.

A scatter plot immediately is the first step towards visualizing correlations. Adding a regression line is just as straightforward. The R2 statistic of 78.5% implies a great fit showing that a great degree of variations (78.5% of total variance) in sales being explained by the regression line.

However, it was upon reaching the part of residual analysis that I really began to learn something. Residual analysis is the exercise of checking the residuals for patterns.


Scatter Plot (with Regression Line): Lottery Sales vs Population
5 types of Model Misspecifications are defined:
  1. Lack of Independence;
  2. Heteroscedasticity: When variability varies by observation;
  3. Relationships between Model Deviations and Explanatory Variables;
  4. Non-normal Distributions;
  5. Unusual Points.
The exercise identified an unusual point in the dataset. 


“An observation that is unusual in the vertical direction is called an outlier. An observation that is unusual in the horizontal directional is called a high leverage point. An observation may be both an outlier and a high leverage point.”


The book introduced the concept of Sensitivity testing in Regression Analysis. The unusual point is an outlier in the data. The outlier ZIP CODE represents the city of village of Bristol in Kenosha County. A very small village indeed with an estimated population of a little over 5000 in 2016. The writer believes that residents in neighboring areas participated in the lottery in this village causing sales to go up unusually.

Removal of this outlier from the dataset, although didn’t do much for the regression coefficient. However, the R2 statistic jumped to 88.3% which is a dramatic increase. The qq plots below before and after the removal of the outlier represent the correction of the non-normality of residuals that was caused by a single outlier.


qq plots of the residuals before and after removal of the outlier


There is a long way to go and I have barely just scratched the surface. However, my experience so far with R has been a very pleasant one and in the process I am eliminating some very embarrassing gaps in my knowledge with hands on practice.

No comments:

Post a Comment