Thursday, 26 July 2018

Towards Data Science: DAY 9 to 13

Moving on forward in the book, these days were about Categorical Variables, significance testing (t-ratios, F-ratios, ANOVA) moving on to variable selection throughout which I struggled to maintain interest. However, this was extremely crucial knowledge and stressed the importance of intuition to support the math. Luckily, I was able to sum this up through an example on modeling Stock Liquidity using various performance measures.

The case study begins through defining the following criteria through which investors choose their stocks;
  1. Expected Return;
  2. Riskiness (Volatility);
  3. Length of Time of Investment (Varies for Growth and Income Stocks);
  4. Liquidity (Ease through which a stock could be sold, measured through its VOLUME in a stock market)
The data was composed of data of 123 companies which was originally 126 companies filtered for 3 companies due to unusually large volumes or prices. The data represents the period December 3, 1984, to February 28, 1985. For the trading activity variables, we examine
  • The three-month total trading volume (VOLUME, in millions of shares)
  • The three-month total number of transactions (NTRAN)
  • The average time between transactions (AVGT, measured in minutes)
  • Opening stock price on January 2, 1985 (PRICE),
  • The number of outstanding shares on December 31, 1984 (SHARE, in millions of shares)
  • The market equity value (VALUE, in billions of dollars) obtained by taking the product of PRICE and SHARE.
  • debt-to-equity ratio (DEB_EQ)

The model would attempt to model LIQUIDITY based on 6 explanatory variables. A scatter-plot illustrates relationships between the variables as under.

Scatterplot Matrix
VOLUME appears to have the best relationships with AVGT (Average Time between Transactions) and NTRAN (Number of Transactions) while both AVGT and NTRAN appear to have a perfect inversely proportional relationship with each other.

Taking a rather different route than the book, I started by fitting all explanatory variable to the response variable (VOLUME) with and R2 statistic of 84.9%. The results revealed only 2 significant variables when compared to their t-ratios:




Coefficients
Standard Error
t Stat
P-value
Intercept
                           6.1571
                  1.9567
       3.1467
     0.0021
AVGT
                        (0.4436)
                  0.1499
    (2.9601)
     0.0037
NTRAN
                           0.0014
                  0.0002
       7.8732
     0.0000
PRICE
                        (0.0138)
                  0.0228
    (0.6061)
     0.5456
SHARE
                           0.0090
                  0.0073
       1.2316
     0.2206
VALUE
                           0.0812
                  0.1135
       0.7152
     0.4759
DEBEQ
                           0.0609
                  0.0593
       1.0272
     0.3065


The next regression was using NTRAN as the only explanatory variable (R2 statistic was 83.43%).


Coefficients
Standard Error
t Stat
P-value
Intercept
1.65128
0.61730
2.67501
0.00851
NTRAN
0.00183
0.00007
24.68049
0.00000

Correlations of the residuals with the remaining explanatory variables revealed that there is still some information contained in the variable AVGT:

AVGT
PRICE
SHARE  
VALUE  
DEBEQ
-15.90%
-1.40%
6.42%
1.80%
7.79%

Finally, the third regression resulted in a model with negligible correlations of the residuals with remaining explanatory variables (R2 statistic was 84.18%).


 Coefficients
 Standard Error
 t Stat
 P-value
 Intercept
4.4087
1.3012
3.3882
0.0010
 AVGT
 (0.3222)
0.1346
 (2.3942)
0.0182
 NTRAN
0.0017
0.0001
17.1337
0.0000

PRICE
SHARE
VALUE
DEBEQ
     (0.015)
        0.100
        0.074
        0.089

And so, concluded the exercise to determine the significant explanatory variables for the regression model. However, I was amazed that the AVGT variable that appears as a perfect function of NTRAN variable could be a significant variable alongside the same. I was under the impression that both would add the same information to the regression.


Key take away from this exercise was the mathematics towards key variable selection and how some variables may although adding to the R squared statistic may have very little mathematical significance. The exercise did also got me excited towards non-linear regressions, however it may take a while to get there.

No comments:

Post a Comment