Pages

Monday, September 29, 2014

Data transformations tips


  1. If variable's distribution has a long tail(left skewed distribution), apply Box-Cox transformation (taking log() is a quick & dirty way). Standardize all variables when in doubt (does not hurt anyway)
  2. We want to turn categorical features into count feature because one-hot-encoding would curse us with dimensionality rendering tree-based models unmanageable. We store the counts from the train set for every unique hash inside a dictionary, and use this to replace hashes with their count occurrence.


References :

  1. http://stats.stackexchange.com/questions/18844/when-and-why-to-take-the-log-of-a-distribution-of-numbers/18852#18852
  2. http://itl.nist.gov/div898/handbook/eda/section3/eda336.htm

1 comment: