My experiments with Big Data: September 2014

Tuesday, September 30, 2014

Criteolab kaggle challenge

Take a random sample of the train data
Decision trees work worse than even random solution in this case
Logistic regression with only the independent variables work better
The data is huge and we cant load the whole data into memory and probably we dont need the whole data to learn a model, but we need more insights into the categorical variables.
Perl script to find statistics of the categorical variables : https://github.com/novieq/kaggle/blob/master/test/stats.pl
Convert unknown values into NA so that they are treated as missing values or Convert all the categorical variables to CTR values
sd

Categorical Variables

Bias Variance Tradeoff

y_noise = np.var(y_test, axis=1)

y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2

y_var = np.var(y_predict, axis=1)

MSE = y_noise + y_bias + y_var

Proof : http://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Kaggle Solutions

Dont overfit : https://www.kaggle.com/c/overfitting/forums/t/593/results-auc
Predicting biological response : https://github.com/emanuele/kaggle_pbr : emanuele's solution
Criteo 3rd Place : http://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10547/document-and-code-for-the-3rd-place-finish

Must read kaggle forum links

https://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning

Monday, September 29, 2014

Data transformations tips

If variable's distribution has a long tail(left skewed distribution), apply Box-Cox transformation (taking log() is a quick & dirty way). Standardize all variables when in doubt (does not hurt anyway)
We want to turn categorical features into count feature because one-hot-encoding would curse us with dimensionality rendering tree-based models unmanageable. We store the counts from the train set for every unique hash inside a dictionary, and use this to replace hashes with their count occurrence.

References :

Python Tools

http://web.stanford.edu/~mwaskom/software/seaborn/
https://pypi.python.org/pypi/joblib
Matplotlib is like matlab graph plots http://matplotlib.org/users/pyplot_tutorial.html
Numpy Tutorial http://wiki.scipy.org/Tentative_NumPy_Tutorial

In a nutshell, the Gaussian Naive Bayes model is generally used for continuous data (where each feature is a real number), where the underlying data distribution is assumed to be a Gaussian (Normal) distribution.

The Multinomial Naive Bayes model counts how often a certain event occurs in the dataset (for example how often a certain word occurs in a document).

The Bernoulli Naive Bayes model is similar to the Multinomial Naive Bayes model, but instead of counting how often an event occurred, it only describes whether or not an event occurred (for example whether or not a certain word occurs in a document, where it doesn't matter if it occurs once or 100000 times)

Saturday, September 27, 2014

Logistic Regression

Decision Trees tips

Random Forests tips

Extremely Randomized Trees : In Random Forests, the partition is chosen to maximize entropy. In extremely randomized trees, the partition is chosen randomly. This reduces the variance and increases the bias.

Regularized Greedy Forest : This will increase the bias and reduce the variance by boosting and is an additive model.

Support Vector Machine tips

Works better on high dimensional data than logistic regression (features >> samples)
Creates a maximum margin classifier - hence its prone to outliers
Outlier removal can give a better model
C works reverse than lamba as a regularization parameter
High Value of C means the model error term has to be zero and only the regularization part will stay - so this will be a high variance model
On extremely high dimensional data sometimes, PCA followed by SVM works well.
My rank 57 on African Soil Challenge competition in kaggle was an ensemble of PCA SVM, SVM on the data and I used Box Cox(log) transformation for P value. P was the hardest to predict and showed that the distribution was left skewed. So a log transform helped. This was also used as an input to the blend.

Tuesday, September 23, 2014

Posterior, Prior and Likelihood

Bayes Theorem is a very common and fundamental theorem used in Data mining and Machine learning. Its formula is pretty simple:

P(X|Y) = ( P(Y|X) * P(X) ) / P(Y), which is Posterior = ( Likelihood * Prior ) /  Evidence

So I was wondering why they are called correspondingly like that.

Let’s use an example to find out their meanings.

Example

Suppose we have 100 movies and 50 books.
There are 3 different movie types: Action, Sci-fi, Romance,
2 different book types: Sci-fi, Romance

20 of those 100 movies are Action.
30 are Sci-fi
50 are Romance.

15 of those 50 books are Sci-fi
35 are Romance

So given a unclassified object,

The probability that it's a movie is 100/150, 50/150 for book.
The probability that it's a Sci-fi type is 45/150, 20/150 for Action and 85/150 for Romance.

If we already know it's a movie, then the probability that it's an action movie is 20/100, 30/100 for Sci-fi and 50/100 for Romance.
If we already know it's a book, then that probability that it's an Sci-fi book is 15/50, 35/50 for Romance.

Right now, we want to know that given an object which has type Sci-fi, what the probability is if it’s a movie?

Using Bayes theorem, we know that the formula is:

P(movie|Sci-fi) = P(Sci-fi| Movie) * P(Movie) / P(Sci-fi)

Here, P(movie|Sci-fi) is called Posterior,
P(Sci-fi|Movie) is Likelihood,
P(movie) is Prior,
P(Sci-fi) is Evidence.

Now let’s see why they are called like that.

Prior: Before we observe it’s a Sci-fi type, the object is completely unknown to us. Our goal is to find out the possibility that it’s a movie, we actually have the data prior(or before) ourobservation, which is the possibility that it’s a movie if it’s a completely unknown object:P(movie).

Posterior: After we observed it’s a Sci-fi type, we know something about the object. Because it’s post(or after) the observation, we call it posterior: P(movie|Sci-fi).

Evidence: Because we’ve already known it’s a Sci-fi type, what has happened is happened. Wewitness it’s appearance, so to us, it’s an evidence, and the chance we get this evidence isP(Sci-fi).

Likelihood: The dictionary meaning of this word is chance or probability that one thing will happen. Here it means when it’s a movie, what the chance will be if it is also a Sci-fi type. This term is very important in Machine Learning.

So why those probabilities are named like that, the observation time is a very important reason.

Thursday, September 18, 2014

KDD Cup

Random Forest Titanic

This particular python function requires floats for the input variables, so all strings need to be converted, and any missing data needs to be filled.
Not all types of data can be converted into floats. For example, Names would be very difficult. In these cases let's decide to neglect these columns. Although they are strings, the categorical variables like male and female can be converted to 1 and 0, and the port of embarkment, which has three categories, can be converted to a 0, 1 or 2 (Cherbourg, Southamption and Queenstown). This may seem like a non-sensical way of classifying, since Queenstown is not twice the value of Southampton-- but random forests are somewhat robust when the number of different attributes are not too numerous.

Converting from categorical strings to floats is intuitive. However, filling in data can be more tricky. Some data cannot be trivially filled (such as Cabin) without complete knowledge of every cabin and ticket price for the entire ship. Nonetheless, Fare price can be estimated if you know the class, or the age of a passenger can be estimated using the median age of the people on board. Fortunately for us, the amount of missing data here is not too large, so the method for which you choose to fill the data shouldn’t have too much of an effect on your predictive result.

Python Notes

if you want to call a specific column of data, say, the gender column, I can just type data[0::,4], remembering that "0::" means all (from start to end), and Python starts indices from 0 (not 1)
You should be aware that the csv reader works by default with strings, so you will need to convert to floats in order to do numerical calculations. For example, you can turn the Pclass variable into floats by usingdata[0::,2].astype(np.float)

Wednesday, September 17, 2014

Ensemble Learning with R

http://www.r-bloggers.com/an-intro-to-ensemble-learning-in-r/
http://vikparuchuri.com/blog/intro-to-ensemble-learning-in-r/
http://vikparuchuri.com/blog/build-your-own-bagging-function-in-r/

Interpreting Logistic Regression Results

all:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), 
    data = menarche)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0363  -0.9953  -0.4900   0.7780   1.3675  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -21.22639    0.77068  -27.54   <2e-16 ---="" 0.001="" 0.01="" 0.05895="" 0.05="" 0.1="" 0="" 1.63197="" 114.76="" 1="" 23="" 24="" 26.703="" 27.68="" 3693.884="" 4="" age="" aic:="" be="" binomial="" codes:="" degrees="" deviance:="" e-16="" family="" fisher="" for="" freedom="" ispersion="" iterations:="" null="" number="" of="" on="" parameter="" pre="" residual="" scoring="" signif.="" taken="" to="">The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and anova(glm.out).



Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."

To evaluate the overall performance of the model, look at the null deviance and residual deviance near the bottom of the print out. Null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean). This is essentially a chi square value on 24 degrees of freedom, and indicates very little fit (a highly significant difference between fitted values and observed values). Adding in our predictors--just "Age" in this case--decreased the deviance by 3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of 0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the observed values.



http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html

Logarithmic Loss function in Criteolab comptetion - kaggle

The logarithm of the likelihood function for a Bernoulli random distribution.

In plain English, this error metric is typically used where you have to predict that something is true or false with a probability (likelihood) ranging from definitely true (1) to equally true (0.5) to definitely false(0).

The use of log on the error provides extreme punishments for being both confident and wrong. In the worst possible case, a single prediction that something is definitely true (1) when it is actually false will add infinite to your error score and make every other entry pointless. In Kaggle competitions, predictions are bounded away from the extremes by a small value in order to prevent this.

LogLoss = - 1 n \sum i = 1 n [y i log (y ̂ i) + (1 - y i) log (1 - y ̂ i)]

Tuesday, September 16, 2014

Reading data in R

There are a couple of simple things to try, whether you use read.table or scan.

Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save, then next time you can retrieve it faster with load.

http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r

How to replace the missing values with median

f=function(x){
   x<- span="">as.numeric(as.character(x)) #first convert each column into numeric if it is from factor
   x[is.na(x)] =median(x, na.rm=TRUE) #convert the item with NA to median value from the column
   x #display the column
}
ss=data.frame(apply(df,2,f))

GLM course in R

http://statmath.wu.ac.at/courses/heather_turner/index.html

Tuning and training a model

control=

optional parameters for controlling tree growth. For example, control=rpart.control(minsplit=30, cp=0.001) requires that the minimum number of observations in a node be 30 before attempting a split and that a split must decrease the overall lack of fit by a factor of 0.001 (cost complexity factor) before being attempted.

http://www.inside-r.org/node/87027

Classification and Regression Trees : rpart

How to prune a decision tree ?

Prune back the tree to avoid overfitting the data. Typically, you will want to select a tree size that minimizes the cross-validated error, the xerror column printed by printcp( ).

Prune the tree to the desired size using
prune(fit, cp= )

Specifically, use printcp( ) to examine the cross-validated error results, select the complexity parameter associated with minimum error, and place it into the prune( ) function. Alternatively, you can use the code fragment

fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]

to automatically select the complexity parameter associated with the smallest cross-validated error. Thanks to HSAUR for this idea.

http://www.mayo.edu/hsr/techrpt/61.pdf
http://www.statmethods.net/advstats/cart.html - classification, regression trees, random forests

Monday, September 15, 2014

When should you do centering and scaling ?

In regression, it is often recommended to center the variables so that the predictors have mean

0. This makes it so the intercept term is interpreted as the expected value of Yi when the predictor values are set to their means. Otherwise, the intercept is interpreted as the expected value of

Yi when the predictors are set to 0, which may not be a realistic or interpretable situation (e.g. what if the predictors were height and weight?). Another practical reason for scaling in regression is when one variable has a very large scale, e.g. if you were using population size of a country as a predictor. In that case, the regression coefficients may on be a very small order of magnitude (e.g.

10−6) which can be a little annoying when you're reading computer output, so you may convert the variable to, for example, population size in millions. The convention that you standardize predictions primarily exists so that the units of the regression coefficients are the same.

As @gung alludes to and @MånsT shows explicitly (+1 to both, btw), centering/scaling does not effect your statistical inference in regression models - the estimates are adjusted appropriately and the

p-values will be the same.

Other situations where centering and/or scaling may be useful:

when you're trying to sum or average variables that are on different scales, perhaps to create a composite score of some kind. Without scaling, it may be the case that one variable has a larger impact on the sum due purely to its scale, which may be undesirable.
To simplify calculations and notation. For example, the sample covariance matrix of a matrix of values centered by their sample means is simply X′X. Similarly, if a univariate random variable X has been mean centered, then var(X)=E(X2) and the variance can be estimated from a sample by looking at the sample mean of the squares of the observed values.
Related to aforementioned, PCA can only be interpreted as the singular value decompositionof a data matrix when the columns have first been centered by their means.

Note that scaling is not necessary in the last two bullet points I mentioned and centering may not be necessary in the first bullet I mentioned, so the two do not need to go hand and hand at all times.

http://stats.stackexchange.com/questions/29781/when-should-you-center-your-data-when-should-you-standardize

My experiments with Big Data

Pages