My experiments with Big Data

Wednesday, November 5, 2014

Wednesday, October 29, 2014

Feature Hashing

http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing
http://en.wikipedia.org/wiki/Feature_hashing

http://people.csail.mit.edu/romer/papers/TISTRespPredAds.pdf

Adaptive Learning rate in gradient descent

Depending on the cost function F that we will select, we might face different problems. When the Sum of Squared Errors is selected as our cost function then the value of θF(Wj)/θWj gets larger and larger as we increase the size of the training dataset. Thus the λ must be adapted to significantly smaller values.

One way to resolve this problem is to divide the λ with 1/N, where N is the size of the training data. So the update step of the algorithm can be rewritten as:

Wj = Wj - (λ/N)*θF(Wj)/θWj

You can read more about this on Wilson et al. paper “The general inefﬁciency of batch training for gradient descent learning”.

Finally another way to resolve this problem is by selecting a cost function that is not affected by the number of train examples that we use, such as the Mean Squared Errors.

This technique was used in the online gradient descent code by tingrtu in Criteo Ad Click Competition organized by Kaggle.

Reference : http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/

Tuesday, October 28, 2014

How to score your model using different scoring functions in Python

The scoring parameter can be a callable that takes model predictions and ground truth.

However, if you want to use a scoring function that takes additional parameters, such as fbeta_score, you need to generate an appropriate scoring object. The simplest way to generate a callable object for scoring is by using make_scorer. That function converts score functions (discussed below in Function for prediction-error metrics) into callables that can be used for model evaluation.

One typical use case is to wrap an existing scoring function from the library with non default value for its parameters such as the beta parameter for the fbeta_score function:

>>>>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

The second use case is to build a completely new and custom scorer object from a simple python function:

>>>>>> def my_custom_loss_func(ground_truth, predictions):
...     diff = np.abs(ground_truth - predictions).max()
...     return np.log(1 + diff)
...
>>> my_custom_scorer = make_scorer(my_custom_loss_func, greater_is_better=False)
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=my_custom_scorer)

make_scorer takes as parameters:

the function you want to use
whether it is a score (greater_is_better=True) or a loss (greater_is_better=False),
whether the function you provided takes predictions as input (needs_threshold=False) or needs confidence scores (needs_threshold=True)
any additional parameters, such as beta in an f1_score.

References : http://scikit-learn.org/stable/modules/model_evaluation.html

Tuesday, September 30, 2014

Criteolab kaggle challenge

Take a random sample of the train data
Decision trees work worse than even random solution in this case
Logistic regression with only the independent variables work better
The data is huge and we cant load the whole data into memory and probably we dont need the whole data to learn a model, but we need more insights into the categorical variables.
Perl script to find statistics of the categorical variables : https://github.com/novieq/kaggle/blob/master/test/stats.pl
Convert unknown values into NA so that they are treated as missing values or Convert all the categorical variables to CTR values
sd

Categorical Variables

Bias Variance Tradeoff

y_noise = np.var(y_test, axis=1)

y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2

y_var = np.var(y_predict, axis=1)

MSE = y_noise + y_bias + y_var

Proof : http://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Kaggle Solutions

Dont overfit : https://www.kaggle.com/c/overfitting/forums/t/593/results-auc
Predicting biological response : https://github.com/emanuele/kaggle_pbr : emanuele's solution
Criteo 3rd Place : http://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10547/document-and-code-for-the-3rd-place-finish

Must read kaggle forum links

https://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning

Monday, September 29, 2014

Data transformations tips

If variable's distribution has a long tail(left skewed distribution), apply Box-Cox transformation (taking log() is a quick & dirty way). Standardize all variables when in doubt (does not hurt anyway)
We want to turn categorical features into count feature because one-hot-encoding would curse us with dimensionality rendering tree-based models unmanageable. We store the counts from the train set for every unique hash inside a dictionary, and use this to replace hashes with their count occurrence.

References :

Python Tools

http://web.stanford.edu/~mwaskom/software/seaborn/
https://pypi.python.org/pypi/joblib
Matplotlib is like matlab graph plots http://matplotlib.org/users/pyplot_tutorial.html
Numpy Tutorial http://wiki.scipy.org/Tentative_NumPy_Tutorial

Naive Bayes Classifier tips

In a nutshell, the Gaussian Naive Bayes model is generally used for continuous data (where each feature is a real number), where the underlying data distribution is assumed to be a Gaussian (Normal) distribution.

The Multinomial Naive Bayes model counts how often a certain event occurs in the dataset (for example how often a certain word occurs in a document).

The Bernoulli Naive Bayes model is similar to the Multinomial Naive Bayes model, but instead of counting how often an event occurred, it only describes whether or not an event occurred (for example whether or not a certain word occurs in a document, where it doesn't matter if it occurs once or 100000 times)

Saturday, September 27, 2014

Logistic Regression

Decision Trees tips

Random Forests tips

Extremely Randomized Trees : In Random Forests, the partition is chosen to maximize entropy. In extremely randomized trees, the partition is chosen randomly. This reduces the variance and increases the bias.

Regularized Greedy Forest : This will increase the bias and reduce the variance by boosting and is an additive model.

Support Vector Machine tips

Works better on high dimensional data than logistic regression (features >> samples)
Creates a maximum margin classifier - hence its prone to outliers
Outlier removal can give a better model
C works reverse than lamba as a regularization parameter
High Value of C means the model error term has to be zero and only the regularization part will stay - so this will be a high variance model
On extremely high dimensional data sometimes, PCA followed by SVM works well.
My rank 57 on African Soil Challenge competition in kaggle was an ensemble of PCA SVM, SVM on the data and I used Box Cox(log) transformation for P value. P was the hardest to predict and showed that the distribution was left skewed. So a log transform helped. This was also used as an input to the blend.

Tuesday, September 23, 2014

Posterior, Prior and Likelihood

Bayes Theorem is a very common and fundamental theorem used in Data mining and Machine learning. Its formula is pretty simple:

P(X|Y) = ( P(Y|X) * P(X) ) / P(Y), which is Posterior = ( Likelihood * Prior ) /  Evidence

So I was wondering why they are called correspondingly like that.

Let’s use an example to find out their meanings.

Example

Suppose we have 100 movies and 50 books.
There are 3 different movie types: Action, Sci-fi, Romance,
2 different book types: Sci-fi, Romance

20 of those 100 movies are Action.
30 are Sci-fi
50 are Romance.

15 of those 50 books are Sci-fi
35 are Romance

So given a unclassified object,

The probability that it's a movie is 100/150, 50/150 for book.
The probability that it's a Sci-fi type is 45/150, 20/150 for Action and 85/150 for Romance.

If we already know it's a movie, then the probability that it's an action movie is 20/100, 30/100 for Sci-fi and 50/100 for Romance.
If we already know it's a book, then that probability that it's an Sci-fi book is 15/50, 35/50 for Romance.

Right now, we want to know that given an object which has type Sci-fi, what the probability is if it’s a movie?

Using Bayes theorem, we know that the formula is:

P(movie|Sci-fi) = P(Sci-fi| Movie) * P(Movie) / P(Sci-fi)

Here, P(movie|Sci-fi) is called Posterior,
P(Sci-fi|Movie) is Likelihood,
P(movie) is Prior,
P(Sci-fi) is Evidence.

Now let’s see why they are called like that.

Prior: Before we observe it’s a Sci-fi type, the object is completely unknown to us. Our goal is to find out the possibility that it’s a movie, we actually have the data prior(or before) ourobservation, which is the possibility that it’s a movie if it’s a completely unknown object:P(movie).

Posterior: After we observed it’s a Sci-fi type, we know something about the object. Because it’s post(or after) the observation, we call it posterior: P(movie|Sci-fi).

Evidence: Because we’ve already known it’s a Sci-fi type, what has happened is happened. Wewitness it’s appearance, so to us, it’s an evidence, and the chance we get this evidence isP(Sci-fi).

Likelihood: The dictionary meaning of this word is chance or probability that one thing will happen. Here it means when it’s a movie, what the chance will be if it is also a Sci-fi type. This term is very important in Machine Learning.

So why those probabilities are named like that, the observation time is a very important reason.

Thursday, September 18, 2014

KDD Cup

Random Forest Titanic

This particular python function requires floats for the input variables, so all strings need to be converted, and any missing data needs to be filled.
Not all types of data can be converted into floats. For example, Names would be very difficult. In these cases let's decide to neglect these columns. Although they are strings, the categorical variables like male and female can be converted to 1 and 0, and the port of embarkment, which has three categories, can be converted to a 0, 1 or 2 (Cherbourg, Southamption and Queenstown). This may seem like a non-sensical way of classifying, since Queenstown is not twice the value of Southampton-- but random forests are somewhat robust when the number of different attributes are not too numerous.

Converting from categorical strings to floats is intuitive. However, filling in data can be more tricky. Some data cannot be trivially filled (such as Cabin) without complete knowledge of every cabin and ticket price for the entire ship. Nonetheless, Fare price can be estimated if you know the class, or the age of a passenger can be estimated using the median age of the people on board. Fortunately for us, the amount of missing data here is not too large, so the method for which you choose to fill the data shouldn’t have too much of an effect on your predictive result.

Python Notes

if you want to call a specific column of data, say, the gender column, I can just type data[0::,4], remembering that "0::" means all (from start to end), and Python starts indices from 0 (not 1)
You should be aware that the csv reader works by default with strings, so you will need to convert to floats in order to do numerical calculations. For example, you can turn the Pclass variable into floats by usingdata[0::,2].astype(np.float)

Wednesday, September 17, 2014

Ensemble Learning with R

http://www.r-bloggers.com/an-intro-to-ensemble-learning-in-r/
http://vikparuchuri.com/blog/intro-to-ensemble-learning-in-r/
http://vikparuchuri.com/blog/build-your-own-bagging-function-in-r/

Interpreting Logistic Regression Results

all:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit), 
    data = menarche)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0363  -0.9953  -0.4900   0.7780   1.3675  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -21.22639    0.77068  -27.54   <2e-16 ---="" 0.001="" 0.01="" 0.05895="" 0.05="" 0.1="" 0="" 1.63197="" 114.76="" 1="" 23="" 24="" 26.703="" 27.68="" 3693.884="" 4="" age="" aic:="" be="" binomial="" codes:="" degrees="" deviance:="" e-16="" family="" fisher="" for="" freedom="" ispersion="" iterations:="" null="" number="" of="" on="" parameter="" pre="" residual="" scoring="" signif.="" taken="" to="">The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and anova(glm.out).



Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."

To evaluate the overall performance of the model, look at the null deviance and residual deviance near the bottom of the print out. Null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean). This is essentially a chi square value on 24 degrees of freedom, and indicates very little fit (a highly significant difference between fitted values and observed values). Adding in our predictors--just "Age" in this case--decreased the deviance by 3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of 0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the observed values.



http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html

Logarithmic Loss function in Criteolab comptetion - kaggle

The logarithm of the likelihood function for a Bernoulli random distribution.

In plain English, this error metric is typically used where you have to predict that something is true or false with a probability (likelihood) ranging from definitely true (1) to equally true (0.5) to definitely false(0).

The use of log on the error provides extreme punishments for being both confident and wrong. In the worst possible case, a single prediction that something is definitely true (1) when it is actually false will add infinite to your error score and make every other entry pointless. In Kaggle competitions, predictions are bounded away from the extremes by a small value in order to prevent this.

LogLoss = - 1 n \sum i = 1 n [y i log (y ̂ i) + (1 - y i) log (1 - y ̂ i)]

Tuesday, September 16, 2014

Reading data in R

There are a couple of simple things to try, whether you use read.table or scan.

Set nrows=the number of records in your data (nmax in scan).
Make sure that comment.char="" to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses in read.table.
Setting multi.line=FALSE may also improve performance in scan.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

The other alternative is filtering your data before you read it into R.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save, then next time you can retrieve it faster with load.

http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r

How to replace the missing values with median

f=function(x){
   x<- span="">as.numeric(as.character(x)) #first convert each column into numeric if it is from factor
   x[is.na(x)] =median(x, na.rm=TRUE) #convert the item with NA to median value from the column
   x #display the column
}
ss=data.frame(apply(df,2,f))