My experiments with Big Data
This blog is about my learnings in big data, product management and digital advertising.
Wednesday, November 5, 2014
Wednesday, October 29, 2014
Adaptive Learning rate in gradient descent
Depending on the cost function F that we will select, we might face different problems. When the Sum of Squared Errors is selected as our cost function then the value of θF(Wj)/θWj gets larger and larger as we increase the size of the training dataset. Thus the λ must be adapted to significantly smaller values.
One way to resolve this problem is to divide the λ with 1/N, where N is the size of the training data. So the update step of the algorithm can be rewritten as:
1
| Wj = Wj - (λ/N)*θF(Wj)/θWj |
You can read more about this on Wilson et al. paper “The general inefficiency of batch training for gradient descent learning”.
Finally another way to resolve this problem is by selecting a cost function that is not affected by the number of train examples that we use, such as the Mean Squared Errors.
This technique was used in the online gradient descent code by tingrtu in Criteo Ad Click Competition organized by Kaggle.
Reference : http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/
Tuesday, October 28, 2014
How to score your model using different scoring functions in Python
The scoring parameter can be a callable that takes model predictions and ground truth.
However, if you want to use a scoring function that takes additional parameters, such as fbeta_score, you need to generate an appropriate scoring object. The simplest way to generate a callable object for scoring is by using make_scorer. That function converts score functions (discussed below in Function for prediction-error metrics) into callables that can be used for model evaluation.
One typical use case is to wrap an existing scoring function from the library with non default value for its parameters such as the beta parameter for the fbeta_score function:
>>>
>>> from sklearn.metrics import fbeta_score, make_scorer >>> ftwo_scorer = make_scorer(fbeta_score, beta=2) >>> from sklearn.grid_search import GridSearchCV >>> from sklearn.svm import LinearSVC >>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)
The second use case is to build a completely new and custom scorer object from a simple python function:
>>>
>>> def my_custom_loss_func(ground_truth, predictions): ... diff = np.abs(ground_truth - predictions).max() ... return np.log(1 + diff) ... >>> my_custom_scorer = make_scorer(my_custom_loss_func, greater_is_better=False) >>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=my_custom_scorer)
make_scorer takes as parameters:
- the function you want to use
- whether it is a score (greater_is_better=True) or a loss (greater_is_better=False),
- whether the function you provided takes predictions as input (needs_threshold=False) or needs confidence scores (needs_threshold=True)
- any additional parameters, such as beta in an f1_score.
Tuesday, September 30, 2014
Criteolab kaggle challenge
- Take a random sample of the train data
- Decision trees work worse than even random solution in this case
- Logistic regression with only the independent variables work better
- The data is huge and we cant load the whole data into memory and probably we dont need the whole data to learn a model, but we need more insights into the categorical variables.
- Perl script to find statistics of the categorical variables : https://github.com/novieq/kaggle/blob/master/test/stats.pl
- Convert unknown values into NA so that they are treated as missing values or Convert all the categorical variables to CTR values
- sd
Categorical Variables
Bias Variance Tradeoff
y_noise = np.var(y_test, axis=1)
y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2
y_var = np.var(y_predict, axis=1)
MSE = y_noise + y_bias + y_var
Kaggle Solutions
- Dont overfit : https://www.kaggle.com/c/overfitting/forums/t/593/results-auc
- Predicting biological response : https://github.com/emanuele/kaggle_pbr : emanuele's solution
- Criteo 3rd Place : http://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10547/document-and-code-for-the-3rd-place-finish
Monday, September 29, 2014
Data transformations tips
- If variable's distribution has a long tail(left skewed distribution), apply Box-Cox transformation (taking log() is a quick & dirty way). Standardize all variables when in doubt (does not hurt anyway)
- We want to turn categorical features into count feature because one-hot-encoding would curse us with dimensionality rendering tree-based models unmanageable. We store the counts from the train set for every unique hash inside a dictionary, and use this to replace hashes with their count occurrence.
References :
Python Tools
- http://web.stanford.edu/~mwaskom/software/seaborn/
- https://pypi.python.org/pypi/joblib
- Matplotlib is like matlab graph plots http://matplotlib.org/users/pyplot_tutorial.html
- Numpy Tutorial http://wiki.scipy.org/Tentative_NumPy_Tutorial
Naive Bayes Classifier tips
In a nutshell, the Gaussian Naive Bayes model is generally used for continuous data (where each feature is a real number), where the underlying data distribution is assumed to be a Gaussian (Normal) distribution.
The Multinomial Naive Bayes model counts how often a certain event occurs in the dataset (for example how often a certain word occurs in a document).
The Bernoulli Naive Bayes model is similar to the Multinomial Naive Bayes model, but instead of counting how often an event occurred, it only describes whether or not an event occurred (for example whether or not a certain word occurs in a document, where it doesn't matter if it occurs once or 100000 times)
Saturday, September 27, 2014
Random Forests tips
Extremely Randomized Trees : In Random Forests, the partition is chosen to maximize entropy. In extremely randomized trees, the partition is chosen randomly. This reduces the variance and increases the bias.
Regularized Greedy Forest : This will increase the bias and reduce the variance by boosting and is an additive model.
Regularized Greedy Forest : This will increase the bias and reduce the variance by boosting and is an additive model.
Support Vector Machine tips
- Works better on high dimensional data than logistic regression (features >> samples)
- Creates a maximum margin classifier - hence its prone to outliers
- Outlier removal can give a better model
- C works reverse than lamba as a regularization parameter
- High Value of C means the model error term has to be zero and only the regularization part will stay - so this will be a high variance model
- On extremely high dimensional data sometimes, PCA followed by SVM works well.
- My rank 57 on African Soil Challenge competition in kaggle was an ensemble of PCA SVM, SVM on the data and I used Box Cox(log) transformation for P value. P was the hardest to predict and showed that the distribution was left skewed. So a log transform helped. This was also used as an input to the blend.
Tuesday, September 23, 2014
Posterior, Prior and Likelihood
Bayes Theorem is a very common and fundamental theorem used in Data mining and Machine learning. Its formula is pretty simple:
P(X|Y) = ( P(Y|X) * P(X) ) / P(Y), which is Posterior = ( Likelihood * Prior ) / Evidence
So I was wondering why they are called correspondingly like that.
Let’s use an example to find out their meanings.
Example
Suppose we have 100 movies and 50 books.
There are 3 different movie types: Action, Sci-fi, Romance,
2 different book types: Sci-fi, Romance
There are 3 different movie types: Action, Sci-fi, Romance,
2 different book types: Sci-fi, Romance
20 of those 100 movies are Action. 30 are Sci-fi 50 are Romance. 15 of those 50 books are Sci-fi 35 are Romance
So given a unclassified object,
The probability that it's a movie is 100/150, 50/150 for book. The probability that it's a Sci-fi type is 45/150, 20/150 for Action and 85/150 for Romance.
If we already know it's a movie, then the probability that it's an action movie is 20/100, 30/100 for Sci-fi and 50/100 for Romance. If we already know it's a book, then that probability that it's an Sci-fi book is 15/50, 35/50 for Romance.
Right now, we want to know that given an object which has type Sci-fi, what the probability is if it’s a movie?
Using Bayes theorem, we know that the formula is:
P(movie|Sci-fi) = P(Sci-fi| Movie) * P(Movie) / P(Sci-fi)
Here, P(movie|Sci-fi) is called Posterior,
P(Sci-fi|Movie) is Likelihood,
P(movie) is Prior,
P(Sci-fi) is Evidence.
P(Sci-fi|Movie) is Likelihood,
P(movie) is Prior,
P(Sci-fi) is Evidence.
Now let’s see why they are called like that.
Prior: Before we observe it’s a Sci-fi type, the object is completely unknown to us. Our goal is to find out the possibility that it’s a movie, we actually have the data prior(or before) ourobservation, which is the possibility that it’s a movie if it’s a completely unknown object:P(movie).
Posterior: After we observed it’s a Sci-fi type, we know something about the object. Because it’s post(or after) the observation, we call it posterior: P(movie|Sci-fi).
Evidence: Because we’ve already known it’s a Sci-fi type, what has happened is happened. Wewitness it’s appearance, so to us, it’s an evidence, and the chance we get this evidence isP(Sci-fi).
Likelihood: The dictionary meaning of this word is chance or probability that one thing will happen. Here it means when it’s a movie, what the chance will be if it is also a Sci-fi type. This term is very important in Machine Learning.
So why those probabilities are named like that, the observation time is a very important reason.
Thursday, September 18, 2014
Random Forest Titanic
This particular python function requires floats for the input variables, so all strings need to be converted, and any missing data needs to be filled.
Not all types of data can be converted into floats. For example, Names would be very difficult. In these cases let's decide to neglect these columns. Although they are strings, the categorical variables like male and female can be converted to 1 and 0, and the port of embarkment, which has three categories, can be converted to a 0, 1 or 2 (Cherbourg, Southamption and Queenstown). This may seem like a non-sensical way of classifying, since Queenstown is not twice the value of Southampton-- but random forests are somewhat robust when the number of different attributes are not too numerous.
Converting from categorical strings to floats is intuitive. However, filling in data can be more tricky. Some data cannot be trivially filled (such as Cabin) without complete knowledge of every cabin and ticket price for the entire ship. Nonetheless, Fare price can be estimated if you know the class, or the age of a passenger can be estimated using the median age of the people on board. Fortunately for us, the amount of missing data here is not too large, so the method for which you choose to fill the data shouldn’t have too much of an effect on your predictive result.
Not all types of data can be converted into floats. For example, Names would be very difficult. In these cases let's decide to neglect these columns. Although they are strings, the categorical variables like male and female can be converted to 1 and 0, and the port of embarkment, which has three categories, can be converted to a 0, 1 or 2 (Cherbourg, Southamption and Queenstown). This may seem like a non-sensical way of classifying, since Queenstown is not twice the value of Southampton-- but random forests are somewhat robust when the number of different attributes are not too numerous.
Converting from categorical strings to floats is intuitive. However, filling in data can be more tricky. Some data cannot be trivially filled (such as Cabin) without complete knowledge of every cabin and ticket price for the entire ship. Nonetheless, Fare price can be estimated if you know the class, or the age of a passenger can be estimated using the median age of the people on board. Fortunately for us, the amount of missing data here is not too large, so the method for which you choose to fill the data shouldn’t have too much of an effect on your predictive result.
Python Notes
- if you want to call a specific column of data, say, the gender column, I can just type
data[0::,4], remembering that "0::" means all (from start to end), and Python starts indices from 0 (not 1) - You should be aware that the csv reader works by default with strings, so you will need to convert to floats in order to do numerical calculations. For example, you can turn the Pclass variable into floats by using
data[0::,2].astype(np.float)
Wednesday, September 17, 2014
Interpreting Logistic Regression Results
all:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial(logit),
data = menarche)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0363 -0.9953 -0.4900 0.7780 1.3675
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ---="" 0.001="" 0.01="" 0.05895="" 0.05="" 0.1="" 0="" 1.63197="" 114.76="" 1="" 23="" 24="" 26.703="" 27.68="" 3693.884="" 4="" age="" aic:="" be="" binomial="" codes:="" degrees="" deviance:="" e-16="" family="" fisher="" for="" freedom="" ispersion="" iterations:="" null="" number="" of="" on="" parameter="" pre="" residual="" scoring="" signif.="" taken="" to="">The following requests also produce useful results: glm.out$coef, glm.out$fitted, glm.out$resid, glm.out$effects, and anova(glm.out).
Recall that the response variable is log odds, so the coefficient of "Age" can be interpreted as "for every one year increase in age the odds of having reached menarche increase by exp(1.632) = 5.11 times."
To evaluate the overall performance of the model, look at the null deviance and residual deviance near the bottom of the print out. Null deviance shows how well the response is predicted by a model with nothing but an intercept (grand mean). This is essentially a chi square value on 24 degrees of freedom, and indicates very little fit (a highly significant difference between fitted values and observed values). Adding in our predictors--just "Age" in this case--decreased the deviance by 3667 points on 1 degree of freedom. Again, this is interpreted as a chi square value and indicates a highly significant decrease in deviance. The residual deviance is 26.7 on 23 degrees of freedom. We use this to test the overall fit of the model by once again treating this as a chi square value. A chi square of 26.7 on 23 degrees of freedom yields a p-value of 0.269. The null hypothesis (i.e., the model) is not rejected. The fitted values are not significantly different from the observed values.
http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html2e-16>
Logarithmic Loss function in Criteolab comptetion - kaggle
The logarithm of the likelihood function for a Bernoulli random distribution.
In plain English, this error metric is typically used where you have to predict that something is true or false with a probability (likelihood) ranging from definitely true (1) to equally true (0.5) to definitely false(0).
The use of log on the error provides extreme punishments for being both confident and wrong. In the worst possible case, a single prediction that something is definitely true (1) when it is actually false will add infinite to your error score and make every other entry pointless. In Kaggle competitions, predictions are bounded away from the extremes by a small value in order to prevent this.
Tuesday, September 16, 2014
Reading data in R
There are a couple of simple things to try, whether you use read.table or scan.
- Set
nrows=the number of records in your data (nmaxinscan). - Make sure that
comment.char=""to turn off interpretation of comments. - Explicitly define the classes of each column using
colClassesinread.table. - Setting
multi.line=FALSEmay also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of
read.table based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with
save, then next time you can retrieve it faster with load.How to replace the missing values with median
f=function(x){
x<- span="">as.numeric(as.character(x)) #first convert each column into numeric if it is from factor
x[is.na(x)] =median(x, na.rm=TRUE) #convert the item with NA to median value from the column
x #display the column
}
ss=data.frame(apply(df,2,f))->
Subscribe to:
Posts (Atom)