My experiments with Big Data: December 2013

Tuesday, December 24, 2013

Pythong Tips and Tricks - learning on the fly

Monday, December 23, 2013

Personalization of Search results - State of the art

There are two types of personalization strategies :

Profile based - Both long term and short term contexts are very important for profile based personalization strategies
Click Based -
Personalization brings significant search accuracy improvements on the queries with large click entropy and has little effect on queries with small click entropy.
Personalization can even harm the search accuracy on some queries.

Thursday, December 19, 2013

Data Preparation Tricks

perl -p -i.bak -e 's/\bNULL\b/0/g' filename
Print all rows - divide one column by another awk '{print $1"\t"$3/$2}' merchant_ctr > merchant_ctr_final
Print the rows where the ctr value is greater than 0 awk '{if($2>0) print;}' merchant_ctr_final > merchant_ctr_final_pos
sd

http://www.ibm.com/developerworks/library/l-p102/index.html

What is the run time complexity of linear regression ?

http://stackoverflow.com/questions/1955088/what-is-the-bigo-of-linear-regression

Monday, December 16, 2013

How to get started with scikit-learn in python

I have had difficulties with managing the versions of different libraries like scikit-learn, numpy, matplotlib and sklearn. The best way I could figure out was to use the source code and add it to PYTHONPATH. The steps are documented below.

git clone git://github.com/scikit-learn/scikit-learn.git

export PYTHONPATH="/home/yourname/bin/scikit-learn"

python setup.py build_ext --inplace

make

Resources :

http://scikit-learn.org/dev/developers/index.html#retrieving-the-latest-code
http://stackoverflow.com/questions/12219657/upgrade-version-of-scikit-learn-included-in-enthought-distribution

Friday, December 13, 2013

Data Preparation : kaggle Facebook Recruiting competition III

Each record in the data ends with \r, so you can replace all the \n with spaces and replace all the \r with \n.

#!/bin/bash

if [ -z "$1" ] ; then
echo "First replaces all the \\n with spaces then replaces all the \\r with \\n"
echo "usage: $0 input.csv output.csv"
exit 1;
fi

tr '\n' ' ' < "$1" | tr '\r' '\n' > "$2"

<post from kaggle forum>

head -n [number of lines] Train.csv > sample_train.csv

Python script to parse the data :

import csv, sys

if len(sys.argv) <> 3:

print >>sys.stderr, 'Wrong number of arguments. This tool will print first n records from a comma separated CSV file.'

print >>sys.stderr, 'Usage:'

print >>sys.stderr, ' python', sys.argv[0], '<file> <number-of-lines>'

sys.exit(1)

fileName = sys.argv[1]

n = int(sys.argv[2])

i = 0

out = csv.writer(sys.stdout, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)

with open(fileName, 'rb') as csvfile:

for row in csv.reader(csvfile, delimiter=',', quotechar='"'):

i += 1

if i > n: break

else:

out.writerow(row)

Wednesday, December 11, 2013

How to interpret logistic regression data ?

In a previous post, we analyzed which version of logistic regression we should be using depending on how our data looks like.

This current post is aimed at interpreting the model that you have built using logistic regression.
http://www.stat.wisc.edu/~mchung/teaching/MIA/reading/GLM.logistic.Rpackage.pdf

How do I know that I have enough data for my logistic regression ?
I think I need more data and more positive signals into the data.

Count Data

http://en.wikipedia.org/wiki/Count_data
http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm

Tuesday, December 10, 2013

When to use logistic regression and exact logistic regression

The general logistic regression process does not work very well for small sample set. The general logistic regression process is described here : http://www.ats.ucla.edu/stat/r/dae/logit.htm
For small sample sets, use exact logistic regression : http://www.ats.ucla.edu/stat/r/dae/exlogit.htm
To understand how the maximum likelihood estimation for logistic regression is biased for rare events, read : http://www.statisticalhorizons.com/logistic-regression-for-rare-events and http://www.cscu.cornell.edu/news/statnews/stnews82.pdf
sd
sd

When to use exact logistic regression instead of regular logistic regression?

It is used when the sample size is too small for a regular logistic regression (which uses the standard maximum-likelihood-based estimator) and/or when some of the cells formed by the outcome and categorical predictor variable have no observations. The estimates given by exact logistic regression do not depend on asymptotic results.

What is separation in the data ?

http://en.wikipedia.org/wiki/Separation_(statistics)

When there is separation in the data we use exact logistic regression or firths logistic regression ?

When the data is small use exact logistic regression. When you have a lot of non events then use firths logistic regression as suggested here http://sas-and-r.blogspot.com/2010/11/example-815-firth-logistic-regression.html

When do you use mixed effects logistic regression model ?
When there are fixed and random effects on the data. When the data has rank bias or kadu quality score bias, then you use mixed effects logistic regression. Some of the other biases might be variable and random.
http://www2.hawaii.edu/~kdrager/MixedEffectsModels.pdf

Pages