This blog is about my learnings in big data, product management and digital advertising.
Tuesday, December 24, 2013
Pythong Tips and Tricks - learning on the fly
Monday, December 23, 2013
Personalization of Search results - State of the art
There are two types of personalization strategies :
- Profile based - Both long term and short term contexts are very important for profile based personalization strategies
- Click Based -
- Personalization brings significant search accuracy improvements on the queries with large click entropy and has little effect on queries with small click entropy.
- Personalization can even harm the search accuracy on some queries.
Thursday, December 19, 2013
Data Preparation Tricks
- perl -p -i.bak -e 's/\bNULL\b/0/g' filename
- Print all rows - divide one column by another awk '{print $1"\t"$3/$2}' merchant_ctr > merchant_ctr_final
- Print the rows where the ctr value is greater than 0 awk '{if($2>0) print;}' merchant_ctr_final > merchant_ctr_final_pos
- sd
http://www.ibm.com/developerworks/library/l-p102/index.html
What is the run time complexity of linear regression ?
http://stackoverflow.com/questions/1955088/what-is-the-bigo-of-linear-regression
Monday, December 16, 2013
How to get started with scikit-learn in python
I have had difficulties with managing the versions of different libraries like scikit-learn, numpy, matplotlib and sklearn. The best way I could figure out was to use the source code and add it to PYTHONPATH. The steps are documented below.
Resources :
git clone git://github.com/scikit-learn/scikit-learn.git
export PYTHONPATH="/home/yourname/bin/scikit-learn"
python setup.py build_ext --inplace
make
Resources :
- http://scikit-learn.org/dev/developers/index.html#retrieving-the-latest-code
- http://stackoverflow.com/questions/12219657/upgrade-version-of-scikit-learn-included-in-enthought-distribution
Friday, December 13, 2013
Data Preparation : kaggle Facebook Recruiting competition III
Each record in the data ends with \r, so you can replace all the \n with spaces and replace all the \r with \n.
#!/bin/bash
if [ -z "$1" ] ; then
echo "First replaces all the \\n with spaces then replaces all the \\r with \\n"
echo "usage: $0 input.csv output.csv"
exit 1;
fitr '\n' ' ' < "$1" | tr '\r' '\n' > "$2"
<post from kaggle forum>
head -n [number of lines] Train.csv > sample_train.csv
Python script to parse the data :
import csv, sys
if len(sys.argv) <> 3:
print >>sys.stderr, 'Wrong number of arguments. This tool will print first n records from a comma separated CSV file.'
print >>sys.stderr, 'Usage:'
print >>sys.stderr, ' python', sys.argv[0], '<file> <number-of-lines>'
sys.exit(1)
fileName = sys.argv[1]
n = int(sys.argv[2])
i = 0
out = csv.writer(sys.stdout, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
with open(fileName, 'rb') as csvfile:
for row in csv.reader(csvfile, delimiter=',', quotechar='"'):
i += 1
if i > n: break
else:
out.writerow(row)
Wednesday, December 11, 2013
How to interpret logistic regression data ?
In a previous post, we analyzed which version of logistic regression we should be using depending on how our data looks like.
This current post is aimed at interpreting the model that you have built using logistic regression.
http://www.stat.wisc.edu/~mchung/teaching/MIA/reading/GLM.logistic.Rpackage.pdf
How do I know that I have enough data for my logistic regression ?
I think I need more data and more positive signals into the data.
This current post is aimed at interpreting the model that you have built using logistic regression.
http://www.stat.wisc.edu/~mchung/teaching/MIA/reading/GLM.logistic.Rpackage.pdf
How do I know that I have enough data for my logistic regression ?
I think I need more data and more positive signals into the data.
Count Data
http://en.wikipedia.org/wiki/Count_data
http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm
http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm
Tuesday, December 10, 2013
When to use logistic regression and exact logistic regression
- The general logistic regression process does not work very well for small sample set. The general logistic regression process is described here : http://www.ats.ucla.edu/stat/r/dae/logit.htm
- For small sample sets, use exact logistic regression : http://www.ats.ucla.edu/stat/r/dae/exlogit.htm
- To understand how the maximum likelihood estimation for logistic regression is biased for rare events, read : http://www.statisticalhorizons.com/logistic-regression-for-rare-events and http://www.cscu.cornell.edu/news/statnews/stnews82.pdf
- sd
- sd
When to use exact logistic regression instead of regular logistic regression?
It is used when the sample size is too small for a regular logistic regression (which uses the standard maximum-likelihood-based estimator) and/or when some of the cells formed by the outcome and categorical predictor variable have no observations. The estimates given by exact logistic regression do not depend on asymptotic results.
What is separation in the data ?
http://en.wikipedia.org/wiki/Separation_(statistics)
When there is separation in the data we use exact logistic regression or firths logistic regression ?
When the data is small use exact logistic regression. When you have a lot of non events then use firths logistic regression as suggested here http://sas-and-r.blogspot.com/2010/11/example-815-firth-logistic-regression.html
When do you use mixed effects logistic regression model ?
When there are fixed and random effects on the data. When the data has rank bias or kadu quality score bias, then you use mixed effects logistic regression. Some of the other biases might be variable and random.
http://www2.hawaii.edu/~kdrager/MixedEffectsModels.pdf
When the data is small use exact logistic regression. When you have a lot of non events then use firths logistic regression as suggested here http://sas-and-r.blogspot.com/2010/11/example-815-firth-logistic-regression.html
When do you use mixed effects logistic regression model ?
When there are fixed and random effects on the data. When the data has rank bias or kadu quality score bias, then you use mixed effects logistic regression. Some of the other biases might be variable and random.
http://www2.hawaii.edu/~kdrager/MixedEffectsModels.pdf
Subscribe to:
Posts (Atom)