My experiments with Big Data: 2013

Tuesday, December 24, 2013

Pythong Tips and Tricks - learning on the fly

Monday, December 23, 2013

Personalization of Search results - State of the art

There are two types of personalization strategies :

Profile based - Both long term and short term contexts are very important for profile based personalization strategies
Click Based -
Personalization brings significant search accuracy improvements on the queries with large click entropy and has little effect on queries with small click entropy.
Personalization can even harm the search accuracy on some queries.

Thursday, December 19, 2013

Data Preparation Tricks

perl -p -i.bak -e 's/\bNULL\b/0/g' filename
Print all rows - divide one column by another awk '{print $1"\t"$3/$2}' merchant_ctr > merchant_ctr_final
Print the rows where the ctr value is greater than 0 awk '{if($2>0) print;}' merchant_ctr_final > merchant_ctr_final_pos
sd

http://www.ibm.com/developerworks/library/l-p102/index.html

What is the run time complexity of linear regression ?

http://stackoverflow.com/questions/1955088/what-is-the-bigo-of-linear-regression

Monday, December 16, 2013

How to get started with scikit-learn in python

I have had difficulties with managing the versions of different libraries like scikit-learn, numpy, matplotlib and sklearn. The best way I could figure out was to use the source code and add it to PYTHONPATH. The steps are documented below.

git clone git://github.com/scikit-learn/scikit-learn.git

export PYTHONPATH="/home/yourname/bin/scikit-learn"

python setup.py build_ext --inplace

make

Resources :

http://scikit-learn.org/dev/developers/index.html#retrieving-the-latest-code
http://stackoverflow.com/questions/12219657/upgrade-version-of-scikit-learn-included-in-enthought-distribution

Friday, December 13, 2013

Data Preparation : kaggle Facebook Recruiting competition III

Each record in the data ends with \r, so you can replace all the \n with spaces and replace all the \r with \n.

#!/bin/bash

if [ -z "$1" ] ; then
echo "First replaces all the \\n with spaces then replaces all the \\r with \\n"
echo "usage: $0 input.csv output.csv"
exit 1;
fi

tr '\n' ' ' < "$1" | tr '\r' '\n' > "$2"

<post from kaggle forum>

head -n [number of lines] Train.csv > sample_train.csv

Python script to parse the data :

import csv, sys

if len(sys.argv) <> 3:

print >>sys.stderr, 'Wrong number of arguments. This tool will print first n records from a comma separated CSV file.'

print >>sys.stderr, 'Usage:'

print >>sys.stderr, ' python', sys.argv[0], '<file> <number-of-lines>'

sys.exit(1)

fileName = sys.argv[1]

n = int(sys.argv[2])

i = 0

out = csv.writer(sys.stdout, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)

with open(fileName, 'rb') as csvfile:

for row in csv.reader(csvfile, delimiter=',', quotechar='"'):

i += 1

if i > n: break

else:

out.writerow(row)

Wednesday, December 11, 2013

How to interpret logistic regression data ?

In a previous post, we analyzed which version of logistic regression we should be using depending on how our data looks like.

This current post is aimed at interpreting the model that you have built using logistic regression.
http://www.stat.wisc.edu/~mchung/teaching/MIA/reading/GLM.logistic.Rpackage.pdf

How do I know that I have enough data for my logistic regression ?
I think I need more data and more positive signals into the data.

Count Data

http://en.wikipedia.org/wiki/Count_data
http://www.ats.ucla.edu/stat/stata/seminars/count_presentation/count.htm

Tuesday, December 10, 2013

When to use logistic regression and exact logistic regression

The general logistic regression process does not work very well for small sample set. The general logistic regression process is described here : http://www.ats.ucla.edu/stat/r/dae/logit.htm
For small sample sets, use exact logistic regression : http://www.ats.ucla.edu/stat/r/dae/exlogit.htm
To understand how the maximum likelihood estimation for logistic regression is biased for rare events, read : http://www.statisticalhorizons.com/logistic-regression-for-rare-events and http://www.cscu.cornell.edu/news/statnews/stnews82.pdf
sd
sd

When to use exact logistic regression instead of regular logistic regression?

It is used when the sample size is too small for a regular logistic regression (which uses the standard maximum-likelihood-based estimator) and/or when some of the cells formed by the outcome and categorical predictor variable have no observations. The estimates given by exact logistic regression do not depend on asymptotic results.

What is separation in the data ?

http://en.wikipedia.org/wiki/Separation_(statistics)

When there is separation in the data we use exact logistic regression or firths logistic regression ?

When the data is small use exact logistic regression. When you have a lot of non events then use firths logistic regression as suggested here http://sas-and-r.blogspot.com/2010/11/example-815-firth-logistic-regression.html

When do you use mixed effects logistic regression model ?
When there are fixed and random effects on the data. When the data has rank bias or kadu quality score bias, then you use mixed effects logistic regression. Some of the other biases might be variable and random.
http://www2.hawaii.edu/~kdrager/MixedEffectsModels.pdf

Monday, November 25, 2013

Python for Data Analysis

sudo easy_install pip - install pip on Mac OS
df
df
df

Tuesday, November 12, 2013

Big Data in news

Big Data in airlines and travel
http://www.informationweek.com/big-data/

Saturday, November 9, 2013

Why should you view your data graphically before jumping into conculsions

http://upload.wikimedia.org/wikipedia/commons/e/ec/Anscombe%27s_quartet_3.svg
http://en.wikipedia.org/wiki/Anscombe's_quartet

Wednesday, October 9, 2013

Tim Minchin's speech at UWA graduation

Tim Minchin gave this awesome graduation speech at UWA recently. This speech is really a classic.

Some of his excerpts :
"A famous bon mot asserts that opinions are like arse-holes, in that everyone has one. There is great wisdom in this… but I would add that opinions differ significantly from arse-holes, in that yours should be constantly and thoroughly examined.

We must think critically, and not just about the ideas of others. Be hard on your beliefs. Take them out onto the verandah and beat them with a cricket bat.
Be intellectually rigorous. Identify your biases, your prejudices, your privilege.

Most of society’s arguments are kept alive by a failure to acknowledge nuance. We tend to generate false dichotomies, then try to argue one point using two entirely different sets of assumptions, like two tennis players trying to win a match by hitting beautifully executed shots from either end of separate tennis courts.

By the way, while I have science and arts grads in front of me: please don’t make the mistake of thinking the arts and sciences are at odds with one another. That is a recent, stupid, and damaging idea. You don’t have to be unscientific to make beautiful art, to write beautiful things.

If you need proof: Twain, Adams, Vonnegut, McEwen, Sagan, Shakespeare, Dickens. For a start.

You don’t need to be superstitious to be a poet. You don’t need to hate GM technology to care about the beauty of the planet. You don’t have to claim a soul to promote compassion.

Science is not a body of knowledge nor a system of belief; it is just a term which describes humankind’s incremental acquisition of understanding through observation. Science is awesome.

The arts and sciences need to work together to improve how knowledge is communicated. The idea that many Australians – including our new PM and my distant cousin Nick – believe that the science of anthropogenic global warming is controversial, is a powerful indicator of the extent of our failure to communicate. The fact that 30% of this room just bristled is further evidence still. The fact that that bristling is more to do with politics than science is even more despairing.”

Thursday, October 3, 2013

Product Management and Big Data Case : Cool feature in Flipkart’s user reviews trying to match a review to a particular item attribute - collaborative filtering

The flipkart Product Management and Research team have come up with a cool idea of trying to match a user review for a particular item to a specific attribute of the item only. They call it product features users are talking about.

As you can see that they have identified operating systems, games, value for money and apps as the features for iphone.

Now, based on a particular feature, you chose, you can see all the reviews that are clustered under that feature.

And then, you can select a particular review and read that review in detail.

This is a real cool feature and will massively improve buyers experience. This will also in future lead the way for more granular recommendations. If flipkart knows what features in a product you are looking for, it can recommend you products which are good in that feature based on the recommendations of users who have used that feature. A strong case of collaborative filtering. Better recommendations in the future when they have a good data set and more money.

I thing this is a nice example, where the product management team and the research (NLP and machine learning) team have come together to bring out a new feature for flipkart.

What would be interesting to see, on how many other different products or categories is flipkart showing this feature.

For watches they are not.

Some other cool features on their website are, certified buyer reviews. This puts in more authenticity on the review and is held credible by the reader. They also write if there is a first time reviewer.

What does the US government shutdown mean for the economy ?

The Government shutdown means that from tomorrow government employees would not be paid for work or would have to work without pay in some cases for an indefinite time.

This means a lot of hardships for their families. Though President Obama has signed a law confirming that citizens on military duty will be paid their salaries and will continue to be on duty.

Nevertheless, what this means for the economy is that the spending power of the consumer is going to go down and so is the investor confidence. Thought the real impact of this shutdown will depend on the time for which the shutdown happens. The longer it stays the worse it is going to get. Businesses are going to shrink as the consumer spending power goes down, companies will stop new investment plans and hiring plans and this spiral will begin. This may potentially slash off some GDP points from the economy.

Now, what makes it more scary is the timing. This is because of the overlooming debt ceiling issue. When the debt ceiling will be reached is a difficult question to answer. However, according to predictions, its going to be somewhere between the October 18 to November 5 period.

If both these issues are not handled with care, it could have huge recessionary impacts on the US economy.

What it means for various US services :

http://www.deccanchronicle.com/131001/news-world/article/us-government-shutdown-what-it-means

Links on Deep Learning

http://www.forbes.com/sites/netapp/2013/08/19/what-is-deep-learning/
http://www.technologyreview.com/news/519411/facebook-launches-advanced-ai-effort-to-find-meaning-in-your-posts/
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

Benchmark Bond Trade Price Challenge - Kaggle

This post was long overdue. I participated in the benchmark trade bond pricing challenge and used a regression based approach to predict bond prices. Here is an outline of the approach.

Build on the training set and predict on the test set. The dependent variable we are trying to predict is the bond price and the independent variables are last 10 trade prices.
Prepare frequency charts based on callability is 0 or 1.
Divide the data into 12 parts – callability, price > 100 or price < 100(bond price will always converge to 100 so the curve will look different), types of trade in the bond (dealer to dealer, dealer to client, client to client – quotes driven market)
Some of the values were missing – missing value treatment (based on exponential weights)
Run regression on these sub-data sets and analyze the results
Some of the t-tests failed – bond ids and time to delay – p value that was kept as cut off for rejecting the coefficients is 3%

Notes :

R² is a statistic that will give some information about the goodness of fit of a model. In regression, the R² coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R² of 1.0 indicates that the regression line perfectly fits the data.
R² measures goodness of fit. But it will not detect overfit because it will increase with any new predictor (unless it has already reached 1).
Assumptions of Regression
L Linear relationship
I Independent observations
N Normally distributed around line
E Equal variance across X’s
Multicollinearity : When two independent variables are correlated and its detection – Variance Inflation Factor
A p-value is the probability of an observed or more extreme result arising by chance. So, if p<.03, then that probability is quite less and hence, we can keep that independent variable

I have been away from data science now because of job change and interviewing. New Year Resolution : get back to kaggle .

Joke On Data Scientists

Today I started reading the Moneyball. Back to Michael Lewis after almost two years. And guess what, the current buzz word in the valley is “Data Science”.

I was having a discussion with my manager regarding hiring a candidate for an open position. During the reviews meeting, we reached to a conclusion that the candidate was not so ok on machine learning and not so ok on programming. So, somebody in the room cracked a joke “sounds like a data scientist”.

But jokes apart, statistics, machine learning and programming put together is a formidable skillset in the industry today. So, I have decided to start a series of blog posts as a statistics refresher for myself.

And guess what, 2013 is also the international year of statistics. Sounds coincidental.

Serendipity !

Descriptive Statistics – starting with the data

There are the kinds of analysis that you can do when you start with any data set. This may be the starting point of all data science projects and it will give insights about the data. This is essential for both statisticians and also for consumer of statistical reports.

For quatitative variables :

minimum, maximum
median, quartile, inter quartile rang
box plots
mean
spread of the data – standard deviation – sometimes there may be gaps in the data when we plot it as a histogram – outliers. When there are underlying special rules in the way the data is being generated, then there will be outliers in the data. For example : Some football clubs can play foreign players salaries above the salary cap, this will produce outlier salaries for those players. Another example : the top deal or product in an ecommerce site, gets the highest clicks by virtue of its position. This will create an outlier if ctr is considered, if the deals are ranked. Cleaning the data is an important first step in any statistical analysis. It is important to understand the reasons behind the outliers. In some cases, it is good to remove the outliers and in some cases it is not so good as we might lose valuable data signals. It is not unusual to report findings both with and without outliers.
shape of the data – histograms
skewed vs non-skewed, symmetric vs non-symmetric
left skewed or negatively skewed – where it has a long left tail – mean < median < mode – the difference between the 3rd quartile and the median is smaller than the difference between the 1st quartile and median
right skewed or positively skewed – where it has a long right tail
extreme values or outliers – sometimes the data has a much better uniform shape when the outliers are removed

For categorical variables

bar charts
pie charts
Examining the relationship between a quantitative variable and a categorical variable involves comparing the values of the quantitative variable among the groups defined by the categorical variable.

Missing Values

We must understand why the data for some of the variables are missing and the fact that they are missing might bias the result of our work.

Pages