Pages

Tuesday, August 5, 2014

Types of Questions asked to Data Scientists


  1. Descriptive - describe a set of data
    • Descriptions cannot be generalized without adding statistical modelling
  2. Exploratory - find relationships you didnt know about
    • Exploratory analysis should not be alone used for generalizing/predicting
    • Correlation doesnt imply causation
  3. Inferential - use a relatively small sample of data to say something about the bigger population
    • Inference is the common goal of statistical analysis
    • Inference involves estimating both the quantity we care about and the certainty of that estimate
    • Inference depends heavily on both the population and the sampling scheme
  4. Predictive - To use the data in some objects to predict the data in other objects
    • If X predicts Y then it doesnt mean that X causes Y
    • More and more data works well with reasonable models
  5. Causal - To find what happens to one variable when you change another variable
    • Usually randomized variables are used for causation
    • There are approaches to infering causation in non-randomized studies, but they are complicated and sensitive to assumptions
    • Causal relationships are identified as average effects, but may not apply to every individual
    • Causal models are usually the gold standard for data analysis
  6. Mechanistic - physics

Introduction to R

Concepts 
  1. Data types : character, numeric, integer, complex, logical
  2. A vector can only contain objects of the same class
  3. List is represented as a vector but can contain objects of different classes
  4. Numbers in R are generally represented as numeric objects
  5. If you explicitly want an integer, you need to specify the L suffix
  6. Ex : entering 1 will be treated as a numeric object and 1L will be treated as an integer
  7. R objects can have attributes - names, dimensions, class, other user defined attributes
  8. x <- -="" 1:20="" create="" integer="" is="" operation="" sequences="" span="" the="" to="" used="">
  9. The function c() is used to create vectors of objects
  10. Objects can be co-erced from one class to another using as.* function
    1. x <- 1:20="" as.character="" span="" x="">
  11. Non sensical co-ersion results in NA
  12. Matrices are vectors with dimension attribute. 
  13. Matrices are created columnwise, so entries start at the upper left corner
  14. Matrices can also be created from vectors by adding the dimension attribute.
    1. x <- 1:10="" c="" dim="" m="" nbsp="" span="">
  15. Matrices can also be created by column binding or row binding 
    1. x <- 10:12="" 2:4="" cbind="" m="" nbsp="" rbind="" span="" x="" y="">
  16. Factors - categorical data
  17. Missing values : NA and NAN. NA can be integer NA or character NA and they have classes.
  18. NAN value is NA but the converse is not true
  19. Data Frames 
    1. Special type of list where every element of the list has to be the same length
    2. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows
    3. data frames can store different classes of objects in each column. Matrixes all elemets have to be of the same class
    4. Data frames also have special attributes called row.names
    5. Data frames are created by read.table() or read.csv()
  20. R Objects can have names
    1. m <- matrix="" nrow="2,ncol=2)</span">
    2. dimnames(m) <- a="" b="" c="" d="" list="" span="">
  21. sd
  22. sd
Examples
  1. How to get a list of available packages in R ?  
    
     
  2. How to install packages ?
    install.packages("slidify"); install.packages(c("slidify", "ggplot")); 
    source("http://www.bioconductor.org/biocLite.R");
    biocLite();
    #Place the names of the packages in a vector
    biocLite(c("GenomicFeatures","AnotationDbi"));
  3. How do you load R packages ?
    After loading a package the functions loaded in the package will be attached to the top of the search list
  4. How do you load the package in R ? library("slidify")
  5. 
    
  6. sd

1) How to read a csv file in R ?
1
data<- code="">read.csv(filename,header=TRUE)
2) How to display the first n lines of the file ?
1
head(data,n) : The default value of n is 6.
3) How to display the last n lines of the file ?
1
tail(data,n)
4) Calculate missing values in all the columns in the data set ?
1
colSums(data)
Other functions that can be used for this purpose are sapply and apply.
5) Calculate the mean of a column without the missing values ?
1
2
3
4
5
6
7
8
9
colMeans(data,na.rm=TRUE)
     Ozone    Solar.R       Wind       Temp      Month        Day
 42.129310 185.931507   9.957516  77.882353   6.993464  15.803922
 colMeans(data)
    Ozone   Solar.R      Wind      Temp     Month       Day
       NA        NA  9.957516 77.882353  6.993464 15.803922
 colMeans(data["Ozone"],na.rm=TRUE)
   Ozone
42.12931
6) Extract the subset of rows of the data frame where Ozone values are above 31 and Temp values are above 90. What is the mean of Solar.R in this subset?
1
2
3
colMeans(subset(data,(Ozone>31 & Temp>90)))
 Ozone Solar.R    Wind    Temp   Month     Day
 89.5   212.8     5.6    93.4     8.2    14.5
7) Find the mean temperature in the Month of n ?
1
2
3
colMeans(subset(data,Month==n))
    Ozone   Solar.R      Wind      Temp     Month       Day
    NA 190.16667  10.26667  79.10000   6.00000  15.50000
Additional Resources :
1) Filling in nas with column medians in R