My experiments with Big Data: Counting missing values in R

Friday, September 12, 2014

Counting missing values in R

Counting missing values

# Counting the occurrence of a particular value
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
sum(x==99, na.rm=T) # Count the occurrence of 99 in x, (omitting any NA)

# Counting the occurrence of NA
sum(x==NA) # Don't do this! Any conditional operation with NA results in NA, (NA is a special logical value)
sum(is.na(x)) # Do this to count the NA in x

# Counting the occurrence of any of a set of particular values
sum(x==c(98,99), na.rm=T) # Don't do this! It recycles c(98,99) to match x and does pairwise comparison
sum(x %in% c(98,99)) # Do this to count the occurrence of any 98 or 99 in x

# Counting incomplete cases, (rows of a data frame where one or more columns contain NA)

sum(complete.cases(data)) # Count of complete cases in a data frame named 'data'
sum(!complete.cases(data)) # Count of incomplete cases
which(!complete.cases(data)) # Which cases (row numbers) are incomplete?

The summary function of a data frame also counts the occurrence of NA in each column.

With numerical data that contain numerically coded missing values a scatter-plot is often helpful to identify missing values, especially if the missing value codes are sometimes entered incorrectly:

x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
plot(x)

Similarly with character data a contingency table is helpful:

x = c("male", "female", "female", "male", 999)
factor(x)
table(x)

Re-coding particular values as NA (or any other value)

# Re-coding the occurrence of a particular value
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x[x==99] = NA # Re-code all 99 in x as NA

# Re-coding the occurrence of NA
x[is.na(x)] = -1 # Recode all NA in x as -1. (Don't do this: x[x==NA] = -1)

# Re-coding the occurrence of any of a set of particular values
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x[x %in% c(98,99)] = NA # Re-code any 98 or 99 in x as NA
x = c(2,5,3,5,2,0,1,3,4,4,0)
x[x %in% 0:3] = 0 # Re-code any 0, 1, 2, or 3 in x as 0

Removing NA values

# Removing NA values from a vector
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x = x[!is.na(x)] # Drop NA from x

# Removing incomplete cases from a data frame named 'data'
na.omit(data) # In effect the same as: data[complete.cases(data), ]

The read.table function has an optional argument named na.strings that can take a vector of values which are then also mapped onto NA. For example:

data = read.table("foo.txt", header=TRUE, sep = "\t", na.strings=c("999", "-999") )

However this recodes occurrences of the given na.strings in all the variables, which can cause mistakes if codes for missing values for one variable are valid data for another.

http://forums.psy.ed.ac.uk/R/P01582/essential-1/

My experiments with Big Data

Pages

Friday, September 12, 2014

Counting missing values in R

Counting missing values

Re-coding particular values as NA (or any other value)

Removing NA values

No comments:

Post a Comment