Counting missing values
# Counting the occurrence of a particular value
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
sum(x==99, na.rm=T) # Count the occurrence of 99 in x, (omitting any NA)
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
sum(x==99, na.rm=T) # Count the occurrence of 99 in x, (omitting any NA)
# Counting the occurrence of NA
sum(x==NA) # Don't do this! Any conditional operation with NA results in NA, (NA is a special logical value)
sum(is.na(x)) # Do this to count the NA in x
sum(x==NA) # Don't do this! Any conditional operation with NA results in NA, (NA is a special logical value)
sum(is.na(x)) # Do this to count the NA in x
# Counting the occurrence of any of a set of particular values
sum(x==c(98,99), na.rm=T) # Don't do this! It recycles c(98,99) to match x and does pairwise comparison
sum(x %in% c(98,99)) # Do this to count the occurrence of any 98 or 99 in x
sum(x==c(98,99), na.rm=T) # Don't do this! It recycles c(98,99) to match x and does pairwise comparison
sum(x %in% c(98,99)) # Do this to count the occurrence of any 98 or 99 in x
# Counting incomplete cases, (rows of a data frame where one or more columns contain NA)
sum(complete.cases(data)) # Count of complete cases in a data frame named 'data'
sum(!complete.cases(data)) # Count of incomplete cases
which(!complete.cases(data)) # Which cases (row numbers) are incomplete?
sum(!complete.cases(data)) # Count of incomplete cases
which(!complete.cases(data)) # Which cases (row numbers) are incomplete?
The summary function of a data frame also counts the occurrence of NA in each column.
With numerical data that contain numerically coded missing values a scatter-plot is often helpful to identify missing values, especially if the missing value codes are sometimes entered incorrectly:
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
plot(x)
plot(x)
Similarly with character data a contingency table is helpful:
x = c("male", "female", "female", "male", 999)
factor(x)
table(x)
factor(x)
table(x)
Re-coding particular values as NA (or any other value)
# Re-coding the occurrence of a particular value
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x[x==99] = NA # Re-code all 99 in x as NA
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x[x==99] = NA # Re-code all 99 in x as NA
# Re-coding the occurrence of NA
x[is.na(x)] = -1 # Recode all NA in x as -1. (Don't do this: x[x==NA] = -1)
x[is.na(x)] = -1 # Recode all NA in x as -1. (Don't do this: x[x==NA] = -1)
# Re-coding the occurrence of any of a set of particular values
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x[x %in% c(98,99)] = NA # Re-code any 98 or 99 in x as NA
x = c(2,5,3,5,2,0,1,3,4,4,0)
x[x %in% 0:3] = 0 # Re-code any 0, 1, 2, or 3 in x as 0
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x[x %in% c(98,99)] = NA # Re-code any 98 or 99 in x as NA
x = c(2,5,3,5,2,0,1,3,4,4,0)
x[x %in% 0:3] = 0 # Re-code any 0, 1, 2, or 3 in x as 0
Removing NA values
# Removing NA values from a vector
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x = x[!is.na(x)] # Drop NA from x
x = c(3.14, 98, 0, 99, 7, NA, 0, 99)
x = x[!is.na(x)] # Drop NA from x
# Removing incomplete cases from a data frame named 'data'
na.omit(data) # In effect the same as: data[complete.cases(data), ]
na.omit(data) # In effect the same as: data[complete.cases(data), ]
The read.table function has an optional argument named na.strings that can take a vector of values which are then also mapped onto NA. For example:
data = read.table("foo.txt", header=TRUE, sep = "\t", na.strings=c("999", "-999") )
However this recodes occurrences of the given na.strings in all the variables, which can cause mistakes if codes for missing values for one variable are valid data for another.
http://forums.psy.ed.ac.uk/R/P01582/essential-1/
No comments:
Post a Comment