Data collected in RDS studies is often messy, especially when information on coupons given out and received is in hard copy, then entered into a computer later. A while ago, I wrote some code to process RDS data to check for some common errors. This doesn't get rid of all of them, by any means, but will catch quite a few. This code, and the dataset, are available as a gist from my Github site. from my Github site.
The data handling code
Firstly, here is the processing code. It's fairly ugly and inefficient, as I wasn't as familiar with R as I am now.
prepareDataset <- function(data, idColumnName, netsizeColumnName, couponColumnName,
seedCoupon, couponList) {
result <- matchIds(data, idColumnName, couponColumnName, seedCoupon, couponList)
result$rds.size <- as.double(data[netsizeColumnName][, 1])
if (length(result$rds.size[result$rds.size < 1]) > 0) {
print("Missing values/Network size of less than one found")
}
result
}
convertToFactor <- function(data, x) {
data[x] <- as.factor(data[x])
}
makeCouponFactor <- function(data, couponColumnName) {
couponVector <- data[couponColumnName][, 1]
f <- rep(NA, length(couponVector))
for (i in 1:length(couponVector)) {
f[i] <- strsplit(as.character(couponVector[i]), split = "")[[1]][1]
}
as.factor(f)
}
checkForDups <- function(couponVector, seedCoupon) {
tb <- table(couponVector)
dupCoupons <- names(tb)[tb > 1]
dupCoupons[dupCoupons != as.character(seedCoupon)]
}
matchIds <- function(data, idColumnName, couponColumnName, seedCoupon, couponList) {
# Extract ID numbers; I convert to characters, as the code will fail if
# factors are used
toId <- as.character(data[idColumnName][, 1])
numIds <- length(toId)
# Extract couponnumbers
couponVector <- data[couponColumnName][, 1]
# Find seeds and replace their coupon number with missing
couponRecvdList = list()
for (i in 1:numIds) {
if (paste(couponVector[i]) == paste(seedCoupon)) {
couponRecvdList[paste(toId[i])] = as.character(NA)
} else {
couponRecvdList[paste(toId[i])] = paste(couponVector[i])
}
}
# Generate a list of all coupon ids, and the participant ids they were given
# to
couponGivenList = list()
for (i in 1:length(couponList)) {
cupList = data[couponList[i]][, 1]
for (j in 1:numIds) {
cupNum = cupList[j][1]
if (is.na(cupNum) == F) {
couponGivenList[[paste(cupNum)]] = toId[j]
}
}
}
# Cross-match coupons for each individual Sanity check; make sure that each
# coupon received was actually given out by someone Need to add check that
# coupon wasn't received twice
badCoupons = c()
matchedIdList = list()
seedList = list()
for (i in 1:length(couponRecvdList)) {
tmpCoupon = couponRecvdList[i]
if (!is.na(tmpCoupon)) {
matchedCouponNumber = intersect(tmpCoupon, names(couponGivenList))
validCoupon = (matchedCouponNumber == tmpCoupon)
if (length(validCoupon) == 0) {
badCoupons = c(badCoupons, tmpCoupon)
} else {
matchedId = couponGivenList[matchedCouponNumber]
matchedIdList[names(tmpCoupon)] = matchedId
seedList[names(tmpCoupon)] = 0
}
} else {
matchedIdList[names(tmpCoupon)] = as.character(NA)
seedList[names(tmpCoupon)] = 1
}
}
if (length(badCoupons) > 0) {
print("Bad coupons")
print(badCoupons)
print("Invalid coupons detected")
return()
}
# Assemble results
result = data
result$rds.toId = as.character(toId)
result$rds.fromId = as.character(matchedIdList)
result$rds.seed = as.integer(seedList)
result
}
This will return a data frame with three additional columns, representing the subject identifiers for the recruit (rds.toId
), recruiter (rds.fromID
), and a flag indicating whether the subject is a seed or not (rds.seed
).
An example: the New York Jazz dataset
These data are from the well-known NYC jazz musician dataset by Heckathorn and Jeffri. These data are a subset of the full data, in a tab-delimited file that was generated by hand from the original RDSAT file.
nyjazz <- read.table("nyjazz.txt", header = T, row.names = NULL, sep = "\t")
head(nyjazz)
## ID netsize mycoupon coupon1 coupon2 coupon3 coupon4 coupon5 coupon6
## 1 1 350 0 14250004 14250005 14250006 14256002 0 0
## 2 2 0 0 14250007 14250008 14250009 14256003 0 0
## 3 3 585 0 14250010 14250011 14250012 14256004 0 0
## 4 4 400 0 14250025 14250026 14250027 14256009 0 0
## 5 5 150 0 14250022 14250023 14250023 14256008 0 0
## 6 6 100 0 14250028 14250029 14250030 14256010 0 0
## coupon7 coupon8 gender.mf age airplay.yn
## 1 0 901 1 40 1
## 2 912 902 1 64 1
## 3 0 903 2 41 1
## 4 0 904 2 77 0
## 5 0 905 1 33 1
## 6 916 906 1 31 2
I then attempt to process the data. The arguments are as follows:
- The name of the dataset
- The name of the ID of the respondent
- The name of the column with network size
- The column of the coupon received}
- The numeric code for a seed coupon
- A vector of column names corresponding to the coupons given out to each participant
nyjazz.prep <- prepareDataset(nyjazz, "ID", "netsize", "mycoupon", 0, c("coupon1",
"coupon2", "coupon3", "coupon4", "coupon5", "coupon6", "coupon7", "coupon8"))
## [1] "Missing values/Network size of less than one found"
Note that this doesn't identify any bad coupons, but does raise a warning about some undesirable values for network size.
table(nyjazz$netsize, exclude = NULL)
##
## 0 20 25 30 35 40 50 60 70 75 80 90 100 120 125
## 21 4 5 7 1 2 13 4 5 4 2 1 45 1 3
## 150 200 220 250 300 350 375 383 400 450 500 585 600 700 800
## 25 35 1 8 25 1 1 1 12 1 24 1 4 4 2
## 850 <NA>
## 1 0
As numeric codes are used to represent missing data in RDSAT, one has to be careful. These may represent missing network sizes or real reports of no ties.
To see what happens when individuals come in with a coupon that is not given out, let's omit one of the columns.
prepareDataset(nyjazz, "ID", "netsize", "mycoupon", 0, c("coupon1", "coupon2",
"coupon3", "coupon4", "coupon5", "coupon6", "coupon7"))
## [1] "Bad coupons"
## $`45`
## [1] "902"
##
## $`60`
## [1] "9050"
##
## $`69`
## [1] "9037"
##
## $`71`
## [1] "9061"
##
## $`73`
## [1] "9071"
##
## $`92`
## [1] "9070"
##
## $`93`
## [1] "906"
##
## $`97`
## [1] "9018"
##
## $`115`
## [1] "9081"
##
## $`117`
## [1] "90112"
##
## $`118`
## [1] "9038"
##
## $`119`
## [1] "9099"
##
## $`130`
## [1] "9055"
##
## $`134`
## [1] "9034"
##
## $`153`
## [1] "90141"
##
## $`168`
## [1] "90160"
##
## $`206`
## [1] "90155"
##
## $`219`
## [1] "90210"
##
## $`229`
## [1] "90187"
##
## [1] "Invalid coupons detected"
## [1] "Missing values/Network size of less than one found"
## $rds.size
## [1] 350 0 585 400 150 100 300 700 300 200 200 300 100 383 700 80 300
## [18] 200 100 400 200 150 100 150 200 50 300 850 0 500 30 200 300 500
## [35] 500 25 150 50 500 150 150 300 300 300 0 200 250 400 150 100 0
## [52] 250 100 200 0 100 150 150 150 600 60 300 50 500 300 100 700 100
## [69] 300 200 200 100 150 20 0 200 500 100 35 700 70 300 200 500 125
## [86] 500 500 250 100 200 75 30 0 150 150 200 400 0 0 100 0 100
## [103] 100 200 20 100 250 250 50 60 100 500 100 150 500 200 200 100 20
## [120] 150 100 90 0 100 150 200 0 20 70 400 25 50 200 300 250 50
## [137] 200 100 500 150 150 600 0 100 30 500 150 200 300 200 150 300 100
## [154] 500 200 200 100 100 200 125 50 500 50 200 500 500 100 300 200 100
## [171] 70 300 75 300 0 500 100 100 60 125 500 0 200 200 500 600 25
## [188] 80 150 120 100 150 400 60 0 500 250 0 150 25 30 50 0 30
## [205] 200 0 30 70 0 100 100 375 50 100 100 100 50 50 100 400 25
## [222] 200 200 800 30 400 300 100 100 150 40 250 300 100 100 200 0 200
## [239] 75 300 200 800 400 150 400 100 500 450 600 220 300 300 70 500 100
## [256] 400 100 100 50 300 75 500 40 400
This now gives a list of IDs, with their coupon numbers, which one should then go and check. This code doesn't check for whether two people come in with the same coupon, but this should be apparent when one comes to plotting out the RDS recruitment tree.