Wednesday, June 18, 2014

Processing RDS data

Data collected in RDS studies is often messy, especially when information on coupons given out and received is in hard copy, then entered into a computer later. A while ago, I wrote some code to process RDS data to check for some common errors. This doesn't get rid of all of them, by any means, but will catch quite a few. This code, and the dataset, are available as a gist from my Github site. from my Github site.

The data handling code

Firstly, here is the processing code. It's fairly ugly and inefficient, as I wasn't as familiar with R as I am now.

prepareDataset <- function(data, idColumnName, netsizeColumnName, couponColumnName, 
    seedCoupon, couponList) {
    result <- matchIds(data, idColumnName, couponColumnName, seedCoupon, couponList)
    result$rds.size <- as.double(data[netsizeColumnName][, 1])
    if (length(result$rds.size[result$rds.size < 1]) > 0) {
        print("Missing values/Network size of less than one found")
    }
    result
}

convertToFactor <- function(data, x) {
    data[x] <- as.factor(data[x])
}

makeCouponFactor <- function(data, couponColumnName) {
    couponVector <- data[couponColumnName][, 1]
    f <- rep(NA, length(couponVector))
    for (i in 1:length(couponVector)) {
        f[i] <- strsplit(as.character(couponVector[i]), split = "")[[1]][1]
    }
    as.factor(f)
}

checkForDups <- function(couponVector, seedCoupon) {
    tb <- table(couponVector)
    dupCoupons <- names(tb)[tb > 1]
    dupCoupons[dupCoupons != as.character(seedCoupon)]
}

matchIds <- function(data, idColumnName, couponColumnName, seedCoupon, couponList) {
    # Extract ID numbers; I convert to characters, as the code will fail if
    # factors are used
    toId <- as.character(data[idColumnName][, 1])
    numIds <- length(toId)
    # Extract couponnumbers
    couponVector <- data[couponColumnName][, 1]
    # Find seeds and replace their coupon number with missing
    couponRecvdList = list()
    for (i in 1:numIds) {
        if (paste(couponVector[i]) == paste(seedCoupon)) {
            couponRecvdList[paste(toId[i])] = as.character(NA)
        } else {
            couponRecvdList[paste(toId[i])] = paste(couponVector[i])
        }
    }
    # Generate a list of all coupon ids, and the participant ids they were given
    # to
    couponGivenList = list()
    for (i in 1:length(couponList)) {
        cupList = data[couponList[i]][, 1]
        for (j in 1:numIds) {
            cupNum = cupList[j][1]
            if (is.na(cupNum) == F) {
                couponGivenList[[paste(cupNum)]] = toId[j]
            }
        }
    }
    # Cross-match coupons for each individual Sanity check; make sure that each
    # coupon received was actually given out by someone Need to add check that
    # coupon wasn't received twice
    badCoupons = c()
    matchedIdList = list()
    seedList = list()
    for (i in 1:length(couponRecvdList)) {
        tmpCoupon = couponRecvdList[i]
        if (!is.na(tmpCoupon)) {
            matchedCouponNumber = intersect(tmpCoupon, names(couponGivenList))
            validCoupon = (matchedCouponNumber == tmpCoupon)
            if (length(validCoupon) == 0) {
                badCoupons = c(badCoupons, tmpCoupon)
            } else {
                matchedId = couponGivenList[matchedCouponNumber]
                matchedIdList[names(tmpCoupon)] = matchedId
                seedList[names(tmpCoupon)] = 0
            }
        } else {
            matchedIdList[names(tmpCoupon)] = as.character(NA)
            seedList[names(tmpCoupon)] = 1
        }
    }
    if (length(badCoupons) > 0) {
        print("Bad coupons")
        print(badCoupons)
        print("Invalid coupons detected")
        return()
    }
    # Assemble results
    result = data
    result$rds.toId = as.character(toId)
    result$rds.fromId = as.character(matchedIdList)
    result$rds.seed = as.integer(seedList)
    result
}

This will return a data frame with three additional columns, representing the subject identifiers for the recruit (rds.toId), recruiter (rds.fromID), and a flag indicating whether the subject is a seed or not (rds.seed).

An example: the New York Jazz dataset

These data are from the well-known NYC jazz musician dataset by Heckathorn and Jeffri. These data are a subset of the full data, in a tab-delimited file that was generated by hand from the original RDSAT file.

nyjazz <- read.table("nyjazz.txt", header = T, row.names = NULL, sep = "\t")
head(nyjazz)
##   ID netsize mycoupon  coupon1  coupon2  coupon3  coupon4 coupon5 coupon6
## 1  1     350        0 14250004 14250005 14250006 14256002       0       0
## 2  2       0        0 14250007 14250008 14250009 14256003       0       0
## 3  3     585        0 14250010 14250011 14250012 14256004       0       0
## 4  4     400        0 14250025 14250026 14250027 14256009       0       0
## 5  5     150        0 14250022 14250023 14250023 14256008       0       0
## 6  6     100        0 14250028 14250029 14250030 14256010       0       0
##   coupon7 coupon8 gender.mf age airplay.yn
## 1       0     901         1  40          1
## 2     912     902         1  64          1
## 3       0     903         2  41          1
## 4       0     904         2  77          0
## 5       0     905         1  33          1
## 6     916     906         1  31          2

I then attempt to process the data. The arguments are as follows:

  1. The name of the dataset
  2. The name of the ID of the respondent
  3. The name of the column with network size
  4. The column of the coupon received}
  5. The numeric code for a seed coupon
  6. A vector of column names corresponding to the coupons given out to each participant
nyjazz.prep <- prepareDataset(nyjazz, "ID", "netsize", "mycoupon", 0, c("coupon1", 
    "coupon2", "coupon3", "coupon4", "coupon5", "coupon6", "coupon7", "coupon8"))
## [1] "Missing values/Network size of less than one found"

Note that this doesn't identify any bad coupons, but does raise a warning about some undesirable values for network size.

table(nyjazz$netsize, exclude = NULL)
## 
##    0   20   25   30   35   40   50   60   70   75   80   90  100  120  125 
##   21    4    5    7    1    2   13    4    5    4    2    1   45    1    3 
##  150  200  220  250  300  350  375  383  400  450  500  585  600  700  800 
##   25   35    1    8   25    1    1    1   12    1   24    1    4    4    2 
##  850 <NA> 
##    1    0

As numeric codes are used to represent missing data in RDSAT, one has to be careful. These may represent missing network sizes or real reports of no ties.

To see what happens when individuals come in with a coupon that is not given out, let's omit one of the columns.

prepareDataset(nyjazz, "ID", "netsize", "mycoupon", 0, c("coupon1", "coupon2", 
    "coupon3", "coupon4", "coupon5", "coupon6", "coupon7"))
## [1] "Bad coupons"
## $`45`
## [1] "902"
## 
## $`60`
## [1] "9050"
## 
## $`69`
## [1] "9037"
## 
## $`71`
## [1] "9061"
## 
## $`73`
## [1] "9071"
## 
## $`92`
## [1] "9070"
## 
## $`93`
## [1] "906"
## 
## $`97`
## [1] "9018"
## 
## $`115`
## [1] "9081"
## 
## $`117`
## [1] "90112"
## 
## $`118`
## [1] "9038"
## 
## $`119`
## [1] "9099"
## 
## $`130`
## [1] "9055"
## 
## $`134`
## [1] "9034"
## 
## $`153`
## [1] "90141"
## 
## $`168`
## [1] "90160"
## 
## $`206`
## [1] "90155"
## 
## $`219`
## [1] "90210"
## 
## $`229`
## [1] "90187"
## 
## [1] "Invalid coupons detected"
## [1] "Missing values/Network size of less than one found"
## $rds.size
##   [1] 350   0 585 400 150 100 300 700 300 200 200 300 100 383 700  80 300
##  [18] 200 100 400 200 150 100 150 200  50 300 850   0 500  30 200 300 500
##  [35] 500  25 150  50 500 150 150 300 300 300   0 200 250 400 150 100   0
##  [52] 250 100 200   0 100 150 150 150 600  60 300  50 500 300 100 700 100
##  [69] 300 200 200 100 150  20   0 200 500 100  35 700  70 300 200 500 125
##  [86] 500 500 250 100 200  75  30   0 150 150 200 400   0   0 100   0 100
## [103] 100 200  20 100 250 250  50  60 100 500 100 150 500 200 200 100  20
## [120] 150 100  90   0 100 150 200   0  20  70 400  25  50 200 300 250  50
## [137] 200 100 500 150 150 600   0 100  30 500 150 200 300 200 150 300 100
## [154] 500 200 200 100 100 200 125  50 500  50 200 500 500 100 300 200 100
## [171]  70 300  75 300   0 500 100 100  60 125 500   0 200 200 500 600  25
## [188]  80 150 120 100 150 400  60   0 500 250   0 150  25  30  50   0  30
## [205] 200   0  30  70   0 100 100 375  50 100 100 100  50  50 100 400  25
## [222] 200 200 800  30 400 300 100 100 150  40 250 300 100 100 200   0 200
## [239]  75 300 200 800 400 150 400 100 500 450 600 220 300 300  70 500 100
## [256] 400 100 100  50 300  75 500  40 400

This now gives a list of IDs, with their coupon numbers, which one should then go and check. This code doesn't check for whether two people come in with the same coupon, but this should be apparent when one comes to plotting out the RDS recruitment tree.


Thursday, June 20, 2013

Generating nice-looking RDS networks,

I often see recruitment networks from RDS studies laid out like this:

However, this sort of layout makes it hard to see who are 'seeds', how many waves there are, etc.. I prefer a layered Sugiyama type format like this:

Here's how you can make such a network using R. You can download the data, sim.rdsat, from here.

In order to read RDSAT data into R, we skip over the first couple of rows.

rdsdata <- read.table("sim.rdsat", skip = 2, row.names = NULL)

This misses out on other information, that we can get from the first couple of lines. Our data is tab-delimited, so we split the second line into separate fields based on the tab character.

# Read the second line
rdsdata.headers <- readLines("sim.rdsat", n = 2)[2]
# Split the line into fields
rdsdata.headers <- unlist(strsplit(rdsdata.headers, "\t")[[1]])
# Extract the number of coupons
ncoupons <- as.integer(rdsdata.headers[2])
# Extract the missing code
misscode <- as.integer(rdsdata.headers[3])
# Now generate new headers
newheaders <- c("id", "netsize", "mycoupon")
for (i in 1:ncoupons) {
    newheaders <- c(newheaders, paste("coupon", i, sep = "."))
}
newheaders <- c(newheaders, rdsdata.headers[(4 + ncoupons):length(rdsdata.headers)])
names(rdsdata) <- newheaders

Now we have a table with standardized headers for identifier, network size, and coupons.

We have to convert the RDSAT data, such that we find the recruiter for each individual. First we make a list of coupons, and who they were given to. Note that I turn everything into a character string.

couponList <- list()
couponsGiven <- rdsdata[, grep("coupon.", names(rdsdata))]
# Now collapse coupons into a single column
couponsGiven <- stack(couponsGiven)
couponIDs <- rep(rdsdata$id, ncoupons)
# Cycle through each coupon
for (i in 1:dim(couponsGiven)[[1]]) {
    thiscoupon <- as.character(couponsGiven$values[i])
    if (thiscoupon != as.character(misscode)) {
        couponList[[thiscoupon]] <- as.character(couponIDs[i])
    }
}

Now we can find the recruiter for each individual.

recruiters <- rep("", dim(rdsdata)[[1]])
for (i in 1:length(recruiters)) {
    thiscoupon <- as.character(rdsdata$mycoupon[i])
    if (thiscoupon == as.character(misscode)) {
        recruiters[i] <- NA
    } else {
        recruiters[i] <- couponList[[thiscoupon]]
    }
}

To plot out the network in a nice Sugiyama-style format, I use Rgraphviz. To install this, set the repositories to include BioConductor; you'll also have to have Graphviz installed.

setRepositories()

To get RGraphviz to render the network, I generate an adjacency matrix, then convert to a graph object.

library(graph)
allids <- sort(unique(c(as.character(rdsdata$id), recruiters)))
numids <- length(allids)
adj <- matrix(0, numids, numids)
dimnames(adj) <- list(allids, allids)
for (j in 1:dim(rdsdata)[1]) {
    adj[match(recruiters[j], allids), match(rdsdata$id[j], allids)] <- 1
}
rdsdata.graph <- graphAM(adjMat = adj, edgemode = "directed")

The resulting network can be plotted very easily.

library(Rgraphviz)
plot(rdsdata.graph, "dot")

However, we can improve this plot by adding in covariate information. I set the colour to be red if x2=1, and the width of the symbol to 2 rather than 1 if x1>0. To do this, we make a list of attributes, which we will pass to the plotting routine. A full list of these attributes can be found here.

nodeNames <- nodes(rdsdata.graph)
numNodes <- length(nodeNames)
nAttrs <- list()
nAttrs$label <- rep("", numNodes)
names(nAttrs$label) <- nodeNames
nAttrs$fillcolor <- c()
nAttrs$width <- c()
for (i in 1:numNodes) {
    if (rdsdata$x2[i] == 0) {
        nAttrs$fillcolor <- c(nAttrs$fillcolor, "blue")
    } else {
        nAttrs$fillcolor <- c(nAttrs$fillcolor, "red")
    }
    if (rdsdata$x1[i] > 0) {
        nAttrs$width <- c(nAttrs$width, 2)
    } else {
        nAttrs$width <- c(nAttrs$width, 1)
    }
}
names(nAttrs$fillcolor) <- as.character(nodeNames)
names(nAttrs$width) <- as.character(nodeNames)

Now we can pass the attributes to the plotting routine, to get the network shown at the beginning of the post.

plot(rdsdata.graph, nodeAttrs = nAttrs, "dot")