Princy Parsana, Markus Riester, Curtis Huttenhower, Levi Waldron
curatedCRCData: Clinically Annotated Data for the colorectal CancerTranscriptome
This package represents a manually curated data collection for gene expression meta-analysis of patients with colorectalcancer. This resource provides uniformly prepared microarray data with curated and documented clinical metadata. Itallows a computational user to efficiently identify studies and patient subgroups of interest for analysis and to run suchanalyses immediately without the challenges posed by harmonizing heterogeneous microarray technologies, study designs,expression data processing methods, and clinical data formats.
In this vignette, we give a short tour of the package and will show how to use it efficiently.
Loading a single dataset is very easy. First we load the package:
To get a listing of all the datasets, use the data function:
> data(package="curatedCRCData")
Now to load a single dataset, we use the data function again:
> data(TCGA.COAD_eset)> TCGA.COAD_eset
ExpressionSet (storageMode: lockedEnvironment)assayData: 17814 features, 130 samples
sampleNames: TCGA.AA.3520 TCGA.AA.3532 . TCGA.A6.2685 (130 total)varLabels: unique_patient_ID alt_sample_name . moltherapy (60 total)varMetadata: labelDescription
featureNames: 15E1.2 2✬-PDE . ZZZ3 (17814 total)fvarLabels: probeset genefvarMetadata: labelDescription
experimentData: use ✬experimentData(object)✬
Annotation: agilent-014850 whole human genome microarray 4x44k g4112f
The datasets are provided as Bioconductor ExpressionSet objects and we refer to the Bioconductor documentation forusers unfamiliar with this data structure.
For a meta-analysis, we typically want to filter datasets and patients to get a population of patients we are interested in. We provide a short but powerful R script that does the filtering and provides the data as a list of ExpressionSet objects. One can use this script within R by first sourcing a config file which specifies the filters, like the minimum numbers ofpatients in each dataset. It is also possible to filter samples by annotation, for example to remove early stage and normalsamples.
> source(system.file("extdata",+ "patientselection_all.config",package="curatedCRCData"))> ls()
[4] "min.number.of.events" "min.sample.size"
See what the values of these variables we have loaded are. The variable names are fairly descriptive, but note that
“rule.1” is a character vector of length 2, where the first entry is the name of a clinical data variable, and the second entry
is a Regular Expression providing a requirement for that variable. Any number of rules can be added, with increasingidentifiers, e.g. “rule.2”, “rule.3”, etc.
Here strict.checking is FALSE, meaning that samples not annotated for the variables in these rules are allowed to passthe filter. If strict.checking == TRUE, samples missing this annotation will be removed.
$TCGA.COAD_esetExpressionSet (storageMode: lockedEnvironment)assayData: 17814 features, 130 samples
sampleNames: TCGA.AA.3520 TCGA.AA.3532 . TCGA.A6.2685 (130 total)varLabels: unique_patient_ID alt_sample_name . moltherapy (60 total)varMetadata: labelDescription
featureNames: 15E1.2 2✬-PDE . ZZZ3 (17814 total)
fvarLabels: probeset genefvarMetadata: labelDescription
experimentData: use ✬experimentData(object)✬
Annotation: agilent-014850 whole human genome microarray 4x44k g4112f
Now that we have defined the sample filter, we create a list of ExpressionSets by sourcing the createEsetList.Rfile:
> source(system.file("extdata", "createEsetList.R", package =+ "curatedCRCData"))
2013-12-14 09:37:55 INFO::Inside script createEsetList.R - inputArgs =2013-12-14 09:37:55 INFO::2013-12-14 09:37:55 INFO::Loading curatedCRCData 0.99.12013-12-14 09:41:00 INFO::Clean up the esets. 2013-12-14 09:41:00 INFO::including GSE11237_eset2013-12-14 09:41:00 INFO::including GSE12225.GPL3676_eset2013-12-14 09:41:01 INFO::including GSE12945_eset2013-12-14 09:41:02 INFO::including GSE13067_eset2013-12-14 09:41:02 INFO::including GSE13294_eset2013-12-14 09:41:02 INFO::including GSE14095_eset2013-12-14 09:41:03 INFO::including GSE14333_eset2013-12-14 09:41:03 INFO::including GSE16125.GPL5175_eset2013-12-14 09:41:04 INFO::including GSE17536_eset2013-12-14 09:41:05 INFO::including GSE17537_eset2013-12-14 09:41:05 INFO::including GSE17538.GPL570_eset2013-12-14 09:41:05 INFO::including GSE18105_eset2013-12-14 09:41:05 INFO::including GSE2109_eset2013-12-14 09:41:06 INFO::including GSE21510_eset2013-12-14 09:41:06 INFO::including GSE21815_eset2013-12-14 09:41:07 INFO::including GSE24549.GPL5175_eset2013-12-14 09:41:07 INFO::including GSE24550.GPL5175_eset2013-12-14 09:41:07 INFO::including GSE2630_eset2013-12-14 09:41:07 INFO::including GSE26682.GPL570_eset
2013-12-14 09:41:07 INFO::including GSE26682.GPL96_eset2013-12-14 09:41:07 INFO::including GSE26906_eset2013-12-14 09:41:07 INFO::including GSE27544_eset2013-12-14 09:41:07 INFO::including GSE28702_eset2013-12-14 09:41:07 INFO::including GSE3294_eset2013-12-14 09:41:07 INFO::including GSE33113_eset2013-12-14 09:41:08 INFO::including GSE39582_eset2013-12-14 09:41:09 INFO::including GSE3964_eset2013-12-14 09:41:09 INFO::including GSE4045_eset2013-12-14 09:41:09 INFO::including GSE4526_eset2013-12-14 09:41:09 INFO::including GSE45270_eset2013-12-14 09:41:09 INFO::including TCGA.COAD_eset2013-12-14 09:41:09 INFO::including TCGA.READ_eset2013-12-14 09:41:09 INFO::including TCGA.RNASeqV2.READ_eset2013-12-14 09:41:10 INFO::including TCGA.RNASeqV2_eset2013-12-14 09:41:10 INFO::Ids with missing data: GSE3294_eset, TCGA.COAD_eset, TCGA.READ_eset
It is also possible to run the script from the command line and then load the R data file within R:
R --vanilla "--args patientselection.config crc.eset.rda tmp.log"
Now we have 34 datasets with samples that passed our filter in a list of ExpressionSets called esets:
In the standard version of curatedCRCData (the version available on Bioconductor), we collapse manufacturer probesetsto official HGNC symbols using the Biomart database. Some probesets are mapped to multiple HGNC symbols in thisdatabase. For these probesets, we provide all the symbols. For example 220159_at maps to ABCA11P and ZNF721 andwe provide ABCA11P///ZNF721 as probeset name. If you have an array of gene symbols for which you want to access theexpression data, ”ABCA11P” would not be found in curatedCRCData in this example. The following function will createa new ExpressionSet in which both ZNF721 and ABCA11P are features with identical expression data:
> expandProbesets <- function (eset, sep = "///")+ {+
x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
eset <- eset[order(sapply(x, length)), ]
x <- lapply(featureNames(eset), function(x) strsplit(x, sep)[[1]])
idx <- unlist(sapply(1:length(x), function(i) rep(i, length(x[[i]]))))
+ }> X <- TCGA.COAD_eset[head(grep("AA", featureNames(TCGA.COAD_eset))),]> exprs(X)[,1:3]
Figure 1: Available clinical annotation.
This heatmap visualizes for each curated clinical characteristic (rows) the
availability in each dataset (columns). Red indicates that the corresponding characteristic is available for at least onesample in the dataset.
This example provides a table summarizing the datasets being used, and is useful when publishing analyses based oncuratedCRCData. First, define some useful functions for this purpose:
> source(system.file("extdata", "summarizeEsets.R", package =+
Optionally write this table to file, for example ( replace myfile <- tempfile() with something like myfile <- “nicetable.csv”)
[1] "/tmp/RtmpmcH6wF/file75ff6f7d35ae"
> write.table(summary.table, file=myfile, row.names=FALSE, quote=TRUE, sep=",")
If you are not doing your analysis in R, and just want to get some data you have identified from the curatedCRCDatamanual, here is a simple way to do it. For one dataset:
> library(curatedCRCData)> library(affy)> data(TCGA.COAD_eset)> write.csv(exprs(TCGA.COAD_eset), file="TCGA.COAD_eset_exprs.csv")> write.csv(pData(TCGA.COAD_eset), file="TCGA.COAD_eset_clindata.csv")
> data.to.fetch <- c("TCGA.COAD_eset", "GSE37317_eset")> for (onedata in data.to.fetch){+
print(paste("Fetching", onedata))
write.csv(exprs(get(onedata)), file=paste(onedata, "_exprs.csv", sep=""))
write.csv(pData(get(onedata)), file=paste(onedata, "_clindata.csv", sep=""))
❼ R Under development (unstable) (2013-11-03 r64145), x86_64-unknown-linux-gnu
❼ Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C,
LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C,LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C
❼ Base packages: base, datasets, grDevices, graphics, methods, parallel, splines, stats, utils
❼ Other packages: Biobase 2.23.3, BiocGenerics 0.9.1, corpcor 1.6.6, curatedCRCData 0.99.1, genefilter 1.45.1,
logging 0.7-103, mgcv 1.7-27, nlme 3.1-113, survival 2.37-4, sva 3.9.1, xtable 1.7-1
❼ Loaded via a namespace (and not attached): AnnotationDbi 1.25.9, BiocStyle 1.1.11, DBI 0.2-7, IRanges 1.21.15,
Matrix 1.1-0, RSQLite 0.11.4, XML 3.98-1.1, annotate 1.41.1, grid 3.1.0, lattice 0.20-24, stats4 3.1.0, tools 3.1.0
MARKETED PRODUCT INFORMATION REVASC® (Desirudin) NAME OF THE MEDICINE DESCRIPTION The active substance of Revasc is desirudin (recombinant hirudin sequence variant I orrecombinant desulphatohirudin), a highly potent and selective inhibitor of human thrombin. Desirudin is a single chain polypeptide consisting of 65 amino acid residues and 3 disulphidebridges. The protein structure of
Companies That Do Test on Animals Frequently Asked Questions Why are these companies included on the "Do Test" list? The following companies manufacture products that ARE tested on animals. Those marked with a Ƈ observing a moratorium on (i.e., current suspension of) animal testing. Please encourage them to announce apermanent ban. Listed in parentheses are examples of products manu