Description of the dataset
The cell lines methylation profiles was published1 by Onuchic et al in Cell reports.
A set of 4 methylation profiles of cell lines generated using the targeted bisulfite sequencing assay (in the Onuchic et al. paper) was used to build simulated mixtures. The methylation profiles are coming from:
- a breast cancer cell lines (MCF-7),
- a normal breast cell line (HMEC),
- a CAF cell line,
- and purified T-cells.
Omics_type
= methylome (targeted bisulfite sequencing)
Cancer_type
= brca
Cohort_size
= 30
Patient_metadata
= No
Sample_type
= In silico mixture of cell lines
Preparation of the data
We use the following dataset:
- GSM2327390
- GSM2327392
- GSM2327393
- GSM2327397
Here is the information retrieved on corresponding GEO portal:
Frozen samples were pulverized with mortar and pestle under liquid nitrogen conditions. DNA from cell culture, tumor tissue, normal tissue, and buffy coat was isolated using Qiagen’s DNeasy Blood and Tissue kit and bisulfite converted with the EpiTect Bisulfite Kit (Qiagen). A set of 1000 target regions of around 300bp in length were preselected for targeted bisulfite sequencing. Primer pairs designed to specifically amplify each selected target region were designed by RainDance Technologies. The ThunderStorm BS-seq assay using that set of primer pairs was performed at RainDance Technologies according to the manufacturer’s specification. That assay uses a microfluidic chip to perform multiplex amplification of bisulfite treated DNA using the set of primers designed to amplify the selected set of genomic regions. This step is followed by sequencing of PCR product. Target regions were sequenced on average to 200X coverage.
Read mapping and methylation level calling was performed using Bismark (Krueger and Andrews, 2011). Average level of methylation over each targeted region was computed (Genome_build: hg19)
Normalisation
= none
Transformation
= beta_value
Aggregation
= median
Composition of the test dataset
Transcriptome dataset
## [1] 5
## [1] 34883 30
colnames(test_data[[1]]) = paste0("sample_",1:dim(test_data[[1]])[2])
knitr::kable(head(test_data[[1]][,1:5], 10))
sample_1 | sample_2 | sample_3 | sample_4 | sample_5 | |
---|---|---|---|---|---|
chr1_948671_948671 | 0.1479821 | 0.0058608 | 0.0052021 | 0.0211274 | 0.0054362 |
chr1_948675_948675 | 0.0649753 | 0.0971333 | 0.0878547 | 0.0112516 | 0.0130588 |
chr1_948682_948682 | 0.0257134 | 0.0078960 | 0.2026140 | 0.0335864 | 0.0059482 |
chr1_948691_948691 | 0.0140645 | 0.1028755 | 0.0529727 | 0.0256354 | 0.4408728 |
chr1_948717_948717 | 0.1195694 | 0.0112640 | 0.1565115 | 0.2616982 | 0.3279626 |
chr1_948725_948725 | 0.1471507 | 0.0068616 | 0.3199015 | 0.1208587 | 0.0057165 |
chr1_948740_948740 | 0.0605371 | 0.3562729 | 0.1435489 | 0.3643015 | 0.0157693 |
chr1_948761_948761 | 0.0214107 | 0.1802633 | 0.1867907 | 0.1267173 | 0.0135065 |
chr1_948775_948775 | 0.1999722 | 0.1383489 | 0.0159732 | 0.0180771 | 0.1904371 |
chr1_948803_948803 | 0.0309058 | 0.1324888 | 0.0288953 | 0.0416293 | 0.0243427 |
Composition of the solution dataset (ground truth)
Source
= in silico simulations
Number of expected cell types
= 4
5 independant proportion matrices and corresponding complex expression matrices have been generated to score the algorithm performances.
## [1] 5
## [1] 4 30