Dataset Factsheet

Description of the dataset

The cell lines methylation profiles was published¹ by Onuchic et al in Cell reports.

A set of 4 methylation profiles of cell lines generated using the targeted bisulfite sequencing assay (in the Onuchic et al. paper) was used to build simulated mixtures. The methylation profiles are coming from:

a breast cancer cell lines (MCF-7),
a normal breast cell line (HMEC),
a CAF cell line,
and purified T-cells.

Omics_type = methylome (targeted bisulfite sequencing)

Cancer_type = brca

Cohort_size = 30

Patient_metadata = No

Sample_type = In silico mixture of cell lines

Preparation of the data

We use the following dataset:

GSM2327390
GSM2327392
GSM2327393
GSM2327397

Here is the information retrieved on corresponding GEO portal:

Frozen samples were pulverized with mortar and pestle under liquid nitrogen conditions. DNA from cell culture, tumor tissue, normal tissue, and buffy coat was isolated using Qiagen’s DNeasy Blood and Tissue kit and bisulfite converted with the EpiTect Bisulfite Kit (Qiagen). A set of 1000 target regions of around 300bp in length were preselected for targeted bisulfite sequencing. Primer pairs designed to specifically amplify each selected target region were designed by RainDance Technologies. The ThunderStorm BS-seq assay using that set of primer pairs was performed at RainDance Technologies according to the manufacturer’s specification. That assay uses a microfluidic chip to perform multiplex amplification of bisulfite treated DNA using the set of primers designed to amplify the selected set of genomic regions. This step is followed by sequencing of PCR product. Target regions were sequenced on average to 200X coverage.

Read mapping and methylation level calling was performed using Bismark (Krueger and Andrews, 2011). Average level of methylation over each targeted region was computed (Genome_build: hg19)

Normalisation = none

Transformation = beta_value

Aggregation = median

Composition of the test dataset

Transcriptome dataset

test_data = readRDS("test_data.rds")
length(test_data)

## [1] 5

dim(test_data[[1]])

## [1] 34883    30

colnames(test_data[[1]]) = paste0("sample_",1:dim(test_data[[1]])[2])
knitr::kable(head(test_data[[1]][,1:5], 10))

	sample_1	sample_2	sample_3	sample_4	sample_5
chr1_948671_948671	0.1479821	0.0058608	0.0052021	0.0211274	0.0054362
chr1_948675_948675	0.0649753	0.0971333	0.0878547	0.0112516	0.0130588
chr1_948682_948682	0.0257134	0.0078960	0.2026140	0.0335864	0.0059482
chr1_948691_948691	0.0140645	0.1028755	0.0529727	0.0256354	0.4408728
chr1_948717_948717	0.1195694	0.0112640	0.1565115	0.2616982	0.3279626
chr1_948725_948725	0.1471507	0.0068616	0.3199015	0.1208587	0.0057165
chr1_948740_948740	0.0605371	0.3562729	0.1435489	0.3643015	0.0157693
chr1_948761_948761	0.0214107	0.1802633	0.1867907	0.1267173	0.0135065
chr1_948775_948775	0.1999722	0.1383489	0.0159732	0.0180771	0.1904371
chr1_948803_948803	0.0309058	0.1324888	0.0288953	0.0416293	0.0243427

Expected number of cell types

print(readRDS("input_k_value.rds"))

## [1] 4

Cancer type

print(readRDS("cancer_type.rds"))

## [1] "brca"

Composition of the solution dataset (ground truth)

Source = in silico simulations

Number of expected cell types = 4

5 independant proportion matrices and corresponding complex expression matrices have been generated to score the algorithm performances.

test_solution = readRDS("test_solution.rds")
length(test_solution)

## [1] 5

dim(test_solution[[1]])

## [1]  4 30

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87297 ↩