Dataset Factsheet

Description of the dataset

The dataset was published¹ by Kang et al in Plos Computational Biology.

In brief, total mRNA was prepared from Namalwa (Burkitt’s lymphoma), Hs343T (fibroblast line derived from a mammary gland adenocarcinoma), hTERT-HME1 (normal mammary epithelial cells immortalized with hTERT), and MCF7 (estrogen receptor positive breast cancer cell line). The RNA samples was profiled by RNA-sequencing in duplicates.

Omics_type = transcriptome

Cancer_type = brca

Cohort_size = 30

Patient_metadata = No

Sample_type = In silico mixture of cell lines

Preparation of the data

Expression data from array were collected, normalized together using fRMA and transformed using log2.

Normalisation = edgeR

Transformation = Log2 + 1 (pseudo-log2)

Aggregation = median

Composition of the test dataset

Transcriptome dataset

test_data = readRDS("test_data.rds")
length(test_data)

## [1] 5

dim(test_data[[1]])

## [1] 56646    30

colnames(test_data[[1]]) = paste0("sample_",1:dim(test_data[[1]])[2])
knitr::kable(head(test_data[[1]][,1:5], 10))

	sample_1	sample_2	sample_3	sample_4	sample_5
BHLHE40 /// DELEC1	0.1389209	0.2015841	0.2440109	0.6057359	0.8770143
MTARC1 /// MARCHF1	2.0820315	2.1339103	2.3622637	2.2660872	2.4993167
SEPTIN1	3.1283017	2.8055923	3.0293702	3.1058714	2.9434853
MARCHF10	0.8054583	0.9071065	0.9917149	0.9196117	1.1236826
SEPTIN10	5.6377001	5.8659071	5.7005703	5.6473292	5.4588523
MARCHF11	0.0508587	0.0259265	0.0528873	0.2903529	0.0513815
SEPTIN11	7.4596111	7.6482334	7.4603445	7.4340067	7.5398126
SEPTIN12	0.0402248	0.0461664	0.0397009	0.0406292	0.0315253
SEPTIN14	0.0073609	0.2188129	0.3195756	0.0155361	0.0110615
MTARC2 /// MARCHF2	3.1910142	3.3989906	3.2921603	3.2310591	3.3790429

Expected number of cell types

print(readRDS("input_k_value.rds"))

## [1] 4

Cancer type

print(readRDS("cancer_type.rds"))

## [1] "brca"

Composition of the solution dataset (ground truth)

Source = in silico simulations

Number of expected cell types = 4

5 independant proportion matrices and corresponding complex expression matrices have been generated to score the algorithm performances.

test_solution = readRDS("test_solution.rds")
length(test_solution)

## [1] 5

dim(test_solution[[1]])

## [1]  4 30

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123604 ↩