Dataset Factsheet

Description of the dataset

The dataset was published¹ by Kang et al in Plos Computational Biology.

In brief, total mRNA was prepared from Namalwa (Burkitt’s lymphoma), Hs343T (fibroblast line derived from a mammary gland adenocarcinoma), hTERT-HME1 (normal mammary epithelial cells immortalized with hTERT), and MCF7 (estrogen receptor positive breast cancer cell line). The RNA samples was profiled by RNA-sequencing in duplicates.

Omics_type = transcriptome

Cancer_type = brca

Cohort_size = 30

Patient_metadata = No

Sample_type = In silico mixture of cell lines

Preparation of the data

Expression data from array were collected, normalized together using fRMA and transformed using log2.

Normalisation = edgeR

Transformation = none, linear scale

Aggregation = median

Composition of the test dataset

Transcriptome dataset

test_data = readRDS("test_data.rds")
length(test_data)

## [1] 5

dim(test_data[[1]])

## [1] 56646    30

colnames(test_data[[1]]) = paste0("sample_",1:dim(test_data[[1]])[2])
knitr::kable(head(test_data[[1]][,1:5], 10))

	sample_1	sample_2	sample_3	sample_4	sample_5
BHLHE40 /// DELEC1	0.6828031	0.9473202	0.5441742	0.1245747	0.8655587
MTARC1 /// MARCHF1	6.6497089	6.9352935	7.7606798	7.3697662	8.6457317
SEPTIN1	15.9085347	12.7724058	14.0699822	15.0968837	13.9262235
MARCHF10	0.6418168	1.3160400	0.9932912	0.7749461	1.4025259
SEPTIN10	91.3654553	97.4414434	85.2664554	86.3254491	77.3517607
MARCHF11	0.0268714	0.1374658	0.0407050	0.0279064	0.1238828
SEPTIN11	189.3891755	247.8283085	195.6794356	185.6063814	225.4692756
SEPTIN12	0.0289296	0.0809382	0.0284863	0.1438283	0.0225955
SEPTIN14	0.2670606	0.0037391	0.2080682	0.1044387	0.2188631
MTARC2 /// MARCHF2	8.6032742	10.7252242	9.2278913	8.7357587	10.3554631

Expected number of cell types

print(readRDS("input_k_value.rds"))

## [1] 4

Cancer type

print(readRDS("cancer_type.rds"))

## [1] "brca"

Composition of the solution dataset (ground truth)

Source = in silico simulations

Number of expected cell types = 4

5 independant proportion matrices and corresponding complex expression matrices have been generated to score the algorithm performances.

test_solution = readRDS("test_solution.rds")
length(test_solution)

## [1] 5

dim(test_solution[[1]])

## [1]  4 30

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123604 ↩