Dataset Factsheet

Description of the dataset

The dataset was published¹ by Kang et al in Plos Computational Biology.

In brief, total mRNA was prepared from Namalwa (Burkitt’s lymphoma), Hs343T (fibroblast line derived from a mammary gland adenocarcinoma), hTERT-HME1 (normal mammary epithelial cells immortalized with hTERT), and MCF7 (estrogen receptor positive breast cancer cell line). mRNA samples were diluted to 100 ng/μl and mixed in different proportions. The mixed RNA samples was profiled by RNA-sequencing. Sequencing libraries were prepared using TruSeq RNA sample preparation kit v2 (Illumina).

Omics_type = tanscriptome

Cancer_type = brca

Cohort_size = 32

Patient_metadata = No

Sample_type = In vitro mixture of cell lines

Preparation of the data

Expression data from array were collected, normalized together using fRMA and transformed using log2.

Normalisation = edgeR

Transformation = none, linear scale

Composition of the test dataset

Transcriptome dataset

test_data = readRDS("test_data.rds")
length(test_data)

## [1] 1

dim(test_data[[1]])

## [1] 56646    32

colnames(test_data[[1]]) = paste0("sample_",1:dim(test_data[[1]])[2])
knitr::kable(head(test_data[[1]][,1:5], 10))

	sample_1	sample_2	sample_3	sample_4	sample_5
BHLHE40 /// DELEC1	0.2723886	0.6524212	0.562481	0.0701092	1.428305
MTARC1 /// MARCHF1	15.8763625	15.8086664	14.033900	14.7930397	13.407634
SEPTIN1	8.1327445	7.5279364	10.743386	11.1824163	4.929955
MARCHF10	1.3619429	1.5055873	1.490574	1.8228390	1.981197
SEPTIN10	34.0874842	30.4128629	24.946030	30.0768438	60.127021
MARCHF11	0.0000000	0.0000000	0.000000	0.0350546	0.000000
SEPTIN11	151.2923955	161.7000733	159.238358	143.1629723	251.427702
SEPTIN12	0.0000000	0.1003725	0.028124	0.0000000	0.000000
SEPTIN14	0.0000000	0.0501862	0.028124	0.0350546	0.000000
MTARC2 /// MARCHF2	11.2846694	11.2417183	9.702796	10.7267065	14.467344

Expected number of cell types

print(readRDS("input_k_value.rds"))

## [1] 4

Cancer type

print(readRDS("cancer_type.rds"))

## [1] "brca"

Composition of the solution dataset (ground truth)

Source = in vitro mixtures

Number of expected cell types = 4

test_solution = readRDS("test_solution.rds")
length(test_solution)

## [1] 1

dim(test_solution[[1]])

## [1]  4 32

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123604 ↩