Dataset Factsheet

Description of the dataset

The dataset was published¹ by Kang et al in Plos Computational Biology.

In brief, total mRNA was prepared from Namalwa (Burkitt’s lymphoma), Hs343T (fibroblast line derived from a mammary gland adenocarcinoma), hTERT-HME1 (normal mammary epithelial cells immortalized with hTERT), and MCF7 (estrogen receptor positive breast cancer cell line). mRNA samples were diluted to 100 ng/μl and mixed in different proportions. The mixed RNA samples was profiled by RNA-sequencing. Sequencing libraries were prepared using TruSeq RNA sample preparation kit v2 (Illumina).

Omics_type = tanscriptome

Cancer_type = brca

Cohort_size = 32

Patient_metadata = No

Sample_type = In vitro mixture of cell lines

Preparation of the data

Expression data from array were collected, normalized together using fRMA and transformed using log2.

Normalisation = edgeR

Transformation = Log2 + 1 (pseudo-log2)

Composition of the test dataset

Transcriptome dataset

test_data = readRDS("test_data.rds")
length(test_data)

## [1] 1

dim(test_data[[1]])

## [1] 56646    32

colnames(test_data[[1]]) = paste0("sample_",1:dim(test_data[[1]])[2])
knitr::kable(head(test_data[[1]][,1:5], 10))

	sample_1	sample_2	sample_3	sample_4	sample_5
BHLHE40 /// DELEC1	0.3475393	0.7245814	0.6438386	0.0977580	1.279950
MTARC1 /// MARCHF1	4.0769321	4.0711334	3.9101474	3.9812170	3.848762
SEPTIN1	3.1910485	3.0921967	3.5537766	3.6067284	2.568021
MARCHF10	1.2399741	1.3251488	1.3164786	1.4971469	1.575892
SEPTIN10	5.1328846	4.9732835	4.6974419	4.9577681	5.933738
MARCHF11	0.0000000	0.0000000	0.0000000	0.0497069	0.000000
SEPTIN11	7.2507001	7.3460711	7.3240757	7.1715569	7.979726
SEPTIN12	0.0000000	0.1379920	0.0400143	0.0000000	0.000000
SEPTIN14	0.0000000	0.0706452	0.0400143	0.0497069	0.000000
MTARC2 /// MARCHF2	3.6187871	3.6137342	3.4199159	3.5517260	3.951154

Expected number of cell types

print(readRDS("input_k_value.rds"))

## [1] 4

Cancer type

print(readRDS("cancer_type.rds"))

## [1] "brca"

Composition of the solution dataset (ground truth)

Source = in vitro mixtures

Number of expected cell types = 4

test_solution = readRDS("test_solution.rds")
length(test_solution)

## [1] 1

dim(test_solution[[1]])

## [1]  4 32

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123604 ↩