Dataset Factsheet

Description of the dataset

5 different cell populations present in pancreatic tumors were considered. It has been described in the deconbench project¹ and published in BioRxiv

Raw transcriptome and methylome profiles of these different cell populations were extracted from various sources (PDX model, tissues or isolated cells).

In silico Dirichlet distribution have been used based on realistic proportions defined by the anatomopathologist expertise (Jerome Cros).

Transcriptome of in silico mixtures from pancreatic tumors were obtained by considering \(D = T A\), with \(T\) the cell-type profiles (matrix of size \(M * K\), with \(M\) the number of features and \(K=5\) the number of cell types) and \(A\) the cell-type proportion per patient (matrix of size \(K * N\), with \(N=30\) the number of samples) common between both omics.

Omics_type = trancriptome

Cancer_type = paad

Cohort_size = 30

Patient_metadata = No

Sample_type = In silico mixture of cell lines, PDX derived cells and FFPE tissues

Preparation of the data

Raw cell type profile matrices were preprocessed (Feature filtering, normalization, signal transformation, sample aggregation) to avoid any batch effect.

Feature filtering = selection of protein coding genes (hg38)

Normalisation = edgeR

Transformation = Log2 + 1 (pseudo-log2)

Aggregation = median

Composition of the test dataset

Transcriptome dataset

test_data = readRDS("test_data.rds")
length(test_data)

## [1] 5

dim(test_data[[1]])

## [1] 21566    30

colnames(test_data[[1]]) = paste0("sample_",1:dim(test_data[[1]])[2])
knitr::kable(head(test_data[[1]][,1:5], 10))

	sample_1	sample_2	sample_3	sample_4	sample_5
TSPAN6	4.5341296	4.6452434	4.4611777	4.6936323	4.7787183
TNMD	0.0175608	0.0573812	0.0215422	0.0022801	0.0692128
DPM1	5.0149146	5.3950986	5.1285956	5.1316449	5.3757923
SCYL3	4.2729605	3.6416795	3.7765321	3.8567533	3.9413723
C1orf112	3.4707383	3.1002907	2.9295319	2.6556491	3.4656401
FGR	1.2251029	1.9415766	2.3156956	1.3959612	0.5588283
CFH	4.6528447	5.1692813	5.0425851	4.9694335	5.0464563
FUCA2	5.9950538	5.9406127	5.8277486	5.8282870	6.1412137
GCLC	4.9321188	4.7676211	4.8162724	4.6943392	4.7133185
NFYA	4.7740554	4.6811932	4.7117780	4.6565642	4.7368636

Expected number of cell types

print(readRDS("input_k_value.rds"))

## [1] 5

Cancer type

print(readRDS("cancer_type.rds"))

## [1] "paad"

Composition of the solution dataset (ground truth)

Source = in silico simulations

Number of expected cell types = 5

5 independant proportion matrices and corresponding complex expression matrices have been generated to score the algorithm performances.

test_solution = readRDS("test_solution.rds")
length(test_solution)

## [1] 5

dim(test_solution[[1]])

## [1]  5 30

https://cancer-heterogeneity.github.io/deconbench.html ↩