Description of the dataset

5 different cell populations present in pancreatic tumors were considered. It has been described in the deconbench project1 and published in BioRxiv

Raw transcriptome and methylome profiles of these different cell populations were extracted from various sources (PDX model, tissues or isolated cells).

In silico Dirichlet distribution have been used based on realistic proportions defined by the anatomopathologist expertise (Jerome Cros).

Transcriptome of in silico mixtures from pancreatic tumors were obtained by considering \(D = T A\), with \(T\) the cell-type profiles (matrix of size \(M * K\), with \(M\) the number of features and \(K=5\) the number of cell types) and \(A\) the cell-type proportion per patient (matrix of size \(K * N\), with \(N=30\) the number of samples) common between both omics.

Omics_type = trancriptome

Cancer_type = paad

Cohort_size = 30

Patient_metadata = No

Sample_type = In silico mixture of cell lines, PDX derived cells and FFPE tissues

Preparation of the data

Raw cell type profile matrices were preprocessed (Feature filtering, normalization, signal transformation, sample aggregation) to avoid any batch effect.

Feature filtering = selection of protein coding genes (hg38)

Normalisation = edgeR

Transformation = Log2 + 1 (pseudo-log2)

Aggregation = median

Composition of the test dataset

Transcriptome dataset

## [1] 5
## [1] 21566    30
sample_1 sample_2 sample_3 sample_4 sample_5
TSPAN6 4.5341296 4.6452434 4.4611777 4.6936323 4.7787183
TNMD 0.0175608 0.0573812 0.0215422 0.0022801 0.0692128
DPM1 5.0149146 5.3950986 5.1285956 5.1316449 5.3757923
SCYL3 4.2729605 3.6416795 3.7765321 3.8567533 3.9413723
C1orf112 3.4707383 3.1002907 2.9295319 2.6556491 3.4656401
FGR 1.2251029 1.9415766 2.3156956 1.3959612 0.5588283
CFH 4.6528447 5.1692813 5.0425851 4.9694335 5.0464563
FUCA2 5.9950538 5.9406127 5.8277486 5.8282870 6.1412137
GCLC 4.9321188 4.7676211 4.8162724 4.6943392 4.7133185
NFYA 4.7740554 4.6811932 4.7117780 4.6565642 4.7368636

Expected number of cell types

## [1] 5

Cancer type

## [1] "paad"

Composition of the solution dataset (ground truth)

Source = in silico simulations

Number of expected cell types = 5

5 independant proportion matrices and corresponding complex expression matrices have been generated to score the algorithm performances.

## [1] 5
## [1]  5 30