Description of the dataset
5 different cell populations present in pancreatic tumors were considered. It has been described in the deconbench project1 and published in BioRxiv
Raw transcriptome and methylome profiles of these different cell populations were extracted from various sources (PDX model, tissues or isolated cells).
In silico Dirichlet distribution have been used based on realistic proportions defined by the anatomopathologist expertise (Jerome Cros).
Transcriptome of in silico mixtures from pancreatic tumors were obtained by considering \(D = T A\), with \(T\) the cell-type profiles (matrix of size \(M * K\), with \(M\) the number of features and \(K=5\) the number of cell types) and \(A\) the cell-type proportion per patient (matrix of size \(K * N\), with \(N=30\) the number of samples) common between both omics.
Omics_type
= trancriptome
Cancer_type
= paad
Cohort_size
= 30
Patient_metadata
= No
Sample_type
= In silico mixture of cell lines, PDX derived cells and FFPE tissues
Preparation of the data
Raw cell type profile matrices were preprocessed (Feature filtering, normalization, signal transformation, sample aggregation) to avoid any batch effect.
Feature filtering
= selection of protein coding genes (hg38)
Normalisation
= edgeR
Transformation
= Log2 + 1 (pseudo-log2)
Aggregation
= median
Composition of the test dataset
Transcriptome dataset
## [1] 5
## [1] 21566 30
colnames(test_data[[1]]) = paste0("sample_",1:dim(test_data[[1]])[2])
knitr::kable(head(test_data[[1]][,1:5], 10))
sample_1 | sample_2 | sample_3 | sample_4 | sample_5 | |
---|---|---|---|---|---|
TSPAN6 | 4.5341296 | 4.6452434 | 4.4611777 | 4.6936323 | 4.7787183 |
TNMD | 0.0175608 | 0.0573812 | 0.0215422 | 0.0022801 | 0.0692128 |
DPM1 | 5.0149146 | 5.3950986 | 5.1285956 | 5.1316449 | 5.3757923 |
SCYL3 | 4.2729605 | 3.6416795 | 3.7765321 | 3.8567533 | 3.9413723 |
C1orf112 | 3.4707383 | 3.1002907 | 2.9295319 | 2.6556491 | 3.4656401 |
FGR | 1.2251029 | 1.9415766 | 2.3156956 | 1.3959612 | 0.5588283 |
CFH | 4.6528447 | 5.1692813 | 5.0425851 | 4.9694335 | 5.0464563 |
FUCA2 | 5.9950538 | 5.9406127 | 5.8277486 | 5.8282870 | 6.1412137 |
GCLC | 4.9321188 | 4.7676211 | 4.8162724 | 4.6943392 | 4.7133185 |
NFYA | 4.7740554 | 4.6811932 | 4.7117780 | 4.6565642 | 4.7368636 |
Composition of the solution dataset (ground truth)
Source
= in silico simulations
Number of expected cell types
= 5
5 independant proportion matrices and corresponding complex expression matrices have been generated to score the algorithm performances.
## [1] 5
## [1] 5 30