Explainable multi-omics deep clustering reveals an important role of DNA methylation in pancreatic ductal adenocarcinoma

Patients with pancreatic ductal adenocarcinoma (PDAC) have the lowest survival rate among all cancer patients in Europe. Since western societies have the highest incidence of pancreatic cancer, it has been projected that PDAC will soon become the second leading cause of cancer-related deaths. The main challenge of PDAC treatment is that patients with similar somatic genotypes exhibit a wide range of disease phenotypes. Artificial Intelligence (AI) is currently transforming the field of healthcare and represents a promising technology for integrating various datasets and optimizing evidence-based decision making. However, the interpretability of most AI models is limited and it is challenging to understand how and why a decision is made. In this study, we developed a deep clustering model for PDAC patient stratification using integrated methylation and gene expression data. We placed a specific emphasis on model explainability, with the aim of understanding hidden multi-modal patterns learned by the model. The model resulted in two subgroups of PDAC patients with different prognoses and biological factors. We performed several follow-up analyses to measure the relative contribution of each modality to the clustering solution. This multi-omics profile analysis revealed an important role of DNA methylation, partially supported by previous experimental studies. We also show how the model learned the underlying patterns in a multi-modal setting, where individual hidden neurons are specialized either in single data modalities or their combinations. We hope this study will help to promote more explainable AI in real-world clinical applications, where the knowledge of the decision factors is crucial. The code of this project is publicly available on GitHub (https://github.com/albertolzs/edc_mo_pdac).


Introduction
PDAC is an extremely lethal and aggressive disease that accounts for 85% of all pancreatic cancers.In 2020, 495,773 new cases and 466,003 deaths worldwide were related to pancreatic cancer, that is, 2.6% of all new cancer diagnoses and 4.7% of all cancer deaths, respectively [1].PDAC ranks as the third-leading cause of cancer-related deaths in the United States and sixth in China [2,3].Experts predict that, by 2030, PDAC will reach second place in terms of cancer mortality [4].Once the tumor appears, it tends to spread aggressively, leading to poor responses to treatments [5].In fact, the five-year survival rate in the United States is 11%, which is the lowest among all cancers.
The poor prognosis of PDAC is primarily due to the lack of clear and distinctive symptoms, as well as reliable biomarkers for early detection, with only about 20% of patients being diagnosed at an early stage [6].Genetic and molecular alterations, such as those involving mutations in KRAS [7,8], CDKN2A [9], SMAD4 [7,10,9], and TP53 [7,10], have been shown to be recurrent in PDAC.It has also been found that DNA methylation can play a key role in PDAC progression, given the different methylation patterns between PDAC and normal tissue and even among PDAC subtypes [11,7,12,13,14].A good example is the hypermethylation of CpG islands in the promoter regions of tumor suppressor genes [15].
Unfortunately, despite advancements in pancreatic cancer research over the past few decades, there has been only a little improvement in the ratio of deaths to new cases [2,16].Given these challenges, there is an urgent need to gain a deeper understanding of the molecular characteristics involved in the PDAC development.There is a lack of systematic approaches to study the relative contribution of omics modalities, such as gene expression and methylation changes, to disease progression and patient stratification.This knowledge is crucial for the development of innovative diagnostic and therapeutic strategies for treating this aggressive disease.
Numerous researchers are currently working on developing and applying AI to the field of oncology and healthcare [17].AI offers greater adaptability and accuracy than traditional methods, provided large enough training data are available, hence enabling its use in various biomedical applications.One of such tasks is patient stratification, in which patients are grouped into subtypes with distinct biological or clinical phenotypes.The current definition of PDAC subtypes is based only on the histology and genotype [18].We argue that a wider molecular perspective would enable a more personalized diagnosis and treatment, and thereby more effective and safe therapeutic regimens.
To date, most PDAC patient stratification approaches have been based on single-omics analysis.Bailey et al. defined four subtypes through a gene expression analysis [8].Three subgroups of pancreatic cancer patients were found by Mishra et al., based on a DNA methylation analysis [19].However, analyzing each type of omics data separately may yield inconsistent results [20], and multi-omics studies have shown to outperform single data type analysis [21].This is because each omics data type carries distinct information and is associated with different mechanisms related to oncogenesis and tumor development.To further enhance our understanding of the disease biology, methods that can identify cohesive subtypes using multiple omics data sources are required.
Deep Learning (DL) has significantly impacted various fields of machine learning and artificial intelligence [22,23].
It has brought transformational changes and gained popularity in many areas, mostly related to supervised prediction tasks.The success of supervised DL has motivated also the development of DL-based unsupervised clustering algorithms [24].These clustering approaches leverage the capabilities of DL to preprocess high-dimensional data, thereby improving the quality of the clustering results and the possibility to obtain biomedically more meaningful clusters [25].
However, despite its undeniable potential, AI, and especially DL, struggles to fulfill its potential in the medical domain.DL methods are perceived as black-box approaches due to limited understanding of how they work internally, and the lack of explainability has been criticized [26].Thus, eXplainable Artificial Intelligence (XAI) is an emerging research topic that focuses on unboxing how AI systems' black-box choices are made [27].Specifically in medicine, explainability can reveal new insights and shared characteristics among the most important biomarkers (or features), and this information can contribute to recommending more personalized treatments.As an example of an XAI application on feature importance analysis, Sokač et al.
used integrated gradient to assign an attribution score to each feature for predicting metastasis, patient age, chromosomal instability, cancer type, and loss of TP53 [28].Similarly, Withnell et al. studied the contribution of each gene and latent dimension for both supervised and unsupervised tasks [29].
On the other hand, quantifying the relative contribution of each omics data type in multi-modal analysis is a more recent development.To the best of our knowledge, this was done for the first time in 2022, when Lee et al. analyzed the contribution ratio of gene expression, somatic mutation, and copy number data in cancer drug response prediction [30].Later, Hauptmann et al. computed the relative importance of various omics (gene expression, mutation and copy number) via attribution methods for drug response prediction in cancer [31].However, there have been no studies of omics contribution to unsupervised patient stratification.
In this study, we developed a patient stratification model using a DL-based clustering algorithm with multiomics data (here, RNA-seq and DNA methylation data).We used a publicly available PDAC data cohort to investigate the clustering solution in terms of model explainability and biological interpretation.By understanding the underlying multi-omics patterns learned by the model, we identified patient risk groups that can be used to stratify patients into prognostically different PDAC subtypes.
The key contributions of our work are: • The multi-omics profile analysis revealed an important role of DNA methylation in PDAC.There are no previous studies that have investigated the relative contribution of different omics in an unsupervised DL model for patient stratification.
• We studied the multi-omics relative contribution to individual neurons, and found that some hidden neurons specialized in one of the omics data types, whereas others were able to integrate both modalities.To the best of our knowledge, no previous studies have reported such findings.

Data
We obtained PAAD-TCGA data from the Firehose Broad GDAC using the R packages curatedTCGAData and TCGAutils [32].We analyzed RNA-seq data (Illumina HiSeq, upper quartile normalized RSEM TPM gene expression values) and DNA methylation data (Illumina Human Methylation 450).
Common patients with both omics were selected by primary tumor and PDAC histological type.In total, we had 147 patients with both omics profiles.

Preprocessing
The omics data were preprocessed using a Scikit-learn pipeline [33].See Figure S1 for the preprocessing pipelines for the RNA-seq and DNA methylation data.Briefly, Similar preprocessing steps and thresholds have been used in other studies, although we set a bit more stringent thresholds in the MAD filter, because of the posterior feature selection [35,36,37].For the NMF-based feature selection, we followed a similar strategy as Carmona-Saez et al. [38].This approach has been reported as an unsupervised way to choose important features [39,40].The goal of applying NMF is to find two non-negative matrices whose product approximates the original matrix.The new weight matrix, with dimensions equal to the number of components multiplied by the number of final features, is composed of vectors called basis components.We began by arranging the features for each NMF basis component in descending order of significance.Next, we selected those features with the most influence on each basis component.The specific number of features from each basis component was determined through a hyperparameter optimization (explained below).For determining the number of NMF components, we experimented with different numbers of NMF components (8, 16, 32, 64, 128, 256 and 512) and computed the reconstruction error (using betadivergence).The more components are selected, the smaller the error becomes, but also more computational resources are required.We found that using 256 and 512 components achieved a reconstruction error below 1.0.Based on the preliminary results of our hyperparameter optimization, which aimed to find the best set of features, we finally used 512 components in our analysis.

Clustering algorithm
High-dimensional data, especially when combined with small patient cohorts, which is the typical case in omics profiling, limits the learning capacity of machine learning models.Therefore, it is usually recommended to reduce the data dimensionality before the clustering; we used here a deep autoencoder.A multi-omics DL-based clustering algorithm was built using a multi-view deep autoencoder (employing Pytorch [41] and Lightning [42]) and K-Means.A deep autoencoder (DAE) is an artificial neural network used for feature extraction by mapping the input data into a hidden representation.Thus, we reduce the dimensionality of the input data in a non-linear manner.
The layers that create the hidden representation (or embedding) are called encoder, whereas the layers that try to reconstruct the original input from it are called decoder.The branches of the encoder were concatenated before the latent representation.This kind of approaches, which map multi-omics data to a joint latent representation, are called middle integration.Each block was composed of a linear, PReLU activation and a batch normalization layer.We used a joint loss function as the weighted (with a lambda coefficient) sum of the reconstruction error (mean absolute error, MAE) of the DAE and the sum of squared distances of the samples to their closest cluster center (inertia): n is the number of samples, x is the input sample, x is the reconstructed sample, µ is the cluster center that contains the sample, λ is the coefficient that controls the trade-off between the autoencoder and clustering loss function.
Both the multi-view deep autoencoder and the cluster centers were initialized with a previous independent pretraining.This means that the deep autoencoder was trained alone first.Then, the embedding layer was used for obtaining the centroids of a K-Means.These centroids were the initial cluster centers in the final training.The step enables one to start the training with a better weight initialization and initial cluster centroids.This is a common practice with small datasets [43], due to the K-Means algorithm being very sensitive to the initialization.The workflow of the algorithm is illustrated in Figure 1.
A hyperparameter optimization was carried out with Optuna [44], with the Silhouette score as the objective function.The Silhouette score is a widely employed metric for clustering, computed as the average of the mean intra-cluster distance and the mean nearest-cluster distance for each sample [45].We executed the Treestructured Parzen Estimator (a Bayesian optimization method) algorithm for 1225 trials, where the first 1000 were a random search, using a 5-nested 5fold cross-validation [46].To avoid data leakage, we applied both the preprocessing pipeline and the clustering model in each iteration.
The hyperparameters to optimize were: the number of hidden layers (one or two before merging the branches of each modality), number of neurons (with a number proportional to the previous layer), latent space dimensionality , number of input features (from 1 to 10 individual features for each NMF component), epochs for both pretraining and training (20-100), lambda coefficient that controls the joint loss function (0.001-1) and number of clusters (2)(3)(4)(5)(6).The fANOVA hyperparameter importance evaluation algorithm was used to assign a score to each hyperparameter [47].In addition, the learning rate was automatically determined using a learning rate finder strategy [48].Wilcoxon test was used to evaluate significant difference between training and validation, and validation and testing.After the hyperparameter optimization, once we knew the expected error and that the model was not overfitted, we trained our model using all the samples.As a last step, the hard clustering assignments were transformed to soft assignments by computing the probability of a sample belonging to a cluster with inverse distance weighting.

Statistical analysis
Although the model training was unsupervised, meaning that no class label information was used when training the model, the TCGA dataset contains some additional clinical labels.Therefore, to evaluate if the clusters were biologically or clinically meaningful, we applied a survival analysis and an enrichment analysis of clinical labels as a post-hoc analysis.For survival analysis, log-rank test (lifelines python package [49]) was used to indicate statistically significant differences in survival profiles between different patient groups.
A Cox's proportional hazard model was also fitted to compute the risk coefficients.For enrichment analysis of clinical labels (using the Scipy package [50]), we applied the Kruskal-Wallis test on the age at initial diagnosis, as well as the chi-square test on sex and three discrete clinical pathological parameters quantifying the progression of the tumor (tumor stage), cancer in lymph nodes (neoplasm histologic grade) and metastases (metastasis stage).
The results were compared with standard unsupervised methods, including K-Means and hierarchical clustering [33], and with state-of-the-art multi-omics clustering algorithms: SNF [21,51], intNMF [52] and jNMF [53].For the standard methods, the preprocessed matrices  were concatenated; for intNMF and jNMF, z-score normalization was replaced by scaling each feature to a 0-1 range.

Multi-omics contribution analysis
To measure the contribution of each omics to the final clustering solutions, we applied MM-SHAP, a performance-agnostic multi-view score based on Shapley values that quantifies the extent to which a multi-view model uses individual modalities [54].We adapted the code of the authors, originally developed for images and text, to tabular data and used Deep Shap (SHAP library [55]) to compute SHAP (SHapley Additive exPlanations) values.We also used Mann-Whitney U-test to assess whether there was a difference in the contributions between the clusters.

Feature importance analysis
We used post-hoc interpretation methods to assign an importance score to each input feature.Attribution methods are used to determine, for a specific neuron, the contribution of each input feature on the output, i.e., the impact of a feature on predictions.To compute the attributions with respect to the inputs of the model, we applied several algorithms available in Captum [56], with their respective default parameters.We used two gradient-based methods and one perturbation-based method [57]: integrated gradients to calculate gradients along a straight-line path in the neural network from baseline inputs to the actual inputs [58]; gradient SHAP to approximate SHAP values by computing the expected values of gradients x (inputs -baselines) [55]; and feature ablation, which replaces each input feature with a baseline and computes the difference in predictions.We chose zero-valued baselines, which correspond to the average of the features after the z-normalization.After computing the attributions of every feature in all samples using these three methods, we computed the mean (in absolute values) across the methods and sorted the features.Thus, we were able to select the top-25 features, i.e., those features with the biggest impact on the neural network.These features were analyzed with Kruskal-Wallis test to identify potential biomarkers that could separate the samples in the clusters.False discovery rate (FDR) correction for multiple tests was applied.

Neuron importance analysis
In addition to evaluating the importance of the inputs, we also analyzed the layers of the model to understand which neurons appeared to be most important.The conductance can be used to evaluate the importance of a hidden neuron for a specific prediction [59,60].The conductance of a hidden neuron is the flow of (integrated gradients) attributions through this hidden neuron, i.e., represents how active a neuron is.We computed the layer conductance with respect to the embedding layer of the model to understand the neuron importance.The neuron conductance of the 3 most important neurons, based on this method, was studied to identify biomarkers that activate the neurons the most.We followed the same logic as in the feature importance analysis, studying the top-25 features for each neuron.

Hyperparameter optimization and model training
The hyperparameter optimization results are shown in Figure 2a and Figure S2.
We can conclude that the most important hyperparameter is the number of clusters (Figure S2b), and two is the optimal number of clusters based on the silhouette score.The second most important hyperparameter was the number of epochs, and the minimum possible value of 20 was the best.The other hyperparameters had a smaller influence on the silhouette score, and led to the following selections: 50 units in the embedding layer, 5120 features of each data modality (10240 in total), and 0.014 as the lambda coefficient.The final model was composed of 5 hidden layers and had approximately 72.4M parameters.
Figure 2 shows the decomposed autoencoder and clustering loss functions, the joint loss function and the Silhouette score that was used as an independent evaluation metric for the clustering solution.For each of these metrics, we used the results obtained in the hyperparameter optimization process to establish a baseline for comparisons (dotted red lines): the worst autoencoder found in the nested cross-validation process; the average of the silhouette scores among the nested cross-validation folds when the number of clusters was two; the highest value of the joint loss function when the lambda coefficient was in the interval {best lambda coefficient} ± 0.02; and the average Silhouette score that was obtained during the optimization process.These results indicated that the clustering model outperformed a random clustering solution, and that the best solution was not overfitted to the training data; even if there is a significant difference between training and validation data in the autoencoder loss function, the Silhouette score shows similar scores across the training, validation and test datasets.The inertia (clustering loss function) was lower in the validation and testing dataset, compared to training data (Figure 2c).This may be due to the relatively small dataset and the fact that this metric computes distance across the samples, making it highly sensitive to outliers.

Survival analysis and clinical associations
The model identified two clusters: cluster 0 with 97 samples, and cluster 1 with 50 samples.The log-rank test revealed that the patients in these two clusters have significantly different survival probabilities (Figure 3).The coefficient of the Cox's proportional hazard model returned a hazard ratio of 1.75 (1.24-2.49,95% confidence interval), meaning that patients in cluster 0 have a 75% higher risk compared to cluster 1, that is, this cluster of patients corresponds to more aggressive disease.No other significant associations with clinical parameters were found, indicating that these clusters are based on novel biology.This could be used in patient stratification in future studies.
Hierarchical clustering and K-Means, as standard clustering methods, and more advanced multi-omics clustering algorithms, including SNF, intNMF and jNMF, were applied as reference comparison methods.The methodology from the original SNF paper also returned   two as the optimal number of clusters.Thus, to enable direct comparisons, we used two clusters in each of the methods.The clusters found by the multi-omics clustering algorithms were associated with sex, while the patient clusters identified by any of the reference methods did not yield any significant association with the other clinical parameters, including the overall survival (Table 1).

Evaluation of multi-omics contributions
To better understand the disease biology of PDAC, we were interested in determining how gene expression and DNA methylation patterns affect the clustering solutions.The relative contribution of each modality across the samples was surprisingly homogeneous, as shown in Figure 4a.On average, the DNA methylation patterns contribute 71% to the final predictions, while the remaining 29% originated from the RNA-seq profiles.This demonstrates the important role of DNA methylation in PDAC patient stratification.We further studied whether there was a difference in the contributions between the two clusters (Figure 4b).Mann-Whitney U test returned a non-significant pvalue.In addition, we did not find any difference in the DNA methylation or gene expression value distributions between the two clusters (Figure 4c and d).This could mean that it is not the individual features themselves what determines the prognosis, but the interaction between the entities (gene expression and methylation features) that is more associated with the phenotype.Thus, relationships between two or more variables are not simply additive, meaning that the combined effects play a significant role in cancer aggressiveness.

Feature importance analysis
We next identified the most important features for the model.A visualization of the feature score of the top-25 features is shown in Figure 5.The figure also compares the importance scores with the average learned model weights of the first layer (light blue bars).In linear models, the estimated values of the model weights provide insights into the relationships between clusters and features.A weight of zero would indicate no correlation, positive weights would suggest positive correlations, and negative weights would imply negative correlation between clusters and features.However, because of the multiple layers in the network, these weights may have more complex connections with the cluster labels.
One of the main observations is that all of the most important features are methylation features.This again  indicates the importance of DNA methylation in the clustering solution and patient stratification.Second, attribution methods disagree on assigning importance scores.In fact, we can observe that, while integrated gradients and feature ablation look more similar, Gradient Shap behaves differently.This is because Gradient Shap is calculated with a baseline of the training distribution, whereas the other methods used a reference baseline of zero.In general, there is no consensus on the best feature importance method [61,62].We tried to address this problem by using the average of three methods to evaluate feature importance.Furthermore, the importance metrics are not correlated with the weights, perhaps because of the multiple layers in the network.In fact, some of the neurons showed a strong negative weight, in contrast to the positive feature scores provided by the attribution methods.There was no significant difference in the values of the top-25 features between the clusters (Figure 5b).These top-25 features were not highly correlated.Most values in the correlation matrix had absolute values lower than 0.8, with only a few of them showing a strong positive correlation (Figure 6).This suggests the model learned relatively non-redundant features.

Neuron importance analysis
As the function of the whole neural system is a combination of the different neurons in each layer working together to get the predictions, we were also interested in understanding which neurons appear to be most important for the clustering solution.For this purpose, we analysed the conductance of the latent representation layer with respect to the inputs for cluster 0, which showed a more aggressive disease based on Figure 3.Both the estimated aggregated weights and attributions exhibit high fluctuations across the neurons (Figure 7a).Although there are some neurons whose aggregated weights are negative, such as 3 and 11, only neurons 1 and 47 had a negative influence in predicting cluster 0. This result suggests that neurons 10, 29 and 22 are the three most important neurons for the clustering solution.The distribution of the weights in these three neurons is plotted in Figure S3, and compared with neurons 12, 24 and 15, which seemed to be the least important ones.It can be concluded that the important neurons learned more positive weights, whereas the weights of the least important neurons were closer to zero.
After we identified the important neurons in the embedding layer, we computed the neuron attributions through the conductance, with the aim to understand which input features activated the neurons.Figure S4 shows the most important features for the selected neurons.The most important neuron 10 is activated mainly by the DNA methylation features, whereas neuron 22 by RNA-seq features.By contrast, both modalities are more balanced for the neuron 29.The influence of each data modality on every neuron in the embedding layer is shown in Figure 7b.Overall, methylation data contributes 54% on average, and RNA-seq the remaining 46% to the embedding layer.When we take into account the neuron  importance, the difference becomes bigger; for instance, methylation affects 63% to the first eight neurons.In general, the most important neurons were highly activated by methylation patterns, while less important neurons were more influenced by gene expression.We found that 8 out of the 50 neurons were activated by only a single modality; 6 by methylation and 2 by RNA-seq data.There were no large differences in the feature scores among the top-25 features in any neuron (Figure S4).Similar to earlier analyses to find possible biomarkers (Figure 5), the top features for the important neurons were not able to separate the two clusters when analysed individually (Figure S5).
This indicates that single features, even if very important for the individual neurons, cannot be used for accurate patient stratification; rather multi-feature signatures are required for multi-modal patient clustering, and here DL-based algorithms are critically needed.

Discussion
We presented a deep clustering model based on multiomics data to identify subgroups of PDAC patients.In this study, we analyzed together RNA-seq and methylation data, but the algorithm can be applied also to other high-dimensional omics data even in low-sample patient cohorts.The model is composed of a multi-branch deep autoencoder and K-Means applied on the embedding layer.Our results indicated the beneficial effect of the deep autoencoder, as shown also in previous studies [63].DL-based clustering has been shown to outperform other clustering algorithms in several use cases in the bioinformatics field, including bioimaging, cancer genomics and biomedical text mining [24].However, in most of the previous works, architectures such as autoencoders or variational autoencoders have been applied either as a two-step model (training independently the DL part followed by the clustering step) or on single-omics data alone [64,65,66].
In contrast, the objective of our algorithm was to optimize a joint loss function for both the autoencoder and the clustering.Interestingly, the best value for the lambda coefficient placed both the autoencoder and clustering loss approximately in a similar range, with the clustering loss having a greater impact.This type of deep clustering was applied to multi-omics data for the first time in 2022, in the context of cancer category recognition, survival analysis and clinical parameter enrichment in the pan-cancer dataset [67].Its applicability and accuracy have also been reported in single-cell multi-omics data [43].Therefore, it is likely that DL-based approaches for clustering biomedical data could make a big impact in future clinical applications.
Through an optimization process, we were able to select the best combination of hyperparameters for the model.Other optimization functions could be considered in future studies, but we recommend to have also independent cluster evaluation metrics, such as the silhouette score, to study potential model overfitting.The hyperparameter optimization allowed us to find a model that avoids overfitting and analyze the importance of each hyperparameter.The number of clusters was the most relevant hyperparameter for the optimal model.Most models in nested cross-validation reached the best metrics when the number of clusters was two, which indicates this is the robust number of subtypes in the current dataset.This value agrees with previous studies that analyzed multi-omics pancreatic cancer data from TCGA [20,68,69].In contrast, Roy et al. found five subgroups [70].A range of 2-5 clusters were found when other PDAC patient cohorts were analyzed [71,72,8,7,73,74,75].The number of clusters is dependent on the omics used and patient cohort, and therefore using multi-cohort could provide more robust results.
Based on statistical testing, the model found clusters that were associated with the overall survival, and one of the clusters had a considerably worse prognosis (cluster 0), indicating a more aggressive disease cluster.In general, it is difficult to find clusters that are associated with multiple clinical parameters, as illustrated in earlier studies [76,35].In those studies, most multi-omics models found only 1-2 significant associations with clinical parameters, even in multiple cancer types, when using similar testing methodology to ours.This is because clustering algorithms try to find novel biological insights, not biased by known clinical associations.Overall survival is an objective clinical endpoint, along with age, whereas the other clinical parameters are more subjective.As a comparison, hierarchical clustering and K-Means did not obtain any significant associations with the clinical parameters, including overall survival; and clusters found by SNF, intNMF and jNMF were only associated with sex.This highlights the impact of DL on patient stratification and finding of clusters with prognostic differences.
We combined transcriptomics and epigenomics data, which is the most common combination in multi-omics cancer studies [77].Overall, transcriptomics is the most common omic modality for multi-omics studies in both cancer and non-cancer diseases.However, although gene expression has been considered to provide the highest predictive power, predictive models can be based on different omics profiles and cancer types, given that there is no universal best omics data type and the results are dependent on the cancer type [78].Interestingly, we showed that the DNA methylation patterns were more important for our clustering solution.Both attribution methods, for the entire model, and neuron conductance, for specific units in the network, agreed on the important role of methylation data.It has been demonstrated that changes in the DNA methylation profile are a driving force of tumor progression in PDAC [79,15].A recent study found DNA methylation as the most predictive omic data type for accurately discriminating PDAC from chronic pancreatitis among DNA methylation, mRNA and miRNA data [80].However, most studies that compare the performance of different omics data types have used single-omics analysis.To the best of our knowledge, this is the first study that analyzes the relative contributions of multiple omics types in an unsupervised DL model.
The contribution of each data modality to the neurons in the embedding layer showed that some units were activated by a single modality, whereas others were able to integrate different omics profiles.To the best of our knowledge, this is the first study reporting such finding in AI.In biological neural networks, it is not clear how different areas of the brain connect and communicate with each other to integrate inputs from multiple modalities [81].Still, there are terms like multimodal neurons or "concept cells", i.e., those neurons that respond to clusters of abstract concepts centered around a common high-level term, rather than any single modality (convergence) [82].Similar concepts have also been reported in artificial neural networks [83].Similarly, one modality could trigger another modality when both are correlated (divergence).This could be a key to maintaining a balance between relevance (consistency) and independence (complementarity) across multiple data modalities.In bioinformatics, the analysis of relative contribution of omics data across neurons is a novel topic, and warrants more research in the future.
The most important features were identified using several XAI methods.
For instance, the CpG site cg10794257 was associated with Hox genes, a gene family with an important role in organogenesis and animal development [84,85].These genes have been related to tumor-promoting activity in pancreatic cancer [86,19].The methylation feature cg03306374 (a CpG in the first exon/5'-UTR of the PRKCB) was earlier included in a DNA methylation signature for identifying PDAC [80].A positive correlation was found between sites located in the CpG island at the 5'-end and PDAC-hypermethylated cg03306374.Regarding gene expression, the second most important neuron 29 was highly affected by SATB1.This protein has been previously associated with pancreatic cancer tumorigenesis and poor patient survival in several studies [87,88,89].Also the protein CBLC, related to neuron 22, has been shown to exhibit a prognostic significance [90].These features, derived from feature importance analysis of both the whole model and specific neurons, were evaluated as possible predictors of the cancer subtypes.However, individual biomarkers did not achieve a good predictive power.Discovering a robust molecular signature in PDAC is known to be a challenging task.The results of previous molecular signatures are controversial and inconsistent across different datasets [91].In fact, despite numerous publications reporting potential biomarkers, no single candidate biomarker has been translated into clinical practice to date [92].
In conclusion, we have identified two prognostically relevant subtypes of PDAC patients using a deep clustering model.XAI techniques allowed us to better understand how the model used the multi-omics data to make predictions.A limitation of this study is the lack of validation of the clusters in an external dataset, and the TCGA dataset itself is relatively small in comparison to common datasets used for DL in other fields.Therefore, additional samples and datasets are required to make our conclusions more robust.Unfortunately, although several molecular signatures have been proposed, they are far from being wellsuited as a basis for clinical decisions.Therefore, more work is needed to discover a robust biomarker panel for stratifying PDAC patients.This would promote a more personalized medicine and potentially more effective treatment of a disease with a leading cause of cancer mortality.

Figure 1 :
Figure 1: Deep clustering algorithm architecture.The model is composed of an autoencoder and a K-Means with a joint loss function.Both methylation and gene expression profiles are encoded into a reduced matrix by the multi-branch autoencoder.The embedding layer is used in the K-Means algorithm to cluster the patient samples.The joint loss function allows one to optimize both the dimensionality reduction and clustering processes simultaneously.

Figure 2 :
Figure 2: Hyperparameter optimization through a nested cross-validation process, where the silhouette score was used as an optimization metric.Hyperparameters included the number of clusters, the number of input features, the lambda coefficient in the weighted loss function, the dimensionality of the embedding layer in the neural network, the number of epochs, the number of layers and the number of neurons in each layer of the model (inverse proportion of units).a) Slice plot scoring of every possible option of the hyperparameters; the x-axis represents the options, and the y-axis the silhouette score; the color corresponds to the trial of the process (the first 1000 were random).Boxplots of the b) autoencoder, c) clustering, d) joint loss function and e) Silhouette metrics obtained for the different folds on training, validation and testing.p-valueswere calculated with a Wilcoxon test between training and validation, and validation and testing, respectively.The metrics in a) and e) should be maximized, while the others should be minimized.The red lines represent the baseline for each metric: the worst autoencoder found in the nested cross-validation process; the average of the clustering metrics among the nested cross-validation folds when the number of clusters was two; the highest value of the joint loss function when the lambda coefficient is in the interval {best lambda coefficient ± 0.02}; and the average Silhouette score that was obtained during the optimization process.Green triangles show the means.MAE: mean absolute error; HP: hyperparameter.

Figure 3 :
Figure 3: Kaplan-Meier analysis of the survival differences between clusters, with 95% confidence interval.The log-rank test was applied for calculating the p-value.The Cox's proportional hazard model returned a hazard ratio of 1.75 (1.24-2.49,95% confidence interval).
a) Omics relative contribution for each sample b) Omics relative contribution grouped by cluster c) DNA methylation by cluster d) RNA-seq by cluster

Figure 4 :
Figure 4: Omics contribution analysis and distribution by cluster.a) % contribution of DNA methylation and gene expression per sample (n=147).b) Stacked barplot of the relative contribution of DNA methylation and gene expression features by cluster.The y-axis represents the % contribution of the omics data type.The standard deviation and p-value of Mann-Whitney U test for comparing the relative contribution between the clusters are also shown.c) Violinplot of DNA feature means by cluster.d) Violinplot of RNA-seq feature means by cluster.The log 2 was applied before computing the mean in panel d).p-values of Kolmogórov-Smirnov test for methylation and RNA-seq distributions are shown.Green triangles indicate the means.
importances across multiple attribution methods and learned weights b)Comparing input features grouped by cluster

Figure 5 :
Figure 5: Feature importance analysis.a) Top-25 input features across feature importance methods and weights estimated in the training data.The order of features corresponds to the consensus among all the methods.The feature importance is represented as the values of attributions x 10 −4 on the y-axis.The weights indicate the average neuron weights in the first layer for the particular feature.b) Boxplots of top-25 input features grouped by clusters.The numbers on top correspond to p-values when comparing the two clusters with Kruskal-Wallis test.Green triangles indicate the means.

Figure 6 :
Figure 6: Pearson correlation coefficient matrix of the top-25 input features inferred by attribution methods (integrated gradients, gradient SHAP and feature ablation).Positive correlations are represented with red color, while blue indicates negative correlation.
and estimated weights in the embedding layer b)Omics relative contribution to neurons in the embedding layer

Figure 7 :
Figure 7: a) Aggregated neuron importances and estimated weights in the embedding layer of the model.The attributions, which provide neuron importance, were computed using layer conductance.The bars representing the weights are the average of the weights on each neuron.b) Stacked bars showing omics relative contribution to neurons in the embedding layer ordered by importance.The contribution was measured using neuron conductance.8 out of the 50 neurons are activated by only a single modality; 6 by methylation and 2 by RNA-seq data.

Table 1 :
p-values of testing the association of clusters with clinical parameters.Log-rank test was used for the overall survival; Kruskal-Wallis test was applied on the diagnosis age; and Chi-square test for sex, tumor stage, metastasis stage and neoplasm histologic grade.HC: hierarchical clustering; AJCC: American Joint Committe on Cancer.