Interpretable machine learning model for cardiovascular disease risk prediction: a feature decomposition-based study | BMC Public Health
Model building
The decomposition model proposed in this article is shown in Fig. 1. The model consists of a decomposition network composed of several residual blocks, an attention module and a fully connected layer.

Decomposition network
For patients’ physiological data, a decomposition network is first employed to perform feature extraction. The residual decomposition architecture isolates and disentangles raw physiological features, which captures key information in the data more efficiently. Additionally, physiological features often exhibit mixed effects due to mutual interference between different characteristics. The decomposition network can disentangle these interactions, rendering each feature independent and clearly defined. Through this feature decomposition process, causal relationships and latent structures within the data become more interpretable, thereby enhancing the model’s overall explainability.
First, the patient’s physiological characteristic data are fed into the decomposition network for feature extraction. Each block receives an input\(\:{\:x}_{i}\) and produces an output \(\:{B}_{i}\), which can be formally expressed as \(\:{B}_{i}={Block}_{i}(W;{x}_{i})\).
Within the block, multiple fully connected layers are composed, which can effectively model the relationships between input features by learning their linear dependencies. The calculation formula for the linear layer is \(\:{y}^{{\prime\:}}=w\bullet\:{x}^{{\prime\:}}+b\), where x is the input vector, y is the output vector, w is the weight matrix, and b is the bias vector.
The blocks are connected using residual connections, meaning that the input to the next block is the difference between the input and output of the previous block. Through this residual approach, the feature data are decomposed, separating features that interfere with each other, thereby making the relationships between related features more independent and clearer.
The output of each block is stacked, combining feature vectors into higher-dimensional feature vectors, which are then used for further analysis of the relationships and importance of different feature combinations.
Attention module
The attention module assigns adaptive weights to the distinct feature combinations derived from decomposition, thereby enabling the model to focus more effectively on the most critical feature combinations. This mechanism facilitates the efficient identification and utilization of significant feature information, which in turn enhances the model’s performance and generalizability. The data features obtained from the decomposition network are denoted as \(\:Stack\in\:{R}^{D\times\:E}\), where D represents the number of decomposition layers and E denotes the dimensionality of the patient feature data. The stacked features generated by the decomposition network are then subjected to a dot product operation with the memory matrix Memory to compute the score for each layer’s vector. The calculation formula is as follows: \(\:score=dot\:(Stack,memory)\).
After the attention scores are calculated, normalization is applied to assign weights dynamically to the features at each layer. This process enables the model to prioritize important feature combinations while reducing the influence of potentially noisy or irrelevant data combinations. The calculation formula is as follows:\(\:w\left(i\right)=\frac{\text{e}\text{x}\text{p}\left(score\right(i\left)\right)}{{\sum\:}_{i}^{I}exp\left(score\right(i\left)\right)}\)
The calculated feature weights are subsequently multiplied by the stacked features, and the resulting weighted feature vectors are added together to obtain a new feature vector. The formula for this process is as follows:\(\:feature={\sum\:}_{i}^{I}w\left(i\right)\ast\:score\left(i\right)\)
Finally, the obtained feature vector feature is sent into a fully connected layer and classified by the function \(\:\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\). The formula is as follows: \(\:\widehat{\text{y}}=\text{S}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}({\text{w}}_{\text{f}\text{c}}\bullet\:\text{f}\text{e}\text{a}\text{t}\text{u}\text{r}\text{e}+{\text{b}}_{\text{f}\text{c}})\).
Model training
We model the prediction of CVD mortality as a classification problem and use the binary cross-entropy loss function. Let y represent the data label and \(\:\widehat{\text{y}}\) represent the value predicted from the model.
We model the prediction of cardiovascular disease mortality as a classification problem and thus employ the binary cross-entropy loss function. This function is computationally efficient and straightforward when calculating gradients, and it is directly related to probability, making it highly suitable for evaluating the quality of model outputs. The formula is shown below, where y represents the data labels and where \(\:\widehat{\text{y}}\) denotes the values predicted by the model.
$$\:\text{L}\left(\text{y},\widehat{\text{y}}\right)=-(\text{y}\text{l}\text{o}\text{g}\left(\widehat{\text{y}}\right)+\left(1-\text{y}\right)\text{log}\left(1-\widehat{\text{y}}\right))$$
To optimize the model, we use the gradient descent algorithm. It calculates the gradient of the loss function with respect to the parameters and updates the parameters in the direction opposite to the gradient. This method can find the global optimal solution in convex functions, where α is the learning rate and \(\:\theta\:\) is the model parameter.\(\:{\theta\:}^{{\prime\:}}=\theta\:-\alpha\:{\nabla\:}_{\theta\:}L\left(y,\widehat{y}\right)\)
Data Source
The dataset used in the experiment is an open-source CVD dataset from Kaggle, comprising medical records of 68,205 patients. Each patient’s record includes 11 features, categorized into three groups: (1) factual information (e.g., age, weight, height), (2) examination measurements (e.g., systolic blood pressure, diastolic blood pressure, blood glucose levels), and (3) subjective indicators (e.g., smoking status, alcohol consumption, and exercise habits). The dataset can be accessed at https://www.kaggle.com/code/sulianova/eda-cardiovascular-data.
Data preprocessing
Before training the model, the data underwent a thorough preprocessing process. First, the numerical features were standardized using StandardScaler, which transforms all the input features into a standard normal distribution with a mean of 0 and a standard deviation of 1, thereby eliminating the impact of varying feature scales on model performance. Categorical variables were encoded as numeric values using LabelEncoder to ensure that they could be used as inputs to the neural network. Additionally, missing values in the dataset were identified and handled using mean imputation for numerical features, ensuring the integrity of the data. Finally, the data were split into training and testing sets using train_test_split, with an 80–20 split. Stratified sampling (stratify parameter) was applied to maintain the class distribution (i.e., the presence or absence of cardiovascular disease) across both the training and testing sets, enhancing the model’s ability to generalize.
To solve the problem of imbalanced data categories, we adopted the following methods: (1) Weighted loss function: A class-weighted cross-entropy loss was used to penalize misclassification of the minority class more heavily. The loss weights were empirically set to [1.0, 1.1], emphasizing positive (cardio = 1) cases. (2) Synthetic oversampling: We employed the synthetic minority oversampling technique (SMOTE) to artificially increase the presence of minority class samples in the training set. SMOTE was applied only after the train–test split to avoid data leakage. To improve model generalizability further, Gaussian noise N(0, 0.01) was added to numeric features such as systolic/diastolic blood pressure, age, and cholesterol. Additionally, feature permutation within the same class was performed on a small percentage (5%) of training samples to simulate plausible physiological variability. These steps were implemented during the training pipeline and combined with normalization, encoding, and stratified sampling.
Model explanation
The interpretation of the prediction model is performed by SHapley Additive exPlanation (SHAP), which precisely calculates the contribution and influence of each feature on the final prediction. In addition, each observation in the dataset can be explained by a specific set of SHAP values.
Statistical analysis
Six machine learning models, including decision tree (DT), logistic regression (LR), support vector machine (SVM), random forest (RF), backpropagation neural network (BPNN), and light gradient boosting machine (LightGBM), were compared with our proposed predictive model. To ensure the repeatability and clarity of the proposed FDDL model, its key architecture and training parameters are listed in Supplementary Table 1. The performance evaluation metrics included the classification accuracy, precision, recall, F1 score, and area under the ROC curve (AUC). The dataset was partitioned into an 80–20 split for the training and validation sets, with model training conducted on the former and performance validation on the latter.
The data were analyzed using IBM SPSS Statistics v25.0. The Kolmogorov‒Smirnov test was used for continuous variables. Continuous variables with a normal distribution are presented as the means ± standard deviations and were compared using t tests. Continuous variables with skewed distributions are presented as medians with interquartile ranges and were compared with the Mann‒Whitney U test. Categorical variables are presented as numbers with percentages and were compared with the chi-square test. P < 0.05 was considered statistically significant.
link
