br When the number of classes is more than
When the number of Fluxametamide is more than two, to evaluate the performance of classifier, we must obtain the above equations for each class separately such that each class is considered as first class and all other classes as second class. After computing the above equations (Eqs. 10–12) for every class, we apply the average of these values for the final result.
The proposed approach of missing data imputation is compared with six imputation methods, i.e., mean, Hot-deck, K-NN, Weighed K-NN, Tensor-based imputation (Dauwels et al., 2012) and Baye-sian network imputation (Rancoita, 2014). In literature, the men-tioned approaches are applied for both numerical and categorical attributes in the same way. These approaches do not use different paradigms for different variable. Therefore, we have applied these approaches for both discrete and numeric values the same. We firstly empty 5%, 10% and 15% of a whole dataset. Then, we esti-mated the missing values via the imputation methods and com-pared the resulted values with the actual ones using NRMSE measurement (Eq. (9)).
Parameters of some of imputation and prediction models.
Parameter Method Task
Number of nearest neighbors = 5;Distance = K-NN Imputation Standardized Euclidean W-KNN
Kernel function = RBF;Order of the RBF kernel = 4 SVM Prediction Number of nearest neighbors = 5;Distance = Pearson K-NN
NRMSE of imputation methods on three datasets with 5–15% missing rates.
Datasets Imputation Methods NRMSE (lower value is better)
In this work, 5-fold cross validation procedure is used in order to evaluate predictive models. The dataset is split randomly into 5 folds. One fold is considered for test and all the others for train-ing. To ensure the stability of the results, the number of experi-ments is five. We reported only the average results for each experiment. Table 3 introduces the parameter settings for imputa-tion and classification methods.
5.4. Experimental results
Imputation of the missing data according to the proposed method is conducted to improve breast cancer recurrence predic-tion. Table 4 shows the results of NRMSE for estimation approaches on the three datasets (Omid, Wisconsin and Cleveland dataset). Also Fig. 4 illustrates the results using curves. In these results,
Fig. 4. NRMSE of imputation methods for Omid, Wisconsin and Cleveland datasets.
the proposed method obtained the lowest error rates and was more efficient than the other methods (the NRMSE for Omid data-set is 0.12, the NRMSE for Wisconsin dataset is 0.10 and the NRMSE for Cleveland dataset is 0.11).
In these results, the proposed method obtained the lowest error rate and is better than other methods. As expected, since the num-ber of discrete categorical attributes are more than continuous attributes, Bayesian network based imputation performance is superior compared to Tensor imputation. On these datasets, W-KNN, KNN and Hot-deck do not work very well in dealing with con-tinuous missing values. Table 5 shows the results of classifiers (Eqs. 10–12) on Omid dataset. The proposed method has achieved the best result with an average accuracy of 89.29%, sensitivity of 78.55% and specificity of 92.83% with C4.5 classifier which has an increased accuracy as compared with Tensor-based imputation and Bayesian network imputation. The same results are also depicted in Fig. 5.
6. Discussion and conclusion
The recurrence of breast cancer affects the lives of patients even many years after Decision TreeKNNSVMsurgery. In recent years, machine learning and data mining methods have increasingly improved predictions and helped medical professionals. Extracting verified information from collected medical data is considered as a major challenge. Due to the increase of cancer patients, especially breast cancer patients, the research about Somatic cells field is very impor-tant. The existence of missing values in medical data is a main challenge in this field. A precise estimation of missing values, which leads to better decision, or at least assist experts for this purpose, is valuable in cancer diagnosis and recurrence prediction. In this paper, a new approach for missing values imputation is pro-posed with respect to dependencies among variables and the type of incomplete variable which significantly affects imputation using Tensor and Bayesian networks for both categorical and numerical