Breast Cancer Prediction using Machine Learning Algorithms
Data Science

Breast Cancer Prediction using Machine Learning Algorithms


1. Understanding Breast Cancer

1.1 What Is Breast Cancer? 

Some breast cells start to develop erratically, which leads to breast cancer. These cells continue to grow and divide more quickly than healthy cells, generating a bulk or lump. Research indicates that the condition will likely impact 1 in every 28 women. 

1.2 Breast Cancer Symptoms 

Depending on the individual, breast cancer may occasionally develop without any symptoms. Some of the signs and symptoms of breast cancer include: 

• A lump or thickening in the breast or under the arm 

• Change in the size and shape of the breast 

• Breast, nipple, or armpit pain 

• A nipple discharges other than breast milk 

• Redness of the breast skin or nipple 

• Dimpling or pitting of breast skin

1.3 Stages of Breast Cancer

1.4 Risk Factors of Breast Cancer

1.5 Diagnosis of Breast Cancer

The following are the tests and procedures used to diagnose breast cancer: 

• Breast exam: Checking breast for any lumps or abnormalities 

• Mammogram: It is an X-ray for the breast used to screen for breast cancer 

• Breast Ultrasound 

• Breast MRI 

• FNAC: Small sample of cells removed for testing with a small needle 

• Biopsy: Removing a piece of breast tissue for further testing. 

2. Dataset

2.1 Dataset Description

Every case of cancer reported in 19 U.S. geographic areas is tracked by a network of cancer registries spread across the country. All cancer kinds' incidence data are available in the SEER database (1972–2012), and for this project, we examined 740506 records of breast cancer patients along with 146 characteristics.

2.2 Data Cleaning 

The SEER dataset encodes unknown values for several characteristics as "999" or "99." Such records are removed from the excel file using the 'FILTER' command. Filtered and kept for later analysis are only records with known values. 13 Imputing Missing Values Using Algorithms, the features which have null values less than 20 percent were imputed using machine learning algorithms. Categorical features were imputed using KNN, and Continuous features were imputed using linear regression.

2.3 Feature Selection 

There are 149 features in the SEER dataset. It's possible that including all of the features in the analysis will result in overfitting. To avoid the problem of overfitting, the following feature selection techniques are used:

(i) Removing OBJECT datatype 

If the features of this OBJECT type are critical to our problem, we can encode them for examination. We can ignore those features if they are insignificant. In this concept, the following four columns from our SEER dataset are removed.

(ii) Dropping Features: Here, the features which have null values of more than 20% and the features which have unique values for each record were dropped (85 features). The features are: 

Table 1 - Features which have null values of more than 20%

(iii) Forward Feature Selection 

This is an iterative method wherein we start with the best performing variable against the target. Next, we select another variable that gives the best performance in combination with the first selected variable. This process continues until the preset criterion is achieved. 

Table 2 - Features from Forward Feature Selection

(iv) Variance Inflation Factor 

Multicollinearity can be detected using various techniques, one such technique being the Variance Inflation Factor (VIF). In the VIF method, we pick each feature and regress it against all of the other features. ERSTATUS, AGE_DX, MAR_STAT have very high values of VIF, indicating that these three variables are highly correlated. Hence, considering these three features together leads to a model with high multicollinearity. 

3. Model Fitting 

Seventy percent of the data was used for training, while thirty percent was used for testing. To determine if the cancer patient will survive or not, the following Ensemble learning approaches were used. 


• Light GBM 

• CatBoost 

3.1 XGBM 

A modified version of the GBM algorithm is called XGBoost. Sequentially constructed trees in XGBoost aim to fix the flaws of earlier trees. XGBM is quicker than GBM since it uses simultaneous preprocessing (at the node level). A wide range of regularisation methods are also included in XGBoost to lessen overfitting and boost overall performance.

3.2 Light GBM 

Light GBM is a distributed, high-performance gradient boosting framework for ranking, classification, and many other machine learning problems. It is based on the decision tree method. While other boosting algorithms split the tree level- or depth-wise rather than leaf-wise, it splits the tree leaf-wise with the best fit.

3.3 CatBoost 

Yandex's CatBoost is a recently released machine learning algorithm. It is simple to interface with deep learning frameworks such as Apple's Core ML and Google's TensorFlow. It can operate with various data formats to assist in resolving a variety of issues that businesses are currently facing. Additionally, it offers the finest accuracy in its class.

4. Results & Discussion 

The results and graphs of classification algorithms are provided after pre-processing. Finally, the existing and suggested algorithms are discussed. The information is derived from the SEER breast cancer database, which has 740506 entries with 146 features. The dataset is pre-processed using several Feature Selection algorithms to decrease its size. The ways for picking features are as follows:

• Removing Object Datatype

• Removing Missing Values

• Removing variables with Forward Feature Selection 

• Variance Inflation Factor 

Once all the pre-processing Techniques completed the pre-processed and updated dataset is processed to Modelling. The dataset is processed with 3 different Ensemble Algorithms – XGBM, LightGBM and CatBoost.

4.1 EDA (Exploratory Data Analysis)

Fig 4.1 The Count Plot of STAT_REC 5.2 Classification of Statistical results

All of the algorithms were statistically analysed, and the results were compared to the classification report. It includes the model’s accuracy, as well as the accuracy score, precision, recall, and f1 score. We conclude that LightGBM has a higher accuracy than other ensemble algorithms based on the analysis of comparison between algorithms. 

Table 3 - Comparison of Algorithms

4.2 Comparison of Existing Solution and Proposed Solution 

SEER breast cancer data was used in the Existing and Proposed Solutions. They used many feature extraction techniques in the existing solution and reduced the dataset to 464889 records with 7 features. They used a Decision Tree with the CHAID (Chi Square Automatic Interaction Detection) model and got a 77 percent accuracy. Various Feature Selection methods were used in the proposed solution. The classification algorithms used are LightGBM, XG Boost, CatBoost. LightGBM has the highest accuracy of 78 percent, followed by CatBoost with 78 percent accuracy. When comparing the results of the existing and proposed solutions, the accuracy growth rate increased from 77% to 78% percent. Improvements have been made, and by adding some new features to the existing model, we have improved the accuracy of the existing model. 

Table 4 - Best Algorithm

5. Conclusion 

Breast cancer prediction using SEER dataset is performed with various features and modelling Techniques using existing Solution. The existing Classification Algorithm is performed and the performance of the results do not provide better accuracy in the available existing solution. To overcome these difficulties a new classification algorithm is proposed. According to the classification and comparison results, the proposed algorithm (LightGBM, XGBoost, CatBoost) performs better than the other used algorithms in the existing solutions. Pre-processing of the breast Cancer dataset offers the fast execution and better performance. For all the datasets the proposed algorithms provide better performance than others.


6. Future Enhancements

The proposed Ensemble technique can be easily extended to any other applications. In future, this method can be combined with any evolutionary algorithms to get a new and more powerful Ensemble algorithm. In the near future, different pre-processing techniques can be applied to balance the datasets. Later on, it can also be solved by advanced machine learning techniques. We can improve the accuracy by improving the feature selection techniques by adopting the latest upcoming methods. We can deploy our model into the AWS cloud.


  • Ghladin Shebac
  • Dec, 27 2022

Add New Comments

Please login in order to make a comment.

Recent Comments

Be the first to start engaging with the bis blog.