Binary Classification of Wine dataset using Support Vector Machines, Random Forest, and Gradient Boosting Classifier

Binary Classification of Wine Dataset

Sharing is Caring!

In this tutorial, you will learn how to develop a binary classification model to differentiate Red Wine from White Wine using distinct data sets for both the wine types.

Firstly, both the data sets will be combined into a single data set containing both the wine types and an exploratory data analysis will be performed to detect the patterns in the data set.

Secondly, binary classification models will be developed via hyperparameter tuning for Support Vector Machine, Random Forest, and Gradient Boosting classifiers with the help of full feature set.

Finally, important features will be selected from the full feature set based on the correlation with the wine type, and the performance of these stripped-down binary classification models will be compared with the full-fledged classification models.

This tutorial will feature following tools/libraries to classify Wine types using 3 machine learning algorithms namely Support Vector Machine (SVM), Random Forest (RF) and Gradient Boost classifiers:

  • Python
  • Jupyter Notebook
  • scikit-learn
  • pandas
  • seaborn
  • matplotlib

Let’s first make sense of the Wine data itself…

Wine Dataset

Wine data comprises of two datasets for distinct wine types i.e., Red and White. Both the datasets contain numeric features of the relevant wine type. To make a binary classification problem statement, we will combine both the datasets with an extra column indicating the wine type.

Both the datasets are readily available at UCI Machine learning repository, and contains 12 input attributes for each instance. Red Wine contains 1599 instances, where White Wine consists of 4898 instances. We will use these continuous attributes to develop a binary classification model for Red and White wine.

Wine Dataset Link: https://archive.ics.uci.edu/ml/datasets/wine+quality

Input Features

All the input features of the wine dataset are numeric in nature:

  • Fixed Acidity
  • Volatile Acidity
  • Citric Acid
  • Residual Sugar
  • Chlorides
  • Free Sulfur Dioxide
  • Total Sulfur Dioxide
  • Density
  • pH
  • Sulphates
  • Alcohol
  • Quality

Output Features

class: ‘Red’ [for red wine] or ‘white’ [for white wine]

Article Notebook

Code: Red and White Wine binary classification using Support Vector Machines, Random Forest, Gradient Boost

Dataset Link: https://archive.ics.uci.edu/ml/datasets/wine+quality

Prepared By: Awais Naeem (awais.naeem@embedded-robotics.com)

Copyrights: www.embedded-robotics.com

Disclaimer: This code can be distributed with the proper mention of the owner copyrights

Notebook Link: https://github.com/embedded-robotics/datascience/blob/master/wine_dataset_binaryclass_ml/SVM_GB_RF_wine.ipynb

Data Reading and Exploratory Analysis

Let’s first import all python-based libraries necessary to read the data, process it, and then perform further exploratory analysis on it:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import StratifiedKFold, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Now, read both of the data sets in two different pandas data frames. Values are separated by a semi-colon;’, so you will need to specify it as a delimiter to read the correctly formatted columns/values:

red_wine_data = pd.read_csv('winequality-red.csv', delimiter=';')
white_wine_data = pd.read_csv('winequality-white.csv', delimiter=';')

Once you have read the data, it’s a good practice to take a first look at some of the starting rows of the data:

red_wine_data.head()

Output:

fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70                 0.00             1.9      	0.076   
1            7.8              0.88         	0.00             2.6     	 0.098   
2            7.8              0.76        	0.04             2.3      	0.092   
3           11.2              0.28         	0.56             1.9      	0.075   
4            7.4              0.70         	0.00             1.9     	 0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   		0.9978  	   3.51       0.56   
1                 25.0                  67.0   		0.9968     3.20       0.68   
2                 15.0                  54.0   		0.9970     3.26       0.65   
3                 17.0                  60.0   		0.9980     3.16       0.58   
4                 11.0                  34.0   		0.9978     3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5  
3      9.8        6  
4      9.4        5
white_wine_data.head()
fixed acidity     volatile acidity     citric acid  residual sugar  chlorides  \
0            7.0              	0.27         	0.36            20.7      0.045   
1            6.3              	0.30         	0.34             1.6      0.049   
2            8.1             	0.28         	0.40             6.9      0.050   
3            7.2              	0.23         	0.32             8.5      0.058   
4            7.2              	0.23         	0.32             8.5      0.058   

   	free sulfur dioxide  total sulfur dioxide  	density    pH  sulphates  \
0                 45.0                 170.0   			1.0010  3.00       0.45   
1                 14.0                 132.0   			0.9940  3.30       0.49   
2                 30.0                  97.0   			0.9951  3.26       0.44   
3                 47.0                 186.0   			0.9956  3.19       0.40   
4                 47.0                 186.0   			0.9956  3.19       0.40   

   alcohol  quality  
0      8.8        6  
1      9.5        6  
2     10.1        6  
3      9.9        6  
4      9.9        6

Add an extra column named ‘class’ in both the datasets to specify the wine type and combine both the datasets in a vertical fashion using pd.concat():

red_wine_data['class'] = 'red'
white_wine_data['class'] = 'white'
wine_data = pd.concat([red_wine_data, white_wine_data], axis=0, ignore_index=True)

Here, you need to set ignore_index=True to get a unique set of index values after combining both the data sets.

Since, you have got the combined data set containing the instances of both the wine types, first thing you need to do is to look for any missing values and deal with them like pros:

wine_data.isnull().sum()

Output:

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
class                   0
dtype: int64

Well… Well… Lucky us! No missing values.

Let’s now evaluate the datatypes of each feature to make sure they are ‘numeric’:

wine_data.dtypes

Output:

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
class                    object
dtype: object

The ‘class’ feature contains ‘white’ or ‘red’, and is therefore classified as ‘object’ type. Let’s convert our ‘class’ feature into a numeric one so that our algorithms can understand the underlying types as well:

wine_data['target'] = np.where(wine_data['class']=='white', 1, 0)

Find the ‘input’ features which form the best correlation with the ‘output’ variable:

wine_data.corr()['target'].sort_values(ascending=True)

Output:

volatile acidity       -0.653036
chlorides              -0.512678
sulphates              -0.487218
fixed acidity          -0.486740
density                -0.390645
pH                     -0.329129
alcohol                 0.032970
quality                 0.119323
citric acid             0.187397
residual sugar          0.348821
free sulfur dioxide     0.471644
total sulfur dioxide    0.700357
target                  1.000000
Name: target, dtype: float64

It seems like ‘volatile acidity’, ‘chlorides’, ‘sulphates’ and ‘fixed acidity’ form the negative correlation, whereas ‘free sulfur dioxide’ and ‘total sulfur dioxide’ form the best positive correlation with the ‘target’ value.

Now draw a pair plot for all the input features and try to spot any more patterns among the data:

sns.pairplot(wine_data, hue='class')
Pair Plot of Wine Dataset Features
Pair Plot of Wine Dataset Features

You can also look at the numeric correlation values between each numeric feature using SNS heatmap:

plt.figure(figsize=(15,10))
sns.heatmap(wine_data.corr(), annot=True, vmin=-1, vmax=1, cmap='PiYG')
plt.show()
SNS Heatmap of Wine Dataset Features
SNS Heatmap of Wine Dataset Features

Here are some salient patterns which I have spotted just by looking at the above two plots:

  • Free Sulfur Dioxide and Total Sulfur Dioxide are positively correlated i.e., if you have more of one, you will have more of the other
  • Density and Alcohol exhibit a reciprocal relationship i.e., more the concentration of alcohol, less the density
  • Density exhibits a positive correlation with the Fixed Acidity and Residual Sugar
  • Quality is also positively linked with the amount of Alcohol i.e., more alcohol a wine has, the more its perceived quality. Right, makes sense!
  • Volatile Acidity depicts a negative correlation with the Total Sulfur Dioxide

The above relationships are visually plotted as follows:

fig, axes = plt.subplots(2,3, figsize=(18,10))
sns.scatterplot(data=wine_data, x='alcohol', y='density', hue='class', ax=axes[0, 0])
sns.scatterplot(data=wine_data, x='free sulfur dioxide', y='total sulfur dioxide', hue='class', ax=axes[0, 1])
sns.scatterplot(data=wine_data, x='alcohol', y='quality', hue='class', ax=axes[0, 2])
sns.scatterplot(data=wine_data, x='residual sugar', y='density', hue='class', ax=axes[1, 0])
sns.scatterplot(data=wine_data, x='volatile acidity', y='total sulfur dioxide', hue='class', ax=axes[1, 1])
sns.scatterplot(data=wine_data, x='alcohol', y='density', hue='class', ax=axes[1, 2])
plt.show()
Relation between highly correlated features of Wine Dataset
Relation between highly correlated features of Wine Dataset

If you can identify more patterns in the data set, do not hesitate to comment and let others know.

Let’s now explore the relation between the input features and the output class. To keep things clean, I will only plot the Kernel Density Estimation of the most important features differentiated by the target class:

fig, axes = plt.subplots(2,3, figsize=(18, 10))
sns.kdeplot(data=wine_data, x='total sulfur dioxide', hue='class', shade='class', ax=axes[0, 0])
sns.kdeplot(data=wine_data, x='free sulfur dioxide', hue='class', shade='class', ax=axes[0, 1])
sns.kdeplot(data=wine_data, x='volatile acidity', hue='class', shade='class', ax=axes[0, 2])
sns.kdeplot(data=wine_data, x='chlorides', hue='class', shade='class', ax=axes[1, 0])
sns.kdeplot(data=wine_data, x='sulphates', hue='class', shade='class', ax=axes[1, 1])
sns.kdeplot(data=wine_data, x='fixed acidity', hue='class', shade='class', ax=axes[1, 2])
plt.show()
Relation between Input Features and Output Class for Wine Dataset
Relation between Input Features and Output Class for Wine Dataset

Looking at the KDE plots of these features, following salient points could be noted down:

  • All the features are, more or less, normally distributed
  • The feature distribution for the ‘Red’ class is right-skewed
  • Density of the ‘White’ wine samples is larger than those of ‘Red’ wine which depicts that majority of the wine samples correspond to ‘White’ wine
  • Distribution values for Total Sulfur Dioxide, Volatile Acidity, and Chlorides are significantly different for both the wines and can greatly impact the performance of binary classification models

Now let’s do some pre-processing on the data before we can move onto building the binary classification models using machine learning algorithms.

Data Pre-processing

Firstly, drop the ‘class’ feature from the data set as we have already substituted its ‘object’ values into ‘numeric’ ones via ‘target’ feature:

wine_data.drop('class', axis=1, inplace=True)

Specify a uniform datatype for all the features, and separate the Input features of the data from its output class:

wine_data = wine_data.astype('float32')
data_X = wine_data.drop('target', axis=1)
data_y = wine_data['target']

Split the dataset into Training Data (75%) and Testing Data (25%). Training data will be used by the models for training, whereas Testing Data will be used to gauge the performance of the trained models on the unseen data.

X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.25, random_state=1)

Since the input features are numeric and differ in individual scales, they need to be scaled such that the distribution of each feature will have mean=0, and std=1.

StandardScaler() is the industry’s go-to algorithm to perform such scaling, which in-turn, enhances the training capability of the mathematical machine learning models as they need to evaluate each feature having the approx. same distribution.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We only need to fit the StandardScaler() on the Training Data and not the Testing Data, since the fitting of scaler on the Testing Data will lead to data leakage.

After the necessary pre-processing, let’s build our first multi-class training model using Support Vector Machine (SVM) and evaluate its performance…

Data Classification using Support Vector Machines

To train the SVM model on the Wine data set, Support Vector Classifier (SVC) is employed from scikit-learn library. At first, SVC() model is fitted on the training data with default parameters:

svm_model = SVC(C=1, kernel='rbf')
svm_model.fit(X_train, y_train)

‘C’ is the regularization parameter which controls underfitting/overfitting. Higher value of ‘C’ leads to less regularization which results in overfitting and vice versa.

‘Kernel’ is selected based on the shape/dimension of the input features. Other values could be ‘linear’, ‘poly’ or ‘sigmoid’.

Once the model is trained, we need to predict the performance of the model on the testing data set:

svm_y_pred = svm_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, svm_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, svm_y_pred))
print('Classification Report:\n', classification_report(y_test, svm_y_pred))

Output:

Accuracy Score 0.9944615384615385
Confusion Matrix
 [[ 376    6]
 [   3 1240]]
Classification Report
               precision    recall  f1-score   support

         0.0       0.99      0.98      0.99       382
         1.0       1.00      1.00      1.00      1243

    accuracy                           	0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625

So, we have got almost perfect accuracy, recall, and precision scores. Nice, isn’t it?

Let’s now tune hyper-parameters for SVM and evaluate whether we can further enhance the performance of the SVM binary classification model.

Hyperparameter tuning for SVM

In this alternative approach, we will train the model using a subset of hyperparameters and then select best parameters based on a scoring metric i.e., accuracy:

svm_param_grid = {'C':[0.01, 0.1, 1, 10, 100],
                'kernel': ['linear', 'rbf'],
                'gamma' :[0.001, 0.01, 0.1, 1, 10]}
svm_cv = StratifiedKFold(n_splits=5)
svm_grid = GridSearchCV(SVC(), svm_param_grid, cv=svm_cv, scoring='accuracy')
svm_grid.fit(X_train, y_train)

SVC() model will be trained using each combination of the hyperparameters specified in ‘svm_param_grid’.

During each iteration, model will be trained using 5-fold cross-validation. 4-folds of the data will be used to train the model, and remaining fold will be used to test the accuracy of the trained model.

Also note that we are using ‘StratifiedKFold’ here instead of the legendary ‘Kfold’. The main reason being that White (4898) and Red Wine (1499) samples are not distributed equally in the data set. When making folds during cross validation, ‘StratifiedKFold’ will ensure that fold consists of the same ratio of the white and red wine samples as that of the original dataset.

Hyperparameters resulting in the maximum accuracy will be used to train the model on the complete dataset in the final stages of ‘GridSearchCV’.

We can get the parameters which resulted in the maximum accuracy as follows:

print('SVM best Params:', svm_grid.best_params_)
print('SVM best Score:', svm_grid.best_score_)

Output:

Accuracy Score: 0.9944615384615385
Confusion Matrix:
 [[ 376    6]
 [   3 1240]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.98      0.99       382
         1.0       1.00      1.00      1.00      1243

    accuracy                           	           0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625

So, we did improve our accuracy a little bit with the optimized hyperparameters for the support vector machines. With this approach, you will see major differences for dataset having million or more instances.

Let’s now train Random Forest Classifier model on the same data set…

Data Classification using Random Forest Classifier

Random Forest Classifier is basically a lot of decision tree classifiers combining to give the majority vote in favor of a class. You can train a Random Forest classifier using scikit-learn library in Python:

rf_model = RandomForestClassifier(n_estimators=500)
rf_model.fit(X_train, y_train)

‘n_estimators’ specify the number of decision trees which will be used to train the Random Forest classifier.

Let’s observe the accuracy of the Random Forest model on the test data set:

rf_y_pred = rf_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, rf_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, rf_y_pred))
print('Classification Report:\n', classification_report(y_test, rf_y_pred))

Output:

Accuracy Score: 0.992
Confusion Matrix:
 [[ 372   10]
 [   3 1240]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.97      0.98       382
         1.0       0.99      1.00      0.99      1243

    accuracy                           	            0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625

Hyperparameter tuning for Random Forest

Similar to what we did for the SVM model, we will tune the hyperparameters of the Random Forest Classifier using 5-fold cross-validation via GridSearchCV() and StratifiedKFold():

rf_param_grid = {'max_samples': [0.1, 0.2, 0.3, 0.4],
                'max_features': [5, 6, 7],
                'n_estimators' :[50, 100, 500, 1000],
                'max_depth': [10, 11, 12]
                }
rf_cv = StratifiedKFold(n_splits=5)
rf_grid = GridSearchCV(RandomForestClassifier(), rf_param_grid, cv=rf_cv)
rf_grid.fit(X_train, y_train)

‘max_samples’ denote the fraction of the total samples which all the decision trees will use to specify the individual contribution.

Since there are only 12 input features in the Iris dataset, we will specify a max feature set to be ‘5’, ‘6’ or ‘7’ for more robust model training.

‘max_depth’ denotes the maximum depth of the decision steps for each individual decision tree in the Random Forest model.

After the model is trained, you can get the best model parameters as well as evaluate the performance of the trained model on the test data set:

print('RF best Parameters:', rf_grid.best_estimator_)
print('RF best Score:', rf_grid.best_score_)

Output:

RF best Parameters: RandomForestClassifier(max_depth=10, max_features=5, max_samples=0.4,
                       n_estimators=500)
RF best Score: 0.9940470699731481

Let’s evaluate the performance of the optimized Random Forest model:

rf_y_pred = rf_grid.predict(X_test)
print('Accuracy:', accuracy_score(y_test, rf_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, rf_y_pred))
print('Classification Report:\n', classification_report(y_test, rf_y_pred))

Output:

Accuray: 0.9907692307692307
Confusion Matrix:
 [[ 373    9]
 [   6 1237]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.98      0.98      0.98       382
         1.0       0.99      1.00      0.99      1243

    accuracy                           	          0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625

Data Classification using Gradient Boosting Classifier

Gradient Boosting classifier is a combined sequential model of various Decision Trees, where the outcome of one decision tree is used to train the next one and so on.

‘learning_rate’ will specify how much impact outcome of one tree has on the next one. You can control underfitting/overfitting of the gradient boosting classifier using ‘learning_rate’ of the model.

gb_model = GradientBoostingClassifier(n_estimators=500, learning_rate=0.1)
gb_model.fit(X_train, y_train)

‘n_estimators’ specify the number of decision trees to be employed in training the gradient boosting classifier.

gb_y_pred = gb_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, gb_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, gb_y_pred))
print('Classification Report:\n', classification_report(y_test, gb_y_pred))

Output:

Accuracy Score: 0.9926153846153846
Confusion Matrix:
 [[ 374    8]
 [   4 1239]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.98      0.98       382
         1.0       0.99      1.00      1.00      1243

    accuracy                           	           0.99      1625
   macro avg       0.99      0.99      	0.99      1625
weighted avg       0.99      0.99      0.99      1625

Hyperparameter tuning for Gradient Boosting Classifier

Tuning the hyperparameters for this classifier is more or less similar to that of Random Forest Classifier:

gb_grid_param = {'learning_rate': [0.01, 0.05, 0.1],
                'n_estimators' : [10, 50, 100],
                'max_depth': [10, 11, 12],
                'max_features': [5, 6, 7]}
gb_cv = StratifiedKFold(n_splits=5)
gb_grid = GridSearchCV(GradientBoostingClassifier(), gb_grid_param, cv=gb_cv)
gb_grid.fit(X_train, y_train)

Best parameters and performance of the gradient boosting classifier on the testing set can be evaluated as follows:

print('GB best Parameters:', gb_grid.best_estimator_)
print('GB best Score:', gb_grid.best_score_)

Output:

GB best Parameters: GradientBoostingClassifier(max_depth=10, max_features=5, n_estimators=50)
GB best Score: 0.9963049544569053
gb_y_pred = gb_grid.predict(X_test)
print('Accuray:', accuracy_score(y_test, gb_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, gb_y_pred))
print('Classification Report:\n', classification_report(y_test, gb_y_pred))

Output:

Accuray: 0.9932307692307693
Confusion Matrix:
 [[ 373    9]
 [   2 1241]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.98      0.99       382
         1.0       0.99      1.00      1.00      1243

    accuracy                           	0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625

Important Features of Wine Data

Previously, we have trained our classification models using the full feature set of the wine data set which is acceptable in some cases. However, most of the times, we need to only consider the features which play a significant role in identifying the target values and then train the models using such features.

This has a larger impact of reducing the training time as well as the consumption of computation resources. As we have already entered the regime of Cloud based distributed training platforms, the need for identifying the most relevant features out of the feature set is a critical requirement.

For our dataset, we have already identified the critical features during exploratory data analysis i.e., “total sulfur dioxide”, “volatile acidity”, “chlorides”, “sulphates”, “fixed acidity”, “free sulfur dioxide”.

Let’s now extract a dataset consisting of these features only:

features = ["total sulfur dioxide", "volatile acidity", "chlorides", "sulphates", "fixed acidity", "free sulfur dioxide"]
data_X = wine_data[features]
data_y = wine_data['target']

Split the data into Training (75%) and Testing (25%) instances, and standardize the input ‘numeric’ features:

X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.25, random_state=1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Binary Classification using the Important Features

We will now train the machine learning models on this stripped-down data set, and compare their performance with the models trained on the full-fledged feature set.

So, let’s start with the support vector machines (SVM)…

Support Vector Machine (SVM) Training

Specify the features of the SVM model which were obtained after hyper-parameter tuning, and let the model train on the new feature set:

svm_model = SVC(C = 10, gamma = 0.01, kernel='rbf')
svm_model.fit(X_train, y_train)

Now, evaluate the performance of the trained SVM model using the Test instances of the dataset:

svm_y_pred = svm_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, svm_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, svm_y_pred))
print('Classification Report:\n', classification_report(y_test, svm_y_pred))

Output:

Accuracy Score: 0.9815384615384616
Confusion Matrix:
 [[ 369   13]
 [  17 1226]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.97      0.96       382
         1.0       0.99      0.99      0.99      1243

    accuracy                           	           0.98      1625
   macro avg       0.97      0.98      0.97      1625
weighted avg       0.98      0.98      0.98      1625

Full-Feature SVM Model Accuracy: 0.994

Important-Features SVM Model Accuracy: 0.982

So, we have got almost similar accuracy with almost half the features.

Interesting, right?

Let’s move on to the next training model i.e, Random Forest Classifier…

Random Forest Classifier Training

Specify the parameters of the Random Forest model obtained after hyperparameter optimization, and train the model on the important features:

rf_model = RandomForestClassifier(max_depth=10, max_features=6, max_samples=0.4, n_estimators=50)
rf_model.fit(X_train, y_train)

Once the model is trained, evaluate its performance on the test dataset as follows:

rf_y_pred = rf_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, rf_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, rf_y_pred))
print('Classification Report:\n', classification_report(y_test, rf_y_pred))

Output:

Accuracy Score: 0.9833846153846154
Confusion Matrix:
 [[ 372   10]
 [  17 1226]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.96      0.97      0.96       382
         1.0       0.99      0.99      0.99      1243

    accuracy                           	            0.98      1625
   macro avg       0.97      0.98      0.98      1625
weighted avg       0.98      0.98      0.98      1625

Full-Feature Random Forest Model Accuracy: 0.990

Important-Features Random Forest Model Accuracy: 0.983

So, we have got almost similar accuracy. Our magic with the important features seems to be working.

Let’s now train our final model i.e., Gradient Boosting Classifier…

Gradient Boosting Classifier Training

Train the model using the best parameters obtained after hyperparameter optimization, and evaluate the performance of the model on the testing data set:

gb_model = GradientBoostingClassifier(learning_rate=0.001, n_estimators=10, max_depth=10, max_features=5)
gb_model.fit(X_train, y_train)
gb_y_pred = gb_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, gb_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, gb_y_pred))
print('Classification Report:\n', classification_report(y_test, gb_y_pred))

Output:

Accuracy Score: 0.9889230769230769
Confusion Matrix:
 [[ 375    7]
 [  11 1232]]
Classification Report:
               precision    recall  f1-score   support

         0.0       0.97      0.98      0.98       382
         1.0       0.99      0.99      0.99      1243

    accuracy                           	           0.99      1625
   macro avg       0.98      0.99      0.98      1625
weighted avg       0.99      0.99      0.99      1625

Full-Feature Random Forest Model Accuracy: 0.993

Important-Features Random Forest Model Accuracy: 0.988

Gradient Boosting classifier also seems to train the model well even when half the features are stripped off. Nice, isn’t it?

Conclusion

In this tutorial, we developed Support Vector Machine, Random Forest and Gradient Boost classification models for 2-class wine data set. These classification models helped us identify the Red and White wine types using numeric input features. The hyperparameters of all the models were also tuned to optimize the performance of the classification model for all the algorithms.

Moreover, the models were trained using the more correlated features of the data set (6 features) and their performance was almost same when compared with the models developed using the full-fledge feature set (12 features). These results depicted that robust model could be trained using the important features only, which is a critical requirement to save computation resources as well as the training time.

Models were trained on the training data (75%), and their performance was evaluated using testing data (25%). Approx. accuracy of 99% showed that all of the trained models were able to differentiate the Red Wine from the White Wine.

Sharing is Caring!

Subscribe for Latest Articles

Don't miss new updates on your email!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top