In this tutorial, you will learn how to process, analyze, and classify 3 types of Iris plant types using the most famous dataset a.k.a “Iris Data Set”.
Multi-class prediction models will be trained using Support Vector Machines (SVM), Random Forest, and Gradient Boosting algorithms. Not only that, hyper-parameters of all these machine learning algorithms will also be optimized to achieve the best model performance on the given data set.
This tutorial will feature following tools/libraries to classify Iris species using 3 machine learning algorithms namely Support Vector Machine (SVM), Random Forest (RF) and Gradient Boost classifiers:
- Python
- Jupyter Notebook
- scikit-learn
- pandas
- seaborn
- matplotlib
Let’s first make sense of the Iris data itself…
Article Contents
Iris Dataset
Iris Data set contains information about 3 different species of Iris plant, with 50 instances for each of the species. It is a multivariate dataset normally used for the classification tasks using input numeric features and multiclass output.
This dataset is readily available at UCI Machine learning repository, and contains 4 input attributes for each instance. We will use these continuous attributes to develop a classification model for 3 species of Iris plant.
Iris Dataset Link: https://archive.ics.uci.edu/ml/datasets/iris
Input Features
- Sepal Length: continuous; depicts the sepal length of Iris plant in cm
- Sepal Width: continuous; depicts the sepal width of Iris plant in cm
- Petal Length: continuous; depicts the petal length of Iris plant in cm
- Petal Width: continuous depicts the petal width of Iris plant in cm
Output Features
class: discrete – Iris Setosa, Iris Versicolor, Iris Virginica; depicts the species of Iris plant
Article Notebook
Code: Iris Flower Data Classification using Support Vector Machines, Random Forest and Gradient Boosting Algorithms
Dataset Link: https://archive.ics.uci.edu/ml/datasets/iris
Prepared By: Awais Naeem
Copyrights: www.embedded-robotics.com
Disclaimer: This code can be distributed with the proper mention of the owner copyrights
Notebook Link: https://github.com/embedded-robotics/datascience/blob/master/iris_dataset_multiclass_ml/SVM_GB_RF_iris.ipynb
Data Reading and Exploratory Analysis
Let’s first import all python-based libraries necessary to read the data, process it, and then perform further exploratory analysis on it:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
To embed the matplotlib graphs in the notebook, set the backend of matplotlib to ‘inline’ backend:
%matplotlib inline
Specify the column names of the dataset, and read the dataset using pd.read_csv():
iris_data_columns = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
iris_data = pd.read_csv("iris.data", header=None, names=iris_data_columns)
iris_data.head()
Output:
sepal-length sepal-width petal-length petal-width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Once you have data in a proper pandas data frame, first thing you need to do is to deal with the missing values. Iris data set is already cleaned enough to contain no missing values, however, let’s just give it a go for our satisfaction:
iris_data.isnull().sum()
Output:
sepal-length 0
sepal-width 0
petal-length 0
petal-width 0
class 0
dtype: int64
Nice! No missing values, eh?
Let’s now check if all the features have been assigned proper data-types:
iris_data.dtypes
Output:
sepal-length float64
sepal-width float64
petal-length float64
petal-width float64
class object
dtype: object
All the numeric features are assigned a datatype of ‘float64’ and class feature is assigned ‘object’ which depicts its discrete feature set containing 3 strings.
Seems like, we are good to perform exploratory data analysis of the Iris data. So, here we go with the pair plot:
sns.pairplot(iris_data, hue='class')
Pair plot shows the relationship between all the numeric features with the ‘Kernel Density (kde) Plot’ for each individual feature on the diagonal.
Let’s go through plots and try to extract some useful insights/patterns from the dataset:
- Looking at the kde plots for Petal Width and Petal Length, it is evident that Iris Setosa can be differentiated from Iris Versicolor and Iris Virginica by using either of these two features
- The scatter plot between Petal Width and Petal Length depicts a linear relation. Moreover, all of the classes can be separated linearly from each other using this relation
- Petal Width/Length when plotted against Sepal Width/Length clearly separates the Iris setosa from Iris Versicolor or Iris Virginica
- Sepal Width depicts linearly separable relation between all the classes, where Sepal Length depicts an amalgam of the classes which are not linearly separable
There are many more insights which you can extract from this data, however, you will need to look deep into it. Let me know in the comments about what you found…
Now is the time to find the correlation between each numeric feature and the class variable as it will help you differentiate the most important features from those which don’t matter at all. Higher the absolute value of the correlation between a numeric value and class variable, more the numeric feature impacts the final outcome of the class.
So, let’s convert the class variable into ‘numeric’ values from ‘object’ entity using label-encoder. Label encoder converts the ‘object’ discrete values into ‘numeric’ ones which can be interpreted by the mathematical machine learning algorithms:
class_labels = iris_data['class'].unique()
encoder = LabelEncoder()
iris_data['class'] = encoder.fit_transform(iris_data['class'])
class_labels_encoded = encoder.transform(class_labels)
Now find the correlation between each input feature and the output feature:
iris_data.corr()['class']
Output:
sepal-length 0.782561
sepal-width -0.419446
petal-length 0.949043
petal-width 0.956464
class 1.000000
Name: class, dtype: float64
As already predicted using exploratory data analysis, Petal Width and Petal Length highly impact the outcome of a class.
Since there are only 4 numeric features in this case, we will select all of these to train the machine learning algorithms. For larger datasets, you can drop off the un-important features and select the ones with the higher ‘absolute’ correlation values as it will strip off your training time as well as computation resources.
Let’s now observe the relationship between each input ‘numeric’ feature and the encoded output feature:
fig,((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize=(15,9))
ax1.scatter(x = iris_data['class'], y = iris_data['sepal-length'])
ax1.set_ylabel("Sepal Length")
ax1.set_xticks(class_labels_encoded)
ax1.set_xticklabels(class_labels)
ax2.scatter(x = iris_data['class'], y = iris_data['sepal-width'])
ax2.set_ylabel("Sepal Width")
ax2.set_xticks(class_labels_encoded)
ax2.set_xticklabels(class_labels)
ax3.scatter(x = iris_data['class'], y = iris_data['petal-length'])
ax3.set_ylabel("Petal Length")
ax3.set_xticks(class_labels_encoded)
ax3.set_xticklabels(class_labels)
ax4.scatter(x = iris_data['class'], y = iris_data['petal-width'])
ax4.set_ylabel("Petal Width")
ax4.set_xticks(class_labels_encoded)
ax4.set_xticklabels(class_labels)
Now let’s move onto building the actual training models for the data, but before we do that, we need to do some pre-processing on the data.
Data Pre-processing
Firstly, separate the Input features of the data from its output class:
data_X = iris_data.drop('class', axis=1)
data_y = iris_data['class']
Now, split the dataset into Training Data (80%) and Testing Data (20%). Training data will be used by the models for training, whereas Testing Data will be used to gauge the performance of the trained models on the unseen data.
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2, random_state=1)
Since the input features are numeric and differ in individual scales, they need to be scaled such that the distribution of each feature will have mean=0, and std=1.
StandardScaler() is the industry’s go-to algorithm to perform such scaling, which in-turn, enhances the training capability of the mathematical machine learning models as they need to evaluate each feature having the approx. same distribution.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
We only need to fit the StandardScaler() on the Training Data and not the Testing Data, since the fitting of scaler on the Testing Data will lead to data leakage.
After the necessary pre-processing, let’s build our first multi-class training model using Support Vector Machine (SVM) and evaluate its performance…
Data Classification using Support Vector Machines
To train the SVM model on the Iris data set, Support Vector Classifier (SVC) is employed from scikit-learn library. At first, SVC() model is fitted on the training data with default parameters:
svm_model = SVC(C=1, kernel='rbf')
svm_model.fit(X_train, y_train)
‘C’ is the regularization parameter which controls underfitting/overfitting. Higher value of ‘C’ leads to less regularization which results in overfitting and vice versa.
‘Kernel’ is selected based on the shape/dimension of the input features. Other values could be ‘linear’, ‘poly’ or ‘sigmoid’.
Once the model is trained, we need to predict the performance of the model on the testing data set:
svm_y_pred = svm_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, svm_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, svm_y_pred))
print('Classification Report:\n', classification_report(y_test, svm_y_pred))
Output:
Accuracy Score: 0.9666666666666667
Confusion Matrix:
[[11 0 0]
[ 0 12 1]
[ 0 0 6]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 11
1 1.00 0.92 0.96 13
2 0.86 1.00 0.92 6
accuracy 0.97 30
macro avg 0.95 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
So, we get an overall accuracy of 96.67%; not bad, eh…
Hyperparameter tuning for SVM
In the previous approach, we used default parameters of the SVC() model which might not be the best parameters to train the model.
We can take an alternative approach to train the model using a subset of hyperparameters, and then select best parameters based on a metric:
svm_param_grid = {'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000],
'kernel': ['linear', 'rbf','poly'],
'gamma' :[0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
svm_cv = KFold(n_splits=5)
svm_grid = GridSearchCV(SVC(), svm_param_grid, cv=svm_cv, scoring='accuracy')
svm_grid.fit(X_train, y_train)
SVC() model will be trained using each combination of the hyperparameters specified in ‘svm_param_grid’.
During each iteration, model will be trained using 5-fold cross-validation. 4-folds of the data will be used to train the model, and remaining fold will be used to test the accuracy of the trained model.
Hyperparameters resulting in the maximum accuracy will be used to train the model on the complete dataset in the final stages of ‘GridSearchCV’.
We can get the parameters which resulted in the maximum accuracy as follows:
print('SVM best Params:', svm_grid.best_params_)
print('SVM best Score:', svm_grid.best_score_)
Output:
SVM best Params: {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
SVM best Score: 0.9666666666666668
Let’s evaluate the performance of the optimized model on the test dataset:
svm_y_pred = svm_grid.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, svm_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, svm_y_pred))
print('Classification Report:\n', classification_report(y_test, svm_y_pred))
Output:
Accuracy Score: 0.9666666666666667
Confusion Matrix:
[[11 0 0]
[ 0 12 1]
[ 0 0 6]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 11
1 1.00 0.92 0.96 13
2 0.86 1.00 0.92 6
accuracy 0.97 30
macro avg 0.95 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
Although, our accuracy stays the same even for the optimized model, you will see major differences for dataset having million or more instances.
Let’s now train Random Forest Classifier model on the same data set…
Data Classification using Random Forest Classifier
Random Forest Classifier is basically a lot of decision tree classifiers combining to give the majority vote in favor of a class. You can train a Random Forest classifier using scikit-learn library in Python:
rf_model = RandomForestClassifier(n_estimators=100)
rf_model.fit(X_train, y_train)
‘n_estimators’ specify the number of decision trees which will be used to train the Random Forest classifier.
Let’s observe the accuracy of the Random Forest model on the test data set:
rf_y_pred = rf_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, rf_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, rf_y_pred))
print('Classification Report:\n', classification_report(y_test, rf_y_pred))
Output:
Accuracy Score: 0.9666666666666667
Confusion Matrix:
[[11 0 0]
[ 0 12 1]
[ 0 0 6]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 11
1 1.00 0.92 0.96 13
2 0.86 1.00 0.92 6
accuracy 0.97 30
macro avg 0.95 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
Hyperparameter tuning for Random Forest
Similar to what we did for the SVM model, we will tune the hyperparameters of the Random Forest Classifier using 5-fold cross-validation via GridSearchCV():
rf_param_grid = {'max_samples': [0.1, 0.2, 0.3, 0.4],
'max_features': [1, 2],
'n_estimators':[10, 50, 100],
'max_depth': [8, 9, 10]
}
rf_cv = KFold(n_splits=5)
rf_grid = GridSearchCV(RandomForestClassifier(), rf_param_grid, cv=rf_cv)
rf_grid.fit(X_train, y_train)
‘max_samples’ denote the fraction of the total samples which all the decision trees will use to specify the individual contribution.
Since there are only 4 input features in the Iris dataset, we will specify a max feature set to be ‘1’ or ‘2’ for more robust model training.
‘max_depth’ denotes the maximum depth of the decision steps for each individual decision tree in the Random Forest model.
After the model is trained, you can get the best model parameters as well as evaluate the performance of the trained model on the test data set:
print('RF best Parameters:', rf_grid.best_estimator_)
print('RF best Score:', rf_grid.best_score_)
Output:
RF best Parameters: RandomForestClassifier(max_depth=8, max_features=1, max_samples=0.1,
n_estimators=50)
RF best Score: 0.9666666666666668
rf_y_pred = rf_grid.predict(X_test)
print('Accuray:', accuracy_score(y_test, rf_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, rf_y_pred))
print('Classification Report:\n', classification_report(y_test, rf_y_pred))
Output:
Accuray: 0.9666666666666667
Confusion Matrix:
[[11 0 0]
[ 0 12 1]
[ 0 0 6]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 11
1 1.00 0.92 0.96 13
2 0.86 1.00 0.92 6
accuracy 0.97 30
macro avg 0.95 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
Data Classification using Gradient Boosting Classifier
Gradient Boosting classifier is a combined sequential model of various Decision Trees, where the outcome of one decision tree is used to train the next one and so on.
‘learning_rate’ will specify how much impact outcome of one tree has on the next one. You can control underfitting/overfitting of the gradient boosting classifier using ‘learning_rate’ of the model.
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb_model.fit(X_train, y_train)
‘n_estimators’ specify the number of decision trees to be employed in training the gradient boosting classifier.
gb_y_pred = gb_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, gb_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, gb_y_pred))
print('Classification Report:\n', classification_report(y_test, gb_y_pred))
Output:
Accuracy Score: 0.9666666666666667
Confusion Matrix:
[[11 0 0]
[ 0 12 1]
[ 0 0 6]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 11
1 1.00 0.92 0.96 13
2 0.86 1.00 0.92 6
accuracy 0.97 30
macro avg 0.95 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
Hyperparameter tuning for Gradient Boosting Classifier
Tuning the hyperparameters for this classifier is more or less similar to that of Random Forest Classifier:
gb_grid_param = {'learning_rate': [0.01, 0.05, 0.1, 1],
'n_estimators' : [10, 50, 100, 500, 1000],
'max_depth': [2, 5, 8, 11],
'max_features': [1,2]}
gb_cv = KFold(n_splits=5)
gb_grid = GridSearchCV(GradientBoostingClassifier(), gb_grid_param, cv=gb_cv)
gb_grid.fit(X_train, y_train)
Best parameters and performance of the gradient boosting classifier on the testing set can be evaluated as follows:
print('GB best Parameters:', gb_grid.best_estimator_)
print('GB best Score:', gb_grid.best_score_)
Output:
GB best Parameters: GradientBoostingClassifier(learning_rate=1, max_depth=5, max_features=1,
n_estimators=500)
GB best Score: 0.95
gb_y_pred = gb_grid.predict(X_test)
print('Accuray:', accuracy_score(y_test, gb_y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, gb_y_pred))
print('Classification Report:\n', classification_report(y_test, gb_y_pred))
Output:
Output:
Accuray: 0.9666666666666667
Confusion Matrix:
[[11 0 0]
[ 0 12 1]
[ 0 0 6]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 11
1 1.00 0.92 0.96 13
2 0.86 1.00 0.92 6
accuracy 0.97 30
macro avg 0.95 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
Conclusion
In this tutorial, we developed Support Vector Machine, Random Forest and Gradient Boost classification models for multi-class iris data set. These classification models helped us identify the three species of iris plant using four input features. The hyperparameters of all the models were also tuned to optimize the performance of the classification model for all the algorithms.
Models were trained on the training data (80%), and their performance was evaluated using testing data (20%). Approx. accuracy of 97% showed that all of the trained models were able to classify the iris plant species correctly.
He is the owner and founder of Embedded Robotics and a health based start-up called Nema Loss. He is very enthusiastic and passionate about Business Development, Fitness, and Technology. Read more about his struggles, and how he went from being called a Weak Electrical Engineer to founder of Embedded Robotics.
Subscribe for Latest Articles
Don't miss new updates on your email!