A Comprehensive Guide to Machine Learning: Regression and Classification with the MNIST Dataset
A Comprehensive Guide to Machine Learning: Regression and Classification with the MNIST Dataset
Introduction to Supervised Learning: Regression and Classification
In the realm of machine learning, supervised learning involves training a model on a labeled dataset, which means the dataset includes both input data and the corresponding output labels. Supervised learning tasks can be broadly categorized into two types: regression and classification.
Regression tasks aim to predict continuous numerical values. For example, predicting house prices based on various features such as location, size, and number of bedrooms. The output is a continuous value that can range over an infinite set of possible values. Common regression algorithms include linear regression, decision trees, and support vector regression.
Classification, on the other hand, deals with predicting discrete categorical values. The goal is to assign input data to one of several predefined classes. For instance, classifying emails as either spam or not spam, or recognizing handwritten digits as one of the digits from 0 to 9. The output is a discrete value representing the class label. Popular classification algorithms include logistic regression, support vector machines, decision trees, and neural networks.
The MNIST Dataset: A Benchmark for Classification
The MNIST dataset is a widely used benchmark in the field of machine learning. It contains 70,000 images of handwritten digits, with each image being a 28×28 pixel grid. This results in 784 features per image. The dataset is split into 60,000 training images and 10,000 test images, each labeled with the corresponding digit (0-9). The MNIST dataset is often used to evaluate and compare the performance of different classification algorithms.
Loading and Exploring the MNIST Dataset
To begin working with the MNIST dataset, we first load the data and inspect its structure. This involves checking the dimensions of the data arrays to ensure they match the expected format (70,000 samples, each with 784 features for the pixel values). Visualizing the data is also crucial. By plotting some of the images, we can confirm that the data is correctly loaded and understand the visual patterns associated with different digits.
Training a Binary Classifier
Binary classification involves distinguishing between two classes. To illustrate this, we can train a classifier to identify the digit ‘5’ from all other digits. The process involves the following steps:
Preparing the Data: We create a binary target vector where the label is 1 if the digit is ‘5’ and 0 otherwise. This transforms our multiclass problem into a binary classification problem.
Training the Classifier: We use a machine learning algorithm, such as stochastic gradient descent (SGD), to train the model. The algorithm iteratively adjusts the model parameters to minimize the difference between the predicted and actual labels. During training, the model learns patterns and features that are indicative of the digit ‘5’, such as the loop and vertical line structure.
Making Predictions: Once trained, the model can predict whether a new image is a ‘5’ or not by applying the learned decision boundary. This boundary separates the feature space into two regions, one for each class.
Evaluating the Classifier
To ensure the classifier performs well, we need to evaluate its performance using various metrics and techniques.
Cross-Validation
This technique assesses the model’s ability to generalize to an independent dataset. The data is split into multiple folds, and the model is trained and evaluated on different combinations of these folds. This helps in obtaining a more reliable estimate of the model’s performance.
Confusion Matrix
A confusion matrix provides a detailed breakdown of the classifier’s performance by comparing actual and predicted labels. It includes counts of true positives, false positives, true negatives, and false negatives. This matrix helps us understand the types of errors the classifier is making and how often they occur.
Precision and Recall
Precision measures the accuracy of positive predictions, i.e., the proportion of true positive predictions out of all positive predictions. Recall measures the ability of the classifier to identify all positive instances, i.e., the proportion of true positive predictions out of all actual positives. These metrics are particularly useful for imbalanced datasets where some classes are much more frequent than others.
Precision/Recall Trade-off
By adjusting the decision threshold, we can find a balance between precision and recall. Lowering the threshold increases recall but decreases precision, while raising the threshold does the opposite. Precision-recall curves help visualize this trade-off and select an optimal threshold based on the specific requirements of the application.
ROC Curve and AUC Score
The ROC (Receiver Operating Characteristic) curve plots the true positive rate (recall) against the false positive rate at various threshold settings. The area under the ROC curve (AUC) provides a single number to summarize the model’s performance. A higher AUC indicates better performance. The ROC curve and AUC score are particularly useful for comparing different classifiers.
Multiclass Classification
After mastering binary classification, we extend the model to distinguish between all ten digits (0-9). This is known as multiclass classification. Two common strategies for multiclass classification are:
One-vs-All (OvA)
This approach involves training a separate binary classifier for each digit. Each classifier distinguishes one digit from all other digits. During prediction, the classifier with the highest score determines the final class label.
One-vs-One (OvO)
This approach involves training a binary classifier for every pair of digits. For ten digits, this results in 45 classifiers. During prediction, a voting scheme determines the final class label based on the outputs of all classifiers.
Error Analysis
Analyzing errors is crucial for improving model performance. We use confusion matrices to understand the types of errors the classifier makes. For example, a confusion matrix might reveal that the classifier frequently confuses ‘3’ with ‘8’. By visualizing these misclassified images, we can identify patterns and features that lead to errors. This insight helps us refine the model and improve its accuracy.
Multilabel and Multioutput Classification
In some tasks, each instance can belong to multiple classes simultaneously (multilabel classification) or have multiple outputs (multioutput classification).
Multioutput Classification
This involves predicting multiple outputs for each instance. For example, predicting both the digit and the intensity level of the handwriting. The model outputs a vector of predictions, and the performance is evaluated using appropriate metrics for each output.
Conclusion
In this comprehensive guide, we have explored the essential steps involved in building, evaluating, and improving a machine learning classifier using the MNIST dataset. We started with an introduction to regression and classification, moved on to exploring and visualizing the dataset, and then delved into training classifiers and evaluating their performance using various metrics. Finally, we discussed advanced topics like multiclass, multilabel, and multioutput classification, along with error analysis.
By understanding these concepts and applying these techniques, you can develop robust machine learning models and tackle various classification problems effectively. Whether you are working with simple binary classification tasks or complex multiclass and multioutput problems, the methods and insights discussed here will serve as a solid foundation for your machine learning journey.
If you have any questions or need further clarification, feel free to ask in the comments. Happy learning and coding!
Now it’s time to code!
# Part 1: Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, precision_score, recall_score, precision_recall_curve, roc_curve, roc_auc_score
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import ClassifierChain
from sklearn.svm import SVC # Importing SVC
# Part 2: Loading and Exploring the MNIST Dataset
# Load the MNIST dataset
mnist = fetch_openml('mnist_784', as_frame=False)
X, y = mnist.data, mnist.target
# Check the shape of the data
print(X.shape) # Output: (70000, 784)
print(y.shape) # Output: (70000,)
# Visualize one of the digits
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap='binary')
plt.axis('off')
plt.show()
# Part 3: Training a Binary Classifier
# Create binary target vector: 1 for digit '5', 0 for all other digits
y = y.astype(np.int8) # Convert target to integer
y_train_5 = (y == 5)
# Split the data into training and test sets
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]
y_train_5, y_test_5 = y_train == 5, y_test == 5
# Train a SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
# Part 4: Evaluating the Classifier
# Perform cross-validation
cross_val_scores = cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')
print("Cross-validation accuracy:", cross_val_scores)
# Make cross-validated predictions
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
# Confusion Matrix
conf_mx = confusion_matrix(y_train_5, y_train_pred)
print("Confusion Matrix:\n", conf_mx)
# Precision and Recall
precision = precision_score(y_train_5, y_train_pred)
recall = recall_score(y_train_5, y_train_pred)
print("Precision:", precision)
print("Recall:", recall)
# Part 5: Precision/Recall Trade-off
# Get decision scores
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")
# Compute precision-recall curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
# Plot precision-recall curve
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="center right")
plt.ylim([0, 1])
plt.show()
# Part 6: ROC Curve and AUC Score
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
# Plot ROC curve
plt.plot(fpr, tpr, linewidth=2)
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([0, 1, 0, 1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()
# Compute ROC AUC score
roc_auc = roc_auc_score(y_train_5, y_scores)
print("ROC AUC Score:", roc_auc)
# Part 7: Multiclass Classification
# One-vs-All (OvA) Strategy
ovr_clf = OneVsRestClassifier(SVC(random_state=42))
ovr_clf.fit(X_train, y_train)
print("OvA Classifier predictions for first digit:", ovr_clf.predict([some_digit]))
# One-vs-One (OvO) Strategy
ovo_clf = OneVsOneClassifier(SVC(random_state=42))
ovo_clf.fit(X_train, y_train)
print("OvO Classifier predictions for first digit:", ovo_clf.predict([some_digit]))
# Part 8: Error Analysis
# Confusion Matrix for multiclass classifier
y_train_pred_full = cross_val_predict(ovr_clf, X_train, y_train, cv=3)
conf_mx_full = confusion_matrix(y_train, y_train_pred_full)
plt.matshow(conf_mx_full, cmap=plt.cm.gray)
plt.show()
# Part 9: Multilabel and Multioutput Classification
# Multilabel classification: Creating multiple binary labels
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
# Training a KNeighborsClassifier for multilabel classification
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
# Multioutput classification: Predicting multiple outputs
chain_clf = ClassifierChain(SVC(random_state=42))
chain_clf.fit(X_train, y_multilabel)
print("Predictions for first digit (multilabel):", knn_clf.predict([some_digit]))
print("Predictions for first digit (multioutput):", chain_clf.predict([some_digit]))
Lets explain the code so you can understand why we used this code here step by step :
Part 1: Code with Detailed Explanation (Steps 1 to 6)
Importing Libraries
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.linear_model import SGDClassifier from sklearn.model_selection import cross_val_score, cross_val_predict from sklearn.metrics import confusion_matrix, precision_score, recall_score, precision_recall_curve, roc_curve, roc_auc_score from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.multioutput import ClassifierChain from sklearn.svm import SVC # Importing SVC
Explanation:
In this section, we import essential libraries and modules for numerical operations, data visualization, and machine learning tasks:
numpy: A library for numerical operations, providing support for arrays and matrices.
matplotlib.pyplot: A plotting library for creating static, animated, and interactive visualizations.
sklearn.datasets: Provides functions to fetch datasets like MNIST.
sklearn.linear_model: Contains the SGDClassifier for training linear models using stochastic gradient descent.
sklearn.model_selection: Offers tools like cross_val_score and cross_val_predict for model evaluation using cross-validation.
sklearn.metrics: Provides functions for evaluating model performance, including confusion matrix, precision, recall, precision-recall curve, ROC curve, and AUC score.
sklearn.multiclass: Contains strategies for multiclass classification, such as OneVsRestClassifier and OneVsOneClassifier.
sklearn.neighbors: Includes the KNeighborsClassifier for k-nearest neighbors classification.
sklearn.multioutput: Provides tools for multioutput classification, including ClassifierChain.
sklearn.svm: Contains the SVC (Support Vector Classifier) for classification tasks.
Loading and Exploring the MNIST Dataset
# Load the MNIST dataset mnist = fetch_openml('mnist_784', as_frame=False) X, y = mnist.data, mnist.target # Check the shape of the data print(X.shape) # Output: (70000, 784) print(y.shape) # Output: (70000,) # Visualize one of the digits some_digit = X[0] some_digit_image = some_digit.reshape(28, 28) plt.imshow(some_digit_image, cmap='binary') plt.axis('off') plt.show()
Explanation:
We load the MNIST dataset using fetch_openml. The dataset consists of 70,000 images of handwritten digits, each represented as a 28×28 pixel grid (resulting in 784 features per image).
We print the shape of the data to ensure it matches the expected dimensions.
We visualize one of the digits to verify the data is loaded correctly and to understand the visual patterns associated with different digits.
Training a Binary Classifier
# Create binary target vector: 1 for digit '5', 0 for all other digits y = y.astype(np.int8) # Convert target to integer y_train_5 = (y == 5) # Split the data into training and test sets X_train, X_test = X[:60000], X[60000:] y_train, y_test = y[:60000], y[60000:] y_train_5, y_test_5 = y_train == 5, y_test == 5 # Train a SGDClassifier sgd_clf = SGDClassifier(random_state=42) sgd_clf.fit(X_train, y_train_5)
Explanation:
We create a binary target vector where the label is 1 if the digit is ‘5’ and 0 otherwise, transforming our multiclass problem into a binary classification problem.
We split the dataset into training and test sets. The first 60,000 samples are used for training, and the remaining 10,000 are used for testing.
We train an SGDClassifier (Stochastic Gradient Descent Classifier) on the training data. This classifier is well-suited for large datasets and performs online learning, adjusting model parameters iteratively.
How SGDClassifier Works:
Initialization: Initialize the model parameters (weights) to small random values.
Training: For each training sample, the classifier calculates the prediction, computes the error (difference between predicted and actual label), and updates the weights to minimize this error. This process is repeated for each sample in the training set.
Iterations: The training process involves multiple iterations (epochs) over the training data to refine the model parameters.
Gradient Descent: The classifier uses gradient descent to find the optimal parameters by moving in the direction that reduces the error.
Cross-Validation: We use cross_val_score to assess the model’s performance. Cross-validation splits the training data into multiple folds, trains the model on different combinations of these folds, and evaluates its accuracy.
Cross-Validated Predictions: We use cross_val_predict to obtain predictions for each training instance. This function returns predictions that are generated by a model trained on different folds than the instance being predicted.
Confusion Matrix: We compute the confusion matrix using confusion_matrix to understand the types of errors the classifier makes. The confusion matrix includes counts of true positives (correctly identified ‘5’s), false positives (incorrectly identified as ‘5’s), true negatives (correctly identified as not ‘5’s), and false negatives (incorrectly identified as not ‘5’s).
Precision and Recall: We calculate precision and recall scores to evaluate the model’s ability to identify positive instances (digits ‘5’) and the accuracy of its positive predictions. Precision is the ratio of true positive predictions to the total positive predictions, while recall is the ratio of true positive predictions to the total actual positives.
Decision Scores: We obtain the decision scores for each instance using cross_val_predict with method="decision_function". These scores represent the confidence level of the classifier for each prediction.
Precision-Recall Curve: We compute the precision-recall curve using precision_recall_curve, which returns precision, recall, and thresholds. This curve helps visualize the trade-off between precision and recall for different decision thresholds.
Plotting: We plot the precision-recall curve to observe how precision and recall change with varying thresholds. Lowering the threshold increases recall but decreases precision, and vice versa.
ROC Curve: We compute the ROC curve using roc_curve, which plots the true positive rate (recall) against the false positive rate at various threshold settings. The ROC curve helps visualize the trade-off between true positive rate and false positive rate.
Plotting: We plot the ROC curve along with a diagonal line representing a random classifier. The closer the ROC curve is to the top-left corner, the better the classifier’s performance.
AUC Score: We calculate the area under the ROC curve (AUC) using roc_auc_score. A higher AUC indicates better performance, summarizing the model’s ability to distinguish between positive and negative instances.
You can now ask me a question , we already draw number 5 in the begging why we continued: let’s explain more for BETTER CLARIFICATION :
MNIST Number 5 Visualization and Classification
When we visualize a sample image of the number 5 from the MNIST dataset and see that it displays correctly, it serves a few specific purposes, but it doesn’t mean that our classifier is already trained or evaluated. Here’s a detailed explanation of why we continue with the subsequent steps even after visualizing the number 5 correctly:
Purpose of Initial Visualization
Data Integrity Check: Visualizing the number 5 ensures that the data has been loaded correctly and the image is properly formatted.
Understanding the Dataset: It helps us understand the structure of the data we’re working with (28×28 pixel images, grayscale values, etc.).
Human Intuition: It gives us a visual intuition of what the model will be learning to recognize.
Why We Continue
Visualizing the number 5 is just the first step. Here’s why we need to continue with the training and evaluation steps:
Model Training: We need to train the model on the entire dataset, not just one sample, so that it learns to generalize from the data. Training involves feeding thousands of labeled images to the model and adjusting its parameters to minimize prediction errors.
Model Evaluation: After training, we need to evaluate how well the model performs on unseen data. This involves checking its accuracy, precision, recall, and other metrics on a test set.
Generalization: Just because the visualization of a number 5 looks correct doesn’t mean our model can correctly identify all instances of the number 5. We need to ensure the model can generalize well to new, unseen images of 5.
Error Analysis: We need to understand and analyze the errors the model makes to improve it further. This involves looking at confusion matrices and specific misclassifications.
Detailed Process After Visualization
Here’s a step-by-step breakdown of what we do after visualizing the number 5 and why each step is necessary:
1. Preparing the Data for Binary Classification
We create a binary target vector where the label is 1 if the digit is 5 and 0 otherwise.
# Create binary target vector: 1 for digit '5', 0 for all other digits
y = y.astype(np.int8) # Convert target to integer
y_train_5 = (y == 5)
Why?
This transforms our multiclass problem into a simpler binary classification problem.
2. Splitting the Data
We split the data into training and test sets to ensure the model can be evaluated on unseen data.
# Split the data into training and test sets
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]
y_train_5, y_test_5 = y_train == 5, y_test == 5
Why?
To prevent overfitting and to have a separate set of data for evaluation.
3. Training the Model
We train a Stochastic Gradient Descent (SGD) classifier on the training data.
# Train a SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
Why?
To create a model that can learn the patterns distinguishing the digit 5 from other digits.
4. Evaluating the Model
We perform cross-validation to evaluate the model’s performance and compute precision and recall.
To evaluate the model’s performance across different thresholds and to compare it with other models.
The visualization step is just a preliminary check. The real work involves training the model on a large dataset, evaluating its performance using various metrics, and making improvements based on detailed error analysis. This ensures that the model can accurately and reliably identify the digit 5 from a variety of handwritten digits.
Let’s continue with a detailed explanation of why we used additional codes after the initial steps and how they contribute to our overall goal of building, evaluating, and improving a machine learning model using the MNIST dataset.
Steps After Initial Evaluation
After we evaluated the binary classifier for identifying the number 5, we proceeded with several additional steps to enhance our understanding and expand the capabilities of our model. Here’s why we continued and what each subsequent step entails:
7. Multiclass Classification
Objective: Extend the model to recognize all digits (0-9), not just the digit 5.
One-vs-All (OvA) Strategy
# One-vs-All (OvA) Strategy ovr_clf = OneVsRestClassifier(SVC(random_state=42)) ovr_clf.fit(X_train, y_train) print("OvA Classifier predictions for first digit:", ovr_clf.predict([some_digit]))
Why?
One-vs-All (OvA): This strategy involves training a separate binary classifier for each class. Each classifier distinguishes one digit from all other digits. During prediction, the classifier with the highest score determines the final class label.
One-vs-One (OvO) Strategy
# One-vs-One (OvO) Strategy ovo_clf = OneVsOneClassifier(SVC(random_state=42)) ovo_clf.fit(X_train, y_train) print("OvO Classifier predictions for first digit:", ovo_clf.predict([some_digit]))
Why?
One-vs-One (OvO): This strategy involves training a binary classifier for every pair of digits, resulting in 45 classifiers for ten digits. A voting scheme determines the final class label based on the outputs of all classifiers.
8. Error Analysis
Objective: Understand and visualize the types of errors made by the multiclass classifier to improve its performance.
Error Analysis: We perform error analysis by visualizing the confusion matrix for the multiclass classifier. This helps us understand which digits are being confused with each other and identify patterns or areas where the model can be improved.
9. Multilabel and Multioutput Classification
Objective: Handle tasks where each instance can belong to multiple classes simultaneously (multilabel classification) or have multiple outputs (multioutput classification).
Multilabel Classification: Each instance can belong to multiple classes simultaneously. For example, we create binary labels indicating whether a digit is large (7, 8, 9) and whether it is odd. We train a KNeighborsClassifier for this task.
Multioutput Classification
# Multioutput classification: Predicting multiple outputs chain_clf = ClassifierChain(SVC(random_state=42)) chain_clf.fit(X_train, y_multilabel) print("Predictions for first digit (multilabel):", knn_clf.predict([some_digit])) print("Predictions for first digit (multioutput):", chain_clf.predict([some_digit]))
Why?
Multioutput Classification: Each instance can have multiple outputs. We use ClassifierChain with SVC to predict multiple outputs for each instance. This method chains together multiple classifiers, with each classifier making predictions based on the input features and all previous classifiers in the chain.
Detailed Explanation of Key Terms
SGDClassifier
Stochastic Gradient Descent (SGD): An iterative method for optimizing an objective function with suitable smoothness properties. In the context of SGDClassifier, it’s used to fit linear classifiers like logistic regression or linear SVMs. SGD updates the model parameters iteratively for each training sample, which makes it suitable for large-scale datasets.
Cross-Validation
Purpose: To evaluate the model’s ability to generalize to an independent dataset. This is done by splitting the data into multiple folds and training/testing the model on different combinations of these folds.
Method: cross_val_score and cross_val_predict functions help perform cross-validation, providing a robust estimate of model performance.
Confusion Matrix
Purpose: To provide a detailed breakdown of the classifier’s performance by comparing actual and predicted labels.
Precision: The proportion of true positive predictions out of all positive predictions. High precision means low false positive rate.
Recall: The proportion of true positive predictions out of all actual positives. High recall means low false negative rate.
Precision-Recall Trade-off: By adjusting the decision threshold, we can balance precision and recall according to the specific needs of the application.
ROC Curve and AUC Score
ROC Curve: Plots the true positive rate (recall) against the false positive rate for various threshold settings.
AUC Score: The area under the ROC curve, providing a single metric to summarize the model’s performance. Higher AUC indicates better performance.
Summary
The visualization of the number 5 is just an initial check. The real work involves:
Training the model on a large dataset.
Evaluating its performance using cross-validation, confusion matrices, precision, recall, and ROC curves.
Extending the model to handle multiclass, multilabel, and multioutput classification tasks.
Finally, let’s see the screenshots of the result which took 1 hour 52 minutes with Google Colab with CPU Speed:
Do Not Forget to Check our IOS Apps in Applications section :) Thanks for your support in advance
Artificial Intelligence ( Machine Learning & Deep Learning) + iOS apps with Deep learning integration
all by INGOAMPT Dismiss
We noticed you're visiting from United States (US). We've updated our prices to United States (US) dollar for your shopping convenience. Use Euro instead.Dismiss