Supervised Machine Learning

What is Supervised Machine Learning?

Supervised machine learning is a type of machine learning where the model is trained using labeled data. This means that for each input data point, the corresponding output is provided. The goal is for the model to learn the mapping from inputs to outputs so it can predict the output for new, unseen data.

How Supervised Learning Works

Collect Labeled Data:
- Gather a dataset where each input is paired with the correct output.
Split the Data:
- Divide the dataset into:
  - Training Set: Used to train the model.
  - Validation Set: Used to tune the model's hyperparameters.
  - Test Set: Used to evaluate the model's performance.
Train the Model:
- Feed the training data into the model. The model learns the relationship between inputs and outputs by minimizing a loss function (e.g., mean squared error for regression, cross-entropy for classification).
Evaluate the Model:
- Test the model on unseen data (test set) to measure its performance using metrics like accuracy, precision, recall, or mean squared error.
Make Predictions:
- Use the trained model to predict outputs for new inputs.

Types of Supervised Learning

Supervised Learning can be divided into two main categories:

Regression:
- Used when the output is a continuous value (e.g., predicting house prices, temperature, or stock prices).
- Example Algorithms: Linear Regression, Decision Trees, Support Vector Regression (SVR).
Classification:
- Used when the output is a category or class (e.g., classifying emails as spam/not spam, identifying handwritten digits).
- Example Algorithms: Logistic Regression, Decision Trees, Support Vector Machines (SVM), Neural Networks.

1. Regression

Regression is used when the output is a continuous value. The goal is to predict a numerical value based on input features.

Example: Predicting House Prices

Input: Features of a house (e.g., size, number of bedrooms).
Output: Price of the house (a continuous value).

Example: Linear Regression

# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data: House sizes (in sq. ft.) and prices (in $1000s)
X = np.array([[750], [1000], [1200], [1500], [1800]])  # Features (size)
y = np.array([150, 200, 250, 300, 350])               # Labels (price)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make a prediction
new_house_size = np.array([[1300]])  # Predict price for a 1300 sq. ft. house
predicted_price = model.predict(new_house_size)
print(f"Predicted price for a 1300 sq. ft. house: ${predicted_price[0]:.2f} thousand")

What the Script Does:

Import Libraries:
- train_test_split for splitting the data.
- LinearRegression for creating the regression model.
- numpy for numerical operations.
Prepare the Data:
- X contains the house sizes (features).
- y contains the corresponding house prices (labels).
Split the Data:
- The data is split into training and testing sets using train_test_split.
Train the Model:
- A LinearRegression model is created and trained using the training data.
Make a Prediction:
- The model predicts the price of a house with a size of 1300 sq. ft.
Output:
- The script prints the predicted price for a 1300 sq. ft. house.

2. Classification

Classification is used when the output is a category or class. The goal is to predict the class label of the input data.

Example: Classifying Emails as Spam or Not Spam

Input: Features of an email (e.g., words, sender).
Output: Class label (e.g., "spam" or "not spam").

Python Script: Logistic Regression

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data: email features (input) and labels (output: 0 for not spam, 1 for spam)
X = np.array([[0.1, 0.4], [0.3, 0.7], [0.4, 0.5], [0.6, 0.8], [0.9, 0.2]])  # email features
y = np.array([0, 0, 0, 1, 1])  # labels

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Display final predictions
print(f"Final Predictions: {y_pred}")

# Compare actual vs predicted labels
comparison = np.vstack((y_test, y_pred)).T
print(f"Actual vs Predicted Labels:\n{comparison}")

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

What the Script Does:

Import Libraries:
- numpy: A library for numerical operations in Python.
- train_test_split: A function from sklearn.model_selection to split the dataset into training and testing sets.
- LogisticRegression: A classification algorithm from sklearn.linear_model.
- accuracy_score: A function from sklearn.metrics to calculate the accuracy of the model.
Prepare the Data:
- X: A 2D array where each row represents an email and each column represents a feature (e.g., word frequency, sender score).
  - Example: [0.1, 0.4] means the first email has feature values 0.1 and 0.4.
- y: A 1D array containing the labels for each email.
  - 0 means "not spam."
  - 1 means "spam."
Split the Data:
- train_test_split: Splits the dataset into:
  - Training Set (X_train, y_train): Used to train the model (80% of the data in this case).
  - Testing Set (X_test, y_test): Used to evaluate the model (20% of the data in this case).
- test_size=0.2: 20% of the data is used for testing.
- random_state=42: Ensures the split is reproducible (same split every time you run the code).
Create and Train the Model
- LogisticRegression(): Creates a logistic regression model.
- model.fit(X_train, y_train): Trains the model using the training data (X_train and y_train).
  - The model learns the relationship between the email features (X_train) and their labels (y_train).
Predict on the Test Set
- model.predict(X_test): Uses the trained model to predict the labels for the test set (X_test).
- y_pred: Contains the predicted labels for the test set.
Display Final Predictions
- Prints the predicted labels (y_pred) for the test set.
Compare Actual vs Predicted Labels
- np.vstack((y_test, y_pred)).T: Combines the actual labels (y_test) and predicted labels (y_pred) into a 2D array for easy comparison.
- Prints the comparison of actual vs predicted labels.
Evaluate the Model
- accuracy_score(y_test, y_pred): Calculates the accuracy of the model by comparing the actual labels (y_test) with the predicted labels (y_pred).
- accuracy: A value between 0 and 1, where 1 means 100% accuracy.
- Prints the accuracy of the model.

Summary

Supervised Learning: The model is trained on labeled data to predict outputs for new inputs.
Regression: Predicts continuous values (e.g., house prices).
Classification: Predicts categorical labels (e.g., spam or not spam).