Unsupervised Machine Learning

What is Unsupervised Machine Learning?

Unsupervised Machine Learning is a type of machine learning where the model is trained on unlabeled data. Unlike supervised learning, there are no predefined output labels. Instead, the goal is to find hidden patterns, structures, or relationships in the data. Unsupervised learning is often used for exploratory data analysis, dimensionality reduction, or clustering.

How Unsupervised Learning Works

Collect Unlabeled Data:
- Gather a dataset where there are no predefined output labels.
Train the Model:
- Feed the unlabeled data into the model. The model learns to identify patterns or groupings in the data.
Discover Patterns:
- The model organizes the data into clusters, reduces its dimensionality, or identifies anomalies.
Evaluate the Model:
- Since there are no labels, evaluation is often qualitative (e.g., visualizing clusters) or based on domain knowledge.

Classifications for Unsupervised Learning

Unsupervised Learning can be broadly categorized into two main types:

Clustering
Dimensionality Reduction

Let’s explore each type with a small sample Python script and explain what each script does.

Unsupervised learning can be broadly classified into two types: Clustering and Dimensionality Reduction.

1. Clustering

Clustering involves grouping data points into clusters based on their similarities.

Example: Customer Segmentation

Here's a small Python script using K-Means clustering with scikit-learn:

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Sample data: customer spending in two categories
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Create and train the KMeans model
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)

# Predict the clusters
y_kmeans = kmeans.predict(X)

# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.xlabel('Category 1 Spending')
plt.ylabel('Category 2 Spending')
plt.title('Customer Segmentation')
plt.show()

Explanation:

Start with a sample dataset of customer spending in two categories.
Create and train a K-Means clustering model with 2 clusters.
Predict the clusters for the data points and plot them.
The clusters are visualized with different colors, and the cluster centers are marked in red.

2. Dimensionality Reduction

Dimensionality reduction involves reducing the number of features (dimensions) in the dataset while preserving as much information as possible.

Example: Visualizing High-Dimensional Data with PCA

Here's a small Python script using Principal Component Analysis (PCA) with scikit-learn:

python

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Sample data: 3-dimensional data points
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9],
              [2, 3, 4], [5, 6, 7], [8, 9, 10]])

# Create and train the PCA model
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Plot the reduced dimensions
plt.scatter(X_reduced[:, 0], X_reduced[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of 3D Data')
plt.show()

Explanation:

Start with a sample dataset of 3-dimensional data points.
Create and train a PCA model to reduce the dimensions to 2.
Transform the data to the reduced dimensions and plot the results.
The reduced data points are visualized in a 2D plot.

Summary

Unsupervised Learning: The model is trained on unlabeled data to discover hidden patterns or structures.
Clustering: Groups similar data points together (e.g., customer segmentation).
Dimensionality Reduction: Reduces the number of features while preserving important information (e.g., visualizing high-dimensional data).