mlp cnn mnist

From MLP to CNN. Neural Networks for MNIST Digit Recognition

We build and compare four neural network architectures in PyTorch, visualize performance, explore complexity vs. accuracy, and show why CNNs excel at image classification.

Daniel Gustaw

Daniel Gustaw

• 13 min read

From MLP to CNN. Neural Networks for MNIST Digit Recognition

Introduction

The MNIST dataset is a classic benchmark in computer vision, consisting of 70,000 grayscale images of handwritten digits (28×28 pixels). It’s small enough to train quickly but complex enough to reveal differences in model performance—perfect for neural network experiments.

While Multi-Layer Perceptrons (MLPs) can technically classify image data, they treat pixels as flat vectors, ignoring spatial patterns. Convolutional Neural Networks (CNNs), on the other hand, are designed to exploit local structures in images—edges, curves, textures—making them far more effective for visual tasks.

In this post, I compare four architectures: a simple MLP, a minimal TinyCNN, a balanced CNN, and a heavier StrongCNN. We’ll look at accuracy, training time, and parameter counts to understand the trade-offs.

Dataset Preparation

As mentioned earlier, we’re using the MNIST dataset, conveniently available through torchvision.datasets. With just a few lines of code, we download and load the data, apply a basic transformation, and prepare it for training:

from torchvision import datasets, transforms

transform = transforms.ToTensor()

train_data = datasets.MNIST(
    root="./data", train=True, download=True, transform=transform
)
test_data = datasets.MNIST(
    root="./data", train=False, download=True, transform=transform
)

The only preprocessing step here is transforms.ToTensor(), which converts each image to a PyTorch tensor and normalizes its pixel values to the [0.0, 1.0] range.

from torch.utils.data import DataLoader

BATCH_SIZE = 64

train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE)

Shuffling the training data avoids memorizing digit order. For the test set, we skip shuffling but still use batching for efficiency.

We can display some sample images to visualize the dataset:

import matplotlib.pyplot as plt

images, labels = next(iter(train_loader))

plt.figure(figsize=(6, 6))
for i in range(9):
    plt.subplot(3, 3, i + 1)
    plt.imshow(images[i][0], cmap="gray")
    plt.title(f"Label: {labels[i].item()}")
    plt.axis("off")
plt.tight_layout()
plt.savefig("mnist_digits.svg")
plt.show()

Training and Evaluation

Now that our data is ready, it’s time to teach our models how to read handwritten digits. To do this, we define a standard training and evaluation loop using PyTorch’s idiomatic structure. We’ll also track model complexity using a simple parameter counter—useful when comparing different architectures.

Device Setup and Epochs

First, we detect whether a GPU is available. If so, training will happen on CUDA; otherwise, we fall back to CPU. We also set a reasonable training duration:

import torch

EPOCHS = 5
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Five epochs may not sound like much, but on MNIST, it’s often enough to get surprisingly good results—even with basic models.

Training Loop

Here’s our train() function. It’s as boilerplate as it gets: set the model to training mode, loop over batches, calculate the loss, and update the weights.

def train(model, loader, optimizer, criterion):
    model.train()
    for x, y in loader:
        x, y = x.to(DEVICE), y.to(DEVICE)
        optimizer.zero_grad()
        output = model(x)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()

This function doesn’t return anything—it just updates the model’s internal parameters. During training, we don’t care about accuracy yet. We’ll check that later.

Evaluation Loop

After training, we evaluate on the test set. The model is set to eval() mode, gradients are disabled, and we collect two metrics: accuracy and average cross-entropy loss.

import torch.nn.functional as F

def test(model, loader):
    model.eval()
    correct = 0
    total = 0
    total_loss = 0.0
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(DEVICE), y.to(DEVICE)
            output = model(x)
            loss = F.cross_entropy(output, y)
            total_loss += loss.item()
            preds = output.argmax(dim=1)
            correct += (preds == y).sum().item()
            total += y.size(0)
    avg_loss = total_loss / len(loader)  # average over batches
    return correct / total, avg_loss

Notice that we take the mean loss over batches—not individual examples. It’s a good balance between performance tracking and simplicity.

Parameter Count

Before we compare architectures, it’s helpful to know how many trainable parameters each one has. This tiny utility gives us the count:

def count_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

Spoiler: the StrongCNN has over 450,000 parameters, while TinyCNN manages with just a few thousand. That’s a huge difference—and a great starting point for deeper analysis.

Experiment Runner

Finally, we put everything together into a single function that trains a model, times the process, evaluates on the test set, and prints a short summary:

import time
import torch.optim as optim
import torch.nn as nn

def run_experiment(model_class, name):
    model = model_class().to(DEVICE)
    optimizer = optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()

    print(f"\n{name} ({count_params(model)} parameters)")
    start = time.time()
    for epoch in range(EPOCHS):
        train(model, train_loader, optimizer, criterion)
    duration = time.time() - start

    acc, loss = test(model, test_loader)
    print(f"Test Accuracy: {acc * 100:.2f}% | Loss: {loss:.2f} | Learning time: {duration:.1f}s")

This structure is flexible enough to work with any model class you pass in—from simple MLPs to deep convolutional beasts.

In the next section, we’ll define and analyze the four architectures: MLP, TinyCNN, CNN, and StrongCNN.

Model 1: Multi-Layer Perceptron (MLP)

The simplest architecture we consider is the classic Multi-Layer Perceptron (MLP). It treats each 28Ă—28 image as a flat vector of 784 pixels, ignoring spatial structure but still able to learn useful features through fully connected layers.

import torch.nn as nn

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        h = 32  # number of hidden units
        self.model = nn.Sequential(
            nn.Flatten(),          # Flatten 28x28 image into a vector of length 784
            nn.Linear(28 * 28, h), # Fully connected layer: 784 → 32
            nn.ReLU(),             # Non-linear activation
            nn.Linear(h, 10)       # Output layer: 32 → 10 classes
        )

    def forward(self, x):
        return self.model(x)

Explanation

This small MLP has relatively few parameters and trains quickly, but it does not capture the spatial relationships between pixels, limiting its accuracy on image data.

Calling

run_experiment(MLP, "MLP")

you should see:

MLP (25450 parameters)
Test Accuracy: 95.96% | Loss: 0.14 | Learning time: 8.7s

It will be our point of reference for comparing cnn models.

Model 2: TinyCNN — A Minimal Convolutional Neural Network

Next, we introduce a simple TinyCNN architecture that leverages convolutional layers to capture spatial patterns in images. This model is lightweight but far more powerful than the MLP for image tasks.

The figure below illustrates the TinyCNN architecture:

import torch.nn as nn

class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(1, 4, kernel_size=3, padding=1),   # 1x28x28 → 4x28x28
            nn.ReLU(),
            nn.MaxPool2d(2),                             # 4x14x14
            nn.Conv2d(4, 8, kernel_size=3, padding=1),   # 8x14x14
            nn.ReLU(),
            nn.MaxPool2d(2),                             # 8x7x7
            nn.Flatten(),
            nn.Linear(8 * 7 * 7, 10)                     # Direct to output layer
        )

    def forward(self, x):
        return self.net(x)

Architecture Overview

======================================================================
Layer (type:depth-idx)                   Output Shape          Param #
======================================================================
TinyCNN                                  [64, 10]                  --
├─Sequential: 1-1                        [64, 10]                  --
│    └─Conv2d: 2-1                       [64, 4, 28, 28]           40
│    └─ReLU: 2-2                         [64, 4, 28, 28]           --
│    └─MaxPool2d: 2-3                    [64, 4, 14, 14]           --
│    └─Conv2d: 2-4                       [64, 8, 14, 14]           296
│    └─ReLU: 2-5                         [64, 8, 14, 14]           --
│    └─MaxPool2d: 2-6                    [64, 8, 7, 7]             --
│    └─Flatten: 2-7                      [64, 392]                 --
│    └─Linear: 2-8                       [64, 10]                3,930
======================================================================
Total params: 4,266
Trainable params: 4,266
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 5.97
======================================================================
Input size (MB): 0.20
Forward/backward pass size (MB): 2.41
Params size (MB): 0.02
Estimated Total Size (MB): 2.63
======================================================================

Sometimes cnn are presented graphically as the following pipeline:

what is most interesting, that we are beating results of mlp with just 4266 parameters instead of 25450.

Tiny CNN (4266 parameters)
Test Accuracy: 97.96% | Loss: 0.06 | Learning time: 12.3s

With few times smaller network we can expect half of mistakes in comparison to the previous model.

Let’s check how our network would improve if we would maintain similar amount of parameters to original MLP.

Modle 3: CNN — A Balanced Convolutional Neural Network

Now that we’ve seen what a minimal convolutional model can do, let’s scale things up a bit.

The CNN model below is designed to maintain a balanced trade-off between parameter count and performance. It expands the feature extraction capabilities of the TinyCNN by using more filters and a hidden linear layer before the final output.

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(1, 8, kernel_size=3, padding=1),   # 1x28x28 → 8x28x28
            nn.ReLU(),
            nn.MaxPool2d(2),                             # 8x14x14
            nn.Conv2d(8, 16, kernel_size=3, padding=1),  # 16x14x14
            nn.ReLU(),
            nn.MaxPool2d(2),                             # 16x7x7
            nn.Flatten(),
            nn.Linear(16 * 7 * 7, 32),                   # Dense layer with 32 units
            nn.ReLU(),
            nn.Linear(32, 10)                            # Final output layer
        )

    def forward(self, x):
        return self.net(x)

Architecture Breakdown

Compared to TinyCNN, this model:

In table below there are all layers, output shapes, and parameters without batch dimension:

LayerOutput ShapeParameters
Conv2d (1→8, 3×3)8×28×2880
ReLU8Ă—28Ă—280
MaxPool2d8Ă—14Ă—140
Conv2d (8→16, 3×3)16×14×141,168
ReLU16Ă—14Ă—140
MaxPool2d16Ă—7Ă—70
Flatten7840
Linear (784 → 32)3225,120
ReLU320
Linear (32 → 10)10330
Total—26,698

With 26,698 parameters, this CNN have similar size to the MLP (25,450) yet significantly more powerful.

CNN (26698 parameters)
Test Accuracy: 98.22% | Loss: 0.05 | Learning time: 14.3s

Key Observations

This model demonstrates a sweet spot: good depth, reasonable parameter size, and excellent accuracy. But what if we didn’t care about size at all and wanted to push performance even further?

Let’s find out in the next section.

Model 4: StrongCNN — A Deep Convolutional Powerhouse

So far, we’ve looked at models that balance performance and simplicity. But what if we remove the constraints and go all-in on performance?

The StrongCNN is a deeper, more expressive architecture that brings in multiple convolutional layers, higher channel counts, and regularization techniques like Dropout to prevent overfitting. It’s inspired by best practices from larger vision models but still compact enough to train quickly on MNIST.

class StrongCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),   # 1x28x28 → 32x28x28
            nn.ReLU(),
            nn.Conv2d(32, 32, 3, padding=1),  # 32x28x28
            nn.ReLU(),
            nn.MaxPool2d(2),                  # 32x14x14
            nn.Dropout(0.25),

            nn.Conv2d(32, 64, 3, padding=1),  # 64x14x14
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, padding=1),  # 64x14x14
            nn.ReLU(),
            nn.MaxPool2d(2),                  # 64x7x7
            nn.Dropout(0.25),

            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        return self.net(x)

Architecture Breakdown

This model stacks four convolutional layers in two blocks, with increasing filter counts (32 → 64). After each block:

======================================================================
Layer (type:depth-idx)                   Output Shape          Param #
======================================================================
StrongCNN                                [64, 10]                  --
├─Sequential: 1-1                        [64, 10]                  --
│    └─Conv2d: 2-1                       [64, 32, 28, 28]          320
│    └─ReLU: 2-2                         [64, 32, 28, 28]          --
│    └─Conv2d: 2-3                       [64, 32, 28, 28]        9,248
│    └─ReLU: 2-4                         [64, 32, 28, 28]          --
│    └─MaxPool2d: 2-5                    [64, 32, 14, 14]          --
│    └─Dropout: 2-6                      [64, 32, 14, 14]          --
│    └─Conv2d: 2-7                       [64, 64, 14, 14]       18,496
│    └─ReLU: 2-8                         [64, 64, 14, 14]          --
│    └─Conv2d: 2-9                       [64, 64, 14, 14]       36,928
│    └─ReLU: 2-10                        [64, 64, 14, 14]          --
│    └─MaxPool2d: 2-11                   [64, 64, 7, 7]            --
│    └─Dropout: 2-12                     [64, 64, 7, 7]            --
│    └─Flatten: 2-13                     [64, 3136]                --
│    └─Linear: 2-14                      [64, 128]             401,536
│    └─ReLU: 2-15                        [64, 128]                 --
│    └─Dropout: 2-16                     [64, 128]                 --
│    └─Linear: 2-17                      [64, 10]                1,290
======================================================================
Total params: 467,818
Trainable params: 467,818
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 1.20
======================================================================
Input size (MB): 0.20
Forward/backward pass size (MB): 38.61
Params size (MB): 1.87
Estimated Total Size (MB): 40.68
======================================================================

With nearly half a million parameters, this model dwarfs the others in capacity. But it pays off.

Strong CNN (467818 parameters)
Test Accuracy: 99.09% | Loss: 0.03 | Learning time: 75.0s

Key Observations

This model is overkill for MNIST—but that’s the point. It illustrates how far you can go when accuracy is the only goal.

Recap: Comparing All Four Models

Let’s wrap up with a side-by-side summary:

ModelParametersTest AccuracyLossTraining Time
MLP25,45095.96%0.148.7s
TinyCNN4,26697.96%0.0612.3s
CNN26,69898.22%0.0514.3s
StrongCNN467,81899.09%0.0375.0s

Conclusion

This experiment demonstrates how architecture choices affect performance in neural networks. Even for a simple dataset like MNIST:

Convolutional models outperform MLPs not because they’re “deeper” or “fancier,” but because they understand how images work.

These results mirror broader trends seen in state-of-the-art research:

Our experiments reaffirm that CNNs are not only more accurate than MLPs—they are also more efficient and better suited to image-based tasks. While SOTA models continue to push the boundary, our practical models already achieve high accuracy with a fraction of the complexity.

https://github.com/gustawdaniel/cnn-mnist

Other articles

You can find interesting also.