From MLP to CNN. Neural Networks for MNIST Digit Recognition
We build and compare four neural network architectures in PyTorch, visualize performance, explore complexity vs. accuracy, and show why CNNs excel at image classification.

Daniel Gustaw
• 13 min read

Introduction
The MNIST dataset is a classic benchmark in computer vision, consisting of 70,000 grayscale images of handwritten digits (28×28 pixels). It’s small enough to train quickly but complex enough to reveal differences in model performance—perfect for neural network experiments.
While Multi-Layer Perceptrons (MLPs) can technically classify image data, they treat pixels as flat vectors, ignoring spatial patterns. Convolutional Neural Networks (CNNs), on the other hand, are designed to exploit local structures in images—edges, curves, textures—making them far more effective for visual tasks.
In this post, I compare four architectures: a simple MLP, a minimal TinyCNN, a balanced CNN, and a heavier StrongCNN. We’ll look at accuracy, training time, and parameter counts to understand the trade-offs.
Dataset Preparation
As mentioned earlier, we’re using the MNIST dataset, conveniently available through torchvision.datasets
. With
just a few lines of code, we download and load the data, apply a basic transformation, and prepare it for training:
from torchvision import datasets, transforms
transform = transforms.ToTensor()
train_data = datasets.MNIST(
root="./data", train=True, download=True, transform=transform
)
test_data = datasets.MNIST(
root="./data", train=False, download=True, transform=transform
)
The only preprocessing step here is transforms.ToTensor()
, which converts each image to a PyTorch tensor and
normalizes its pixel values to the [0.0, 1.0]
range.
from torch.utils.data import DataLoader
BATCH_SIZE = 64
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_data, batch_size=BATCH_SIZE)
Shuffling the training data avoids memorizing digit order. For the test set, we skip shuffling but still use batching for efficiency.
We can display some sample images to visualize the dataset:
import matplotlib.pyplot as plt
images, labels = next(iter(train_loader))
plt.figure(figsize=(6, 6))
for i in range(9):
plt.subplot(3, 3, i + 1)
plt.imshow(images[i][0], cmap="gray")
plt.title(f"Label: {labels[i].item()}")
plt.axis("off")
plt.tight_layout()
plt.savefig("mnist_digits.svg")
plt.show()
Training and Evaluation
Now that our data is ready, it’s time to teach our models how to read handwritten digits. To do this, we define a standard training and evaluation loop using PyTorch’s idiomatic structure. We’ll also track model complexity using a simple parameter counter—useful when comparing different architectures.
Device Setup and Epochs
First, we detect whether a GPU is available. If so, training will happen on CUDA; otherwise, we fall back to CPU. We also set a reasonable training duration:
import torch
EPOCHS = 5
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Five epochs may not sound like much, but on MNIST, it’s often enough to get surprisingly good results—even with basic models.
Training Loop
Here’s our train()
function. It’s as boilerplate as it gets: set the model to training mode, loop over batches,
calculate the loss, and update the weights.
def train(model, loader, optimizer, criterion):
model.train()
for x, y in loader:
x, y = x.to(DEVICE), y.to(DEVICE)
optimizer.zero_grad()
output = model(x)
loss = criterion(output, y)
loss.backward()
optimizer.step()
This function doesn’t return anything—it just updates the model’s internal parameters. During training, we don’t care about accuracy yet. We’ll check that later.
Evaluation Loop
After training, we evaluate on the test set. The model is set to eval()
mode, gradients are disabled, and we collect
two metrics: accuracy and average cross-entropy loss.
import torch.nn.functional as F
def test(model, loader):
model.eval()
correct = 0
total = 0
total_loss = 0.0
with torch.no_grad():
for x, y in loader:
x, y = x.to(DEVICE), y.to(DEVICE)
output = model(x)
loss = F.cross_entropy(output, y)
total_loss += loss.item()
preds = output.argmax(dim=1)
correct += (preds == y).sum().item()
total += y.size(0)
avg_loss = total_loss / len(loader) # average over batches
return correct / total, avg_loss
Notice that we take the mean loss over batches—not individual examples. It’s a good balance between performance tracking and simplicity.
Parameter Count
Before we compare architectures, it’s helpful to know how many trainable parameters each one has. This tiny utility gives us the count:
def count_params(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
Spoiler: the StrongCNN
has over 450,000
parameters, while TinyCNN
manages with just a few thousand. That’s a huge
difference—and a great starting point for deeper analysis.
Experiment Runner
Finally, we put everything together into a single function that trains a model, times the process, evaluates on the test set, and prints a short summary:
import time
import torch.optim as optim
import torch.nn as nn
def run_experiment(model_class, name):
model = model_class().to(DEVICE)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
print(f"\n{name} ({count_params(model)} parameters)")
start = time.time()
for epoch in range(EPOCHS):
train(model, train_loader, optimizer, criterion)
duration = time.time() - start
acc, loss = test(model, test_loader)
print(f"Test Accuracy: {acc * 100:.2f}% | Loss: {loss:.2f} | Learning time: {duration:.1f}s")
This structure is flexible enough to work with any model class you pass in—from simple MLPs to deep convolutional beasts.
In the next section, we’ll define and analyze the four architectures: MLP
, TinyCNN
, CNN
, and StrongCNN
.
Model 1: Multi-Layer Perceptron (MLP)
The simplest architecture we consider is the classic Multi-Layer Perceptron (MLP). It treats each 28Ă—28 image as a flat vector of 784 pixels, ignoring spatial structure but still able to learn useful features through fully connected layers.
import torch.nn as nn
class MLP(nn.Module):
def __init__(self):
super().__init__()
h = 32 # number of hidden units
self.model = nn.Sequential(
nn.Flatten(), # Flatten 28x28 image into a vector of length 784
nn.Linear(28 * 28, h), # Fully connected layer: 784 → 32
nn.ReLU(), # Non-linear activation
nn.Linear(h, 10) # Output layer: 32 → 10 classes
)
def forward(self, x):
return self.model(x)
Explanation
- Flatten converts the 2D input image into a 1D vector.
- The first Linear layer projects this input vector into a 32-dimensional hidden space.
- The ReLU activation introduces non-linearity to learn complex patterns.
- The final Linear layer outputs logits for the 10 digit classes.
This small MLP has relatively few parameters and trains quickly, but it does not capture the spatial relationships between pixels, limiting its accuracy on image data.
Calling
run_experiment(MLP, "MLP")
you should see:
MLP (25450 parameters)
Test Accuracy: 95.96% | Loss: 0.14 | Learning time: 8.7s
It will be our point of reference for comparing cnn
models.
Model 2: TinyCNN — A Minimal Convolutional Neural Network
Next, we introduce a simple TinyCNN architecture that leverages convolutional layers to capture spatial patterns in images. This model is lightweight but far more powerful than the MLP for image tasks.
The figure below illustrates the TinyCNN architecture:
import torch.nn as nn
class TinyCNN(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(1, 4, kernel_size=3, padding=1), # 1x28x28 → 4x28x28
nn.ReLU(),
nn.MaxPool2d(2), # 4x14x14
nn.Conv2d(4, 8, kernel_size=3, padding=1), # 8x14x14
nn.ReLU(),
nn.MaxPool2d(2), # 8x7x7
nn.Flatten(),
nn.Linear(8 * 7 * 7, 10) # Direct to output layer
)
def forward(self, x):
return self.net(x)
Architecture Overview
- The network begins with a convolutional layer transforming the input from 1 channel to 4 channels, preserving spatial dimensions with padding.
- A ReLU activation adds non-linearity.
- MaxPooling halves the spatial size to 14Ă—14, reducing computational cost and providing spatial invariance.
- A second convolution expands feature maps from 4 to 8 channels.
- Another ReLU and max-pooling reduce the feature map size to 7Ă—7.
- Finally, the features are flattened and passed directly to a linear layer outputting logits for the 10 digit classes.
======================================================================
Layer (type:depth-idx) Output Shape Param #
======================================================================
TinyCNN [64, 10] --
├─Sequential: 1-1 [64, 10] --
│ └─Conv2d: 2-1 [64, 4, 28, 28] 40
│ └─ReLU: 2-2 [64, 4, 28, 28] --
│ └─MaxPool2d: 2-3 [64, 4, 14, 14] --
│ └─Conv2d: 2-4 [64, 8, 14, 14] 296
│ └─ReLU: 2-5 [64, 8, 14, 14] --
│ └─MaxPool2d: 2-6 [64, 8, 7, 7] --
│ └─Flatten: 2-7 [64, 392] --
│ └─Linear: 2-8 [64, 10] 3,930
======================================================================
Total params: 4,266
Trainable params: 4,266
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 5.97
======================================================================
Input size (MB): 0.20
Forward/backward pass size (MB): 2.41
Params size (MB): 0.02
Estimated Total Size (MB): 2.63
======================================================================
Sometimes cnn
are presented graphically as the following pipeline:
what is most interesting, that we are beating results of mlp
with just 4266
parameters instead of 25450
.
Tiny CNN (4266 parameters)
Test Accuracy: 97.96% | Loss: 0.06 | Learning time: 12.3s
With few times smaller network we can expect half of mistakes in comparison to the previous model.
Let’s check how our network would improve if we would maintain similar amount of parameters to original MLP.
Modle 3: CNN — A Balanced Convolutional Neural Network
Now that we’ve seen what a minimal convolutional model can do, let’s scale things up a bit.
The CNN model below is designed to maintain a balanced trade-off between parameter count and performance. It expands
the feature extraction capabilities of the TinyCNN
by using more filters and a hidden linear layer before the final
output.
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(1, 8, kernel_size=3, padding=1), # 1x28x28 → 8x28x28
nn.ReLU(),
nn.MaxPool2d(2), # 8x14x14
nn.Conv2d(8, 16, kernel_size=3, padding=1), # 16x14x14
nn.ReLU(),
nn.MaxPool2d(2), # 16x7x7
nn.Flatten(),
nn.Linear(16 * 7 * 7, 32), # Dense layer with 32 units
nn.ReLU(),
nn.Linear(32, 10) # Final output layer
)
def forward(self, x):
return self.net(x)
Architecture Breakdown
Compared to TinyCNN
, this model:
- Doubles the number of convolutional filters (8 → 16), allowing it to capture richer visual patterns.
- Adds a hidden fully connected layer with 32 neurons before the output. This extra layer improves the model’s ability to combine extracted features before making the final classification.
- Still uses only two convolutional layers and two pooling layers—keeping it reasonably lightweight and fast.
In table below there are all layers, output shapes, and parameters without batch dimension:
Layer | Output Shape | Parameters |
---|---|---|
Conv2d (1→8, 3×3) | 8×28×28 | 80 |
ReLU | 8Ă—28Ă—28 | 0 |
MaxPool2d | 8Ă—14Ă—14 | 0 |
Conv2d (8→16, 3×3) | 16×14×14 | 1,168 |
ReLU | 16Ă—14Ă—14 | 0 |
MaxPool2d | 16Ă—7Ă—7 | 0 |
Flatten | 784 | 0 |
Linear (784 → 32) | 32 | 25,120 |
ReLU | 32 | 0 |
Linear (32 → 10) | 10 | 330 |
Total | — | 26,698 |
With 26,698 parameters, this CNN
have similar size to the MLP
(25,450) yet significantly more powerful.
CNN (26698 parameters)
Test Accuracy: 98.22% | Loss: 0.05 | Learning time: 14.3s
Key Observations
-
Accuracy boost: The model jumps to 98.22% accuracy, improving both over the
MLP
andTinyCNN
. -
Parameter efficiency: Despite a similar parameter count to the MLP, this CNN leverages spatial patterns through convolution to achieve better performance.
-
Inference-ready: The size and speed of this model make it suitable for lightweight applications and real-time digit recognition.
This model demonstrates a sweet spot: good depth, reasonable parameter size, and excellent accuracy. But what if we didn’t care about size at all and wanted to push performance even further?
Let’s find out in the next section.
Model 4: StrongCNN — A Deep Convolutional Powerhouse
So far, we’ve looked at models that balance performance and simplicity. But what if we remove the constraints and go all-in on performance?
The StrongCNN
is a deeper, more expressive architecture that brings in multiple convolutional layers, higher channel counts, and regularization techniques like Dropout
to prevent overfitting. It’s inspired by best practices from larger vision models but still compact enough to train quickly on MNIST
.
class StrongCNN(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), # 1x28x28 → 32x28x28
nn.ReLU(),
nn.Conv2d(32, 32, 3, padding=1), # 32x28x28
nn.ReLU(),
nn.MaxPool2d(2), # 32x14x14
nn.Dropout(0.25),
nn.Conv2d(32, 64, 3, padding=1), # 64x14x14
nn.ReLU(),
nn.Conv2d(64, 64, 3, padding=1), # 64x14x14
nn.ReLU(),
nn.MaxPool2d(2), # 64x7x7
nn.Dropout(0.25),
nn.Flatten(),
nn.Linear(64 * 7 * 7, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, 10)
)
def forward(self, x):
return self.net(x)
Architecture Breakdown
This model stacks four convolutional layers in two blocks, with increasing filter counts (32 → 64). After each block:
- We apply MaxPool2d(2) to downsample.
- We apply Dropout to reduce overfitting.
- Finally, features are flattened and passed through two fully connected layers with a 128-neuron hidden layer and another dropout.
======================================================================
Layer (type:depth-idx) Output Shape Param #
======================================================================
StrongCNN [64, 10] --
├─Sequential: 1-1 [64, 10] --
│ └─Conv2d: 2-1 [64, 32, 28, 28] 320
│ └─ReLU: 2-2 [64, 32, 28, 28] --
│ └─Conv2d: 2-3 [64, 32, 28, 28] 9,248
│ └─ReLU: 2-4 [64, 32, 28, 28] --
│ └─MaxPool2d: 2-5 [64, 32, 14, 14] --
│ └─Dropout: 2-6 [64, 32, 14, 14] --
│ └─Conv2d: 2-7 [64, 64, 14, 14] 18,496
│ └─ReLU: 2-8 [64, 64, 14, 14] --
│ └─Conv2d: 2-9 [64, 64, 14, 14] 36,928
│ └─ReLU: 2-10 [64, 64, 14, 14] --
│ └─MaxPool2d: 2-11 [64, 64, 7, 7] --
│ └─Dropout: 2-12 [64, 64, 7, 7] --
│ └─Flatten: 2-13 [64, 3136] --
│ └─Linear: 2-14 [64, 128] 401,536
│ └─ReLU: 2-15 [64, 128] --
│ └─Dropout: 2-16 [64, 128] --
│ └─Linear: 2-17 [64, 10] 1,290
======================================================================
Total params: 467,818
Trainable params: 467,818
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 1.20
======================================================================
Input size (MB): 0.20
Forward/backward pass size (MB): 38.61
Params size (MB): 1.87
Estimated Total Size (MB): 40.68
======================================================================
With nearly half a million parameters, this model dwarfs the others in capacity. But it pays off.
Strong CNN (467818 parameters)
Test Accuracy: 99.09% | Loss: 0.03 | Learning time: 75.0s
Key Observations
-
Top-tier accuracy: The StrongCNN approaches 99.09% test accuracy, getting very close to human-level performance on MNIST.
-
Regularization matters: With this many parameters, dropout is crucial to avoid overfitting.
-
Cost of performance: Training time is almost 3Ă— that of the MLP and 6Ă— more parameters than the balanced CNN.
This model is overkill for MNIST—but that’s the point. It illustrates how far you can go when accuracy is the only goal.
Recap: Comparing All Four Models
Let’s wrap up with a side-by-side summary:
Model | Parameters | Test Accuracy | Loss | Training Time |
---|---|---|---|---|
MLP | 25,450 | 95.96% | 0.14 | 8.7s |
TinyCNN | 4,266 | 97.96% | 0.06 | 12.3s |
CNN | 26,698 | 98.22% | 0.05 | 14.3s |
StrongCNN | 467,818 | 99.09% | 0.03 | 75.0s |
Conclusion
This experiment demonstrates how architecture choices affect performance in neural networks. Even for a simple dataset like MNIST:
- MLPs work but ignore image structure.
- CNNs leverage spatial features for better results.
- Depth and width improve accuracy but increase training cost and overfitting risk.
- Regularization is essential for deeper networks.
Convolutional models outperform MLPs not because they’re “deeper” or “fancier,” but because they understand how images work.
These results mirror broader trends seen in state-of-the-art research:
-
Convolutional models remain the backbone of MNIST classification, offering strong inductive biases for image structure.
-
Techniques like dropout, data augmentation, and deep architectures are key to improving performance.
-
More advanced models, such as capsule networks, transformer hybrids, and ensembles, have pushed accuracy beyond 99.85%, although these methods are often overkill for MNIST and require far more compute.
Our experiments reaffirm that CNNs are not only more accurate than MLPs—they are also more efficient and better suited to image-based tasks. While SOTA models continue to push the boundary, our practical models already achieve high accuracy with a fraction of the complexity.
Other articles
You can find interesting also.

Scraping Facebook in 2021
The article aims to familiarize the reader with the method for scraping the Facebook portal after the layout update.

Daniel Gustaw
• 18 min read

How to Install Yay on a Pure Arch Linux Docker Image
Yay installation require few steps like user creation, base-devel and git installation, change in /etc/sudousers, cloning yay repo and makepkg on it. This posts covering this process steps by steps.

Daniel Gustaw
• 3 min read

Telegram Bot in Typescript
Learn how to create a bot on Telegram, add command listening to it, and configure notification sending.

Daniel Gustaw
• 3 min read