Deep learning often feels “mystical” because many learners start with high-level libraries that hide the maths. If you want real confidence, it helps to build a small Multi-Layer Perceptron (MLP) yourself and implement backpropagation step by step. This approach makes concepts like gradients, weight updates, and activation functions intuitive—and it also prepares you for practical model debugging in production. Many learners who take an AI course in Delhi eventually realise that understanding the internal mechanics is what separates “library usage” from “model engineering”.
What an MLP really is (and what it is not)
An MLP is a feed-forward neural network made of fully connected layers. Each layer performs a linear transformation followed by a non-linear activation:
- Linear step: z=XW+b
- Activation:a=f(z)
A minimal MLP for classification typically has:
- Input layer (features)
- One or more hidden layers (learn representations)
- Output layer (class scores or probabilities)
The “multi-layer” part matters because stacking layers allows the model to learn hierarchical patterns. The non-linearity is essential; without it, multiple layers collapse into a single linear model.
Forward pass: computing predictions
The forward pass is the easy part: you start with inputs, apply each layer’s weights and biases, run an activation, and finally produce an output. For binary classification, a common setup is:
- Hidden layers: ReLU or tanh
- Output layer: sigmoid
- Loss: binary cross-entropy
In practice, you store intermediate values (like 𝑧 and 𝑎) during the forward pass. Backpropagation needs them to compute gradients efficiently. If you are learning through an AI course in Delhi, try to treat the forward pass as a “data flow graph”: each operation creates a value that later receives a gradient signal.
Backpropagation: the chain rule in action
Backpropagation is simply the chain rule applied repeatedly from the loss back to each parameter.
For a single layer:
- z=aprevW+b
- a=f(z)
You want gradients of the loss 𝐿 w.r.t. parameters:
- ∂W/∂L
- ∂b/∂L
Key idea:
1 . Compute the “error signal” at the output (how wrong the prediction is).
2. Propagate that error backward through each layer using derivatives.
3. Use gradients to update weights via gradient descent.
Example intuition (binary classification):
If 𝑦^ is the prediction and 𝑦 is the target, the output-layer gradient often starts from something like
(𝑦^−𝑦), then gets multiplied by activation derivatives and previous activations as it moves backward.
Common activation derivatives (you must implement these)
Sigmoid: (z)=σ(z)(1−σ(z))
tanh: 1−tanh(z)^2
ReLU: derivative is 1 when z>0, else 0
ReLU is popular because it reduces vanishing-gradient issues compared to sigmoid/tanh in deeper networks.
A tiny MLP from scratch (NumPy-only)
Below is a compact reference implementation for a 1-hidden-layer MLP. It focuses on clarity over speed.
import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x))
def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(x.dtype)
X: (n, d), y: (n, 1) in {0,1}
n, d, h = X.shape[0], X.shape[1], 16
W1, b1 = np.random.randn(d, h)*0.01, np.zeros((1, h))
W2, b2 = np.random.randn(h, 1)*0.01, np.zeros((1, 1))
lr = 0.1
for _ in range(1000):
Z1 = X @ W1 + b1
A1 = relu(Z1)
Z2 = A1 @ W2 + b2
yhat = sigmoid(Z2)
# loss gradient at output (binary cross-entropy with sigmoid)
dZ2 = (yhat – y) / n
dW2 = A1.T @ dZ2
db2 = dZ2.sum(axis=0, keepdims=True)
dA1 = dZ2 @ W2.T
dZ1 = dA1 * relu_grad(Z1)
dW1 = X.T @ dZ1
db1 = dZ1.sum(axis=0, keepdims=True)
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
What to observe:
- Shapes must align at every step.
- You cache Z1,A1,Z2 because gradients depend on them.
- Dividing by n keeps gradients scale-stable across batch sizes.
Practical checks to avoid silent mistakes
When implementing from scratch, most errors are not “syntax errors”—they are math errors that still run. Use these checks:
- Gradient checking (finite differences): numerically approximate gradients and compare with backprop results on a tiny batch.
- Overfit a tiny dataset: your model should reach near-zero loss on a small set if gradients are correct.
- Track loss curve: if loss explodes, lower learning rate; if it never decreases, check derivative logic and label shapes.
- Watch saturation: sigmoid outputs stuck near 0 or 1 can cause slow learning if earlier layers push values too far.
These habits are commonly emphasised in a good AI course in Delhi, because they mirror real debugging workflows used by ML engineers.
Conclusion
Building an MLP and backpropagation from scratch is one of the fastest ways to understand deep learning at a structural level. Once you can derive gradients, manage tensor shapes, and verify learning behaviour, you can trust yourself with more complex architectures and training pipelines. If your goal is to move beyond “using frameworks” into genuine model building, practising this exercise alongside an AI course in Delhi can give you a strong, practical foundation.

