Transformer from Scratch with Custom Autodiff Engine

Transformers Deep Learning Automatic Differentiation Machine Learning Python January 2025 - March 2025

In this project, I built a Transformer model entirely from scratch, including both the architecture and the underlying automatic differentiation engine. I implemented everything using only Python and NumPy—without relying on PyTorch, TensorFlow, or any external ML libraries.

Custom Autodiff System

I started by writing a computational graph-based autodiff framework. This included:

A Variable and Node system to represent inputs and operations
A topological Evaluator that performs forward computation over the graph
A gradients() function that symbolically constructs the backward graph

I manually implemented forward and backward modes for every operation needed by the Transformer, including:

MatMul, Add, Div, Transpose, Power, Sqrt, Mean
Softmax and LayerNorm, both of which required careful differentiation logic due to their internal broadcasting and normalization steps

Transformer Architecture

I used my autodiff engine to build a simplified Transformer model composed of:

A custom Linear Layer, implemented from matmul and add
A Single-Head Attention mechanism with scaled dot-product attention
A Feed-Forward Network with ReLU
A full Encoder Layer combining the above components
A custom Softmax Loss function for training

Each of these components was built using only primitive operations that I implemented and differentiated myself. No PyTorch autograd was used—every gradient path was explicitly constructed through symbolic backward graphs.

Training on MNIST

I trained the Transformer on the MNIST dataset, treating each image as a flattened sequence of patches. I implemented:

transformer() to define the full model
softmax_loss() using manual log-softmax and one-hot target logic
sgd_epoch() to update weights using manually calculated gradients

Despite the model not using residual connections or multi-head attention, it achieved over 50% test accuracy, validating the correctness of both the architecture and the autodiff engine.

What I Learned

This project taught me how neural networks really work under the hood:

How autodiff systems compute gradients symbolically, not just numerically
How to differentiate complex ops like Softmax and LayerNorm correctly
How to build scalable model components using only primitive mathematical ops

Building a Transformer from scratch—starting from nothing but NumPy arrays—was one of the most rewarding technical challenges I’ve tackled. It gave me a much deeper understanding of how modern ML frameworks operate internally and how to bridge low-level autodiff with high-level model design.