In this project, I built a Transformer model entirely from scratch, including both the architecture and the underlying automatic differentiation engine. I implemented everything using only Python and NumPy—without relying on PyTorch, TensorFlow, or any external ML libraries.
Custom Autodiff System
I started by writing a computational graph-based autodiff framework. This included:
- A VariableandNodesystem to represent inputs and operations
- A topological Evaluatorthat performs forward computation over the graph
- A gradients()function that symbolically constructs the backward graph
I manually implemented forward and backward modes for every operation needed by the Transformer, including:
- MatMul,- Add,- Div,- Transpose,- Power,- Sqrt,- Mean
- Softmaxand- LayerNorm, both of which required careful differentiation logic due to their internal broadcasting and normalization steps
Transformer Architecture
I used my autodiff engine to build a simplified Transformer model composed of:
- A custom Linear Layer, implemented from matmul and add
- A Single-Head Attention mechanism with scaled dot-product attention
- A Feed-Forward Network with ReLU
- A full Encoder Layer combining the above components
- A custom Softmax Loss function for training
Each of these components was built using only primitive operations that I implemented and differentiated myself. No PyTorch autograd was used—every gradient path was explicitly constructed through symbolic backward graphs.
Training on MNIST
I trained the Transformer on the MNIST dataset, treating each image as a flattened sequence of patches. I implemented:
- transformer()to define the full model
- softmax_loss()using manual log-softmax and one-hot target logic
- sgd_epoch()to update weights using manually calculated gradients
Despite the model not using residual connections or multi-head attention, it achieved over 50% test accuracy, validating the correctness of both the architecture and the autodiff engine.
What I Learned
This project taught me how neural networks really work under the hood:
- How autodiff systems compute gradients symbolically, not just numerically
- How to differentiate complex ops like Softmax and LayerNorm correctly
- How to build scalable model components using only primitive mathematical ops
Building a Transformer from scratch—starting from nothing but NumPy arrays—was one of the most rewarding technical challenges I’ve tackled. It gave me a much deeper understanding of how modern ML frameworks operate internally and how to bridge low-level autodiff with high-level model design.

