In this project, I built a Transformer model entirely from scratch, including both the architecture and the underlying automatic differentiation engine. I implemented everything using only Python and NumPy—without relying on PyTorch, TensorFlow, or any external ML libraries.

Custom Autodiff System

I started by writing a computational graph-based autodiff framework. This included:

  • A Variable and Node system to represent inputs and operations
  • A topological Evaluator that performs forward computation over the graph
  • A gradients() function that symbolically constructs the backward graph

I manually implemented forward and backward modes for every operation needed by the Transformer, including:

  • MatMul, Add, Div, Transpose, Power, Sqrt, Mean
  • Softmax and LayerNorm, both of which required careful differentiation logic due to their internal broadcasting and normalization steps

Transformer Architecture

I used my autodiff engine to build a simplified Transformer model composed of:

  • A custom Linear Layer, implemented from matmul and add
  • A Single-Head Attention mechanism with scaled dot-product attention
  • A Feed-Forward Network with ReLU
  • A full Encoder Layer combining the above components
  • A custom Softmax Loss function for training

Each of these components was built using only primitive operations that I implemented and differentiated myself. No PyTorch autograd was used—every gradient path was explicitly constructed through symbolic backward graphs.

Training on MNIST

I trained the Transformer on the MNIST dataset, treating each image as a flattened sequence of patches. I implemented:

  • transformer() to define the full model
  • softmax_loss() using manual log-softmax and one-hot target logic
  • sgd_epoch() to update weights using manually calculated gradients

Despite the model not using residual connections or multi-head attention, it achieved over 50% test accuracy, validating the correctness of both the architecture and the autodiff engine.

What I Learned

This project taught me how neural networks really work under the hood:

  • How autodiff systems compute gradients symbolically, not just numerically
  • How to differentiate complex ops like Softmax and LayerNorm correctly
  • How to build scalable model components using only primitive mathematical ops

Building a Transformer from scratch—starting from nothing but NumPy arrays—was one of the most rewarding technical challenges I’ve tackled. It gave me a much deeper understanding of how modern ML frameworks operate internally and how to bridge low-level autodiff with high-level model design.