SmolGPT: A minimal PyTorch implementation for training a small LLM from scratch

377 points by amrrs a day ago | 50 comments

This is cool, and timely (I wanted a neat repo like that).

I have also been working from last 2 weeks on a gpt implementation in C. Eventually it turned out to be really slow (without CUDA). But it taught me how much memory management and data management there is when implementing these systems. You are running like a loop billions of times so you need to preallocate the computational graph and stuff. If anyone wanna check out it's ~1500 LOC single file:

https://github.com/attentionmech/gpt.c/blob/main/gpt.c

sitkack a day ago | prev | next |

Neat, I love projects like these.

The next level down is to do it directly in numpy.

And then from there, write a minimal numpy work-a-like to support the model above.

You start with a working system using the most powerful abstractions. Then you iteratively remove abstractions, lowering your solution, then when you get low enough but still riding on an external abstraction, you rewrite that, but ONLY to support the layers above you.

Following the above pattern, you can bootstrap yourself to have full system understanding. This is not unlike RL+distillation that human persons do learn complex topics.

bee_rider a day ago | root | parent | next |

Numpy can use the chipmaker’s BLAS (Intel MKL or AMD’s Blis fork). Trying to replace it could be a good academic exercise but I think most people wisely leave that to the vendors.

sitkack a day ago | root | parent |

It is a purely pedagogical device, like building a go kart.

sitkack 2 hours ago | root | parent |

A lightweight, pure Python, numpy compliant ndarray class.

https://github.com/wadetb/tinynumpy

> but still riding on an external abstraction, you rewrite that, but ONLY to support the layers above you.

i don't get it. Why do i stop before stripping all abstractions?

byteknight a day ago | root | parent |

Where do you get that? He is postulating the external abstraction you are using has more features than you use. He is saying implement only the parts you use.

lagrange77 a day ago | root | parent |

> Where do you get that?

From "when you get low enough but still riding on an external abstraction".

> He is saying implement only the parts you use.

Thanks.

sitkack a day ago | root | parent |

Correct, I should proof read my posts.

tomrod a day ago | root | parent | prev |

Likewise. And your comment reminded me of real programmers*

* https://xkcd.com/378/

numba888 a day ago | prev | next |

github has a bunch of them for years, the most known from Andrej Karpathy:

https://github.com/karpathy/nanoGPT

some other have MoE implemented.

syassami a day ago | root | parent | next |

Personal fave: https://github.com/karpathy/llama2.c

benreesman a day ago | root | parent | prev |

nanoGPT is awesome (and I highly recommend his videos on it), but it’s closer to a direct reproduction of GPT-2, so it’s cool to have a really clean implementation of some newer ideas.

Nimitz14 a day ago | root | parent |

nanoGPT contains some new ideas. https://github.com/karpathy/minGPT is more plain

c0wb0yc0d3r 21 hours ago | prev | next |

Can someone help me understand what I’m looking at here? This repository allows me to train a specific model on a specific data set, and finally test the result? Is that correct?

I am interested in how large and small language models are trained, but as someone who has little knowledge in this world I find it hard to cut through the noise to find useful information.

Really I’m looking for an open source project that helps a person gain this knowledge. Something like a docker container that encapsulates all the dependencies. When training it will use any available gpu or tell me why my gpu can’t be used and then fall back to cpu. Then had a simple interface to test the training results. Finally you can easily pull back the curtain to understand the process in better detail and maybe even adapt it to different model to experiment.

Does something like that exist?

MacTea 17 hours ago | root | parent | next |

https://course.fast.ai/ is the best. From their site: " A free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems. "

c0wb0yc0d3r 16 hours ago | root | parent |

This is at the top of my lunch time learning list. Not quite what I’ve been envisioning but it’s in the right direction. Thanks!

If you are looking for something that actually explains most concepts behind it, Karpathy’s series will teach you what you want to know. (Mentioned elsewhere) If you are looking for command line tools to fine tune and evaluate models on known datasets, this article is a good take!

https://www.philschmid.de/fine-tune-llms-in-2025

As opposed to inference (like generating text and images), training requires some more math (fp16 or bf16) and a single CPU generally won't cut it.

The prepare/train/generate instructions in the github linked are pretty much it for the 'how' of training a model. You give it a task and it does it for 1 billion trillion epochs and saves the changes incrementally (or not).

Training a LoRA for an image model may be more approachable, there's more blog entries etc on this, and the process is largely similar, except you're doing it for a single slice instead of the whole network.

[edit] I'm also learning so correct me if I'm off, hn!

sva_ 12 hours ago | root | parent |

> You give it a task and it does it for 1 billion trillion epochs and saves the changes incrementally (or not).

Somewhat confusingly, big LLM are most just trained for 1 epoch afaik.

_joel 11 hours ago | root | parent |

I've seen 3 epochs on some of the finetuning R1 blog posts. It's not my field so not sure how valid that is.

SJC_Hacker 19 hours ago | root | parent | prev |

Do you have a good theoretical foundation in ML ? You will also need some linear algebra.

If not would invest the time in a decent course, there are plenty online, even offline if you are close enough to where its offered. I took one from Andrew Ng on Coursera years ago, which used Matlab. There are probably much better, more up-to-date options now, especially now that LLMs are very in-vogue. The fundamentals such as gradient descent, ANNs and back-propagation however, is still relevant, and hasn't changed much.

Trying to understand what code is doing without that foundation will be an exercise in futility.

c0wb0yc0d3r 16 hours ago | root | parent |

I don’t have a solid ML foundation, and it’s been a decade or more since I’ve worked with linear algebra.

For now I think that might be too deep for what I’m after. I’m at the beach looking out at the vast ocean that is machine learning and LLMs.

barrenko 13 hours ago | root | parent |

You're probably having the right hunch, it takes a crapload of time, especially if you want to implement and not just "get an intuition".

Lerc a day ago | prev | next |

The example story is interesting.

I have made my own implementation from scratch with my own multi-channel tokeniser, each channel gets its own embedding table 32768, 256,256, 64, and 4. Which are summed along with the position encoding.

Yet with all of those differences, my stories have Lily as a protagonist often enough that I thought I had a bug somewhere.

Might have to check tinystories for name distribution.

Most questionable output from mine so far:

"one day, a naughty man and a little boy went to the park place to find some new things."

brap a day ago | prev | next |

It’s interesting that technology so transformative is only a few hundred lines of code (excluding underlying frameworks and such).

How big would you guess state of the art models are, in terms of lines of code?

miki123211 a day ago | root | parent | next |

Llama2 inference can be implemented in 900-ish lines of dependency-free C89, with no code golfing[1]. More modern architectures (at least the dense, non-MoE models) aren't that much more complicated.

That code is CPU only, uses float32 everywhere and doesn't do any optimizations, so it's not realistically usable for models beyond 100m params, but that's how much it takes to run the core algorithm.

[1] https://github.com/karpathy/llama2.c

hatthew 20 hours ago | root | parent | prev |

A minimal hardcoded definition of the structure: probably a few hundred lines.

The actual definition, including reusable components, optional features, and flexibility for experimentation: probably a few thousand.

The code needed to train the model, including all the data pipelines and management, training framework, optimization tricks, etc.: tens of thousands.

The whole codebase, including experiments, training/inference monitoring, modules that didn't make it into the final architecture, unit tests, and all custom code written to support everything mentioned so far: hundreds of thousands.

febin 15 hours ago | prev | next |

Here's a google collab notebook built from this. It takes ~2 hours on A100 GPU if you have collab pro. Might work on free account as well.

https://colab.research.google.com/drive/1dklqzK8TDPfbPbyHrk3...

OmAlve 10 hours ago | prev | next |

Thanks a lot for posting this here! I can't believe it went viral, makes all the efforts feel worth it now! - Om Alve

ideashower 6 hours ago | prev | next |

Can anyone share what a training dataset would look like for something like this? What are some use cases?

efm 4 hours ago | root | parent |

Karpathy's nanoGPT has a full training pipeline using Shakespeare. [1]

The use case for this is learning in simple example.

[1] https://github.com/karpathy/nanoGPT

ks2048 a day ago | prev | next |

So, this has nothing to do with "SmolLM" - a set of models (with data, training recipes, etc) released by HuggingFace? https://huggingface.co/blog/smollm

mkagenius 16 hours ago | prev | next |

Looks like a rip off of - https://github.com/PraveenRaja42/Tiny-Stories-GPT

without any credits to above or TinyStories paper.

yorwba 14 hours ago | root | parent |

The implementations are different, so I don't think you can consider it a rip-off.

mkagenius 13 hours ago | root | parent |

Why do you say implementations are different?

yorwba 12 hours ago | root | parent |

Because I read the code.

18 hours ago | prev | next |

[deleted]

imdsm 12 hours ago | prev | next |

Is there a corresponding article for this? I'd love to read through it!

Diffused_asi 7 hours ago | prev | next |

How many parameters model is this ?

antirez a day ago | prev | next |

No cpu / mps support to train on Macs, apparently.

leopoldj a day ago | root | parent |

You should be able to make a few small changes to support "mps".

In TrainingConfig set the device to "mps". The run training.

In sample.py modify parse_args() and add support for mps as a possible value for the --device argument.

antirez a day ago | root | parent |

Thanks! I'll try. I didn't bother believing that if this was developed heavily on CUDA, it was likely going to use kernels that were missing in MPS.

nostradumbasp a day ago | prev | next |

Cute! Keep making fun things.

spidermonkey23 a day ago | prev | next |

Is there anything that can run locally on mobile in temrux

quantadev a day ago | prev | next |

I noticed several people mentioned Karpathy already, but I wanted to include that his tiny "Micrograd" project (see Youtube Video and GitHub) is a great introduction to Neural Nets (Multilayer Peceptron), which is at the core of [most] machine learning of course.

the_real_cher a day ago | prev |

Any body have any good readings they read and liked to kind of understand what is going on with how this works?

leopoldj 7 hours ago | root | parent | next |

This is a faithful reproduction of the original Transformer paper [1]. Except, these days we use trainable parameters for positional embedding. The paper used a static calculation for positional embedding using sine and cosine.

Figure 1 in the paper can be seen implemented in the forward() method of the GPT class in model.py. Here are the rough steps:

1. Tokens are embedded using a nn.Embedding layer. 2. Tokens are positionally embedded using a nn.Embedding layer. 3. The two embedding values are added to make the input x. 4. A sequence of N number of transformer blocks are then executed. This is the grey box in the left of the Figure 1. This is where all the magic happens. Chiefly in the self attention calculation. You can see this in the forward() method of the CausalSelfAttention class. 5. A regular nn.Linear layer is executed. 6. Finally the output token probabilities are calculated using F.cross_entropy (shown as softmax in the figure).

I hope this helps a little. Please feel free to suggest improvements and additions.

[1] https://arxiv.org/pdf/1706.03762

fragmede a day ago | root | parent | prev |

https://spreadsheets-are-all-you-need.ai/

ianand 19 hours ago | root | parent |

hey, creator of spreadsheets-are-all-you-need.ai here. Thanks for mentioning!

I now have a web version of GPT2 implemented in pure JavaScript for web developers at https://spreadsheets-are-all-you-need.ai/gpt2/.

The best part is that you can debug and step through it in the browser dev tools: https://youtube.com/watch?v=cXKJJEzIGy4 (100 second demo). Every single step is is in plain vanilla client side JavaScript (even the matrix multiplications). You don't need python, etc. Heck, you don't even have to leave your browser.

I recently did an updated version of my talk with it for JavaScript developers here: https://youtube.com/watch?v=siGKUyTk9M0 (52 min). That should give you a basic grounding on what's happening inside a Transformer.