There has been much well-deserved attention paid towards the latest advances in machine learning these days. I feel like I see a new paper or model every week that promises the Earth, moon, and stars.

Perhaps it’s a new approach that will finally™ solve the problem of quadratic scaling of transformers w.r.t. context length, be it via clever tweaks inspired by convolutions, literally using convolutions, more clever utilization of accelerators, or various memory bottlenecks.1 Perhaps it’s any number of new models that have been fine-tuned by hobbyists, perhaps using leaked LLaMA weights or ChatGPT/ShareGPT data.2

But there is another thing that hasn’t gotten as much mainstream attention. That is, just how easy it has become to experiment with some seriously advanced models, models that would have quite recently been state of the art and required non-trivial capital to train. Of course, researchers have been publishing models and code for a while now, but the current state of affairs with easy to use APIs, reasonably good documentation, and emphasis placed on community interaction and contribution? That feels rather new.3

As an example, I wanted to walk through a small language model that I trained for my own amusement.

Goal

I wanted to train a model that could talk a bit more informally, and perhaps talk a bit more like how I will text with friends. It’s no secret that LLMs tend to output text that errs on the side of formal/verbose. My suspicion is that this is due to some combination of

  • Instructions to human annotators when generating text data, e.g. for supervised fine tuning datasets
  • Instructions to human annotators when ranking text data, i.e generating reward model data for RLHF, i.e. ranking which outputs are better, i.e. more aligned with user preferences.
  • Preference for “high quality” text sources during training, e.g. Wikipedia, news articles, or well upvoted comments on Reddit.4

My hope (inspired by this paper) was that it would take a relatively minimal amount of fine tuning to get a language model to chat more like me. That is, talk a bit more informally, use different punctuation (e.g. newlines instead of periods), etc.

Training data

Getting training data for this was fairly simple. I’ve been using Facebook Messenger for a long time now, and Facebook provides a convenient way to download all of your data as a bunch of JSONs. Then it’s just a bit of Python to parse the messages, yielding the ones where I’m responding. That generator then can be used directly to create a Dataset, via Hugging Face’s dataset API; specifically from_generator.

To be precise, the training data was in the format of “prompts” and “responses”, where prompts were continguous blocks of messages from anybody that wasn’t me, joined with a pipe char (|). Responses were of the same format, just comprising messages that I sent. I use a pipe char to avoid any preprocessing shenanigans regarding newlines that some tokenizers may perform (e.g., transforming into newlines into spaces). I used a character that wouldn’t normally come up in texting, since substituting other punctuation (e.g. a period) can change perceived tone.5

Model

On Hugging Face, there are many language models appropriate for conversational interaction. I chose to run things using a 400M-parameter distilled BlenderBot, which uses a standard seq2seq (i.e. encoder-decoder) transformer. The paper came out in 2020, which is comparatively old, but these models are convenient since they’ve already been fine-tuned on conversational prompts. In particular, the 400M-parameter model isn’t so large that one needs to start thinking about model parallelism yet and has the added benefit of knowledge distillation from the bigger versions. I.e., it ought to be fine for some quick/fun hacking.

Training

I used a bone-stock PyTorch Lightning training loop, which is almost a one-liner. Of course one can choose to implement their own checkpointing, looping over epochs, etc., but why bother reinventing the wheel, especially for a one-off just-for-fun training run?

Compute

As far as compute goes, my 2019-era MacBook is woefully underpowered, but Colab Pro is cheap and good enough here. An instance with 50GB RAM and a 16GB VRAM GPU., albeit an old one, was plenty for my purpose, and cost roughly $.20/hr.6

Parameter efficiency

I didn’t want to fine-tune the entire model, since that would’ve taken a while with the admittedly slower NVIDIA T4 that I was using. However, the peft library makes it surprisingly easy to leverage SOTA fine-tuning methods. For me, using AdaLoRA, an improvement of low-rank adaptation that was only published in March, was three operations: an import, a config initialization, and then an assignment.

One quirk of the library is that it has a hard-coded mapping from transformer model architectures (as strings!) to modules for which to actually adapt via AdaLoRA, see source here.7 These mappings don’t work out of the box for BlenderBot, but you can just inspect the module names8 and then it’s no problem:

class Imitator(pl.LightningModule):

    def __init__(self):
        super().__init__()
        pretrained_model = BlenderbotForConditionalGeneration.from_pretrained("facebook/blenderbot-400M-distill")
        peft_config = AdaLoraConfig(
            peft_type="ADALORA",
            task_type=TaskType.SEQ_2_SEQ_LM,
            inference_mode=False,
            r=16,
            lora_alpha=32,
            target_modules=["q_proj", "k_proj", "v_proj", "out_proj", "fc1", "fc2"],
            lora_dropout=0.1,
        )
        self.model = get_peft_model(pretrained_model, peft_config)
...

Amusing non-scientific results

Example 1

Prompt

> dude i need your advice
> she said she wanted to "take a break"

JamesBot response

> LOL
> i'm not sure what to say to that
> i don't really know what advice to give

Alright, perhaps a bit unempathetic and not much of a specific response, but definitely talks more like me. In particular, the model’s learned that I tend to encapsulate thoughts in distint messages often without ending punctuation.

Example 2

Prompt

> dude
> gels are such a game changer
> felt amazing at mile 15 today

JamesBot response

> oh wow
> that's a lot of running
> i'm impressed
> i can't run a mile

This isn’t perfectly accurate since I can indeed run a mile and also advocate for gels9 on runs longer than 90 minutes, but this is plausibly something I would’ve said before getting into running.

The punchline

The punchline isn’t that I took a pre-trained model and then fine-tuned it on another dataset; after all that’s been done before.

The punchline is that, not including the script I used to parse my Facebook messages, this was only ~50 lines of code. Code that, by my assessment, is rather explicit and not reminiscent of code golf.

It’s incredible that the open source ecosystem has advanced to the stage where you can experiment with very modern techniques (transformers, parameter efficient fine-tuning, etc) in just O(dozens) of LoC.

This lowers the barrier of entry for not only hobbyists and enthusiasts, but also for professionals who have requirements that aren’t met by the existing ecosystem of model inference APIs, or simply prefer driving stick.

Notes

  1. What’s even more fascinating than this research outright, is just how much other research continues to be done on top of bone stock transformers, even pretty hype research. E.g., DeepMind’s Gato was trained over a decoder-only transformer “for simplicity and scalability”. 

  2. What’s also interesting here is that both of these seem to be in a legal gray area since LLaMA weights were leaked and OpenAI’s terms of service prohibits their products from being used to train competitor models, and despite that they’re both incredibly popular approaches. 

  3. This is perhaps biased by my experience, which has been mostly with heavy duty infrastructure that originated from within Alphabet. 

  4. Some may not consider Reddit comments to be “high quality”, but it’s important to compare it to internet text en masse. Seriously, take a look at some examples from these canonical large web scrapes. There’s an amusing amount of SEO spam, websites that just aren’t even parsed correctly, e.g. “This website requires JavaScript …”. 

  5. Here’s a fun article from the New York Times discussing this in more detail. 

  6. This is an estimate, since the Colab pricing model depends on “compute credits” per hour, and it’s not entirely clear how those rates are calculated. Regardless, you can get 100 credits per month for $10, and high-ram instances with an NVIDIA T4 were consistently around 2.xx credits per hour. 

  7. Quite a few of these are also commented out, for reasons that aren’t super clear to me. Perhaps it’s the library maintainers being strict about tests? 

  8. This is easy, via nn.Module.modules(). It’s always nice when something does exactly what it says on the tin. 

  9. For the uninitiated, energy gels are portable and easy-to-digest carbs for endurance athletes.