Sparse Notes

Mamba No. 5 (A Little Bit Of…)

2024-02-12T19:50:08+00:00

In this post, I attempt to provide a walkthrough of the essence of the Mamba state space model architecture, occasionally sacrificing some rigor for intuition and overall pedagogical friendliness.

I don’t assume readers have any familiarity with state space models, but I do assume some familiarity with machine learning and mathematical notation.

If at any point you spot any errors, typos, or confusing wording, please let me know!

TL;DR
Setting the stage
Linear time-invariant state space models
Mamba SSM
- A departure from S4’s linear time invariance
- Fast implementation
Hardware-aware resource management
- A simple GPU program
- Hardware-aware Mamba
The Blelloch parallel prefix scan
- Warm up: the parallel reduce
- The Blelloch parallel scan
  - First, the up sweep
  - Next, the down sweep
A binary associative operator for Mamba
Closing thoughts
Assorted references
Footnotes

TL;DR

Mamba is a state space model (SSM) architecture that improves upon the S4 architecture. Sometimes known as S6, it makes two important modifications to S4:

Selective SSM parameters
Efficient implementation via parallel scan

Mamba parallelizes well during training, scales well with context length, performs inference efficiently, and most importantly, displays strong empirical results.

Setting the stage

Sequence models can be placed on a spectrum based on their approach to information representation, from highly compressed (e.g. RNNs) to highly explicit (e.g. transformers).

Exhibit A: the RNN

Consider a vanilla RNN:

\[\begin{aligned}h_t &= \tanh(W_{hh}h_{t-1} + W_{xh}x_t)\\y_t &= W_{hy}h_t\end{aligned}\]

The fixed size state $h_{t-1}$ represents all prior context in a sequence at time $t$. This underpins the core tradeoffs associated with RNNs:

Pros

Efficient autoregressive inference: Since $h_{t}$ encapsulates prior inputs, the model only needs to consider a small and constant set of new information for each subsequent input.
No limits to context length: There is nothing in the formulation that explicitly constrains the model to a maximal sequence length.

Cons

Ineffective modeling of complex dependencies: All prior context must be compressed, via static¹ updates, into a fixed amount of bits.
Slow training: Training requires sequential backpropagation through time, making poor utilization of hardware accelerators, e.g. GPUs or TPUs. Accelerators have enormous throughput for parallel computation, but are otherwise surprisingly slow at sequential computation.

Exhibit B: the transformer

On the other end of that spectrum, consider a decoder-only transformer model, à la GPT-3. In particular, let’s focus on its scaled dot-product self-attention layer:

\[\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Pros

Unreasonably effective at modeling complex dependencies: Every token gets to explicitly attend to all other prior tokens, instead of relying on a fixed-sized state as a “summary”.
Highly parallel training: There are no dependencies along the time dimension, and the core operations are matrix multiplications, which hardware accelerators have been excellent at parallelizing for decades.

Cons

Quadratic scaling with context length: Since every input attends to all prior inputs, the total amount of computation required accelerates as the number of tokens increases.
Autoregressive inference is expensive²: Unlike RNNs, there is no fixed-sized compressed representation of the prior tokens; each new token must explicitly attend to all prior tokens.

Why Mamba? Why now?

The contrasting tradeoffs of transformers and RNNs highlight the crux of sequence modeling research: how can we improve model quality within the constraints of available compute?

Recently, the industry has made rapid forward progress not from algorithmic breakthroughs but instead dramatic increases in compute, due to both increased funding and continual improvements in hardware development.

Which is to say, scaling isn’t particularly clever, but oh boy is it effective³.

Perhaps Rich Sutton said it best in The Bitter Lesson:

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great.

Speculatively, RNNs and transformers have a limited lifespan because they make poor use of increasingly abundant compute. It’s critical that we design models that better leverage compute while also maintaining or preserving fundamental model quality. Hence, the interest in Mamba.

Linear time-invariant state space models

Mamba is based on S4, which is a linear time invariant (LTI) state space model (SSM), a common and useful subset of state space models more generally. For now, let’s focus on LTI SSMs, starting with their continuous form:

Continuous form

\[\begin{aligned} \mathbf{h}'(t) &= \mathbf{A} \mathbf{h}(t) + \mathbf{B}\mathbf{x}(t) \\ \mathbf{y}(t) &= \mathbf{C}\mathbf{h}(t) + \mathbf{D}\mathbf{x}(t) \end{aligned}\]

Ah yes, a chunk of $\LaTeX$, how intuitive. Let’s break it down.

Components

$t$ represents time, and is a scalar real number, i.e. $t \in \mathbb{R}$.
- Although integers are a subset real of numbers ($\mathbb{Z} \subset \mathbb{R}$), there is some specific handling and notation for the discrete case. We’ll discuss this further in the next section.
$\mathbf{x}(t) \in \mathbb{R}^{\mathtt{D}}$ is the input to our model at time $t$, which has dimensionality $\mathtt{D}$ (i.e., has $\mathtt{D}$ channels).
- E.g., if you are doing modeling over raw audio files, then $\mathtt{D} = 1$, $t \in \mathbb{R}^+$, and $\mathbf{x}(t)$ is the amplitude from the microphone’s recording at time $t$.
- E.g., if you are doing modeling over text token embeddings, then $\mathtt{D} = \text{embedding dimensionality}$, $t \in \mathbb{Z}^+$, and $\mathbf{x}(t)$ is the $\mathtt{D}$-length embedding vector for the token at position index $t$.
$\mathbf{y}(t) \in \mathbb{R}^{\mathtt{V}}$ is the corresponding output of our model, at time $t$.
- E.g. for binary music/not-music classification over raw audio files: $\mathtt{V} = 1$, $t \in \mathbb{R}^+$, and $\mathbf{y}(t) \in [0, 1]$ is the predicted probability that all prior audio context was music at time $t$.
- E.g. for next-token language modeling: $\mathtt{V}=\text{tokenizer vocab size}$, $t \in \mathbb{Z}^+$, and $\mathbf{y}(t) \in [0, 1]^\mathtt{V}$ is the predicted probability distribution over all tokens in the vocabulary.
Similar to a vanilla RNN, the state vector $\mathbf{h}(t) \in \mathbb{R}^\mathtt{N}$ encapsulates all prior inputs at time $t$.
The matrices $\mathbf{A} \in \mathbb{R}^{\texttt{N}\times\texttt{N}}$, $\mathbf{B} \in \mathbb{R}^{\texttt{N}\times\texttt{D}}$, $\mathbf{C} \in \mathbb{R}^{\texttt{V}\times\texttt{N}}$, and $\mathbf{D} \in \mathbb{R}^{\texttt{V}\times\texttt{D}}$, known as the state matrix, input matrix, output matrix, and feedthrough matrix respectively, comprise the actual parameters of the SSM. These parameters, along with inputs and state, determine the outputs. These parameters also determine how the state itself evolves over the sequence of inputs.

Discrete form

In the continuous case, the dynamics of how $\mathbf{h}$ evolves over time are determined via the differential equation $\mathbf{h}'(t) = \mathbf{A} \mathbf{h}(t) + \mathbf{B}\mathbf{x}(t)$. That is, the current value of $\mathbf{h}$ itself determines how $\mathbf{h}$ is changing at that moment in time.

The discrete case is similar, but because we cannot differentiate discrete functions, we notate this self-modifying recursive behavior with, well, a recurrence⁴. We also use subscript notation instead of function notation to help emphasize this distinction:

\[\begin{aligned} \mathbf{h}_{t} &= \mathbf{\overline{A}} \mathbf{h}_{t-1}+ \mathbf{\overline{B}}\mathbf{x}_t \\ \mathbf{y}_t &= \mathbf{\overline{C}}\mathbf{h}_t + \mathbf{\overline{D}}\mathbf{x}_t \end{aligned}\]

Furthermore, you may now also notice some horizontal bars over our SSM parameters, which is indicative of discretized parameters:

Parameter discretization

Because SSMs’ most general formulation is continuous, when working with discrete data there is (usually) a discretization step, where the “discretized” parameters are annotated with an overhead line, and are dependent on a learned “step size” parameter $\Delta$ and fixed choices for discretization functions $f_\mathbf{A}$, $f_\mathbf{B}$, $f_\mathbf{C}$, and $f_\mathbf{D}$.

\[\begin{aligned} \mathbf{\overline{A}} &= f_{\mathbf{A}}(\Delta, \mathbf{A}) \\ \mathbf{\overline{B}} &= f_{\mathbf{B}}(\Delta, \mathbf{A}, \mathbf{B}) \\ \mathbf{\overline{C}} &= f_{\mathbf{C}}(\Delta, \mathbf{C}) \\ \mathbf{\overline{D}} &= f_{\mathbf{D}}(\Delta, \mathbf{D}) \\ \end{aligned}\]

There are several reasonable strategies for discretization, but an intuitive exposition here would be longer than I’d like for this post, so for now you’ll have to just trust me (or go read Albert Gu’s 330 page thesis).

Discretization fine print and personal musings

The authors prefer to drop $\mathbf{\overline{D}}$ entirely from their exposition of SSMs, because it “can be viewed as a skip connection that does not interact with the state…, the most important part of the SSM.” To be clear, it is still explicitly parametrized in their actual reference S4 and Mamba implementations, so IMHO⁵ it’s a bit more clear to leave it in the exposition even at the cost of verbosity.
$\mathbf{C}$ is not actually discretized per se, because $\mathbf{A}$ and $\mathbf{B}$ are already discretized, so in the Mamba paper there is no $\mathbf{\overline{C}}$. Equivalently, in the S4 paper, $\mathbf{C}$ is in fact discretized, where its discretization function is the identity. I personally prefer the notation of the latter, but it’s not really a big deal.
There are at least a few valid methods of discretization, the Euler method, zero-order-hold (ZOH) method, or the bilinear method. The Euler method is the weakest, but choosing between the latter two is nuanced. In fact, the S4 paper goes with the bilinear method, but Mamba highlights ZOH instead.
- I am no mathematician, but I find the setup of predefined discretization functions to be a touch surprising, because it goes against my personal deep-learning intuition of “let the model learn everything”. I’m curious how well learned discretization functions would perform, in addition to simply a learned step size parameter.
Whether or not discretization is necessary at all is also an interesting question. There are various nice properties and interpretations, such as viewing $\mathbf{x}_t$ as being sampled from continuous data, i.e. $\mathbf{x}_t = \mathbf{x}(t\Delta)$. Yet, authors also mention how other SSM research e.g. Zhang et al., 2018, Effectively Modeling Time Series with Simple⁶ Discrete State Spaces, did not explicitly discretize and still achieved good results.

Structured SSMs, à la S4/S4D

Structure for A, i.e. initialization

S4 (or rather, S4D) goes a step further and parametrizes $\mathbf{A}$ as a diagonal matrix, initialized via very clever approximations of the HiPPO matrix⁷ used in the original S4 paper. Some high level pointers:

HiPPO initialization was required to get good performance from the S4 architecture. A random initialization with S4 produced a model with middling performance.
Matrix-vector multiplications are, of course, much faster/cheaper if your matrix is diagonal.
S4D style initialization continues to work well with Mamba, but now random initialization also performs pretty well!

Inputs and shapes

Earlier, we described LTI SSMs as being able to handle inputs with multiple channels. In particular, we parametrized the SSM matrices as $\mathbf{A} \in \mathbb{R}^{\texttt{N}\times\texttt{N}}$, $\mathbf{B} \in \mathbb{R}^{\texttt{N}\times\texttt{D}}$, $\mathbf{C} \in \mathbb{R}^{\texttt{V}\times\texttt{N}}$, and $\mathbf{D} \in \mathbb{R}^{\texttt{V}\times\texttt{D}}$, where

$\mathtt{N}$ is the state space dimension size⁸
$\mathtt{D}$ is the number of input channels
$\mathtt{V}$ is the number of output channels

However for S4 (and Mamba), the SSM is parametrized like so:

In this case, the $\mathbf{A} \in \mathbb{R}^{\mathtt{N} \times \mathtt{N}}, \mathbf{B} \in \mathbb{R}^{\mathtt{N}\times 1}, \mathbf{C} \in \mathbb{R}^{1\times\mathtt{N}}$ matrices can all be represented by $\mathtt{N}$ numbers. To operate over an input sequence $\mathbf{x}$ of batch size $\mathtt{B}$ and length $\mathtt{L}$ with $\mathtt{D}$ channels, the SSM is applied independently to each channel.

Structure fine print and personal musings

You can confirm this parametrization by inspecting the authors’ reference Mamba implementation here (or their S4 / S4D implementations) which simply repeats the same instantiation across all channel dimensions.
I’m curious how the models perform if one does not restrict to this parametrization. As in, the Mamba paper assumes that $\mathtt{N}$ is an expansion factor for a single channel, but how would things behave all channels were handled together? This is less meaningful if we’re doing an S4D-Real style of initialization, but how about with random initialization?

Mamba SSM

To recap, we discussed

LTI SSMs in their more general continuous form
Discrete LTI SSMs
Discrete LTI structured SSMs (à la S4)

The only actual model architecture change from Mamba is the removal of linear time-invariance for $\mathbf{B}$, $\mathbf{C}$, and $\Delta$. They are now functions of $\mathbf{x}_t$, i.e. the parameters are “selective”.

\[\begin{aligned} \mathbf{h}_{t} &= \mathbf{\overline{A}} \mathbf{h}_{t-1}+ \overline{\mathbf{B}(\mathbf{x}_t)}\mathbf{x}_t \\ \mathbf{y}_t &= \overline{\mathbf{C}(\mathbf{x}_t)}\mathbf{h}_t + \mathbf{\overline{D}}\mathbf{x}_t \end{aligned}\]

A departure from S4’s linear time invariance

Linear time invariance was critical to S4 because it provided the foundation for its efficient implementation. The sequential (slow) implementation is the obvious one: just use a for-loop, applying the parameters to each input one at a time.

The parallel (fast) implementation is more complex. In brief, because the updates from the SSM matrices are time-invariant, you can compute the outputs by constructing a particular kernel and then performing a full-width convolution. Since full-width convolutions can be computed quickly with the FFT trick, this ends up being a very efficient implementation.

However, LTI caps the expressivity of the model, as every update to state is handled identically, no matter what needs to be updated, if the current input is relevant or irrelevant, etc. With linear time variance, instead of learning fixed matrix parameters, Mamba learns functions which ingest the input and state at time $t$ and then controls output dynamics.

This idea of not just naively applying the same simple state update for each input is also why gating RNNs (e.g. LSTM, GRU) are much more effective than vanilla RNNs, and indeed there is a strong theoretical connection between gating heuristics and selectivity in Mamba.

Fast implementation

Since Mamba is not a linear time-invariance SSM, we can no longer rely on a full-width convolution as the basis of fast (i.e. parallel) implementation. We need other strategies, otherwise we are still stuck with slow training and limited practical utility.

The most critical techniques the authors employ are the Blelloch work-efficient⁹ parallel scan and hardware-aware memory management, which together facilitate the fast (in real world wall-clock time) training of Mamba.

Pedagogical fine print and personal musings

The above notation is my own, but I’ve tried to clarify some things, e.g.
- $\mathbf{A}$ and $\mathbf{D}$ remain linear time-invariant, although I am curious about ablations that also parametrize those two parameters.
- $\mathbf{B}$ and $\mathbf{C}$ are parametrized by $\mathbf{x}_t$, and then discretized.
All pedagogically friendly implementations of Mamba I was able to find actually omit the parallel scan. I’m personally conflicted about the utility of this, because it does simplify the implementation, but how you can implement something like this efficiently is half of the reason why Mamba is valuable in the first place.

Hardware-aware resource management

In most of computer science, we lean pretty hard on a computation model that hand waves away a lot of real world performance characteristics in favor of “constant time”. The random access machine is one popular model, but far from the only one. In any case, these models are akin to idealized models in an introductory physics class: they are exceedingly useful, but they also lie.

The real world is messier, with different hierarchies of performance in compute, memory, and bandwidth, all thanks to the very real implications of squeezing transistors on a silicon wafer of limited size.

A simple GPU program

Most GPU programs follow the same recipe:

Load relevant data from host DRAM to GPU HBM.
Load it into GPU SRAM.
Perform desired computation.¹⁰
Load it back into GPU HBM.
Load it back into CPU DRAM.

Loading is by far the slowest part of this process, or at least for most neural network applications. The more you can minimize memory loading the better, even if you need to spend some compute to do so.¹¹

One of the Mamba authors, Tri Dao, is well-known for his work on FlashAttention, which introduced hardware-aware computation of self-attention, brining memory requirements from quadratic to linear and also providing dramatic wall-clock time speedups.

The core realization with FlashAttention was that the size of the intermediate computations dwarfed the actual size of the inputs and outputs. I.e., the $QK^T$ matrix has size $O(\mathtt{L}^2)$, even though the inputs themselves are only $O(\mathtt{L})$.

Hardware-aware Mamba

The same principle is applicable here. The intermediate computations (the actual state machine mechanics) are again larger than the inputs and outputs, i.e. $O(\mathtt{BLDN} > O(\mathtt{BLD} + \mathtt{DN})$. Thus a similar approach works well here, were computations are done in a blockwise fashion, maximizing the amount of computation that can occur in SRAM before needing to load to/from HBM.

If this still feels a bit hand-wavey, then please go read Horace He’s excellent blog post Making Deep Learning Go Brrrr From First Principles, which is perhaps the best technical blog post I’ve read in the past few years.

And if you’re feeling brave, then I’d encourage you to dive into the core implementation in the authors’ reference implementation, i.e. selective_scan_fwd_kernel.cuh. In particular, pay attention to anything with smem or shared.

The Blelloch parallel prefix scan

Of course, a very memory efficient implementation isn’t particularly useful if you can’t compute it parallel, thanks to how modern hardware accelerators are built. Hence the desire for parallelism.

To be clear, parallel computation of linear RNNs is not something the Mamba authors invented, nor is it even particularly recent¹². But it’s critical to an efficient implementation for Mamba, hence discussing it here. At a high level, there are two key pieces to gaining intuition here for Mamba:

Understanding how the Blelloch parallel prefix scan algorithm works for simple binary associative operators, e.g. summation.
Understanding how to represent the core bits of Mamba, or any linear RNN, as a binary operator.

This section will focus on the former.

Warm up: the parallel reduce

Before discussing the parallel scan, let’s first think about something simpler, the parallel reduce.

Suppose I have a list with $k$ elements and I want to perform a reduce operation over some binary associative operator $\oplus$ where:

An operator is a function that takes in multiple elements of one type and returns a single element of the same type. A binary operator takes in two elements.
A binary operator is associative if $x \oplus (y \oplus z)= (x \oplus y) \oplus z$ for all $x,y,z$.
A reduce, also sometimes known as a fold, is a function that recursively applies a binary operation to aggregate a sequence of values into a single cumulative result. E.g., $(((x_1 + x_2) + x_3) + ... + x_{k-1}) + x_k$ is the reduction of the plus operator over a list of inputs $x_1$ to $x_k$.

One possible naive implementation is just looping through the inputs in order. For example, with inputs [3, 1, 7, 0, 4, 1, 6, 3]:

def naive_reduce(op, identity, inputs):
    result = identity
    for i in inputs:
        result = op(result, i)
    return result

>>> naive_reduce(op=lambda x, y: x + y, identity=0, inputs_=[3, 1, 7, 0, 4, 1, 6, 3])
25

This has linear time complexity w.r.t. the number of inputs, which is as good as we can hope for without parallelism, but what if we had more workers? We could split our input into pairs, compute those sums, and then recursively use those results as the inputs to another reduce, forming a computation tree:

graph BT; 00[3] --> 10[4] 01[1] --> 10[4] 02[7] --> 11[7] 03[0] --> 11[7] 04[4] --> 12[5] 05[1] --> 12[5] 06[6] --> 13[9] 07[3] --> 13[9] 10[4] --> 20[11] 11[7] --> 20[11] 12[5] --> 21[14] 13[9] --> 21[14] 20[11] --> 30[25] 21[14] --> 30[25]

The Blelloch parallel scan

A scan aggregates results from a sequence of elements by applying the operator cumulatively. For example, given the same inputs, the scan output would be [3, 4, 11, 11, 15, 16, 22, 25].

Closely related is the prescan which just shifts outputs by one, starting with an identity for the given operator (e.g., zero for summation): [0, 3, 4, 11, 11, 15, 16, 22]. The Blelloch algorithm technically computes a prescan, although that is sufficient for our use case since it’s easy to go to/from a prescan.

First, the up sweep

The first step to a fast scan computation is to compute a parallel reduction, as we described in the previous section. However, this time, we preserve intermediate computations.

Next, the down sweep

In the down sweep, we will maintain the invariant that: every node contains the sum of all prior leaf nodes, as determined visit order in a pre-order traversal, e.g.:

def pre_order_traversal(node: Node) -> None:
  if not node:
    return
  visit(node)
    pre_order_traversal(node.left)
    pre_order_traversal(node.right)

If every node can contain the sum of all prior leaf nodes, then the values of the leaves themselves will be the results of a prescan!

Stepping through the down sweep with a concrete example:

When there are no prior leaf nodes, we use use the identity value, e.g. 0 for summation.

graph TD; 10[?] --> 00[?] 10[?] --> 01[?] 11[?] --> 02[?] 11[?] --> 03[?] 12[?] --> 04[?] 12[?] --> 05[?] 13[?] --> 06[?] 13[?] --> 07[?] 20 --> 10[?] 20 --> 11[?] 21[?] --> 12[?] 21[?] --> 13[?] 30 --> 20[?] 30[0] --> 21[?]

We now need to be careful about maintaining the invariant. Filling in the down-sweep level by level, for any particular node N:

downsweep[N].left.value = downsweep[N].value
- For the following diagrams, a blue node indicates the contribution from the parent.
downsweep[N].right.value = downsweep[N].value + upsweep[N].left.value
- For the following diagrams, a red node indicates a contribution from the downsweep tree, and the yellow node indicates the contribution from the upsweep tree, and orange indicates the combined result.

Up sweep

graph BT; classDef orange fill:#ffad5b,stroke:#333,stroke-width:2px; classDef yellow fill:#ffff5b,stroke:#333,stroke-width:2px; classDef red fill:#ff5b5b,stroke:#333,stroke-width:2px; 00[3] --> 10[4] 01[1] --> 10[4] 02[7] --> 11[7] 03[0] --> 11[7] 04[4] --> 12[5] 05[1] --> 12[5] 06[6] --> 13[9] 07[3] --> 13[9] 10[4] --> 20[11] 11[7] --> 20[11] 12[5] --> 21[14] 13[9] --> 21[14] 20[11] --> 30[25] 21[14] --> 30[25] class 20 yellow

Down sweep

graph TD; classDef orange fill:#ffad5b,stroke:#333,stroke-width:2px; classDef yellow fill:#ffff5b,stroke:#333,stroke-width:2px; classDef red fill:#ff5b5b,stroke:#333,stroke-width:2px; classDef blue fill:#5bd5ff,stroke:#333,stroke-width:2px; 10[?] --> 00[?] 10[?] --> 01[?] 11[?] --> 02[?] 11[?] --> 03[?] 12[?] --> 04[?] 12[?] --> 05[?] 13[?] --> 06[?] 13[?] --> 07[?] 20 --> 10[?] 20 --> 11[?] 21[?] --> 12[?] 21[?] --> 13[?] 30 --> 20[0] 30[0] --> 21[11] class 30 red class 21 orange class 20 blue

graph TD; classDef orange fill:#ffad5b,stroke:#333,stroke-width:2px; classDef yellow fill:#ffff5b,stroke:#333,stroke-width:2px; classDef red fill:#ff5b5b,stroke:#333,stroke-width:2px; classDef blue fill:#5bd5ff,stroke:#333,stroke-width:2px; 10[?] --> 00[?] 10[?] --> 01[?] 11[?] --> 02[?] 11[?] --> 03[?] 12[?] --> 04[?] 12[?] --> 05[?] 13[?] --> 06[?] 13[?] --> 07[?] 20 --> 10[0] 20 --> 11[4] 21[?] --> 12[11] 21[?] --> 13[16] 30 --> 20[0] 30[0] --> 21[11] class 20,21 red class 11,13 orange class 10,12 blue

graph TD; classDef orange fill:#ffad5b,stroke:#333,stroke-width:2px; classDef yellow fill:#ffff5b,stroke:#333,stroke-width:2px; classDef red fill:#ff5b5b,stroke:#333,stroke-width:2px; classDef blue fill:#5bd5ff,stroke:#333,stroke-width:2px; 10[?] --> 00[0] 10[?] --> 01[3] 11[?] --> 02[4] 11[?] --> 03[11] 12[?] --> 04[11] 12[?] --> 05[15] 13[?] --> 06[16] 13[?] --> 07[22] 20 --> 10[0] 20 --> 11[4] 21[?] --> 12[11] 21[?] --> 13[16] 30 --> 20[0] 30[0] --> 21[11] class 10,11,12,13 red class 01,03,05,07 orange class 00,02,04,06 blue

Voila! For a sketch of the proof of why this works, consider a node in a preorder traversal.

It is either a left child, or a right child (or the root).
If it is a left child, then it and its root have the same amount of prior leaf nodes, so it can get its value just by copying its parent’s value.
If it is a right child, then there are leaves from two possible regions:
- all the leaves that are strictly “after” the parent node but before the right node (the contribution of the sibling/left node)
- all leaves that are “before” the parent node entirely

A binary associative operator for Mamba

The above examples used “sum”, but the Blelloch parallel scan works for all binary associative operators. For Mamba, we can define an associative operator, that when given the proper inputs, will compute the prescan of state vectors in parallel!

Prerequisites

In fact, what we’ll discuss here is contextualized by Mamba, but is actually valid for all first-order recurrences of the following form¹³:

\[h_t= \begin{cases} b_0 & t = 0 \\ (a_t\otimes h_{t-1}) \oplus b_t & t>0 \\ \end{cases}\]

Where $\oplus$ and $\otimes$ meet the following criteria:

$\oplus$ must be associative, i.e. $(x \oplus y) \oplus z = x \oplus (y \oplus z)$
- Notice that vector-vector addition satisfies this!
$\otimes$ must be semiassociative, i.e. there exists a binary associative operator $\odot$ such that $x \otimes (y \otimes z) = (x \odot y) \otimes z$
- Notice that $\odot$ as matrix-matrix multiplication and $\otimes$ as matrix-vector multiplication satisfies this!
$\otimes$ distributes over $\oplus$: $x \otimes (y \oplus z) = (x \otimes y) \oplus (x \otimes z)$
- Notice that above matrix/vector addition/multiplication operators satisfy this!

Defining the operator

We massage our inputs into the sequence of $c_t \equiv[a_t, b_t] \text{ for } t = 1, 2, ... , \mathtt{L}$, where

\[\begin{aligned} a_t &= \mathbf{\overline{A}} \\ b_t &= \overline{\mathbf{B(x}_t)}\mathbf{x}_t\\ \end{aligned}\]

And define a new operator $\bull$ as follows:

\[\begin{aligned} c_i \bullet c_j &\equiv [c_{j,a} c_{i,a}, \, c_{j,a} c_{i,b} + c_{j,b}] \\ &\equiv [a_j a_i, \, a_j b_i + b_j] \end{aligned}\]

The “identity” for this operator is $c_0 = \left[\mathbf{\overline{A}}, \, \mathbf{h}_0 \right]$, where $\mathbf{h}_0$ is the initial state vector, before having seen any inputs. This is analogous to $0$ for the $\text{add}$ operator.

We then apply the operator in parallel with the Blelloch (pre)scan algorithm, and the outputs at the second index will be the desired $\mathbf{h}_t$ results, computed in an efficient manner!

Some proofs

From the set up of the operator and how we set up $a_t$ and $b_t$, the second component is basically guaranteed to compute the proper state vectors. Recall the first line of the Mamba SSM:

\[\mathbf{h}_{t} = \mathbf{\overline{A}} \mathbf{h}_{t-1}+ \overline{\mathbf{B}(\mathbf{x}_t)}\mathbf{x}_t\]

A sketch of the high level intuition here is:

We can show that the $\bull$ operator with our specified initializations computes $\mathbf{h}_t$ with a sequential scan.
We can show that the $\bull$ operator is associative, so that we may use the Blelloch parallel scan algorithm in particular.

Proof of part 1

We initialize $b_0 = \mathbf{h}_0$.
For time $t \ge 1$, if $b_{t-1}$ is equal to $\mathbf{h}_{t-1}$, then $a_t b_{t-1} + b_t = \mathbf{h}_t$. This is because we have set $a_t = \mathbf{\overline{A}}$ and $b_t = \overline{\mathbf{B}(\mathbf{x}_t)}\mathbf{x}_t$ for all $t$.
Via induction, since $b_0 = \mathbf{h}_0$ is initialized correctly and subsequent $b_t = \mathbf{h}_t$ are computed correctly if provided $b_{t-1} = \mathbf{h}_{t-1}$, then scanning with this operator and these initializations correctly computes the desired state vectors.

Proof of part 2

This, unfortunately, is simply a wall of algebra:

\[\begin{aligned} &\text{Apply the definition of } \bull: \\ (c_i \bull c_j) \bull c_k &= [c_{j,a} \odot c_{i,a}, \; (c_{j,a} \otimes c_{i,b}) \oplus c_{j,b}] \bull c_k \\ &\text{Apply the definition of} \bull \text{again:} \\ &= [ c_{k,a} \odot (c_{j,a} \odot c_{i,a} ) , \; (c_{k,a} \otimes ((c_{j,a} \otimes c_{i,b}) \oplus c_{j,b})) \oplus c_{k,b}] \\ &\text{Associativity of } \odot : \\ &= [(c_{i,a} \odot c_{j,a}) \odot c_{i,a}, \; (c_{k,a} \otimes ((c_{j,a} \otimes c_{i,b}) \oplus c_{j,b})) \oplus c_{k,b}] \\ &\otimes \text{distributes } c_{k,a} \text { over} \oplus \text{:} \\ &= [(c_{k,a} \odot c_{j,a}) \odot c_{i,a}, \; (c_{k,a} \otimes (c_{j,a} \otimes c_{i,b}) \oplus (c_{k,a} \otimes c_{j,b})) \oplus c_{k,b}] \\ &\text{Associativity of} \oplus \text{:} \\ &= [(c_{k,a} \odot c_{j,a}) \odot c_{i,a}, \; (c_{k,a} \otimes (c_{j,a} \otimes c_{i,b})) \oplus ((c_{k,a} \otimes c_{j,b}) \oplus c_{k,b})] \\ &\text{Semiassociativity of } \otimes : \\ &= [(c_{k,a} \odot c_{j,a}) \odot c_{i,a}, \; ((c_{k,a} \odot c_{j,a}) \otimes c_{i,b}) \oplus ((c_{k,a} \otimes c_{j,b}) \oplus c_{k,b})] \\ &\text{Apply operator definition:} \\ &= c_i \bull [c_{k,a} \odot c_{j,a}, \; ( c_{k,a} \otimes c_{j,b} ) \oplus c_{k,b}] \\ &\text{Apply operator definition again:} \\ (c_i \bull c_j) \bull c_k &= c_i \bull (c_j \bull c_k) \end{aligned}\]

Sanity checking

The above seems to make sense, but perhaps you prefer python to $\LaTeX$ (I wouldn’t blame you).

As a first sanity check, if you’ve gotten this far, now is perhaps a good time to start perusing some reference implementations. E.g., in the author’s CUDA code, you can see the above operator implemented quite literally at selective_scan_common.h:113.

But also, let’s write our own “unit tests” as an additional sanity check. FYI, this will leverage the jax.lax.associative_scan implementation, which is a batteries-included implementation¹⁴ of Blelloch’s algorithm.

Miscellaneous setup

# Various imports
from einops import einsum
import jax
# Jax.lax already has a convenient parallel scan implementation.
import jax.lax as lax
import jax.numpy as jnp

key = jax.random.PRNGKey(seed=42)

B = 1  # batch size
L = 8192 # context length
N = 64  # hidden state size
D = 2  # num in channels
V = 1  # num out channels

# Gets the various fake x_t inputs.
def generate_random_xs(key, num_inputs=L, num_channels=D):
    key, subkey = jax.random.split(key)
    xs = jax.random.lognormal(subkey, shape=(L, D))
    return key, xs

# Gets various fake A matrices. This isshape= actually constant in the paper,
# but it doesn't have to be.
def generate_random_As(key, num_inputs=L, state_size=N):
    key, subkey = jax.random.split(key)
    As = jax.random.lognormal(subkey, shape=(L, N, N))
    return key, As

# Gets various fake B(x_t) matrices.
def generate_random_Bxs(key, num_inputs=L, state_size=N, num_channels=D):
    key, subkey = jax.random.split(key)
    Bxs = jax.random.lognormal(subkey, shape=(L, N, D))
    return key, Bxs

# Gets the b_t term.
def get_bs(xs, Bxs):
    return einsum(Bxs, xs, "l n d, l d -> l n")

# Jax plays nicest with jnp.arrays, so we'll stuff the values inside a
# single array and just unpack things here. I suppose I could use PyTrees
# but please forgive a bit of laziness/hackiness on my part.
def extract(c, state_size):
    assert c.ndim == 1
    assert c.shape[0] == state_size * state_size + state_size
    return (
        c[:state_size * state_size].reshape((state_size, state_size)),
        c[-state_size:].reshape((state_size,))
    )

The operator implementation and test logic

def operator(c_prev, c_curr, num_inputs=L, state_size=N, num_channels=D):
    prev_a, prev_b = extract(c_prev, state_size)
    curr_a, curr_b = extract(c_curr, state_size)
    return jnp.concatenate([
        jnp.ravel(curr_a @ prev_a), 
        jnp.ravel(curr_a @ prev_b + curr_b)
    ])
vectorized_operator = jax.vmap(operator, in_axes=(0, 0), out_axes=0)

# Actually generate some fake test data.
key, xs = generate_random_xs(key)
key, Bxs = generate_random_Bxs(key)
key, As = generate_random_As(key)

bs = get_bs(xs, Bxs)
cs = jnp.concatenate([As.reshape(-1, N * N), bs], axis=1)

# %%timeit results on a freebie Google Colab VM: 
# 283 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
lax_scanned = lax.associative_scan(vectorized_operator, cs)[:, -N:]

def naive_scan_hs(h_0, As, Bxs, xs):
    output = [h_0]
    for a, bx, x in zip(As, Bxs, xs):
        b = einsum(bx, x, "n d, d -> n")
        output.append(a @ output[-1] + b)
    return output[1:]

# %%timeit results on a freebie Google Colab VM:
# 3.34 s ± 313 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
naive_hs = jnp.vstack(
    naive_scan_hs(jnp.zeros((N,)), As, Bxs, xs)
)

# The following returns Array(True, dtype=bool)! Which means that
# we're getting identical results, (allowing for some floating point
# imprecision), regardless of if we're using the naive iterative
# implementation, or the fast parallel implementation.
jnp.allclose(naive_hs, lax_scanned)

Hooray, we’re getting identical results as well as a much faster wall time. These trends only get more extreme as you leverage more capable hardware (the above results were on a freebie Google Colaboratory VM, i.e. not hardware accelerated and using only two CPU cores).

Closing thoughts

Whew, that was a lot. Let’s do a quick recap of what we’ve covered:

Summary of above topics

Mamba is a state space model, and is at its core, recurrent/sequential.
Because Mamba relies on a fixed sized state representation, it is efficient at inference time, unlike transformers.
Mamba, like FlashAttention, maximizes efficiency in a real world wall-clock sense by doing as much work as possible in SRAM.
Mamba can be computed in parallel via the Blelloch parallel scan algorithm, making it efficient at training time, unlike RNNs.
Mamba’s parameters are selective, making it for better at modeling long term dependencies than vanilla RNNs.

In particular, we paid extra attention to

Mamba’s specific SSM formulation.
How linear RNNs, Mamba included, can be computed efficiently.

Remaining topics

There is a bunch of other important that I’ve glossed over here. Conceptually, I believe the selective SSM formulation and the scan implementation are the most important contributions, but a thorough discussion would also cover

The hardware aware computation¹⁵
Mamba’s fused “one block” architecture.
Other fairly standard optimization tricks, e.g. recomputation, kernel fusion, etc.
Theoretical ties between heuristic gating and selectivity

And of course, I provided a bit of code to sanity check the discussion around parallel computation of the state vectors, but this is not a full implementation 🙂. Perhaps I’ll get around to writing a more thorough treatise in the future, but for now, I hope you’ve found this interesting!

Musings on breadth

Mamba is one of the most impressive papers I’ve read in years. It’s particularly impressive because the core concepts (selectivity and efficient implementation) are simple (where simple ≠ easy) yet effective.

In fact, throughout the paper there are not many particular insights that are demanding of galaxy brain intellect. For example:

Non-LTI SSMs are really just the more general version of SSMs you might learn about in a control theory class.
There have been other examples of fused architectures.
The Blelloch scan algorithm is an old paper, originally published in 1990, and is something you’re very likely to encounter if you took some parallel programming courses. There’s even a Udacity course that touches on this!
The “selectivity” functions are very simple, comprising just matmuls, a softplus, and a broadcast.
I am inexperienced at writing kernel code, but I assume like any other engineering skill this is very doable with some practice.
There is a bit of insight that comes with knowing how hardware accelerators are built. Becoming an expert at this is, like all things, very difficult. But you can get some pretty useful knowledge by doing some high bang-for-buck reading, e.g. Making Deep Learning Go Brrrr From First Principles.
Legendre polynomials, à la HiPPO initialization, could very feasibly appear in one’s undergraduate coursework if they enjoy math (e.g., Math 442 at my alma mater).

But on the other hand, combining a useful knowledge of all of the above and distilling that into a cohesive and novel body of research? Super cool. I suppose that’s why both authors are younger than me but also tenure-track professors at top institutions.

Assorted references

Footnotes

I.e. post-training, the parameters involved in the updates are always the same. ↩
There are some tricks like KV caching which help here, but they unfortunately do not change asymptotic behavior. ↩
Not to say that scaling is easy either, because HPC is a fiendishly difficult thing to get right, it’s just a very simple concept to understand: use moar computer ↩
At the risk of offending proper mathematicians, you can kinda think of a first order differential equation as a infinitesimally precise recurrence. ↩
You should take any of my opinions with a large grain of salt. I, for one, have not been awarded any ICLR Outstanding Paper Awards 🙂 ↩
“Simple” 🫠 ↩
IMHO, this is where most people get stuck when trying to understand S4. I mean, how often does the typical computer scientist encounter Legendre polynomials? ↩
It is technically possible but quite silly to use a value smaller than $\mathtt{D}$. Remember that the state itself is only $\mathtt{N}$ dimensional, i.e. $\mathbf{h}_t \in \mathbb{R}^\mathtt{N}$. It would be very challenging for a model to compress multiple $\mathbb{R}^{\mathtt{D}}$ inputs into a single $\mathbb{R}^{\mathtt{N}}$ if $\mathtt{N} < \mathtt{D}$. ↩
There are other parallel scan implementations, e.g. the Hillis-Steele scan. In a nutshell, the Blelloch scan is more “work-efficient” yet less “step-efficient”. At a high level, when the amount of work to do exceeds the amount of available parallelism, the Blelloch implementation is more efficient. This matches up the typical situation in most ML problems, where even a lot of parallel compute is dwarfed by the scale of data/input demands. ↩
This is where the GPU kernel comes in, as in “kernel fusion”, a scary name for a simple thing. Also, is there a more overloaded word in STEM other than kernel? Kernel trick, GPU kernel, Linux kernel, convolution kernel, etc. Enough to warrant a fairly long Kernel (disambiguation) Wikipedia article. Maybe entropy or set are more overloaded… ↩
I mean, to an extent; the KV-cache is still a thing. ↩
Guy Blelloch published his work in 1988 and 1990, building upon work by Hillis and Steele from 1986. In doing so, he explicitly highlighted the capability of the parallel scan to efficiently compute certain first-order (and higher-order) recurrences.

To my knowledge, Eric Martin and Chris Cundy were the first to publish work connecting the Blelloch parallel scan algorithm to linear RNNs in early 2018. The publication was successful by usual metrics, appearing in ICLR and garnering 43 citations as of early 2024, but it was perhaps overshadowed at the time due to the decreasing popularity of recurrent architectures due to the then-recent introduction of the transformer architecture. ↩
Note that this is slightly different than what you will find in the OG Blelloch paper, since the first-order recurrence referred to in that paper swaps the position of $h_t$ and $a_t$. All following proofs have been modified accordingly. Rest assured, they are the same in spirit, with only some very minor algebra changes. ↩
I’m a bit curious about whether the authors considered using JAX, its lax module and corresponding associative_scan implementation with Pallas whenever necessary. I suppose the implementation of the Blelloch scan is not actually that many lines of code, so perhaps it just comes down to preference. ↩
I.e., do as much as possible in SRAM since writing to/from SRAM/HBM is slow. ↩

Nano Perceiver

2023-08-13T19:50:08+00:00

Tl;dr

The Perceiver family of models from DeepMind decouple context length from memory and compute requirements. Perceiver AR extends this with support for autoregressive generation. It also has a refreshingly simple implementation, since at it’s core it is just a small variation on top of an otherwise standard decoder-only transformer.

I’ve provided a lightweight implementation here and provide additional context in this post.

Background

Notation

Let’s set some consistent notation:

For inputs, consider
- Index (i.e., first) dimensionality as $M$, also known as context size
- Channel (i.e., second) dimensionality as $C$
For a transformer model, consider:
- A model with $L$ transformer “blocks”, each with
  - an attention layer
  - a relatively shallow MLP

For example, these inputs could be

$M$ token embeddings, each with an embedding size of $C$
$M$ raw pixels from a color image, with $C = 3$ for RGB.

Transformers and scale

Transformers are increasingly powerful and expressive with increasing training data and parameter count, as demonstrated by large transformers like GPT-4 and PaLM. However, transformers scale poorly with w.r.t. both compute and memory when context size increases.

Self-attention is quadratic

The most important operation in a transformer model is the attention operation, hence the seminal paper’s title being Attention Is All You Need.

If you’re unfamiliar with attention, then there are an inordinate amount of intuitive explanations out there, but here’s my abbreviated take that highlights the scaling: At its core, self-attention tries to determine for each input token, how it relates to each other input token. This is basically a doubly nested for-loop:

def self_attention(some_inputs):
    scores = [
        [0 for _ in range(len(some_inputs))]
        for _ in range(len(some_inputs))
    ]
    for r, query_tok in enumerate(some_inputs):
        for c, key_tok in enumerate(some_inputs):
            score_qk = relation(get_query_vector(query_tok), get_key_vector(key_tok))
            scores[r][c] = score_qk
    ...  # Normalize, combine with value vectors, etc

This is obviously quadratic.

If you’re familiar with basic matrix math, you’ll observe that you can rewrite this more compactly with some matrix operations. I.e. given $Q, K, V \in \mathbb{R}^{M \times C}$, the attention operation is:

\[\text{softmax} \left( \frac { QK^T }{\sqrt C} \right) V\]

Here you can see that attention scales quadratically via observing that $QK^T \in \mathbb{R}^{M \times M}$.

But let’s not forget about “linear” scaling either

A standard transformer has $L$ blocks, which means both self-attention and subsequent MLP are run $L$ times, which brings overall complexity to $O(M^2L)$ and $O(ML)$ for the attention and MLP operations respectively.

Suppose by some miracle we are able to reduce self-attention’s complexity down to $O(M)$, perhaps via some clever approximations. Throwing the math under the rug of Big-O notation, this would imply that a transformer model scales with $O(ML)$. This is of course better than $O(M^2L)$, but for large M (e.g. long context models), this is still rather poor scaling.

This incentivizes model designers to come up with clever and bespoke ways of reducing context length (e.g. the ViT, which famously converts an image into patches first). These are effective, but it’s not always clear how to handle other domains, and it’s tremendously easy to get input sizes that are in the deep hundreds of thousands, if not larger. For example, 10 seconds of time-domain audio ($10s \times 44.1 \text{kHz} = 441 \text{ thousand}$) or 10 seconds of low resolution (e.g. 240p) grayscale video without any audio ($\text{240 px high} \times \text{362 px wide} \times 24 \text{Hz} \times 10s \approx 20 \text{ million}$).

Therefore, there’s a desire to not only handle the quadratic nature of self-attention, but also to further decouple context length from computation (past processing the inputs, of course).

Perceiver architecture

The crux of the Perceiver architecture is to use cross-attention for the first block. The only difference between cross-attention and self-attention is that cross attention uses query vectors that are attending to key and value vectors that may come from a different underlying inputs¹:

\[Q \in \mathbb{R} ^ {N \times C}; \; K, V \in \mathbb{R} ^{M \times C}\]

Consider $N$ to be a hyperparameter, selected such that $N \ll M$. Now $QK^T \in \mathbb{R}^{N \times M}$; as in, the first attention layer is now an $O(NM)$ operation! And, since the output latents are of length $N$, each subsequent self atttention is $O(N^2)$, and each MLP layer is also only $O(N)$ now.

Getting the queries

This is all well and good, but it depends on our ability to actually get reasonable queries.

For the initial query, both the original Perceiver and Perceiver IO used a fixed latent, analogous to the initial hidden state in an RNN. Perceiver AR uses an arguably simpler approach, where the initial query just comes from the last $N$ inputs. This way, you can still apply causal masking for autoregressive generation!

Since $N$ is now a hyperparameter choice, if one uses $N=M$, then this is actually equivalent to a standard decoder-only transformer, which I find particularly elegant.

Just show me the code

So conceptually this makes sense, so what do we need to actually modify? Well, as you can also see from the above diagram, there’s not too much to change. You need to

Ensure your targets are the same size as your queries.
Explicitly choose queries as the tail of your full context input.
Ensure your “triangular” causal attention masking is shifted accordingly.

Targets

x = torch.stack([data[i : i + block_size] for i in ix])
y = torch.stack(
    [data[i + block_size + 1 - query_size : i + block_size + 1] for i in ix]
)

For a standard transformer, query_size is identical to block_size.

Attention module

def forward(self, inputs_q, inputs_kv):
    ...
    # Causal masking.
    mask = torch.tril(
        torch.ones(q_time, kv_time, device=attention.device),
        diagonal=kv_time - q_time,
    )
    attention = attention.masked_fill(~mask.bool(), float("-inf"))

For a standard transformer, the diagonal argument is just 0, the default value.

Perceiver block

def forward(self, x: torch.Tensor):
    inputs_q, inputs_kv = x[:, -self.query_size :, :], x
    normed_q, normed_kv = self.ln1(inputs_q), self.ln1(inputs_kv)
    x = inputs_q + self.attn(inputs_q=normed_q, inputs_kv=normed_kv)
    x = x + self.mlp(self.ln2(x))
    return x

For a standard transformer, the main Block’s forward function only passes the input to attention, since it’s doing self-attention, but since we’re now doing cross-attention on inputs, we need to handle inputs_q and inputs_kv separately.

The attention operations in the middle of the transformer are self-attentions over the $\mathbb{R} ^ {N \times C}$ latent space, so for these layers the inputs_q == inputs_kv since getting the last the full size of the latent is just self.query_size anyways.

Is that all?

And… that’s about it. Feel free to see the full repo here.² The repo takes inspiration (and code) from Karpathy’s NanoGPT repo, and reuses the core logic around a simple training script over a single plain text file Shakespeare “corpus”. I encourage you to mess around with some of the parameters in the train.py script. I was able to train surprisingly long context models, caveated that I was running locally on a 2019 Macbook, but of course you’re free to use a proper accelerator of your choice.

If you want to see a more fully fleshed out implementation, then you’ll be pleased to know that DeepMind has actually open sourced a repo here. It’s a research codebase so it definitely was not designed for pedagogical friendliness, but it is supremely flexible³.

Notes

The original Perceiver paper goes further and uses $Q \in \mathbb{R} ^ {N \times D}$, which requires further projections. Fun fact, the original implementation refers to this projection as conv_1d even though it’s just a Linear. I’m not exactly sure why they’d call it this; perhaps due to other patterns in computer vision where a 1x1 convolution has been used to reduce channel size? ↩
This repo doesn’t experiment with any more sophisticated optimizers, nonlinearities, or further-optimized attention chunking/computation. This repo also assumes you are not meddling with channel dimensions, e.g. not projecting into larger or smaller channel sizes. ↩
It’s also implemented in JAX, which is a pro or con depending on whether or not you’re a Googler (jk, kind of 🙂). ↩

Filling in the middle for great good

2023-07-05T19:50:08+00:00

Tl;dr

My favorite research findings are ones that make me go “Why didn’t I think of that?” due to a paradoxical combination of simplicity and cleverness. The OpenAI paper “Efficient Training of Language Models to Fill in the Middle” (FIM) is the most recent thing I’ve read that makes me feel this way.

Language modeling 101

For the uninitiated, modern language model (pre)training has a very short number of steps involved:

Download the internet.
Choose a model architecture.
Given an imperfect view of some tokens, predict a perfect view.

I’m being a bit facetious here because each step is complex, and also because this doesn’t discuss things like fine-tuning, inference, etc… But for pretraining, this is a reasonable approximation.

Some research focuses on step 1, e.g. how to construct the most useful training dataset, given the expected tradeoffs between quality and quantity. An enormous amount of research focuses on step 2, e.g. how to architect your model to improve its performance in any number of ways (memory efficiency, modeling distant relationships, etc). And some research focuses on step 3, e.g. how to set up your task. BART has a denoising approach (input text is corrupted), decoder-only transformers tend to do next token prediction (i.e., all tokens are visible except the last one), etc.

Efficient Training of Language Models to Fill in the Middle tackles step 3. It maintains the standard “next token” autoregressive task of most decoder-only transformers, but with the simple twist that some fraction of those tokens have their order modified.

FIM in a nutshell

For simplicity, we’ll also assume that we’re doing whitespace/punctuation tokenization.

Now consider the sentence “What I cannot create, I do not understand”. Under the typical autoregressive training setup, your input and target would look something like

Input: ["", "What", "I", "cannot", "create", ",", "I", "do", "not", "understand"]
Target: ["What", "I", "cannot", "create", ",", "I", "do", "not", "understand", ""]

Where and are special tokens representing the beginning and end of an input sequence. For FIM, the model maintains the same mechanics around training and optimization, but the data is transformed a bit: it’s chunked into a prefix, middle, and suffix, represented with corresponding special tokens:

Input: ["

", "What", "I", "", "do", "not", "understand", "", "cannot", "create", ",", "I"]
Target: ["What", "I", "", "do", "not", "understand", "", "cannot", "create", ",", "I", ""]

This way, during inference time, you can prompt a model to fill in the middle very naturally by just providing context up to . And… that’s all folks.

Well, not exactly. There are other details and context which, of course, you can find in the paper. But the crux of the paper really is that simple!

Putting my PM hat on

Chatbots are a natural extension to autoregressive models, and relatively convenient for the engineers implementing them. But not every product or task is most naturally represented as an extension of “predict the most probable next word”. Compared to an off-the-shelf autoregressively pretrained LM, filling in the middle seems like an more intuitive way to generate text for certain types of products, e.g.:¹

A docstring, given a function definition and a function body.
A blog article’s content, given its title and a conclusion paragraph.
A road trip itinerary, given that it starts in California and ends in New York City.

Admittedly, there is a grab bag of overlapping techniques to address the problem of alignment. That is, there is ongoing research in turning something that predicts the most probable next token into something that is useful. These techniques may include standard supervised fine tuning, instruction tuning, RLHF², or simply prompt engineering.³

But there is no reason these techniques wouldn’t work with a FIM pretrained model. To top it off, FIM also seems to grant these new middle-filling capabilities without harming overall model performance, making it just about the closest thing I’ve seen to a free lunch in a little while.

Notes

To be fair, it seems that filling in the middle may be a fundamentally more challenging problem than left-to-right (i.e. typical) sampling, perhaps because generated text needs to flow naturally with both a prefix and suffix, as opposed to just a prefix. ↩
RLHF comes with its own grab bag of challenges, such as detecting/preventing over optimization of the reward model, high VRAM of actor-critic policy gradient methods, and the overarching sample inefficiency of basically all RL techniques. ↩
“Prompt engineering” always felt like kind of a bizarre phrase to me. “Engineering” ideally implies some sort of systematic understanding and predictability, but prompt engineering seems more like an art at best and voodoo magic at worst. ↩

Training a chatbot to talk like me

2023-06-26T19:50:08+00:00

There has been much well-deserved attention paid towards the latest advances in machine learning these days. I feel like I see a new paper or model every week that promises the Earth, moon, and stars.

Perhaps it’s a new approach that will finally™ solve the problem of quadratic scaling of transformers w.r.t. context length, be it via clever tweaks inspired by convolutions, literally using convolutions, more clever utilization of accelerators, or various memory bottlenecks.¹ Perhaps it’s any number of new models that have been fine-tuned by hobbyists, perhaps using leaked LLaMA weights or ChatGPT/ShareGPT data.²

But there is another thing that hasn’t gotten as much mainstream attention. That is, just how easy it has become to experiment with some seriously advanced models, models that would have quite recently been state of the art and required non-trivial capital to train. Of course, researchers have been publishing models and code for a while now, but the current state of affairs with easy to use APIs, reasonably good documentation, and emphasis placed on community interaction and contribution? That feels rather new.³

As an example, I wanted to walk through a small language model that I trained for my own amusement.

Goal

I wanted to train a model that could talk a bit more informally, and perhaps talk a bit more like how I will text with friends. It’s no secret that LLMs tend to output text that errs on the side of formal/verbose. My suspicion is that this is due to some combination of

Instructions to human annotators when generating text data, e.g. for supervised fine tuning datasets
Instructions to human annotators when ranking text data, i.e generating reward model data for RLHF, i.e. ranking which outputs are better, i.e. more aligned with user preferences.
Preference for “high quality” text sources during training, e.g. Wikipedia, news articles, or well upvoted comments on Reddit.⁴

My hope (inspired by this paper) was that it would take a relatively minimal amount of fine tuning to get a language model to chat more like me. That is, talk a bit more informally, use different punctuation (e.g. newlines instead of periods), etc.

Training data

Getting training data for this was fairly simple. I’ve been using Facebook Messenger for a long time now, and Facebook provides a convenient way to download all of your data as a bunch of JSONs. Then it’s just a bit of Python to parse the messages, yielding the ones where I’m responding. That generator then can be used directly to create a Dataset, via Hugging Face’s dataset API; specifically from_generator.

To be precise, the training data was in the format of “prompts” and “responses”, where prompts were continguous blocks of messages from anybody that wasn’t me, joined with a pipe char (|). Responses were of the same format, just comprising messages that I sent. I use a pipe char to avoid any preprocessing shenanigans regarding newlines that some tokenizers may perform (e.g., transforming into newlines into spaces). I used a character that wouldn’t normally come up in texting, since substituting other punctuation (e.g. a period) can change perceived tone.⁵

Model

On Hugging Face, there are many language models appropriate for conversational interaction. I chose to run things using a 400M-parameter distilled BlenderBot, which uses a standard seq2seq (i.e. encoder-decoder) transformer. The paper came out in 2020, which is comparatively old, but these models are convenient since they’ve already been fine-tuned on conversational prompts. In particular, the 400M-parameter model isn’t so large that one needs to start thinking about model parallelism yet and has the added benefit of knowledge distillation from the bigger versions. I.e., it ought to be fine for some quick/fun hacking.

Training

I used a bone-stock PyTorch Lightning training loop, which is almost a one-liner. Of course one can choose to implement their own checkpointing, looping over epochs, etc., but why bother reinventing the wheel, especially for a one-off just-for-fun training run?

Compute

As far as compute goes, my 2019-era MacBook is woefully underpowered, but Colab Pro is cheap and good enough here. An instance with 50GB RAM and a 16GB VRAM GPU., albeit an old one, was plenty for my purpose, and cost roughly $.20/hr.⁶

Parameter efficiency

I didn’t want to fine-tune the entire model, since that would’ve taken a while with the admittedly slower NVIDIA T4 that I was using. However, the peft library makes it surprisingly easy to leverage SOTA fine-tuning methods. For me, using AdaLoRA, an improvement of low-rank adaptation that was only published in March, was three operations: an import, a config initialization, and then an assignment.

One quirk of the library is that it has a hard-coded mapping from transformer model architectures (as strings!) to modules for which to actually adapt via AdaLoRA, see source here.⁷ These mappings don’t work out of the box for BlenderBot, but you can just inspect the module names⁸ and then it’s no problem:

class Imitator(pl.LightningModule):

    def __init__(self):
        super().__init__()
        pretrained_model = BlenderbotForConditionalGeneration.from_pretrained("facebook/blenderbot-400M-distill")
        peft_config = AdaLoraConfig(
            peft_type="ADALORA",
            task_type=TaskType.SEQ_2_SEQ_LM,
            inference_mode=False,
            r=16,
            lora_alpha=32,
            target_modules=["q_proj", "k_proj", "v_proj", "out_proj", "fc1", "fc2"],
            lora_dropout=0.1,
        )
        self.model = get_peft_model(pretrained_model, peft_config)
...

Amusing non-scientific results

Example 1

Prompt

> dude i need your advice
> she said she wanted to "take a break"

JamesBot response

> LOL
> i'm not sure what to say to that
> i don't really know what advice to give

Alright, perhaps a bit unempathetic and not much of a specific response, but definitely talks more like me. In particular, the model’s learned that I tend to encapsulate thoughts in distint messages often without ending punctuation.

Example 2

Prompt

> dude
> gels are such a game changer
> felt amazing at mile 15 today

JamesBot response

> oh wow
> that's a lot of running
> i'm impressed
> i can't run a mile

This isn’t perfectly accurate since I can indeed run a mile and also advocate for gels⁹ on runs longer than 90 minutes, but this is plausibly something I would’ve said before getting into running.

The punchline

The punchline isn’t that I took a pre-trained model and then fine-tuned it on another dataset; after all that’s been done before.

The punchline is that, not including the script I used to parse my Facebook messages, this was only ~50 lines of code. Code that, by my assessment, is rather explicit and not reminiscent of code golf.

It’s incredible that the open source ecosystem has advanced to the stage where you can experiment with very modern techniques (transformers, parameter efficient fine-tuning, etc) in just O(dozens) of LoC.

This lowers the barrier of entry for not only hobbyists and enthusiasts, but also for professionals who have requirements that aren’t met by the existing ecosystem of model inference APIs, or simply prefer driving stick.

Notes

What’s even more fascinating than this research outright, is just how much other research continues to be done on top of bone stock transformers, even pretty hype research. E.g., DeepMind’s Gato was trained over a decoder-only transformer “for simplicity and scalability”. ↩
What’s also interesting here is that both of these seem to be in a legal gray area since LLaMA weights were leaked and OpenAI’s terms of service prohibits their products from being used to train competitor models, and despite that they’re both incredibly popular approaches. ↩
This is perhaps biased by my experience, which has been mostly with heavy duty infrastructure that originated from within Alphabet. ↩
Some may not consider Reddit comments to be “high quality”, but it’s important to compare it to internet text en masse. Seriously, take a look at some examples from these canonical large web scrapes. There’s an amusing amount of SEO spam, websites that just aren’t even parsed correctly, e.g. “This website requires JavaScript …”. ↩
Here’s a fun article from the New York Times discussing this in more detail. ↩
This is an estimate, since the Colab pricing model depends on “compute credits” per hour, and it’s not entirely clear how those rates are calculated. Regardless, you can get 100 credits per month for $10, and high-ram instances with an NVIDIA T4 were consistently around 2.xx credits per hour. ↩
Quite a few of these are also commented out, for reasons that aren’t super clear to me. Perhaps it’s the library maintainers being strict about tests? ↩
This is easy, via nn.Module.modules(). It’s always nice when something does exactly what it says on the tin. ↩
For the uninitiated, energy gels are portable and easy-to-digest carbs for endurance athletes. ↩