What is a RWKV?

RWKV is a neural network architecture that combines the strengths of recurrent neural networks (RNNs) and Transformers. RWKV was first introduced in the paper "RWKV: Reinventing RNNs for the Transformer Era" in 2023 and has since attracted attention for its efficiency and scalability in large language models. RWKV is used in text generation, audio generation, and other sequence modeling tasks, demonstrating versatility across various domains.

Like Transformer-based models, RWKV can process sequences and capture long-range dependencies, but it does so with a different architecture that avoids the quadratic complexity of self-attention. This makes RWKV suitable for efficient inference and training on long sequences.

RWKV-4 (small) is a representative example of a text-generative RWKV model. RWKV Explainer is powered by the RWKV-4 (small) model, which has 169 million parameters. While it is not the largest RWKV model, it contains all the essential architectural components and mechanisms found in state-of-the-art RWKV models, making it an ideal starting point for understanding the basics of this architecture.

RWKV Architecture

Every text-generative RWKV model consists of these three key components:

Embedding: Text input is divided into smaller units called tokens, which can be words or subwords. These tokens are converted into numerical vectors called embeddings, which capture the semantic meaning of words.
RWKV Block is the fundamental building block of the model that processes and transforms the input data. Each block includes:
- Timemix Mechanism: RWKV uses a time-mixing mechanism to capture dependencies between tokens, serving a similar purpose to attention in Transformers.
- ChannelMix Mechanism: A feed-forward network that operates on each token independently, refining each token's representation.
Output Probabilities: The final linear and softmax layers transform the processed embeddings into probabilities, enabling the model to make predictions about the next token in a sequence.

Embedding

Let's say you want to generate text using a RWKV model. You add the prompt like this one: “Data visualization empowers users to”. This input needs to be converted into a format that the model can understand and process. That is where embedding comes in: it transforms the text into a numerical representation that the model can work with. To convert a prompt into embedding, we first tokenize the input and then obtain a vector representation for each token. Let’s see how each of these steps is done.

Figure 1. Expanding the Embedding layer view, showing how the input prompt is converted to a vector representation. The process involves (1) Tokenization, (2) Token Embedding.

Step 1: Tokenization

Tokenization is the process of breaking down the input text into smaller, more manageable pieces called tokens. These tokens can be a word or a subword. The words "Data" and "visualization" correspond to unique tokens, while the word "empowers" is split into two tokens. The full vocabulary of tokens is decided before training the model: RWKV-4's vocabulary has 50,277 unique tokens. Now that we split our input text into tokens with distinct IDs, we can obtain their vector representation from embeddings.

Step 2. Token Embedding

RWKV-4 (small) represents each token in the vocabulary as a 768-dimensional vector; the dimension of the vector depends on the model. These embedding vectors are stored in a matrix of shape (50,277, 768), containing approximately 39 million parameters! This extensive matrix allows the model to assign semantic meaning to each token.

RWKV Block

The core of RWKV's processing lies in the RWKV block, which consists of two main mechanisms: Timemix and ChannelMix. Unlike Transformers that use multi-head self-attention, RWKV uses the Timemix mechanism to capture temporal dependencies between tokens efficiently, and ChannelMix as a feed-forward network to refine token representations. Most RWKV models stack multiple such blocks sequentially, allowing token representations to evolve through layers and enabling the model to build up a deep understanding of each token in context. For example, RWKV-4 (small) consists of 12 such blocks.

Timemix: Temporal Mixing Mechanism

In RWKV, the Timemix mechanism replaces the self-attention of Transformers. Timemix enables the model to capture dependencies between tokens across time steps, but does so with a recurrent and efficient computation. Instead of computing attention scores between all pairs of tokens, Timemix mixes the current token's representation with a weighted sum of previous tokens, allowing the model to efficiently model long-range dependencies without quadratic complexity.

Step 1: Receptance, Key, and Value Matrices

r_t = W_r \cdot \left( \mu_r \odot x_t + \left( 1 - \mu_r \right) \odot x_{t-1} \right), \\ k_t = W_k \cdot \left( \mu_k \odot x_t + \left( 1 - \mu_k \right) \odot x_{t-1} \right), \\ v_t = W_v \cdot \left( \mu_v \odot x_t + \left( 1 - \mu_v \right) \odot x_{t-1} \right)

Figure 2.1. Computing Receptance, Key, and Value matrices from the original embedding.

Each token's embedding vector is transformed into three vectors: Receptance (R), Key (K), and Value (V). These vectors are derived by multiplying the input embedding matrix with learned weight matrices for R, K, and V. Here's an analogy to help build intuition:

Receptance (R) determines how much the current token should accept information from the past. It acts as a gate, controlling the flow of historical context.
Key (K) represents the information the current token offers to the future.
Value (V) is the actual content or information carried by the current token.

By using these RKV values, the model can calculate a weighted summary of the past, which determines how much focus each token should receive when generating predictions.

Step 2: WKV Computation

Figure 2.2. The WKV computation combines the current token's K/V pair with a time-decaying weighted sum of past K/V pairs.

wkv_t = \frac{ \sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} \odot v_i + e^{u + k_t} \odot v_t }{ \sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u + k_t} }

The core of the Timemix mechanism is the WKV (Weighted Key-Value) computation, as shown in the formula above. It's a form of time-weighted averaging. For each token t, it computes a value wkv_t by combining its own Key-Value pair (k_t, v_t) with a sum of all previous Key-Value pairs (k_i, v_i for i < t).

A crucial element is the time-decay factor w, a learnable parameter. The term e^{-(t-1-i)w} ensures that the influence of tokens from the distant past decays exponentially, allowing the model to prioritize more recent information while still retaining long-term memory. This recurrent formulation avoids the quadratic complexity of standard attention, making it highly efficient for long sequences.

Step 3: RWKV Computation

The computed wkv vector is then combined with the token's Receptance vector (R) through element-wise multiplication (gating). This allows the model to selectively decide which information from the time-mixed context is relevant for the current token.

Figure 2.3. The final Timemix output is produced by gating the WKV result with the Receptance vector.

Step 4: Output Projection

This gated result is then passed through a final linear projection layer (the Output matrix) to produce the final output of the Timemix block. This output now contains rich contextual information from previous tokens.

Figure 2.4. The final Timemix output is produced by passing the gated WKV result through the Output matrix.

ChannelMix: Feed-Forward Network

Figure 3.1. The ChannelMix mechanism refines token representations independently, similar to a feed-forward network.

After the Timemix mechanism captures the temporal relationships between the input tokens, the outputs are passed through the ChannelMix layer to further enhance the model's representational capacity. The ChannelMix block consists of two linear transformations with a nonlinear activation function in between (i.e. ReLU(x)^2 = (max(0, x))^2). The first linear transformation increases the dimensionality of the input four-fold from 768 to 3072. The second linear transformation reduces the dimensionality back to the original size of 768, ensuring that the subsequent layers receive inputs of consistent dimensions. Unlike Timemix, ChannelMix processes tokens independently and simply maps them from one representation to another.

Output Probabilities

After the input has been processed through all RWKV blocks, the output is passed through the final linear layer to prepare it for token prediction. This layer projects the final representations into a 50,277 dimensional space, where every token in the vocabulary has a corresponding value called logit. Any token can be the next word, so this process allows us to simply rank these tokens by their likelihood of being that next word. We then apply the softmax function to convert the logits into a probability distribution that sums to one. This will allow us to sample the next token based on its likelihood.

Softmax probability distribution illustration

Figure 4. Each token in the vocabulary is assigned a probability based on the model's output logits. These probabilities determine the likelihood of each token being the next word in the sequence.

The final step is to generate the next token by sampling from this distribution The temperature hyperparameter plays a critical role in this process. Mathematically speaking, it is a very simple operation: model output logits are simply divided by the temperature:

temperature = 1: Dividing logits by one has no effect on the softmax outputs.
temperature < 1: Lower temperature makes the model more confident and deterministic by sharpening the probability distribution, leading to more predictable outputs.
temperature > 1: Higher temperature creates a softer probability distribution, allowing for more randomness in the generated text – what some refer to as model “creativity”.

In addition, the sampling process can be further refined using top-k and top-p parameters:

top-k sampling: Limits the candidate tokens to the top k tokens with the highest probabilities, filtering out less likely options.
top-p sampling: Considers the smallest set of tokens whose cumulative probability exceeds a threshold p, ensuring that only the most likely tokens contribute while still allowing for diversity.

By tuning temperature, top-k, and top-p, you can balance between deterministic and diverse outputs, tailoring the model's behavior to your specific needs.

Advanced Architectural Features

There are several advanced architectural features that enhance the performance of RWKV models. While important for the model's overall performance, they are not as important for understanding the core concepts of the architecture. Layer Normalization, and Residual Connections are crucial components in RWKV models, particularly during the training phase. Layer Normalization stabilizes training and helps the model converge faster. Residual Connections allows gradients to flow directly through the network and helps to prevent the vanishing gradient problem.

Layer Normalization

Layer Normalization helps to stabilize the training process and improves convergence. It works by normalizing the inputs across the features, ensuring that the mean and variance of the activations are consistent. This normalization helps mitigate issues related to internal covariate shift, allowing the model to learn more effectively and reducing the sensitivity to the initial weights. Layer Normalization is applied twice in each RWKV block, once before the timemix mechanism and once before the channelmix layer.

Residual Connections

Residual connections were first introduced in the ResNet model in 2015. This architectural innovation revolutionized deep learning by enabling the training of very deep neural networks. Essentially, residual connections are shortcuts that bypass one or more layers, adding the input of a layer to its output. This helps mitigate the vanishing gradient problem, making it easier to train deep networks with multiple RWKV blocks stacked on top of each other. In RWKV-4, residual connections are used twice within each RWKV block: once before the ChannelMix and once after, ensuring that gradients flow more easily, and earlier layers receive sufficient updates during backpropagation.

Interactive Features

RWKV Explainer is built to be interactive and allows you to explore the inner workings of the RWKV. Here are some of the interactive features you can play with:

Input your own text sequence to see how the model processes it and predicts the next word. Explore timemix weights, intermediate computations, and see how the final output probabilities are calculated.
Use temperature slider to control the randomness of the model’s predictions. Explore how you can make the model output more deterministic or more creative by changing the temperature value.
Select top-k and top-p sampling methods to adjust sampling behavior during inference. Experiment with different values and see how the probability distribution changes and influences the model's predictions.

How is RWKV Explainer Implemented?

RWKV Explainer features a live RWKV-4 (small) model running directly in the browser. This model is derived from the PyTorch implementation of RWKV by BlinkDL and has been converted to ONNX Runtime for seamless in-browser execution. The interface is built using JavaScript, with Svelte as a front-end framework and D3.js for creating dynamic visualizations. Numerical values are updated live following the user input.

Who developed the RWKV Explainer?

RWKV Explainer was developed by Ding Wang, based on the open-source transformer-explainer and rwkv-v4-web. It adapts the original by replacing GPT with RWKV, updating diagrams and logic, while keeping a similar UI and interactive features.