The context window – bottleneck for using LLMs

What is a context window?

A "context window" refers to the maximum length or number of tokens from the prior textual input that a language model, such as a Transformer, can consider and process. It determines the model's ability to handle long sequences and is constrained by factors such as the model's architecture, time, and memory complexity.

Why is it important?

The size of the context window is important for many reasons. Think about comparing several large contracts to one another.

In code generation the size of the context window is a critical factor since it influences the model's ability to understand complex code structures, maintain consistency, perform transformations, integrate with existing code, and more.

Why can't we just make bigger context windows?

Attention based LLM have quadratic time and memory complexity with respect to the length of the input. This increasing the context window size means that the computations required grow rapidly, leading to longer training and inference times.

Also the memory consumption grows rapidly. This limits the the feasibility of using very large context windows on typical GPUs as well as high end hardware.

What can we do about this?

We can enhance the model via fine tuning. This way we include part of the context as parametric knowledge in the LLM.

Another approach is Retrieval-Augmented Generation (RAG) where we add source knowledge to the input / context in a "just-in-time" manner [1].

Other approaches aim at using the context window as efficiently as possible. See the aider-project for example.

What trends do we see?

In a recent paper "LongNet: Scaling Transformers to 1,000,000,000 Tokens" the authors suggested a new way to handle long contexts. This seems to make long context windows computationally feasible [2].

However in another paper "Lost in the Middle: How Language Models Use Long Contexts" it is shown that longer context windows lead to poor quality in the results that we get back from LLMs [3].

So we have to see how this plays out in the meantime fine tuning and RAG will be our most generally usable approaches.

References

[1]: James Briggs, ‘Better Llama 2 with Retrieval Augmented Generation (RAG)’, Youtube Channel von James Briggs, 2023 <https://www.youtube.com/watch?v=ypzmPwLH_Q4>.

[2]: Jiayu Ding and others, ‘LongNet: Scaling Transformers to 1,000,000,000 Tokens’ (arXiv, 2023) <http://arxiv.org/abs/2307.02486> [accessed 2 August 2023].

[3]: Nelson F. Liu and others, ‘Lost in the Middle: How Language Models Use Long Contexts’ (arXiv, 2023) <http://arxiv.org/abs/2307.03172> [accessed 2 August 2023].