Louis Abraham's Home Page

Effortless tests with cached stubs

14 Jan 2025

Imagine you have just written a complex piece of code (maybe an agent?) that interacts with several APIs and possibly uses large language models (LLMs). Now, you want to write tests but calling the APIs or LLMs during tests would make them slow, non-deterministic, and dependent on network availability.

What is the solution?

You could write mock implementations for every external call, but that is time-consuming, difficult to maintain, and prone to breaking with code changes. Wouldn’t it be great if you could simply record one execution of your code and reuse that data to stub external calls during tests?

Enter cached stubs.

What are cached stubs?

Cached stubs allow you to record the results of function calls once and use those cached results in your tests. This approach eliminates the need to create and maintain complex mocks manually. It is fast, reliable, and integrates seamlessly with testing workflows.

I came up with the name “cached stubs” myself as I have not seen this pattern used anywhere else. If you end up using this approach in your code, feel free to link back to this post! And if you have seen something similar before, I would love to hear about it.

A simple example

Let us start with a typical example:

from openai import OpenAI

client = OpenAI()

def generate_response(prompt):
    print("Calling LLM")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

def test_generate_response():
    prompt = "What is the capital of France?"
    response = generate_response(prompt)
    assert "Paris" in response.strip()

Running this code with pytest -s main.py prints "Calling LLM" and executes the API call. However, this test is slow and network-dependent.

On such a toy example, you might not see the benefit of cached stubs, as you could mock generate_response with a simple lambda: "Paris". But imagine if generate_response was called dozens of times with different arguments – each producing a different response – in a test suite, and that every time you change a prompt template, you have to update several mocks. Or worse, you have several chained calls, so updating all the mocks requires debugging long logs and copy-pasting back into the mocks.

This is where cached stubs come in.

Implementing cached stubs

We will use three libraries:

Here is a simple implementation of cached stubs:

import os
from unittest import mock

import joblib
import pytest

def create_cached_stub(func):
    memory = joblib.Memory("cached_stubs", verbose=0)
    cached_func = memory.cache(func)

    if not os.getenv("RECORD"):
        def error(*args, **kwargs):
            raise Exception("Cache miss")
        cached_func._call = error

    patch = mock.patch(
        f"{func.__module__}.{func.__name__}",
        cached_func,
    )

    @pytest.fixture
    def fixture():
        patch.start()
        yield
        patch.stop()

    return fixture

Using cached stubs

Here is how to use cached stubs:

fixture = create_cached_stub(generate_response)

def test_generate_response(fixture):
    prompt = "What is the capital of France?"
    response = generate_response(prompt)
    assert "Paris" in response.strip()
  1. Record the API responses: Run the tests with the RECORD environment variable set:

    RECORD=1 pytest -s main.py
    

    This executes generate_response (prints "Calling LLM") and saves the result in the cached_stubs/ directory.

  2. Commit your changes: The outputs of cached functions are now stored as files in the cached_stubs/ directory. You can commit these files to your git repository and reuse them on other machines, including CI. The only caveat of using joblib is that it relies on the pickle module which is unsecure so you should never use cache files from untrusted sources.

  3. Use the cached stubs: Run the tests a second time without the RECORD variable:

    pytest -s main.py
    

    The cached results will be used instead of making actual API calls.

    If a value is missing in the cache and RECORD is not set, an exception will be automatically raised. This way, you can easily check locally that all your stubs are cached before pushing to CI.

As a bonus, thanks to joblib, any changes to the generate_response function (but not its dependencies) automatically invalidate the cache. To remove old values, you can periodically delete the whole cache folder and re-record the stubs.

A full example with malib

Our implementation above is nice for explanatory purposes but lacks the handling of several edge cases:

This is the reason why I have integrated cached stubs into my utility library, malib. Here is the same example using malib.cached_stubs:

from malib import cached_stubs
from openai import OpenAI

client = OpenAI()

def generate_response(prompt):
    print("Calling LLM")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

fixture = cached_stubs.create(generate_response)

def test_generate_response(fixture):
    prompt = "What is the capital of France?"
    response = generate_response(prompt)
    assert "Paris" in response.strip()

A simple one-liner fixture = cached_stubs.create(generate_response) is all you need to use cached stubs. I hope you find this concept useful!