Imagine you have just written a complex piece of code (maybe an agent?) that interacts with several APIs and possibly uses large language models (LLMs). Now, you want to write tests but calling the APIs or LLMs during tests would make them slow, non-deterministic, and dependent on network availability.
What is the solution?
You could write mock implementations for every external call, but that is time-consuming, difficult to maintain, and prone to breaking with code changes. Wouldn’t it be great if you could simply record one execution of your code and reuse that data to stub external calls during tests?
Enter cached stubs.
What are cached stubs?
Cached stubs allow you to record the results of function calls once and use those cached results in your tests. This approach eliminates the need to create and maintain complex mocks manually. It is fast, reliable, and integrates seamlessly with testing workflows.
I came up with the name “cached stubs” myself as I have not seen this pattern used anywhere else. If you end up using this approach in your code, feel free to link back to this post! And if you have seen something similar before, I would love to hear about it.
A simple example
Let us start with a typical example:
from openai import OpenAI
client = OpenAI()
def generate_response(prompt):
print("Calling LLM")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
def test_generate_response():
prompt = "What is the capital of France?"
response = generate_response(prompt)
assert "Paris" in response.strip()
Running this code with pytest -s main.py
prints "Calling LLM"
and executes the API call. However, this test is slow and network-dependent.
On such a toy example, you might not see the benefit of cached stubs, as you could mock generate_response
with a simple lambda: "Paris"
. But imagine if generate_response
was called dozens of times with different arguments – each producing a different response – in a test suite, and that every time you change a prompt template, you have to update several mocks. Or worse, you have several chained calls, so updating all the mocks requires debugging long logs and copy-pasting back into the mocks.
This is where cached stubs come in.
Implementing cached stubs
We will use three libraries:
joblib
to cache the results ofgenerate_response
unittest.mock
to replacegenerate_response
with its cached version during testspytest
fixtures to manage the stubbing lifecycle
Here is a simple implementation of cached stubs:
import os
from unittest import mock
import joblib
import pytest
def create_cached_stub(func):
memory = joblib.Memory("cached_stubs", verbose=0)
cached_func = memory.cache(func)
if not os.getenv("RECORD"):
def error(*args, **kwargs):
raise Exception("Cache miss")
cached_func._call = error
patch = mock.patch(
f"{func.__module__}.{func.__name__}",
cached_func,
)
@pytest.fixture
def fixture():
patch.start()
yield
patch.stop()
return fixture
Using cached stubs
Here is how to use cached stubs:
fixture = create_cached_stub(generate_response)
def test_generate_response(fixture):
prompt = "What is the capital of France?"
response = generate_response(prompt)
assert "Paris" in response.strip()
-
Record the API responses: Run the tests with the
RECORD
environment variable set:RECORD=1 pytest -s main.py
This executes
generate_response
(prints"Calling LLM"
) and saves the result in thecached_stubs/
directory. -
Commit your changes: The outputs of cached functions are now stored as files in the
cached_stubs/
directory. You can commit these files to your git repository and reuse them on other machines, including CI. The only caveat of usingjoblib
is that it relies on the pickle module which is unsecure so you should never use cache files from untrusted sources. -
Use the cached stubs: Run the tests a second time without the
RECORD
variable:pytest -s main.py
The cached results will be used instead of making actual API calls.
If a value is missing in the cache and
RECORD
is not set, an exception will be automatically raised. This way, you can easily check locally that all your stubs are cached before pushing to CI.
As a bonus, thanks to joblib
, any changes to the generate_response
function (but not its dependencies) automatically invalidate the cache. To remove old values, you can periodically delete the whole cache folder and re-record the stubs.
A full example with malib
Our implementation above is nice for explanatory purposes but lacks the handling of several edge cases:
- generators and particularly async generators
- function arguments that shouldn’t be cached
- patching functions that appear in multiple modules
This is the reason why I have integrated cached stubs into my utility library, malib
. Here is the same example using malib.cached_stubs
:
from malib import cached_stubs
from openai import OpenAI
client = OpenAI()
def generate_response(prompt):
print("Calling LLM")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
fixture = cached_stubs.create(generate_response)
def test_generate_response(fixture):
prompt = "What is the capital of France?"
response = generate_response(prompt)
assert "Paris" in response.strip()
A simple one-liner fixture = cached_stubs.create(generate_response)
is all you need to use cached stubs. I hope you find this concept useful!