When a chain runs often with equal or similar inputs, you can avoid repeating work. The trick: identify which steps are pure functions (same input → same output) and store the result under a hash of the input.
Next time the same input arrives, you hit the cache, return the output, skip the LLM call. Latency: 5ms. Cost: zero.
- id: classify_incident
cache:
kind: hash
key: hash(input.text + prompt_version)
ttl: forever
prompt: |
Classify the following report ...The runtime, before calling the LLM, computes hash(input.text + prompt_version). If it finds an entry, returns it. If not, runs and stores.
If you only cache by input and tomorrow you change the prompt, the cache returns outputs from the OLD prompt. It's the most common bug with LLM caches. The key must include the prompt version to invalidate automatically.
On the right, six steps of your pipeline. Some are pure (deterministic, ideal for hash cache), some depend on the world (short TTL), some NEVER can be cached (side effects). Connect each step to its correct type.
The "never cache" criterion: ask yourself what happens if I run it twice?. If both runs are equivalent, cacheable. If the second one does something new in the world, never.