Caching
This pages explains how caching makes Fused more responsive & some best practices for making the best use of it
Caching Basics
The goal of Fused is to make developing & running code faster for data scientists. This is done by using efficient file formats and making UDFs simple to run. On top of those, Fused relies heavily on caching to make recurring calls much faster.
At a high level, caching is storing the output of a function run with some input so we can directly access the result next time that function is called with the same input, rather than re-computing it to save time & processing cost.

The first run of a [Function + Input] is processed, but the next time that same combination is called, the result is retrieved much faster
As soon as either the function or the inputs change however, the output needs to be processed (as the result of this new combination has not been computed before)

Fused uses a few different types of cache, but they all work in this same manner
Caching a function (inside a UDF): @fused.cache
Any function inside a UDF can be cached using the @fused.cache
decorator around it:
@fused.udf
def udf():
import pandas as pd
@fused.cache
def load_data(i):
# Do heavy processing here
return pd.DataFrame({'id': [i]})
df_first = load_data(i=1)
df_first_repeat = load_data(i=1)
df_second = load_data(i=2)
return pd.concat([df_first, df_first_repeat, df_second])
Under the hood:
- The first time Fused sees the function code and parameters, Fused runs the function and stores the return value in a cache.
- This is what happens in our example above, line 10:
load_data(i=1)
- This is what happens in our example above, line 10:
- The next time the function is called with the same parameters and code, Fused skips running the function and returns the cached value
- Example above: line 11,
df_first_repeat
is the same call asdf_first
so the function is simply retrieved from cache, not computed
- Example above: line 11,
- As soon as the function or the input changes, Fused re-computes the function
- Example above: line 12 as
i=2
, which is different from the previous calls
- Example above: line 12 as
Implementation Details
A function cached with @fused.cache
is:
- Cached for 12h by default (can be changed with
cache_max_age
) - Stored as pickle file on
mount/
Benchmark: With / without @fused.cache
Using @fused.cache
is mostly helpful to cache functions that have long, repetitive calls like for example loading data from slow file formats.
Here are 2 simple UDFs to demonstrate the impact:
without_cache_loading_udf
-> Doesn't use cachewith_cache_loading_udf
-> Caches the loading of a CSV
@fused.udf
def without_cache_loading_udf(
ship_length_meters: int = 100,
ais_path: str = "s3://fused-users/fused/file_format_demo/AIS_2024_01_01_100k_points.csv"
):
# @fused.cache
def load_ais_data(ais_path: str):
import pandas as pd
return pd.read_csv(ais_path)
ais = load_ais_data(ais_path)
return ais[ais.Length > ship_length_meters]
and the same:
@fused.udf
def with_cache_loading_udf(
ship_length_meters: int = 100,
ais_path: str = "s3://fused-users/fused/file_format_demo/AIS_2024_01_01_100k_points.csv"
):
@fused.cache
def load_ais_data(ais_path: str):
import pandas as pd
return pd.read_csv(ais_path)
ais = load_ais_data(ais_path)
return ais[ais.Length > ship_length_meters]
Comparing the 2:

Best Practices: @fused.cache
Caching inside a UDF works best for:
- Loading data from slow formats (CSV, Shapefile)
- Repetitive operations that can take a long amount of processing
However, be wary of relying on @fused.cache
to load very large (>10Gb) datasets as cache is only stored for a few hours by default and is over-written each time you change the cached function or inputs.
Look into ingesting your data in partitioned cloud native formats if you're working with large datasets.
The line between when to ingest your data or use @fused.cache
is a bit blurry. Check this section for more
Caching a UDF
Implementation Details
Cached UDF are:
- Stored for 90d by default (see Python SDK for more details)
- Stored on S3
Calling a UDF with a token
While @fused.cache
allows you to cache functions inside UDFs, UDFs ran with tokens are cached by default.
- Workbench
- Python
This is enabled by default when sharing a token from Workbench:

You can create a token for your UDF in Python by first saving your UDF to Fused server:
@fused.udf
def slow_caching_udf():
import time
time.sleep(5)
return pd.DataFrame({"output": ["I'm done running my long task!"]})
slow_caching_udf.to_fused()
token = udf.create_access_token()
fused.run(token)
We can demonstrate this caching with a UDF that has a time.sleep(5)
in it. Running this same UDF twice:

This means that UDFs that are repeatably called with fused.run(token)
become much more responsive. Do remember once again that UDFs are recomputed each time either anything in the UDF function or the inputs change!
Calling a UDF from object or name
🚧 Under Construction 🚧
Advanced
Caching & bbox
Pass bbox
to make the output unique to each Tile.
@fused.udf
def udf(bbox: fused.types.TileGDF=None):
@fused.cache
def fn(bbox):
return bbox
return fn(bbox)
Note that this means that if you're running your Tile UDF in Workbench, every time you pan around on the map you will cache a new file
For this reason, it's recommend to keep cache for tasks that aren't dependent on your bbox
when possible, for example:
@fused.udf
def udf(bbox: fused.types.TileGDF):
@fused.cache
def loading_slow_geodataframe(data_path):
...
return gdf
# Loading of our slow data does not depend on bbox so can be cached even if we pan around
gdf = loading_slow_geodataframe()
gdf_in_bbox = gdf[gdf.geometry.within(bbox.iloc[0].geometry)]
return gdf_in_bbox
Defining your cache lifetime: cache_max_age
You can define how long to keep your cache data for with cache_max_age
:
@fused.udf
def udf():
@fused.cache(
cache_max_age="24h" # Your cache will stay available for 24h
)
def loading_slow_geodataframe(data_path):
...
return gdf
return gdf
This also works with @fused.udf()
& fused.run()
:
@fused.udf(cache_max_age='24h') # This UDF will be cached for 24h after it's initial run
def udf(path):
gdf = gpd.read_file(path)
return gdf
This UDF will be cached from the moment it's executed with fused.run(udf)
for as long as is defined in cache_max_age
:
fused.run(udf)
If you run fused.run(udf)
again with no changes to udf
, then for the next 24h fused.run(udf)
will return a cached result. This is both faster & cheaper (saving on compute) while giving you control over how long to keep your cache for.
You can also overwrite the cache_max_age
defined in udf
when running your UDF:
fused.run(udf, cache_max_age='12h')
udf
results will now only be cached for 12h
, even if udf
was defined with a cache_max_age
of 24h
:
The age of your cache is defined as follows:
- By default a UDF is cached for 90 days.
- If
@fused.udf(cache_max_age)
is defined, this new cache age overwrites the default. - If
fused.run(udf, cache_max_age)
is passed, then this cache age takes priority over default &@fused.udf(cache_max_age)