Skip to main content

Caching

This pages explains how caching makes Fused more responsive & some best practices for making the best use of it

Caching Basics

The goal of Fused is to make developing & running code faster for data scientists. This is done by using efficient file formats and making UDFs simple to run. On top of those, Fused relies heavily on caching to make recurring calls much faster.

At a high level, caching is storing the output of a function run with some input so we can directly access the result next time that function is called with the same input, rather than re-computing it to save time & processing cost.

Function + Input run

The first run of a [Function + Input] is processed, but the next time that same combination is called, the result is retrieved much faster

As soon as either the function or the inputs change however, the output needs to be processed (as the result of this new combination has not been computed before)

Different Function + Input run

Fused uses a few different types of cache, but they all work in this same manner

Caching a function (inside a UDF): @fused.cache

Any function inside a UDF can be cached using the @fused.cache decorator around it:

@fused.udf
def udf():
import pandas as pd

@fused.cache
def load_data(i):
# Do heavy processing here
return pd.DataFrame({'id': [i]})

df_first = load_data(i=1)
df_first_repeat = load_data(i=1)
df_second = load_data(i=2)

return pd.concat([df_first, df_first_repeat, df_second])

Under the hood:

  • The first time Fused sees the function code and parameters, Fused runs the function and stores the return value in a cache.
    • This is what happens in our example above, line 10: load_data(i=1)
  • The next time the function is called with the same parameters and code, Fused skips running the function and returns the cached value
    • Example above: line 11, df_first_repeat is the same call as df_first so the function is simply retrieved from cache, not computed
  • As soon as the function or the input changes, Fused re-computes the function
    • Example above: line 12 as i=2, which is different from the previous calls

Implementation Details

A function cached with @fused.cache is:

Benchmark: With / without @fused.cache

Using @fused.cache is mostly helpful to cache functions that have long, repetitive calls like for example loading data from slow file formats.

Here are 2 simple UDFs to demonstrate the impact:

  • without_cache_loading_udf -> Doesn't use cache
  • with_cache_loading_udf -> Caches the loading of a CSV
@fused.udf
def without_cache_loading_udf(
ship_length_meters: int = 100,
ais_path: str = "s3://fused-users/fused/file_format_demo/AIS_2024_01_01_100k_points.csv"
):
# @fused.cache
def load_ais_data(ais_path: str):
import pandas as pd
return pd.read_csv(ais_path)

ais = load_ais_data(ais_path)

return ais[ais.Length > ship_length_meters]

and the same:

@fused.udf
def with_cache_loading_udf(
ship_length_meters: int = 100,
ais_path: str = "s3://fused-users/fused/file_format_demo/AIS_2024_01_01_100k_points.csv"
):
@fused.cache
def load_ais_data(ais_path: str):
import pandas as pd
return pd.read_csv(ais_path)

ais = load_ais_data(ais_path)

return ais[ais.Length > ship_length_meters]

Comparing the 2:

Function + Input run

Best Practices: @fused.cache

Caching inside a UDF works best for:

  • Loading data from slow formats (CSV, Shapefile)
  • Repetitive operations that can take a long amount of processing

However, be wary of relying on @fused.cache to load very large (>10Gb) datasets as cache is only stored for a few hours by default and is over-written each time you change the cached function or inputs.

Look into ingesting your data in partitioned cloud native formats if you're working with large datasets.

tip

The line between when to ingest your data or use @fused.cache is a bit blurry. Check this section for more

Caching a UDF

Implementation Details

Cached UDF are:

  • Stored for 90d by default (see Python SDK for more details)
  • Stored on S3

Calling a UDF with a token

While @fused.cache allows you to cache functions inside UDFs, UDFs ran with tokens are cached by default.

This is enabled by default when sharing a token from Workbench:

Workbench token saving

We can demonstrate this caching with a UDF that has a time.sleep(5) in it. Running this same UDF twice:

Token fused.run() caching

This means that UDFs that are repeatably called with fused.run(token) become much more responsive. Do remember once again that UDFs are recomputed each time either anything in the UDF function or the inputs change!

Calling a UDF from object or name

🚧 Under Construction 🚧

Advanced

Caching & bbox

Pass bbox to make the output unique to each Tile.

@fused.udf
def udf(bbox: fused.types.TileGDF=None):

@fused.cache
def fn(bbox):
return bbox

return fn(bbox)

Note that this means that if you're running your Tile UDF in Workbench, every time you pan around on the map you will cache a new file

For this reason, it's recommend to keep cache for tasks that aren't dependent on your bbox when possible, for example:

@fused.udf
def udf(bbox: fused.types.TileGDF):

@fused.cache
def loading_slow_geodataframe(data_path):
...
return gdf

# Loading of our slow data does not depend on bbox so can be cached even if we pan around
gdf = loading_slow_geodataframe()
gdf_in_bbox = gdf[gdf.geometry.within(bbox.iloc[0].geometry)]

return gdf_in_bbox

Defining your cache lifetime: cache_max_age

You can define how long to keep your cache data for with cache_max_age:

@fused.udf
def udf():

@fused.cache(
cache_max_age="24h" # Your cache will stay available for 24h
)
def loading_slow_geodataframe(data_path):
...
return gdf

return gdf

This also works with @fused.udf() & fused.run():

@fused.udf(cache_max_age='24h') # This UDF will be cached for 24h after it's initial run
def udf(path):

gdf = gpd.read_file(path)

return gdf

This UDF will be cached from the moment it's executed with fused.run(udf) for as long as is defined in cache_max_age:

fused.run(udf)

If you run fused.run(udf) again with no changes to udf, then for the next 24h fused.run(udf) will return a cached result. This is both faster & cheaper (saving on compute) while giving you control over how long to keep your cache for.

You can also overwrite the cache_max_age defined in udf when running your UDF:

fused.run(udf, cache_max_age='12h')

udf results will now only be cached for 12h, even if udf was defined with a cache_max_age of 24h:

The age of your cache is defined as follows:

  • By default a UDF is cached for 90 days.
  • If @fused.udf(cache_max_age) is defined, this new cache age overwrites the default.
  • If fused.run(udf, cache_max_age) is passed, then this cache age takes priority over default & @fused.udf(cache_max_age)