Small UDF run

Fused UDF functions really shine once you start calling them from anywhere. You can call small jobs in 2 main ways:

fused.run() in Python. All you need is the fused Python package installed
- Useful when wanting to run a UDF as part of another pipeline, inside another UDF or anywhere in Python / code.
HTTPS call from anywhere
- Useful when you want to call a UDF outside of Python. For example, receiving dataframes into Google Sheets or plotting points and images in a Felt map

Defining "Small" job

"Small" jobs are defined as any job being:

Less than 120s to execute
Using less than a few Gb of RAM to run

These jobs run in "real-time" with no start-up time, so are quick to run, but with limited resources and time-out if taking too long.

`fused.run()`

fused.run() is the simplest & most common way to execute a UDF from any Python script or notebook.

The simplest way to call a public UDF is using a public UDF name and calling it as: UDF_ + name. Let's take this UDF that returns the location of the Eiffel Tower in a GeoDataFrame as an example:

import fused
fused.run("UDF_Single_point_Eiffel_Tower")

Simple UDF fused.run() returning a geodataframe

There are a few other ways to run a UDF:

Name (from your account)

When to use: When calling a UDF you made, from your own account.

You can call any UDFs you have made simply by referencing it by name (given when you save a UDF).

(Note: This requires authentication)

Hello World UDF

This UDF can then be run in a notebook locally (granted that you have authenticated):

fused.run("Hello_World_bbox")

Running Hello World UDF

Name (from your teammate's account)

When to use: When calling a UDF someone on your team made, from your own account.

Similarly, you can reference by name and run any UDFs under your teammates' accounts. Simply prefix the UDF name with the person's email address, separated by a /.

fused.run("teammate@fused.io/Hello_World_bbox")

Note that both your and your teammate's accounts must belong to the same organization.

Public UDF Name

When to run: Whenever you want to run a public UDF for free from anywhere

Any UDF saved in the public UDF repo can be run for free.

Reference them by prefixing their name with UDF_. For example, the public UDF Get_Isochrone is run with UDF_Get_Isochrone:

fused.run('UDF_Get_Isochrone')

Token

When to use: Whenever you want someone to be able to execute a UDF but might not want to share the code with them.

You can get the token from a UDF either in Workbench (Save your UDF then click "Share") or by returning the token in Python.

Here's a toy UDF that we want others to be able to run, but we don't want them to see the code:

import fused

@fused.udf()
def my_super_duper_private_udf(my_udf_input):
    import pandas as pd
    # This code is so private I don't want anyone to be able to read it
    return pd.DataFrame({"input": [my_udf_input]})

We then need to save this UDF to Fused server to make it accessible from anywhere.

my_super_duper_private_udf.to_fused()

note

my_udf.to_fused() saves your UDF to your personal user UDFs. These are private to you and your team. You can create a token that anyone (even outside your team) can use to run your UDF but by default these UDFs are private.

We can create a token for this my_super_duper_private_udf and share it:

Python
Workbench

token = my_super_duper_private_udf.create_access_token()
print(token)

This would return something like: 'fsh_**********q6X' (You can recognise this to be a shared token because it starts with fsh_)

Once you have your 'fsh_***' token, you can use it to run your UDF:

fused.run(token, my_udf_input="I'm directly using the token object")

or directly:

fused.run('fsh_**********q6X', my_udf_input="I can't see your private UDF but can still run it")

UDF object

When to run: When you're writing your UDF in the same Python file / jupyter notebook and want to refer to the Python object directly. You might want to do this to test your UDF works locally for example

You may also pass a UDF Python object to fused.run:

# Running a local UDF
@fused.udf
def local_udf():
    import pandas as pd
    return pd.DataFrame({})

# Note that by default fused.run() will run your UDF on the Fused serverless server so we pass engine='local' to run this as a normal Python function
fused.run(local_udf, engine='local')

Github URL

When to use: [Not recommended] This is useful if you're working on a branch that you control. This method always points to the last commit on a branch so your UDF can break without you knowing if someone else pushes a new commit or merges & deletes your branch

gh_udf = fused.load("https://github.com/fusedio/udfs/tree/main/public/REM_with_HyRiver/")
fused.run(gh_udf)

warning

We do NOT recommend you use this approach as your UDF might break if changes are done to it

Especially using a URL pointing to a main branch means that your UDF will change if someone else pushes to it, in a way that isn't visible to you.

For that reason we recommend using a git commit hash instead

Git commit hash (recommended for most stable use cases)

When to use: Whenever you want to rely on a UDF such as in production or when using a UDF as a building block for another UDF.

This is the safest way to use a UDF. Since you're pointing to a specific git commit hash, you won't end up with changes breaking your UDF.

Using a git commit hash is the safest, and thus recommended way to call UDFs from Github.

This does mean you need to update the commit where your UDFs are being called if you want to propagate updates. But this gives you the most amount of control.

Let's again take the example of the Simple Eiffel Tower UDF:

Running Hello World UDF

commit_hash = "bdfb4d0"
commit_udf = fused.load(f"https://github.com/fusedio/udfs/tree/{commit_hash}/public/Single_point_Eiffel_Tower/")
fused.run(commit_udf)

Team UDF Names

Team UDFs can be loaded or run by specifying the name "team", as in:

fused.load("team/udf_name")

This can be helpful when collaborating with team members as this does not require making a shared token

Execution engines

fused.run can run the UDF in various execution modes, as specified by the engine parameter either local or remote

local: Run in the current process.
remote: Run in the serverless Fused cloud engine (this is the default).

# By default, fused.run will use the remote engine
fused.run(my_udf)

# To run locally, explicitly specify engine="local"
fused.run(my_udf, engine="local")

warning

⚠️ Important change: fused.run() now defaults to engine="remote" in all cases, even when users are not authenticated. Previously, it would default to engine="local" for unauthenticated users. If you are not authenticated, you must explicitly specify engine="local" to run UDFs locally.

Set sync=False to run a UDF asynchronously.

Passing arguments in `fused.run()`

A typical fused.run() call of a UDF looks like this:

@fused.udf
def my_udf(inputs: str):
    import pandas as pd
    return pd.DataFrame({"output": [inputs]})

fused.run(my_udf, inputs="hello world")

A fused.run() call will require the following arguments:

[Mandatory] The first argument needs to be the UDF to run (name, object, token, etc as seen above)
[Optional] Any arguments of the UDF itself (if it has any). In the example above that's inputs because my_udf takes inputs as an argument.
[Optional] Any protected arguments as seen in the dedicated API docs page (if applicable). These include for example:
- bounds -> A geographical bounding box (as a list of 4 points: [min_x, min_y, max_x, max_y]) defining the area of interest.
- cache_max_age -> The maximum age of the UDF's cache.
- dtype_out_raster -> The output raster format. Defaults to "tif". Change this to "png" for a simple preview of your image.
- dtype_out_vector -> The output vector format. Defaults to "parquet".

Examples of using parameters

Changing the default cache max age:

fused.run("UDF_CDLs_Tile_Example", cache_max_age='1d')

Calling a public UDF that returns a raster, this might be helpful if you want a simple visualization of your raster without needing other requirements like rasterio & rioxxaray:

fused.run("UDF_CDLs_Tile_Example", dtype_out_raster='png')

Running jobs in parallel: `fused.submit()`

Sometimes you want to run a UDF over a list of inputs (for example running a UDF that unzips a file over a list of files). If each run itself is quite small, then you can run a batch of UDFs over a list of inputs.

Let's use a simple UDF to demonstrate:

@fused.udf
def udf(val):
    import pandas as pd
    return pd.DataFrame({'val':[val]})

Say we wanted to run this udf 10 times over:

inputs = [0,1,2,3,4,5,6,7,8,9]

`fused.submit()`

Fused is built to help you scale your processing to huge datasets and the core of this ability is fused.submit(). You can run 1 UDF over a large number of arguments:

results = fused.submit(udf, inputs)

>>> 100%|██████████| 10/10 

fused.submit() runs all these jobs in parallel and defaults to directly returning the results back to you as a dataframe:

results

>>>
 		val
val
0	0	0
1	0	1
...

Tips for using `fused.submit()`

Check that your parameters are correctly setup with debug_mode=True (more detail below)

single_run = fused.submit(udf, inputs, debug_mode=True)

Start with a small number of jobs, then scale up

# Assuming inputs is a list
results = fused.submit(udf, inputs[:10])

Check the runtime of each job:

# Run only 10 jobs and see how long each one took
results = fused.submit(udf, inputs[:10], collect=False)
results.times()

Job length rule of thumb: 30-45s

Aim for a single UDF job that takes 30-45s. This gives you a "safety" margin as UDFs will timeout after 120s. So jobs can still take a bit longer and not time out.

Read the Best Practices page for more tips on writing efficient UDFs

Advanced `fused.submit()` options

Blocking vs non-blocking calls

By default we've set up fused.submit() to be blocking, meaning it will wait for all the jobs to finish before returning the results.

However, you can set collect=False and then track the progress of jobs as they run:

results = fused.submit(udf, inputs, collect=False)

Real time logs

Show a progress bar of number of jobs completed:

results.wait()

100%|██████████| 10/10 [00:01<00:00, 9.31it/s]

Show total time it took to run all the jobs:

results.total_time()

>>> datetime.timedelta(seconds=1, microseconds=96764)

Check the first error that occurred:

results.first_error()

>>> fused.types.UdfRuntimeError("[Run #0 {'val': 3'}] my error message here...")

Get your data back as a dataframe:

results.collect()

>>>
 		val
val
0	0	0
1	0	1
...

Debug mode

Sometimes you might just want to make sure your code is running correctly before kicking off a large number of jobs. That's what Debug Mode allows you to do:

results = fused.submit(udf, inputs, debug_mode=True)

This will run the first item in inputs directly using fused.run() (equivalent to fused.run(udf, inputs[0])) and then return the results:

>>>
	val
0	0

You can then set debug_mode back to False and be more confident that your UDF is working as expected!

Execution parameters

fused.submit() also have parameters giving you more control over the execution. See the Python SDK docs page for more details:

max_workers: The number of workers to use for the job pool.
engine: local or remote (default is remote). Just like fused.run(), by default fused.submit() will run the UDF in the Fused server (engine='remote'). You can set engine='local' to run udf locally either on your machine or inside a large machine that you spin up.
max_retry: The maximum number of retries for a job.

Benchmarking

Simple fused.submit() Benchmark

fused.submit(udf) runs all the UDF calls in parallel, making it a helpful tool to run multiple UDFs all at once.

We can demonstrate this by adding a simple time.sleep(1) in our original UDF:

@fused.udf
def udf(val):
    import pandas as pd
    import time
    time.sleep(1)
    return pd.DataFrame({'val':[val]})

In a notebook, we can time how long each cell takes to execute with the %%time magic command

# In a jupyter notebook
%%time
fused.run(udf, val=1)

Singe run

This takes 2s: A few ms of overhead to send the UDF to Fused server & run + 1s of time.sleep(1)

Now using fused.submit() to run this over 50 UDFs:

30 runs

This takes a few more seconds, but not 100s. fused.submit() is a helpful way to scale a single UDF to many inputs in a timely manner.

Example use cases

fused.submit() is used in many places across our docs, here are some examples:

⛴️ In the Dark Vessel Detection example to scale retrieving daily AIS .zip files from NOAA over 30 days.
🛰️ Retrieving all of Maxar's Open Data STAC Catalogs across every events they have imagery for.
💡 Check the Best Practices for more on when to use submit() and when to use other methods.

HTTPS requests

You can call Saved UDFs via HTTPs calls, effecitvaly turning your data in an API!

The same can be done with the Fused Python SDK.

Shared token

When you save a UDF for the first time, it by default creates a shared token for you, something like:

fsh_**********h4t

This is a unique token that is used to identify your UDF and call it via HTTPS requests

Here is what a UDF endpoint looks like:

https://www.fused.io/server/v1/realtime-shared/******/run/file?dtype_out_raster=png

Manage your account's shared tokens in here.

Passing parameters

You can also pass parameters to your UDF via the URL by adding query parameters (&param=value) to the URL.

For example a UDF that takes lat and lon as parameters:

@fused.udf
def udf(lat, lon):
    return pd.DataFrame({'lat': [lat], 'lon': [lon]})

Could be called with:

https://www.fused.io/...&lat=37.7749&lon=-121.4194

Tiling

You can integrate your UDF as a vector or raster tile server y adding tiles path parameter, followed by templated /{z}/{x}/{y} path parameters.

You need to make sure:

Your UDF is set to Tile mode
You properly pass /{z}/{x}/{y} instead of default value
Depending on your server type, you might need to add dtype_out_raster or dtype_out_vector parameters. For example QGIS requires you to use dtype_out_vector=mvt for vector tile servers

https://www.fused.io/server/v1/realtime-shared/******/run/tiles/{z}/{x}/{y}?dtype_out_raster=png

See our integration guides for examples.

Private token

Calling UDFs with Bearer authentication requires an account's private token. The URL structure to run UDFs with the private token varies slightly, as the URL specifies the UDF's name and the owner's user account.

curl -XGET "https://app.fused.io/server/v1/realtime/fused/api/v1/run/udf/saved/user@fused.io/caltrain_live_location?dtype_out_raster=png" -H "Authorization: Bearer $ACCESS_TOKEN"

Do not share your Bearer token

Do not share your Bearer token with anyone. These allow to impersonate your account and should be treated as such.

Response data types

The dtype_out_vector and dtype_out_raster parameters define the output data type for vector tables and raster arrays, respectively.

The supported types for vector tables are parquet, geojson, json, feather, csv, mvt, html, excel, and xml.
For raster array: png, gif, jpg, jpeg, webp, tif, and tiff.

https://www.fused.io/server/v1/realtime-shared/****/run/file?dtype_out_raster=png

Read how to structure HTTPS endpoints to call the UDF as a Map Tile & File.

Caching responses

If a UDF's cache is enabled, its endpoints cache outputs for each combination of code and parameters. The first call runs and caches the UDF, and subsequent calls return cached data.

Defining "Small" job​

fused.run()​

Name (from your account)​

Name (from your teammate's account)​

Public UDF Name​

Token​

UDF object​

Github URL​

Git commit hash (recommended for most stable use cases)​

Team UDF Names​

Execution engines​

Passing arguments in fused.run()​

Examples of using parameters​

Running jobs in parallel: fused.submit()​

fused.submit()​

Tips for using fused.submit()​

Advanced fused.submit() options​

Blocking vs non-blocking calls​

Real time logs​

Debug mode​

Execution parameters​

Benchmarking​

Example use cases​

HTTPS requests​

Shared token​

Passing parameters​

Tiling​

Private token​

Response data types​

Caching responses​