Skip to main content

How to run in parallel

Need to process hundreds or thousands of inputs? fused.submit() runs a UDF over multiple inputs in parallel, spinning up separate instances for each job.

This guide covers best practices for scaling parallel workloads. For the full fused.submit() reference, see Run UDFs in parallel.

inputs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
results = fused.submit(udf, inputs)

Best practices

Start small, then scale

Don't immediately spin up 1000 jobs. Test progressively:

# First test with 5 inputs
results = fused.submit(udf, inputs[:5])

# Then 10, then 50, then scale up
results = fused.submit(udf, inputs[:50])

Target 30-45s per job

Each parallel job has a 120s timeout. Aim for 30-45s per job to leave safety margin for slower runs.

If your jobs consistently hit the timeout, either:

Test first with debug mode

Run only the first input to verify your setup works:

result = fused.submit(udf, inputs, debug_mode=True)

Check timing

Monitor how long each job takes:

job = fused.submit(udf, inputs[:5], collect=False)
job.wait()
print(job.times()) # Time per job

Error handling

By default, errors aren't cached. If a job fails (e.g., API timeout), it will retry fresh on the next run. See Caching for more on how caching works.

However, if you wrap errors in try/except and return a result, that result gets cached:

@fused.udf
def udf(url: str):
try:
return fetch_data(url)
except Exception as e:
return {"error": str(e)} # This gets cached!
tip

Use ignore_exceptions=True to skip failed jobs when collecting results:

results = fused.submit(udf, inputs, ignore_exceptions=True)

When to use batch instances

If your jobs need more than 120s or ~4GB RAM, use instance_type:

results = fused.submit(
udf,
inputs,
instance_type="large",
collect=False
)

See How to run a batch job for details on batch instances.

Example use cases

See also