How to run in parallel
Need to process hundreds or thousands of inputs? fused.submit() runs a UDF over multiple inputs in parallel, spinning up separate instances for each job.
This guide covers best practices for scaling parallel workloads. For the full fused.submit() reference, see Run UDFs in parallel.
inputs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
results = fused.submit(udf, inputs)
Best practices
Start small, then scale
Don't immediately spin up 1000 jobs. Test progressively:
# First test with 5 inputs
results = fused.submit(udf, inputs[:5])
# Then 10, then 50, then scale up
results = fused.submit(udf, inputs[:50])
Target 30-45s per job
Each parallel job has a 120s timeout. Aim for 30-45s per job to leave safety margin for slower runs.
If your jobs consistently hit the timeout, either:
- Break them into smaller chunks
- Use
instance_typefor larger dedicated machines (see How to run a batch job)
Test first with debug mode
Run only the first input to verify your setup works:
result = fused.submit(udf, inputs, debug_mode=True)
Check timing
Monitor how long each job takes:
job = fused.submit(udf, inputs[:5], collect=False)
job.wait()
print(job.times()) # Time per job
Error handling
By default, errors aren't cached. If a job fails (e.g., API timeout), it will retry fresh on the next run. See Caching for more on how caching works.
However, if you wrap errors in try/except and return a result, that result gets cached:
@fused.udf
def udf(url: str):
try:
return fetch_data(url)
except Exception as e:
return {"error": str(e)} # This gets cached!
Use ignore_exceptions=True to skip failed jobs when collecting results:
results = fused.submit(udf, inputs, ignore_exceptions=True)
When to use batch instances
If your jobs need more than 120s or ~4GB RAM, use instance_type:
results = fused.submit(
udf,
inputs,
instance_type="large",
collect=False
)
See How to run a batch job for details on batch instances.
Example use cases
- Climate Dashboard — Processing 20TB of data in minutes
- Dark Vessel Detection — Retrieving 30 days of AIS data
- Satellite Imagery — Processing Maxar's Open Data STAC Catalogs
See also
- Run UDFs in parallel — full
fused.submit()reference - Caching — how caching works