submit
submit(
udf: AnyBaseUdf | FunctionType | str,
arg_list: list | pd.DataFrame,
*,
engine: Literal["remote", "local"] | None = "remote",
instance_type: InstanceType | None = None,
max_workers: int | None = None,
n_processes_per_worker: int | None = None,
max_retry: int = 2,
debug_mode: bool = False,
collect: bool = True,
execution_type: ExecutionType = "thread_pool",
cache_max_age: str | None = None,
cache: bool = True,
ignore_exceptions: bool = False,
flatten: bool = True,
**kwargs
) -> JobPool | ResultType | pd.DataFrame
Executes a user-defined function (UDF) multiple times for a list of input parameters, and returns immediately a "lazy" JobPool object allowing to inspect the jobs and wait on the results.
Each individual UDF run will be cached following the standard caching logic as with fused.run() and the specified cache_max_age. Additionally, when collect=True (the default), the collected results are cached locally for the duration of cache_max_age or 12h by default.
Parameters
- udf (
AnyBaseUdf | FunctionType | str) – The UDF to execute. Seefused.runfor more details on how to specify the UDF. - arg_list (
list | pd.DataFrame) – A list of input parameters for the UDF. Can be specified as:- a list of values for parametrizing over a single parameter
- a list of dictionaries for parametrizing over multiple parameters
- A DataFrame for parametrizing over multiple parameters where each row is a set of parameters
- engine (
Literal['remote', 'local'] | None) – The execution engine to use. Defaults to 'remote'. - instance_type (
InstanceType | None) – The type of instance to use for remote execution ('realtime', or 'small', 'medium', 'large' or one of the whitelisted instance types). Defaults to 'realtime'. - max_workers (
int | None) – The maximum number of workers to use. Defaults to 32 for realtime (max 1024), and 1 for batch instances (max 5). - n_processes_per_worker (
int | None) – The number of processes to use per worker. For realtime instances, defaults to 1. For batch instances, defaults to the number of cores. - max_retry (
int) – The maximum number of retries for failed jobs. Defaults to 2. - debug_mode (
bool) – If True, executes only the first item in arg_list directly usingfused.run(), useful for debugging. Default is False. - collect (
bool) – If True, waits for all jobs to complete and returns the collected DataFrame. If False, returns a JobPool object. Default is True. - execution_type (
ExecutionType) – The type of batching to use. Either "thread_pool" (default) or "async_loop". - cache_max_age (
str | None) – The maximum age when returning a result from the cache. - cache (
bool) – Set to False to disable caching. - ignore_exceptions (
bool) – Set to True to ignore exceptions when collecting results. Defaults to False. - flatten (
bool) – Set to True to receive a flat DataFrame of results. Defaults to True. - **kwargs – Additional (constant) keyword arguments to pass to the UDF.
Returns
JobPool | ResultType | pd.DataFrame – JobPool, or DataFrame depending on execution_type and collect parameters.
Examples
Run a UDF multiple times for the values 0 to 9:
df = fused.submit("username@fused.io/my_udf_name", range(10))
Using async batch type:
df = fused.submit(udf, range(10), execution_type="async_loop")
Being explicit about the parameter name:
df = fused.submit(udf, [dict(n=i) for i in range(10)])
Get the pool of ongoing tasks:
pool = fused.submit(udf, [dict(n=i) for i in range(10)], collect=False)