Top-Level Functions

@fused.udf

udf(
    fn: Optional[Callable] = None,
    *,
    name: Optional[str] = None,
    cache_max_age: Optional[str] = None,
    instance_type: Optional[str] = None,
    default_parameters: Optional[Dict[str, Any]] = None,
    headers: Optional[Sequence[Union[str, Header]]] = None,
    **kwargs: dict[str, Any]
) -> Callable[..., Udf]

A decorator that transforms a function into a Fused UDF.

Parameters:

name (Optional[str]) – The name of the UDF object. Defaults to the name of the function.
cache_max_age (Optional[str]) – The maximum age when returning a result from the cache.
instance_type (Optional[str]) – The default type of instance to use for remote execution ('realtime' or 'batch').
default_parameters (Optional[Dict[str, Any]]) – Parameters to embed in the UDF object, separately from the arguments list of the function. Defaults to None for empty parameters.
headers (Optional[Sequence[Union[str, Header]]]) – A list of files to include as modules when running the UDF. For example, when specifying headers=['my_header.py'], inside the UDF function it may be referenced as:
```
import my_header
my_header.my_function()
```
Defaults to None for no headers.

Returns:

Callable[..., Udf] – A callable that represents the transformed UDF. This callable can be used
Callable[..., Udf] – within GeoPandas workflows to apply the defined operation on geospatial data.

Examples:

To create a simple UDF that calls a utility function to calculate the area of geometries in a GeoDataFrame:

@fused.udf
def udf(bbox, table_path="s3://fused-asset/infra/building_msft_us"):
    ...
    gdf = table_to_tile(bbox, table=table_path)
    return gdf

@fused.cache

cache(
    func: Callable[..., Any] | None = None,
    cache_max_age: str | int = DEFAULT_CACHE_MAX_AGE,
    cache_folder_path: str = "tmp",
    concurrent_lock_timeout: str | int = 120,
    cache_reset: bool | None = None,
    cache_storage: StorageStr | None = None,
    cache_key_exclude: Iterable[str] = None,
    cache_verbose: bool | None = None,
    **kwargs: Any
) -> Callable[..., Any]

Decorator to cache the return value of a function.

This function serves as a decorator that can be applied to any function to cache its return values. The cache behavior can be customized through keyword arguments.

Parameters:

func (Callable) – The function to be decorated. If None, this returns a partial decorator with the passed keyword arguments.
cache_max_age (str | int) – A string with a numbered component and units. Supported units are seconds (s), minutes (m), hours (h), and days (d) (e.g. "48h", "10s", etc.).
cache_folder_path (str) – Folder to append to the configured cache directory.
concurrent_lock_timeout (str | int) – Max amount of time in seconds for subsequent concurrent calls to wait for a previous concurrent call to finish execution and to write the cache file.
cache_reset (bool | None) – Ignore cache_max_age and overwrite cached result.
cache_storage (StorageStr | None) – Set where the cache data is stored. Supported values are "auto", "mount" and "local". Auto will automatically select the storage location defined in options (mount if it exists, otherwise local) and ensures that it exists and is writable. Mount gets shared across executions where local will only be shared within the same execution.
cache_key_exclude (Iterable[str]) – An iterable of parameter names to exclude from the cache key calculation. Useful for arguments that do not affect the result of the function and could cause unintended cache expiry (e.g. database connection objects)
cache_verbose (bool | None) – Print a message when a cached result is returned

Returns: Callable: A decorator that, when applied to a function, caches its return values according to the specified keyword arguments.

Examples:

Use the @cache decorator to cache the return value of a function in a custom path.

@cache(path="/tmp/custom_path/")
def expensive_function():
    # Function implementation goes here
    return result

If the output of a cached function changes, for example if remote data is modified, it can be reset by running the function with the cache_reset keyword argument. Afterward, the argument can be cleared.

@cache(path="/tmp/custom_path/", cache_reset=True)
def expensive_function():
    # Function implementation goes here
    return result

fused.load

load(
    url_or_udf: Union[str, Path],
    /,
    *,
    cache_key: Any = None,
    import_globals: bool = True,
) -> AnyBaseUdf

Loads a UDF from various sources including GitHub URLs, and a Fused platform-specific identifier.

This function supports loading UDFs from a GitHub repository URL, or a Fused platform-specific identifier composed of an email and UDF name. It intelligently determines the source type based on the format of the input and retrieves the UDF accordingly.

Parameters:

url_or_udf (Union[str, Path]) – A string representing the location of the UDF, or the raw code of the UDF. The location can be a GitHub URL starting with "https://github.com", a Fused platform-specific identifier in the format "email/udf_name", or a local file path pointing to a Python file.
cache_key (Any) – An optional key used for caching the loaded UDF. If provided, the function will attempt to load the UDF from cache using this key before attempting to load it from the specified source. Defaults to None, indicating no caching.
import_globals (bool) – Expose the globals defined in the UDF's context as attributes on the UDF object (default True). This requires executing the code of the UDF. To globally configure this behavior, use fused.options.never_import.

Returns:

AnyBaseUdf (AnyBaseUdf) – An instance of the loaded UDF.

Raises:

ValueError – If the URL or Fused platform-specific identifier format is incorrect or cannot be parsed.
Exception – For errors related to network issues, file access permissions, or other unforeseen errors during the loading process.

Examples:

Load a UDF from a GitHub URL:

udf = fused.load("https://github.com/fusedio/udfs/tree/main/public/REM_with_HyRiver/")

Load a UDF using a Fused platform-specific identifier:

udf = fused.load("username@fused.io/REM_with_HyRiver")

fused.run

run(
    udf: Union[str, None, UdfJobStepConfig, Udf, UdfAccessToken] = None,
    *,
    x: Optional[int] = None,
    y: Optional[int] = None,
    z: Optional[int] = None,
    sync: bool = True,
    engine: Optional[Literal["remote", "local"]] = None,
    instance_type: Optional[InstanceType] = None,
    type: Optional[Literal["tile", "file"]] = None,
    max_retry: int = 0,
    cache_max_age: Optional[str] = None,
    cache: bool = True,
    parameters: Optional[Dict[str, Any]] = None,
    _return_response: Optional[bool] = False,
    _ignore_unknown_arguments: bool = False,
    _cancel_callback: Callable[[], bool] | None = None,
    **kw_parameters: Callable[[], bool] | None
) -> Union[
    ResultType,
    Coroutine[ResultType, None, None],
    UdfEvaluationResult,
    Coroutine[UdfEvaluationResult, None, None],
]

Executes a user-defined function (UDF) with various execution and input options.

This function supports executing UDFs in different environments (local or remote), with different types of inputs (tile coordinates, geographical bounding boxes, etc.), and allows for both synchronous and asynchronous execution. It dynamically determines the execution path based on the provided parameters.

Parameters:

udf (str, Udf or UdfJobStepConfig) – the UDF to execute. The UDF can be specified in several ways:
- A string representing a UDF name or UDF shared token.
- A UDF object.
- A UdfJobStepConfig object for detailed execution configuration.
x, y, z (int) – Tile coordinates for tile-based UDF execution.
sync (bool) – If True, execute the UDF synchronously. If False, execute asynchronously.
engine (Optional[Literal['remote', 'local']]) – The execution engine to use ('remote' or 'local').
instance_type (Optional[InstanceType]) – The type of instance to use for remote execution ('realtime', or 'small', 'medium', 'large' or one of the whitelisted instance types). If not specified, gets the default from the UDF (if specified in the @fused.udf() decorator, and the UDF is not run as a shared token), or otherwise defaults to 'realtime'.
type (Optional[Literal['tile', 'file']]) – The type of UDF execution ('tile' or 'file').
max_retry (int) – The maximum number of retries to attempt if the UDF fails. By default does not retry.
cache_max_age (Optional[str]) – The maximum age when returning a result from the cache. Supported units are seconds (s), minutes (m), hours (h), and days (d) (e.g. “48h”, “10s”, etc.). Default is None so a UDF run with fused.run() will follow cache_max_age defined in @fused.udf() unless this value is changed.
cache (bool) – Set to False as a shortcut for cache_max_age='0s' to disable caching.
verbose – Set to False to suppress any print statements from the UDF.
parameters (Optional[Dict[str, Any]]) – Additional parameters to pass to the UDF.
**kw_parameters – Additional parameters to pass to the UDF.

Raises:

ValueError – If the UDF is not specified or is specified in more than one way.
TypeError – If the first parameter is not of an expected type.
Warning – Various warnings are issued for ignored parameters based on the execution path chosen.

Returns:

Union[ResultType, Coroutine[ResultType, None, None], UdfEvaluationResult, Coroutine[UdfEvaluationResult, None, None]] – The result of the UDF execution, which varies based on the UDF and execution path.

Examples:

Run a UDF saved in the Fused system:

fused.run("username@fused.io/my_udf_name")

Run a UDF saved in GitHub:

loaded_udf = fused.load("https://github.com/fusedio/udfs/tree/main/public/Building_Tile_Example")
fused.run(loaded_udf, bbox=bbox)

Run a UDF saved in a local directory:

loaded_udf = fused.load("/Users/local/dir/Building_Tile_Example")
fused.run(loaded_udf, bbox=bbox)

Details

Note

This function dynamically determines the execution path and parameters based on the inputs. It is designed to be flexible and support various UDF execution scenarios.

fused.submit

submit(
    udf: AnyBaseUdf | FunctionType | str,
    arg_list: AnyBaseUdf | FunctionType | str,
    /,
    *,
    engine: Literal["remote", "local"] | None = "remote",
    instance_type: InstanceType | None = None,
    max_workers: int | None = None,
    n_processes_per_worker: int | None = None,
    max_retry: int = 2,
    debug_mode: bool = False,
    collect: bool = True,
    execution_type: ExecutionType = "thread_pool",
    cache_max_age: str | None = None,
    cache: bool = True,
    ignore_exceptions: bool = False,
    flatten: bool = True,
    _before_run: float | None = None,
    _before_submit: float | None = 0.01,
    **kwargs: float | None,
) -> Union[BaseJobPool, ResultType, pd.DataFrame]

Executes a user-defined function (UDF) multiple times for a list of input parameters, and return immediately a "lazy" JobPool object allowing to inspect the jobs and wait on the results.

Each individual UDF run will be cached following the standard caching logic as with fused.run() and the specified cache_max_age. Additionally, when collect=True (the default), the collected results are cached locally for the duration of cache_max_age or 12h by default.

See fused.run for more details on the UDF execution.

Parameters:

udf (AnyBaseUdf | FunctionType | str) – the UDF to execute. See fused.run for more details on how to specify the UDF.
arg_list – a list of input parameters for the UDF. Can be specified as:
- a list of values for parametrizing over a single parameter, i.e. the first parameter of the UDF
- a list of dictionaries for parametrizing over multiple parameters
- A DataFrame for parametrizing over multiple parameters where each row is a set of parameters
engine (Literal['remote', 'local'] | None) – The execution engine to use. Defaults to 'remote'.
instance_type (InstanceType | None) – The type of instance to use for remote execution ('realtime', or 'small', 'medium', 'large' or one of the whitelisted instance types). If not specified, gets the default from the UDF (if specified in the @fused.udf() decorator, and the UDF is not run as a shared token), or otherwise defaults to 'realtime'.
max_workers (int | None) – The maximum number of workers to use. Defaults to 32 for the realtime engine (with a maximum of 1024), and 1 for other batch instances (with a maximum of 5).
n_processes_per_worker (int | None) – The number of processes to use per worker. Always 1 for the realtime engine, but defaults to the number of cores for batch engines. Specify this keyword to reduce the number of processes.
max_retry (int) – The maximum number of retries for failed jobs. Defaults to 2.
debug_mode (bool) – If True, executes only the first item in arg_list directly using fused.run(), useful for debugging UDF execution. Default is False.
collect (bool) – If True, waits for all jobs to complete and returns the collected DataFrame containing the results. If False, returns a JobPool object, which is non-blocking and allows you to inspect the individual results and logs. Default is True.
execution_type (ExecutionType) – The type of batching to use. Either "thread_pool" (default) for ThreadPoolExecutor-based concurrency or "async_loop" for asyncio-based concurrency.
cache_max_age (str | None) – The maximum age when returning a result from the cache. Supported units are seconds (s), minutes (m), hours (h), and days (d) (e.g. "48h", "10s", etc.). Default is None so a UDF run with fused.run() will follow cache_max_age defined in @fused.udf() unless this value is changed.
cache (bool) – Set to False as a shortcut for cache_max_age='0s' to disable caching.
ignore_exceptions (bool) – Set to True to ignore exceptions when collecting results. Runs that result in exceptions will be silently ignored. Defaults to False.
flatten (bool) – Set to True to receive a DataFrame of results, without nesting of a results column, when collecting results. When False, results will be nested in a results column. If the UDF does not return a DataFrame (e.g. a string instead,) results will be nested in a results column regardless of this setting. Defaults to True.
**kwargs – Additional (constant) keyword arguments to pass to the UDF.

Returns:

Union[BaseJobPool, ResultType, DataFrame] – JobPool, AsyncJobPool, or DataFrame depending on execution_type and collect parameters

Examples:

Run a UDF multiple times for the values 0 to 9 passed to as the first positional argument of the UDF:

df = fused.submit("username@fused.io/my_udf_name", range(10))

Using async batch type:

df = fused.submit(udf, range(10), execution_type="async_loop")

Being explicit about the parameter name:

df = fused.submit(udf, [dict(n=i) for i in range(10)])

Get the pool of ongoing tasks:

pool = fused.submit(udf, [dict(n=i) for i in range(10)], collect=False)

fused.download

download(url: str, file_path: str, storage: StorageStr = 'auto') -> str

Download a file.

May be called from multiple processes with the same inputs to get the same result.

Fused runs UDFs from top to bottom each time code changes. This means objects in the UDF are recreated each time, which can slow down a UDF that downloads files from a remote server.

💡 Downloaded files are written to a mounted volume shared across all UDFs in an organization. This means that a file downloaded by one UDF can be read by other UDFs.

Fused addresses the latency of downloading files with the download utility function. It stores files in the mounted filesystem so they only download the first time.

💡 Because a Tile UDF runs multiple chunks in parallel, the download function sets a signal lock during the first download attempt, to ensure the download happens only once.

Parameters:

url (str) – The URL to download.
file_path (str) – The local path where to save the file.
storage (StorageStr) – Set where the cache data is stored. Supported values are "auto", "mount" and "local". Auto will automatically select the storage location defined in options (mount if it exists, otherwise local) and ensures that it exists and is writable. Mount gets shared across executions where local will only be shared within the same execution.

Returns:

str – The function downloads the file only on the first execution, and returns the file path.

Examples:

@fused.udf
def geodataframe_from_geojson():
    import geopandas as gpd
    url = "s3://sample_bucket/my_geojson.zip"
    path = fused.core.download(url, "tmp/my_geojson.zip")
    gdf = gpd.read_file(path)
    return gdf

fused.ingest

ingest(
    input: str | Path | Sequence[str | Path] | gpd.GeoDataFrame,
    output: str | None = None,
    *,
    output_metadata: str | None = None,
    schema: Schema | None = None,
    file_suffix: str | None = None,
    load_columns: Sequence[str] | None = None,
    remove_cols: Sequence[str] | None = None,
    explode_geometries: bool = False,
    drop_out_of_bounds: bool | None = None,
    partitioning_method: Literal["area", "length", "coords", "rows"] = "rows",
    partitioning_maximum_per_file: int | float | None = None,
    partitioning_maximum_per_chunk: int | float | None = None,
    partitioning_max_width_ratio: int | float = 2,
    partitioning_max_height_ratio: int | float = 2,
    partitioning_force_utm: Literal["file", "chunk", None] = "chunk",
    partitioning_split_method: Literal["mean", "median"] = "mean",
    subdivide_method: Literal["area", None] = None,
    subdivide_start: float | None = None,
    subdivide_stop: float | None = None,
    split_identical_centroids: bool = True,
    target_num_chunks: int = 500,
    lonlat_cols: tuple[str, str] | None = None,
    partitioning_schema_input: str | pd.DataFrame | None = None,
    gdal_config: GDALOpenConfig | dict[str, Any] | None = None,
    overwrite: bool = False,
    as_udf: bool = False
) -> GeospatialPartitionJobStepConfig

Ingest a dataset into the Fused partitioned format.

Parameters:

input (str | Path | Sequence[str | Path] | gpd.GeoDataFrame) – A GeoPandas GeoDataFrame or a path to file or files on S3 to ingest. Files may be Parquet or another geo data format.
output (str | None) – Location on S3 to write the main table to.
output_metadata (str | None) – Location on S3 to write the fused table to.
schema (Schema | None) – Schema of the data to be ingested. This is optional and will be inferred from the data if not provided.
file_suffix (str | None) – filter which files are used for ingestion. If input is a directory on S3, all files under that directory will be listed and used for ingestion. If file_suffix is not None, it will be used to filter paths by checking the trailing characters of each filename. E.g. pass file_suffix=".geojson" to include only GeoJSON files inside the directory.
load_columns (Sequence[str] | None) – Read only this set of columns when ingesting geospatial datasets. Defaults to all columns.
remove_cols (Sequence[str] | None) – The named columns to drop when ingesting geospatial datasets. Defaults to not drop any columns.
explode_geometries (bool) – Whether to unpack multipart geometries to single geometries when ingesting geospatial datasets, saving each part as its own row. Defaults to False.
drop_out_of_bounds (bool | None) – Whether to drop geometries outside of the expected WGS84 bounds. Defaults to True.
partitioning_method (Literal['area', 'length', 'coords', 'rows']) – The method to use for grouping rows into partitions. Defaults to "rows".
- "area": Construct partitions where all contain a maximum total area among geometries.
- "length": Construct partitions where all contain a maximum total length among geometries.
- "coords": Construct partitions where all contain a maximum total number of coordinates among geometries.
- "rows": Construct partitions where all contain a maximum number of rows.
partitioning_maximum_per_file (int | float | None) – Maximum value for partitioning_method to use per file. If None, defaults to 1/10th of the total value of partitioning_method. So if the value is None and partitioning_method is "area", then each file will have no more than 1/10th the total area of all geometries. Defaults to None.
partitioning_maximum_per_chunk (int | float | None) – Maximum value for partitioning_method to use per chunk. If None, defaults to 1/100th of the total value of partitioning_method. So if the value is None and partitioning_method is "area", then each file will have no more than 1/100th the total area of all geometries. Defaults to None.
partitioning_max_width_ratio (int | float) – The maximum ratio of width to height of each partition to use in the ingestion process. So for example, if the value is 2, then if the width divided by the height is greater than 2, the box will be split in half along the horizontal axis. Defaults to 2.
partitioning_max_height_ratio (int | float) – The maximum ratio of height to width of each partition to use in the ingestion process. So for example, if the value is 2, then if the height divided by the width is greater than 2, the box will be split in half along the vertical axis. Defaults to 2.
partitioning_force_utm (Literal['file', 'chunk', None]) – Whether to force partitioning within UTM zones. If set to "file", this will ensure that the centroid of all geometries per file are contained in the same UTM zone. If set to "chunk", this will ensure that the centroid of all geometries per chunk are contained in the same UTM zone. If set to None, then no UTM-based partitioning will be done. Defaults to "chunk".
partitioning_split_method (Literal['mean', 'median']) – How to split one partition into children. Defaults to "mean" (this may change in the future).
- "mean": Split each axis according to the mean of the centroid values.
- "median": Split each axis according to the median of the centroid values.
subdivide_method (Literal['area', None]) – The method to use for subdividing large geometries into multiple rows. Currently the only option is "area", where geometries will be subdivided based on their area (in WGS84 degrees).
subdivide_start (float | None) – The value above which geometries will be subdivided into smaller parts, according to subdivide_method.
subdivide_stop (float | None) – The value below which geometries will not be subdivided into smaller parts, according to subdivide_method. Recommended to be equal to subdivide_start. If None, geometries will be subdivided up to a recursion depth of 100 or until the subdivided geometry is rectangular.
split_identical_centroids (bool) – If True, should split a partition that has identical centroids (such as if all geometries in the partition are the same) if there are more such rows than defined in "partitioning_maximum_per_file" and "partitioning_maximum_per_chunk".
target_num_chunks (int) – The target for the number of files if partitioning_maximum_per_file is None. Note that this number is only a target and the actual number of files generated can be higher or lower than this number, depending on the spatial distribution of the data itself. Defaults to 500, rough default to use in most cases.
lonlat_cols (tuple[str, str] | None) – Names of longitude, latitude columns to construct point geometries from.

If your point columns are named "x" and "y", then pass:
```
fused.ingest(
    ...,
    lonlat_cols=("x", "y")
)
```
This only applies to reading from Parquet files. For reading from CSV files, pass options to gdal_config.
gdal_config (GDALOpenConfig | dict[str, Any] | None) – Configuration options to pass to GDAL for how to read these files. For all files other than Parquet files, Fused uses GDAL as a step in the ingestion process. For some inputs, like CSV files or zipped shapefiles, you may need to pass some parameters to GDAL to tell it how to open your files.

This config is expected to be a dictionary with up to two keys:
- layer: str. Define the layer of the input file you wish to read when the source contains multiple layers, as in GeoPackage.
- open_options: Dict[str, str]. Pass in key-value pairs with GDAL open options. These are defined on each driver's page in the GDAL documentation. For example, the CSV driver defines these open options you can pass in.
For example, if you're ingesting a CSV file with two columns "longitude" and "latitude" denoting the coordinate information, pass
```
fused.ingest(
    ...,
    gdal_config={
        "open_options": {
            "X_POSSIBLE_NAMES": "longitude",
            "Y_POSSIBLE_NAMES": "latitude",
        }
    }
)
```
overwrite (bool) – If True, overwrite the output directory if it already exists (by first removing all existing content of the directory, i.e. it does not only overwrite conflicting files). Defaults to False.
as_udf (bool) – Return the ingestion workflow as a UDF that can be executed using fused.run(). Local files or python objects passed to input or partitioning_schema_input are still uploaded to S3 first such that those are available when executing the UDF remotely. Defaults to False.

Returns:

GeospatialPartitionJobStepConfig – Configuration object describing the ingestion process. Call .execute on this object to start a job.

Examples:

For example, to ingest the California Census dataset for the year 2022:

job = fused.ingest(
    input="https://www2.census.gov/geo/tiger/TIGER_RD18/STATE/06_CALIFORNIA/06/tl_rd22_06_bg.zip",
    output="s3://fused-sample/census/ca_bg_2022/main/",
    output_metadata="s3://fused-sample/census/ca_bg_2022/fused/",
    explode_geometries=True,
    partitioning_maximum_per_file=2000,
    partitioning_maximum_per_chunk=200,
).execute()

`job.run_batch`

def run_batch(output_table: Optional[str] = ...,
    instance_type: Optional[WHITELISTED_INSTANCE_TYPES] = None,
    *,
    region: str | None = None,
    disk_size_gb: int | None = None,
    additional_env: List[str] | None = None,
    image_name: Optional[str] = None,
    ignore_no_udf: bool = False,
    ignore_no_output: bool = False,
    validate_imports: Optional[bool] = None,
    validate_inputs: bool = True,
    overwrite: Optional[bool] = None) -> RunResponse

Begin execution of the ingestion job by calling run_batch on the job object.

Arguments:

output_table - The name of the table to write to. Defaults to None.
instance_type - The AWS EC2 instance type to use for the job. Acceptable strings are m5.large, m5.xlarge, m5.2xlarge, m5.4xlarge, m5.8xlarge, m5.12xlarge, m5.16xlarge, r5.large, r5.xlarge, r5.2xlarge, r5.4xlarge, r5.8xlarge, r5.12xlarge, or r5.16xlarge. Defaults to None.
region - The AWS region in which to run. Defaults to None.
disk_size_gb - The disk size to specify for the job. Defaults to None.
additional_env - Any additional environment variables to be passed into the job. Defaults to None.
image_name - Custom image name to run. Defaults to None for default image.
ignore_no_udf - Ignore validation errors about not specifying a UDF. Defaults to False.
ignore_no_output - Ignore validation errors about not specifying output location. Defaults to False.

Monitor and manage job

Calling run_batch returns a RunResponse object with helper methods.

# Declare ingest job
job = fused.ingest(
  input="https://www2.census.gov/geo/tiger/TIGER_RD18/STATE/06_CALIFORNIA/06/tl_rd22_06_bg.zip",
  output="s3://fused-sample/census/ca_bg_2022/main/"
)

# Start ingest job
job_id = job.run_batch()

Fetch the job status.

job_id.get_status()

Fetch and print the job's logs.

job_id.print_logs()

Determine the job's execution time.

job_id.get_exec_time()

Continuously print the job's logs.

job_id.tail_logs()

Cancel the job.

job_id.cancel()

`job.run_remote`

Alias of job.run_batch for backwards compatibility. See job.run_batch above for details.

fused.ingest_nongeospatial

ingest_nongeospatial(
    input: str | Path | Sequence[str, Path] | pd.DataFrame | gpd.GeoDataFrame,
    output: str | None = None,
    *,
    output_metadata: str | None = None,
    partition_col: str | None = None,
    partitioning_maximum_per_file: int = 2500000,
    partitioning_maximum_per_chunk: int = 65000
) -> NonGeospatialPartitionJobStepConfig

Ingest a dataset into the Fused partitioned format.

Parameters:

input (str | Path | Sequence[str, Path] | pd.DataFrame | gpd.GeoDataFrame) – A GeoPandas GeoDataFrame or a path to file or files on S3 to ingest. Files may be Parquet or another geo data format.
output (str | None) – Location on S3 to write the main table to.
output_metadata (str | None) – Location on S3 to write the fused table to.
partition_col (str | None) – Partition along this column for nongeospatial datasets.
partitioning_maximum_per_file (int) – Maximum number of items to store in a single file. Defaults to 2,500,000.
partitioning_maximum_per_chunk (int) – Maximum number of items to store in a single file. Defaults to 65,000.

Returns:

NonGeospatialPartitionJobStepConfig – Configuration object describing the ingestion process. Call .execute on this object to start a job.

Examples:

job = fused.ingest_nongeospatial(
    input=gdf,
    output="s3://sample-bucket/file.parquet",
).execute()

fused.file_path

file_path(
    file_path: str, mkdir: bool = True, storage: StorageStr = "auto"
) -> str

Creates a directory in a predefined temporary directory.

This gives users the ability to manage directories during the execution of a UDF. It takes a relative file_path, creates the corresponding directory structure, and returns its absolute path.

This is useful for UDFs that temporarily store intermediate results as files, such as when writing intermediary files to disk when processing large datasets. file_path ensures that necessary directories exist. The directory is kept for 12h.

Parameters:

file_path (str) – The relative file path to locate.
mkdir (bool) – If True, create the directory if it doesn't already exist. Defaults to True.
storage (StorageStr) – Set where the cache data is stored. Supported values are "auto", "mount" and "local". Auto will automatically select the storage location defined in options (mount if it exists, otherwise local) and ensures that it exists and is writable. Mount gets shared across executions where local will only be shared within the same execution.

Returns:

str – The located file path.

fused.get_chunks_metadata

get_chunks_metadata(url: str) -> gpd.GeoDataFrame

Returns a GeoDataFrame with each chunk in the table as a row.

Parameters:

url (str) – URL of the table.

fused.get_chunk_from_table

get_chunk_from_table(
    url: str,
    file_id: Union[str, int, None],
    chunk_id: Optional[int],
    *,
    columns: Optional[Iterable[str]] = None
) -> gpd.GeoDataFrame

Returns a chunk from a table and chunk coordinates.

This can be called with file_id and chunk_id from get_chunks_metadata.

Parameters:

url (str) – URL of the table.
file_id (Union[str, int, None]) – File ID to read.
chunk_id (Optional[int]) – Chunk ID to read.
columns (Optional[Iterable[str]]) – Read only the specified columns.

@fused.udf​

@fused.cache​

fused.load​

fused.run​

fused.submit​

fused.download​

fused.ingest​

job.run_batch​

Monitor and manage job​

job.run_remote​

fused.ingest_nongeospatial​

fused.file_path​

fused.get_chunks_metadata​

fused.get_chunk_from_table​

@fused.udf

@fused.cache

fused.load

fused.run

fused.submit

fused.download

fused.ingest

`job.run_batch`

Monitor and manage job

`job.run_remote`

fused.ingest_nongeospatial

fused.file_path

fused.get_chunks_metadata

fused.get_chunk_from_table