Skip to main content

fused.h3

persist_hex_table_metadata

persist_hex_table_metadata(
dataset_path: str,
output_path: str | None = None,
metadata_path: str | None = None,
verbose: bool = False,
row_group_size: int = 500,
indexed_columns: list[str] | None = None,
pool_size: int = 10,
metadata_compression_level: int | None = None,
) -> str

Persist metadata for a hex table to enable serverless queries.

Scans all parquet files in the dataset, extracts metadata needed for fast reconstruction, and writes one .metadata.parquet file per source file.

Each metadata file contains:

  • Row group metadata (offsets, H3 ranges, etc.)
  • Full metadata_json stored as custom metadata in the parquet schema

Requires a 'hex' column (or variant like h3, h3_index) in all files. Raises ValueError if no hex column is found.

Parameters:

  • dataset_path (str) – Path to the dataset directory (e.g., "s3://bucket/dataset/")
  • output_path (str | None) – Deprecated - use metadata_path instead
  • metadata_path (str | None) – Directory path where metadata files should be written. If None, writes to {source_dir}/.fused/{source_filename}.metadata.parquet for each source file. If provided, writes to {metadata_path}/{full_source_path}.metadata.parquet using the full source path as the filename. This allows storing metadata in a different location when you don't have write access to the dataset directory.
  • verbose (bool) – If True, print timing/progress information.
  • row_group_size (int) – Number of rows per row group in output parquet file. Default is 500. Larger values reduce file size but may increase memory usage during writes.
  • indexed_columns (list[str] | None) – List of column names to index. If None (default), indexes only the first identified indexed column (typically the hex column). If an empty list, indexes all detected indexed columns.
  • pool_size (int) – Number of parallel workers to use for processing files. Default is 10 (parallel processing enabled by default). Set to 1 for sequential processing. This can significantly speed up metadata extraction for large datasets with many files.
  • metadata_compression_level (int | None) – Optional zstd compression level for metadata parquet files. If None, uses PyArrow's default for zstd. Valid levels are typically in the range [1, 22].

Returns:

  • [str](#str) – Path to metadata directory (if metadata_path provided) or first metadata file path

Raises:

  • [ImportError](#ImportError) – If job2 package is not available
  • [ValueError](#ValueError) – If no hex column is found in any file
  • [FileNotFoundError](#FileNotFoundError) – If no parquet files are found in the dataset

Example:

fused.h3.persist_hex_table_metadata("s3://my-bucket/my-dataset/")
# 's3://my-bucket/my-dataset/.fused/file1.metadata.parquet'
fused.h3.persist_hex_table_metadata("s3://my-bucket/my-dataset/", metadata_path="s3://my-bucket/metadata/")
# 's3://my-bucket/metadata/'
fused.h3.persist_hex_table_metadata("s3://my-bucket/my-dataset/", pool_size=4)
# 's3://my-bucket/my-dataset/.fused/file1.metadata.parquet'

read_hex_table

read_hex_table(
dataset_path: str,
hex_ranges_list: List[List[int]],
columns: Optional[List[str]] = None,
base_url: Optional[str] = None,
verbose: bool = False,
return_timing_info: bool = False,
batch_size: Optional[int] = None,
max_concurrent_downloads: Optional[int] = None,
) -> "pa.Table" | tuple["pa.Table", Dict[str, Any]]

Read data from an H3-indexed dataset by querying hex ranges.

This is an optimized version that assumes the server always provides full metadata (start_offset, end_offset, metadata, and row_group_bytes) for all row groups. If any row group is missing required metadata, this function will raise an error indicating that the dataset needs to be re-indexed.

This function eliminates all metadata API calls by using prefetched metadata from the /datasets/items-with-metadata endpoint.

This function imports the implementation from job2 at runtime, similar to how read_parquet_row_group works.

Parameters:

  • dataset_path (str) – Path to the H3-indexed dataset (e.g., "s3://bucket/dataset/")
  • hex_ranges_list (List\[List[int]\]) – List of [min_hex, max_hex] pairs as integers. Example: [[622236719905341439, 622246719905341439]]
  • columns (Optional\[List[str]\]) – Optional list of column names to read. If None, reads all columns.
  • base_url (Optional[str]) – Base URL for API. If None, uses current environment.
  • verbose (bool) – If True, print progress information. Default is False.
  • return_timing_info (bool) – If True, return a tuple of (table, timing_info) instead of just the table. Default is False for backward compatibility.
  • batch_size (Optional[int]) – Target size in bytes for combining row groups. If None, uses fused.options.row_group_batch_size (default: 32KB).
  • max_concurrent_downloads (Optional[int]) – Maximum number of simultaneous download operations. If None, uses a default based on the number of files. Default is None.

Returns:

  • 'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\] – PyArrow Table containing the concatenated data from all matching row groups.
  • 'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\] – If return_timing_info is True, returns a tuple of (table, timing_info dict).

Raises:

  • [ValueError](#ValueError) – If any row group is missing required metadata (start_offset, end_offset, metadata, or row_group_bytes). This indicates the dataset needs to be re-indexed.

Example:

import fused

# Read data for a specific H3 hex range
table = fused.h3.read_hex_table(
dataset_path="s3://my-bucket/my-h3-dataset/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)
df = table.to_pandas()

read_hex_table_slow

read_hex_table_slow(
dataset_path: str,
hex_ranges_list: List[List[int]],
columns: Optional[List[str]] = None,
base_url: Optional[str] = None,
verbose: bool = False,
return_timing_info: bool = False,
metadata_batch_size: int = 50,
) -> "pa.Table" | tuple["pa.Table", Dict[str, Any]]

Read data from an H3-indexed dataset by querying hex ranges.

This function queries the dataset index for row groups that match the given H3 hex ranges, downloads them in parallel with optimized batching, and returns a concatenated table.

Adjacent row groups from the same file are combined into single downloads for better S3 performance. The batch size is controlled by fused.options.row_group_batch_size (default: 32KB).

Parameters:

  • dataset_path (str) – Path to the H3-indexed dataset (e.g., "s3://bucket/dataset/")
  • hex_ranges_list (List\[List[int]\]) – List of [min_hex, max_hex] pairs as integers. Example: [[622236719905341439, 622246719905341439]]
  • columns (Optional\[List[str]\]) – Optional list of column names to read. If None, reads all columns.
  • base_url (Optional[str]) – Base URL for API. If None, uses current environment.
  • verbose (bool) – If True, print progress information. Default is False.
  • return_timing_info (bool) – If True, return a tuple of (table, timing_info) instead of just the table. Default is False for backward compatibility.
  • metadata_batch_size (int) – Maximum number of row group metadata requests to batch together in a single API call. Larger batches reduce API overhead. Default is 50. Consider MongoDB's 16KB document limit when adjusting this value.

Returns:

  • 'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\] – PyArrow Table containing the concatenated data from all matching row groups.
  • 'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\] – If return_timing_info is True, returns a tuple of (table, timing_info dict).

Example:

import fused

# Read data for a specific H3 hex range
table = fused.h3.read_hex_table_slow(
dataset_path="s3://my-bucket/my-h3-dataset/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)
df = table.to_pandas()

read_hex_table_with_persisted_metadata

read_hex_table_with_persisted_metadata(
dataset_path: str,
hex_ranges_list: List[List[int]],
columns: Optional[List[str]] = None,
metadata_path: Optional[str] = None,
verbose: bool = False,
return_timing_info: bool = False,
batch_size: Optional[int] = None,
max_concurrent_downloads: Optional[int] = None,
max_concurrent_metadata_reads: Optional[int] = None,
) -> "pa.Table" | tuple["pa.Table", Dict[str, Any]]

Read data from an H3-indexed dataset using persisted metadata parquet.

This function reads from per-file metadata parquet files instead of querying a server. Each source parquet file has a corresponding .metadata.parquet file stored at: {source_dir}/.fused/{source_filename}.metadata.parquet

Or at the specified metadata_path if provided: {metadata_path}/{full_source_path}.metadata.parquet

Supports subdirectory queries - if dataset_path points to a subdirectory, the function will look for metadata files for files in that subdirectory.

Parameters:

  • dataset_path (str) – Path to the H3-indexed dataset (e.g., "s3://bucket/dataset/") Can also be a subdirectory path for filtering (e.g., "s3://bucket/dataset/year=2024/")
  • hex_ranges_list (List\[List[int]\]) – List of [min_hex, max_hex] pairs as integers. Example: [[622236719905341439, 622246719905341439]]
  • columns (Optional\[List[str]\]) – Optional list of column names to read. If None, reads all columns.
  • metadata_path (Optional[str]) – Directory path where metadata files are stored. If None, looks for metadata at {source_dir}/.fused/ for each source file. If provided, reads metadata files from this location using full source paths. This allows reading metadata from a different location when you don't have access to the original dataset directory.
  • verbose (bool) – If True, print progress information. Default is False.
  • return_timing_info (bool) – If True, return a tuple of (table, timing_info) instead of just the table. Default is False for backward compatibility.
  • batch_size (Optional[int]) – Target size in bytes for combining row groups. If None, uses fused.options.row_group_batch_size (default: 32KB).
  • max_concurrent_downloads (Optional[int]) – Maximum number of simultaneous download operations. If None, uses a default based on the number of files. Default is None.
  • max_concurrent_metadata_reads (Optional[int]) – Maximum number of simultaneous metadata file reads. If None, defaults to min(10, len(source_files)).

Returns:

  • 'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\] – PyArrow Table containing the concatenated data from all matching row groups.
  • 'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\] – If return_timing_info is True, returns a tuple of (table, timing_info dict).

Raises:

  • [FileNotFoundError](#FileNotFoundError) – If the metadata parquet file is not found.
  • [ValueError](#ValueError) – If any row group is missing required metadata.

Example:

import fused

# First, persist metadata (one-time operation)
fused.h3.persist_hex_table_metadata("s3://my-bucket/my-dataset/")

# Then read using persisted metadata (no server required)
table = fused.h3.read_hex_table_with_persisted_metadata(
dataset_path="s3://my-bucket/my-dataset/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)
df = table.to_pandas()

# Read from subdirectory (filters by path prefix)
table = fused.h3.read_hex_table_with_persisted_metadata(
dataset_path="s3://my-bucket/my-dataset/year=2024/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)

# Read with metadata in alternate location
table = fused.h3.read_hex_table_with_persisted_metadata(
dataset_path="s3://my-bucket/my-dataset/",
metadata_path="s3://my-bucket/metadata/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)

run_ingest_raster_to_h3

run_ingest_raster_to_h3(
input_path: str | list[str],
output_path: str,
metrics: str | list[str] = "cnt",
res: int | None = None,
k_ring: int = 1,
res_offset: int = 1,
chunk_res: int | None = None,
file_res: int | None = None,
overview_res: list[int] | None = None,
overview_chunk_res: int | list[int] | None = None,
max_rows_per_chunk: int = 0,
include_source_url: bool = True,
target_chunk_size: int | None = None,
nodata: int | float | None = None,
debug_mode: bool = False,
remove_tmp_files: bool = True,
tmp_path: str | None = None,
overwrite: bool = False,
steps: list[str] | None = None,
extract_engine: str | None = None,
extract_max_workers: int = 256,
partition_engine: str | None = None,
partition_max_workers: int = 256,
overview_engine: str | None = None,
overview_max_workers: int = 256,
**kwargs: int
)

Run the raster to H3 ingestion process.

This process involves multiple steps:

  • extract pixels values and assign to H3 cells in chunks (extract step)
  • combine the chunks per partition (file) and prepare metadata (partition step)
  • create the metadata _sample file and overviews files

Parameters:

  • input_path (str, list) – Path(s) to the input raster data. When this is a single path, the file is chunked up for processing based on target_chunk_size. When this is a list of paths, each file is processed as one chunk.
  • output_path (str) – Path for the resulting Parquet dataset.
  • metrics (str or list of str) – The metrics to compute per H3 cell. Supported metrics are either "cnt" or a list containing any of "avg", "min", "max", "stddev", "mode" (i.e. most common value), and "sum".
  • res (int) – The resolution of the H3 cells in the output data. The pixel values are assigned to H3 cells at resolution res + res_offset and then aggregated to res. By default, this is inferred based on the resolution of the input data ensuring the H3 cell size is close to the pixel size (e.g. for a raster with pixel size of 30x30m, a resolution of 11 is inferred).
  • k_ring (int) – The k-ring distance at resolution res + res_offset to which the pixel value is assigned (in addition to the center cell). Defaults to 1.
  • res_offset (int) – Offset to child resolution (relative to res) at which to assign the raw pixel values to H3 cells.
  • file_res (int) – The H3 resolution to chunk the resulting files of the Parquet dataset. By default will be inferred based on the target resolution res. You can specify file_res=-1 to have a single output file.
  • chunk_res (int) – The H3 resolution to chunk the row groups within each file of the Parquet dataset (ignored when max_rows_per_chunk is specified). By default will be inferred based on the target resolution res.
  • overview_res (list of int) – The H3 resolutions for which to create overview files. By default, overviews are created for resolutions 3 to 7 (or capped at a lower value if the res of the output dataset is lower).
  • overview_chunk_res (int or list of int) – The H3 resolution(s) to chunk the row groups within each overview file of the Parquet dataset. By default, each overview file is chunked at the overview resolution minus 5 (clamped between 0 and the res of the output dataset).
  • max_rows_per_chunk (int) – The maximum number of rows per chunk in the resulting data and overview files. If 0 (the default), chunk_res and overview_chunk_res are used to determine the chunking.
  • include_source_url (bool) – If True, include a "source_url" column in the output dataset that contains a list of source URLs that contributed data to each H3 cell. Defaults to True, set to False to omit this column.
  • target_chunk_size (int) – The approximate number of pixel values to process per chunk in the first "extract" step. Defaults to 10,000,000 for ingesting a single file or a few files. If ingesting more than 20 files, each file is processed as a single chunk by default, but you can override this by specifying a specific target_chunk_size value, or by specifying target_chunk_size=0 to always process each file as a single chunk.
  • nodata (int or float) – By default, the "nodata" value indicated in the raster metadata will be used to ignore pixels with this value when ingesting the data. If this information is missing or wrong in the raster metadata, you can override the value by specifying this keyword.
  • debug_mode (bool) – If True, run only the first two chunks for debugging purposes. Defaults to False.
  • remove_tmp_files (bool) – If True, remove the temporary files after ingestion is complete. Defaults to True.
  • tmp_path (str) – Optional path to use for the temporary files. If specified, the extract step is skipped and it is assumed that the temporary files are already present at this path.
  • overwrite (bool) – If True, overwrite the output path if it already exists, by first removing the existing content before writing the new files. Defaults to False, in which case an error is raised if the output_path is not empty.
  • steps (list of str) – The processing steps to run. Can include "extract", "partition", "metadata", and "overview". By default, all steps are run.
  • extract_engine (str) – The engine to use for the extract step. By default, tries first with "realtime" and then falls back to "large".
  • extract_max_workers (int) – The maximum number of workers to use for the extract step. Defaults to 256.
  • partition_engine (str) – The engine to use for the partition step. By default, tries first with "realtime" and then falls back to "large".
  • partition_max_workers (int) – The maximum number of workers to use for the partition step. Defaults to 256.
  • overview_engine (str) – The engine to use for the overview step. By default, tries first with "realtime" and then falls back to "large".
  • overview_max_workers (int) – The maximum number of workers to use for the overview step. Defaults to 256.

The extract, partition and overview steps are run in parallel using fused.submit(). By default, the function will first attempt to run this using "realtime" instances, and retry any failed runs using "large" instances.

You can override this behavior by specifying the engine, max_workers, worker_concurrency, etc parameters as additional keyword arguments to this function (**kwargs). If you want to specify those per step, use extract_engine, partition_engine, etc. For example, to run everything locally on the same machine where this function runs, use:

run_ingest_raster_to_h3(..., engine="local")

run_partition_to_h3

run_partition_to_h3(
input_path: str | list[str],
output_path: str,
metrics: str | list[str] = "cnt",
groupby_cols: list[str] = ["hex"],
data_cols: list[str] = ["data"],
window_cols: list[str] = ["hex"],
additional_cols: list[str] = [],
input_rows_are_counts: bool = True,
res: int | None = None,
chunk_res: int | None = None,
file_res: int | None = None,
overview_res: list[int] | None = None,
overview_chunk_res: int | list[int] | None = None,
max_rows_per_chunk: int = 0,
remove_tmp_files: bool = True,
overwrite: bool = False,
extract_kwargs: bool = {},
partition_kwargs: bool = {},
overview_kwargs: bool = {},
**kwargs: bool
)

Run the H3 partitioning process.

This pipeline assumes that the input data already has a "hex" column with H3 cell IDs at the target resolution. It will then repartition the data spatially and add overview files.

By default, it is assumed that each row in the input data has a cnt column with counts of the raster pixel values (in the "data" column).

If your data already has actual statistics per H3 cell, you can set input_rows_are_counts=False. If you then specify, for example, a metric like "avg", it will take just take the average of the "data" column for each H3 cell. At the target resolution, this will typically be a single row, not actually further aggregating. For overview files, it will then create aggregated data.

Parameters:

  • input_path (str, list) – Path(s) to the input data. Can be a path to file or directory, or list of file paths). The first step of the processing is parallelized per input file.
  • output_path (str) – Path for the resulting Parquet dataset.
  • metrics (str or list of str) – The metrics to compute per H3 cell. Supported metrics are either "cnt" or a list containing any of "avg", "min", "max", "stddev", and "sum".
  • groupby_cols (list) – The columns indicating the groups for which the aggregated metrics are calculated. By default, for the "cnt" metric, this is the "hex" column, which is further combined with the data_cols (by default "data"). For the default "cnt" metric, this means counts are calculated (summed) per unique combination of "hex" and "data". For the other metrics, this should typically always be the "hex" column.
  • data_cols (list[str]) – Value column names (default ["data"]). For "cnt", this column contains the categories that are counted and is appended to the GROUP BY clause. For non-"cnt" metrics, the list of columns that will be aggregated according to the metrics (the result contains the cartesian product of data columns and metrics, i.e. all {data_col}_{metric} combinations).
  • window_cols (list) – The columns for which to calculate total counts over using a window function. This will add an additional column to the output (one for each of the columns in this list) with the total counts per unique value of the specified column. By default, this is only done for the "hex" column. This only applies when the "cnt" metric is specified.
  • additional_cols (list) – Additional columns from the input data to include in the output dataset. These columns are assumed to have a unique value per group defined by the groupby_cols. If this is not the case, the output will contain arbitrary values from one of the rows in each group. By default, this is empty. This only applies when the "cnt" metric is specified.
  • res (int) – The resolution at which to assign the pixel values to H3 cells.
  • file_res (int) – The H3 resolution to chunk the resulting files of the Parquet dataset. By default will be inferred based on the target resolution res. You can specify file_res=-1 to have a single output file.
  • chunk_res (int) – The H3 resolution to chunk the row groups within each file of the Parquet dataset (ignored when max_rows_per_chunk is specified). By default will be inferred based on the target resolution res.
  • overview_res (list of int) – The H3 resolutions for which to create overview files. By default, overviews are created for resolutions 3 to 7 (or capped at a lower value if the res of the output dataset is lower).
  • overview_chunk_res (int or list of int) – The H3 resolution(s) to chunk the row groups within each overview file of the Parquet dataset. By default, each overview file is chunked at the overview resolution minus 5 (clamped between 0 and the res of the output dataset).
  • max_rows_per_chunk (int) – The maximum number of rows per chunk in the resulting data and overview files. If 0 (the default), chunk_res and overview_chunk_res are used to determine the chunking.
  • remove_tmp_files (bool) – If True, remove the temporary files after ingestion is complete. Defaults to True.
  • overwrite (bool) – If True, overwrite the output path if it already exists, by first removing the existing content before writing the new files. Defaults to False, in which case an error is raised if the output_path is not empty.
  • extract_kwargs (dict) – Additional keyword arguments to pass to fused.submit for the extract step.
  • partition_kwargs (dict) – Additional keyword arguments to pass to fused.submit for the partition step.
  • overview_kwargs (dict) – Additional keyword arguments to pass to fused.submit for the overview step.

See the docstring of run_ingest_raster_to_h3 for more details on the processing steps and how to configure the execution.