fused.h3
persist_hex_table_metadata
persist_hex_table_metadata(
dataset_path: str,
output_path: str | None = None,
metadata_path: str | None = None,
verbose: bool = False,
row_group_size: int = 500,
indexed_columns: list[str] | None = None,
pool_size: int = 10,
metadata_compression_level: int | None = None,
) -> str
Persist metadata for a hex table to enable serverless queries.
Scans all parquet files in the dataset, extracts metadata needed for fast reconstruction, and writes one .metadata.parquet file per source file.
Each metadata file contains:
- Row group metadata (offsets, H3 ranges, etc.)
- Full metadata_json stored as custom metadata in the parquet schema
Requires a 'hex' column (or variant like h3, h3_index) in all files. Raises ValueError if no hex column is found.
Parameters:
- dataset_path (
str) – Path to the dataset directory (e.g., "s3://bucket/dataset/") - output_path (
str | None) – Deprecated - use metadata_path instead - metadata_path (
str | None) – Directory path where metadata files should be written. If None, writes to {source_dir}/.fused/{source_filename}.metadata.parquet for each source file. If provided, writes to {metadata_path}/{full_source_path}.metadata.parquet using the full source path as the filename. This allows storing metadata in a different location when you don't have write access to the dataset directory. - verbose (
bool) – If True, print timing/progress information. - row_group_size (
int) – Number of rows per row group in output parquet file. Default is 500. Larger values reduce file size but may increase memory usage during writes. - indexed_columns (
list[str] | None) – List of column names to index. If None (default), indexes only the first identified indexed column (typically the hex column). If an empty list, indexes all detected indexed columns. - pool_size (
int) – Number of parallel workers to use for processing files. Default is 10 (parallel processing enabled by default). Set to 1 for sequential processing. This can significantly speed up metadata extraction for large datasets with many files. - metadata_compression_level (
int | None) – Optional zstd compression level for metadata parquet files. If None, uses PyArrow's default for zstd. Valid levels are typically in the range [1, 22].
Returns:
[str](#str)– Path to metadata directory (if metadata_path provided) or first metadata file path
Raises:
[ImportError](#ImportError)– If job2 package is not available[ValueError](#ValueError)– If no hex column is found in any file[FileNotFoundError](#FileNotFoundError)– If no parquet files are found in the dataset
Example:
fused.h3.persist_hex_table_metadata("s3://my-bucket/my-dataset/")
# 's3://my-bucket/my-dataset/.fused/file1.metadata.parquet'
fused.h3.persist_hex_table_metadata("s3://my-bucket/my-dataset/", metadata_path="s3://my-bucket/metadata/")
# 's3://my-bucket/metadata/'
fused.h3.persist_hex_table_metadata("s3://my-bucket/my-dataset/", pool_size=4)
# 's3://my-bucket/my-dataset/.fused/file1.metadata.parquet'
read_hex_table
read_hex_table(
dataset_path: str,
hex_ranges_list: List[List[int]],
columns: Optional[List[str]] = None,
base_url: Optional[str] = None,
verbose: bool = False,
return_timing_info: bool = False,
batch_size: Optional[int] = None,
max_concurrent_downloads: Optional[int] = None,
) -> "pa.Table" | tuple["pa.Table", Dict[str, Any]]
Read data from an H3-indexed dataset by querying hex ranges.
This is an optimized version that assumes the server always provides full metadata (start_offset, end_offset, metadata, and row_group_bytes) for all row groups. If any row group is missing required metadata, this function will raise an error indicating that the dataset needs to be re-indexed.
This function eliminates all metadata API calls by using prefetched metadata from the /datasets/items-with-metadata endpoint.
This function imports the implementation from job2 at runtime, similar to how read_parquet_row_group works.
Parameters:
- dataset_path (
str) – Path to the H3-indexed dataset (e.g., "s3://bucket/dataset/") - hex_ranges_list (
List\[List[int]\]) – List of [min_hex, max_hex] pairs as integers. Example: [[622236719905341439, 622246719905341439]] - columns (
Optional\[List[str]\]) – Optional list of column names to read. If None, reads all columns. - base_url (
Optional[str]) – Base URL for API. If None, uses current environment. - verbose (
bool) – If True, print progress information. Default is False. - return_timing_info (
bool) – If True, return a tuple of (table, timing_info) instead of just the table. Default is False for backward compatibility. - batch_size (
Optional[int]) – Target size in bytes for combining row groups. If None, usesfused.options.row_group_batch_size(default: 32KB). - max_concurrent_downloads (
Optional[int]) – Maximum number of simultaneous download operations. If None, uses a default based on the number of files. Default is None.
Returns:
'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\]– PyArrow Table containing the concatenated data from all matching row groups.'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\]– If return_timing_info is True, returns a tuple of (table, timing_info dict).
Raises:
[ValueError](#ValueError)– If any row group is missing required metadata (start_offset, end_offset, metadata, or row_group_bytes). This indicates the dataset needs to be re-indexed.
Example:
import fused
# Read data for a specific H3 hex range
table = fused.h3.read_hex_table(
dataset_path="s3://my-bucket/my-h3-dataset/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)
df = table.to_pandas()
read_hex_table_slow
read_hex_table_slow(
dataset_path: str,
hex_ranges_list: List[List[int]],
columns: Optional[List[str]] = None,
base_url: Optional[str] = None,
verbose: bool = False,
return_timing_info: bool = False,
metadata_batch_size: int = 50,
) -> "pa.Table" | tuple["pa.Table", Dict[str, Any]]
Read data from an H3-indexed dataset by querying hex ranges.
This function queries the dataset index for row groups that match the given H3 hex ranges, downloads them in parallel with optimized batching, and returns a concatenated table.
Adjacent row groups from the same file are combined into single downloads
for better S3 performance. The batch size is controlled by
fused.options.row_group_batch_size (default: 32KB).
Parameters:
- dataset_path (
str) – Path to the H3-indexed dataset (e.g., "s3://bucket/dataset/") - hex_ranges_list (
List\[List[int]\]) – List of [min_hex, max_hex] pairs as integers. Example: [[622236719905341439, 622246719905341439]] - columns (
Optional\[List[str]\]) – Optional list of column names to read. If None, reads all columns. - base_url (
Optional[str]) – Base URL for API. If None, uses current environment. - verbose (
bool) – If True, print progress information. Default is False. - return_timing_info (
bool) – If True, return a tuple of (table, timing_info) instead of just the table. Default is False for backward compatibility. - metadata_batch_size (
int) – Maximum number of row group metadata requests to batch together in a single API call. Larger batches reduce API overhead. Default is 50. Consider MongoDB's 16KB document limit when adjusting this value.
Returns:
'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\]– PyArrow Table containing the concatenated data from all matching row groups.'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\]– If return_timing_info is True, returns a tuple of (table, timing_info dict).
Example:
import fused
# Read data for a specific H3 hex range
table = fused.h3.read_hex_table_slow(
dataset_path="s3://my-bucket/my-h3-dataset/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)
df = table.to_pandas()
read_hex_table_with_persisted_metadata
read_hex_table_with_persisted_metadata(
dataset_path: str,
hex_ranges_list: List[List[int]],
columns: Optional[List[str]] = None,
metadata_path: Optional[str] = None,
verbose: bool = False,
return_timing_info: bool = False,
batch_size: Optional[int] = None,
max_concurrent_downloads: Optional[int] = None,
max_concurrent_metadata_reads: Optional[int] = None,
) -> "pa.Table" | tuple["pa.Table", Dict[str, Any]]
Read data from an H3-indexed dataset using persisted metadata parquet.
This function reads from per-file metadata parquet files instead of querying a server. Each source parquet file has a corresponding .metadata.parquet file stored at: {source_dir}/.fused/{source_filename}.metadata.parquet
Or at the specified metadata_path if provided: {metadata_path}/{full_source_path}.metadata.parquet
Supports subdirectory queries - if dataset_path points to a subdirectory, the function will look for metadata files for files in that subdirectory.
Parameters:
- dataset_path (
str) – Path to the H3-indexed dataset (e.g., "s3://bucket/dataset/") Can also be a subdirectory path for filtering (e.g., "s3://bucket/dataset/year=2024/") - hex_ranges_list (
List\[List[int]\]) – List of [min_hex, max_hex] pairs as integers. Example: [[622236719905341439, 622246719905341439]] - columns (
Optional\[List[str]\]) – Optional list of column names to read. If None, reads all columns. - metadata_path (
Optional[str]) – Directory path where metadata files are stored. If None, looks for metadata at {source_dir}/.fused/ for each source file. If provided, reads metadata files from this location using full source paths. This allows reading metadata from a different location when you don't have access to the original dataset directory. - verbose (
bool) – If True, print progress information. Default is False. - return_timing_info (
bool) – If True, return a tuple of (table, timing_info) instead of just the table. Default is False for backward compatibility. - batch_size (
Optional[int]) – Target size in bytes for combining row groups. If None, usesfused.options.row_group_batch_size(default: 32KB). - max_concurrent_downloads (
Optional[int]) – Maximum number of simultaneous download operations. If None, uses a default based on the number of files. Default is None. - max_concurrent_metadata_reads (
Optional[int]) – Maximum number of simultaneous metadata file reads. If None, defaults to min(10, len(source_files)).
Returns:
'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\]– PyArrow Table containing the concatenated data from all matching row groups.'pa.Table' | [tuple](#tuple)\['pa.Table', [Dict](#typing.Dict)\[[str](#str), [Any](#typing.Any)\]\]– If return_timing_info is True, returns a tuple of (table, timing_info dict).
Raises:
[FileNotFoundError](#FileNotFoundError)– If the metadata parquet file is not found.[ValueError](#ValueError)– If any row group is missing required metadata.
Example:
import fused
# First, persist metadata (one-time operation)
fused.h3.persist_hex_table_metadata("s3://my-bucket/my-dataset/")
# Then read using persisted metadata (no server required)
table = fused.h3.read_hex_table_with_persisted_metadata(
dataset_path="s3://my-bucket/my-dataset/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)
df = table.to_pandas()
# Read from subdirectory (filters by path prefix)
table = fused.h3.read_hex_table_with_persisted_metadata(
dataset_path="s3://my-bucket/my-dataset/year=2024/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)
# Read with metadata in alternate location
table = fused.h3.read_hex_table_with_persisted_metadata(
dataset_path="s3://my-bucket/my-dataset/",
metadata_path="s3://my-bucket/metadata/",
hex_ranges_list=[[622236719905341439, 622246719905341439]]
)
run_ingest_raster_to_h3
run_ingest_raster_to_h3(
input_path: str | list[str],
output_path: str,
metrics: str | list[str] = "cnt",
res: int | None = None,
k_ring: int = 1,
res_offset: int = 1,
chunk_res: int | None = None,
file_res: int | None = None,
overview_res: list[int] | None = None,
overview_chunk_res: int | list[int] | None = None,
max_rows_per_chunk: int = 0,
include_source_url: bool = True,
target_chunk_size: int | None = None,
nodata: int | float | None = None,
debug_mode: bool = False,
remove_tmp_files: bool = True,
tmp_path: str | None = None,
overwrite: bool = False,
steps: list[str] | None = None,
extract_engine: str | None = None,
extract_max_workers: int = 256,
partition_engine: str | None = None,
partition_max_workers: int = 256,
overview_engine: str | None = None,
overview_max_workers: int = 256,
**kwargs: int
)
Run the raster to H3 ingestion process.
This process involves multiple steps:
- extract pixels values and assign to H3 cells in chunks (extract step)
- combine the chunks per partition (file) and prepare metadata (partition step)
- create the metadata
_samplefile and overviews files
Parameters:
- input_path (
str, list) – Path(s) to the input raster data. When this is a single path, the file is chunked up for processing based ontarget_chunk_size. When this is a list of paths, each file is processed as one chunk. - output_path (
str) – Path for the resulting Parquet dataset. - metrics (
str or list of str) – The metrics to compute per H3 cell. Supported metrics are either "cnt" or a list containing any of "avg", "min", "max", "stddev", "mode" (i.e. most common value), and "sum". - res (
int) – The resolution of the H3 cells in the output data. The pixel values are assigned to H3 cells at resolutionres + res_offsetand then aggregated tores. By default, this is inferred based on the resolution of the input data ensuring the H3 cell size is close to the pixel size (e.g. for a raster with pixel size of 30x30m, a resolution of 11 is inferred). - k_ring (
int) – The k-ring distance at resolutionres + res_offsetto which the pixel value is assigned (in addition to the center cell). Defaults to 1. - res_offset (
int) – Offset to child resolution (relative tores) at which to assign the raw pixel values to H3 cells. - file_res (
int) – The H3 resolution to chunk the resulting files of the Parquet dataset. By default will be inferred based on the target resolutionres. You can specifyfile_res=-1to have a single output file. - chunk_res (
int) – The H3 resolution to chunk the row groups within each file of the Parquet dataset (ignored whenmax_rows_per_chunkis specified). By default will be inferred based on the target resolutionres. - overview_res (
list of int) – The H3 resolutions for which to create overview files. By default, overviews are created for resolutions 3 to 7 (or capped at a lower value if theresof the output dataset is lower). - overview_chunk_res (
int or list of int) – The H3 resolution(s) to chunk the row groups within each overview file of the Parquet dataset. By default, each overview file is chunked at the overview resolution minus 5 (clamped between 0 and theresof the output dataset). - max_rows_per_chunk (
int) – The maximum number of rows per chunk in the resulting data and overview files. If 0 (the default),chunk_resandoverview_chunk_resare used to determine the chunking. - include_source_url (
bool) – If True, include a"source_url"column in the output dataset that contains a list of source URLs that contributed data to each H3 cell. Defaults to True, set to False to omit this column. - target_chunk_size (
int) – The approximate number of pixel values to process per chunk in the first "extract" step. Defaults to 10,000,000 for ingesting a single file or a few files. If ingesting more than 20 files, each file is processed as a single chunk by default, but you can override this by specifying a specifictarget_chunk_sizevalue, or by specifyingtarget_chunk_size=0to always process each file as a single chunk. - nodata (
int or float) – By default, the "nodata" value indicated in the raster metadata will be used to ignore pixels with this value when ingesting the data. If this information is missing or wrong in the raster metadata, you can override the value by specifying this keyword. - debug_mode (
bool) – If True, run only the first two chunks for debugging purposes. Defaults to False. - remove_tmp_files (
bool) – If True, remove the temporary files after ingestion is complete. Defaults to True. - tmp_path (
str) – Optional path to use for the temporary files. If specified, the extract step is skipped and it is assumed that the temporary files are already present at this path. - overwrite (
bool) – If True, overwrite the output path if it already exists, by first removing the existing content before writing the new files. Defaults to False, in which case an error is raised if theoutput_pathis not empty. - steps (
list of str) – The processing steps to run. Can include "extract", "partition", "metadata", and "overview". By default, all steps are run. - extract_engine (
str) – The engine to use for the extract step. By default, tries first with "realtime" and then falls back to "large". - extract_max_workers (
int) – The maximum number of workers to use for the extract step. Defaults to 256. - partition_engine (
str) – The engine to use for the partition step. By default, tries first with "realtime" and then falls back to "large". - partition_max_workers (
int) – The maximum number of workers to use for the partition step. Defaults to 256. - overview_engine (
str) – The engine to use for the overview step. By default, tries first with "realtime" and then falls back to "large". - overview_max_workers (
int) – The maximum number of workers to use for the overview step. Defaults to 256.
The extract, partition and overview steps are run in parallel using
fused.submit(). By default, the function will first attempt to run this using
"realtime" instances, and retry any failed runs using "large" instances.
You can override this behavior by specifying the engine, max_workers,
worker_concurrency, etc parameters as additional
keyword arguments to this function (**kwargs). If you want to specify
those per step, use extract_engine, partition_engine, etc.
For example, to run everything locally on the same machine where this
function runs, use:
run_ingest_raster_to_h3(..., engine="local")
run_partition_to_h3
run_partition_to_h3(
input_path: str | list[str],
output_path: str,
metrics: str | list[str] = "cnt",
groupby_cols: list[str] = ["hex"],
data_cols: list[str] = ["data"],
window_cols: list[str] = ["hex"],
additional_cols: list[str] = [],
input_rows_are_counts: bool = True,
res: int | None = None,
chunk_res: int | None = None,
file_res: int | None = None,
overview_res: list[int] | None = None,
overview_chunk_res: int | list[int] | None = None,
max_rows_per_chunk: int = 0,
remove_tmp_files: bool = True,
overwrite: bool = False,
extract_kwargs: bool = {},
partition_kwargs: bool = {},
overview_kwargs: bool = {},
**kwargs: bool
)
Run the H3 partitioning process.
This pipeline assumes that the input data already has a "hex" column
with H3 cell IDs at the target resolution. It will then repartition
the data spatially and add overview files.
By default, it is assumed that each row in the input data has a cnt column
with counts of the raster pixel values (in the "data" column).
If your data already has actual statistics per H3 cell, you can set
input_rows_are_counts=False. If you then specify, for example, a metric
like "avg", it will take just take the average of the "data" column for each H3 cell.
At the target resolution, this will typically be a single row, not actually
further aggregating. For overview files, it will then create aggregated data.
Parameters:
- input_path (
str, list) – Path(s) to the input data. Can be a path to file or directory, or list of file paths). The first step of the processing is parallelized per input file. - output_path (
str) – Path for the resulting Parquet dataset. - metrics (
str or list of str) – The metrics to compute per H3 cell. Supported metrics are either "cnt" or a list containing any of "avg", "min", "max", "stddev", and "sum". - groupby_cols (
list) – The columns indicating the groups for which the aggregated metrics are calculated. By default, for the "cnt" metric, this is the "hex" column, which is further combined with thedata_cols(by default "data"). For the default "cnt" metric, this means counts are calculated (summed) per unique combination of "hex" and "data". For the other metrics, this should typically always be the "hex" column. - data_cols (
list[str]) – Value column names (default["data"]). For "cnt", this column contains the categories that are counted and is appended to theGROUP BYclause. For non-"cnt" metrics, the list of columns that will be aggregated according to the metrics (the result contains the cartesian product of data columns and metrics, i.e. all{data_col}_{metric}combinations). - window_cols (
list) – The columns for which to calculate total counts over using a window function. This will add an additional column to the output (one for each of the columns in this list) with the total counts per unique value of the specified column. By default, this is only done for the "hex" column. This only applies when the "cnt" metric is specified. - additional_cols (
list) – Additional columns from the input data to include in the output dataset. These columns are assumed to have a unique value per group defined by thegroupby_cols. If this is not the case, the output will contain arbitrary values from one of the rows in each group. By default, this is empty. This only applies when the "cnt" metric is specified. - res (
int) – The resolution at which to assign the pixel values to H3 cells. - file_res (
int) – The H3 resolution to chunk the resulting files of the Parquet dataset. By default will be inferred based on the target resolutionres. You can specifyfile_res=-1to have a single output file. - chunk_res (
int) – The H3 resolution to chunk the row groups within each file of the Parquet dataset (ignored whenmax_rows_per_chunkis specified). By default will be inferred based on the target resolutionres. - overview_res (
list of int) – The H3 resolutions for which to create overview files. By default, overviews are created for resolutions 3 to 7 (or capped at a lower value if theresof the output dataset is lower). - overview_chunk_res (
int or list of int) – The H3 resolution(s) to chunk the row groups within each overview file of the Parquet dataset. By default, each overview file is chunked at the overview resolution minus 5 (clamped between 0 and theresof the output dataset). - max_rows_per_chunk (
int) – The maximum number of rows per chunk in the resulting data and overview files. If 0 (the default),chunk_resandoverview_chunk_resare used to determine the chunking. - remove_tmp_files (
bool) – If True, remove the temporary files after ingestion is complete. Defaults to True. - overwrite (
bool) – If True, overwrite the output path if it already exists, by first removing the existing content before writing the new files. Defaults to False, in which case an error is raised if theoutput_pathis not empty. - extract_kwargs (
dict) – Additional keyword arguments to pass tofused.submitfor the extract step. - partition_kwargs (
dict) – Additional keyword arguments to pass tofused.submitfor the partition step. - overview_kwargs (
dict) – Additional keyword arguments to pass tofused.submitfor the overview step.
See the docstring of run_ingest_raster_to_h3 for more details on
the processing steps and how to configure the execution.