Zonal stats with Fused: 10 minute guide
A step-by-step guide for data scientists.
1. Using Fused for a Zonal Statistics Example
In this guide, we'll show you how to:
- Bring in your data,
- Write a UDF to process the data
- Run the UDF remotely & in parallel
- Create an app that shows your results and can be shared with anyone
2. Bring in your data
In this guide, we'll estimate how much alfalfa grows in areas defined by polygon geometries. You'll first upload your own vector table with fused.ingest
. This spatially partitions the table and writes it in your specified S3 bucket as a GeoParquet file. You'll then calculate zonal stats over a raster array of alfalfa crops in the USDA Cropland Data Layer (CDL) dataset.
This example shows how to geo partition polygons of Census Block Groups for Utah, which is a Parquet table with a geometry
column. You can follow along with this file or any vector table you'd like. Read about other supported formats in Ingest your own data.
First, set up a Python environment, install the latest Fused SDK with pip install fused
, and authenticate.
Now, write the following script to geo partition your data. Pass the URL of the table to fused.ingest
. When you kick off an ingest job with run_remote
, Fused spins up a server to geo partition your table and writes the output to the path specified by the output
parameter. In the codeblock below, fd://tl_2023_49_bg/
is the base path to your account's S3 bucket.
import fused
job = fused.ingest(
input="https://www2.census.gov/geo/tiger/TIGER2023/BG/tl_2023_49_bg.zip",
output="fd://tl_2023_49_bg/"
)
job_id = job.run_remote()
After running the preceeding code, open fused.io/jobs to view the job status and logs.
Once the job is complete, you can preview the output dataset in the File Explorer.
You can also ingest data without installing anything by using this Fused App.
For the next step you can use the path to the data you just ingested or, if you prefer, this public sample table: s3://fused-asset/data/tl_2023_49_bg/
.
3. Write a UDF to process the data
Now we will write a UDF that reads your data, prints a log to the console, and returns an output. To write a UDF simply wrap a Python function with the decorator @fused.udf
.
To run this, first log in at fused.io/workbench. As you write code in the UDF Builder you'll see how visualization results, logs, and errors show up immediately.
The first parameter of this UDF, bbox
. It is reserved for Fused to pass a GeoDataFrame
which the UDF may use to spatially filter datasets, and usually corresponds to a web map tile. This enables Fused to run the UDF for each tile in the viewport to distribute processing across multiple workers.
The year
parameter is used to structure the S3 path of the CDL GeoTiff which the utility function read_tiff
reads for the area defined by bbox
. The crop_id
parameter 36 corresponds to alfalfa the CDL colormap, which the UDF uses to mask the raster array.
Fused lets you import utility Modules from other UDFs with fused.utils
. Their code lives in the public UDFs repo.
read_tiff
loads an array of the CDL dataset for the specifiedbbox
extent andyear
table_to_tile
loads the table you geo partitioned for the specifiedbbox
extentgeom_stats
calculates zonal statistics by aggregating thearr
variable over the geometries specified by thegdf
@fused.udf
def udf(
bbox: fused.types.TileGDF = None,
year: int = 2020,
crop_id: int = 36
):
import numpy as np
# Load CDLS data
arr = fused.utils.common.read_tiff(
bbox,
input_tiff_path=f"s3://fused-asset/data/cdls/{year}_30m_cdls.tif"
)
# Mask for crop
arr = np.where(np.isin(arr, [crop_id]), 1, 0)
# Load polygons
gdf = fused.utils.common.table_to_tile(
bbox,
table='s3://fused-asset/data/tl_2023_49_bg/',
min_zoom=5,
use_columns=['NAMELSAD', 'GEOID', 'MTFCC', 'FUNCSTAT', 'geometry']
)
gdf.crs = 4326
# Calculate zonal stats
return fused.utils.common.geom_stats(gdf, arr)
Try running the UDF in the UDF Builder and visually inspect the output on the map. See what happens when you change year
. Try introducing print statements such as print(arr)
and print(gdf)
to show logs in the console.
You can also style the map layer by setting this JSON in the Visualize tab.
Details
{
"tileLayer": {
"@@type": "TileLayer",
"minZoom": 0,
"maxZoom": 19,
"tileSize": 256,
"pickable": true
},
"rasterLayer": {
"@@type": "BitmapLayer",
"pickable": true
},
"vectorLayer": {
"@@type": "GeoJsonLayer",
"stroked": true,
"filled": true,
"pickable": true,
"lineWidthMinPixels": 1,
"pointRadiusMinPixels": 1,
"getFillColor": {
"@@function": "colorContinuous",
"attr": "count",
"domain": [
0,
100
],
"colors": "Tropic",
"nullColor": [
184,
14,
184
]
},
"getLineColor": [
208,
208,
208,
40
]
}
}
4. Run the UDF remotely and in parallel
Now that you have a UDF, let's see three ways you can Run UDFs remotely:
- via an HTTP endpoint,
- in a Python application,
- in parallel.
a. HTTP endpoint
Save the UDF and greate an HTTP endpoint for the UDF by clicking "Share" in the UDF Settings tab. You'll see snippets to call the UDF from anywhere using a shared token. Try this by changing the HTTP endpoint's query parameter as shown. You should see the output in GeoJSON format.
https://www.fused.io/server/v1/realtime-shared/fsh_46eSFZaR3q3SnoVB28pN0g/run/tiles/12/778/1548?dtype_out_vector=geojson&crop_id=36&year=2020
What just happened?
When you called the HTTP endpoint, Fused ran the UDF then sent back the output table and debug logs. In the URL above, /12/778/1548
specifies the ZXY tile index to structure bbox
and dtype_out_vector=geojson
specifies the output format. Fused passes the year
& crop_type
parameters to the UDF as int
based on their types in the function definition.
Why does this matter?
You called the HTTP endpoint with a shared token, which means any application may call the UDF and get data back without needing to configure credentials. You also passed parameters to the UDF, which enables you to dynamically generate data and define its output format.
To start seeing the full power of Fused, change the UDF code and call the endpoint again. You'll see the UDF automatically updates. When you call the UDF again it should run even faster.
b. Python SDK
The share token also enables you to run the UDF in a Python environment. You can specify bbox
as the same map tile as above by passing x
, y
, and z
parameters.
import fused
fused.run("fsh_46eSFZaR3q3SnoVB28pN0g", x=778, y=1548, z=12)
You can also pass a GeoDataFrame
to explicitly define a custom bbox
and other parameters specific to the UDF.
import geopandas as gpd
# Square AOI near Utah Lake
bbox = gpd.GeoDataFrame.from_features({"type":"FeatureCollection","features":[{"type":"Feature","properties":{"shape":"Rectangle"},"geometry":{"type":"Polygon","coordinates":[[[-112.01315665811222,40.13628586159681],[-111.89330615564467,40.13628586159681],[-111.89330615564467,40.004073791892196],[-112.01315665811222,40.004073791892196],[-112.01315665811222,40.13628586159681]]]}}]})
fused.run("fsh_46eSFZaR3q3SnoVB28pN0g", bbox=bbox, year=2020, crop_id=36)
c. Parallelization
Invoke the UDF in parallel with run_pool
to increase performance when you want to run it repeatedly with different input parameters. In this case, we'll use it to call the UDF across a set of years. For a deeper dive, read how to Call UDFs asynchronously.
import pandas as pd
def run_udf(year):
gdf = fused.run("fsh_46eSFZaR3q3SnoVB28pN0g", bbox=bbox, year=year, crop_id=36)
gdf['year'] = year
return gdf
gdfs = fused.utils.common.run_pool(run_udf, [2019, 2020, 2021])
gdf_final = pd.concat(gdfs)
gdf_final
These examples show how you can easily integrate Fused into your analytics workflows. For example, you can group gdf_final
by GEOID
and year
and calculate aggregates of the count
and stats
columns.
gdfs.groupby(['GEOID', 'year']).agg({'count':'sum', 'stats': 'mean'}).reset_index()
5. Create an app
Now that you've created a UDF and explored different ways to invoke it, you can create a data app to share your results. We'll structure the HTTP endpoint you created above to act as a Tile server (with /run/tiles/{z}/{x}/{y}
), allowing it to be called from a pydeck TileLayer
within the Fused App Builder. We'll also create a Streamlit dropdown for users to set the year parameter.
Click "Copy shareable link" to share the app with others or embed it in Notion!
6. Conclusion and next steps
We've shown how you can use Fused to develop a distributed Python workflow to power an app. Through a simple sequence of steps we loaded data, wrote analytics code, and created an app to interact with the data. With a single click you went from experimental development code to a live application.
We hope this overview gives you a glimpse of what you can build with Fused. You can continue to learn how to get data in, transform data, and integrate with other applications.
Find inspiration for your next project, ask questions, or share your work with the Fused community.