Step 2: Joining Multiple Geospatial Datasets

After creating all 3 of our datasets at a H3 cell grid size of 3, we can now join them together.

Joining logic

Here's a quick summary of what each of our datasets contains:

Crop

Dataset size: ~700k rows

hex	crop_type	area
599658332378103807	4	1713
599669866512777215	50	20854
599657528145477631	122	17505
599699078732840959	121	10368586
599697034328408063	122	17754

Elevation

Dataset size: ~68k rows

hex	mean_value
599647257804931071	453.75
599647264247382015	375.32
599647267468607487	392.52
599647276058542079	401.93
599647280353509375	410.52

Temperature

Dataset size: ~68k rows

hex	daily_mean
599740827962441727	16.46
599740794676445183	16.51
599740824741216255	16.40
599740837626118143	14.99
599740839773601791	14.93

Notice a few things:

We have ~10x more rows in the Crop dataset than the Elevation and Temperature datasets, which both have the same number of H3 cells.
The Elevation & Temperature datasets each have 1 H3 cell < - > 1 value. The Crop dataset has multiple rows per each H3 cell.

We can write a simple UDF to return all the different dataset at once for a single H3 cell to see this:

@fused.udf
def udf():
    # Assuming you have named your UDFs as follows:
    elevation_data = fused.run('elevation_hex_mean_height')
    cdl_data = fused.run("cdl_data_all_crops_staging")
    temp_data = fused.run("era5_hex_mean_temp_foss4g")

    # Random h3 cell within the US
    random_hex = 599691796615790591
    
    # Filter all 3 dataframes by the same hex value
    filtered_elevation = elevation_data[elevation_data['hex'] == random_hex]
    filtered_cdl = cdl_data[cdl_data['hex'] == random_hex]
    filtered_temp = temp_data[temp_data['hex'] == random_hex]
    
    # Print the contents of all 3 filtered dataframes
    print("Elevation Data:")
    print(filtered_elevation)
    print("\nCDL Data:")
    print(filtered_cdl)
    print("\nTemperature Data:")
    print(filtered_temp)
    
    return filtered_elevation

This returns:

Elevation Data:
                      hex  mean_value
46561  599691796615790591  307.809524

CDL Data:
        crop_type                 hex      area            chunk_id  year
355599        122  599691796615790591   7631733  577199624117288959  2024
355600        123  599691796615790591   2211934  577199624117288959  2024
355601         68  599691796615790591  18305536  577199624117288959  2024
355602         37  599691796615790591   5207407  577199624117288959  2024
355603        131  599691796615790591     60333  577199624117288959  2024
...           ...                 ...       ...                 ...   ...
365701          6  599691796615790591     10317  577199624117288959  2024
365702        143  599691796615790591     12347  577199624117288959  2024
365703         29  599691796615790591     14997  577199624117288959  2024
365704        221  599691796615790591     15215  577199624117288959  2024
365705         38  599691796615790591     16505  577199624117288959  2024

[68 rows x 5 columns]

Temperature Data:
                      hex  daily_mean
41582  599691796615790591       15.31

1 H3 cell < - > 1 value for Elevation and Temperature
1 H3 cell < - > 68 values for Crop!

This is because the Crop dataset creates a new row for each crop type. Recall that a resolution 5 H3 cell is 252 km², so a single cell can contain multiple crop types.

To join our datasets we need to decide on a logic. We'll decide to:

Aggregate the Crop dataset by most dominant crop type for each H3 cell.

Aggregating crop by most common

Aggregating crop dataset by most common crop

@fused.udf
def udf():
    cdl_data = fused.run("cdl_data_all_crops_staging")

    # Aggregate CDL by most common crop by area. 1 h3 cell <-> 1 row: crop_rank_1 and crop_area
    common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
    con = common.duckdb_connect()
    
    # Group by hex cell, rank crops by area, keep top crop per cell
    crop_agg = con.execute("""
        SELECT 
            hex,
            crop_type as crop_rank_1,
            SUM(area) as crop_area,
            year
        FROM cdl_data
        GROUP BY hex, crop_type, year
        QUALIFY ROW_NUMBER() OVER (PARTITION BY hex, year ORDER BY SUM(area) DESC) = 1
    """).df()

    print(f"{crop_agg.T=}")
    

    return crop_agg

We can refine this further by:

Filtering out any crops we're not interested in like shrubs, grass, forest, etc.
Taking neighboring cells into account to get more accurate statistics using k-ring

@fused.udf
def udf(k_ring_size: int = 1):
    cdl_data = fused.run("cdl_data_all_crops_staging")

    # Aggregate CDL by most common crop by area. 1 h3 cell <-> 1 row: crop_rank_1 and crop_area
    common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
    con = common.duckdb_connect()

    # Removing 'crop types' from CDL dataset we're not interested in
    crops_to_remove = [
        64,  # Shrubland
        152, # Shrubland
        176, # Pasture/Hay
        63,  # Forest
        142, # Forest
        141, # Forest
        190, # Woody Wetlands
        111, # Perennial Ice/Snow
        195, # Water
        131, # Barren
        121, # Developed/Open Space
        143, # Mixed Forest
    ]
    cdl_data = cdl_data[~cdl_data['crop_type'].isin(crops_to_remove)]

    crop_agg = con.execute("""
        WITH k_ring_expanded AS (
            SELECT 
                hex as center_hex,
                unnest(h3_grid_disk(hex, ?)) as ring_hex
            FROM cdl_data
        ),
        k_ring_data AS (
            SELECT 
                k.center_hex,
                c.crop_type as crop_type,
                c.area as crop_area
            FROM k_ring_expanded k
            JOIN cdl_data c ON CAST(c.hex AS BIGINT) = CAST(k.ring_hex AS BIGINT)
        ),
        aggregated_crops AS (
            SELECT 
                center_hex,
                crop_type,
                SUM(crop_area) as total_area
            FROM k_ring_data
            GROUP BY center_hex, crop_type
        ),
        ranked_crops AS (
            SELECT 
                center_hex,
                crop_type,
                total_area,
                ROW_NUMBER() OVER (PARTITION BY center_hex ORDER BY total_area DESC) as rank
            FROM aggregated_crops
        )
        SELECT 
            center_hex as hex,
            crop_type as crop_rank_1,
            total_area as area_rank_1
        FROM ranked_crops
        WHERE rank = 1
    """, [k_ring_size]).df()

    print(f"{crop_agg.T=}")
    

    return crop_agg

This returns:

hex	crop_rank_1	area_rank_1
599230208964296703	0	8.98e+08
599238790308954111	0	2.03e+09
...	...	...
599238794603921407	0	2.02e+09
599238816078757887	0	2.02e+09

Shape: (51299, 3)

Joining datasets

We now have 3 datasets, with unique hex values each, so we can join them together:

@fused.udf
def udf(k_ring_size: int = 1):
    elevation_data = fused.run('elevation_hex_mean_height')
    cdl_data = fused.run("cdl_data_all_crops_staging")
    temp_data = fused.run("era5_hex_mean_temp_foss4g")

    # Aggregate CDL by most common crop by area. 1 h3 cell <-> 1 row: crop_rank_1 and crop_area
    common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
    con = common.duckdb_connect()

    # Removing 'crop types' from CDL dataset we're not interested in
    crops_to_remove = [
        64,  # Shrubland
        152, # Shrubland
        176, # Pasture/Hay
        63,  # Forest
        142, # Forest
        141, # Forest
        190, # Woody Wetlands
        111, # Perennial Ice/Snow
        195, # Water
        131, # Barren
        121, # Developed/Open Space
        143, # Mixed Forest
    ]
    cdl_data = cdl_data[~cdl_data['crop_type'].isin(crops_to_remove)]

    crop_agg = con.execute("""
        WITH k_ring_expanded AS (
            SELECT 
                hex as center_hex,
                unnest(h3_grid_disk(hex, ?)) as ring_hex
            FROM cdl_data
        ),
        k_ring_data AS (
            SELECT 
                k.center_hex,
                c.crop_type as crop_type,
                c.area as crop_area
            FROM k_ring_expanded k
            JOIN cdl_data c ON CAST(c.hex AS BIGINT) = CAST(k.ring_hex AS BIGINT)
        ),
        aggregated_crops AS (
            SELECT 
                center_hex,
                crop_type,
                SUM(crop_area) as total_area
            FROM k_ring_data
            GROUP BY center_hex, crop_type
        ),
        ranked_crops AS (
            SELECT 
                center_hex,
                crop_type,
                total_area,
                ROW_NUMBER() OVER (PARTITION BY center_hex ORDER BY total_area DESC) as rank
            FROM aggregated_crops
        )
        SELECT 
            center_hex as hex,
            crop_type as crop_rank_1,
            total_area as area_rank_1
        FROM ranked_crops
        WHERE rank = 1
    """, [k_ring_size]).df()

    # Join elevation with aggregated CDL crop data
    merged = elevation_data.merge(
        crop_agg,
        on='hex',
        how='left'
    ).rename(columns={'mean_value': 'elevation'})
    
    # Join with ERA5 temperature data
    merged = merged.merge(
        temp_data,
        on='hex',
        how='left'
    )

    # Round elevation to 1 decimal place
    merged['elevation'] = merged['elevation'].round(1)
    
    return merged

Returns:

hex	elevation	crop_rank_1	area_rank_1	daily_mean
599650153686630399	357.4	1.0	9.152000e+09	13.90
599650161202823167	413.5	1.0	1.124163e+10	13.45
...	...	...	...	...
599747753597206527	255.9	0.0	2.030604e+09	8.93
599747766482108415	283.6	0.0	2.031832e+09	8.15

Shape: (68044, 5)

Using fused common functions for simpler H3 operations

Note that we're using the Fused common functions to simplify DuckDB queries using H3 functions:

# inside our udf
common = fused.load("https://github.com/fusedio/udfs/tree/56ec615/public/common/")
con = common.duckdb_connect()
crop_agg = con.execute("""SELECT ...""").df()
# rest of logic

This function provides some benefits:

Pre-installed dependencies for H3 operations
No need to register data as tables in DuckDB as a preliminary step

Summary & Next Steps

We now have a joined dataset that contains, for each H3 res 5 cell:

Elevation data
Crop data by most common crop + area of said crop
Temperature data

Step 3: Interactive Visualization

Joining logic​

Aggregating crop by most common​

Joining datasets​

Summary & Next Steps​

Joining logic

Aggregating crop by most common

Joining datasets

Summary & Next Steps