DuckDB

The combination of Fused serverless operations and DuckDB offers blazing fast data processing. Fused has long embraced Python for its serverless User Defined Functions (UDFs). Now, with the help of DuckDB, Fused enables developers to leverage the ease and familiarity of SQL in these functions as well — without compromising on performance and parallelism.

This notebook is support material for the blog post DuckDB + Fused: Fly beyond the serverless horizon.

1. Run DuckDB in a Fused UDF

As an example of running DuckDB within a Fused UDF, take the case of loading a geospatial Parquet dataset. The "DuckDB H3" sample UDF runs an SQL query with DuckDB on the NYC Taxi Trip Record Dataset. It uses the bbox argument to spatially filter the dataset and automatically parallelize the operation.

To try this example, you can run the cell below. You can find the code of the UDF in the Fused public UDF repo.

Alternatively, you can import the "DuckDB H3 Example Tile" UDF into your Fused Workbench environment.

This pattern gives DuckDB easy parallel operations. Fused spatially filters the dataset via the bbox parameter, runs the operation for each encompassing tile, and stitches the results together. Because Fused breaks down operations to only a fraction of the dataset, it's easy to transition between SQL and Python.

# !pip install fused "duckdb>=0.10.1" -q

import fused

udf = fused.load("https://github.com/fusedio/udfs/tree/main/public/DuckDB_H3_Example_Tile")
gdf = fused.run(udf=udf, x=2412, y=3078, z=13, engine='local')
gdf.head()

	cell_id	cell_lat	cell_lng	cnt	geometry	fused_index
0	8b2a100d692cfff	40.7691	-73.9611	6219	POLYGON ((-73.96110 40.76935, -73.96140 40.769...	0
1	8b2a100d6796fff	40.7586	-73.9858	6132	POLYGON ((-73.98577 40.75886, -73.98607 40.758...	1
2	8b2a100d2922fff	40.7509	-73.9766	2324	POLYGON ((-73.97660 40.75118, -73.97691 40.751...	2
3	8b2a100d6754fff	40.7562	-73.9786	2174	POLYGON ((-73.97864 40.75641, -73.97894 40.756...	3
4	8b2a100d2d95fff	40.7527	-73.9932	4826	POLYGON ((-73.99319 40.75299, -73.99349 40.752...	4

2. Call Fused UDFs from DuckDB

Any database that supports querying data via HTTP can call and load data from Fused UDF endpoints using common formats like Parquet or CSV. This means that DuckDB can dispatch operations to Fused that otherwise would be too complex or impossible to express with SQL, or would be unsupported in the local runtime.

As an example of calling a Fused endpoint from within DuckDB, take an operation to vectorize a raster dataset. This might be necessary to determine the bounds of areas with pixel value within a certain threshold range in an Earth observation image - such as a Digital Elevation Model. SQL is not geared to support raster operations, but these are easy to do in Python.

In this example, a Fused UDF returns a table where each record is a polygon generated from the contour of a raster provided by the Copernicus Digital Elevation Model as a Cloud Optimized GeoTIFF. DuckDB can easily trigger a UDF and load its output with this simple query, which specifies that the UDF endpoint returns a Parquet file.

This SQL query uses DuckDB's read_parquet function to call an endpoint of a UDF instance of the "DEM Raster to Vector" UDF.

You can find the code of the UDF in the Fused public UDF repo.

To try this example, simply run the following SQL query on the cell below or in a DuckDB shell. Change the min_elevation parameter to run the UDF for parts of California at different elevations. (Note: for DuckDB WASM, the file will be requested as CSV.)

import duckdb

con = duckdb.connect()

con.sql("""
    SELECT 
        wkt, 
        ROUND(area,1) AS area
    FROM read_parquet('https://www.fused.io/server/v1/realtime-shared/1e35c9b9cadf900265443073b0bd99072f859b8beddb72a45e701fb5bcde807d/run/file?min_elevation=500&dtype_out_vector=parquet')
    LIMIT 5
""")

Out:

┌────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────┐

│ wkt │ area │

│ varchar │ double │

├────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤

│ POLYGON ((-124.1827015175656186 42.2400766748579528, -124.1827015175656186 42.1992724119226850, -124… │ 15409410.3 │

│ POLYGON ((-124.3052742356768618 42.2808809377932349, -124.1827015175656186 42.2808809377932207, -124… │ 138751711.8 │

│ POLYGON ((-124.3869893810843479 42.1176638860521493, -124.3869893810843479 42.0768596231168956, -124… │ 15447026.2 │

│ POLYGON ((-124.2235590902693616 42.1176638860521564, -124.2235590902693616 42.0360553601816278, -124… │ 30891651.3 │

│ POLYGON ((-124.1827015175656186 41.7912297825700350, -124.1827015175656186 41.7504255196347742, -124… │ 15518755.9 │

└────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘

This pattern enables DuckDB to address use cases and data formats that it doesn't natively support or would otherwise see high data transfer cost, such as raster operations, API calls, and control flow logic.

3. Integrate DuckDB in applications using Fused

Fused is the glue layer between DuckDB and apps. This enables seamless integrations that trigger Fused UDFs and load their results with simple parameterized HTTP calls.

DuckDB is an embedded database engine and doesn't have built-in capability to share results other than writing out files. As a corollary of the preceding example, it's possible to query and transform data with DuckDB and seamlessly integrate the results of queries into any workflow or app.

As an example, take the case of loading the output of a DuckDB query into Google Sheets. Sheets can easily structure the Fused UDF endpoint to pass parameters defined in specific cells as URL query parameters. In this example, the importData command calls the endpoint for the UDF above and loads its output data in CSV format. It constructs the endpoint string by concatenating a base URL with the B2, B3, and B4 cell values as query parameters.

=importData(CONCATENATE("https://www.fused.io/server/v1/realtime-shared/aba7b238d9445d576e15b2d6b780dc353bfdee55f02a285a85a3917b72835600/run/file?dtype_out_vector=csv&resolution=", B2, "&min_count=", B3, "&head=", B4))

To try this example simply make a copy of this Google Sheets spreadsheet (File > Make a copy) and click, and modify the parameters in B2:4 to trigger the Fused UDF endpoint and load data.

You can learn more about the Google Sheets integration in the documentation.

This pattern brings the power of the DuckDB analytical engine into non-analytical and no-code software like Google Sheets, Retool, and beyond - without the need to build bespoke integrations with closed-source systems. With this, a Python developer can abstract away the UDF and deliver data to end users. This removes the need to even install DuckDB.

DuckDB

1. Run DuckDB in a Fused UDF​

2. Call Fused UDFs from DuckDB​

3. Integrate DuckDB in applications using Fused​

1. Run DuckDB in a Fused UDF

2. Call Fused UDFs from DuckDB

3. Integrate DuckDB in applications using Fused