Skip to main content

DuckDB

The combination of Fused serverless operations and DuckDB offers blazing fast data processing. Fused has long embraced Python for its serverless User Defined Functions (UDFs). Now, with the help of DuckDB, Fused enables developers to leverage the ease and familiarity of SQL in these functions as well β€” without compromising on performance and parallelism.

This notebook is support material for the blog post DuckDB + Fused: Fly beyond the serverless horizon.

1. Run DuckDB in a Fused UDF​

As an example of running DuckDB within a Fused UDF, take the case of loading a geospatial Parquet dataset. The "DuckDB H3" sample UDF runs an SQL query with DuckDB on the NYC Taxi Trip Record Dataset. It uses the bbox argument to spatially filter the dataset and automatically parallelize the operation.

To try this example, you can run the cell below. You can find the code of the UDF in the Fused public UDF repo.

Alternatively, you can import the "DuckDB H3 Example Tile" UDF into your Fused Workbench environment.

This pattern gives DuckDB easy parallel operations. Fused spatially filters the dataset via the bbox parameter, runs the operation for each encompassing tile, and stitches the results together. Because Fused breaks down operations to only a fraction of the dataset, it's easy to transition between SQL and Python.

overture
# !pip install fused "duckdb>=0.10.1" -q
import fused

udf = fused.load("https://github.com/fusedio/udfs/tree/main/public/DuckDB_H3_Example_Tile")
gdf = fused.run(udf=udf, x=2412, y=3078, z=13, engine='local')
gdf.head()
cell_idcell_latcell_lngcntgeometryfused_index
08b2a100d692cfff40.7691-73.96116219POLYGON ((-73.96110 40.76935, -73.96140 40.769...0
18b2a100d6796fff40.7586-73.98586132POLYGON ((-73.98577 40.75886, -73.98607 40.758...1
28b2a100d2922fff40.7509-73.97662324POLYGON ((-73.97660 40.75118, -73.97691 40.751...2
38b2a100d6754fff40.7562-73.97862174POLYGON ((-73.97864 40.75641, -73.97894 40.756...3
48b2a100d2d95fff40.7527-73.99324826POLYGON ((-73.99319 40.75299, -73.99349 40.752...4

2. Call Fused UDFs from DuckDB​

Any database that supports querying data via HTTP can call and load data from Fused UDF endpoints using common formats like Parquet or CSV. This means that DuckDB can dispatch operations to Fused that otherwise would be too complex or impossible to express with SQL, or would be unsupported in the local runtime.

As an example of calling a Fused endpoint from within DuckDB, take an operation to vectorize a raster dataset. This might be necessary to determine the bounds of areas with pixel value within a certain threshold range in an Earth observation imageβ€Š-β€Šsuch as a Digital Elevation Model. SQL is not geared to support raster operations, but these are easy to do in Python.

overture

In this example, a Fused UDF returns a table where each record is a polygon generated from the contour of a raster provided by the Copernicus Digital Elevation Model as a Cloud Optimized GeoTIFF. DuckDB can easily trigger a UDF and load its output with this simple query, which specifies that the UDF endpoint returns a Parquet file.

This SQL query uses DuckDB's read_parquet function to call an endpoint of a UDF instance of the "DEM Raster to Vector" UDF.

You can find the code of the UDF in the Fused public UDF repo.

To try this example, simply run the following SQL query on the cell below or in a DuckDB shell. Change the min_elevation parameter to run the UDF for parts of California at different elevations. (Note: for DuckDB WASM, the file will be requested as CSV.)

import duckdb

con = duckdb.connect()

con.sql("""
SELECT
wkt,
ROUND(area,1) AS area
FROM read_parquet('https://www.fused.io/server/v1/realtime-shared/1e35c9b9cadf900265443073b0bd99072f859b8beddb72a45e701fb5bcde807d/run/file?min_elevation=500&dtype_out_vector=parquet')
LIMIT 5
""")

Out:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ wkt β”‚ area β”‚

β”‚ varchar β”‚ double β”‚

β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€

β”‚ POLYGON ((-124.1827015175656186 42.2400766748579528, -124.1827015175656186 42.1992724119226850, -124… β”‚ 15409410.3 β”‚

β”‚ POLYGON ((-124.3052742356768618 42.2808809377932349, -124.1827015175656186 42.2808809377932207, -124… β”‚ 138751711.8 β”‚

β”‚ POLYGON ((-124.3869893810843479 42.1176638860521493, -124.3869893810843479 42.0768596231168956, -124… β”‚ 15447026.2 β”‚

β”‚ POLYGON ((-124.2235590902693616 42.1176638860521564, -124.2235590902693616 42.0360553601816278, -124… β”‚ 30891651.3 β”‚

β”‚ POLYGON ((-124.1827015175656186 41.7912297825700350, -124.1827015175656186 41.7504255196347742, -124… β”‚ 15518755.9 β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This pattern enables DuckDB to address use cases and data formats that it doesn't natively support or would otherwise see high data transfer cost, such as raster operations, API calls, and control flow logic.

3. Integrate DuckDB in applications using Fused​

Fused is the glue layer between DuckDB and apps. This enables seamless integrations that trigger Fused UDFs and load their results with simple parameterized HTTP calls.

DuckDB is an embedded database engine and doesn't have built-in capability to share results other than writing out files. As a corollary of the preceding example, it's possible to query and transform data with DuckDB and seamlessly integrate the results of queries into any workflow or app.

As an example, take the case of loading the output of a DuckDB query into Google Sheets. Sheets can easily structure the Fused UDF endpoint to pass parameters defined in specific cells as URL query parameters. In this example, the importData command calls the endpoint for the UDF above and loads its output data in CSV format. It constructs the endpoint string by concatenating a base URL with the B2, B3, and B4 cell values as query parameters.

=importData(CONCATENATE("https://www.fused.io/server/v1/realtime-shared/aba7b238d9445d576e15b2d6b780dc353bfdee55f02a285a85a3917b72835600/run/file?dtype_out_vector=csv&resolution=", B2, "&min_count=", B3, "&head=", B4))
overture

To try this example simply make a copy of this Google Sheets spreadsheet (File > Make a copy) and click, and modify the parameters in B2:4 to trigger the Fused UDF endpoint and load data.

You can learn more about the Google Sheets integration in the documentation.

This pattern brings the power of the DuckDB analytical engine into non-analytical and no-code software like Google Sheets, Retool, and beyondβ€Š-β€Šwithout the need to build bespoke integrations with closed-source systems. With this, a Python developer can abstract away the UDF and deliver data to end users. This removes the need to even install DuckDB.