Skip to main content

DuckDB

Fused interfaces DuckDB with the Python duckdb library.

This example UDF shows how to query the NYC Taxi dataset with a DuckDB.

@fused.udf
def udf(agg_factor=3, min_count=5):
common = fused.load("https://github.com/fusedio/udfs/tree/b41216d/public/common/").utils
con = common.duckdb_connect()

df = con.sql(f"""
SELECT
round(pickup_longitude*{agg_factor},3)/{agg_factor} lng,
round(pickup_latitude*{agg_factor},3)/{agg_factor} lat,
count(1) cnt
FROM
read_parquet('s3://fused-users/fused/asset/docs//user_guide/yellow_tripdata_2010-01.parquet')
GROUP BY
round(pickup_longitude*{agg_factor},3),
round(pickup_latitude*{agg_factor},3)
HAVING
cnt>{min_count} AND
lat>40 AND
lat<41
""").df()

print("Number of trips:", df.cnt.sum())
gdf = common.to_gdf(df)
return gdf
note

We have done 2 things to simplify this example:

  1. Uploaded the NYC Taxi dataset to S3.
  2. Using the common.duckdb_connect() helper function, which we built to simplify setting up a DuckDB connection. Check the code here.