Join and Expose Messy Data to AI Agents

This example shows how to join data scattered across Snowflake, Google Sheets, and S3 into a single Fused pipeline, then expose the combined result to AI agents so they can answer business questions across all sources at once.

Walkthrough: Letting AI agents talk to your data no matter where it's from. Try the Canvas →

Real-world data rarely lives in one place. Supplier lists sit in Snowflake, store metadata in Google Sheets, and review scores land in S3 as CSVs. Manually consolidating these sources before every analysis is slow and brittle. With Fused, you can join disparate datasets inside a UDF pipeline and expose the combined result to AI agents — no manual consolidation required.

Building the canvas

1. Connect multiple data sources

Each data source is loaded by its own UDF in the Canvas. Fused can pull data from virtually anywhere — a database, a spreadsheet, a file on S3 — and the examples below are just three of the many connectors you can use.

Snowflake — supplier records

The all_suppliers UDF reads supplier records (name, address, phone, account balance) from Snowflake's TPCH_SF1 sample dataset — anyone with a Snowflake account can reproduce this. Store your Snowflake credentials using fused.secrets[] so the UDF can authenticate. See the Snowflake integration guide for setup details.

Show UDF code

@fused.udf
def udf(limit: int = 100):
    import snowflake.connector

    conn = snowflake.connector.connect(
        user="DEMO_APP_USER",  # change this to your user id
        password=fused.secrets["snowflake_demo_access_token"],  # make sure to have this secret set in Fused Secrets
        account="DINFVZH-WOB67667",
        warehouse="COMPUTE_WH",
        database="SNOWFLAKE_SAMPLE_DATA",
        schema="TPCH_SF1",
    )

    cur = conn.cursor()
    cur.execute("SELECT * FROM SUPPLIER LIMIT %s;", (limit,))
    results = cur.fetch_pandas_all()
    cur.close()
    conn.close()
    return results

Google Sheets — customer feedback

In this example, we imagine that our suppliers feedback information comes from another team in our organization that's going in the field and therefore only inputs data through a Google Sheets.

The suppliers_feedback_gdrive UDF reads customer feedback per supplier (ratings, sentiment, NPS, audit results) from a public Google Sheet. Replace sheet_id with your own sheet's ID. The sheet must be shared publicly for now (Share → Anyone with the link).

Show UDF code

@fused.udf
def udf(
    sheet_id: str = "1_utccObv7uSk-Ew92Yu3tW3roYCaZ8Shn3xhMelk24A",  # replace with your sheet ID
    sheet_name: str = "supplier_feedback"  # replace with your sheet name
):
    import pandas as pd

    url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
    df = pd.read_csv(url)
    return df

AWS S3 — store locations CSV

The supplier_locations_csv UDF reads physical store/facility locations (coordinates, store type, staff headcount, capacity) from a CSV on S3. To use your own data, update the S3 path. See the S3 reading guide for more details.

Show UDF code

@fused.udf
def udf():
    import pandas as pd
    import geopandas as gpd
    from shapely.geometry import Point

    path = 's3://fused-asset/demos/supplier_customer_joining/supplier_locations.csv'  # Demo data available on public Fused bucket
    df = pd.read_csv(path)

    col_map = {c.lower(): c for c in df.columns}
    lat_col = col_map.get('latitude') or col_map.get('lat')
    lon_col = col_map.get('longitude') or col_map.get('lon') or col_map.get('lng')

    geometry = [Point(xy) for xy in zip(df[lon_col], df[lat_col])]
    gdf = gpd.GeoDataFrame(df, geometry=geometry, crs='EPSG:4326')
    return gdf

2. Join datasets in Python

A downstream UDF (join_store_infos) uses fused.load() — which loads a UDF by name so you can call it from another UDF — to pull data from the three sources:

suppliers_udf = fused.load("all_suppliers")
feedback_udf  = fused.load("suppliers_feedback_gdrive")
locations_udf = fused.load("supplier_locations_csv")

df_sup = suppliers_udf()  # When no parameters are passed this executes all_suppliers UDF with its defaults
df_fb  = feedback_udf()
df_loc = locations_udf()

The data arrives with different column names and ID formats across sources — Snowflake uses an integer S_SUPPKEY, the Google Sheet has prefixed strings like S00000001 or SUPP-3, and the CSV uses SUP0000001. The join UDF normalizes all of these to a common integer key, then merges the three DataFrames into a single GeoDataFrame.

The joined result exposes filtering parameters (rating, sentiment, NPS, store type, staff headcount) that an agent can set directly.

Show join UDF code

@fused.udf
def udf(
    num_reviews_min: float = 0,
    num_reviews_max: float = 9999,
    avg_rating_min: float = 0.0,
    avg_rating_max: float = 5.0,
    sentiment: str = "All",
    nps_min: float = -100.0,
    nps_max: float = 100.0,
    store_type: str = "All",
    staff_headcount_min: float = 0,
    staff_headcount_max: float = 9999,
):
    import pandas as pd
    import geopandas as gpd
    import re

    suppliers_udf = fused.load("all_suppliers")
    feedback_udf  = fused.load("suppliers_feedback_gdrive")
    locations_udf = fused.load("supplier_locations_csv")

    df_sup = suppliers_udf()
    df_fb  = feedback_udf()
    df_loc = locations_udf()

    # Normalize supplier IDs to a plain integer join key
    def extract_int(s):
        m = re.search(r"\d+", str(s))
        return int(m.group()) if m else None

    df_sup["_join_key"] = df_sup["S_SUPPKEY"].astype(int)
    df_fb["_join_key"]  = df_fb["supplier_id"].apply(extract_int)
    df_loc["_join_key"] = df_loc["supplier_ref"].apply(extract_int)

    # Merge all three on the common key
    df_merged = (
        df_sup
        .merge(df_fb,  on="_join_key", how="left", suffixes=("", "_fb"))
        .merge(df_loc, on="_join_key", how="left", suffixes=("", "_loc"))
    )
    df_merged = df_merged.drop(columns=["_join_key", "S_ADDRESS"])

    gdf = gpd.GeoDataFrame(df_merged, geometry="geometry", crs="EPSG:4326")

    # Apply filters
    mask = (
        (gdf["num_reviews"].fillna(0) >= num_reviews_min) &
        (gdf["num_reviews"].fillna(0) <= num_reviews_max) &
        (gdf["avg_rating"].fillna(0) >= avg_rating_min) &
        (gdf["avg_rating"].fillna(0) <= avg_rating_max) &
        (gdf["net_promoter_score"].fillna(0) >= nps_min) &
        (gdf["net_promoter_score"].fillna(0) <= nps_max) &
        (gdf["staff_headcount"].fillna(0) >= staff_headcount_min) &
        (gdf["staff_headcount"].fillna(0) <= staff_headcount_max)
    )
    if sentiment != "All":
        mask &= gdf["sentiment"].str.lower() == sentiment.lower()
    if store_type != "All":
        mask &= gdf["store_type"] == store_type

    return gdf[mask]

3. Expose the joined data as an MCP tool

The Canvas publishes the joined result as an agent-callable endpoint via its OpenAPI specification. The tool description tells the agent what data is available and what parameters it accepts.

The OpenAPI spec only lists UDFs that are visible on the canvas — hidden UDFs will not appear in the spec. You can still call hidden UDFs directly, but agents won't discover them through the API listing. This lets you control exactly which tools an agent can see.

4. Connect an AI agent

Once the Canvas is shared and its OpenAPI spec is available, you can connect an AI agent to it. Here's how to set up Claude Code:

Teach Claude Code about Fused. Paste the Fused skills into Claude Code so it understands how to interact with Fused endpoints. Tell it: "always use these skills when working with Fused."
Give it the OpenAPI spec. In the Canvas, click Share → OpenAPI and copy the .api.json URL (it looks like https://udf.ai/fc_<your_token>.api.json). Paste it into Claude Code so the agent knows which tools are available and what parameters they accept.
Ask questions. The agent can now query the combined dataset directly.

For more details, see the Expose your Canvas to agents guide.

5. Ask business questions

Once connected, an AI agent can query the combined dataset directly. For example:

"Which stores have the highest ratings?"
"Show me the bottom-performing suppliers."
"Compare average review scores across regions."

The agent draws answers from all three sources at once — no manual data wrangling needed.

Try it out

Open the Fused Canvas to explore the live pipeline. The Canvas connects to all three data sources, joins them, and exposes the result to an AI agent — ready for you to query.

Get the OpenAPI spec for your own Canvas

First, make a copy of the Canvas. Then follow the Share Modal guide to publish it. Once shared, click Share → OpenAPI to get the OpenAPI specification you can hand to any AI agent.

Building the canvas​

1. Connect multiple data sources​

Snowflake — supplier records​

Google Sheets — customer feedback​

AWS S3 — store locations CSV​

2. Join datasets in Python​

3. Expose the joined data as an MCP tool​

4. Connect an AI agent​

5. Ask business questions​

Try it out​

Building the canvas

1. Connect multiple data sources

Snowflake — supplier records

Google Sheets — customer feedback

AWS S3 — store locations CSV

2. Join datasets in Python

3. Expose the joined data as an MCP tool

4. Connect an AI agent

5. Ask business questions

Try it out