Skip to main content

Engineering & ETL

Connect Data to Fusedโ€‹

Connect your own data sourcesโ€‹

You can directly connect your data buckets to Fused:

Bring data directly inside Fusedโ€‹

Quickly bring any data not on the cloud into Fused:

  1. Drag & Drop in File Explorer!

Drag and drop files directly into Workbench

  1. Use fused.upload()

Install fused Python, authenticate & run:

fused.api.upload("my_local_file.csv", "fd://my_data/file.csv")

Note: fd:// is the Fused provisioned private S3 path for your team.

Optimize data loadingโ€‹

For files < 1GB:

Leverage caching built in to Fused to make loading any data faster:

@fused.udf
def udf(path: str = "s3://fused-sample/demo_data/housing_2024.csv"):
import pandas as pd

@fused.cache
def load_data(path):
return pd.read_csv(path)

# Some processing

return load_data(path)
  • As you make changes inside your UDF, load_data() will be called from cache.
  • This is especially useful for slow formats (CSV, Excel, etc.) or files that are not partitioned well.

For files > 1GB:

Use fused.ingest() to ingest large datasets into cloud optimized, partitioned files.

job = fused.ingest(
input="https://www2.census.gov/geo/tiger/TIGER_RD18/LAYER/TRACT/tl_rd22_11_tract.zip",
output=f"s3://fused-users/{user_id}/census/dc_tract/",
)

job.run_remote()

Read more about how to ingest your data.

Turn your data into an APIโ€‹

Share your data with the world by turning it into an API:

def udf(path: str = "s3://fused-sample/demo_data/housing_2024.csv"):
import pandas as pd

df = pd.read_csv(path)

# Only return the relevant data for my team
df = df[df['price'] > 1000000]
return df[['price', 'area']]

In Workbench:

  • Save your UDF
  • Click "Share"
  • Create a shared token
  • You now have a HTTPs endpoint to call this UDF, which returns data in your desired format:
https://fused.io/.../run/file?

Learn more about creating a shared token.

Infrastructure (Github / Secrets / On Prem)โ€‹

You can use Fused with your own infrastructure:

Examplesโ€‹