File Formats
This page specifies which File Formats for both raster & vector data we prefer working with at Fused, and why
What this page is about
Fused works with any file formats you might have, as all UDFs are running pure Python. This means you can use any file formats you want to process your data. That being said the goal of Fused is to significantly speed up the workflow for data scientists, by leveraging modern cloud compute infrastructure and simplify it.
Some formats like Shapefile, CSV, JSON, while incredibly versatile, aren't the most appropriate for large datasets (even above a few Gb) and are slow to read / write (we consider anything above 10s of seconds to read to be extremely slow).
Take a look at our benchmark to see a comparison between loading a CSV, GeoParquet & Fused-partitioned GeoParquet to see a concrete example of this
To make the most out of Fused, we recommend ingesting your data into the following file formats:
For rasters (images)
For images (like satellite images) we recommend using Cloud Optimized GeoTiffs (COGs). To paraphrase the Cloud Native Geo guide on them:
Cloud-Optimized GeoTIFF (COG), a raster format, is a variant of the TIFF image format that specifies a particular layout of internal data in the GeoTIFF specification to allow for optimized (subsetted or aggregated) access over a network for display or data reading
Fused does not (yet) have a build-in tool to ingest raster data. We suggest you create COGs yourself, for example by using gdal
's built-in options or rio-cogeo
Cloud Optimized GeoTiffs have multiple different features making them particularly interesting for cloud native applications, namely:
- Tiling: Images are split into smaller tiles that can be individually accessed, making getting only parts of data a lot faster.
- Overviews: Pre-rendered images of lower zoom levels of images. This makes displaying images at different zoom levels a lot faster
data:image/s3,"s3://crabby-images/416a3/416a3bd51cedbaa429e3a4da9ba80c593a7fb643" alt="A simple overview of Geoparquet benefits"
A simple visual of COG tiling: If we only need the top left part of the image we can fetch only those tiles (green arrows). Image courtesy of Element 84's blog on COGs
- Element84 wrote a simple explainer of what Cloud Optimized GeoTiffs are with great visuals
- Cloud Optimized Geotiff spec dedicated website
- Cloud Optimized Geotiff page on Cloud Native Geo guide
For vectors (tables)
To handle vector data such as pandas
DataFrames
or geopandas
GeoDataFrames
we recommend using GeoParquet files. To (once again) paraphrase the Cloud Native Geo guide:
GeoParquet is an encoding for how to store geospatial vector data (point, lines, polygons) in Apache Parquet, a popular columnar storage format for tabular data.
data:image/s3,"s3://crabby-images/5eb05/5eb05add265a77cb3c9d167f8753392531491f79" alt="A simple overview of Geoparquet benefits"
Image credit from the Cloud Native Geo slideshow
Refer to the next section to see all the details of how to ingest your data with Fused's built-in fused.ingest()
to make the most out of geoparquet
geoparquet
Github repogeoparquet
1 page website with a list of companies & projects involved- GeoParquet page on Cloud Native Geo guide
Additional resources
- Read the Cloud-Optimized Geospatial Formats Guide written by the Cloud Native Geo Org about why we need Cloud Native formats
- Friend of Fused Kyle Barron did an interview about Cloud Native Geospatial Formats. Kyle provides simple introductions to some cloud native formats like
GeoParquet