Skip to main content

· 4 min read
File

Fused is a toolkit to enable interoperability between geospatial datasets and tools in the modern data stack. Fused is the glue layer that integrates data platforms with data tools via a managed serverless API.

Current limitations with geospatial data processing

Today, there is a fragmented ecosystem around scalable geospatial data processing. Python geospatial libraries like GeoPandas, Shapely, and Rasterio make it easy to do small jobs, but are single-threaded and operate entirely in-memory. For bigger jobs, there are Python parallel processing tools like Dask that require complex installations and are liable to memory pressure errors. Spark-based tools like Apache Sedona and RasterFrames have a steep learning curve and are hard to debug and orchestrate. Postgres and its geospatial extension PostGIS operate on larger-than-memory datasets, but are hard to scale larger than the disk of one machine, aren’t designed for OLAP workloads, and can be hard to administer. Cloud data warehouses like Databricks and Snowflake are monolithic systems that tend to bring lock-in and pricing that is hard to anticipate.

Spatial SQL is a great way to run scalable operations on tables with vector data - but falls short on raster data and does not have native access to libraries for the finesse operations of data science. In fact, geospatial data science teams largely use Python and would prefer to use it both in development and in production - but tooling fragmentation forces them to juggle languages and frameworks. The present paradigm accepts the inefficiencies of complexity as a necessary evil, because there hasn’t been a better way to work with both raster and vector data at scale. Data teams have an unaddressed need for a friendly Python API that scales. To increase development velocity it’s convenient for most code to run in Python, moving only computationally-heavy code into specialized frameworks - as efficiently as possible. Additionally, scaling Python from local development to massive cloud workloads, calls for efficient parallelization.

Seizing the moment

The last several years have seen a commoditization of modular building blocks of OLAP systems and increased adoption of geospatial cloud-native data formats. With the convergence and popularity of columnar memory formats like Apache Arrow and Apache Parquet, easy-to-use columnar OLAP databases like DuckDB, and broader adoption of geospatial cloud-native data formats like Cloud-Optimized GeoTIFF and GeoParquet, we believe there’s a window for a serverless geospatial OLAP engine. Moreover, serverless computing has emerged as a prominent trend, delegating the management of infrastructure and dynamically scaling resources in response to demand, leading to heightened flexibility and cost efficiency. Leveraging serverless cloud infrastructure like AWS Lambda, Azure Functions, Google Cloud Functions, or Cloudflare Workers enables event-driven processing closer to the data source.

Parquet files have become the standard file format for columnar data and have helped to commoditize the decoupling of storage and compute by enabling queries directly on object storage like AWS S3. GeoParquet – a specification for storing point, line, and polygon geometries in Parquet – has seen recent momentum as a fast storage format for geospatial vector data and has started to be integrated into industry-standard tools like GDAL. Moreover, with spatial partitioning, operations can be broken down into small independent parts that execute simultaneously in multiple processes. For geospatial array data like satellite imagery, Cloud-Optimized GeoTIFF – an extension to GeoTIFF that enables chunked access via HTTP range requests – has taken hold as the standard way to store geospatial image data, with petabytes publicly available from AWS’ open data program and buy-in from major vendors like USGS and Planet.

Apache Arrow has become the universal in-memory columnar data format for columnar, analytic data because its language-independent specification enables easier movement of data between languages and frameworks. Moreover GeoArrow – an incubating specification for storing geospatial data in Arrow – gives us a way to move geospatial data from Python to compiled code for free, and will likely serve as the foundation for an ecosystem of large-data geospatial tools. Already in the frontend, deck.gl can use GeoArrow-style data buffers to visualize millions of coordinates with no serialization costs.

As a result of all these trends, smaller data can be transferred to and processed on serverless cloud services in ways that are not possible ever before. Public clouds enable event-driven compute services that automatically scale, which makes for simple infrastructure and dependency management. Managed offerings reduce the complexities of data pipelines enabling geospatial workloads of any size to run on demand – to empower users with the ability to go from code to map, instantly.

Why Fused?

Fused instantly converts user’s Python code to workflows and maps in Jupyter notebooks, low-code web apps, the Fused Workbench web-app, ETL pipelines, or any tool that consumes HTTP API endpoints. Fused lets developers run real-time serverless operations at any scale and build responsive maps, dashboards, and reports. Developers develop in production and run on any scale data without infrastructure friction using serverless parallel computing powered by advanced caching of geo-partitioned data. This enables bringing interoperable workflows, apps, and maps to the user's preferred stack and avoiding vendor lock-in.

With Fused, users find, reuse, and share User Defined Functions (UDFs) in the Fused vibrant community. Fused User Defined Functions (UDFs) are building blocks of serverless geospatial operations that integrate across the stack - with Planetary Computer, Google Earth Engine, Big Query, Snowflake, DuckDB, and more. They load datasets from the cloud ecosystem such as NASA, NOAA, US Census, and Overture. Fused serverless API turns these UDFs into live HTTP endpoints that load their output into any tools.

Fused allows people for the first time to easily work with geospatial data and integrate it with modern data tools. This is a radical departure from times when you manually conduct multistep processes fragmented across tools and data standards with the help of an army of data engineers and infra (if they are lucky) just to render data on a map. Fused is built to be the interoperable glue between geospatial data systems, and we’re excited to bring best-in-class cloud infrastructure and distributed computing to this industry.

Join us in our journey to break from old geospatial infrastructure. Let's revolutionize geospatial technology together! 🌎🚀

  • The Fused Founding Team