Skip to main content

Creating cloud-free composite HLS imagery with Fused

ยท 4 min read
Marie Hoeger
Staff Software Engineer @ Pachama
Plinio Guzman
Founding Engineer @ Fused

High-quality satellite imagery is essential to assess the carbon impact of nature-based forest conservation and restoration projects [1]. However, getting that high quality imagery is uniquely difficult in areas that need carbon financing the most: tropical forests. Tropical forests present a unique challenge for satellite imagery analysis due to persistent cloud cover, which often renders optical imagery unusable and creates data gaps.

File

Example composites highlight how the HLS-L30 product alone can have gaps when attempting to make a seasonal composite, as fewer cloud-free observations.

This blog post explores how Pachama's engineering team partnered with Fused to generate cloud-free seasonal composites using Harmonized Landsat Sentinel-2 (HLS) data, enabling higher quality optical imagery and better canopy height map creating ML model performance.

Obstacles to create a cloud-free HLS image compositeโ€‹

The HLS dataset is an exciting development put forward by NASA's Satellite Needs Working Group. It provides consistent surface reflectance data with global observations every 2-3 days at a 30-meter resolution. The dataset harmonizes data from Landsats 8 & 9 with the European Space Agency's Sentinel-2A & 2B satellites such that the results are high quality, standardized, and able to be combined [2].

The HLS dataset consists of scene-level harmonized data, and does not create any cloud-free composite images by default. A significant amount of compute power is needed to process and combine this data, which contains multiple petabytes of data. Iteration on the compositing algorithm is also essential to quickly experiment and refine the process.

File

Example of HLS image for a region in Brazil with clouds.

One common solution to this problem is to use Google Earth Engine (GEE). However, only the Landsat portion of this dataset (HLS-L30) is available on GEE. Without the Sentinel-2 portion of this dataset (HLS-S30), we do not get a 2-3 day temporal resolution that is required for cloud-free imagery in frequently cloudy areas.

With Fusedโ€‹

Pachama turned to Fused to create scalable workflows for quickly iterating on a compositing algorithm. Fused's UDF model allowed Pachama to design algorithms that parallelize image processing, generate cloud-free composites, and run these workflows at scale.

Pachama's UDF workflowโ€‹

Here's the workflow we created with a Fused User Defined Function (UDF) to generate cloud-free composite HLS imagery.

1. Write a UDF to load imageryโ€‹

This sample UDF loads data for the Landsat and Sentinel2 data products. It queries for a specific date range and does a first pass at filtering out images with too many clouds.

# To Get your username and password, Please visit https://urs.earthdata.nasa.gov
@fused.udf
def udf(
bbox: fused.types.TileGDF,
mask_url: str,
band_url: str,
username="<INSERT USERNAME>",
password="<INSERT PASSWORD>",
env="earthdata",
):
import numpy as np
utils = fused.load("https://github.com/fusedio/udfs/tree/f928ee1/public/common/").utils
# Authenticate
aws_session = utils.earth_session(cred={"env": env, "username": username, "password": password})
cred = {"env": env, "username": username, "password": password}
overview_level = max(0, 12 - bbox.z[0])

# Read band data
band_arr = utils.read_tiff(
bbox,
band_url,
overview_level=overview_level,
cred=cred,
)

# Read and apply cloud mask
mask_arr = utils.read_tiff(
bbox,
mask_url,
overview_level=overview_level,
cred=cred,
)
cloud_mask = (mask_arr & 0b00000010) >> 1
band_arr = np.where(cloud_mask == 1, np.nan, band_arr)

# Filter nan's and convert to RGB values
band_arr = np.where(band_arr == -9999, np.nan, band_arr)
band_arr = band_arr / 10
band_arr += 1 # workaround for uint8 and nan values
band_arr = band_arr.astype("uint8")

return np.array(band_arr)

2. Call the UDF asynchronouslyโ€‹

This UDF queries the LP DAAC STAC catalog for data that matches the time and location of interest. This UDF then calls the previous one in parallel asynchronously to fetch each cloud-free image in parallel. It then combines the outputs, taking the median of each band to create a cloud-free composite.


@fused.udf
async def udf(
bbox: fused.types.TileGDF,
date_range="2023-05/2023-06"
):
import numpy as np
import asyncio
from collections import defaultdict

RGB_BANDS = ["B04", "B03", "B02"]
F_MASK_BAND = "Fmask"

# Query STAC catalog
band_urls = get_band_urls(bbox, date_range)

# Call the image loading/masking UDF in parallel
tasks = defaultdict(list)
for band in RGB_BANDS:
for mask_url, band_url in zip(band_urls[F_MASK_BAND], band_urls[band]):
arr_task = fused.run(
"<INSERT UDF TOKEN>",
bbox=bbox,
sync=False,
parameters={
"mask_url": mask_url,
"band_url": band_url,
"date_range": date_range,
})
tasks[band].append(arr_task)

# Combine each band
rgb = []
for band in RGB_BANDS:
task_results = await asyncio.gather(*tasks[band])
composite_values = []

# Convert back to format with nan's
for arr in task_results:
arr = arr.image.values.astype("uint8")
arr = np.where(arr == 0, np.nan, arr)
arr += 1
composite_values.append(arr)

# Take median of the composite values
band_composite = np.nanmedian(composite_values, axis=0)
band_composite = band_composite.astype("uint8")
rgb.append(band_composite)

return np.array(rgb)

The UDF above generates a cloud-free composite image and gives Pachama control and transparency over the image inputs.

File

Example of cloud-free HLS image composite for the same region in Brazil.

Benefits of using Fusedโ€‹

The best part is that Pachama's Data Science team can design UDF while looking at a specific area, and to run it for a different region by simply changing the input bounding box (bbox). This flexibility allows Pachama to create individual image tiles for any location worldwide. They can easily experiment and generate composites for different date ranges by adjusting the input parameters.

  • Easy parallelization with simple Python function calls, no need to manage clusters
  • Iterate on both UDFs in the same code editor with the UDF Builder
  • Instant feedback during algorithm development, no need to wait for pipelines to run
  • Invoke UDF and load its data into a Jupyter Notebook with fused.run for downstream analysis

Conclusionโ€‹

Thanks to Fused, Pachama's scientists and engineers can quickly iterate and experiment with different algorithms to optimize their image composites. Scaling the algorithm to apply to a larger area also becomes trivial by using Fused. Pachama can more efficiently improve transparency into forest carbon projects through better data and better insights, faster.

Referencesโ€‹