Skip to main content

The Strength in Weak Data: Part 1

· 2 min read
Kristin Scholten
Data Scientist @ Nationwide

Ever tried to make sense of the myriad file types in spatial data science and felt like you've wandered into a linguistic labyrinth? Trust me, you're not alone. As a data scientist who's spent more time wrangling datasets than I care to admit, I thought I'd take a casual stroll down memory lane with an old high school friend: regression models. Just a simple plot of actual vs. predicted, right? But when spatial data's involved, you can't just sit back and relax—you've got to keep one eye on the geometries.

I'm currently working on an agricultural project, and growing up on a farm gives me a personal stake in this. This blog illustrates my solution to the geometry debacle. I'll first take you to the area where I grew up: Lyon County.

The Geometry Challenge: A Look at Lyon County

File

The resolution differences are huge—going from 30 square meters up to 5 billion! Traditional tools would have you pulling your hair out, but Fused lets you turn this “weak” data into something powerful.

Actual Variable: Handling the Data Mismatch

When dealing with data that doesn't quite match up—like trying to combine different resolutions—you need to align everything to the coarsest resolution. In this case, that's the county level.

Here's how I tackled it: I grabbed a CSV file of county ANSI codes along with my actual variable data. Using Fused's Fused's File Explorer, I plotted the data easily. Just a quick visit to the File Explorer S3 bucket, a double-click on the file, and the entire map rendered instantly.

File

Remember the days of wrestling with shapefile resolutions? No more. I edited the UDF to pull my actual data CSV straight from my S3 bucket in under 30 seconds. Boom.

Predictor Variable: Navigating the NetCDF

Now, let's get into the predictor variable—a NetCDF file from 5 degrees off the equator, covering around 25 square kilometers. NetCDF files can be a bit tricky to work with due to their complex formats, but Fused's utility modules make it easier. I imported some key functions directly into my UDF to clip the array, convert it into an image, and add a colormap.

@fused.udf
def udf(bbox, path: str='s3://fused-users/fused/sina/Kristin/sif_ann_201409a.nc'):
xy_cols=['lon','lat']
# Get the data array using the constructed path
da = fused.utils.common.get_da(path, coarsen_factor=3, variable_index=0, xy_cols=xy_cols)
# Clip the array based on the bounding box
arr_aoi = fused.utils.common.clip_arr(da.values,
bounds_aoi=bbox.total_bounds,
bounds_total=fused.utils.common.get_da_bounds(da, xy_cols=xy_cols))
# Convert the array to an image with the specified colormap
img = (arr_aoi*255).astype('uint8')
return fused.utils.common.arr_to_plasma(arr_aoi, min_max=(0, 1), colormap="rainbow", include_opacity=False, reverse=True)

Once I saved the UDF and created an HTTP endpoint, I visualized the data interactively in the App Builder.

The Variable That is Going to Make this Weak Data Strong

Okay, I have prepped my actual and predictor variables. Now, I will focus on how to fuse the geometries together using the variable that is going to make this Weak Data Strong (30 square meters). For that, stay tuned for Part 2, where I'll dive into the techniques for aligning and merging these spatial layers into a cohesive analysis. See you in the next installment!