Table of Contents

Title

R&D

BY

Pushkar Kopparla and Cyril juliani

Feb 3, 2025

6

MIN READ

Building Country-Scale Basemaps with Sentinel-2 Imagery: A Memory-Efficient Approach

See through the clouds: a practical guide for creating unobstructed basemaps.

Share

Share

Share

Introduction

Satellite imagery is often utilized for monitoring and decision making in sectors like agriculture, forestry, mining and defense. The majority of objects of interest for such lie on the Earth’s surface. Thus, clouds are mostly a hindrance that cause gaps in the data that we want to see (see Fig. 1). For this reason, basemaps are generated from large stacks of satellite imagery to exclude clouds and cloud shadows and give an unobstructed view of the ground. These basemaps then feed into further downstream analyses, such as segmentation or change detection.

Fig. 1: A Sentinel-2 satellite image of Gaborone, capital of Botswana, as captured on 13 Jan 2025. (data source: Planetary Computer Explorer)

Challenges in Generating Cloud-Free Basemaps

We at Solafune R&D have had to repeatedly create cloud free base maps for multiple countries and time periods of interest. This led to a lot of trial and error on consistent, easy ways to generate base maps on big data, as we found that pipelines that worked for smaller countries sometimes failed on larger ones. Often, the input imagery is of order several hundred gigabytes to several terabytes, which needs to be processed into a basemap of order several tens of gigabytes. We’ve tried making them with xarray and dask (as you can see on our other open source repo, solafune_tools: https://github.com/Solafune-Inc/solafune-tools/blob/main/docs/STAC_download.md), on big cloud machines and with various tiling approaches. After much experimentation, and several setbacks, we present here our latest approach uses tiling to only read small portions of the dataset into memory at any given time. We have used this approach to generate basemaps of several large country sized areas, and found it works well.

Generating a Cloud-Free Basemap for Botswana

We will walk through the process of generating a cloud-free basemap using Sentinel-2 data with working code. For this write-up, we will make a basemap for the country of Botswana in Southern Africa. Botswana has a land area of 581,730 sq km, and with a pixel resolution of 10m for Sentinel-2 we need to generate a basemap of size 6 billion pixels.

Selecting the Right Time Period

The first step is choosing a good time period, if you pick time spans during the rainy season, you will need to gather far more Sentinel-2 scenes to find a cloud-free pixel. So we choose to get imagery in the middle of the dry season, May, when clouds are scarce. The size of the basemap we create from this procedure will be ~ 30 GB and the amount of raw imagery downloaded will be between 1-1.5 TB.

Fig. 2: The country of Botswana in Southern Africa. The capital city of Gaborone on the south-eastern border. (tiles credit: OpenStreetMap)

Implementation: Step-by-Step Guide

Step 1: Downloading Sentinel-2 Data

The code for the process described in this article can be found here: https://github.com/Solafune-Inc/solafune_country_basemaps

In the first script of the above repo (1_download_sentinel_bands.py), we define a bounding box for Botswana and download all available imagery for the red, green and blue bands (B04, B03 and B02 respectively) and the cloud mask band (SCL) from the Planetary Computer Sentinel-2 repository for May in the year 2022. We create a data directory, with subdirectories for each month and band where the downloaded data is stored. You could also use streaming to download only data that you need from the cloud server as you process each sub-area. But we found that while experimenting, it was both quicker and cleaner just to have a local repo of the imagery. Also we do not want to hit the imagery API too many times since it’s free, and overuse will sooner or later lead to it becoming closed or a paid service.


Step 2: Creating RGB Stacks

The next step (in the script 2_stack_bands.py) is to create RGB images for each scene that we have downloaded. We stack the bands and store the data in a new RGB directory under data. A little multiprocessing here is fine, as writing to each RGB file is independent of processing other RGB files.


Step 3: Generating the Cloud-Free Basemap

The final step is creating the basemap (3_generate_cloudfree_basemap_tiled.py). What we want to do is to stack all relevant RGB images that lie over our area of interest, and select the middle-most value to be used in our basemap. The reason we use the middle value is to avoid the highest brightness values (which are likely to be clouds) and the lowest ones (likely to be cloud shadows). Typically a median is used as it is effective for handling outliers. But calculating a median requires sorting all values, which increases memory and processing demands. In contrast, the mean is more computationally efficient, as it can be calculated in-stream using running sums. You can use whatever measure suits your use case, but in this code we use a mean. For large countries, it is likely that images have different CRS depending on their longitude range.

Before doing the stacking, it is important to bring all images onto a common CRS. "We use the CRS "EPSG:32735” as our common CRS, as the majority of our Botswana Sentinel-2 scenes are in this CRS and it minimizes reprojecting which is a computationally expensive operation. Next, the entire area of interest is broken up into tiles of size 5000x5000 pixels. Here is a small code snippet that creates the tiles.

def generate_tiles(
    overall_bbox: Tuple[float, float, float, float], tile_size_geo: Tuple[float, float]
) -> List[Tuple[float, float, float, float]]:
    """
    Generate a list of tile bounding boxes covering the overall extent.

    Args:
    overall_bbox (Tuple[float, float, float, float]): Overall bounding box (minx, miny, maxx, maxy).
    tile_size_geo (Tuple[float, float]): Tile size in geographical units.

    Returns:
    List[Tuple[float, float, float, float]]: List of tile bounding boxes.
    """
    tiles = []
    x = overall_bbox[0]
    while x < overall_bbox[2]:
        y = overall_bbox[1]
        while y < overall_bbox[3]:
            tile_bbox = (
                x,
                y,
                min(x + tile_size_geo[0], overall_bbox[2]),
                min(y + tile_size_geo[1], overall_bbox[3]),
            )
            tiles.append(tile_bbox)
            y += tile_size_geo[1]
        x += tile_size_geo[0]
    return tiles

Fig. 3: An approximate visualization of the 5000x5000 pixel tiles covering Botswana

At any given time, only the pixels within this tile are read using windows from the Sentinel-2 scenes into memory for the mean calculation. Thus, the stack depth of pixels in memory is at most 4, assuming a weekly revisit time for May 2022 and the total pixel number is 5000x5000x4. Below is a simplified code snippet showing the mean calculation for one tile.

def process_tile(
    tile_bbox: Tuple[float, float, float, float],
    intersecting_files: List[Dict],
) -> xr.DataArray:
    """
    Process a single tile by calculating the mean of intersecting rasters using rioxarray for 3 bands

    Args:
    tile_bbox (Tuple[float, float, float, float]): Bounding box of the tile (minx, miny, maxx, maxy).
    intersecting_files (List[Dict]): List of files intersecting with the tile.

    Returns:
    xr.DataArray: Processed tile data with 3 bands, or None if no valid data.
    """
    tile_data = []
    # print(f"Number of intersecting files: {len(intersecting_files)}")
    for file in intersecting_files:
		    with rioxarray.open_rasterio(file["filepath"]) as src:
				    clipped = src.rio.clip_box(*tile_bbox)
	      tile_data.append(clipped)

    # Stack all tile data along a new dimension
    stacked_data = xr.concat(tile_data, dim="source")

    # All zero values are considered nodata
    stacked_data = stacked_data.where(stacked_data !=0, other=np.nan)
    # Calculate mean across all valid data for each band separately
    mean_tile = stacked_data.mean(dim="source", skipna=True)

    return mean_tile

This is comfortably within the memory capacity of most PCs, which is why we can add in multiprocessing to speed things along a little more. The alternative is to increase the size of the tile till you use up all memory with a single process. You can play with these parameters to optimize runtime on your system. Generally speaking, larger tiles are preferable because you can leverage array operations for speed in calculating means, however you could run into memory issues if you go too large. Since avoiding crashes from exceeding memory was our primary concern, we have been conservative with our tile size.

Final Result

The output of the script is a folder full of RGB tif tiles of size 5000x5000 and a vrt file that can used to convert them into a single image for further processing. We find that combining all tiles into a single full size geotif file once again causes memory problems when reading with various libraries. Our downstream workflow often ends up using gdal_translate to convert this vrt file into an mbtiles file, and then using a tileserver to serve tiles to web applications.

The final result is a cloud-free basemap of Botswana at 10m resolution and looks as below. The white regions towards the north-east of the country are the Makgadikgadi salt pans, not clouds. There is a lot of wasted basemap generated outside of our areas of interest, this can be optimized by keeping the bounding box tighter or when calculating the tiles in the script above to check if the tile intersects with the Botswana national boundaries (rather than the bounding box).

The images in the tile can be further compressed by converting the data type to byte or through various image compression schemes found in GDAL or elsewhere.

Fig. 4: A small sized preview of the generated basemap for Botswana

We hope you find this article useful for making your own basemaps. For further questions, please contact the authors at pushkar.kopparla@solafune.com and cyril.juliani@solafune.com

Credits: Thanks to Mohamed Yasser and Charles Genki Shimokura for comments on the draft of this article which improved it.

Read others

Got a question? We’d love to hear from you.

Got a question? We’d love to hear from you.

Got a question? We’d love to hear from you.

Company Name

Solafune, Inc.

President & CEO

Ren Uechi

Address

Solix Shibuya 401, 3-chōme-6-15, Shibuya, Shibuya City, Tokyo, 150-0002

© 2023 Solafune.Inc. All rights reserved

Company Name

Solafune, Inc.

President & CEO

Ren Uechi

Address

Solix Shibuya 401, 3-chōme-6-15, Shibuya, Shibuya City, Tokyo, 150-0002

© 2023 Solafune.Inc. All rights reserved

Company Name

Solafune, Inc.

President & CEO

Ren Uechi

Address

Solix Shibuya 401, 3-chōme-6-15, Shibuya, Shibuya City, Tokyo, 150-0002

© 2023 Solafune.Inc. All rights reserved