Note

You can view & download the original notebook on GitHub.

Running on a cluster#

We’ll use a Dask cluster in the cloud—in this case, using Coiled—to use many machines to process the data in parallel. We can also run in the data center where the data is stored for better performance.

To use Coiled, you’ll need a Coiled account (free), and you’ll need to connect Coiled to your cloud provider account (AWS or GCP) so it can launch clusters on your behalf.

Another option for Dask clusters is Microsoft’s Planetary Computer.

[2]:

import coiled
import distributed

cluster = coiled.Cluster(
    name="stackstac",
    n_workers=8,
    package_sync=True,
    backend_options={"region": "us-west-2"},  # where the data is
)
client = distributed.Client(cluster)
client

[2]:

Client

Client-6a095c68-6f7d-11ed-91e2-acde48001122

Connection method: Cluster object	Cluster type: coiled.ClusterBeta
Dashboard: http://35.84.199.219:8787

Cluster Info

ClusterBeta

stackstac

Dashboard: http://35.84.199.219:8787	Workers: 8
Total threads: 32	Total memory: 119.50 GiB

Scheduler Info

Scheduler

Scheduler-fa983ff2-954b-4ba3-86c5-cce21c22ae64

Comm: tls://10.0.59.124:443	Workers: 8
Dashboard: http://10.0.59.124:8787/status	Total threads: 32
Started: 7 minutes ago	Total memory: 119.50 GiB

Workers

Worker: stackstac-worker-16c4ddc9d5

Comm: tls://10.0.61.165:35151	Total threads: 4
Dashboard: http://10.0.61.165:8787/status	Memory: 14.94 GiB
Nanny: tls://10.0.61.165:41703
Local directory: /scratch/dask-worker-space/worker-5nkdzcyn

Worker: stackstac-worker-4ed28f88a0

Comm: tls://10.0.62.234:38469	Total threads: 4
Dashboard: http://10.0.62.234:8787/status	Memory: 14.93 GiB
Nanny: tls://10.0.62.234:37993
Local directory: /scratch/dask-worker-space/worker-j55olfrb

Worker: stackstac-worker-4f02208daa

Comm: tls://10.0.54.53:33853	Total threads: 4
Dashboard: http://10.0.54.53:8787/status	Memory: 14.95 GiB
Nanny: tls://10.0.54.53:41165
Local directory: /scratch/dask-worker-space/worker-5z9ryrkf

Worker: stackstac-worker-809a097aa0

Comm: tls://10.0.56.248:37853	Total threads: 4
Dashboard: http://10.0.56.248:8787/status	Memory: 14.94 GiB
Nanny: tls://10.0.56.248:44977
Local directory: /scratch/dask-worker-space/worker-e9s0e83a

Worker: stackstac-worker-99c2273579

Comm: tls://10.0.58.248:35043	Total threads: 4
Dashboard: http://10.0.58.248:8787/status	Memory: 14.94 GiB
Nanny: tls://10.0.58.248:45287
Local directory: /scratch/dask-worker-space/worker-knkc82rc

Worker: stackstac-worker-9a1bfca04f

Comm: tls://10.0.61.200:39239	Total threads: 4
Dashboard: http://10.0.61.200:8787/status	Memory: 14.93 GiB
Nanny: tls://10.0.61.200:42685
Local directory: /scratch/dask-worker-space/worker-mqc9vu2j

Worker: stackstac-worker-c988b2fbcc

Comm: tls://10.0.54.156:33293	Total threads: 4
Dashboard: http://10.0.54.156:8787/status	Memory: 14.94 GiB
Nanny: tls://10.0.54.156:37565
Local directory: /scratch/dask-worker-space/worker-9b2cqr8y

Worker: stackstac-worker-f7375ac6b2

Comm: tls://10.0.50.184:35375	Total threads: 4
Dashboard: http://10.0.50.184:8787/status	Memory: 14.94 GiB
Nanny: tls://10.0.50.184:45477
Local directory: /scratch/dask-worker-space/worker-i63od3yi

[3]:

import stackstac
import pystac_client
from rasterio.enums import Resampling

Search for a full year of Sentinel-2 data over Santa Fe, New Mexico.#

We’ll look at 2019-2020.

[4]:

%%time
items = pystac_client.Client.open(
    "https://earth-search.aws.element84.com/v1"
).search(
    intersects=dict(type="Point", coordinates=[-106, 35.7]),
    collections=["sentinel-2-l2a"],
    datetime="2019-01-01/2020-01-01",
    max_items=10_000,
).item_collection()
len(items)

CPU times: user 1.16 s, sys: 81.1 ms, total: 1.24 s
Wall time: 7.91 s

[4]:

Set a coarser resolution to speed up the computation#

We’ll have stackstac retrieve the data at 100m resolution, instead of its native 10m. Since the data is stored in Cloud-Optimized GeoTIFFs with internal overviews, fetching lower-resolution data is very efficient and requires processing an order of magnitude less data.

(Internally, stackstac is just telling rasterio/GDAL to build a VRT at this resolution. GDAL then automatically figures out which overview level to fetch data from.)

We also set bounds_latlon to just the area we want to look at (additionally, this drops any items that don’t intersect that bounding box), and set the resampling method to bilinear to produce a nicer-looking image.

[5]:

%%time
stack = stackstac.stack(
    items,
    resolution=100,
    bounds_latlon=(-106.2, 35.6, -105.6, 36),
    resampling=Resampling.bilinear
)

CPU times: user 279 ms, sys: 9.37 ms, total: 288 ms
Wall time: 295 ms

[6]:

stack

[6]:

<xarray.DataArray 'stackstac-efb5ba84853ae19e5a8b478f6f73560f' (time: 292,
                                                                band: 32,
                                                                y: 450, x: 547)>
dask.array<fetch_raster_window, shape=(292, 32, 450, 547), dtype=float64, chunksize=(1, 1, 450, 547), chunktype=numpy.ndarray>
Coordinates: (12/52)
  * time                                     (time) datetime64[ns] 2019-01-02...
    id                                       (time) <U24 'S2A_13SDV_20190102_...
  * band                                     (band) <U12 'aot' ... 'wvp-jp2'
  * x                                        (x) float64 3.913e+05 ... 4.459e+05
  * y                                        (y) float64 3.985e+06 ... 3.94e+06
    s2:unclassified_percentage               (time) object 0.828183 ... 15.96...
    ...                                       ...
    title                                    (band) <U31 'Aerosol optical thi...
    gsd                                      (band) object None 10 ... None None
    common_name                              (band) object None 'blue' ... None
    center_wavelength                        (band) object None 0.49 ... None
    full_width_half_max                      (band) object None 0.098 ... None
    epsg                                     int64 32613
Attributes:
    spec:        RasterSpec(epsg=32613, bounds=(391300, 3939700, 446000, 3984...
    crs:         epsg:32613
    transform:   | 100.00, 0.00, 391300.00|\n| 0.00,-100.00, 3984700.00|\n| 0...
    resolution:  100

For comparison, this is how much data we’d be processing if we’d used full 10m resolution:

[7]:

import dask
dask.utils.format_bytes(stackstac.stack(items).nbytes)

[7]:

'15.66 TiB'

Prepare monthly RGB composites#

Now, use standard xarray methods to select out the red, green, and blue bands, then make monthly median composites.

[8]:

rgb = stack.sel(band=["red", "green", "blue"])
monthly_rgb = rgb.resample(time="MS").median(dim="time")
monthly_rgb

[8]:

<xarray.DataArray 'stackstac-efb5ba84853ae19e5a8b478f6f73560f' (time: 12,
                                                                band: 3,
                                                                y: 450, x: 547)>
dask.array<stack, shape=(12, 3, 450, 547), dtype=float64, chunksize=(1, 2, 450, 547), chunktype=numpy.ndarray>
Coordinates: (12/22)
  * band                                     (band) <U12 'red' 'green' 'blue'
  * x                                        (x) float64 3.913e+05 ... 4.459e+05
  * y                                        (y) float64 3.985e+06 ... 3.94e+06
    proj:epsg                                int64 32613
    constellation                            <U10 'sentinel-2'
    mgrs:latitude_band                       <U1 'S'
    ...                                       ...
    gsd                                      (band) object 10 10 10
    common_name                              (band) object 'red' 'green' 'blue'
    center_wavelength                        (band) object 0.665 0.56 0.49
    full_width_half_max                      (band) object 0.038 0.045 0.098
    epsg                                     int64 32613
  * time                                     (time) datetime64[ns] 2019-01-01...
Attributes:
    spec:        RasterSpec(epsg=32613, bounds=(391300, 3939700, 446000, 3984...
    crs:         epsg:32613
    transform:   | 100.00, 0.00, 391300.00|\n| 0.00,-100.00, 3984700.00|\n| 0...
    resolution:  100

Compute in parallel on the cluster#

[9]:

client.wait_for_workers(8)

[10]:

%time rgb_ = monthly_rgb.compute()

CPU times: user 3.32 s, sys: 1.22 s, total: 4.54 s
Wall time: 1min 42s

Using 8 Coiled workers (2 CPU, 8 GiB memory each), we processed the ~10 GB of data into median composites in a couple minutes.

Go to the dashboard at https://cloud.coiled.io to watch the computation in progress.

[11]:

rgb_.plot.imshow(row="time", rgb="band", robust=True, size=6);

According to the Coiled dashboard, this cost about 74 cents. Was this picture worth 74 cents to you?