Note

You can run this notebook interactively: , or view & download the original on GitHub.

Running on a cluster¶

We’ll use a Dask cluster in the cloud—in this case, using Coiled—to use many machines to process the data in parallel. We can also run in the data center where the data is stored for better performance.

If you use Coiled (which is both easy to use, and currently free!), you can set software="gjoseph92/stackstac" to get a software environment where the latest version of stackstac is already installed.

[1]:

import coiled
import distributed

cluster = coiled.Cluster(
    name="stackstac",
    software="gjoseph92/stackstac",
    backend_options={"region": "us-west-2"},
    n_workers=8,
    protocol="wss",  # remove this line when not running on Binder
)
client = distributed.Client(cluster)
client

[1]:

Client

Client-da12cf68-5268-11ec-981e-acde48001122

Connection method: Cluster object	Cluster type: coiled.Cluster
Dashboard: http://34.215.222.29:8787

Cluster Info

Cluster

stackstac

Dashboard: http://34.215.222.29:8787	Workers: 8
Total threads: 16	Total memory: 60.48 GiB

Scheduler Info

Scheduler

Scheduler-65bab949-949c-4e55-b7f7-2fa5e756da95

Comm: tls://10.6.2.77:8786	Workers: 8
Dashboard: http://10.6.2.77:8787/status	Total threads: 16
Started: 4 minutes ago	Total memory: 60.48 GiB

Workers

Worker: coiled-dask-gjosephf7-74122-worker-090a2503b2

Comm: tls://10.6.24.58:38443	Total threads: 2
Dashboard: http://10.6.24.58:38801/status	Memory: 7.56 GiB
Nanny: tls://10.6.24.58:35857
Local directory: /dask-worker-space/worker-wmh92zsk

Worker: coiled-dask-gjosephf7-74122-worker-20842c8b93

Comm: tls://10.6.27.156:33833	Total threads: 2
Dashboard: http://10.6.27.156:38299/status	Memory: 7.56 GiB
Nanny: tls://10.6.27.156:43747
Local directory: /dask-worker-space/worker-vuiu_9g9

Worker: coiled-dask-gjosephf7-74122-worker-4484d1541c

Comm: tls://10.6.28.159:44503	Total threads: 2
Dashboard: http://10.6.28.159:41353/status	Memory: 7.56 GiB
Nanny: tls://10.6.28.159:38563
Local directory: /dask-worker-space/worker-ciem2rg_

Worker: coiled-dask-gjosephf7-74122-worker-4feec2f950

Comm: tls://10.6.31.84:46093	Total threads: 2
Dashboard: http://10.6.31.84:33355/status	Memory: 7.56 GiB
Nanny: tls://10.6.31.84:36789
Local directory: /dask-worker-space/worker-bi26h0i0

Worker: coiled-dask-gjosephf7-74122-worker-80191dde29

Comm: tls://10.6.21.83:46525	Total threads: 2
Dashboard: http://10.6.21.83:42929/status	Memory: 7.56 GiB
Nanny: tls://10.6.21.83:39267
Local directory: /dask-worker-space/worker-gch3naij

Worker: coiled-dask-gjosephf7-74122-worker-e1c68b4040

Comm: tls://10.6.17.113:46367	Total threads: 2
Dashboard: http://10.6.17.113:45695/status	Memory: 7.56 GiB
Nanny: tls://10.6.17.113:42107
Local directory: /dask-worker-space/worker-2oojvnpq

Worker: coiled-dask-gjosephf7-74122-worker-f47f156648

Comm: tls://10.6.22.97:42089	Total threads: 2
Dashboard: http://10.6.22.97:34175/status	Memory: 7.56 GiB
Nanny: tls://10.6.22.97:41979
Local directory: /dask-worker-space/worker-igsrejpu

Worker: coiled-dask-gjosephf7-74122-worker-ffa37d735e

Comm: tls://10.6.25.151:39127	Total threads: 2
Dashboard: http://10.6.25.151:39563/status	Memory: 7.56 GiB
Nanny: tls://10.6.25.151:39177
Local directory: /dask-worker-space/worker-4uik7thk

[2]:

import stackstac
import pystac_client
from rasterio.enums import Resampling

Search for a full year of Sentinel-2 data over Santa Fe, New Mexico.¶

We’ll look at 2019-2020.

[3]:

%%time
items = pystac_client.Client.open(
    "https://earth-search.aws.element84.com/v0"
).search(
    intersects=dict(type="Point", coordinates=[-106, 35.7]),
    collections=["sentinel-s2-l2a-cogs"],
    datetime="2019-01-01/2020-01-01",
    limit=10_000,
).get_all_items()
len(items)

CPU times: user 156 ms, sys: 41.4 ms, total: 198 ms
Wall time: 5.11 s

[3]:

Set a coarser resolution to speed up the computation¶

We’ll have stackstac retrieve the data at 100m resolution, instead of its native 10m. Since the data is stored in Cloud-Optimized GeoTIFFs with internal overviews, fetching lower-resolution data is very efficient and requires processing an order of magnitude less data.

(Internally, stackstac is just telling rasterio/GDAL to build a VRT at this resolution. GDAL then automatically figures out which overview level to fetch data from.)

We also set bounds_latlon to just the area we want to look at (additionally, this drops any items that don’t intersect that bounding box), and set the resampling method to bilinear to produce a nicer-looking image.

[4]:

%%time
stack = stackstac.stack(
    items,
    resolution=100,
    bounds_latlon=(-106.2, 35.6, -105.6, 36),
    resampling=Resampling.bilinear
)

CPU times: user 214 ms, sys: 7.4 ms, total: 222 ms
Wall time: 231 ms

[5]:

stack

[5]:

<xarray.DataArray 'stackstac-d4c09e1ea603a35bfecbad52a5781090' (time: 294, band: 17, y: 450, x: 547)>
dask.array<fetch_raster_window, shape=(294, 17, 450, 547), dtype=float64, chunksize=(1, 1, 450, 547), chunktype=numpy.ndarray>
Coordinates:
  * time                        (time) datetime64[ns] 2019-01-02T18:04:01 ......
    id                          (time) <U24 'S2A_13SDV_20190102_0_L2A' ... 'S...
  * band                        (band) <U8 'overview' 'visual' ... 'WVP' 'SCL'
  * x                           (x) float64 3.913e+05 3.914e+05 ... 4.459e+05
  * y                           (y) float64 3.985e+06 3.985e+06 ... 3.94e+06
    created                     (time) <U24 '2020-09-23T19:20:47.956Z' ... '2...
    sentinel:grid_square        (time) <U2 'DV' 'CV' 'DV' ... 'CV' 'DV' 'CV'
    eo:cloud_cover              (time) float64 54.0 50.23 1.02 ... 0.5 0.14
    sentinel:utm_zone           int64 13
    gsd                         (band) object 10 10 60 10 ... 20 None None None
    sentinel:product_id         (time) <U60 'S2A_MSIL2A_20190102T175731_N0211...
    view:off_nadir              int64 0
    sentinel:valid_cloud_cover  (time) bool True True True ... True True True
    sentinel:sequence           (time) <U1 '0' '0' '0' '0' ... '0' '0' '0' '0'
    constellation               <U10 'sentinel-2'
    data_coverage               (time) object 33.92 None 100 ... 100 100 42.39
    platform                    (time) <U11 'sentinel-2a' ... 'sentinel-2b'
    updated                     (time) <U24 '2020-09-23T19:20:47.956Z' ... '2...
    proj:epsg                   int64 32613
    instruments                 <U3 'msi'
    sentinel:latitude_band      <U1 'S'
    sentinel:data_coverage      (time) object 33.92 100 100 ... 100 100 42.39
    title                       (band) <U31 'True color image' ... 'Scene Cla...
    common_name                 (band) object None None 'coastal' ... None None
    center_wavelength           (band) object None None 0.4439 ... None None
    full_width_half_max         (band) object None None 0.027 ... None None None
    epsg                        int64 32613
Attributes:
    spec:        RasterSpec(epsg=32613, bounds=(391300, 3939700, 446000, 3984...
    crs:         epsg:32613
    transform:   | 100.00, 0.00, 391300.00|\n| 0.00,-100.00, 3984700.00|\n| 0...
    resolution:  100

xarray.DataArray

'stackstac-d4c09e1ea603a35bfecbad52a5781090'

time: 294
band: 17
y: 450
x: 547

dask.array<chunksize=(1, 1, 450, 547), meta=np.ndarray>

	Array	Chunk
Bytes	9.17 GiB	1.88 MiB
Shape	(294, 17, 450, 547)	(1, 1, 450, 547)
Count	14995 Tasks	4998 Chunks
Type	float64	numpy.ndarray

Coordinates: (27)

Attributes: (4)
spec :
RasterSpec(epsg=32613, bounds=(391300, 3939700, 446000, 3984700), resolutions_xy=(100, 100))
crs :
epsg:32613
transform :
| 100.00, 0.00, 391300.00| | 0.00,-100.00, 3984700.00| | 0.00, 0.00, 1.00|
resolution :
100

For comparison, this is how much data we’d be processing if we’d used full 10m resolution:

[6]:

import dask
dask.utils.format_bytes(stackstac.stack(items).nbytes)

[6]:

'8.38 TiB'

Prepare monthly RGB composites¶

Now, use standard xarray methods to select out the red, green, and blue bands, then make monthly median composites.

[7]:

rgb = stack.sel(band=["B04", "B03", "B02"])
monthly_rgb = rgb.resample(time="MS").median(dim="time")
monthly_rgb

[7]:

<xarray.DataArray 'stackstac-d4c09e1ea603a35bfecbad52a5781090' (time: 12, band: 3, y: 450, x: 547)>
dask.array<stack, shape=(12, 3, 450, 547), dtype=float64, chunksize=(1, 2, 450, 547), chunktype=numpy.ndarray>
Coordinates:
  * time                    (time) datetime64[ns] 2019-01-01 ... 2019-12-01
  * band                    (band) <U8 'B04' 'B03' 'B02'
  * x                       (x) float64 3.913e+05 3.914e+05 ... 4.459e+05
  * y                       (y) float64 3.985e+06 3.985e+06 ... 3.94e+06
    sentinel:utm_zone       int64 13
    gsd                     (band) object 10 10 10
    view:off_nadir          int64 0
    constellation           <U10 'sentinel-2'
    proj:epsg               int64 32613
    instruments             <U3 'msi'
    sentinel:latitude_band  <U1 'S'
    title                   (band) <U31 'Band 4 (red)' ... 'Band 2 (blue)'
    common_name             (band) object 'red' 'green' 'blue'
    center_wavelength       (band) object 0.6645 0.56 0.4966
    full_width_half_max     (band) object 0.038 0.045 0.098
    epsg                    int64 32613

xarray.DataArray

'stackstac-d4c09e1ea603a35bfecbad52a5781090'

time: 12
band: 3
y: 450
x: 547

dask.array<chunksize=(1, 2, 450, 547), meta=np.ndarray>

	Array	Chunk
Bytes	67.61 MiB	3.76 MiB
Shape	(12, 3, 450, 547)	(1, 2, 450, 547)
Count	16833 Tasks	24 Chunks
Type	float64	numpy.ndarray

Coordinates: (16)

Attributes: (0)

Compute in parallel on the cluster¶

[8]:

client.wait_for_workers(8)

[9]:

%time rgb_ = monthly_rgb.compute()

CPU times: user 1.85 s, sys: 475 ms, total: 2.33 s
Wall time: 1min 59s

Using 8 Coiled workers (2 CPU, 8 GiB memory each), we processed the ~10 GB of data into median composites in a couple minutes.

Go to the dashboard at https://cloud.coiled.io to watch the computation in progress.

[10]:

rgb_.plot.imshow(row="time", rgb="band", robust=True, size=6);

According to the Coiled dashboard, this cost about 74 cents. Was this picture worth 74 cents to you?