Running on a cluster

We’ll use a Dask cluster in the cloud—in this case, using Coiled—to use many machines to process the data in parallel. We can also run in a data center near* where the data is stored for better performance.

*these Sentinel-2 COGs are in AWS’s us-west-2; Coiled currently only offers us-west-1. But it’s close enough.

[1]:
import coiled
import distributed

cluster = coiled.Cluster(
    name="stackstac",
    software="gjoseph92/stackstac",
    backend_options={"region": "us-west-1"},
)
client = distributed.Client(cluster)
client
Using existing cluster: stackstac
/Users/gabe/Library/Caches/pypoetry/virtualenvs/stackstac-FdRcOknL-py3.8/lib/python3.8/site-packages/distributed/client.py:1134: VersionMismatchWarning: Mismatched versions found

+---------+---------------+---------------+---------------+
| Package | client        | scheduler     | workers       |
+---------+---------------+---------------+---------------+
| python  | 3.8.7.final.0 | 3.8.8.final.0 | 3.8.8.final.0 |
+---------+---------------+---------------+---------------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
[1]:

Client

Cluster

  • Workers: 4
  • Cores: 16
  • Memory: 68.72 GB
[2]:
# TODO remove
!rm -rf stackstac/__pycache__
!zip -r stackstac.zip stackstac/
client.upload_file("stackstac.zip")
!rm stackstac.zip

def check_import():
    import stackstac
    return "ok"
client.run(check_import)
  adding: stackstac/ (stored 0%)
  adding: stackstac/reader_protocol.py (deflated 61%)
  adding: stackstac/accumulate_metadata.py (deflated 65%)
  adding: stackstac/rio_env.py (deflated 70%)
  adding: stackstac/raster_spec.py (deflated 56%)
  adding: stackstac/timer.py (deflated 50%)
  adding: stackstac/__init__.py (deflated 41%)
  adding: stackstac/rio_reader.py (deflated 67%)
  adding: stackstac/stac_types.py (deflated 67%)
  adding: stackstac/stack.py (deflated 62%)
  adding: stackstac/to_dask.py (deflated 62%)
  adding: stackstac/prepare.py (deflated 72%)
[2]:
{'tls://10.4.11.31:33089': 'ok',
 'tls://10.4.11.38:35991': 'ok',
 'tls://10.4.12.193:43513': 'ok',
 'tls://10.4.12.232:36411': 'ok'}
[2]:
import stackstac
import satsearch

from rasterio.enums import Resampling

Search for a full year of Sentinel-2 data over Santa Fe, New Mexico.

We’ll look at 2019-2020.

[3]:
%%time
items = satsearch.Search(
    url="https://earth-search.aws.element84.com/v0",
    intersects=dict(type="Point", coordinates=[-106, 35.7]),
    collections=["sentinel-s2-l2a-cogs"],
    datetime="2019-01-01/2020-01-01"
).items()
len(items)
CPU times: user 82.8 ms, sys: 21.1 ms, total: 104 ms
Wall time: 2.46 s
[3]:
294

Set a coarser resolution to speed up the computation

We’ll have stackstac retrieve the data at 100m resolution, instead of its native 10m. Since the data is stored in Cloud-Optimized GeoTIFFs, with internal overviews, fetching lower-resolution data is very efficient and requires processing an order of magnitude less data.

(Internally, stackstac is just telling rasterio/GDAL to build a VRT at this resolution. GDAL then automatically figures out which overview level to fetch data from.)

We also set bounds_latlon to just the area we want to look at (additionally, this drops any items that don’t intersect that bounding box), and set the resampling method to bilinear to produce a nicer-looking image.

[6]:
%%time
stack = stackstac.stack(
    items,
    resolution=100,
    bounds_latlon=[-106.2, 35.6, -105.6, 36],
    resampling=Resampling.bilinear
)
CPU times: user 247 ms, sys: 11.3 ms, total: 258 ms
Wall time: 269 ms
[7]:
stack
[7]:
<xarray.DataArray 'stackstac-ae169f4201b4c70e2f19d70d036371ff' (time: 294, band: 17, y: 450, x: 547)>
dask.array<fetch_raster_window, shape=(294, 17, 450, 547), dtype=float64, chunksize=(1, 1, 450, 547), chunktype=numpy.ndarray>
Coordinates: (12/23)
  * time                        (time) datetime64[ns] 2019-01-02T18:04:01 ......
    id                          (time) <U24 'S2A_13SDV_20190102_0_L2A' ... 'S...
  * band                        (band) <U8 'overview' 'visual' ... 'WVP' 'SCL'
  * x                           (x) float64 3.914e+05 3.914e+05 ... 4.46e+05
  * y                           (y) float64 3.985e+06 3.985e+06 ... 3.94e+06
    instruments                 <U3 'msi'
    ...                          ...
    constellation               <U10 'sentinel-2'
    eo:cloud_cover              (time) float64 54.0 50.23 1.02 ... 0.5 0.14
    updated                     (time) <U24 '2020-09-23T19:20:47.956Z' ... '2...
    sentinel:data_coverage      (time) object 33.92 100 100 ... 100 100 42.39
    sentinel:sequence           (time) object None None None ... '0' '0' '0'
    title                       (band) object None ... 'Scene Classification ...
Attributes:
    spec:        RasterSpec(epsg=32613, bounds=(391300, 3939700, 446000, 3984...
    crs:         epsg:32613
    transform:   | 100.00, 0.00, 391300.00|\n| 0.00,-100.00, 3984700.00|\n| 0...
    resolution:  100

For comparison, this is how much data we’d be processing if we’d used full 10m resolution:

[8]:
import dask
dask.utils.format_bytes(stackstac.stack(items).nbytes)
[8]:
'9.21 TB'

Prepare monthly RGB composites

Now, use standard xarray methods to select out the red, green, and blue bands, then make monthly median composites.

[9]:
rgb = stack.sel(band=["B04", "B03", "B02"])
monthly_rgb = rgb.resample(time="MS").median(dim="time")
monthly_rgb
[9]:
<xarray.DataArray 'stackstac-ae169f4201b4c70e2f19d70d036371ff' (time: 12, band: 3, y: 450, x: 547)>
dask.array<stack, shape=(12, 3, 450, 547), dtype=float64, chunksize=(1, 2, 450, 547), chunktype=numpy.ndarray>
Coordinates:
  * time                    (time) datetime64[ns] 2019-01-01 ... 2019-12-01
  * band                    (band) <U8 'B04' 'B03' 'B02'
  * x                       (x) float64 3.914e+05 3.914e+05 ... 4.46e+05
  * y                       (y) float64 3.985e+06 3.985e+06 ... 3.94e+06
    instruments             <U3 'msi'
    proj:epsg               int64 32613
    sentinel:utm_zone       int64 13
    sentinel:latitude_band  <U1 'S'
    view:off_nadir          int64 0
    gsd                     int64 10
    constellation           <U10 'sentinel-2'
    title                   (band) object 'Band 4 (red)' ... 'Band 2 (blue)'

Compute in parallel on the cluster

[31]:
%time rgb_ = monthly_rgb.compute()
CPU times: user 1.63 s, sys: 324 ms, total: 1.96 s
Wall time: 3min 26s

Using 4 Coiled workers, we processed the ~10 GB of data into median composites in just a few minutes.

[38]:
rgb_.plot.imshow(row="time", rgb="band", robust=True, size=6)
[38]:
<xarray.plot.facetgrid.FacetGrid at 0x133633070>
_images/cluster_16_1.png

According to the Coiled dashboard, this cost about 74 cents. Was this picture worth 74 cents to you?