Create {custom} pictures for geospatial evaluation with Amazon SageMaker Distribution in Amazon SageMaker Studio

Amazon SageMaker Studio offers a complete suite of totally managed built-in improvement environments (IDEs) for machine studying (ML), together with JupyterLab, Code Editor (primarily based on Code-OSS), and RStudio. It helps all levels of ML improvement—from knowledge preparation to deployment, and lets you launch a preconfigured JupyterLab IDE for environment friendly coding inside seconds. Moreover, its versatile interface and synthetic intelligence (AI) powered coding assistant simplifies and enhances the ML workflow configuration, debugging, and code testing.

Geospatial knowledge similar to satellite tv for pc pictures, coordinate traces, or aerial maps which might be enriched with traits or attributes of different enterprise and environmental datasets is changing into more and more obtainable. This unlocks worthwhile use circumstances in fields similar to environmental monitoring, city planning, agriculture, catastrophe response, transportation, and public well being.

To successfully make the most of the wealth of knowledge contained in such datasets for ML and analytics, entry to the appropriate instruments for geospatial knowledge dealing with is essential. That is particularly related provided that geospatial knowledge usually is available in specialised file codecs similar to Cloud Optimized GeoTIFF (COG), Zarr information, GeoJSON, and GeoParquet that require devoted software program instruments and libraries to work with.

To deal with these particular wants inside SageMaker Studio, this put up exhibits you the best way to lengthen Amazon SageMaker Distribution with further dependencies to create a {custom} container picture tailor-made for geospatial evaluation. Though the instance on this put up focuses on geospatial knowledge science, the methodology introduced might be utilized to any form of {custom} picture primarily based on SageMaker Distribution.

SageMaker Distribution pictures are Docker pictures that include preinstalled knowledge science packages and are preconfigured with a JupyterLab IDE, which lets you use these pictures within the SageMaker Studio UI in addition to for non-interactive workflows like processing or coaching. This lets you use the identical runtime throughout SageMaker Studio notebooks and asynchronous jobs like processing or coaching, facilitating a seamless transition from native experimentation to batch execution whereas solely having to keep up a single Docker picture.

On this put up, we offer step-by-step steering on how one can construct and use {custom} container pictures in SageMaker Studio. Particularly, we show how one can customise SageMaker Distribution for geospatial workflows by extending it with open-source geospatial Python libraries. We clarify the best way to construct and deploy the picture on AWS utilizing steady integration and supply (CI/CD) instruments and the best way to make the deployed picture accessible in SageMaker Studio. All code used on this put up, together with the Dockerfile and infrastructure as code (IaC) templates for fast deployment, is out there as a GitHub repository.

Answer overview

You’ll be able to constructing a {custom} container picture and use it in SageMaker Studio with the next steps:

Create a Dockerfile that features the extra Python libraries and instruments.
Construct a {custom} container picture from the Dockerfile.
Push the {custom} container picture to a personal repository on Amazon Elastic Container Registry (Amazon ECR).
Connect the picture to your Amazon SageMaker Studio area.
Entry the picture out of your JupyterLab area.

The next diagram illustrates the answer structure.

The answer makes use of AWS CodeBuild, a completely managed service that compiles supply code and produces deployable software program artifacts, to construct a brand new container picture from a Dockerfile. CodeBuild helps a broad number of git model management sources like AWS CodeCommit, GitHub, and GitLab. For this put up, we host our construct information on Amazon Easy Storage Service (Amazon S3) and use it because the supply supplier for the CodeBuild challenge. You’ll be able to lengthen this resolution to work with various CI/CD tooling, together with GitLab, Jenkins, Harness, or different instruments.

CodeBuild retrieves the construct information from Amazon S3, runs a Docker construct, and pushes the ensuing container picture to a personal ECR repository. Amazon ECR is a managed container registry that facilitates the storage, administration, and deployment of container pictures.

The {custom} picture is then hooked up to a SageMaker Studio area and can be utilized by knowledge scientists and knowledge engineers as an IDE or as runtime for SageMaker processing or coaching jobs.

Stipulations

This put up covers the default strategy for SageMaker Studio, which includes a managed community interface that enables web communication. We additionally embrace steps to adapt this to be used inside a personal digital personal cloud (VPC).

Earlier than you get began, confirm that you’ve the next conditions:

If you happen to intend to observe this put up and deploy the CodeBuild challenge and the ECR repository utilizing IaC, you additionally want to put in the AWS Cloud Improvement Equipment (AWS CDK) in your native machine. For directions, see Getting began with the AWS CDK. If you happen to’re utilizing a cloud-based IDE like AWS Cloud9, the AWS CDK will often come preinstalled.

If you wish to securely deploy your {custom} container utilizing your personal VPC, you additionally want the next:

A VPC with a personal subnet
VPC endpoints for the next companies:

To arrange a SageMaker Studio area with a personal VPC, see Join Studio notebooks in a VPC to exterior sources.

Lengthen SageMaker Distribution

By default, SageMaker Studio offers a number of curated pre-built Docker pictures as a part of SageMaker Distribution. These pictures embrace in style frameworks for ML, knowledge science, and visualization, together with deep studying frameworks like PyTorch, TensorFlow and Keras; in style Python packages like NumPy, scikit-learn, and pandas; and IDEs like JupyterLab and Code Editor. All put in libraries and packages are mutually suitable and are supplied with their newest suitable variations. Every distribution model is out there in two variants, CPU and GPU, and is hosted on the Amazon ECR Public Gallery. To have the ability to work with geospatial knowledge in SageMaker Studio, it’s essential to lengthen SageMaker Distribution by including the required geospatial libraries like gdal, geospandas, leafmap, or rioxarray and make it accessible to customers via SageMaker Studio.

Let’s first assessment the best way to lengthen SageMaker Distribution for geospatial analyses and ML. To take action, we largely observe the supplied template for creating {custom} Docker information in SageMaker, with just a few refined however essential variations particular to the geospatial libraries we need to set up. The total Dockerfile is as follows:

# set distribution kind (cpu or gpu)
ARG DISTRIBUTION_TYPE

# get SageMaker Distribution base picture
# use fastened model for reproducibility, use "newest" for most up-to-date model
FROM public.ecr.aws/sagemaker/sagemaker-distribution:1.8.0-$DISTRIBUTION_TYPE

#set SageMaker particular parameters and arguments
#see right here for supported values: https://docs.aws.amazon.com/sagemaker/newest/dg/studio-updated-jl-image-specifications.html#studio-updated-jl-admin-guide-custom-images-user-and-filesystem
ARG NB_USER="sagemaker-user"
ARG NB_UID=1000
ARG NB_GID=100

ENV MAMBA_USER=$NB_USER

USER $ROOT

#set surroundings variables required for GDAL
ARG CPLUS_INCLUDE_PATH=/usr/embrace/gdal
ARG C_INCLUDE_PATH=/usr/embrace/gdal

#set up GDAL and different required Linux packages
RUN apt-get --allow-releaseinfo-change replace -y -qq 
   && apt-get replace 
   && apt set up -y software-properties-common 
   && add-apt-repository --yes ppa:ubuntugis/ppa 
   && apt-get replace 
   && apt-get set up -qq -y groff unzip libgdal-dev gdal-bin ffmpeg libsm6 libxext6 
   && apt-get set up -y --reinstall build-essential 
   && apt-get clear 
   && rm -fr /var/lib/apt/lists/*

# use micromamaba bundle supervisor to put in required geospatial python packages
USER $MAMBA_USER

RUN micromamba set up gdal==3.6.4 --yes --channel conda-forge --name base 
   && micromamba set up geopandas==0.13.2 rasterio==1.3.8 leafmap==0.31.3 rioxarray==0.15.1 --yes --channel conda-forge --name base 
   && micromamba clear -a

# set entrypoint and jupyter server args
ENTRYPOINT ["jupyter-lab"]
CMD ["--ServerApp.ip=0.0.0.0", "--ServerApp.port=8888", "--ServerApp.allow_origin=*", "--ServerApp.token=''", "--ServerApp.base_url=/jupyterlab/default"]

Let’s break down the important thing geospatial-specific modifications.

First, you put in the Geospatial Information Abstraction Library (GDAL) on Linux. GDAL is an open supply library that gives drivers for studying and writing raster and vector geospatial knowledge codecs. It offers the spine for a lot of open supply and proprietary GIS functions, together with the libraries used within the put up. That is carried out as follows (see see Set up GDAL for Python for extra particulars for extra particulars):

#set up GDAL and different required Linux packages
RUN apt-get --allow-releaseinfo-change replace -y -qq 
   && apt-get replace 
   && apt set up -y software-properties-common 
   && add-apt-repository --yes ppa:ubuntugis/ppa 
   && apt-get replace 
   && apt-get set up -qq -y groff unzip libgdal-dev gdal-bin ffmpeg libsm6 libxext6 
   && apt-get set up -y --reinstall build-essential 
   && apt-get clear 
   && rm -fr /var/lib/apt/lists/*

You additionally have to set the next GDAL-specific surroundings variables:

ARG CPLUS_INCLUDE_PATH=/usr/embrace/gdal
ARG C_INCLUDE_PATH=/usr/embrace/gdal

With GDAL put in, now you can set up the required geospatial Python libraries utilizing the advisable micromamba bundle supervisor. That is carried out within the following code block:

# use micromamaba bundle supervisor to put in required geospatial python packages
USER $MAMBA_USER

RUN micromamba set up gdal==3.6.4 --yes --channel conda-forge --name base 
   && micromamba set up geopandas==0.13.2 rasterio==1.3.8 leafmap==0.31.3 rioxarray==0.15.1 --yes --channel conda-forge --name base 
   && micromamba clear -a

The variations outlined right here have been examined with the underlying SageMaker Distribution. You’ll be able to freely add further libraries that you could be want. Figuring out the appropriate model could require some degree of experimentation.

Now that you’ve created your {custom} geospatial Dockerfile, you possibly can construct it and push the picture to Amazon ECR.

Construct a {custom} geospatial picture

To construct the Docker picture, you want a construct surroundings outfitted with Docker and the AWS Command Line Interface (AWS CLI). This surroundings might be arrange in your native machine, in a cloud-based IDE like AWS Cloud9, or as a part of a steady integration service like CodeBuild.

Earlier than you construct the Docker picture, establish the ECR repository the place you’ll push the picture. Your picture should be tagged within the following format: <your-aws-account-id>.dkr.ecr.<your-aws-region>.amazonaws.com/<your-repository-name>:<tag>. With out this tag, pushing it to an ECR repository just isn’t doable. If you happen to’re deploying the answer utilizing the AWS CDK, an ECR repository is routinely created, and a CodeBuild challenge is configured to make use of this repository because the goal for pushing the picture. If you provoke the CodeBuild construct, the picture is constructed, tagged, after which pushed to the beforehand created ECR repository.

The next steps are relevant provided that you select to carry out these actions manually.

To construct the picture manually, run the next command in the identical listing because the Dockerfile:

docker construct --build-arg DISTRIBUTION_TYPE=cpu -t ${ECR_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_REPO_NAME}:latest-cpu .

After constructing your picture, you will need to log in to the ECR repository with this command earlier than pushing the picture:

aws ecr get-login-password --region ${ECR_REGION} | docker login --username AWS --password-stdin ${ECR_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com

Subsequent, push your Docker picture utilizing the next command:

docker push ${ECR_ACCOUNT_ID}.dkr.ecr.${ECR_REGION}.amazonaws.com/${ECR_REPO_NAME}:latest-cpu

Your picture has now been pushed to the ECR repository and you’ll proceed to connect it to SageMaker.

Connect the {custom} geospatial picture to SageMaker Studio

After your {custom} picture has been efficiently pushed to Amazon ECR, it’s essential to connect it to a SageMaker area to have the ability to use it inside SageMaker Studio.

On the SageMaker console, select Domains underneath Admin configurations within the navigation pane.

If you happen to don’t have a SageMaker area arrange but, you possibly can create one.

From the record of accessible domains, select the area to which you need to connect the geospatial picture.
On the Area particulars web page, select the Surroundings tab
In Customized pictures for private Studio apps part, select Connect picture.

Select New picture and enter the ECR picture URI from the construct pipeline output. This could have the next format <your-aws-account-id>.dkr.ecr.<your-aws-region>.amazonaws.com/<your-repository-name>:<tag>
Select Subsequent.
For Picture title, enter a {custom} picture title (for this put up, we use custom-geospatial-sm-dist).
For Picture show title, enter a {custom} show title (for this put up, we use Geospatial SageMaker Distribution (CPU)).
For Description, enter a picture description.

Select JupyterLab picture as the applying kind and select Submit.

When returning to the Surroundings tab on the Area particulars web page, it’s best to now see your picture listed underneath Customized pictures for private Studio apps.

Connect the {custom} geospatial picture utilizing the AWS CLI

You may as well automate the method utilizing the AWS CLI.

First, register the picture in SageMaker and create a picture model:

SAGEMAKER_IMAGE_NAME=sagemaker-dist-custom-geospatial # adapt together with your picture title
ECR_IMAGE_URL='<account_id>.dkr.ecr.<area>.amazonaws.com/<ecr-repo-name>:latest-cpu' # exchange together with your ECR repository url
ROLE_ARN='The ARN of an IAM position for the execution position you need to use' # exchange with the specified execution position

aws sagemaker create-image 
    --image-name ${SAGEMAKER_IMAGE_NAME} 
    --role-arn ${ROLE_ARN}

aws sagemaker create-app-image-config 
    --app-image-config-name ${SAGEMAKER_IMAGE_NAME}-app-image-config 
    --jupyter-lab-app-image-config {}

aws sagemaker create-image-version 
    --image-name ${SAGEMAKER_IMAGE_NAME} 
    --base-image ${ECR_IMAGE_URL}

Subsequent, create a file containing the next content material. You’ll be able to add a number of {custom} pictures by including further entries to the CustomImages record.

{
  "DefaultUserSettings": {
    "JupyterLabAppSettings": {
      "CustomImages": [
                {
                    "ImageName": "sagemaker-dist-custom-geospatial",
                    "ImageVersionNumber": 1,
                    "AppImageConfigName": "sagemaker-dist-custom-geospatial-app-image-config "
                }
            ]
        }
    }
}

The subsequent step assumes that you simply named the file from the earlier step default-user-settings.json. The next command attaches the SageMaker picture to the required Studio area:

DOMAIN_ID=d-####### # exchange together with your SageMaker Studio area id
aws sagemaker update-domain --domain-id ${DOMAIN_ID} --cli-input-json file://default-user-settings.json

Use the {custom} geospatial Picture within the JupyterLab app

Within the earlier part, we demonstrated the best way to connect the picture to a SageMaker area. If you create a brand new (or modify an present) JupyterLab area inside this area, the newly created {custom} picture will now be obtainable. You’ll be able to select it on the Picture dropdown menu, the place it now seems alongside the default AWS curated SageMaker Distribution picture variations underneath Customized.

To run an area utilizing the {custom} geospatial picture, select Geospatial SageMaker Distribution (CPU) as your picture, then select Run area.

After the area has been provisioned and is in Working state, select Open JupyterLab. It will deliver up the JupyterLab IDE in a brand new browser tab. Choose a pocket book with Python3 (ipykernel) to start out up a brand new Jupyter pocket book working on high of the {custom} geospatial picture.

Run interactive geospatial knowledge analyses and large-scale processing jobs in SageMaker

After you construct the {custom} geospatial picture and connect it to your SageMaker area, you should utilize it in certainly one of two principal methods:

You need to use the picture as the bottom to run a JupyterLab pocket book kernel to carry out in-notebook interactive improvement and geospatial analytics.
You need to use the picture in a SageMaker processing job to run extremely parallelized geospatial processing pipelines. Reusing the interactive kernel picture for asynchronous batch processing might be advantageous as a result of solely a single picture should maintained and routines developed in an interactive method utilizing a pocket book might be anticipated to work seamlessly within the processing job. If startup latency attributable to longer picture load occasions is a priority, you possibly can select to construct a devoted extra light-weight picture only for processing (see Construct Your Personal Processing Container for particulars).

For hands-on examples of each approaches, consult with the accompanying GitHub repository.

In-notebook interactive improvement utilizing a {custom} picture

After you select the {custom} geospatial picture as the bottom picture in your JupyterLab area, SageMaker offers you with entry to many geospatial libraries that may now be imported with out the necessity for added installs. For instance, you possibly can run the next code to initialize a geometry object and plot it on a map throughout the acquainted surroundings of a pocket book:

import shapely
import leafmap
import geopandas

coords = [[-102.00723310488662,40.596123257503024],[-102.00723310488662,40.58168585757733],[-101.9882214495914,40.58168585757733],[-101.9882214495914,40.596123257503024],[-102.00723310488662,40.596123257503024]]
polgyon = shapely.Polygon(coords)
gdf = geopandas.GeoDataFrame(index=[0], crs="epsg:4326", geometry=[polgyon])
Map = leafmap.Map(heart=[40.596123257503024, -102.00723310488662], zoom=13)
Map.add_basemap("USGS NAIP Imagery")
Map.add_gdf(gdf, layer_name="take a look at", fashion={"coloration": "yellow", "fillOpacity": 0.3, "clickable": True,})
Map

Extremely parallelized geospatial processing pipelines utilizing a SageMaker processing job and a {custom} picture

You’ll be able to specify the {custom} picture because the picture to run a SageMaker processing job. This allows you to use specialist geospatial processing frameworks to run large-scale distributed knowledge processing pipelines with just some traces of code. The next code snippet initializes after which runs a SageMaker ScriptProcessor object that makes use of the {custom} geospatial picture (specified utilizing the geospatial_image_uri variable) to run a geospatial processing routine (laid out in a processing script) on 20 ml.m5.2xlarge situations:

import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import ScriptProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

area = sagemaker.Session().boto_region_name
position = get_execution_role()

geospatial_image_uri = "<GEOSPATIAL-IMAGE-URI>" #<-- set to uri of the {custom} geospatial picture

processor_geospatial_data_cube = ScriptProcessor(
    command=['python3'],
    image_uri=geospatial_image_uri,
    position=position,
    instance_count=20,
    instance_type="ml.m5.2xlarge",
    base_job_name="aoi-data-cube"
)

processor_geospatial_data_cube.run(
    code="scripts/generate_aoi_data_cube.py", #<-- processing script
    inputs=[
        ProcessingInput(
            source=f"s3://{bucket_name}/{bucket_prefix_aoi_meta}/",
            destination='/opt/ml/processing/input/aoi_meta/', #<-- meta data (incl. geography) of the area of observation
            s3_data_distribution_type="FullyReplicated" #<-- sharding strategy for distribution across nodes
        ),        
        ProcessingInput(
            source=f"s3://{bucket_name}/{bucket_prefix_sentinel2_meta}/",
            destination='/opt/ml/processing/input/sentinel2_meta/', #<-- Sentinel-2 scene metadata (1 file per scene)
            s3_data_distribution_type="ShardedByS3Key" #<-- sharding strategy for distribution across nodes
        ),
    ],
    outputs=[
        ProcessingOutput(
            source="/opt/ml/processing/output/",
            destination=f"s3://{bucket_name}/processing/geospatial-data-cube/{execution_id}/output/" #<-- output S3 path
        )
    ]
)

A typical processing routine involving raster file loading, clipping to an space of commentary, resampling particular bands, and masking clouds amongst different steps throughout 134 110x110km Sentinel-2 scenes completes in underneath quarter-hour, as might be seen within the following Amazon CloudWatch dashboard.

Clear up

After you’re finished working the pocket book, don’t neglect to cease the SageMaker Studio JupyterLab utility to keep away from incurring pointless prices. If you happen to deployed the extra infrastructure utilizing the AWS CDK, you possibly can delete the deployed stack by working the next command in your native code checkout:

cd <path to repository>
cd deployment && cdk destroy

Conclusion

This put up has outfitted you with the information and instruments to construct and use {custom} container pictures tailor-made for geospatial evaluation in SageMaker Studio. By extending SageMaker Distribution with specialised geospatial libraries, you possibly can customise your surroundings for specialised use circumstances. This empowers you to unlock the huge potential of geospatial knowledge for functions similar to environmental monitoring, city planning, and precision agriculture—all throughout the acquainted and user-friendly surroundings of SageMaker Studio.

Though this put up targeted on geospatial workflows, the methodology introduced is broadly relevant. You’ll be able to make the most of the identical rules to tailor container pictures for any area requiring particular libraries or instruments past the scope of SageMaker Distribution. This empowers you to create a really personalized improvement expertise inside SageMaker Studio, catering to your distinctive challenge wants.

The supplied sources, together with pattern code and IaC templates, provide a stable basis for constructing your individual {custom} pictures. Experiment and discover how this strategy can streamline your ML workflows involving geospatial knowledge or some other specialised area. To get began, go to the accompanying GitHub repository.

Concerning the Authors

Janosch Woschitz is a Senior Options Architect at AWS, specializing in AI/ML. With over 15 years of expertise, he helps prospects globally in leveraging AI and ML for revolutionary options and constructing ML platforms on AWS. His experience spans machine studying, knowledge engineering, and scalable distributed programs, augmented by a robust background in software program engineering and trade experience in domains similar to autonomous driving.

Dr. Karsten Schroer is a Senior Machine Studying (ML) Prototyping Architect at AWS, targeted on serving to prospects leverage synthetic intelligence (AI), ML, and generative AI applied sciences. With deep ML experience, he collaborates with corporations throughout industries to design and implement data- and AI-driven options that generate enterprise worth. Karsten holds a PhD in utilized ML.

Anirudh Viswanathan is a Senior Product Supervisor, Technical, at AWS with the SageMaker workforce, the place he focuses on Machine Studying. He holds a Grasp’s in Robotics from Carnegie Mellon College and an MBA from the Wharton College of Enterprise. Anirudh is a named inventor on greater than 50 AI/ML patents. He enjoys long-distance working, exploring artwork galleries, and attending Broadway exhibits.