Deploy Amazon SageMaker pipelines utilizing AWS Controllers for Kubernetes

Kubernetes is a well-liked orchestration platform for managing containers. Its scalability and load-balancing capabilities make it very best for dealing with the variable workloads typical of machine studying (ML) functions. DevOps engineers usually use Kubernetes to handle and scale ML functions, however earlier than an ML mannequin is out there, it have to be skilled and evaluated and, if the standard of the obtained mannequin is passable, uploaded to a mannequin registry.

Amazon SageMaker offers capabilities to take away the undifferentiated heavy lifting of constructing and deploying ML fashions. SageMaker simplifies the method of managing dependencies, container photos, auto scaling, and monitoring. Particularly for the mannequin constructing stage, Amazon SageMaker Pipelines automates the method by managing the infrastructure and sources wanted to course of information, prepare fashions, and run analysis assessments.

A problem for DevOps engineers is the extra complexity that comes from utilizing Kubernetes to handle the deployment stage whereas resorting to different instruments (such because the AWS SDK or AWS CloudFormation) to handle the mannequin constructing pipeline. One various to simplify this course of is to make use of AWS Controllers for Kubernetes (ACK) to handle and deploy a SageMaker coaching pipeline. ACK lets you make the most of managed mannequin constructing pipelines while not having to outline sources outdoors of the Kubernetes cluster.

On this put up, we introduce an instance to assist DevOps engineers handle the whole ML lifecycle—together with coaching and inference—utilizing the identical toolkit.

Resolution overview

We think about a use case by which an ML engineer configures a SageMaker mannequin constructing pipeline utilizing a Jupyter pocket book. This configuration takes the type of a Directed Acyclic Graph (DAG) represented as a JSON pipeline definition. The JSON doc might be saved and versioned in an Amazon Easy Storage Service (Amazon S3) bucket. If encryption is required, it may be carried out utilizing an AWS Key Administration Service (AWS KMS) managed key for Amazon S3. A DevOps engineer with entry to fetch this definition file from Amazon S3 can load the pipeline definition into an ACK service controller for SageMaker, which is operating as a part of an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The DevOps engineer can then use the Kubernetes APIs supplied by ACK to submit the pipeline definition and provoke a number of pipeline runs in SageMaker. This complete workflow is proven within the following resolution diagram.

Conditions

To comply with alongside, it is best to have the next stipulations:

An EKS cluster the place the ML pipeline will likely be created.
A consumer with entry to an AWS Identification and Entry Administration (IAM) position that has IAM permissions (iam:CreateRole, iam:AttachRolePolicy, and iam:PutRolePolicy) to permit creating roles and attaching insurance policies to roles.
The next command line instruments on the native machine or cloud-based improvement atmosphere used to entry the Kubernetes cluster:

Set up the SageMaker ACK service controller

The SageMaker ACK service controller makes it easy for DevOps engineers to make use of Kubernetes as their management airplane to create and handle ML pipelines. To put in the controller in your EKS cluster, full the next steps:

Configure IAM permissions to verify the controller has entry to the suitable AWS sources.
Set up the controller utilizing a SageMaker Helm Chart to make it obtainable on the shopper machine.

The next tutorial offers step-by-step directions with the required instructions to put in the ACK service controller for SageMaker.

Generate a pipeline JSON definition

In most corporations, ML engineers are answerable for creating the ML pipeline of their group. They usually work with DevOps engineers to function these pipelines. In SageMaker, ML engineers can use the SageMaker Python SDK to generate a pipeline definition in JSON format. A SageMaker pipeline definition should comply with the supplied schema, which incorporates base photos, dependencies, steps, and occasion sorts and sizes which might be wanted to completely outline the pipeline. This definition then will get retrieved by the DevOps engineer for deploying and sustaining the infrastructure wanted for the pipeline.

The next is a pattern pipeline definition with one coaching step:

{
  "Model": "2020-12-01",
  "Steps": [
  {
    "Name": "AbaloneTrain",
    "Type": "Training",
    "Arguments": {
      "RoleArn": "<<YOUR_SAGEMAKER_ROLE_ARN>>",
      "HyperParameters": {
        "max_depth": "5",
        "gamma": "4",
        "eta": "0.2",
        "min_child_weight": "6",
        "objective": "multi:softmax",
        "num_class": "10",
        "num_round": "10"
     },
     "AlgorithmSpecification": {
     "TrainingImage": "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1",
     "TrainingInputMode": "File"
   },
   "OutputDataConfig": {
     "S3OutputPath": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/"
   },
   "ResourceConfig": {
     "InstanceCount": 1,
     "InstanceType": "ml.m4.xlarge",
     "VolumeSizeInGB": 5
   },
   "StoppingCondition": {
     "MaxRuntimeInSeconds": 86400
   },
   "InputDataConfig": [
   {
     "ChannelName": "train",
     "DataSource": {
       "S3DataSource": {
         "S3DataType": "S3Prefix",
         "S3Uri": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/train/",
         "S3DataDistributionType": "
       }
     },
     "ContentType": "text/libsvm"
   },
   {
     "ChannelName": "validation",
     "DataSource": {
       "S3DataSource": {
         "S3DataType": "S3Prefix",
         "S3Uri": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/validation/",
         "S3DataDistributionType": "FullyReplicated"
       }
     },
     "ContentType": "text/libsvm"
   }]
  }
 }]
}

With SageMaker, ML mannequin artifacts and different system artifacts are encrypted in transit and at relaxation. SageMaker encrypts these by default utilizing the AWS managed key for Amazon S3. You possibly can optionally specify a customized key utilizing the KmsKeyId property of the OutputDataConfig argument. For extra info on how SageMaker protects information, see Knowledge Safety in Amazon SageMaker.

Moreover, we suggest securing entry to the pipeline artifacts, equivalent to mannequin outputs and coaching information, to a selected set of IAM roles created for information scientists and ML engineers. This may be achieved by attaching an acceptable bucket coverage. For extra info on finest practices for securing information in Amazon S3, see High 10 safety finest practices for securing information in Amazon S3.

Create and submit a pipeline YAML specification

Within the Kubernetes world, objects are the persistent entities within the Kubernetes cluster used to symbolize the state of your cluster. While you create an object in Kubernetes, you will need to present the item specification that describes its desired state, in addition to some primary details about the item (equivalent to a reputation). Then, utilizing instruments equivalent to kubectl, you present the data in a manifest file in YAML (or JSON) format to speak with the Kubernetes API.

Seek advice from the next Kubernetes YAML specification for a SageMaker pipeline. DevOps engineers want to change the .spec.pipelineDefinition key within the file and add the ML engineer-provided pipeline JSON definition. They then put together and submit a separate pipeline execution YAML specification to run the pipeline in SageMaker. There are two methods to submit a pipeline YAML specification:

Go the pipeline definition inline as a JSON object to the pipeline YAML specification.
Convert the JSON pipeline definition into String format utilizing the command line utility jq. For instance, you should use the next command to transform the pipeline definition to a JSON-encoded string:

jq -r tojson <pipeline-definition.json>

On this put up, we use the primary possibility and put together the YAML specification (my-pipeline.yaml) as follows:

apiVersion: sagemaker.companies.k8s.aws/v1alpha1
sort: Pipeline
metadata:
  identify: my-kubernetes-pipeline
spec:
  parallelismConfiguration:
  	maxParallelExecutionSteps: 2
  pipelineName: my-kubernetes-pipeline
  pipelineDefinition: |
  {
    "Model": "2020-12-01",
    "Steps": [
    {
      "Name": "AbaloneTrain",
      "Type": "Training",
      "Arguments": {
        "RoleArn": "<<YOUR_SAGEMAKER_ROLE_ARN>>",
        "HyperParameters": {
          "max_depth": "5",
          "gamma": "4",
          "eta": "0.2",
          "min_child_weight": "6",
          "objective": "multi:softmax",
          "num_class": "10",
          "num_round": "30"
        },
        "AlgorithmSpecification": {
          "TrainingImage": "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1",
          "TrainingInputMode": "File"
        },
        "OutputDataConfig": {
          "S3OutputPath": "s3://<<YOUR_S3_BUCKET>>/sagemaker/"
        },
        "ResourceConfig": {
          "InstanceCount": 1,
          "InstanceType": "ml.m4.xlarge",
          "VolumeSizeInGB": 5
        },
        "StoppingCondition": {
          "MaxRuntimeInSeconds": 86400
        },
        "InputDataConfig": [
        {
          "ChannelName": "train",
          "DataSource": {
            "S3DataSource": {
              "S3DataType": "S3Prefix",
              "S3Uri": "s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/train/",
              "S3DataDistributionType": "FullyReplicated"
            }
          },
          "ContentType": "text/libsvm"
        },
        {
          "ChannelName": "validation",
          "DataSource": {
            "S3DataSource": {
              "S3DataType": "S3Prefix",
              "S3Uri": "s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/validation/",
              "S3DataDistributionType": "FullyReplicated"
            }
          },
          "ContentType": "text/libsvm"
        }
      ]
    }
  }
]}
pipelineDisplayName: my-kubernetes-pipeline
roleARN: <<YOUR_SAGEMAKER_ROLE_ARN>>

Submit the pipeline to SageMaker

To submit your ready pipeline specification, apply the specification to your Kubernetes cluster as follows:

kubectl apply -f my-pipeline.yaml

Create and submit a pipeline execution YAML specification

Seek advice from the next Kubernetes YAML specification for a SageMaker pipeline. Put together the pipeline execution YAML specification (pipeline-execution.yaml) as follows:

apiVersion: sagemaker.companies.k8s.aws/v1alpha1
sort: PipelineExecution
metadata:
  identify: my-kubernetes-pipeline-execution
spec:
  parallelismConfiguration:
  	maxParallelExecutionSteps: 2
  pipelineExecutionDescription: "My first pipeline execution through Amazon EKS cluster."
  pipelineName: my-kubernetes-pipeline

To start out a run of the pipeline, use the next code:

kubectl apply -f pipeline-execution.yaml

Overview and troubleshoot the pipeline run

To checklist all pipelines created utilizing the ACK controller, use the next command:

To checklist all pipeline runs, use the next command:

kubectl get pipelineexecution

To get extra particulars concerning the pipeline after it’s submitted, like checking the standing, errors, or parameters of the pipeline, use the next command:

kubectl describe pipeline my-kubernetes-pipeline

To troubleshoot a pipeline run by reviewing extra particulars concerning the run, use the next command:

kubectl describe pipelineexecution my-kubernetes-pipeline-execution

Clear up

Use the next command to delete any pipelines you created:

Use the next command to cancel any pipeline runs you began:

kubectl delete pipelineexecution

Conclusion

On this put up, we introduced an instance of how ML engineers acquainted with Jupyter notebooks and SageMaker environments can effectively work with DevOps engineers acquainted with Kubernetes and associated instruments to design and preserve an ML pipeline with the correct infrastructure for his or her group. This permits DevOps engineers to handle all of the steps of the ML lifecycle with the identical set of instruments and atmosphere they’re used to, which permits organizations to innovate sooner and extra effectively.

Discover the GitHub repository for ACK and the SageMaker controller to start out managing your ML operations with Kubernetes.

In regards to the Authors

Pratik Yeole is a Senior Options Architect working with world clients, serving to clients construct value-driven options on AWS. He has experience in MLOps and containers domains. Outdoors of labor, he enjoys time with associates, household, music, and cricket.

Felipe Lopez is a Senior AI/ML Specialist Options Architect at AWS. Previous to becoming a member of AWS, Felipe labored with GE Digital and SLB, the place he centered on modeling and optimization merchandise for industrial functions.