Amazon SageMaker HyperPod is designed to help large-scale machine studying (ML) operations, offering a sturdy atmosphere for coaching basis fashions (FMs) over prolonged intervals. A number of customers — corresponding to ML researchers, software program engineers, information scientists, and cluster directors — can work concurrently on the identical cluster, every managing their very own jobs and recordsdata with out interfering with others.
When utilizing HyperPod, you need to use acquainted orchestration choices corresponding to Slurm or Amazon Elastic Kubernetes Service (Amazon EKS). This weblog publish particularly applies to HyperPod clusters utilizing Slurm because the orchestrator. In these clusters, the idea of login nodes is obtainable, which cluster directors can add to facilitate person entry. These login nodes function the entry level via which customers work together with the cluster’s computational sources. By utilizing login nodes, customers can separate their interactive actions, corresponding to shopping recordsdata, submitting jobs, and compiling code, from the cluster’s head node. This separation helps forestall any single person’s actions from affecting the general efficiency of the cluster.
Nonetheless, though HyperPod supplies the potential to make use of login nodes, it doesn’t present an built-in mechanism for load balancing person exercise throughout these nodes. Consequently, customers manually choose a login node, which may result in imbalances the place some nodes are overutilized whereas others stay underutilized. This not solely impacts the effectivity of useful resource utilization however can even result in uneven efficiency experiences for various customers.
On this publish, we discover an answer for implementing load balancing throughout login nodes in Slurm-based HyperPod clusters. By distributing person exercise evenly throughout all out there nodes, this method supplies extra constant efficiency, higher useful resource utilization, and a smoother expertise for all customers. We information you thru the setup course of, offering sensible steps to realize efficient load balancing in your HyperPod clusters.
Answer overview
In HyperPod, login nodes function entry factors for customers to work together with the cluster’s computational sources to allow them to handle their duties with out impacting the pinnacle node. Though the default methodology for accessing these login nodes is thru AWS Methods Supervisor, there are circumstances the place direct Safe Shell (SSH) entry is extra appropriate. SSH supplies a extra conventional and versatile means of managing interactions, particularly for customers who require particular networking configurations or want options corresponding to TCP load balancing, which Methods Supervisor doesn’t help.
On condition that HyperPod is often deployed in a digital personal cloud (VPC) utilizing personal subnets, direct SSH entry to the login nodes requires safe community connectivity into the personal subnet. There are a number of choices to realize this:
- AWS Web site-to-Web site VPN – Establishes a safe connection between your on-premises community and your VPC, appropriate for enterprise environments
- AWS Direct Join – Gives a devoted community connection for high-throughput and low-latency wants
- AWS VPN Shopper – A software-based resolution that distant customers can use to securely hook up with the VPC, offering versatile and quick access to the login nodes
This publish demonstrates the right way to use the AWS VPN Shopper to ascertain a safe connection to the VPC. We arrange a Community Load Balancer (NLB) throughout the personal subnet to evenly distribute SSH site visitors throughout the out there login nodes and use the VPN connection to hook up with the NLB within the VPC. The NLB ensures that person periods are balanced throughout the nodes, stopping any single node from turning into a bottleneck and thereby enhancing general efficiency and useful resource utilization.
For environments the place VPN connectivity may not be possible, another choice is to deploy the NLB in a public subnet to permit direct SSH entry from the web. On this configuration, the NLB may be secured by proscribing entry via a safety group that permits SSH site visitors solely from specified, trusted IP addresses. Consequently, licensed customers can join on to the login nodes whereas sustaining some degree of management over entry to the cluster. Nonetheless, this public-facing methodology is outdoors the scope of this publish and isn’t really useful for manufacturing environments, as exposing SSH entry to the web can introduce extra safety dangers.
The next diagram supplies an summary of the answer structure.
Stipulations
Earlier than following the steps on this publish, be sure you have the foundational elements of a HyperPod cluster setup in place. This consists of the core infrastructure for the HyperPod cluster and the community configuration required for safe entry. Particularly, you want:
- HyperPod cluster – This publish assumes you have already got a HyperPod cluster deployed. If not, seek advice from Getting began with SageMaker HyperPod and the HyperPod workshop for steering on creating and configuring your cluster.
- VPC, subnets, and safety group – Your HyperPod cluster ought to reside inside a VPC with related subnets. To deploy a brand new VPC and subnets, observe the directions within the Personal Account part of the HyperPod workshop. This course of consists of deploying an AWS CloudFormation stack to create important sources such because the VPC, subnets, safety group, and an Amazon FSx for Lustre quantity for shared storage.
Organising login nodes for cluster entry
Login nodes are devoted entry factors that customers can use to work together with the HyperPod cluster’s computational sources with out impacting the pinnacle node. By connecting via login nodes, customers can browse recordsdata, submit jobs, and compile code independently, selling a extra organized and environment friendly use of the cluster’s sources.
In the event you haven’t arrange login nodes but, seek advice from the Login Node part of the HyperPod Workshop, which supplies detailed directions on including these nodes to your cluster configuration.
Every login node in a HyperPod cluster has an related community interface inside your VPC. A community interface, also referred to as an elastic community interface, represents a digital community card that connects every login node to your VPC, permitting it to speak over the community. These interfaces have assigned IPv4 addresses, that are important for routing site visitors from the NLB to the login nodes.
To proceed with the load balancer setup, it’s good to acquire the IPv4 addresses of every login node. You possibly can acquire these addresses from the AWS Administration Console or by invoking a command in your HyperPod cluster’s head node.
Utilizing the AWS Administration Console
To arrange login nodes for cluster entry utilizing the AWS Administration Console, observe these steps:
- On the Amazon EC2 console, choose Community interfaces within the navigation pane
- Within the Search bar, choose VPC ID = (Equals) and select the VPC id of the VPC containing the HyperPod cluster
- Within the Search bar, choose Description : (Comprises) and enter the title of the occasion group that features your login nodes (usually, that is login-group)
For every login node, you will see that an entry within the listing, as proven within the following screenshot. Observe down the IPv4 addresses for all login nodes of your cluster.
Utilizing the HyperPod head node
Alternatively, you can even retrieve the IPv4 addresses by coming into the next command in your HyperPod cluster’s head node:
Create a Community Load Balancer
The following step is to create a NLB to handle site visitors throughout your cluster’s login nodes.
For the NLB deployment, you want the IPv4 addresses of the login nodes collected earlier and the suitable safety group configurations. In the event you deployed your cluster utilizing the HyperPod workshop directions, a safety group that allows communication between all cluster nodes ought to already be in place.
This safety group may be utilized to the load balancer, as demonstrated within the following directions. Alternatively, you may choose to create a devoted safety group that grants entry particularly to the login nodes.
Create goal group
First, we create the goal group that might be utilized by the NLB.
- On the Amazon EC2 console, choose Goal teams within the navigation pane
- Select Create goal group
- Create a goal group with the next parameters:
- For Select a goal kind, select IP addresses
- For Goal group title, enter smhp-login-node-tg
- For Protocol : Port, select TCP and enter 22
- For IP handle kind, select IPv4
- For VPC, select SageMaker HyperPod VPC (which was created with the CloudFormation template)
- For Well being test protocol, select TCP
- Select Subsequent, as proven within the following screenshot
- Within the Register targets part, register the login node IP addresses because the targets
- For Ports, enter 22 and select Embrace as pending under, as proven within the following screenshot
- The login node IPs will seem as targets with Pending well being standing. Select Create goal group, as proven within the following screenshot
Create load balancer
To create the load balancer, observe these steps:
- On the Amazon EC2 console, choose Load Balancers within the navigation pane
- Select Create load balancer
- Select Community Load Balancer and select Create, as proven within the following screenshot
- Present a reputation (for instance, smhp-login-node-lb) and select Inside as Scheme
- For community mapping, choose the VPC that comprises your HyperPod cluster and an related personal subnet, as proven within the following screenshot
- Choose a safety group that permits entry on port 22 to the login nodes. In the event you deployed your cluster utilizing the HyperPod workshop directions, you need to use the safety group from this deployment.
- Choose the Goal group that you just created earlier than and select TCP as Protocol and 22 for Port, as proven within the following screenshot
- Select Create load balancer
After the load balancer has been created, yow will discover its DNS title on the load balancer’s element web page, as proven within the following screenshot.
Ensuring host keys are constant throughout login nodes
When utilizing a number of login nodes in a load-balanced atmosphere, it’s essential to keep up constant SSH host keys throughout all nodes. SSH host keys are distinctive identifiers that every server makes use of to show its id to connecting purchasers. If every login node has a unique host key, customers will encounter “WARNING: SSH HOST KEY CHANGED” messages each time they hook up with a unique node, inflicting confusion and doubtlessly main customers to query the safety of the connection.
To keep away from these warnings, configure the identical SSH host keys on all login nodes within the load balancing rotation. This setup makes certain that customers received’t obtain host key mismatch alerts when routed to a unique node by the load balancer.
You possibly can enter the next script on the cluster’s head node to repeat the SSH host keys from the primary login node to the opposite login nodes in your HyperPod cluster:
Create AWS Shopper VPN endpoint
As a result of the NLB has been created with Inside scheme, it’s solely accessible from throughout the HyperPod VPC. To entry the VPC and ship requests to the NLB, we use AWS Shopper VPN on this publish.
AWS Shopper VPN is a managed client-based VPN service that permits safe entry to your AWS sources and sources in your on-premises community.
We’ll arrange an AWS Shopper VPN endpoint that gives purchasers with entry to the HyperPod VPC and makes use of mutual authentication. With mutual authentication, Shopper VPN makes use of certificates to carry out authentication between purchasers and the Shopper VPN endpoint.
To deploy a consumer VPN endpoint with mutual authentication, you may observe the steps outlined in Get began with AWS Shopper VPN. When configuring the consumer VPN to entry the HyperPod VPC and the login nodes, preserve these diversifications to the next steps in thoughts:
- Step 2 (create a Shopper VPN endpoint) – By default, all consumer site visitors is routed via the Shopper VPN tunnel. To permit web entry with out routing site visitors via the VPN, you may allow the choice Allow split-tunnel when creating the endpoint. When this feature is enabled, solely site visitors destined for networks matching a route within the Shopper VPN endpoint route desk is routed via the VPN tunnel. For extra particulars, seek advice from Cut up-tunnel on Shopper VPN endpoints.
- Step 3 (goal community associations) – Choose the VPC and personal subnet utilized by your HyperPod cluster, which comprises the cluster login nodes.
- Step 4 (authorization guidelines) – Select the Classless Inter-Area Routing (CIDR) vary related to the HyperPod VPC. In the event you adopted the HyperPod workshop directions, the CIDR vary is 10.0.0.0/16.
- Step 6 (safety teams) – Choose the safety group that you just beforehand used when creating the NLB.
Connecting to the login nodes
After the AWS Shopper VPN is configured, purchasers can set up a VPN connection to the HyperPod VPC. With the VPN connection in place, purchasers can use SSH to hook up with the NLB, which is able to route them to one of many login nodes.
ssh -i /path/to/your/private-key.pem person@<NLB-IP-or-DNS>
To permit SSH entry to the login nodes, you have to create person accounts on the cluster and add their public keys to the authorized_keys file on every login node (or on all nodes, if crucial). For detailed directions on managing multi-user entry, seek advice from the Multi-Person part of the HyperPod workshop.
Along with utilizing the AWS Shopper VPN, you can even entry the NLB from different AWS providers, corresponding to Amazon Elastic Compute Cloud (Amazon EC2) cases, in the event that they meet the next necessities:
- VPC connectivity – The EC2 cases should be both in the identical VPC because the NLB or in a position to entry the HyperPod VPC via a peering connection or comparable community setup.
- Safety group configuration – The EC2 occasion’s safety group should permit outbound connections on port 22 to the NLB safety group. Likewise, the NLB safety group needs to be configured to just accept inbound SSH site visitors on port 22 from the EC2 occasion’s safety group.
Clear up
To take away deployed sources, you may clear them up within the following order:
- Delete the Shopper VPN endpoint
- Delete the Community Load Balancer
- Delete the goal group related to the load balancer
In the event you additionally wish to delete the HyperPod cluster, observe these extra steps:
- Delete the HyperPod cluster
- Delete the CloudFormation stack, which incorporates the VPC, subnets, safety group, and FSx for Lustre quantity
Conclusion
On this publish, we explored the right way to implement login node load balancing for SageMaker HyperPod clusters. By utilizing a Community Load Balancer to distribute person site visitors throughout login nodes, you may optimize useful resource utilization and improve the general multi-user expertise, offering seamless entry to cluster sources for every person.
This method represents just one solution to customise your HyperPod cluster. Due to the flexibleness of SageMaker HyperPod you may adapt configurations to your distinctive wants whereas benefiting from a managed, resilient atmosphere. Whether or not it’s good to scale basis mannequin workloads, share compute sources throughout totally different duties, or help long-running coaching jobs, SageMaker HyperPod gives a flexible resolution that may evolve together with your necessities.
For extra particulars on profiting from SageMaker HyperPod, dive into the HyperPod workshop and discover additional weblog posts overlaying HyperPod.
Concerning the Authors
Janosch Woschitz is a Senior Options Architect at AWS, specializing in AI/ML. With over 15 years of expertise, he helps prospects globally in leveraging AI and ML for modern options and constructing ML platforms on AWS. His experience spans machine studying, information engineering, and scalable distributed methods, augmented by a robust background in software program engineering and trade experience in domains corresponding to autonomous driving.
Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Internet Companies. With a number of years of software program engineering and an ML background, he works with prospects of any measurement to grasp their enterprise and technical wants and design AI and ML options that make the very best use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on initiatives in numerous domains, together with MLOps, pc imaginative and prescient, and NLP, involving a broad set of AWS providers. In his free time, Giuseppe enjoys enjoying soccer.