This post is courtesy of Dario La Porta, Senior Consultant, HPC.
With high performance computing (HPC) workloads running in the AWS Cloud, customers can scale workloads easily and select from a variety of instance types.
With this additional flexibility, elasticity, and scale, it’s important to track your costs and resource utilization for specific projects or users in HPC environments. You can do this using orchestration tools such as AWS ParallelCluster with AWS Cost Explorer and AWS Budgets. These allow you to manage cost allocation, forecast spending, and set up billing alarms that trigger on defined budget thresholds. You can also analyze usage to reduce cost or optimize price and performance.
AWS ParallelCluster deployment
AWS ParallelCluster is an open source cluster management tool to deploy and manage HPC clusters in the AWS Cloud.
- The post_install.sh script configures the Slurm cluster after the deployment. Replace <bucket> with your bucket name.
- The projects_list.conf file contains the list of the projects assigned to each user.
- The slurm-aws-PrologSlurmctld.sh script assigns the required tags to the Amazon EC2 instances of the jobs.
- The sbatch script used as wrapper to the Slurm sbatch command. Replace <account_id> with the id of your account.
The following “Launch Stack” button deploys a VPC with public (for the cluster’s head node) and private subnets (for the cluster’s compute nodes). You can also specify the CIDR range for the subnets and VPC when you launch this stack. In addition, it creates the policies required for the AWS ParallelCluster’s additional_iam_policies configuration. These are used to apply the tags to track per-project and per-user costs.
The stack creates the VPC, the public subnets, the private subnet, and the additional policies required for the cluster.
After the stack is deployed, you can use the provided AWS ParallelCluster config file to build the cluster.
The template uses t2.micro instance type for the cluster’s master and compute instances.
For real-world HPC use cases, you most likely want to use a different instance type, such as C5 or C5n.
The master_subnet_id contains the id of the created public subnet. The compute_subnet_id contains the private one and the vpc_id the VPC of the cluster. You must replace <account_id> with your account ID and <bucket> with the bucket name that contains the post_install.sh, projects_list.conf, slurm-aws-PrologSlurmctld.sh, and sbatch scripts.
This tutorial assumes you know how to set up an HPC cluster in AWS ParallelCluster. To learn how to do this, refer to the AWS ParallelCluster documentation, or this getting started blog post. To set up a serverless HPC cluster, refer to using AWS ParallelCluster with a serverless API.
Review your HPC cost
After the cluster deployment, the following tags are created in the environment when you submit a job:
- aws-parallelcluster-jobid – the job ID assigned to the compute instance.
- aws-parallelcluster-username – the owner of the submitted job.
- aws-parallelcluster-partition – the Slurm partition of the job.
- aws-parallelcluster-account – the Slurm user account.
- aws-parallelcluster-jobname – the name of the Slurm job.
- aws-parallelcluster-project – the project assigned to the job.
A tag is a label that you or AWS assigns to an AWS resource. Each tag consists of a key and a value. For each resource, each tag key must be unique, and each tag key can have only one value.
After the tags are applied to the environment, you can activate the user-defined cost allocation tags for your billing reports.
You can specify the project of a job using the Slurm –comment parameter:
$ sbatch --comment ProjectA script.sh
The sbatch command is a wrapper to the Slurm sbatch command. This script extends the standard sbatch functionality, enabling project tag management.
Usually, a cluster can be used for multiple purposes and projects. When a user specifies the project related to a job, the underline system adds the correct aws-parallelcluster-project tag to the instances created to run the computation.
In addition, the aws-parallelcluster-username and aws-parallelcluster-account tags link the instances to the user and account. This assignment can be used to bill the correct user or team for the used resources.
The AWS Cost Explorer service is used to visualize, understand, and manage the HPC expenses related to your projects. Note that your account’s spending in Cost Explorer may take up to 24 hours to propagate. The following graphs are examples showing how you can group your costs by Projects and Job IDs.
When the solution is deployed in a multiuser environment, you can also track the expenses of each user grouping the report by aws-parallelcluster-username.
You can build a multiuser environment by reading the AWS ParallelCluster Wiki page. For an Active Directory integration, see the AWS ParallelCluster with AWS Directory Services authentication blog post.
Creating budgets for HPC spending
Sometimes tracking the expenses is not sufficient because you may need to set budget limits for users for specific projects.
To set a custom budget that alerts you if costs or usage exceed a budgeted amount, use the AWS Budgets service. The documentation for creating a budget explains how to create a cost budget. Under Budget parameters, you can choose the AWS ParallelCluster tags that you want to use for the budget creation.
You can also use the create-budget API to create project and user-level budgets.
aws budgets create-budget --account-id 111122223333 --budget file://budget.json --notifications-with-subscribers file://notifications-with-subscribers.json
The budget.json file contains the budget object that you want to create. You can replace <amount> with the budget allocated for the project and <project_name> with the name of the project.
The notifications-with-subscribers.json file contains the notification associated with the budget. The <email> string must be replaced with the email address where you want to receive the budget notifications. You can review the syntax of both files in the create-budget API documentation.
You can limit which projects are assigned to a user by using the /opt/slurm/etc/projects_list.conf configuration file. This limits users to using only the correctly allocated projects.
The ec2-user is assigned to ProjectA and ProjectB.
If the user tries to specify a different project or omit it in the submission line, Slurm does not allow the execution of the job.
$ sbatch --comment ProjectC script.sh You are not allowed to use the project ProjectC $ sbatch script.sh You need to specify a project. "--comment ProjectName"
Using this approach, you can mandate users of the cluster to assign every job to a specific project. This allows expense tracking for each user.
If your budget reaches the limit for a project, you can prevent the user from exceed the allocated budget. Set the budget variable to yes under the /opt/slurm/bin/sbatch script.
Ensure that the project name defined in the /opt/slurm/etc/projects_list.conf file has the same name of the budget defined in AWS Budgets.
$ sbatch --comment ProjectB script.sh The Project ProjectB doens not have any associated budget. Please ask the administrator to create it. $ sbatch --comment ProjectA script.sh The Project ProjectA doens not have more budget allocated for this month.
Often, it’s challenging to assign specific cost to a relative HPC project or user. This approach can help you use cost optimization techniques and to analyze the data to find savings.
You can now assign a project to each job and monitor it from AWS Cost Explorer. You can also ensure that users stay within a budget that you have allocated in AWS Budgets.