RStudio is an integrated development environment (IDE) for R, a language and environment for statistical computing and graphics. As a data scientist, you may integrate R and Spark (a big data processing framework) to analyze large datasets. You can use an R package called sparklyr to offload filtering and aggregation of large datasets from your R script to Spark and use R’s native strength to further analyze and visualize the results from Spark.

An R script running in RStudio uses sparklyr to submit Spark jobs to the cluster. Typically, an R script (along with sparklyr) runs in an RStudio environment that is installed on a machine that’s separate from the cluster of machines (in Amazon EMR) that runs Spark. To enable sparklyr to submit Spark jobs, you need to establish network connectivity between the RStudio machine and the cluster running Spark. One way to do that is to run RStudio on an edge node, which is a machine that is part of the cluster’s private network and runs client applications like RStudio. Edge nodes let you run client applications separately from the nodes that run the core Hadoop services. Edge nodes also offer convenient access to local Spark and Hive shells.

However, edge nodes are not easy to deploy. They must have the same versions of Hadoop, Spark, Java, and other tools as the Hadoop cluster, and require the same Hadoop configuration as nodes in the cluster.

This post demonstrates an automated way to create an edge node with RStudio installed using AWS Systems Manager.

Deploying an edge node for an EMR cluster

One method to deploy an edge node involves creating an Amazon EC2 AMI directly from the EMR master node. For more information, see Launch an edge node for Amazon EMR to run RStudio. This post offers an SSM automation document that simplifies on-demand edge node deployment. Systems Manager gives you visibility and control of your AWS infrastructure, and Systems Manager Automation lets you safely automate common and repetitive tasks, like creating edge nodes on demand.

This post walks you through the process of installing the SSM document and how to use the document to create an edge node. For more information about the code, see the GitHub repo.

Creating the automation document

First, you use Terraform to create the automation document. You can download Terraform from the Terraform website. Alternatively, AWS CloudFormation works equally well.

After you install Terraform, go to the directory where you cloned the repo and edit the file vars.tf. For more information, see the GitHub repo. This file defines several input parameters, and the comments in the file should be self-explanatory. You can provide default values in vars.tf or override using one of the other supported techniques. For more information, see Input Variables on the Terraform website.

Next, enter the following code:

tf init # one time only
tf apply

The code runs a Terraform plan to create the document. Your environment should already be configured to access your AWS account with privileges to do the following:

To make updates to the Terraform plan going forward, use Terraform’s shared state feature. For more information, see Remote State on the Terraform website.

The Terraform plan loads your automation document from a local template file and registers it with Systems Manager. See the following code:

# Load our document from a template and substitute some variables.
data "template_file" "ssm_doc_edge_node" { template = "${file("${path.module}/ssm_doc_edge_node.tpl")}" vars = { SSMRoleArn = "${aws_iam_role.ssm_automation_role.arn}" InstanceProfileArn = "${aws_iam_instance_profile.edge_node_profile.arn}" PlaybookUrl = "s3://${var.bucket}/init.yaml" Environment = "${var.environment}" Project = "${var.ProjectTag}" region = "${var.region}" }
} # Register the document content with SSM
resource "aws_ssm_document" "create_edge_node" { name = "create_edge_node" document_type = "Automation" document_format = "YAML" tags = { Name = "create_edge_node" Project = "${var.ProjectTag}" Environment = "${var.environment}" } content = "${data.template_file.ssm_doc_edge_node.rendered}"
}

The rest of the Terraform plan does the following:

  • Uploads an Ansible template for the SSM document to use
  • Sets up IAM roles and policies that let Systems Manager and a new edge node assume the correct privileges

What’s in the automation document?

The automation document has three main steps. First, it creates and launches a new AMI from the existing EMR master node. See the following code:

- name: create_ami action: aws:createImage maxAttempts: 1 timeoutSeconds: 1200 onFailure: Abort inputs: InstanceId: "{{MasterNodeId}}" ImageName: AMI Created on{{global:DATE_TIME}} NoReboot: true - name: launch_ami action: aws:runInstances maxAttempts: 1 timeoutSeconds: 1200 onFailure: Abort inputs: ImageId: "{{create_ami.ImageId}}" ...

Next, it updates the SSM agent and runs an Ansible playbook to install RStudio. You can examine the Ansible playbook in GitHub; it installs RStudio and dependencies and handles some initial configuration. See the following code:

- name: updateSSMAgent action: aws:runCommand inputs: DocumentName: AWS-UpdateSSMAgent InstanceIds: - "{{launch_ami.iid}}"
- name: installPip action: aws:runCommand inputs: DocumentName: AWS-RunShellScript InstanceIds: - "{{launch_ami.iid}}" Parameters: commands: - pip install ansible boto3 botocore
- name: runPlaybook action: aws:runCommand inputs: DocumentName: AWS-RunAnsiblePlaybook InstanceIds: - "{{launch_ami.iid}}" Parameters: playbookurl: "${PlaybookUrl}"

Finally, it adds an Amazon CloudWatch alarm to trigger EC2 instance recovery if the edge node fails. See the following code:

- name: add_recovery action: aws:executeAwsApi inputs: Service: cloudwatch Api: PutMetricAlarm AlarmName: "Recovery for edge node {{ launch_ami.iid }}" ActionsEnabled: true AlarmActions: - "arn:aws:automate:${region}:ec2:recover"

Using the automation document

To start using the automation document, complete the following steps:

  1. On the Systems Manager console, choose Automation.
  2. Choose Execute automation.
  3. On the Owned by me tab, choose the document create_edge_node.
  4. Choose Next.

    On the next page, you need to fill in three pieces of information. You may want to get some advice from your cloud operations team, or whomever manages your EMR clusters. For instructions on creating a cluster with the latest EMR version and Spark, see Launch Your Sample Amazon EMR Cluster.
  5. In the Input parameters section, provide the following information:
    • For MasterNodeId, enter the EC2 instance ID of the master node of the EMR cluster you want to connect to.In most cases, your operations team can provide this information, but you can also find the instance ID by going to the Hardware tab of your EMR cluster and drilling into the master node group. Your EMR cluster must have Spark installed because you want to use sparklyr with RStudio.The following screenshot shows where to find your EC2 instance ID on the Hardware tab.
    • For SubnetId, enter the subnet that the edge node should live in. Your operations team should provide this information, or you can see it on the Summary tab of the EMR cluster. The edge node must live in the same VPC as the cluster. It does not need to be in a public subnet because you connect via Session Manager.The following screenshot shows where to find your subnet ID on the Summary tab of your cluster.
    • For QuickIdentifier, enter a user-friendly name to help you remember this edge node; for example, Edge Node with RStudio.

When the execution is finished, you will see the completed steps, as in the following screenshot.

If you choose the last step in the list (step 8), you see the DNS name and EC2 instance ID for your new edge node. See the following screenshot.

You can now connect to this edge node by using another feature of Systems Manager: Session Manager. Session Manager lets you open an SSH tunnel for port forwarding without having to use SSH keys or expose the SSH port to the internet. For instructions on opening a port forwarding session, see Starting a Session (Port Forwarding). You need the Session Manager plugin installed locally. See the following code:

aws ssm start-session  --target instance-id  # Get this from the output of the automation document --document-name AWS-StartPortForwardingSession  --parameters '{"portNumber":["8787"], "localPortNumber":["8787"]}'

For more information, see Install the Session Manager Plugin for the AWS CLI.

You can now access RStudio at http://localhost:8787. See the following screenshot.

You can also access the node directly and use the local Hive and Spark shells through the Session Manager console.

This post sets up the SSM document to create single-user edge nodes. The default user name to log in to RStudio is ruser. You must set the password by changing the password for the ruser account directly in the operating system, because RStudio uses PAM authentication by default. For more information, see What is my username on my RStudio Server? To change the password, open another Session Manager session and enter the following code:

aws ssm start-session  --target instance-id  # Get this from the output of the automation document $ sudo passwd ruser # Enter a password of your choice and confirm it

You should keep any valuable files like R scripts in a GitHub repo and store any output data in an S3 bucket for long-term persistence.

Configuring security

The Terraform plan sets up three important IAM roles:

  • A role that Systems Manager assumes when running the automation document. This role needs to perform actions in Amazon EC2, like creating new AMIs, CloudWatch, and Systems Manager.
  • An EC2 instance profile for the edge nodes. Theprofile has the permissions necessary for the SSM agent to run and for the edge node to perform tasks typical of an EMR node.
  • A role for CloudWatch to perform instance recovery.

In your environment, you may want to review the IAM roles and policies and tighten their scope based on tags or other conditions.

Conclusion

This post described an automated way to deploy an EMR edge node with RStudio using an SSM document. EMR edge nodes with RStudio give you a familiar working environment with access to large datasets via Spark and sparklyr. For information about deploying a new edge node and installing the necessary Hadoop libraries with an AWS CloudFormation template, see Launch an edge node for Amazon EMR to run RStudio.

 


About the Author

Randy DeFauw is a principal solutions architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.