Create and run EMR on EKS cluster

Article Directory


The creation of EMR on EKS is entirely command-line driven, and there is currently no corresponding UI interface to complete related operations. This article will demonstrate how to create and run an EMR on EKS cluster through the command line. The process of creating EMR on EKS can be divided into two stages: the first stage is to create an EKS cluster first, and the second stage is to create an EMR virtual cluster on top of this EKS cluster. The following are the specific steps.

Note: During the operation, we will get some values ​​one after another, such as the name of the EKS cluster, the ID of the virtual cluster, these variables will be used again in subsequent operations, in order to facilitate the improvement of the reusability of the script Extract these values ​​separately, assign them to a variable, and export them with export to facilitate subsequent reference. The following are some variables that will be generated and referenced during the operation, and the values ​​we will use in this example:

Variable nameValue in this exampledescription
REGIONus-east-1The current AWS REGION
ZONESus-east-1a,us-east-1b,us-east-1cAvailability zone assigned to the EKS cluster to be created
EKS_CLUSTER_NAMEit-infrastructureThe name of the EKS cluster to be created
DATALAKE_NAMESPACEdatalakeThe data system-oriented Kubenetes namespace to be created on EKS, and the EMR on EKS virtual cluster to be created will be placed under this space
VIRTUAL_CLUSTER_NAMEemr-cluster-1The name of the EMR on EKS virtual cluster to be created
SSH_PUBLIC_KEY<从EC2->Kye Pairs处查找>The EKS cluster to be created needs to specify the public key
EXECUTION_ROLE_ARN<从IAM的Admin Role处查找>IAM Role used to run EMR on EKS
VIRTUAL_CLUSTER_ID<过程中产生>The ID of the EMR on EKS virtual cluster to be created

The following are the commands for assigning values ​​to the above global variables (VIRTUAL_CLUSTER_ID will be generated in subsequent operations and will not be assigned temporarily):

export REGION="us-east-1"
export ZONES="us-east-1a,us-east-1b,us-east-1c"
export EKS_CLUSTER_NAME="it-infrastructure"
export DATALAKE_NAMESPACE="datalake"
export VIRTUAL_CLUSTER_NAME="emr-cluster-1"
export SSH_PUBLIC_KEY="<your-pub-key-name>"
export EXECUTION_ROLE_ARN="<your-admin-role-arn>"

0. Precondition

  • Make sure you have a Linux host and have installed the awscli command line
  • Ensure that the access_key configured to awscli belongs to an Admin account

1. Install ekscli

Ekscli is a command line tool used to operate eks. We need to use this tool and must be installed first. The installation command is as follows:

curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

2. Install kubectl

kubectl is a command line tool used to manage the kubenetes cluster. We need to use this tool and must be installed first. The installation command is as follows:

curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.20.4/2021-04-12/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$PATH:$HOME/bin
echo 'export PATH=$PATH:$HOME/bin' >> ~/.bashrc

3. Create an EKS cluster

Next, we want to create ABC_IT_INFRASTRUCTUREan EKS cluster named in East US 1 , the command is as follows:

eksctl create cluster \
    --region $REGION \
    --name $EKS_CLUSTER_NAME \
    --zones $ZONES \
    --node-type m5.xlarge \
    --nodes 5 \
    --with-oidc \
    --ssh-access \
    --ssh-public-key $SSH_PUBLIC_KEY \
    --managed

The above command line needs to pay attention to the following points:

  • $SSH_PUBLIC_KEYIt is the ID of your public key key on AWS. This string can be found in the EC2 console -> Key Pairs, and the name column is;
  • --zonesIt is not a required option. If it is not specified, AZ will be randomly selected, but sometimes the randomly selected AZ does not have enough resources to support the EKS cluster requested to be created when it is created. In this case, you need to explicitly specify the zone to avoid unavailable zone;
  • --node-typeAnd --nodesis not a necessary option. If not specified, the cluster is deployed on 2 m5.large nodes by default. For EMR, the configuration of this cluster is too low, so you must explicitly configure these two items to give the cluster more resources ;

The above command line needs to be executed for a long time (about 20 minutes), when it finally appears:

EKS cluster "ABC_IT_INFRASTRUCTURE" in "us-east-1" region is ready

Indicates that the EKS cluster has been built. It should be noted that during the execution of this command, a large amount of infrastructure will be created through Cloud Formation, including IAM Role, VPC, EC2, etc., and errors are more likely to occur in the middle, and many operations cannot be automatically rolled back. Therefore, you need to open the Cloud Formation console and continue to pay attention. If you find an uncleaned Stack, you must manually delete the Stack before re-executing the above command.

eksctl create clusterThere are many configurable options, you can view the detailed description through the following command:

eksctl create cluster -h

4. View EKS cluster status

After the EKS cluster is created, in order to ensure that the cluster is healthy, you can check the cluster status through the command line (this step is not necessary and can be skipped).

  • View the status of each physical node in the cluster
kubectl get nodes -o wide
  • View the status of the cluster POD
kubectl get pods --all-namespaces -o wide

5. Create Namespace

In order to facilitate the management of resources, we can create a separate namespace for data-related systems on the Kubenetes cluster, name it ABC_DATALAKE, and the EMR virtual cluster created subsequently will be placed under this namespace:

kubectl create namespace $DATALAKE_NAMESPACE

6. Authorize access to Namespace

By default, EMR on EKS is not authorized to directly access and use the namespace on EKS. We need to create a Kubernetes role, then bind the Role to a Kubernetes user, and map a service role AWSServiceRoleForAmazonEMRContainers to this user. In this way, we can bridge the authorization authentication between the Kubenetes side and the EMR on EKS server. Fortunately, we do not need to manually complete these operations one by one. It can be achieved directly through an eksctl command:

eksctl create iamidentitymapping \
    --region $REGION \
    --cluster $EKS_CLUSTER_NAME \
    --namespace $DATALAKE_NAMESPACE \
    --service-name "emr-containers"

The console output will also confirm the above statement:

2021-06-02 12:39:49 [ℹ]  created "datalake:Role.rbac.authorization.k8s.io/emr-containers"
2021-06-02 12:39:49 [ℹ]  created "datalake:RoleBinding.rbac.authorization.k8s.io/emr-containers"
2021-06-02 12:39:49 [ℹ]  adding identity "arn:aws:iam::1234567898765:role/AWSServiceRoleForAmazonEMRContainers" to auth ConfigMap

7. Create Job Execution Role

To run EMR on EKS, an IAM Role is required. In this Role, you need to configure the resources that can be used to authorize EMR on EKS, such as certain buckets on s3, cloudwatch and other services. These are called Role Policies, which are given in official documents. A reference configuration of Role Policies can be found at: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/creating-job-execution-role.html.

For convenience, this article will directly use the Admin role as the job execution role .

8. Create Trust Relationship for Role

If you create a role through step 7, you also need to edit the role and add the mutual trust between the role and the EMR managed service account (EMR managed service account). The so-called EMR service account (EMR managed service account) is automatically created when the job is submitted, so a wildcard will be used in the EMR service account part of the configuration.

Fortunately, we don’t need to manually edit the Trust Relationships section of the Role. We can automatically add this Trust Relationship with the following command line:

aws emr-containers update-role-trust-policy \
   --cluster-name $EKS_CLUSTER_NAME \
   --namespace $DATALAKE_NAMESPACE \
   --role-name <Admin or the-job-excution-role-name-you-created>

Among them, you need to <Admin or the-job-excution-role-name-you-created>replace it with Admin or the name of the role created in step 7. When the creation is successful, you can see the generated configuration similar to the following on the Trust Relationships page of Role:

{
  "Effect": "Allow",
  "Principal": {
    "Federated": "arn:aws:iam::1234567898765:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/1C2DF227CD8E011A693BCF03D7EBD581"
  },
  "Action": "sts:AssumeRoleWithWebIdentity",
  "Condition": {
    "StringLike": {
      "oidc.eks.us-east-1.amazonaws.com/id/1C2DF227CD8E011A693BCF03D7EBD581:sub": "system:serviceaccount:kube-system:emr-containers-sa-*-*-1234567898765-3l0vgne6"
    }
  }
}

Even if we choose to use the Admin role as the job execution role in step 7, this step still needs to be executed, -role-name takes the value Admin, otherwise it is not authorized to create a Log Group and store logs on s3 during job execution .

9. Create EMR virtual cluster on EKS

Next, we will create an EMR cluster. In fact, it should be more accurately called "registration", because after this step is executed, an EMR cluster will not be generated on the EKS. What is created here is a virtual cluster. The cluster needs It is created when the job is submitted for the first time. The command to create a cluster is as follows:

# create virtual cluster description file
tee $VIRTUAL_CLUSTER_NAME.json <<EOF
{
  "name": "$VIRTUAL_CLUSTER_NAME",
  "containerProvider": {
    "type": "EKS",
    "id": "$EKS_CLUSTER_NAME",
    "info": {
      "eksInfo": {
        "namespace": "$DATALAKE_NAMESPACE"
      }
    }
  }
}
EOF

# create virtual cluster
aws emr-containers create-virtual-cluster --cli-input-json file://./$VIRTUAL_CLUSTER_NAME.json

The above command first creates a cluster description file $VIRTUAL_CLUSTER_NAME.json, which describes the name of the EMR cluster and which Namespace of which EKS cluster is to be built, and then aws emr-containers create-virtual-clustercreates the virtual cluster described by this file.

If the above command is executed successfully, a json data describing the cluster will be output on the console. The idfields are more important. This id will be used when submitting the job later. If it is not saved, you can also query it at any time through the following command:

aws emr-containers list-virtual-clusters

Pay the obtained id to a global variable VIRTUAL_CLUSTER_ID, and subsequent operations will refer to the id multiple times:

export VIRTUAL_CLUSTER_ID='<cluster-id>'

10. Submit work to EMR on EKS

After the virtual cluster is built, you can submit big data jobs. EMR on EKS is container-based. Unlike EMR operating through shell login (which is possible but inconvenient), the normal way to use it is to treat it as a computing resource. Black box, just submit the job to it. The following is an example command to submit a job to EMR on EKS, which executes the example program that comes with sparkpi.py

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sample-job-name \
--execution-role-arn $EXECUTION_ROLE_ARN \
--release-label emr-6.2.0-latest \
--job-driver '{"sparkSubmitJobDriver": {"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py","sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"}}' \
--configuration-overrides '{"monitoringConfiguration": {"cloudWatchMonitoringConfiguration": {"logGroupName": "/emr-on-eks/$VIRTUAL_CLUSTER_NAME", "logStreamNamePrefix": "pi"}}}'

start-job-runThe most important thing about this command is --job-driverthis parameter. All the relevant information about the job itself is in this parameter. Based on the documentation, the current EMR on EKS only supports sparkSubmitJobDriverone form of job submission, that is, the job can only be submitted in a form acceptable to spark-submit, that is, the job can be submitted in the form of jar package + class or pyspark script. The jar package and its dependent jar files can be deployed on s3.

A more elegant way to submit a job is to provide a job run json description file, configure all clusters, jobs, and job configuration related information in this json file, and then execute it through commands, as shown below:

# create job description file
tee start-job-run-request.json <<EOF
{
  "name": "sample-job-name", 
  "virtualClusterId": "$VIRTUAL_CLUSTER_ID",  
  "executionRoleArn": "$EXECUTION_ROLE_ARN", 
  "releaseLabel": "emr-6.2.0-latest", 
  "jobDriver": {
    "sparkSubmitJobDriver": {
      "entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py",
      "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
    }
  }, 
  "configurationOverrides": {
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "persistentAppUI": "ENABLED", 
      "cloudWatchMonitoringConfiguration": {
    "logGroupName": "/emr-on-eks/$VIRTUAL_CLUSTER_NAME", 
        "logStreamNamePrefix": "pi"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "s3://glc-emr-on-eks-logs/"
      }
    }
  }
}
EOF
# start job
aws emr-containers start-job-run --cli-input-json file://./start-job-run-request.json

For the preparation of json files, please refer to: https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-jobs-CLI.html#emr-eks-jobs-submit

The last is about the configuration of the EMR cluster. Similar to the pure EMR cluster, the cluster configuration is also submitted through a json file and written into applicationConfigurationit, such as the "classification": "spark-defaults"part in the above configuration . Since EMR on EKS currently only supports Spark, only the following types of classification can be configured

ClassificationsDescriptions
core-siteChange values in Hadoop’s core-site.xml file.
emrfs-siteChange EMRFS settings.
spark-metricsChange values in Spark’s metrics.properties file.
spark-defaultsChange values in Spark’s spark-defaults.conf file.
spark-envChange values in the Spark environment.
spark-hive-siteChange values in Spark’s hive-site.xml file.
spark-log4jChange values in Spark’s log4j.properties file.

11. Delete and clean up

The order of deleting and cleaning up the cluster should be the opposite of the creation process. First delete the ERM virtual cluster, and then delete the EKS cluster:

# 1. list all jobs
aws emr-containers list-job-runs --virtual-cluster-id $VIRTUAL_CLUSTER_ID

# 2. cancel running jobs
aws emr-containers cancel-job-run --id <job-run-id> --virtual-cluster-id $VIRTUAL_CLUSTER_ID

# 3. delete virtual cluster
aws emr-containers delete-virtual-cluster --id $VIRTUAL_CLUSTER_ID

# 4. delete eks cluster
eksctl delete cluster --region $REGION --name $EKS_CLUSTER_NAME

Note: In step 4, when deleting the EKS cluster, you must find a resource in the corresponding Cloud Formation template, NodeInstanceRoleand manually dettach all the policies on the Role before the command can be executed successfully.

12. Common mistakes

The eks cluster created by eksctl create cluster defaults to two m5.large nodes. This configuration is difficult to support an EMR cluster, so be sure to specify the number of nodes and node types:

If you encounter an error similar to the following when creating an EKS cluster in step 3:

AWS::EKS::Cluster/ControlPlane: CREATE_FAILED – "Cannot create cluster 'my-bigdata-infra-cluster' because us-east-1e, the targeted availability zone, does not currently have sufficient capacity to support the cluster. Retry and choose from these availability zones: us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f (Service: AmazonEKS; Status Code: 400; Error Code: UnsupportedAvailabilityZoneException; Request ID: 61028748-0cc1-4100-9152-aab79a475fe6; Proxy: null)"

Note that an AZ automatically assigned or designated is currently unavailable, and you can --zonesreplace it with another AZ in the parameter list.

About the author: Architect, 15 years of IT system development and architecture experience, rich practical experience in big data, enterprise application architecture, SaaS, distributed storage and domain-driven design, and keen on functional programming. Have in-depth and extensive understanding of the Hadoop/Spark ecosystem, participated in the development of Hadoop commercial releases, and led the team to build several complete enterprise data platforms. Personal technical blog: https://laurence.blog.csdn.net / The author is the author of the book "Big Data Platform Architecture and Prototype Implementation: Actual Combat for Data Middle-Taiwan Construction" , which has been launched on JD.com and Dangdang.

Insert picture description here