Analysis of Spark in Action on Kubernetes-Playground Construction and Architecture

Keywords: Java Spark Kubernetes git Big Data

Preface

Spark is a very popular big data processing engine. Data scientists use Spark and the related ecological big data suite to complete a large number of rich scene data analysis and mining. Spark has gradually become the industry standard in the field of data processing. However, Spark itself is designed to use static resource management. Although Spark also supports dynamic resource managers like Yarn, these resource managers are not designed for dynamic cloud infrastructure and lack solutions in speed, cost, efficiency and other fields. With the rapid development of Kubernetes, data scientists began to consider whether Spark could be combined with Kubernetes'elasticity and cloud-oriented primitiveness. In Park 2.3, the Resource Manager adds native Kubernetes support, and in this series we will show you how to use Spark for data analysis in clusters in a more Kubernetes way. This series does not require developers to have rich experience in using Spark, and will be interspersed with explanations of the Spark features used in the gradual deepening of the series.

Build Playground

Many developers are discouraged by the complexity of the installation process when they come into contact with Hadoop. In order to reduce the learning threshold, this series will simplify the installation process by using spark-on-k8s-operator as Playground. Spark-on-k8s-operator, as its name implies, is designed to simplify the operation of Spark. If a developer does not know much about the operator, he or she can search for it and understand what the operator can do to help you quickly grasp the key points of the spark-on-k8s-operator.

Before explaining the internal principle, we first build up the environment and run through the whole running environment through a simple demo.

1. Install spark-on-k8s-operator

The official documentation is installed through Helm Chart. Because many developers'environments cannot connect to google's repo, we install it here through standard yaml.

## Download repo
git clone git@github.com:AliyunContainerService/spark-on-k8s-operator.git

## Install crd
kubectl apply -f manifest/spark-operator-crds.yaml 
## Service Account and Authorization Policy for Installing operator
kubectl apply -f manifest/spark-operator-rbac.yaml 
## Service Account and Authorization Policy for Installing spark Tasks
kubectl apply -f manifest/spark-rbac.yaml 
## Install spark-on-k8s-operator 
kubectl apply -f manifest/spark-operator.yaml

Verify installation results

Under the stateless application in the spark-operator namespace, you can see a running spark operator. At this time, the component has been installed successfully. Next, we run a demo application to verify whether the component works properly.

2. Demo validation

When learning Spark, the first task we ran was the example of the PI operation described in the official documentation. Today we're going to run it again in Kubernetes, another way.

## Next spark-pi task
kubectl apply -f examples/spark-pi.yaml

Once the task is successfully posted, the status of the task can be observed from the command line.

## Query task
kubectl describe sparkapplication spark-pi

## Task results  
Name:         spark-pi
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sparkoperator.k8s.io/v1alpha1","kind":"SparkApplication","metadata":{"annotations":{},"name":"spark-pi","namespace":"defaul...
API Version:  sparkoperator.k8s.io/v1alpha1
Kind:         SparkApplication
Metadata:
  Creation Timestamp:  2019-01-20T10:47:08Z
  Generation:          1
  Resource Version:    4923532
  Self Link:           /apis/sparkoperator.k8s.io/v1alpha1/namespaces/default/sparkapplications/spark-pi
  UID:                 bbe7445c-1ca0-11e9-9ad4-062fd7c19a7b
Spec:
  Deps:
  Driver:
    Core Limit:  200m
    Cores:       0.1
    Labels:
      Version:        2.4.0
    Memory:           512m
    Service Account:  spark
    Volume Mounts:
      Mount Path:  /tmp
      Name:        test-volume
  Executor:
    Cores:      1
    Instances:  1
    Labels:
      Version:  2.4.0
    Memory:     512m
    Volume Mounts:
      Mount Path:         /tmp
      Name:               test-volume
  Image:                  gcr.io/spark-operator/spark:v2.4.0
  Image Pull Policy:      Always
  Main Application File:  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar
  Main Class:             org.apache.spark.examples.SparkPi
  Mode:                   cluster
  Restart Policy:
    Type:  Never
  Type:    Scala
  Volumes:
    Host Path:
      Path:  /tmp
      Type:  Directory
    Name:    test-volume
Status:
  Application State:
    Error Message:
    State:          COMPLETED
  Driver Info:
    Pod Name:             spark-pi-driver
    Web UI Port:          31182
    Web UI Service Name:  spark-pi-ui-svc
  Execution Attempts:     1
  Executor State:
    Spark - Pi - 1547981232122 - Exec - 1:  COMPLETED
  Last Submission Attempt Time:             2019-01-20T10:47:14Z
  Spark Application Id:                     spark-application-1547981285779
  Submission Attempts:                      1
  Termination Time:                         2019-01-20T10:48:56Z
Events:
  Type    Reason                     Age                 From            Message
  ----    ------                     ----                ----            -------
  Normal  SparkApplicationAdded      55m                 spark-operator  SparkApplication spark-pi was added, Enqueuing it for submission
  Normal  SparkApplicationSubmitted  55m                 spark-operator  SparkApplication spark-pi was submitted successfully
  Normal  SparkDriverPending         55m (x2 over 55m)   spark-operator  Driver spark-pi-driver is pending
  Normal  SparkExecutorPending       54m (x3 over 54m)   spark-operator  Executor spark-pi-1547981232122-exec-1 is pending
  Normal  SparkExecutorRunning       53m (x4 over 54m)   spark-operator  Executor spark-pi-1547981232122-exec-1 is running
  Normal  SparkDriverRunning         53m (x12 over 55m)  spark-operator  Driver spark-pi-driver is running
  Normal  SparkExecutorCompleted     53m (x2 over 53m)   spark-operator  Executor spark-pi-1547981232122-exec-1 completed

At this time, we found that the task has been successfully executed. Looking at the log of this Pod, we can calculate the final result is Pi is roughly 3.1470557352786765. So far, on Kubernetes, we have run through the first Job. Next, let's go into details about what we have done in this wave of operations.

Analysis of Spark Operator's Infrastructure

This diagram is a flow chart of Spark Operator. In the first step of the above operation, the blue Spark Operator at the center of the diagram is actually installed into the cluster. Spark Operator itself is a CRD Controller and a Mutating Admission Web hook Controller. When we issue the spark-pi template, it will be converted into a CRD object named Spark Application, and then Spark Operator will listen to Apiserver, parse the Spark Application object, turn it into a spark-submit command and submit it it. Driver Pod is a mirror encapsulating Spark Jar in a simple way. If it is a local task, it is executed directly in Driver Pod; if it is a cluster task, it is executed by regenerating Exector Pod from Driver Pod. When the task is over, you can view the run log through Driver Pod. In addition, Spark Operator dynamically attach es a Spark UI to Driver Pod during task execution. Developers who want to view task status can view task status through this UI page.

Last

In this paper, we discuss the original design intention of Spark Operator, how to quickly build a Playground of Spark Operator and the basic architecture and process of Spark Operator. In the next article, we will go deep into the interior of Spark Operator and explain its internal implementation principles and how to integrate more seamlessly with Spark.

Posted by stallingjohn on Tue, 22 Jan 2019 08:18:13 -0800

Programmer Group