If your class has been authorized to use clusters, you will see an option in the Workspace tab of the part configuration to select "Spark Cluster" -

You need to configure the following options -

  • Instance - Select from on the options for the number of CPUs and memory
  • Max data size - this is the total data storage available and is equally divided amongst the worker nodes
  • Workers - these are the number of workers to be created for this cluster
  • Concurrent jobs - this is the number of jobs that Vocareum should send to the clusters concurrently. All jobs need a wall-clock time limit and the clock starts running when the job is sent to the cluster. Platform can also use it to load balance between various clusters or start a new one if needed.

Clusters will be created with Spark and Hadoop (HDFS and YARN) installed on it.

Initialization - After the machines are brought up, if you specify a file cluster/init.sh in the ../resource/scripts directory for the part, it will be run after the the Spark/Hadoop daemons are started. You can typically use this script to download data from your S3 bucket to initialize the cluster. This script will have access to the following environment variables -

  • Spark master-ip and port ($VOC_SPARK_MASTER)
  • Hadoop namenode and port ($VOC_HADOOP_NAMENODE)
  • Spark/Hadoop installation dirs ($SPARK_HOME, $HADOOP_HOME)
  • Spark/Hadoop configuration dirs ($SPARK_CONF_DIR, $HADOOP_CONF_DIR)

The output of this script is stored in the file cluster_init.out that is saved in the instructors work directory.

Status & Control - There is a "Cluster" drop-down menu in the teacher-IDE that displays the current cluster state. It also has controls for Start/Stop/Re-Init/Terminate. Re-init is useful if the init.sh script needs some changes and they need to be tested (in this scenario, start the cluster, if not already running, and then select Re-Init - this will reformat the namenode, restart all the Hadoop/Spark daemons, and then run your init.sh script).

Submission/Grading - When the student submits their work, your grading scripts can use the above environment variables. The cluster is normally in "Stop" mode and will automatically start (or launch) when a grading job is submitted (in the Student-View, via Autograde-On-Submit or when doing batch Auto-grading). Also, it will be "stopped" automatically after 10 minutes of idle time.

Did this answer your question?