[How-To] Run Spark jobs with custom packages (Conda)

Conda can distribute conda environments to be used with Apache Spark jobs when deploying jobs on Apache YARN. By bundling your environment for use with Spark, you can make use of all the libraries provided by conda, and ensure that they are consistently provided on every node. If you are interested, you can find here docs and useful examples.

Let’s see how to prepare your conda environment with your custom packages.

Summary

  1. Requirements
  2. Create a conda environment
  3. Install packages in conda environment
  4. Activate the environment
  5. Prepare the environment for Spark
  6. Launch Spark job on the cluster
  7. Where can I find for new packages?

1. Requirements

This tutorial requires to access via ssh into Access Gateway bigdatalab.polito.it.

Refer yo this page if you need credentials/instructions on how to access Big Data cluster.

2. Create a conda environment

conda create -y -n conda_example python=3.7

The example shows how to create a new conda environment called “conda_example” (arbitrary) running on Python 3.7.

It’s possible to specify Python’s version you prefer. If you like, you can both instantiate and install some packages at once:

conda create -y -n conda_example python=3.7 numpy pandas scikit-learn

The provided example instantiates a new conda environment and, in addition, installs numpy, pandas and scikit-learn packages.

3. Install packages in conda environment

conda install -c conda-forge package_name another_package_name

You can specify one or more packages at the same time.

In order to install packages inside your conda environment you can use also “pip”, but it’s recommended to try with conda first.

4. Activate the environment

conda activate conda_example

5. Prepare the environment for Spark

conda pack -o environment.tar.gz

The command creates an archive in the working directory with your currently activated conda environment’s files (e.g. libraries).

6. Launch Spark job on the cluster

PYSPARK_DRIVER_PYTHON=which python \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
 --master yarn \
 --deploy-mode client \
 --archives environment.tar.gz#environment \
 script.py

“environment.tar.gz” is the archive you obtained from the previous step, while “script.py” is you PySpark script.

7. Where can I find for new packages?

Anaconda Cloud provides the conda-forge repository together with a list of all available packages.