In-code comments are not always sufficient if you want to maintain a good documentation of your code. Sometimes, it is the case when you would like to add equations, images, complex text formats and more. Of course, you can generate a “wiki” page for your project, but what would really be cool is if you could embed some code inside it, and execute it on demand to get the results, seamlessly.
Well, if you liked the idea then you should definitely try using a notebook. This post will guide you through installing the open source Jupyter notebook to the point where you can execute distributed Spark code from within it.
Formerly known as IPython, now the Jupyter project supports running Python, Scala, R and more (~40 languages via using kernels). You can run your code without leaving your notebook. You can embed widgets (LaTex equations, for example) in it, write formatted text, generate charts dynamically, and more.
Although not out of the box, it supports running Spark code on a cluster, so it becomes a really powerful tool for Spark practitioners as well, and its installation takes only few simple steps. Have it a go:
Who is this product good for
- Presentations / Demos – as you can add text in markup language, plot images, and run your code live.
- Documentation of snippets you’d normally run on spark-shell REPL.
- For testing, visualization, ad hoc querying and researching.
How to install on AWS
Installing the Jupyter notebook exposes a web service that is accessible from your web browser and enables you to write code inside your web browser, then hit CTRL+Enter and your snippets are executed on your cluster without leaving the notebook. Although, the original Jupyter installation comes only with a python kernel out-of-the-box, and so, the installation is two step:
Installing Jupyter
To have Jupyter running on your AWS cluster (EMR 4.x.x versions) add the following bootstrap action:
s3://elasticmapreduce.bootstrapactions/ipython-notebook/install-ipython-notebook
On previous EMR versions (AMI 3.x or earlier): download the bootstrap script from JustGiving‘s github, save it on S3, and add it as a bootstrap action in your cluster.
Installing Spark Kernel for Jupyter
Jupyter’s Spark Kernel is now part of IBM’s Toree Incubator. It is a seamless binding to run your notebook snippets on your Spark cluster. To install it, execute the following on the master node (no need to run on all nodes; assuming EMR 4.x.x, on previous versions paths are different):
sudo pip install –pre toree
sudo /usr/local/bin/jupyter toree install –spark_home=/usr/lib/sparksudo pip install “ipython[notebook]”
sudo mkdir /mnt/sbt
cd /mnt/sbt
sudo curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
sudo yum install -y sbt git
sudo mkdir /mnt/toree
cd /mnt/toree
sudo git clone -b master https://github.com/apache/incubator-toree
cd /mnt/toree/incubator-toree
sudo make dist
export SPARK_HOME=/usr/lib/spark
Tada! Jupyter service is already running!
UPDATE: on the newer versions of Jupyter a user must have an access to the service’s token. To get it, SSH to the master node where Jupyter is running, and call:
jupyter notebook list
which will output something similar to:
Currently running servers:
http://localhost:8192/?token=abc…
Copy the token parameter value, open your local web-browser, surf to your master node address on port 8192, and enter the token on the login screen. You’re in!
Note: If you haven’t already done that – you’ll have to tunnel your SSH connection to your master node (windows guide, mac/linux guide). Unless you do so, you’ll see err 404 – page not found.
Import spark packages or your own jars
You might want to include spark packages as Databrick’s CSV reader or MLLib. To do so, you have to edit your Jupyter’s Spark-Kernel configuration files. By defaul, Jupyter kernerls are installed inside /usr/local/share/jupyter/kernels. Navigate to this path, you should be able to find a directory named apache_toree_scala, in which you’ll find the kernel.json file, that looks similar to:
{
“language”: “scala”,
“display_name”: “Apache Toree – Scala”,
“env”: {
“__TOREE_SPARK_OPTS__”: “”,
“SPARK_OPTS”: “–packages org.apache.spark:spark-mllib_2.10:1.6.1,com.databricks:spark-csv_2.10:1.4.0 –jars /path/to/your.jar“,
“SPARK_HOME”: “/usr/lib/spark”,
“__TOREE_OPTS__”: “”,
“DEFAULT_INTERPRETER”: “Scala”,
“PYTHONPATH”: “/usr/lib/spark/python:/usr/lib/spark/python/lib/py4j-0.9-src.zip”,
“PYTHON_EXEC”: “python”
},
“argv”: [“/usr/local/share/jupyter/kernels/apache_toree_scala/bin/run.sh”,
“–profile”,
“{connection_file}”]
}
only that is misses the red line above. Add it, and/or any other package you’d like Spark to import whenever you are going to use Jupyter. Obviously, if you wanted to import a jar (to use its classes/objects) inside your notebook, the green part is just for that.
Walla! and don’t forget to refresh the spark kernel inside your notebook to apply changes in the configuration files.
UPDATE: the above kernel.json refers to spark editions installed on older versions of EMR releases. To conform with the new emr-5.x.x releases, please make sure the PYTHONPATH configuration includes a valid path to the python zip file (as of now, emr-5.2.0 uses /usr/lib/spark/python/lib/py4j-0.10.3-src.zip), as well as importing the right release of any spark package your code depends on (as of now, emr-5.2.0 uses spark 2.0.2 built with scala 2.11, so org.apache.spark:spark-mllib_2.11:2.0.2 should replace the above)
Some extremely useful interfaces
Apart from embedding basic scala/spark functions and evaluating them live, the Spark Kernel offers additional useful functions through kernel magics:
SparkSQL
It’s been a while since DataFrame API has been released to make RDD some company. The DataFrame API is actually part of the SparkSQL sub-project, and it aims to bring Spark’s functionality to higher level type of usage via an easier and more natural interface (at least to whomever is familiar with the popular SQL scripting language).
As a Spark practitioner, you can access DataFrames from either one of: Scala / Java, Python (via PySpark), or R (via SparkR). Alternatively, you can register a DataFrame as a temp table, and query it via vanilla SQL
val data = … //Load a new dataframe
data.registerTempTable(“some_alias”)
sqlContext.sql(“SELECT COUNT(*) FROM some_alias”)
To me, embedding plain SQL queries between Object-Oriented language lines seems like writing a massy code. But if you use Spark Kernel within Jupyter, you can easily transform some cells into SQL-only code via Spark Kernel’s %%SQL magic – and your code will actually contain both Scala and SQL and still be pretty:
HTML and Chart Plotting
Using another the %%HTML magic, you can also embed an HTML code in your notebook – which means you can introduce your notebook to images, tables, and everything that can be represented as an HTML code. And this is simple as writing:
Although, embedding a dynamic html, that is to say – one that is created dynamically based on the results computed in the rest of your code, is less straightforward to achieve. At first, evaluate the following snippet in your Spark notebook:
import org.apache.toree.magic.{CellMagicOutput, CellMagic}
import org.apache.toree.kernel.protocol.v5.{Data, MIMEType}def display_html(html: String) = Left(CellMagicOutput(MIMEType.TextHtml -> html))
Now all you’ll have to do is to generate some HTML script dynamically in your code and call (at the end of the cell after which you would like the HTML to be evaluated):
display_html ( html_code )
So for example, if you use JFreeChart, and you really like to embed charts in your notebooks, you could use this snippet:
import org.jfree.chart.encoders.EncoderUtil
import org.apache.commons.codec.binary.Base64def display_jfree ( chart: org.jfree.chart.JFreeChart , resolution: (Int, Int) ) = {
val imgBuffer = chart.createBufferedImage( resolution._1, resolution._2, java.awt.image.BufferedImage.TYPE_INT_ARGB, null)
val imgHTML = Base64.encodeBase64String( EncoderUtil.encode( imgBuffer, “png”))
s”””<img width=”${resolution._1}” height=”${resolution._2}” src=”data:image/png;base64,$imgHTML” />”””
}
Now just instantiate a new chart and call:
display_jfree ( chart, (600,400) )
to get a 600 x 400 plot of you chart embedded in the notebook, just like that:
Other magics include dynamic JAR loading, and a lot of other useful stuff. Be sure to check out Toree Magics to make your notebook look even better!
Liked that post? Subscribe to stay up to date with the newest tools and coolest data science algorithms! Want to choose what will I write on next? Vote!