An Alternative Introduction to Altair Plotting
Common Options and Best Practices
Scope
The article is intended for readers that have some familiarity with at least one plotting library, and within the python
ecosystem that is likely to be matplotlib
. Therefore when introducing a new plotting library, beyond the basic syntax, most people want to know how to enable the common options and apply basic best practice in setting titles, defining user friendly tooltips and separating transformations from plotting elements.
Rather then introducing a new dataset, the article is based on altair’s Example Gallery but with these common options and best practices applied. This article shows both the original and modified plots to produce a skim-friendly reference.
Simple Bar Chart
The comparison shows the original (on the left) and the modified Chart (on the right). The plot is the simplest example, however it lacks some basic best practice such as a plot and axes titles. Colour coding of the bars was added to create greater contrast.
The following code is directly from the Altair Developers:
The data in source
DataFrame looks like the following:
The new modified chart was produced using the following:
Three significant changes:
- The axes were given a more meaningful
title
using thealt.X
class. Tooltip
s were added with again a customtitle
using thealt.Tootlip
class.- A
title
for the whole plot was added to reflect best practice using theproperties
element for theChart
.
Simple Heatmap
The comparison between the original and the modified version is provided below:
The code from the Altair Developers generates the source
data using a simple relationship between x
, y
and z
:
A preview of the source data is presented below:
The modified Chart was produced with the following code:
Relative to the previous plot the color
channel has been customised using alt.Color
class. The rest of the Chart reflects best practice: the axes, legend and plot are given title
s using a similar pattern: alt.X('x:O', title='X Coodinate')
. The plot size was also changed — note the use of Integers instead of Strings.
Simple Histogram
The comparison between the original and the new chart hides significant under-the-hood changes. Altair performs a transformation using a single line (bin=True
) to create a histogram, whereas manipulating the data with pandas
requires much more effort:
The original code contains a single keyword argument that transforms the data with bin=True
:
The data is from IMDb and includes a number of film facts:
The histogram is of the IMDB_Rating
, which is made easy by altair
using the bin=True
option. In contrast, manually binning the data and presenting as a histogram takes more than a single line!
This example helps illustrate how a big data set containing say 10 Million rows could easily be transformed using spark
, dask
, vaex
, etc into a pandas Series
with only 10 rows, where altair
would typically struggle. The following steps have been adapted from Issue #1691 from the altair
team.
The data within the url
is stored as DataFrame ( df
) using the pd.read_json()
function. Then the first step is to create the bin categories (or edges) and then divide the data accordingly:
The output of the check confirms the categorisation is correct:
The next steps are:
- aggregate the data using
groupby
- create label columns for both the minimum and maximum rating per category
- remove the index as its no longer needed
From the whole DataFrame, only the IMDB_Rating
column is required and aggregated (groupby
) using the binning categories, count
ed, converted and renamed as a DataFrame. The range of each bin is then extracted using list
indexing into separate columns and the existing index
removed. The data is now of this format:
The following example is not found in the Gallery and is not explicitly stated, but the following code is the minimum to create a Chart with pre-binned data:
Note the keyword argument bin=’binned’
is required, which enables the upper limit of the bin range to be defined using the x2
(or y2
) encoding channel:
The tooltip
contains both the upper and lower range for each bin; this could be improved using Vega Expressions and altair
’s transform_calculate
function:
The code shows a number of altair
features at once; the data is transformed to create a new parameter called rating_range
. The two label columns are converted to strings using Vega expressions and then combined using JavaScript syntax to create a label of the format: 6–7
. This is displayed as a tooltip
. The Chart is also assigned a slightly larger size for better comparison.
However, the plot does not reflect best practice as illustrated by the side-by-side comparison below:
The code of the modified chart (on the right) is presented below:
The number of altair
lines has doubled for the benefit of using pre-binned data. The transform_calculate
function to generate a new parameter has already been covered. To remove the second axis title, it must explicitly be set to None
i.e. alt.X2('bin_max', title=None)
. Tooltips are added with well formatted labels using the title
parameter as is a plot title.
Simple Line Chart
The comparison between the original and new plots is shown below with some minor changes to reflect best practice:
The code used to generate the original plot is a very simple example of a sine plot:
The data is of the format:
The modifications to the original plot are minor using more explicit titles
for the axes and a title
for the overall plot:
Simple Scatter Plot with Tooltips
The simple scatter below using static images between the original and the new hides some changes in using Tooltips
as well as the application of best practice:
The original code uses the cars dataset from Vega and shows a modified size for the circles:
The data is of the format:
The modified plot contains a number of best practices in terms of making it easier to read and understand:
An plot title
has been added and explicit titles for the y
channel given that it reflects fuel efficiency. The Tooltip
now contains an abbreviation for the Miles_per_Gallon
encoding to obstruct the Chart less when viewing interactively.
Simple Stacked Area Chart
A comparison between the original and the modified:
The original code uses the Iowa, USA energy generation data set and the area mark
:
The data is in the long form:
The modified code reflects the best practice discussed previously using explicit titles for all channels:
Simple Strip Plot
A comparison between the original and the modified:
The data is again based on the cars dataset from Vega; note the explicit use of encoding data types:
Q
: QuantitativeO
: Ordinal
The data is of the format:
The modified Chart adds the color
channel for added contrast:
The plot reflects best practice with the addition of the color
channel as a Nominal data type.
Summary
A first set of plots from the Altair example gallery has been reproduced with best practice wherever possible but also by separating transformations from the plotting components. The modified plots have illustrated the following additional features:
- Adding a plot
title
- Explicitly labelling axes with
alt.X(<column>, title='Meaningful Axis')
- Explicitly labelling other channels and Tooltips
- Generating new supplementary data using
transform_calculate
for display purposes. - Binning data to generate aggregations for subsequent plotting with
altair
.
So an alternative introduction to Altair by the use of several examples showcasing common options and best practices, which are missing from the Example Gallery from the Altair user guide.
Attribution
All gists
, notebooks and terminal casts are by the author. All of the artwork is based on assets explicitly CC0, Public Domain license or SIL OFL and is therefore non-infringing. Theme is inspired by and based on my favourite vim
theme: Gruvbox.
Connect
Feel free to connect with me on LinkedIn.