Visualizing Programming Language Popularity data with D3
I will start with the result, a visualization of the number of questions tagged with java or javascript on StackOverflow.com. The number of questions is aggregated for each month. An overall time period of about three years is covered.
Java vs JavaScript
Click on the grayish framed links to toggle the opacity of the corresponding values.
Interpretation
Interpretation of data is hardly objective. As they say: there are lies, damn lies, and statistics. This is what I read out of the diagram: JavaScript gained popularity compared to Java over the last three years. This is probably no surprise for most of us. I was somewhat surprised by the fact that JavaScript roughly achieves the same number of questions per month as Java does.
More to come
You can stop reading here if you were purely interested in the result. I plan to publish more comparisons of programming languages and frameworks based on the same1 data in the future.
I will focus on the technical issues about how to produce such a diagram in the following sections.
Scaling Values
The scale for the absolute counts is shown on the left axis. These unmodified values are great to see the evolution and assess the overall relevance of a language. It can be hard to see the relative behaviour of two languages by absolute values.
The normalized values always add up to one for each month, i.e. the normalized value for the ith language is vn, i = ci/∑cj where c is the frequency-count per month. Normalized values take out the absolute growth and compare only relatively. However, the charts will always show a possibly strange perceived symmetry around the center line if there are, as in our case, only two subjects under consideration.
The relative evolution of two languages is most easily discovered when one of them serves as the reference, i.e. vi = ci/ck for a fixed k. We use java as the reference and hence the corresponding mapping is a novarying 1.
The D3 (Data Driven Documents) Library
The Data Driven Documents library, or D3js for short, was created by Mike Bostock. He also created the protovis library, which is in many ways a predecessor2 to D3.
What D3 is and what D3 is not
D3 is primarily a framework to connect data with some elements of a html5 document. The library provides various methods to create further elements, and to set attributes of those.
D3 is often used for visualizations. While this isn't exactly the main goal of D3 (there are almost no high level graphical helpers), it is a very convenient tool for creating visualizations.
D3 is not a statistics or even general analytics toolkit. It is not a replacement for R or Mathematica. A few years ago I would have used R for performing the task at hand. It has strong capabilities for processing and analyzing data. However, it is not trivial to produce good visualizations that easily integrate with html5 documents. In very simple cases, D3 can act as a replacement for R and alike. In less trivial cases, one would probably stick with R, or use R for the analytic part and then D3 for the presentation work.
General Concepts
D3 is a document-data binding library. It is very similar to jQuery from the perspective of the programming interface. The following expression, for example, selects all the circles of a svg element:
circles = D3.select("svg#vis").selectAll("circle.java")
We can use the same class, and id matcher, as we are used from jQuery. Now, D3 goes beyond the concepts of jQuery in its data-binding capabilities. The .data
and .enter
methods associate each object in the array of data with exactly one circle returned by the .selectAll
matcher. If no circles exist, they will be created on the fly3, e.g. by:
circles.data(data).enter()
All the methods I am going to discuss from now on are chainable. Each of them returns an object that is linked to the same prototype4 on which it has been invoked. This is also very similar to how jQuery behaves with most of its methods. However, there is a very important conceptual departure from jQuery: the methods of jQuery will (in general) act on the first element of a collection (and then also return only that element for chainable actions). However, In D3 methods like .attr
will be invoked for each element of the collection. To this end, D3 can use higher order functions to act on an element with custom properties. The following code5 shows an example that sets the x-coordinate for each of the circles:
circles
.data(data).enter()
.append("circle")
.attr("cx", (d)-> time_scale(d.time))
Drawing Lines
Connecting the circles with lines isn't the same smooth experience as drawing the circles. The recommended procedure to connect the circles is using the svg path primitive. Now, while path can be used with the .data
and .enter
methods, it doesn't connect the lines according the given data. Instead, the data can, and should be given as an argument to the .line
method itself:
line = D3.svg.line()
.x((d)-> time_scale(d.time))
.y((d) -> value_scale(value(d,field)))
svg.append("svg:path")
.attr("class","#{classname(field)} #{value_method}")
.attr("d", line(data))
Next, the svg-path primitive a multipurpose geometric shape. The styling should be set to {fill: none}
to just draw lines.
Extracting the Data
StackOverflow.com publishes the underlying data roughly every three months in the form of an XML dump. I have to explain for what I use this data in another context before describing the extraction of the data.
I built a (j)Ruby on Rails application that mimics the StackOverflow site. This allows me to query the whole dataset offline from my laptop computer. I normalized the schema to some extent (to the end of being able to pose more complex queries6 compared to what is offered on the original StackOverflow site). In particular, there exists a join table between the questions and the tags.
This schema turned out to be quite handy to query the data efficiently and in a way such that it can be easily brought into a format to be consumed with D3. The query also takes advantage of PostgreSQL's date_trunc
function7 to aggregate the counts for each month. An example of a query used for the visualization could thus look like the following:
SELECT tags.name, count(questions.id), date_trunc('month',creation_date) as month
FROM questions
INNER JOIN questions_tags ON questions.id = questions_tags.question_id
INNER JOIN tags ON tags.id = questions_tags.tag_id
WHERE tags.name = 'javascript'
GROUP BY month, tags.name
ORDER BY month ;
I chose node.js and the node-postgres client to fetch, aggregate and output the data in JSON. The script is included with this post8.
Choosing Colors
Choosing adequate colors in any diagram is a mundane, yet annoying task. CSS3 adds the HSLA color model. It simplifies the task of dividing the colorspace into any number of distinct colors immensely by using a single dimension for the hue value. We could, for example, write something like the following to set the ith of n colors:
"hsla(#{i*360/n},100%,25%,0.5)"
Remarks
Stackoverflow.com publishes the dataset under a creative commons license, see their initial blog post from 2009. The derived dataset as used for the visualization is, therefore, subject to the same license.
The corresponding dataset is based on a release from December 2011. There should been a new release relatively soon and I might wait for that one.↩
It is imho often a good sign if a library has been rewritten by scratch. There are always some design decisions that turned out not to be such a great idea in the first cast.↩
"on the fly" is only an abstract description of what is actually happening.↩
JavaScript employs prototypal inheritance which is distinctly different from the more conventional class-based inheritance.↩
I use the more concise CoffeeScript notation, not pure JavaScript.↩
PostgreSQL has a build in (full) text search engine. It is thus possible to combine all imaginable traditional SQL queries with those using the text indexes.↩
See the official documentation Date/Time Functions and Operators.↩
You might find that the SQL-query looks very different from one given above. However, it is equivalent.↩