Have you ever heard of confidence intervals? Probably. Have you ever made them? Probably not. If you're like me, you studied confidence intervals in Stats 101, you're convinced of their importance, but when it comes to plotting data they have always seemed like too much trouble.

Getting into R, the popular language for computing statistics, I thought there would be some built-in function to make it easy -- maybe plot.with.confidence. No such luck. But I discovered it is easy in R *if* you learn how to use the language to its best advantage.

Below is my solution for binomial data. It is comes from an analysis of clicks and impressions over time. The code could easily be adapted for other distributions, like a normal distribution.

First, here is a plotting function to draw the statistic of interest plus lines above and below for confidence.

plot.conf = function(x, y, upper, lower, ...) {
    plot(x, y, type='l', lwd=2, ylim=c(min(lower), mean(y) + 3*sd(y)), ...)
    lines(x, upper, type='l', lty=1, lwd=.5)
    lines(x, lower, type='l', lty=1, lwd=.5)
}

Notice the "..." for variable keyword arguments (similar to Python's **kwargs). Also notice the ylimits are based on the y values. Confidence intervals can become vary wide and scale a graph beyond comprehension.

Now the part that calculates the upper and lower confidence lines. The way to get a binomial confidence interval in R is to perform a binomial test. This returns an object which has a confidence interval attached, a common pattern in R.

We can loop implicitly over all our data points because R is a functional language.

CI = mapply(
    function(s, t) {
        b = binom.test(s, t, conf.level=0.95)
        b$conf.int[1:2]
    },
    successes, trials
)

Mapply will apply a function of x parameters to x vectors. The return value(s) are put in a matrix. Remember that R will return the last expression in a function as its value.

Finally we call our plotting function with the appropriate arguments.

plot.conf(xvalues, successes / trials, CI[2,], CI[1,], main='Plot with confidence')

 
 

What do serious statisticians use for doing their work? They all use R.

R is an interactive programming environment designed for data analysis. It has its own language (which can confusingly be called S for historical reasons), its own large library of basic and statistical functions, its own quality-controlled repository of contributed libraries, its own interactive shell with integrated plotting. In its own domain, it as complete a working language as Python, Perl or PHP. (It is certainly more mature than Javascript!)

Here are some features of the language to get programmers excited:

Functions are objects.
rmean = function(x=50) mean(rnorm(x))

Inline anonymous functions are easy.
boot(data, function(data, x){ mean(data) - mean(x) })

Numbers are always arrays.
mean(1) == 1

Arithmetic is vectorized.
c(1,2,3) * 2 == c(2,4,6)

Boolean operations are vectorized.
(c(1,2,3) == 3) ==  c(FALSE, FALSE, TRUE)

Object-oriented support with simple prototype system.
df = data.frame(c(1,2,3))
class(df) == "data.frame"
print.data.frame(df)
print(df) #same as previous because of method lookup

Code blocks are objects.
plot(c(1,2,3))
# graph shows "c(1,2,3)" as axis label... so cool!

More excitement: Django + R graphing .

 
 

Working on SnapAds reports, I ran into the UI problem that the effects of colors (red, green, blue) in an advertisement were being graphed with random colors assigned by the graphing software (amcharts) -- very confusing! The real underlying problem was that the names of the colors are entered into the system by outside users. There is no way of knowing ahead of time if the thing being graphed is a color. Unless...

There are only so many words to describe colors. Could there be an 80% solution here? Indeed, I found Stoyan Stefanov had already written a color parser in javascript. I rewrote it quickly in Python so it could run server-side (amazing how closely those two languages can map to one another) and voila -- SnapAds reports now show a red bar if the thing being graphed is the effect of a red button.

 
 

Over the past year, I've worked on a couple of sites that needed a lot of graphs. The landscape of web friendly graphing software is pretty large and technically diverse. By far the best article I found on the topic was this one over at smashingmagazine.com. Rather than repeat their excellent survey, I'll just list the aspects of these packages that have really stuck with me.

What I wanted:

Good time plotting
Handle large datasets (>1000 points)
Interactivity, annotation (highlight interesting data points)
Smart labeling, range setting and defaults

What I found:

amcharts (12+ hours experience)
  Very impressive overall
  Depth + documentation = happy
  XML config file is huge (~500 lines!)
  Smart defaults, organized
  New to market (from 2007?)
  Responsive support

google charts
(12+ hours experience)
  clean and readable
  fast
  good examples
  a bitch for anything complicated
  URL string API -- cool but hacky
  static image -- no interactivity
  gchartphp and pygooglechart -- very thin layers

fusioncharts (0.5 hours experience)
  Nicest animations
  Relatively expensive
  Popular choice -- active forums
  Documentation is too hard to search
  No 100% stacked line chart

Open Flash Chart
(0.1 hours experience)
  feature-full
  ugly (why make the examples so f**cking ugly?)
  PHP API
 
plotr
(0.1 hours experience)
  nice looking
  limited
 
PlotKit (0.1 hours experience)
  nice looking
  limited


Interesting but peripheral packages:

processing
  could this be the future?

timeplot
  specialized for time series
  impressive

matplotlib
  scientific plots... many more ways to visualize data