Have you ever heard of confidence intervals? Probably. Have you ever made them? Probably not. If you're like me, you studied confidence intervals in Stats 101, you're convinced of their importance, but when it comes to plotting data they have always seemed like too much trouble.
Getting into R, the popular language for computing statistics, I thought there would be some built-in function to make it easy -- maybe plot.with.confidence. No such luck. But I discovered it is easy in R *if* you learn how to use the language to its best advantage.
Below is my solution for binomial data. It is comes from an analysis of clicks and impressions over time. The code could easily be adapted for other distributions, like a normal distribution.
First, here is a plotting function to draw the statistic of interest plus lines above and below for confidence.
plot.conf = function(x, y, upper, lower, ...) {
plot(x, y, type='l', lwd=2, ylim=c(min(lower), mean(y) + 3*sd(y)), ...)
lines(x, upper, type='l', lty=1, lwd=.5)
lines(x, lower, type='l', lty=1, lwd=.5)
}
Notice the "..." for variable keyword arguments (similar to Python's **kwargs). Also notice the ylimits are based on the y values. Confidence intervals can become vary wide and scale a graph beyond comprehension.
Now the part that calculates the upper and lower confidence lines. The way to get a binomial confidence interval in R is to perform a binomial test. This returns an object which has a confidence interval attached, a common pattern in R.
We can loop implicitly over all our data points because R is a functional language.
CI = mapply(
function(s, t) {
b = binom.test(s, t, conf.level=0.95)
b$conf.int[1:2]
},
successes, trials
)
Mapply will apply a function of x parameters to x vectors. The return value(s) are put in a matrix. Remember that R will return the last expression in a function as its value.
Finally we call our plotting function with the appropriate arguments.
plot.conf(xvalues, successes / trials, CI[2,], CI[1,], main='Plot with confidence')
What do serious statisticians use for doing their work? They all use R.
R is an interactive programming environment designed for data analysis. It has its own language (which can confusingly be called S for historical reasons), its own large library of basic and statistical functions, its own quality-controlled repository of contributed libraries, its own interactive shell with integrated plotting. In its own domain, it as complete a working language as Python, Perl or PHP. (It is certainly more mature than Javascript!)
Here are some features of the language to get programmers excited:
Functions are objects.
rmean = function(x=50) mean(rnorm(x))
Inline anonymous functions are easy.
boot(data, function(data, x){ mean(data) - mean(x) })
Numbers are always arrays.
mean(1) == 1
Arithmetic is vectorized.
c(1,2,3) * 2 == c(2,4,6)
Boolean operations are vectorized.
(c(1,2,3) == 3) == c(FALSE, FALSE, TRUE)
Object-oriented support with simple prototype system.
df = data.frame(c(1,2,3))
class(df) == "data.frame"
print.data.frame(df)
print(df) #same as previous because of method lookup
Code blocks are objects.
plot(c(1,2,3))
# graph shows "c(1,2,3)" as axis label... so cool!
More excitement: Django + R graphing .