Have you ever heard of confidence intervals? Probably. Have you ever made them? Probably not. If you're like me, you studied confidence intervals in Stats 101, you're convinced of their importance, but when it comes to plotting data they have always seemed like too much trouble.
Getting into R, the popular language for computing statistics, I thought there would be some built-in function to make it easy -- maybe plot.with.confidence. No such luck. But I discovered it is easy in R *if* you learn how to use the language to its best advantage.
Below is my solution for binomial data. It is comes from an analysis of clicks and impressions over time. The code could easily be adapted for other distributions, like a normal distribution.
First, here is a plotting function to draw the statistic of interest plus lines above and below for confidence.
plot.conf = function(x, y, upper, lower, ...) {
plot(x, y, type='l', lwd=2, ylim=c(min(lower), mean(y) + 3*sd(y)), ...)
lines(x, upper, type='l', lty=1, lwd=.5)
lines(x, lower, type='l', lty=1, lwd=.5)
}
Notice the "..." for variable keyword arguments (similar to Python's **kwargs). Also notice the ylimits are based on the y values. Confidence intervals can become vary wide and scale a graph beyond comprehension.
Now the part that calculates the upper and lower confidence lines. The way to get a binomial confidence interval in R is to perform a binomial test. This returns an object which has a confidence interval attached, a common pattern in R.
We can loop implicitly over all our data points because R is a functional language.
CI = mapply(
function(s, t) {
b = binom.test(s, t, conf.level=0.95)
b$conf.int[1:2]
},
successes, trials
)
Mapply will apply a function of x parameters to x vectors. The return value(s) are put in a matrix. Remember that R will return the last expression in a function as its value.
Finally we call our plotting function with the appropriate arguments.
plot.conf(xvalues, successes / trials, CI[2,], CI[1,], main='Plot with confidence')

The common advice out there on how to scale your framework of choice -- whether Rails, Django or CakePHP -- is caching, caching and more caching. These frameworks come with caching functions that make it dead easy.
But let's not forget that a webserver like Apache can always deliver a static file faster (and more reliably) than any framework, no matter how much the framework caches. As soon as the language interpreter loads, some time has been spent.
I ran into a problem trying to serve dynamically generated images thru Django. With built-in caching, performance was pretty good, averaging a few hundred requests a second, enough for most apps but not an ad server. Instead of trying to wring out more performance from Django, I configured Apache to act as a cache handler, avoiding Django entirely on cache hits.
The technique is to rewrite URIs to reference a static file. If the file doesn't exist, then run your script and have it save its output to a predetermined location. A similar technique is using a 404 handler to generate static files for the next request to a URI.
Here is my Apache config code (inspired by hardcoder.com):
#cache handling
RewriteEngine On
RewriteCond %{QUERY_STRING} !=miss
RewriteRule composite/([0-9]+)/$ /media/comp_cache/$1.jpeg
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule media/comp_cache/([0-9]+).jpeg$ /composite/$1/?miss
These rules will transform the URI
/composite/84/ -> /media/comp_cache/84.jpeg
If /media/comp_cache/84.jpeg does not exist, it will then transform back
/media/comp_cache/84.jpeg -> /composite/84/?miss
so the script at /composite/84/ can run and produce the static file /media/comp_cache/84.jpeg.
By the way, here is a great online service that lets you test your rewrite rules without messing with Apache logs.
What do serious statisticians use for doing their work? They all use R.
R is an interactive programming environment designed for data analysis. It has its own language (which can confusingly be called S for historical reasons), its own large library of basic and statistical functions, its own quality-controlled repository of contributed libraries, its own interactive shell with integrated plotting. In its own domain, it as complete a working language as Python, Perl or PHP. (It is certainly more mature than Javascript!)
Here are some features of the language to get programmers excited:
Functions are objects.
rmean = function(x=50) mean(rnorm(x))
Inline anonymous functions are easy.
boot(data, function(data, x){ mean(data) - mean(x) })
Numbers are always arrays.
mean(1) == 1
Arithmetic is vectorized.
c(1,2,3) * 2 == c(2,4,6)
Boolean operations are vectorized.
(c(1,2,3) == 3) == c(FALSE, FALSE, TRUE)
Object-oriented support with simple prototype system.
df = data.frame(c(1,2,3))
class(df) == "data.frame"
print.data.frame(df)
print(df) #same as previous because of method lookup
Code blocks are objects.
plot(c(1,2,3))
# graph shows "c(1,2,3)" as axis label... so cool!
More excitement: Django + R graphing .
Working on SnapAds reports, I ran into the UI problem that the effects of colors (red, green, blue) in an advertisement were being graphed with random colors assigned by the graphing software (amcharts) -- very confusing! The real underlying problem was that the names of the colors are entered into the system by outside users. There is no way of knowing ahead of time if the thing being graphed is a color. Unless...
There are only so many words to describe colors. Could there be an 80% solution here? Indeed, I found Stoyan Stefanov had already written a color parser in javascript. I rewrote it quickly in Python so it could run server-side (amazing how closely those two languages can map to one another) and voila -- SnapAds reports now show a red bar if the thing being graphed is the effect of a red button.

Over the past year, I've worked on a couple of sites that needed a lot of graphs. The landscape of web friendly graphing software is pretty large and technically diverse. By far the best article I found on the topic was this one over at smashingmagazine.com. Rather than repeat their excellent survey, I'll just list the aspects of these packages that have really stuck with me.
What I wanted:
Good time plotting
Handle large datasets (>1000 points)
Interactivity, annotation (highlight interesting data points)
Smart labeling, range setting and defaults
What I found:
amcharts (12+ hours experience)
Very impressive overall
Depth + documentation = happy
XML config file is huge (~500 lines!)
Smart defaults, organized
New to market (from 2007?)
Responsive support
google charts (12+ hours experience)
clean and readable
fast
good examples
a bitch for anything complicated
URL string API -- cool but hacky
static image -- no interactivity
gchartphp and pygooglechart -- very thin layers
fusioncharts (0.5 hours experience)
Nicest animations
Relatively expensive
Popular choice -- active forums
Documentation is too hard to search
No 100% stacked line chart
Open Flash Chart (0.1 hours experience)
feature-full
ugly (why make the examples so f**cking ugly?)
PHP API
plotr (0.1 hours experience)
nice looking
limited
PlotKit (0.1 hours experience)
nice looking
limited
Interesting but peripheral packages:
processing
could this be the future?
timeplot
specialized for time series
impressive
matplotlib
scientific plots... many more ways to visualize data

I've been working on a project that requires a lot of INSERTs, one for every pageview plus a few UPDATEs of summary tables. Since we want this thing to scale up to millions of pageviews a day, I was concerned about all those INSERTs slowing down the client experience. But, since the client doesn't need up-to-the-second data, I knew the problem could be solved by some kind of queueing.
First I thought of writing all the SQL queries to a file, then running a cron job against it every so often. But this approach has a couple of downsides. The data in the DB would always be somewhat stale, even when server load was light, and it would require rewriting some of my existing, proven code.
A better approach is to fork a process and let the OS handle scheduling. Intead of waiting for the DB, the executing web script forks a process to perform the slow INSERT and returns early to the client -- fast, smooth, and little reprogramming.
In PHP, the listed way to fork is pcntl_fork. But few people can do this because it is not compiled into PHP by default. Thankfully, someone found a trick that accomplishes the same thing by using the command shell and the PHP command line interpreter. I like to think of it as poor man's forking. Here it is.
function poor_mans_fork($path, $args=array()) {
foreach ($args as $key => $value) {
$key = escapeshellarg($key);
$value = escapeshellarg($value);
@$arg_string .= "--$key=$value";
}
$cmd = "/usr/bin/php " . escapeshellarg(realpath($path))
. ' ' . $arg_string;
exec($cmd . " > /dev/null &");
}Previously my recording script took all its input in the $_REQUEST variable. To enable it to operate as a forked process, all I needed to do was assign the command-line input contained in $_SERVER['argv'] to $_REQUEST, and provide a please-fork boolean.
I put this code at the top of my PHP script, before the INSERT takes place.
if (!empty($_REQUEST['fork'])) {
unset($_REQUEST['fork']);
poor_mans_fork(__FILE__, $_REQUEST);
die();
}
else {
$_REQUEST = extract_args($_SERVER['argv']);
// continue normal execution...
}Now when the web script is called, it does what is needed by the client, forks an instance of itself to take care of the rest, and returns promptly to the client. Meanwhile the forked instance happily completes on its own time.
In this blog, I'll try to share the small nuggets of wisdom I've managed to wrangle from my daily toils. I'll polish and shine them so you don't ever need to think about the mountains of dirt they were buried under.