Have you ever heard of confidence intervals? Probably. Have you ever made them? Probably not. If you're like me, you studied confidence intervals in Stats 101, you're convinced of their importance, but when it comes to plotting data they have always seemed like too much trouble.

Getting into R, the popular language for computing statistics, I thought there would be some built-in function to make it easy -- maybe plot.with.confidence. No such luck. But I discovered it is easy in R *if* you learn how to use the language to its best advantage.

Below is my solution for binomial data. It is comes from an analysis of clicks and impressions over time. The code could easily be adapted for other distributions, like a normal distribution.

First, here is a plotting function to draw the statistic of interest plus lines above and below for confidence.

plot.conf = function(x, y, upper, lower, ...) {
    plot(x, y, type='l', lwd=2, ylim=c(min(lower), mean(y) + 3*sd(y)), ...)
    lines(x, upper, type='l', lty=1, lwd=.5)
    lines(x, lower, type='l', lty=1, lwd=.5)
}

Notice the "..." for variable keyword arguments (similar to Python's **kwargs). Also notice the ylimits are based on the y values. Confidence intervals can become vary wide and scale a graph beyond comprehension.

Now the part that calculates the upper and lower confidence lines. The way to get a binomial confidence interval in R is to perform a binomial test. This returns an object which has a confidence interval attached, a common pattern in R.

We can loop implicitly over all our data points because R is a functional language.

CI = mapply(
    function(s, t) {
        b = binom.test(s, t, conf.level=0.95)
        b$conf.int[1:2]
    },
    successes, trials
)

Mapply will apply a function of x parameters to x vectors. The return value(s) are put in a matrix. Remember that R will return the last expression in a function as its value.

Finally we call our plotting function with the appropriate arguments.

plot.conf(xvalues, successes / trials, CI[2,], CI[1,], main='Plot with confidence')

 
 

What do serious statisticians use for doing their work? They all use R.

R is an interactive programming environment designed for data analysis. It has its own language (which can confusingly be called S for historical reasons), its own large library of basic and statistical functions, its own quality-controlled repository of contributed libraries, its own interactive shell with integrated plotting. In its own domain, it as complete a working language as Python, Perl or PHP. (It is certainly more mature than Javascript!)

Here are some features of the language to get programmers excited:

Functions are objects.
rmean = function(x=50) mean(rnorm(x))

Inline anonymous functions are easy.
boot(data, function(data, x){ mean(data) - mean(x) })

Numbers are always arrays.
mean(1) == 1

Arithmetic is vectorized.
c(1,2,3) * 2 == c(2,4,6)

Boolean operations are vectorized.
(c(1,2,3) == 3) ==  c(FALSE, FALSE, TRUE)

Object-oriented support with simple prototype system.
df = data.frame(c(1,2,3))
class(df) == "data.frame"
print.data.frame(df)
print(df) #same as previous because of method lookup

Code blocks are objects.
plot(c(1,2,3))
# graph shows "c(1,2,3)" as axis label... so cool!

More excitement: Django + R graphing .

 
 

Working on SnapAds reports, I ran into the UI problem that the effects of colors (red, green, blue) in an advertisement were being graphed with random colors assigned by the graphing software (amcharts) -- very confusing! The real underlying problem was that the names of the colors are entered into the system by outside users. There is no way of knowing ahead of time if the thing being graphed is a color. Unless...

There are only so many words to describe colors. Could there be an 80% solution here? Indeed, I found Stoyan Stefanov had already written a color parser in javascript. I rewrote it quickly in Python so it could run server-side (amazing how closely those two languages can map to one another) and voila -- SnapAds reports now show a red bar if the thing being graphed is the effect of a red button.

 
 

I've been working on a project that requires a lot of INSERTs, one for every pageview plus a few UPDATEs of summary tables. Since we want this thing to scale up to millions of pageviews a day, I was concerned about all those INSERTs slowing down the client experience. But, since the client doesn't need up-to-the-second data, I knew the problem could be solved by some kind of queueing.

First I thought of writing all the SQL queries to a file, then running a cron job against it every so often. But this approach has a couple of downsides. The data in the DB would always be somewhat stale, even when server load was light, and it would require rewriting some of my existing, proven code.

A better approach is to fork a process and let the OS handle scheduling. Intead of waiting for the DB, the executing web script forks a process to perform the slow INSERT and returns early to the client -- fast, smooth, and little reprogramming.

In PHP, the listed way to fork is pcntl_fork. But few people can do this because it is not compiled into PHP by default. Thankfully, someone found a trick that accomplishes the same thing by using the command shell and the PHP command line interpreter. I like to think of it as poor man's forking. Here it is.


function poor_mans_fork($path, $args=array()) {  
foreach ($args as $key => $value) {
$key = escapeshellarg($key);
$value = escapeshellarg($value);
@$arg_string .= "--$key=$value";
}

$cmd = "/usr/bin/php " . escapeshellarg(realpath($path))
. ' ' . $arg_string;
exec($cmd . " > /dev/null &");
}

Previously my recording script took all its input in the $_REQUEST variable. To enable it to operate as a forked process, all I needed to do was assign the command-line input contained in $_SERVER['argv'] to $_REQUEST, and provide a please-fork boolean.

I put this code at the top of my PHP script, before the INSERT takes place.

if (!empty($_REQUEST['fork'])) {
unset($_REQUEST['fork']);
poor_mans_fork(__FILE__, $_REQUEST);
die();
}
else {
$_REQUEST = extract_args($_SERVER['argv']);
// continue normal execution...
}

Now when the web script is called, it does what is needed by the client, forks an instance of itself to take care of the rest, and returns promptly to the client. Meanwhile the forked instance happily completes on its own time.