March 2011

Via Alan Boyle of Cosmic Log: this has to be embarrassing.

You create the first synthetic life form. You thoughtfully encode in its DNA wise words from three giants of literature and science: Joyce, Oppenheimer (quoting Felix Adler), and Feynman.

And you get one of the quotes wrong.

At least you can fix it. Imagine if there had been a typo on the Golden Record. Or if you’d flubbed the first words ever spoken from someone standing on the surface of the moon.

(I remember very well being befuddled when I heard Neil’s flub live – well, give or take a second or so, though this calculation is wrong – and I remained befuddled when I read it the next day, and I was unconvinced by the initial explanation a couple of weeks later.)

[To decode this post’s title, use the DNA decoding table here.]

Earlier today, I virtually ran into the husband of one of my choir buddies. Virtually, like on Twitter.

Adding to the amusement, check out the similarity between his (2007) Twitter profile picture and my (2001) Drew web page (sadly in need of an update, I know) picture:

Screen shot 2011-03-26 at 22.51.42

Briefly, it went like this: I happen to follow Ivan Oransky (of Retraction Watch) on Twitter. Ivan, good scientist that he is, always cites his sources, and his source for this tweet today was David Berreby of, with whose wife I sing in The Dessoff Choirs.

Small world, or big Twitter.

Come hear the wonderful music (Bernstein, Dove, Ives, Walker, Brahms, Jameson, Górecki, Barber, and Conte) Dessoff is preparing for its next concert. Or come to our Gala fundraiser. Or both.

I spent a good chunk of the last 24 hours at one of my favorite hangouts, Language Log. My reason for lingering was to pore over (in good company) some interesting graphs Mark Liberman had put up about the ever-controversial adverb “literally.” [Link: Two Breakfast Experiments™: Literally]

The graphs purported to show, inter alia, “a remarkably lawful relationship between the frequency of a verb and the probability of its being modified by literally, as revealed by counts from the 410-million-word COCA corpus.” [Aside: Visit COCA some time. It’s beautiful, it’s open 24/7, and admission is free.]


Sadly (for the researcher whose graphs Mark posted), there was no linguistic revelation; happily (for me and other mathophiles) the graphs highlighted a very interesting statistical artifact. Good stuff was learned.

Instead of rehashing what you can find in the comment thread at Language Log, what I’ll do here is give a non-linguistic example of this statistical artifact. First, a very few general remarks about statistics.

Much of statistics is about making observations or drawing inferences from selected data. In a nutshell, statistical analysis often goes like this: look at some data (such as the COCA corpus), find something interesting (such as an inverse relationship between two measurements), and draw a conclusion (in this case, a general inference about American English, of which COCA is one of the largest samples available in usable form).

Easy as a, b, c. One, two, three. Do, re, mi.

Sometimes. The mathematical underpinnings of statistics often make it possible, given certain assumptions, to make inferences from selected data with some (measurable) measure of confidence. Unfortunately, it’s easy to focus so hard on measuring the confidence (Yay, p < 0.05! I might get tenure!) that you forget the assumptions or you get careless about how you state an inference or calculation.

When bad statistics happens, there’s often a scary headline, but I can’t think up a good one at the moment and I’ll go straight to the (artifactual) graph.


This graph shows that for not-too-small cities, there’s a modest negative relationship between city size and homicide rate: on average, smaller cities tend to have higher homicide rates.

But the truth is that among not-too-small cities, smaller cities don’t tend to have higher homicide rates than larger ones. Here’s a better graph:


This graph shows almost no relationship between city size and homicide rate.

What’s going on, and what’s wrong with the relationship that shows up (and is real) in the first graph? The titles hold a clue (but don’t count on such clear titles when you see or read about graphs in the news). The first graph only shows cities that had at least 10 homicides in 2009. For that scatterplot, cities were selected for analysis according to a criterion related to the variable under investigation, homicide rate. That’s a no-no.

The 10-homicide cutoff biased the selection of cities used in the analysis. Most very large cities show up simply because they’re large enough to have 10 or more homicides, but the smallest cities that appear are only there because they had high enough homicide rates to reach 10-homicide threshold despite their relatively small populations. For the first graph, I (pretendingly) unwittingly chose all large cities together with only some smaller cities, specifically smaller cities with unusually high homicide rates for their size. Then I “discovered” that smaller cities had higher homicide rates.

Oops. It’s an easy mistake to make, and it wouldn’t surprise me if it happens often. I can easily imagine medical studies that compare the rates of some disease among cities and exclude any city that has “too few” cases of the disease to analyze.

Statistics is a powerful tool. Follow the instructions.