Steve Kass

Statistics

Archived Posts from this Category

5 May 2012 20:45

Baby Unique, Not So Much [updated again]

During the 13-year period (1995-2007) for which public data is available, 141 babies were born in Los Angeles County and named Unique. Two were boys.

A great many baby names were more unusual than Unique, including Z (boy, 2007), Q (boy, 2005), Abcde (girl, 2005), Awesome (boy, 2007), Yourhighness (girl, 2007), Queenelizabeth (girl, 2005), Unknown (boy, 2005), and Y (boy, 2007).

Data source: Los Angeles County Department of Public Health [link]

Update (May 5, 2012)

I updated this post last year when I discovered the Social Security Administration’s baby-name database, but it turns out there was more there than I realized. In his Breakfast Experiment™ of this morning, Language Log’s Mark Liberman mentioned this richer source of SSA name data.

As was the case for the SSA data I found last year (in the earlier update, which appears below), there’s no data for baby names that are unique (or almost so). But in recent years, Unique is not unique, so there’s more to report. The chart below summarizes the latest SSA data I have for Unique boys and girls.

Update (June 24, 2011)

The Social Security Administration provides a basic query interface to its national baby-name database. Based on Social Security card applications for births that occurred in the United States, Unique was a common enough girl’s name to make the annual top 1,000 list nine times between 1996 and 2009.

Both the Los Angeles and national figures suggest that as a baby name, Unique is becoming less common.

2 Responses to “Baby Unique, Not So Much [updated again]”

Steve Kass Says:
February 25th, 2011 at 12:14 am
Further investigation reveals several similar names worth noting.

Im’unique (girl, 2007)
I’munique (girl, 2002)
Imuniqye (girl, 1999)
Imuniqee (girl, 1998)
Imyunique (girl, 1997)
Imunquine (boy, 1997)
Imunique (girl, 2007; 2 girls, 2005; 2 girls, 2001; 2 girls, 2000; 5 girls, 1999; 2 girls, 1998, girl, 1997; 6 girls, 1996; 2 girls, 1995)

Also, Iexyzhen Vyelle Yzhian (girl, 2005).
Mike Says:
June 24th, 2011 at 3:53 pm
I’d be curious about Uniqua http://en.wikipedia.org/wiki/Uniqua#Uniqua

19 Feb 2012 20:29

Because Comments Were Closed

In today’s New York Times, Ross Douthat opines that

a lack of contraceptive access simply doesn’t seem to be a significant factor in unplanned pregnancy in the United States. When the Alan Guttmacher Institute¹ surveyed more than 10,000 women who had procured abortions in 2000 and 2001, it found that only 12 percent cited problems obtaining birth control as a reason for their pregnancies. A recent Centers for Disease Control and Prevention study of teenage mothers found similar results: Only 13 percent of the teens reported having had trouble getting contraception.

If this makes any sense to you at all, imagine the same argument against requiring seat belts in cars as a way to reduce the number of highway fatalities:

Unavailability of seat belts simply doesn’t seem to be a significant factor in highway fatalities in the United States. When the XYZ Institute examined more than 10,000 highway fatalities, it found that in only 12 percent of the cases was no seat belt available to the deceased. A recent Centers for Accident Prevention study of highway fatalities found similar results: Only 13 percent of passengers killed in traffic accidents were in seats not equipped with a working seat belt.

Or the same argument against just about anything useful that should be readily available:

Unavailability of public waste bins simply doesn’t seem to be a significant factor in the litter problem in the United States. When the XYZ Institute interviewed more than 10,000 people who threw trash on the streets, it found that in only 12 percent of them did so because they couldn’t find a waste bin. A recent Centers for a Cleaner America study of littering found similar results: Only 13 percent of people who littered did so more than 20 yards from a public waste bin.

¹ According to the latest Guttmacher survey, released this month, and based on 2008 data, the rates of pregnancy, abortion, and births among teens in the United States are all at historic lows. According to the report, “A large body of research has shown that the long-term decline in teen pregnancy, birth and abortion rates was driven primarily by improved use of contraception among teens.”

27 Nov 2011 21:49

Maybe That Wasn’t Your Brain on Meth

Posted by Steve under Health , News , Research Studies , Science , Statistics , Vulpigeration
1 Comment

News this week of an Adderall shortage, and this report, which draws into question the widely-held belief that methamphetamines cause brain damage and cognitive impairment, prompt me to rescue an old statistical parody I wrote (and posted on my now-moribund Drew web page) in 2003, a few years before I had this soapbox. The news links above are also well worth visiting.

Cocaine’s brain effects might be long term [“news”]

Insulin’s metabolic effects might be long term [parody]

BOSTON, March 10, 2003 (UPI) — Cocaine and amphetamines
might cause slight mental impairments in abusers that
persist for at least one year after discontinuing the
drugs, research released Monday reveals.

MADISON (NJ), March 16, 2003 — Insulin might cause metabolic
disorders in abusers that persist for at least one year after
discontinuing the drug, research released Monday reveals.

However, experts outside the study said the findings were inconclusive
and pointed out although cocaine has been widely abused for decades,
impaired cognitive function is not seen routinely or even known to exist in
former abusers.

"Overall, the abusers were impaired compared to non-abusers on the function of attention and motor skills," Rosemary Toomey, a psychologist at Harvard
Medical School and the study’s lead investigator, told United Press International.

“Overall, the abusers were impaired compared to non-abusers
on tests of sugar metabolism,” Rosemary Toomey, a psychologist
at Harvard Medical School and the study’s lead investigator,
told United Press International.

Previous studies have yielded inconsistent findings on whether
cocaine abuse led to long-term mental deficits. Some studies found
deficits in attention, concentration, learning and memory six months
after quitting. But a study of former abusers who were now in prison
and had abstained from cocaine for three years found no deficit.

Few studies have looked at the long term effects of insulin
abuse, although doctors and scientists generally believe
the drug is harmful. One study of former abusers who
were now in prison and had abstained from insulin for
three years found a higher than normal death rate.

To help clarify these seemingly conflicting results, Toomey’s team,
in a study funded by the National Institute on Drug Abuse, identified
50 sets of male twins, in which only one had abused cocaine or
amphetamines for at least one year. Amphetamine abusers were
included because the drug is similar to cocaine and could have the
same long-term effects on the body.

To address the lack of careful studies, Toomey’s team, funded by
the National Institute on Drug Abuse, identified 50 sets of male
twins, in which only one had abused insulin for at least one year.

Most of the pairs were identical twins, meaning they share the exact
same genetic pattern. This helps minimize the role biological
differences could play in the findings and gives stronger support to the
mental impairments being due to drug abuse.

Most of the pairs were identical twins, meaning they
share the exact same genetic pattern. This helps minimize
the role genetic differences could play in the findings and gives
stronger support to the impairments being due to insulin abuse.

The abusers, who averaged age 46 and had not used drugs for at least
one year, scored significantly worse on tests of motor skills and
attention, Toomey’s team reports in the March issue of The Archives
of General Psychiatry.

The abusers, who averaged age 46 and had not used
insulin for at least one year, scored significantly worse
on tests of sugar metabolism, Toomey’s team reports in
the March issue of The Archives of General Metabolism.

The tests all were timed, which indicates the abusers have
"a motor slowing, which is consistent with what other investigators
have found in other studies," Toomey said.

The tests all were performed after fasting, which indicates the abusers
have “an impaired metabolism unrelated to diet, which is consistent
with the consensus in the medical community,” Toomey said.

Still, the abusers’ scores were within normal limits and they actually
performed better on one cognitive test, called visual vigilance, which
is an indication of the ability to sustain attention over time. This
indicates the mental impairment is minor, Toomey said. "In real life,
it wouldn’t be a big impact on (the abusers’) day-to-day functioning
but there is a difference between them and their brothers," she said.

The finding is significant, she added, because given that the study subjects
are twins and share the same biological make-up, they would be expected
to have about the same mental status. This implicates the drug abuse
as the cause of the mental impairment.

The finding is significant, she added, because given that the study
subjects are twins and share the same biological make-up, they would
be expected to have about the same metabolic status. This
implicates the drug abuse as the cause of the impairment.

Among the abusers, the mental test scores largely did not vary in
relation to the amount of cocaine or amphetamine used. However,
on a few tests the abusers did score better with more stimulant use.

Among the abusers, poorer test scores were consistently associated
with increased levels of insulin abuse. Among the heaviest abusers,
not one scored better than his non-abusing twin.

"The results seem to me to be inconclusive," Greg Thompson,
a pharmacist at the University of Southern California’s
School of Pharmacy in Los Angeles, told UPI.

“The results seem to me to be conclusive,” Greg Thompson,
a pharmacist at the University of Southern California’s
School of Pharmacy in Los Angeles, told UPI.

This is "because both twins are within a normal range
(and) sometimes the cocaine-abusing twin did better than the
non-abusing twin and sometimes not," Thompson said.

This is “because almost without exception, only the non-abusing
twin is within a normal range (and) the insulin-abusing twin did
worse than the non-abusing twin,” Thompson said.

In addition, cocaine has been abused by millions of people, going
back as far as the 1930s and before, he said. "You’d think you’d be
seeing this as a significant clinical problem and we are not," he said.

In addition, insulin has been abused by millions of people,
and poor sugar metabolism among former insulin abusers
has been reported by physicians going back as far as the 1930s
and before, he said. “This is a significant clinical problem,” he said.

Of more concern to Thompson is the effect stimulants such as Ritalin,
which are used to treat attention deficit disorder, are having on
children. "This would be a much bigger problem I would think if
it’s true stimulants impair cognitive function," he said.

Of more concern to Thompson is the effect daily insulin injections
are having on children. Insulin is commonly prescribed to control
diabetes (frequent urination, weight gain, and fatigue syndrome).
“Many of these children will become former insulin abusers, and
poor sugar metabolism will be a major healthcare issue for
them in the years to come,” he said.

"Before I’d worry about the 46 year-old abuser, I’d want to know about the
3 year old being treated for ADD (attention-deficit disorder)," Thompson said.

“Before I’d worry about the 46 year-old abuser, I’d want to
know about the 3 year old being treated for diabetes,” Thompson said.

One Response to “Maybe That Wasn’t Your Brain on Meth”

Terri Says:
December 30th, 2011 at 6:39 pm
Yea I just realized I can read your blog in 75 degree sunshine! Happy me thanks you!

25 Sep 2011 22:46

Heteroscedasticity in the Residuals?

Posted by Steve under Science , Scoops , Statistics
1 Comment

The possible existence of heteroscedasticity is a major concern in the application of regression analysis, including the analysis of variance, because the presence of heteroscedasticity can invalidate statistical tests of significance that assume the effect and residual (error) variances are uncorrelated and normally distributed. —Wikipedia

Perhaps I’m overeager to use one of my favorite words, but the more I look at Figure 11 of The Neutrino Preprint, the more I think I see a hint of heteroscedasticity in the residuals. If present, it would support the possibility that the model used for the best fit analysis (a one-parameter family of time-shifted scaled copies of the summed proton waveform) was not appropriate. See my previous post for some background.

The figure above (which is the bottom half of Figure 11) shows the best fit of the complete summed proton waveform (red) vs. the observed neutrino counts (black), summarized using 150 nanosecond bins. For both extractions (left and right), the residuals of the fit (the distances from the red curve to each black dot) appear possibly heteroscedastic in two ways.

First, they seem to be slightly (negatively) correlated with the time scale — positive residuals are more likely towards the beginning of the pulse, negative residuals towards the end. Second, there may be a slight negative correlation of the variance of the residuals with the time scale as well. The residuals seem to become more consistent — vary less in either direction from zero — from left to right. [I didn’t pull out a ruler and calculate any real statistics.]

To be fair, there is little evidence of heteroscedastic residuals in Figure 12 (below), which shows a zoomed-in detail of the beginning and end of each extraction, summarized into 50 nanosecond bins. In all, only about a sixth of the waveform is shown at this resolution. (A data point appears to have been omitted from this figure; between the first two displayed bins in the the second extraction, there should probably be a black point to indicate that zero neutrinos were observed in that 50 ns interval.)

The authors report some tests of robustness; for example, they analyzed daytime and nighttime data separately and found no discrepancy. They also calculated and report a reduced chi-square statistic that indicates a good model fit. They may also have measured the heteroscedasticity of the residuals, but they don’t mention it.

They do say a fair bit about how they obtained the summed proton waveform (the red line) used for the fit, but so far I don’t see any indication that they considered the possibility of a systematic process occurring over the length of each proton pulse that caused the ratio of protons to observed neutrinos to vary.

Then again, I don’t understand every sentence in the paper that might be relevant, such as this one: “The way the PDF [the probability density functions for the proton waveform] are built automatically accounts for the beam conditions corresponding to the neutrino interactions detected by OPERA.” And I’m not a physicist or a statistician.

One Response to “Heteroscedasticity in the Residuals?”

Eric Jones Says:
November 18th, 2011 at 12:01 pm
Here’s an update on the original results, http://news.sciencemag.org/scienceinsider/2011/11/faster-than-light-neutrinos-opera.html, which appears to rule out the statistical argument (which I really liked).

24 Sep 2011 16:10

My $0.02 on the FTL Neutrino Thing

Posted by Steve under Science , Scoops , Statistics
[7] Comments

[I’ve posted a follow-up here: Heteroscedasticity in the Residuals?]

When applying statistics to find a “best fit” between your observation and reality, always ask yourself “best among what?”

The CERN result about faster-than-light neutrinos is based on a best fit. If the authors were too restrictive in their meaning of “among what,” they might have missed figuring out what really happened. And what might have really happened was that the neutrinos they detected had not traveled faster than light.

The data for this experiment was, as usual, a bunch of numbers. These numbers were precisely-measured (by portable atomic clocks and other very cool techniques) arrival times of neutrinos at a detector. The neutrinos were created by shooting a beam of protons into a long tube of graphite. This produced neutrinos, some of which were subsequently observed by a detector hundreds of miles away.

Over the course of a few years, the folks at CERN shot a total of about 100,000,000,000,000,000,000 protons into the tube; they observed about 15,000 neutrinos. The protons were fired in pulses, each pulse lasting about 10 microseconds.

A careful statistical analysis of the data, the authors report, indicates that the neutrinos traveled about 0.0025% faster than the speed of light. Whooooooosh! Furthermore, because the experiment looked at a lot of neutrinos and the results were consistent, the experiment indicates that in all likelihood the true speed of neutrinos was very close to 0.0025% faster than the speed of light, and it was almost without doubt at least faster.

If the experimental design and statistical analysis are correct (and the authors are aware they might not be, though they worked hard to make them correct), this is one of the great experiments of all time.

So far, I haven’t read much scrutiny of the statistical analysis pertaining to the question of “among what?” But Jon Butterworth of The Guardian raised one issue, and I have a similar one.

Look at the graph below, from the preprint.

The statistical analysis of the data was designed to measure how far to slide the red curve (the summed photon waveform) left or right so that the black data points (the neutron observation data) fit it most closely.

The experiment didn’t detect individual neutrinos at the beginning of the trip. The neutrons were produced by 10-microsecond proton bursts, and neutrinos were expected to appear in 10-microsecond bursts at the other end. The time between the bursts, then, should indicate how fast the individual neutrinos traveled.

To get the time between the bursts, slide the graphs back and forth until they align as closely as they can, and then compare the (atomic) clock times at the beginnings and ends of the bursts.

For this to give the right travel time, and more importantly, to be able to evaluate the statistical uncertainty, the researchers appear to have assumed that the shape of the proton burst upstream of the graphite rod exactly matched the shape of the neutrino burst at the detector (once adjusted for the fact that the detector sees about one neutrino for each 10 million billion or so protons in the initial burst).

Why should the shapes match exactly? If God jiggled the detector right when the neutrinos arrived, for example, the shapes might not match. More scientifically plausibly, though, at least to this somewhat-naïve-about-particle-physics mathematician, what if the protons at the beginning of the burst were more likely to create detectable neutrinos than those at the end of the burst? Maybe the graphite changes properties slightly during the burst. [Update: It does, but whether that might affect the result, I don’t know.] Or maybe the protons are less energetic at the end of the bursts because there’s more proton traffic.

The authors don’t tell us why they assume the shapes match exactly. There might be good theory and previous experimental results to support the assumption, but if so, it’s not mentioned in the paper. The authors do remark that a given “neutrino detected by OPERA” might have been produced by “any proton in the 10.5 microsecond extraction time.” But they don’t say “equally likely by any proton.”

If protons generated early in the burst were slightly more likely to yield detectable neutrinos, then the data points at the left of the figure should be scaled down and those at the left scaled up, if the observational data is expected to indicate the actual proton count across the burst.

If that’s the case, then the adjusted data might not have to be shifted quite so far to best match the red curve. And the calculated speed would be different.

Whether this would make enough of a difference to bring the speed below light-speed, I don’t know and can’t guess from what’s in the preprint. And of course, there may be good reasons for same-shape bursts to be a sound assumption.

[Disclaimer: I’m a mathematician, not a statistician or a physicist.]

7 Responses to “My $0.02 on the FTL Neutrino Thing”

Steve Kass » Heteroscedasticity in the Residuals? Says:
September 25th, 2011 at 10:46 pm
[…] family of time-shifted scaled copies of the summed proton waveform) was not appropriate. See my previous post for some […]
Joe Says:
September 27th, 2011 at 7:25 am
You kindof shoot yourself in the leg with your speculations.

You go on about how uncertainty about the neutrino creation process could have distorted the resulting measurements.

But if you look at the graph you posted it seems clear that there are multiple peaks within the graph that are shifted by exactly the same ammount as the whole graph.

The red line is a computer prediction based on neutrinos traveling -at- the speed of light.
Notice that the shape of the red graph pretty much has exactly the same shape as the data points, just shifted.
This means that the simulation used for the prediction has a very precise understanding of the neutrino generation process and what the resulting measurement amplitude series will be.
The only discrepancy is the detection time.

If what you say were true then the arriving data points would have had distorted rise and fall but would otherwise have its peaks match the predicted graph to at least fall on the speed of light instead of faster than the speed of light.

So based on that graph i think you are thinking in the wrong direction to find the flaw (if there is one).
Steve Says:
September 27th, 2011 at 9:35 am
You’ve missed my point.

“Pretty much exactly the same shape’ is not a statistical or mathematical statement. The data (black points) do not fit the red curve exactly when shifted. They come close, and among all possible horizontal shifts, 1048.5 ns gives the closest fit. But the six-sigma statistical claim assumes that the distribution from which the black data points were a random sample is a copy of the shifted red line and not any similar but different shape.

This assumption is not addressed in the paper. The shifted red line used for the statistics is the shape of the proton waveform hundreds of miles upstream of the detector at Gran Sasso. The data is not a random sample of protons from that waveform. The data is a sample (presumably random) of neutrinos hundreds of miles away, produced from the precisely-understood waveform of protons by several intermediate processes (including pion/kaon production when the proton beam strikes the graphite target and subsequent decay of the particles produced at the target into neutrinos later on). The arrival waveform clearly has a similar shape, but the authors give no theoretical or statistical evidence to suggest it must have an identical shape.

If the intermediate processes systematically change the shape of the proton waveform even slightly (as it becomes a pion/kaon waveform and then a neutrino waveform), the statistics reported are not valid.

In addition, the data in the paper is only a summary of the actual data into bins (150 ns wide for Figure 11, and 50 ns wide for Figure 12). The experimental result yields a neutrino speed only 60 ns faster than light-speed, so it’s impossible to “notice” the best fit to such high precision only from the paper’s graphs. In Figure 11, where the multiple peaks are visible, “exactly the same amount” can’t be determined to 60 ns accuracy. Even if the black data points, when shifted by 1048.5 ns, all lay exactly on the red line (and they do not at all), one cannot conclude that the actual data (not given in the paper, which summarizes it into bins) fits just as perfectly.
Philip Meadowcroft Says:
September 28th, 2011 at 4:23 am
Is the same true at the detection end? If the first detection in any way compromises the likelyhood of another detection in the same burst.

May be insignificant due to the low number detected per burst.
Gareth Williams Says:
September 28th, 2011 at 4:49 am
OK, so add an extra parameter. Scale the red line from 1 at the leading edge to a faction k at the trailing edge (to crudely model the hypothesis that the later protons, for whatever unknown reason, are less efficient at producing detectable neutrinos), and find what combination of translation and k produces the best fit.

If there is no such effect we should get the same speed as before and k=1. But if we get speed = c and k = 0.998 (say) then we have an indication where the problem is.

It would be interesting in any case to just try a few different constant values of k and see how sensitive the result is to that.

(It also occurs to me that k could arise from a problem with the proton detector, if the sensitivity changes very slightly from the beginning to the end of the pulse you would get the same effect).

This does not look too hard. I would do it myself but I am busy today [/bluff]
Steve Says:
September 28th, 2011 at 7:50 am
Philip: I think there was a similar question at the news conference given by OPERA, and it was answered to the satisfaction of the person who asked.

Gareth: Yes, absolutely. If the complete neutrino arrival data is posted, I might try this. But I would be happy to see you do it for me!
Gareth Williams Says:
October 27th, 2011 at 10:43 am
What you said, I think:

http://arxiv.org/PS_cache/arxiv/pdf/1110/1110.5275v1.pdf

23 Aug 2011 16:24

Save the Statistical Abstract

Posted by Steve under News , Statistics
Comment on this post

The authoritative and comprehensive summary of statistics on the social, political, and economic organization of the United States.

Published annually by the United States Census Bureau’s Statistical Compendia branch. Since 1878.

Slated for termination.

The 2012 budget does not include funding for the Statistical Compendia Branch which would mean the elimination of not only the Statistical Abstract, but all titles produced by that branch (State and Metropolitan Area Data Book, County and City Data Book, USA Counties, Quick Facts). No new editions would be produced in print or online. [source]

Cutting the branch will save $2.9 million, or just under $0.01 per American. The branch’s 24 employees will go.

“Killing the publication for the sake of a tiny saving would be a truly gratuitous step toward a dumbed-down country.” —Paul Krugman

“Without the Stat Abstract, statistics will become more hidden, and our collective knowledge will suffer.” —Robert J. Samuelson

“It democratizes knowledge by making enormous amounts of information comprehensible and easily accessible.” —E. J. Dionne

“Never heard of the Statistical Abstract? Go here. You’re in for a treat. ” —Ezra Klein

Video. Article. Write your representatives.

23 Jul 2011 23:32

Bring Back Running Water

Posted by Steve under Food , Health , Nonsense , Research Studies , Science , Statistics , Teaching , Vulpigeration
1 Comment

The Soda Police are getting noisier lately, but their concern for public health is a subterfuge. When it comes down to brass tacks (and I doubt brass’s slight lead content is going to kill you when used judiciously in plumbing, by the way), the S.P. don’t care most about the public health or about overweight kids at risk for diabetes and heart disease. They’re hell-bent on demonizing soda, especially soda made by Big Food and sold by the Big Chain Store and Restaurant Corporation.

Demon or not, it probably won’t hurt Americans to drink less soda on average than we do now. It will definitely help the environment if we drink less of anything that comes in individual single-use containers — even water — if there’s an environmentally friendly alternative already in place.

Here’s a simple two-part proposal to bring back running water.

BBRW Part 1. Require public water fountains everywhere.

Schools, parks, subway stations, airports, shopping centers, offices, stores, and more. We already require a lot of things, sensible and otherwise, so the means is in place. Require enough of them so no one has to wait in line. These water fountains (bubblers in Wisconsin and parts of New England) should have good water pressure, and they should be designed so they can fill up a bottle, too — or there should be some faucets for that. Simply making it possible to fill a personal water bottle in an airport — and yes, you can carry one through security so long as it’s empty — will reduce heart disease.

No flow restrictors, either; use spring-loaded knobs to conserve. (I’m not going to say a word about those infrared hand-wavy travesties.) Restrictors belong in kitchens and showers, if anywhere. It doesn’t need to take ten minutes to deliver half a cup of water. ADA compliant, but otherwise basic and solid. Call me nostalgic, but I like porcelain-coated cast iron.

Room-temperature, pure water is already available from every municipal water system. Only a little effort makes it ubiquitous. (If you’re afraid it will give you cancer, carry your own personal PET-free container full of home-purified water.)

BBRW Part 2. Require water to be available everywhere soda is available, for less.

If a restaurant offers a meal that includes soda, require it to offer the same meal with the same size tap water for less money. Less by at least half the restaurant’s own à la carte price for the included soda. Except during water emergencies, require restaurants to offer tap water when patrons are seated.

Stop the endless debates over soda vs. fruit juice, sugar versus high-fructose corn syrup, artificially-sweetened beverages vs. sugary ones, and aspartame vs. stevia extract. Bring back running water.

One Response to “Bring Back Running Water”

Jenne Says:
July 25th, 2011 at 10:25 am
Right on, Dr. Kass! as a mommy (read: permanent entourage) of a 2 year old, I’m astonished how many public-funded places either don’t have water fountains, or have faulty ones. And getting a cup of water from a retail establishment often involves complicated gyrations, as the standard is selling you a bottle of water.
Bring back the water fountain, and have a tap on it for cups/bottles! Yes!

13 Apr 2011 16:51

Math Causes Disease Clusters

Posted by Steve under Health , Science , Statistics , Vulpigeration
[3] Comments

Year after year, thousands of Americans are devastated to discover that their community has been stricken by a disease cluster. Some rare and frightening disease of unknown cause has visited their community like a plague. Residents are afflicted at rates many times the national average.

Despite years of study, billions of dollars, massive lawsuits and at least two Hollywood movies, little progress has been made towards understanding, let alone preventing, disease clusters.

The general public continues to suspect and blame environmental causes, especially chemicals with names that are hard to pronounce. The real reason for most disease clusters is likely something else.

Math.

Yes, math. Look at this map.

This map shows the 2009 rate of aleatorum gravis, an emerging and debilitating disease that currently affects only one American in 5,000. In some communities, however, the disease is rampant. Counties with rates more than five times the national average are shaded in red, and those with more than twice the national rate are shown in the darkest shade of green. [Click on the image or here for the full U.S. map.]

Nebraska.

Clusters of a. gravis are concentrated in the nation’s heartland, especially Nebraska and neighboring states. Why? If you wanted, you could look for and find potential causes alarmingly close to each cluster. A gas pipeline, a chicken farm, a power plant, a landfill. Or you could have a lawyer look for you.

Look as hard as you want, but the fact is that the cause of these disease clusters is mathematics. There is no such thing as a. gravis. The map shows the result of randomly giving each U.S. resident the disease with a 1-in-5,000 chance. (Mathematica notebook and links to population and geographic data files available on request.)

Math, indomitable math, caused these clusters.

Randomness.

Cases of non-communicable disease come in clusters just by chance. So do bags of M&Ms that have more blues than average, but it’s hard to drum up fear about them. Randomness and uniformity are not the same thing.

By chance alone, some counties will end up with higher rates of any randomly-occurring disease than other counties. The Central Limit Theorem proves it. Which counties is anyone’s guess, but because of the Law of Large Numbers (not subject to repeal), small counties are more likely than large ones to end up with unusually high (or low) rates. Sanity check: When was the last time you read about a disease cluster the size of a large city, as opposed to a census tract, county, or neighborhood?

Science.

Do some diseases have non-random environmental causes? Sure. Cholera, to give a famous example. That’s why local and national governmental agencies like the CDC and the National Cancer Institute take reports of disease clusters seriously. But the good scientists there also understand the math, and I trust their advice about public health policy more than what I hear on the local newscast, on Oprah, or from yet another celebrity non-scientist.

Reminder: John Snow was a scientist. (He also drew a map to make his point, which was a darn good idea.)

By the way, you don’t even have to be in a red county to jump on the bandwagon of fear and woo. You can still decide your neighborhood is a disease cluster (when it’s not), get everyone riled up, and make a scary video. Or you can write for a shameful woo-purveying media giant. For free. Specifically, the one behind the recent stench of pseudoscience in the air about disease clusters, and who’s getting no link from here. If the miasma theory of disease were true, scientists would be dropping like flies from what they read.

Pseudoscience and pandering to unjustified fear waste society’s resources and sidetrack scientists from research that might make the world a better and less scary place.

3 Responses to “Math Causes Disease Clusters”

Rene Najera Says:
April 14th, 2011 at 6:56 pm
Thank you for this. It’s exactly what I wanted to say in the follow-up to the “scary video”. I just got sidetracked by anti-vaccine matters. Thanks again.
Deb Leddon Says:
August 11th, 2011 at 1:50 am
Hello,
Your piece on ‘clustering’ due to math artifact/defect?, above is quite interesting. Could you please send the notebook with associated data files mentioned above to me?

Would like to take a look. Thaks very much for your time and this offer,

regards,
Deb
Math 131 Thursday, October 30 | Jeff Morford's Current Courses Says:
March 18th, 2015 at 6:54 am
[…] interesting blog post by Kass can be found at http://www.stevekass.com/2011/04/13/math-causes-most-disease-clusters/. Why are rural counties more likely to have higher incidences of the […]

10 Mar 2011 0:16

Take (Some of) the Data and Run

Posted by Steve under Language , Statistics , Teaching
1 Comment

I spent a good chunk of the last 24 hours at one of my favorite hangouts, Language Log. My reason for lingering was to pore over (in good company) some interesting graphs Mark Liberman had put up about the ever-controversial adverb “literally.” [Link: Two Breakfast Experiments™: Literally]

The graphs purported to show, inter alia, “a remarkably lawful relationship between the frequency of a verb and the probability of its being modified by literally, as revealed by counts from the 410-million-word COCA corpus.” [Aside: Visit COCA some time. It’s beautiful, it’s open 24/7, and admission is free.]

Sadly (for the researcher whose graphs Mark posted), there was no linguistic revelation; happily (for me and other mathophiles) the graphs highlighted a very interesting statistical artifact. Good stuff was learned.

Instead of rehashing what you can find in the comment thread at Language Log, what I’ll do here is give a non-linguistic example of this statistical artifact. First, a very few general remarks about statistics.

Much of statistics is about making observations or drawing inferences from selected data. In a nutshell, statistical analysis often goes like this: look at some data (such as the COCA corpus), find something interesting (such as an inverse relationship between two measurements), and draw a conclusion (in this case, a general inference about American English, of which COCA is one of the largest samples available in usable form).

Easy as a, b, c. One, two, three. Do, re, mi.

Sometimes. The mathematical underpinnings of statistics often make it possible, given certain assumptions, to make inferences from selected data with some (measurable) measure of confidence. Unfortunately, it’s easy to focus so hard on measuring the confidence (Yay, p < 0.05! I might get tenure!) that you forget the assumptions or you get careless about how you state an inference or calculation.

When bad statistics happens, there’s often a scary headline, but I can’t think up a good one at the moment and I’ll go straight to the (artifactual) graph.

This graph shows that for not-too-small cities, there’s a modest negative relationship between city size and homicide rate: on average, smaller cities tend to have higher homicide rates.

But the truth is that among not-too-small cities, smaller cities don’t tend to have higher homicide rates than larger ones. Here’s a better graph:

This graph shows almost no relationship between city size and homicide rate.

What’s going on, and what’s wrong with the relationship that shows up (and is real) in the first graph? The titles hold a clue (but don’t count on such clear titles when you see or read about graphs in the news). The first graph only shows cities that had at least 10 homicides in 2009. For that scatterplot, cities were selected for analysis according to a criterion related to the variable under investigation, homicide rate. That’s a no-no.

The 10-homicide cutoff biased the selection of cities used in the analysis. Most very large cities show up simply because they’re large enough to have 10 or more homicides, but the smallest cities that appear are only there because they had high enough homicide rates to reach 10-homicide threshold despite their relatively small populations. For the first graph, I (pretendingly) unwittingly chose all large cities together with only some smaller cities, specifically smaller cities with unusually high homicide rates for their size. Then I “discovered” that smaller cities had higher homicide rates.

Oops. It’s an easy mistake to make, and it wouldn’t surprise me if it happens often. I can easily imagine medical studies that compare the rates of some disease among cities and exclude any city that has “too few” cases of the disease to analyze.

Statistics is a powerful tool. Follow the instructions.

One Response to “Take (Some of) the Data and Run”

Dan H Says:
March 10th, 2011 at 6:53 am
Your statistical analysis is, as far as I can tell, spot on, but I think you’re being a little harsh on the original Language Log post. While you’re right that it’s important to realise that the data *does not* show that there is a blanket negative correlation between frequency of use of a word and its frequency of combination with “literally” it *does* show such a correlation for the specific subset of words analysed (words which are frequently combined with literally).

Similarly, the correlation shown in your first graph of murder rate vs city size is actually a perfectly legitimate one, as long as you’re clear about what you’re actually looking at. If for some reason I was forced to live in a city which had at least ten murders per year, then I would absolutely want that city to be as large as possible, because otherwise I’d find myself living in a small town with a disproportionately large murder rate.

To put it another way, I think there *are* legitimate reasons to be interested in studying the specific subset of words that are frequently combined with a particular modifier, it’s just very important not to overgeneralise from the specific case.

13 Feb 2011 12:26

Diseases, and Numbers, and Bears! Oh, My!

Posted by Steve under Health , Research Studies , Statistics , Vulpigeration
Comment on this post

The American Stroke Association is having a conference in Los Angeles (near Hollywood) this week. The disease-ridden news coming out of that conference is full of numbers, so reporters are cooking up bigger-than-usual batches of scare.

Yesterday’s stroke news was an unjustified scare about stroke and younger people.

Today’s stroke news: “Is The Oscar Ticket to Heart Attack, Stroke?”

Public domain image (Source: Wikimedia Commons)

According to a recent study by UCLA researchers, 7.3% of 409 Oscar nominees for best actor or actress since 1927 had strokes, according to public records, a number senior study author cautions is “sure to be an underestimate.” Scary?

ABC News wants their article to be scary, so they imply a wrong answer to the questioning headline with this wrong statistic: “The lifetime risk of stroke in the United States is roughly 2.9 percent, according to a 2010 report from the American Heart Association.” Oscar nominees’ higher-than-7.3% stroke rate is now officially scary. It’s several times the average!

Except that it’s not. The 2.9% figure ABC quotes is wrong. The number 2.9% does appear in the American Heart Association report, but it’s not the lifetime risk of stroke among Americans. It’s the prevalence of (having had a) stroke among American adults, young and old combined — the percentage of Americans who had had a stroke before the data-gathering took place, not who will have a stroke before they die. Many of the 97.1 percent who hadn’t had a stroke when surveyed will have a stroke later in their lives.

According to the same AHA report, stroke accounted for about 137,000 deaths in 2006, or one of every 18 deaths in the United States in 2006. One out of 18 is more than 5%, and that’s just the stroke deaths. The lifetime risk of stroke must then be at least 5%, and it’s probably a lot higher. Only about 1 in 6 strokes is fatal, so the lifetime incidence of stroke could be as high as 30%. In any case, it’s considerably higher than 2.9%, the figure ABC gives.

So. The real news is “Like Other People, Actors Sometimes Have Strokes.” In fact, that’s more or less what the authors of the study set out to say. They wanted to increase public awareness about stroke prevention. When famous people get this or that disease, the general public’s awareness of the disease increases (at least for a while), and those who go to big disease conferences may want more visibility for the specific disease they study.

Statistics

Baby Unique, Not So Much [updated again]

2 Responses to “Baby Unique, Not So Much [updated again]”

Leave a Reply

Because Comments Were Closed

Leave a Reply

Maybe That Wasn’t Your Brain on Meth

Cocaine’s brain effects might be long term [“news”]

Insulin’s metabolic effects might be long term [parody]

One Response to “Maybe That Wasn’t Your Brain on Meth”

Leave a Reply

Heteroscedasticity in the Residuals?

One Response to “Heteroscedasticity in the Residuals?”

Leave a Reply

My $0.02 on the FTL Neutrino Thing

7 Responses to “My $0.02 on the FTL Neutrino Thing”

Leave a Reply

Save the Statistical Abstract

Leave a Reply

Bring Back Running Water

One Response to “Bring Back Running Water”

Leave a Reply

Math Causes Disease Clusters

Math.

Nebraska.

Randomness.

Science.

3 Responses to “Math Causes Disease Clusters”

Leave a Reply

Take (Some of) the Data and Run

One Response to “Take (Some of) the Data and Run”

Leave a Reply

Diseases, and Numbers, and Bears! Oh, My!

Leave a Reply

Categories

Monthly