Error in Scientist Mom's Vaccine & Autism Data Analysis
Posted Nov 11 2008 7:39am
Back in September there was some noise about a post by someone I'll call "Scientist Mom" (apparently she doesn't use a pseudonym at all) titled The Correlation that Does Indicate Causation. I didn't want to even read the post back then because I had a feeling I would become involved in analyzing the data and spend way too much time that I was supposed to spend doing something else. The obvious critique of such an analysis, without knowing much about it, is that it was a pirates vs. global-warming type of correlation. Orac slammed Scientist Mom for it, and rightly so.
In my last post on the (lack of) association between rainfall and autism I had used birth-year data from California. I thought a natural extension of that work was to apply a detrended cross-correlation analysis to the caseload data and Scientist Mom's vaccine data.
Well, to my disappointment, the post doesn't provide any usable vaccine data. It's more of a qualitative analysis, where Scientist Mom just lists vaccines that were recommended during different time periods.
I noticed a significant error in the analysis, however. It has to do with Scientist Mom's key claim:
Most compelling of all, there was no increase in the percentage of autism cases in 2002-2004, when no vaccines were added to the childhood schedule.
I wonder if the error is obvious to some of my readers. If I mention "left censorship" as a hint, do you see the problem now? What if I mention that in my last post I decided to left-censor California birth-year autism caseload such that I only used data up to 2000?
You see, autism prevalence by birth year series always have a hook shape on the right hand side of the graph. It doesn't matter if I survey the prevalence in 2004 or 1994. They always do. The following is an IDEA graph representing prevalence by birth year, as reported in 2001, 2002 and 2003.
Not only is there a natural decline in prevalence by birth year because some autistics are diagnosed late; it's also the case that prevalence by birth year data is not fixed in time. If we request new data from Califonia DDS next year, the data potentially changes in all birth years, and it likely changes considerably in recent birth years.
This is a common mistake. Mark Blaxill has fallen for it. The Geiers have as well, assuming they didn't know what they were doing.
One way to solve the issue is to left-censor the data. Basically, you only consider the birth year data that is more likely to remain stable in the future.
I believe a much better way to solve the issue (although this is not always feasible) is to use data on prevalence by a given age in a given year, e.g. prevalence of 3 year olds in the system in a given year. This data shouldn't change with time. This is the type of data that was used when there was a debate over the expected decline in the California DDS 3-5 caseload. And as you may recall, the 3-5 prevalence continued to increase.
Since California DDS provides birth year data as reported in different years, we can estimate the caseload of autistic 3 year olds from, say, 2000 to 2007. You basically look at each of the 32 files (8 years times 4 quarters) for the years we're interested in, and get the birth-year caseload of the report year minus 3. The resulting graph follows.
This is an approximation, of course. Consider that on 03/2002, the number of children born in 1999 will not be as many as you'll have in 12/2002. Hence the seesaw pattern.
The point is that Scientist Mom is mistaken in her finding that the prevalence of autism dropped or was stable after 2002. This completely undermines her analysis, since that was her key claim.