"Leveraging the long tail": Open-access annotation of gene function at Wikipedia

Posted Sep 09 2008 2:08am

In the post-genomic age, annotation serves an essential function: it takes us from a point where we’re struggling with the allegorical phone books of raw sequence to a point where we’re productively accessing comprehensively compiled knowledge about the function of specific genes. Typically, gene annotation is accomplished in part by machines (e.g., computers running hidden Markov models to identify open reading frames, BLASTing to identify known homologs in other organisms, and occasionally bootstrapping their way to predicted functions) and in part by expert human curators, who (in ways that machines still can’t) collate and summarize published studies that have addressed the functions and interactions of specific genes.

The process used to work pretty well, back in the day when the only complete eukaryotic genome was the genetically tractable and well-studied budding yeast S. cerevisiae — but now it’s starting to come apart. The number of genes, while still imprecisely known, is unquestionably finite, but as the number of studies and interactions grows, the problem of annotation is growing exponentially — still finite, certainly (?) but far beyond the ability of the top-down expert-only annotation systems to keep up.

Because the process is bottlenecked by curators (delightful human beings, to be sure, but who have finite energy and time at their disposal), it takes times for new data to be incorporated into existing systems. Therefore, the most recent findings regarding a given gene — often reflecting the most exciting new direction for research — are often missing from curated functional annotations.

Another problem with expert curation is that the types of information included may implicitly reflect the priorities and agendas of the experts doing the curating — again, not to bag on curators, but it’s simply impossible for a small group of people to identify all the different ways that everyone in the scientific world might want to grope each of >30,000 different elephants. (For those of you who missed it: This is the biogerontology “hook” for this post. Have you ever been frustrated when a gene annotation fails to include a reference to a piece of aging-related data on your favorite gene, even when you know it’s out there — and then realized what that means about how many other aging-related annotations are also missing?)

Enter the power of the mobilized mob: In July, Huss et al. announced an initiative to democratize the annotation of gene function, using an established web site that happens to be the most widely used open editing system in the world: Wikipedia. (Perhaps you’ve heard of it.) The idea is to kick-start the (arguably inevitable) process of making Wikipedia a central hub for community-generated annotation of gene function.

In any open science initiative, the first (skeptical) question anyone raises is: Who’s going to do the hard part? For instance: open notebooks might be a great concept in the abstract, but it’s beyond the realm of reasonability to expect every lab in the world to take a month of three to whip up a software solution to enable that idea. (Fortunately, in the case of that example, there’s an answer to the question: OpenWetWare.) The same question applies to annotation: Given that the theoretical tools for the open annotation of gene function technically exist (just as they do for the open annotation of every episode of Battlestar Galactica ), what barrier must be overcome in order to get people to actually use those tools?

Sometimes an enabling technology can be as simple as a thoughtful template for future efforts of the same kind. Here, the initiators of the project simply created a “stub” — an standard format for new gene entries, automatically populated from existing annotations, which other Wikipedia editors are then free to expand and modify:

In principle, a comprehensive gene wiki could have naturally evolved out of the existing Wikipedia framework, and as described above, the beginnings of this process were already underway. However, we hypothesized that growth could be greatly accelerated by systematic creation of gene page stubs, each of which would contain a basal level of gene annotation harvested from authoritative sources. Here we describe an effort to automatically create such a foundation for a comprehensive gene wiki. Moreover, we demonstrate that this effort has begun the positive-feedback loop between readers, contributors, and page utility, which will promote its long-term success. …

Each gene stub consists of a sidebar detailing the symbols and aliases, external identifiers, gene function (as represented in Gene Ontology), and genomic location. Although gene stubs are primarily focused on human genes, links to their mouse orthologs are also provided. When available, links to the Protein Data Bank are displayed under a thumbnail ribbon diagram, and gene expression patterns across diverse human tissues are shown as thumbnail bar charts. Links to the primary databases are included when available. In addition, the central area of the gene stub shows a gene summary and a list of relevant references in the literature, both of which were provided by Entrez Gene.

As with most good ideas, the rest is obvious: More and more scientists will find themselves turning to Wikipedia as a first-line source of information about genes that catch their eye. When they notice that an entry is incomplete (say, because it’s missing an important biogerontological implication) or unclear, they’ll make a little change. The little changes will add up. And a self-correcting, bottom-up system for gene annotation will be born.

Hey, I just noticed that there’s no reference to Huss et al. ’s paper in the “Genome annotation” subsection of the Wikipedia entry for “Genome project” (link is version-specific, for posterity’s sake). Anyone care to fix that?

