Hosting the World’s Scientifc Data


The good news is that Google has generously offered to host the world’s scientific data sets on its servers for free. That fits pretty well with their mission to index the world’s information. If the data isn’t accessible, it’s hard to index.

Stefan has posted a video of Michael Jones discussing this at the AGU on Dec 15th.

At the AGU Fall meeting, Google’s Michael Jones outlines Google’s recommendation to the Obama administration to link increased funding for basic scientific research to an obligation to share the resulting data freely. Watch the video of the speech.

For those who didn’t watch the speech, here’s the gist. With scientific knowledge, the lesson of history is: "share it or lose it (to the Visigoths)." The lesson for the future will be: "share it or lose it (to the haystack, to bit rot, or in the walls of ivory towers)."

Alas, Stefan continues with the bad news:

Google’s also been planning to offer to host this data for free, though that project has now been put on the back burner, apparently a victim of the current financial crisis.

This comes just a few days after MTJ’s speech. So what can change so much in a few days?

I don’t quite buy the financial crisis story. This would be a very minor expense, relatively, and Google could spread it out, call the first 12 months a limited testbed, before scaling it up to more impressive numbers once the economic situation has stabilized (or not). They also could have pulled the plug long before MTJ’s speech — the economy didn’t just start to dive on December 17, did it? Plus, as Google and Microsoft should know, the time to make investments (if you have the cash) is during the lean times, especially on things that will take several years to bear fruit.

My guess: the political angle is more important here, and not in the way you might think. Obama has yet to announce his pick for national CTO. That person could be a top Googler, or someone openly favorable to Google. So any program that might look like the government is sanctioning Google over others might be problematic without a more measured approach, open competition, open bidding, etc… If so, the program could be re-introduced next year, at least I hope so.

Fortunately, Amazon is still going ahead with their business model that includes hosting scientific data. Their model makes money on the CPU processing that often goes along with it, so they have an incentive to plow ahead regardless.

BTW, I don’t know of any Microsoft effort on this front, but that’s not saying much (for my knowledge of the company’s plans, I mean).

However, it’s clear that MTJ is totally right about the need. Without a way to find all previously collected data, we are doomed to repeat the process again and again.

Frankly, I would call on any serious academic journal to reject any scientific paper that didn’t make its raw data (or source code, in the case of comp sci) publicly available in some reasonable form.*

 

____________

* I’ve previously gone much further than that: Any journal publishing a paper should feel obligated to publish (pending peer review for accuracy) any new paper refuting the original paper’s claims, with equal prominence. All too often, refuting claims of someone "more established" is seen as "career suicide," whereas it should right be viewed as heroic and in the true spirit of science. This would go some way to fixing that.

 

  1. No comments yet.
(will not be published)