Why Computation is Eating Science
I’m sure many of you are familiar with Marc Andreessen’s seminal essay “Why Software is Eating the World” (another nice related article can be found here). While I agree with what much of Mr. Andreessen says, being a long-time practitioner of scientific computing I believe there is a corollary worth exploring that has long-term implications for the practice of science: computational methods are increasingly driving the practice of science, and so successful, future scientists will be those that embrace this new reality.
For many this is a fairly obvious conclusion. Everything from brain imaging to decoding DNA to studying the fundamental components of matter is being driven by computational methods. While the three pillars of science–theory, experiment and computation–remain central to the pursuit of knowledge, increasingly theoreticians are performing computational studies to validate theories; experimentalists are simulating complex phenomena in order to visualize and understand it; and computational work proceeds apace gathering and analyzing ever larger data.
The implications behind this corollary and the growing role of computation are profound, and I’m not sure that we as a scientific community fully appreciate them. There are two particular areas that have engaged my attention lately. The first, reproducibility, is a the central tenet of the scientific method and ensures that our foundational scientific knowledge is rock solid. The second, which I call computational competence, is the treatment of computational skills and infrastructure as a first-class participant in the scientific process. Let me explain.
First, we all know that scientific knowledge is based on the principle of reproducibility. In the earliest days of the formation of science this was approached with militant zeal, for example the motto of the Royal Society (est. 1660) was “Nullius in verba” which loosely translated means “take nobody’s word for it.” These societies regularly gathered to practice experiments and ensure that the results were reproducible. Yet despite this core requirement, we still find today that large amounts of scientific work are not reproducible. For example a recent paper in Nature states that 90% of the results influential papers in cancer research are not reproducible. Several other authors have pointed out similar unfortunate results.
The great thing about computation is that it is relatively easy to reproduce results, as long as the data, source code, and publications are available (and you have the necessary computing resources). Thus with the growing importance of scientific computation, we have a significant opportunity to positively address the reproducibility crisis through Open Science via the practices of Open Data, Open Source, and Open Access. As computational scientists we can lead by example and hold true to the requirement of reproducibility. And with the growing importance of computation, make a lasting impact on how we practice science.
Second, computational competence is increasingly required in the research process as computation eats science. Back in the day it was pretty typical for researchers to have rudimentary software knowledge and write simple applications. The computing environment was straightforward too, often a standalone computer where data was relatively small and localized. It didn’t take much knowledge to make good use of a computer, rudimentary computing knowledge went a long way towards benefiting a research program.
But as we all know times have changed. Scientific computing software is now measured in millions of lines of code, and often multiple systems must be combined to produce a useful workflow. Data sizes are commonly tera-scale, some peta-scale with exa-scale on the horizon, and distributed across the web. Computing architecture require complex interactions between multiple computing resources ranging from mobile to supercomputer, for example in distributed parallel computing or client-server architectures. Such computing environments require years of training to master and use effectively. In such an environment rudimentary software skills make little impact and in fact can set back progress through poor software design and implementation.
What’s disturbing to me is that while everyone realizes that computing systems have increased in complexity, the attitudes of too many researchers and funding organization has not moved much beyond where it was back in the day. For example, I still routinely see research groups (in academic and lab settings) with little or no software version control, little formalized testing, with mostly untrained graduate students expected to engage in complex computing or software maintenance tasks. Or large-scale research programs populated with a bunch of scientists with a few computational types thrown in as an afterthought. And this isn’t to disparage the scientists, rather it’s the research programs that ignore the growing, central aspect of software, often spending serious money only to toss data in external drives into filing cabinets, or let software rot on a graduate student’s forgotten computer. At the very least this stuff should be shared so others can use it; better yet research programs should necessarily include computational plans that focus on computational processes and competently trained staff. Otherwise the value of the research is rather negatively impacted.
I suppose I shouldn’t complain, after all it’s deficiencies like these that enable companies like Kitware to fill the vacuum by providing much-needed software skills coupled with extensive scientific domain knowledge (through in-house experts or collaborative teaming skills). Yet there is so much more that needs to be done, and as computation continues to eat science, its practice is dramatically changing, necessarily becoming ever more reliant on superb software skills and open practices. Research program that engage computation as a first-class participant will be the ones that succeed in this current scientific era, with the added benefit that they honor the reproducibility requirement and hence science itself.