If you’re suffering from Higgs-Boson overload these days, don’t worry – I’ll only mention it once – I promise. The Large Hadron Collider that supported the search for the Higgs-Boson is an excellent example of Big Science, an approach that brings together a huge amount of resources and people to tackle key questions in science that could never be done by a single scientist in his/her small university lab (which is termed, quite originally, ‘Small Science’). Other examples of Big Science include the Manhattan Project of the 1940s, and the Canadian Light Source synchotron currently operating in Saskatoon.
While Big Science is a well-known term, I was browsing the NEON website recently and came across the term ‘Big Data’. The concept made intuititive sense, but I hadn’t heard it used in science circles. Shows just how far out of those circles I am. 🙂 The website mentioned they’d be soliciting guest blog posts on the implications of Big Data for various environment-related fields, which got me thinking about how this term might be relevant in hydrology.
From Big Science to Big Data
I spent part of my sabbatical this spring at Oregon State University, doing some work on snow and stream temperature at the HJ Andrews Experimental Forest. To me this site – as well as other LTER sites around the US and non-LTER watershed studies – constitutes a big data project. HJ Andrews was established in 1948, and has many long term, spatially distributed datasets covering a range of physical and biological processes. This includes everything from LiDar surfaces of the study area to datasets noting the phenological response of various vegetation species within the study area. One of the neatest talks I saw while at OSU was by PhD student Tuan Pham. He’s developing a data storage and retrieval system that would allow researchers easy access to ecological data, while also highlighting new relationships between datasets that might otherwise be overlooked.
Another project I’ve been involved with is the Southern Rockies Watershed Project (website somewhat out of date unfortunately). This is a multi-catchment watershed study in the Crowsnest Pass region of southwestern Alberta that aims to quantify hydrologic response to wildfire. Just like HJ Andrews, SRWP has generated reams of physical and biological data – but only (‘only’, haha!) for the past 8 years. Keeping track of those datasets – include quality assurance/quality control and data access to researchers affiliated with the project – has been a herculean task. And one that hasn’t even begun to tap into the potential for discovering unexplored connections between historically unrelated datasets.
As a final example, I’ve also been (peripherally) involved in the Water and Environmental Hub. This initiative is focused on linking water- and environment-related datasets via one web-based portal, to make it easy to find and link datasets both for existing studies and to develop new ones. It currently boasts of hosting “1,811 observation datasets and over 146 base feature datasets” – certainly an impressive data collection.
But the key to successful development of a big data strategy lies not just in collecting data itself. It hinges on knowing how to set up the data stream to ‘mine’ it for new insights and ideas that would not be possible without such massive data volumes. In hydrology, this could mean defining linkages between water availability (as a function of spatial snowmelt patterns, soil moisture, canopy interception, etc.) and vegetation distribution, or between vegetation growth and development and topographic attributes such as slope curvature and aspect. Even these ideas seem relatively pedestrian – there’s no telling what new connections could be derived from combining individual datasets into one massive one.
This is where data visualization and data mining techniques become critical, requiring a new breed of scientist versed in a combination of environmental science processes, computer science techniques, and visual design. We need to employ techniques espoused by Yale’s Edward Tufte – particularly in his book Envisioning Information – which is meant to develop ‘visual literacy’ in not just scientists, but in business people, artists, etc. His key principles are to “Tell the truth. Show the data in its full complexity. Reveal what is hidden.” Sounds like a relevant prescription for Big Data projects.
Big Data is further complicated by the advent of wireless sensor networks, which allow researchers to generate high spatial and temporal resolution datasets at a lower cost than was previously possible. For example, Roger Bales’ group at the University of California Merced has set up a sensor network in the Sierra Nevada to monitor everything from snow volumes to water availability across a large geographic region. This approach used to be considered the future of hydrologic sciences, but is rapidly becoming commonplace. Not only does it generate huge amounts of data that require processing and analysis, but – as with all projects – it requires some finesse in network design to determine what needs to be measured, where, and at what timestep, in order to develop a coherent picture of what’s happening within a research basin.
The requirement that we analyze and interpret Big Data in new ways in order to fully exploit their potential reminds me of our transition from home computers to smartphones, from point and click mice to touch screens and sliding menus. We have to rearrange our thinking and reconfigure our mental maps, to look for surprises and oddities rather than the patterns we’ve become used to. Coleridge talked about the ‘suspension of disbelief’ in relation to our ability to believe the reality of events occurring in a novel – in this case we have to suspend our preconceptions and be open to making new findings that may not match our hypotheses.
Big Data is potentially a huge step forward for hydrology – a way to integrate satellite and field data from a range of water-related disciplines. But can we harness it to our advantage?