There’s More to Big Data than big data
It’s easy to believe that Big Data is just the current technology fad; it seems everyone has jumped on the bandwagon. Various experts are citing it as the next frontier for innovation, competition and productivity; and many governmental and research programs are in full swing. Yet anyone paying even cursory attention to the deluge of information coming from sensors, communication, and computation – not to mention our own personal experience with overrun email accounts, social streams, and media files – knows that the data problem is very real. So while I do not believe that it’s the latest technology fashion, I do remain uncomfortable with the phrase “Big Data” in that it only vaguely describes the problem, and doesn’t even begin to prescribe a solution. The phrase suggests that the data is large, complex and hard to handle, and possibly implies that there are opportunities to find real nuggets of value, but it doesn’t address the questions that the practically-minded want to hear. That is, just exactly how does one go about addressing the Big Data problem, and how does one take advantage of the opportunities it presents?
The bad news is that there are no simple answers. Most organizations find that generic solutions don’t exist, and that off-the shelf-software doesn’t address their needs. The reason for this is simple: organizations have their own unique workflows and data. Plus the value of data is necessarily tied to the questions asked of it, and each organization asks different questions. Invariably this means that solutions must be built which are custom tailored to the organization’s workflow, the data it collects, and the questions it needs answered. But the good news is that there is a lot of cool technology out there that can enable your organization to make a dent in the problem. With this in mind, here are three important considerations when developing custom solutions which I refer to as platform, scaling, and focused inquiry, and which I elaborate on in the following paragraphs.
If I had to characterize the Big Data initiative as it currently stands, I would say we are in the platform phase. That is, all sorts of research and solutions are emerging to enable representing large data, computing on it, and exploring it. However, what’s out there now is mostly inadequate. There are a lot of tools that are great for demos, but fall flat due to issues of scale or data complexity. For example, applications that enable exploration of data by simply dragging-and-dropping table columns and rows into various types of charts are really cool; but for really big data, the likelihood that you can interactively explore yourself into a “eureka” moment is extremely low because well, it’s too big. It’s essential that a useful Big Data platform support algorithmic means to crunch data and provide an integrated workflow so that users can guide automated analysis, assess progress, and visualize results. Further, due to the complexity of data, the multitude of useful tools, and the imperative (in many cases) for Open Data, the platform is best when it’s open. In this way, such a platform can react quickly to technology changes and support collaborative workflows.
The scale of the problem is still under-appreciated, and it manifests in so many unexpected ways. Here’s a simple example: for really big data, do not expect to move, transmit, or copy it once its acquired or computed; it’s just not feasible. Even given extreme bandwidth and prodigious storage, copy times can extend into weeks or months for many data sets. This simple fact drives many issues related to scale. For example, computation must travel to where the data resides, it’s a really bad idea to transmit data to a remote computational client. The associated IO and bandwidth constraints demand all sorts of innovative approaches: high-performance computing (HPC) must be intimately connected to Big Data and become part and parcel of the solution; future algorithmic designs require that they are born parallel; in-situ approaches are necessary to process and reduce data during computation; new algorithmic techniques including statistical methods for probing data are essential; visualization of data from actively running processes will become necessary (since you can’t afford to write out interim data); and data centers will necessarily evolve into HPC systems with lots of computational horsepower. Forget about trying to interactively explore an entire data set, as visualization will be used as a magnifying lens through which users can assess, guide, and evaluate the results of data analysis.
Focused inquiry is another important consideration when developing a Big Data solution. This includes providing easy ways to formulate relevant questions, and evaluate potential answers. It is only by limiting a query through a well-formulated question that users can hope to carve out useful information. Part of asking the right question is understanding the data context and framing the answer in the right form: it generally makes no sense to use bar charts to visualize a MRI scan (although I suppose the right form of histogram might be useful). Focused inquiry must inherently be part of an integrated workflow that combines automated processing tools with the means to interactively visualize interim results, and guide continued processing. So the focused inquiry process is necessarily presumed to be iterative, with automated processes providing hints as to where the useful information might lie; this in turn sharpens the driving questions and ultimately leads to useful information.
As I mentioned earlier, one of the important lessons I’ve learned is that there is no generic solution to the Big Data problem. Expect to build a custom solution built on a scalable platform, one that can be customized around a well-thought-out workflow to address the questions relevant to your organization. In addition, if you want to want to adapt quickly as technology advances and to collaborate across your team (so as to bring in the expertise of many), you must make sure that you build on open platforms. If you do not, IP barriers and programming limitations will constrain what tools you can use to ask and get the answers to important questions.
Big Data is definitely not a fad. We’ll be facing this challenge for the foreseeable future, and it’s only to get worse in the near term. If your organization is producing large data, there’s a good chance that much of it is going unseen and that you are missing opportunities. Be wary of simple solutions; addressing the Big Data challenge means rethinking your data platform, ensuring that systems scale properly, and building tools and processes that enable you to ask the right questions and collaborate across your organization.