Informatics Survey Sheds Light on Researcher Needs
Kitware, in collaboration with The Ohio State University, recently completed a Phase I small business grant awarded by the Department of Energy entitled “Cloud Computing and Visualization Tools for KBase”. This grant involved extending our tool-sets and exploring activities in the areas of bioinformatics and researcher support. As part of a gap analysis and to broaden our understanding of the needs of this burgeoning research community, we conducted an online survey of current visualization and bioinformatics researchers that was designed to capture an insider’s view of the field. Our survey was very successful in obtaining opinions from a wide range of researchers. Among the 58 participants surveyed as of November 30, 2012 were nineteen (19) different research topics within bioinformatics and systems biology.
The survey focused in two different areas: Analysis Needs and Visualization Needs; and for each area we probed current usage, available tools, and unmet needs. Additional question groups probed background information and solicited free text evaluation of existing tools and unmet needs. The ten survey questions are shown in Table 1 with the four question groups shown in different colors.
In what area(s) of system biology or bioinformatics are you involved? |
What computational platform(s) do you use in your work? |
What sort of systems biology or bioinformatic analyses do you perform? |
What analysis packages do you use? |
Describe the major analysis challenges you face in your field. |
What types of system biology or bioinformatics visualizations do you use? |
Which systems biology or bioinformatics visualization tools do you use? |
Describe the major visualization challenges you face in your field. |
What types of visualizations would you like to perform that are not available to you now? |
What are the shortcomings in the tool-set in your research area? |
Table 1. The questions asked in our online survey.
As noted earlier, the 58 respondents cover 19 different disciplines. Interestingly, over 50% were involved in Cellular and Molecular Biology with other significant concentrations (>25%) coming in Genetics; Transcriptomics; and Genomics/Functional Genomics. These researchers made 137 selections for computational platform spanning the space from mobile devices to High Performance Computing (HPC). Most researchers used Linux (70.7%) or Mac (62.1%) but Windows and Cloud/Distributed computing also were used by significant numbers of researchers (46.6% and 39.7%, respectively). This multiplicity of platforms demonstrates that any broad solutions will benefit from multi-platform and scalable capability.
Regarding the Analysis questions, a majority of the respondents performed Network Modeling/Analysis; but significant numbers (>10%) also used Clustering, Factor Analysis, Canonical Correlations, Time Series Alignment, Hypothesis Testing, Rule-Based Modeling, Phylogenetic Tree Generation, Population Comparisons, Read Alignment, and Dynamics Modeling/Simulation. Free text responses added graphical models, bayesian statistics, sequence analysis, branch site models and standard image analysis techniques. To accomplish these analyses, they most often used R/Bioconductor (81.5%), Galaxy (40.7%), QIIME (22.2%), or MG-RAST (18.5%). Aside from these, a significant number of respondents (29.3%) reported developing their own custom code or using algorithm development languages such as Python or Matlab. Of the four analysis challenges offered by the survey, three of them; Interface standards, Interactivity, Data acquisition; were above 40% (61.4%, 63.6% and 47.7%, respectively) and Data upload/transmission garnered a respectable 36.4%. More interesting were the free text responses where scalability, model verification, uncertainty in the data, and finding good visualizations were suggested.
The visualization questions provided an interesting contrast. For visualization algorithms, there was wide support for a large number of visualizations with two (Scatter Plots and Network visualizations) being selected by more than 60% of respondents. An additional 9 (total 11) of the 15 offered choices were above 20%. Free text responses added model specific visualizations, 3D volumetrics, and time series plots. The response for visualization tools showed that there is much narrower use of the tools. Only one of the offered selections, UCSC Genome Browser/Integrative Genome Viewer/Ensemble Viewer, was chosen by over 40% of the respondents, and of the 22 offered selections, only 7 did not receive at least one vote. The free text values continued the trend and at the end more than 35 different applications were identified. Interestingly, custom visualizations were the second highest vote getter. Large data (82.8%), interactivity/real-time visualization (60.3%), and multiple data dimensions (51.7%) dominated the visualization challenges.
The final section was designed to explore the gap as seen by the researchers. We asked two questions: “What types of visualizations would you like to perform that are not available to you now?”; and “What are the shortcomings in the tool-set in your research area?”. The responses are shown in Table 2 below. The responses make clear the need for multidimensional, multi-omics, scalable visual analytics.
What types of visualizations would you like to perform that are not available to you now? |
What are the shortcomings in the tool-set in your research area? |
Dose-response graphs |
Interface with instruments for real time access, analysis, and feedback control. Mainly due to proprietary software of the manufacturers |
More detailed network visualizations |
Interactivity between multiple algorithms is not always easy to achieve |
3D visualizations in journal publications |
The possibility to insert time delay and logic conditions in the functions governing the chemical reactions kinetics |
Large data set visualization |
Size limited |
NURBS/T-Splines, or any spline-based mesh |
Visualization research often lack a comprehensive knowledge of the biological problems, therefore solutions are not appropriate. – biological problems are diverse, therefore tools should be adaptable to also meet slightly different tasks |
Multidimensional and multivariate data visualization |
Nothing I can’t code myself. |
More interactive 3D visualzations |
CAD has superior mesh vis, but lacks node- or cell-based colorization |
Temporal (across a phylogenetic tree) combined with spatial (biogeographic) |
A lot of what I do is based on generating visualizations from the perspective of classes of genomic coordinates (ie meta gene analyses from the perspective of TSSs, motifs, etc. ) I’ve had to write my own tools to do this. |
Exploratory data analysis of integrated, multi-omics analysis |
The actual methods for data analysis and visualization (for exploring high-dimensional image-based data) are not mature, and user-friendly tools for carrying out the methods are also not mature. |
3D rendering on touchscreens, for protein structures |
The complicated natures of analysis and exploration requires many tools to be used. Of course, this introduces huge import/export problems. Recently, we have been developing our own tools to allow us to perform the analysis, exploration, and visualization in a single environment. However, adding functionality is often slow. |
|
Optimisation Performance in fine tuning the model for Metabolic Flux Analysis. |
|
Formatting is not universal – wasted time spent on small formatting changes. Also sometimes searching for the tool to do the analyses i want takes longer than it should |
|
Ability to perform exploratory data analysis of very large, multidimensional data sets. |
|
Data integration of x-omics datasets. |
Table 2. Answers to free-form responses on gaps in current research tools.
Based on the survey responses, we believe that the biggest gap remains visualization. The ideal tool will be multi-platform and configurable, able to span the deployment platform from mobile devices to HPC with a consistent interface and user experience. It should “play well with others” by providing hooks for seamlessly importing analyses from multiple external analysis platforms and instruments, possibly by incorporating a data curation tool to allow for “small formatting changes” on data import. It must be scalable to allow for problems appropriate to the target hardware and should incorporate interactive visualization for the selected problem size on the appropriately sized hardware. Finally, there is a consensus that visualization is not a mature technology. Whatever solution arises must be easily reconfigured by the developer or user to allow easy access to novel visualizations and customized views into the data.
As we proceed with our research and development, you can expect that these issues will find their way into our toolkits and applications. For those who want to add their own input to the survey, you are welcome to visit it here and tell us what you think is important, or if you want to dig into see the full set of survey results, go right to the summary.
Beautiful, in-depth study. Nice!