New Data Array Layouts in VTK 7.1
ProTip: For an exhaustive and up-to-date description of all improvements described in this blog, consider reading the current wiki page.
With the upcoming release of VTK 7.1 comes significant improvements to the efficiency and interoperability of VTK datasets. This post will guide you through how to use these tools.
VTK datasets store most of their important information in subclasses of vtkDataArray. Vertex locations (vtkPoints::Data), cell topology (vtkCellArray::Ia), and numeric point, cell, and generic attributes (vtkFieldData::Data) are the dataset features accessed most frequently by VTK algorithms, and these all rely on the vtkDataArray API.
Terminology
ValueType: the element type of an array. For instance, vtkFloatArray has a ValueType of float.
ArrayType: subclass of vtkDataArray which specifies ValueType and an array implementation.
Dispatch: a runtime-resolution of a vtkDataArray’s ArrayType which calls a tailored section of executable code. At compile-time, the possible ArrayTypes are determined and a worker code template is generated for each. At run-time, the type of a specific array is determined and the proper worker instantiation is called.
Template explosion is a sharp increase in the size of a compiled binary that results from instantiating a template function or class on many different types.
vtkDataArray
This non-templated base class of the VTK data array type hierarchy is unique in C++. All arrays containing numeric data inherit the common interface vtkDataArray. Without knowing the underlying ValueType stored in a data array, an algorithm or user may resize, reshape, read, and rewrite the array using a generic API that substitutes double-precision floating point numbers for the array’s actual ValueType. For instance, we can write a simple function that computes the magnitudes for a set of vectors in one array and store the results in another using nothing but the typeless vtkDataArray API:
// 3 component magnitude calculation using the vtkDataArray API. // Inefficient, but easy to write: void calcMagnitude(vtkDataArray *vectors, vtkDataArray *magnitude) { vtkIdType numVectors = vectors->GetNumberOfTuples(); for (vtkIdType tupleIdx = 0; tupleIdx < numVectors; ++tupleIdx) { // What data types are magnitude and vectors using? // We don’t care! These methods all use double. magnitude->SetComponent(tupleIdx, 0, std::sqrt(vectors->GetComponent(tupleIdx, 0) * vectors->GetComponent(tupleIdx, 0) + vectors->GetComponent(tupleIdx, 1) * vectors->GetComponent(tupleIdx, 1) + vectors->GetComponent(tupleIdx, 2) * vectors->GetComponent(tupleIdx, 2)); } }
The Costs of Flexibility
This flexibility comes at a cost. Accuracy is impacted by ValueTypes that cannot be fully expressed as a double; integers with > 52 bits of precision must be truncated. Performance also takes a hit – vtkDataArray calls must be routed through a run-time resolution of ValueTypes using a C++ virtual override, adding overhead to each API call. This indirection also prevents the compiler from making assumptions about the layout of the data in-memory, which could be used to perform advanced optimizations such as vectorization.
So what options are available for fast, optimized, type-safe access to the data stored in a vtkDataArray?
The Current Solution: vtkTemplateMacro
In the past, this was done using the vtkTemplateMacros construct; this is no longer a best practice due to practical issues with array dispatching. While it is still usable and likely to be encountered in the VTK source code, newer code should use the vtkArrayDispatch mechanism.
Before 7.1, most numeric vtkDataArray objects were subclasses of the vtkDataArrayTemplate class, which implements all documented numeric data arrays such as vtkDoubleArray, vtkIdTypeArray, etc. Tuples are stored in memory as a contiguous array-of-structs (AOS), with the components stored in adjacent memory locations. These are accessed with the vtkTemplateMacro, a series of case statements that call a virtual GetDataType() to determine the array’s ValueType, typedef’d to VTK_TT. This allows developers to write generic type-specific code using VTK_TT as a placeholder for the actual ValueType. To perform an operation on multiple arrays, all arrays must have the same ValueType, or nested vtkTemplateMacros calls must be used to determine the ValueTypes of the additional arrays.
This solution has two problems: First, the worker functions assume the AOS adjacent layout; as VTK moves further into the field of in-situ analysis, this assumption may no longer be valid. Second, the number of required dispatches increases exponentially as the number of arrays to be considered grows. Restricted dispatch is needed to avoid template explosion.
The New Solution: Data Array Changes in VTK 7.1
As of VTK 7.1, the Array-Of-Structs (AOS) memory layout is no longer the only vtkDataArray implementation provided. The Struct-Of-Arrays (SOA) memory layout is now available through the vtkSOADataArrayTemplate class. The SOA layout assumes that the components of an array are stored separately, as in:
struct StructOfArraysBuffer { float *x; // Pointer to array containing x components float *y; // Same for y float *z; // Same for z };
The new SOA arrays improve interoperability between VTK and simulation packages for live visualization of in-situ results. Many simulations use the SOA layout for their data, and natively supporting these arrays in VTK allows analysis of live data without explicitly copying it into a VTK data structure.
The legacy GetVoidPointer method, which is typically used with vtkTemplateMacro, must return a pointer to the array data with AOS-ordering. SOA arrays implement this by copying the data into a temporary buffer – this is a waste of both processor time and memory. A replacement mechanism is needed which can abstract storage details while providing performance on-par with raw memory buffer operations, and reduce template explosion by providing more efficient multi-array dispatch.
Best Practices for vtkDataArray Post-7.1
The new tools in VTK 7.1 make managing template instantiations for efficient array access both easy and extensible. These tools replace vtkTemplateMacro by abstracting away things like storage details, while providing performance on-par with raw memory buffer operations. They also give the developer more control over multi-array dispatch, and reduce the template explosion problem. These include:
- vtkGenericDataArray The new templated base interface for all numeric vtkDataArray subclasses.
- vtkArrayDispatch Collection of code generation tools that allow concise and precise specification of restrictable dispatch for up to 3 arrays simultaneously.
- vtkArrayDownCast Access to specialized downcast implementations from code templates.
- vtkDataArrayAccessor Provides Get and Set methods for efficiently accessing/modifying array data using either a specific ArrayType API, or falling back to the slower generic vtkDataArray API when the ArrayType is unknown.
- VTK_ASSUME New abstraction for the compiler __assume directive to provide optimization hints.
Here is the calcMagnitude example implemented using these new tools:
// 3-component magnitude calculation using new concepts in VTK 7.1: // A worker functor. The calculation is implemented in the function template // for operator(). struct CalcMagnitudeWorker { // The worker accepts VTK array objects, not raw memory buffers. template void operator()(VectorArray *vectors, MagnitudeArray *magnitude) { // This allows the compiler to optimize for the AOS array stride. VTK_ASSUME(vectors->GetNumberOfComponents() == 3); VTK_ASSUME(magnitude->GetNumberOfComponents() == 1); // These allow this single worker function to be used with both // the vtkDataArray 'double' API and the more efficient // vtkGenericDataArray APIs, depending on the template parameters: vtkDataArrayAccessor v(vectors); vtkDataArrayAccessor m(magnitude); vtkIdType numVectors = vectors->GetNumberOfTuples(); for (vtkIdType tupleIdx = 0; tupleIdx < numVectors; ++tupleIdx) { // Set and Get compile to inlined optimizable raw memory accesses for // vtkGenericDataArray subclasses. m.Set(tupleIdx, 0, std::sqrt(v.Get(tupleIdx, 0) * v.Get(tupleIdx, 0) + v.Get(tupleIdx, 1) * v.Get(tupleIdx, 1) + v.Get(tupleIdx, 2) * v.Get(tupleIdx, 2))); } } }; void calcMagnitude(vtkDataArray *vectors, vtkDataArray *magnitude) { // Create our worker functor: CalcMagnitudeWorker worker; // Define our dispatcher. We’ll let vectors have any ValueType, but only // consider float/double arrays for magnitudes. These combinations will // use a 'fast-path' implementation generated by the dispatcher: typedef vtkArrayDispatch::Dispatch2ByValueType < vtkArrayDispatch::AllTypes, // ValueTypes allowed by first array vtkArrayDispatch::Reals // ValueTypes allowed by second array > Dispatcher; // Execute the dispatcher: if (!Dispatcher::Execute(vectors, magnitude, worker)) { // If Execute() fails, it means the dispatch failed due to an // unsupported array type. In this case, it’s likely that the magnitude // array is using an integral type. This is an uncommon case, so we won’t // generate a fast path for these, but instead call an instantiation of // CalcMagnitudeWorker::operator()<vtkDataArray, vtkDataArray>. // Through the use of vtkDataArrayAccessor, this falls back to using the // vtkDataArray double API: worker(vectors, magnitude); } }
vtkGenericDataArray
The vtkGenericDataArray class template drives the new vtkDataArray class hierarchy. The ValueType is introduced here, both as a template parameter and a class-scope typedef. As a result, writing a typed API doesn’t require conversion to/from a common type. Instead of implementing storage details, it uses the CRTP idiom to forward key method calls to a derived class without a virtual function call. Without this indirection, vtkGenericDataArray can define an efficient interface for the compiler to to see past the method calls and instead optimize the underlying memory accesses.
The two main subclasses of vtkGenericDataArray are vtkAOSDataArrayTemplate and vtkSOADataArrayTemplate. These implement array-of-structs and struct-of-arrays storage, respectively.
vtkTypeList
Type lists are a metaprogramming construct used to generate a list of C++ types, used in VTK to implement restricted array dispatching. vtkArrayDispatch reduces the number of generated template instantiations by enforcing constraints on the arrays used to dispatch. For instance, if one wanted to only generate templated worker implementations for vtkFloatArray and vtkIntArray, a typelist is used to specify this:
// Create a typelist of 2 types, vtkFloatArray and vtkIntArray: typedef vtkTypeList_Create_2(vtkFloatArray, vtkIntArray) MyArrays; Worker someWorker = ...; vtkDataArray *someArray = ...; // Use vtkArrayDispatch to generate code paths for these arrays: vtkArrayDispatch::DispatchByArray(someArray, someWorker);
There is a set of macros named vtkTypeList_Create_X, where X is the number of types in the created list, and the arguments are the types to place in the list. In the example above, the new type list is bound to a friendlier name using a local typedef, which is a common practice.
The vtkTypeList.h header defines some additional type list operations that may be useful, such as deleting and appending types, looking up indices, etc. vtkArrayDispatch::FilterArraysByValueType may come in handy, too. But for array dispatches, most users will only need to create new ones, or use the predefined vtkTypeLists:
- vtkArrayDispatch::Reals — All floating point ValueTypes.
- vtkArrayDispatch::Integrals — All integral ValueTypes.
- vtkArrayDispatch::AllTypes — Union of Reals and Integrals.
- vtkArrayDispatch::Arrays — Default list of ArrayTypes to use in dispatches.
The last one is special — vtkArrayDispatch::Arrays is a type list of ArrayTypes set application-wide when VTK is built. This vtkTypeList of vtkDataArray subclasses is used for unrestricted dispatches, and is the list that gets filtered when restricting a dispatch to specific ValueTypes. This list contains all AOS arrays by default. The CMake option VTK_DISPATCH_SOA_ARRAYS will enable SOA array dispatch as well. More advanced possibilities exist and are described in VTK/CMake/vtkCreateArrayDispatchArrayList.cmake.
vtkArrayDownCast
In VTK, all subclasses of vtkObject support a downcast method called SafeDownCast, similar to the C++ dynamic_cast: given an object, try to cast it to a more derived type; return NULL if the object is not the requested type. SafeDownCast is implemented as a series of string comparisons on the object’s class name, and this can dominate computational resources. Some vtkDataArrays support a specialized FastDownCast method, which lessons the cost of downcasting arrays by replacing common string comparisons with a single virtual call. However, not all array implementations support the FastDownCast method.
vtkArrayDownCast unifies these casting methods by automatically selecting FastDownCast when it is defined for the ArrayType, and otherwise falls back to the slower SafeDownCast.
template void DoSomeAction(vtkAbstractArray *array) { ArrayType *myArray = vtkArrayDownCast(array); if (myArray) { // ... (do work with myArray) } }
vtkDataArrayAccessor
Array dispatching relies on having templated worker code carry out some operation, such as locating the maximum value in an array. However, because only the arrays in vtkArrayDispatch::Arrays are tested for dispatching, a backup implementation to fall back on the slower vtkDataArray API is needed in case of an unsupported array. Writing the same algorithm twice adds extra debugging, testing, and maintenance.
The vtkDataArrayAccessor removes the need for duplication of effort. It provides component and tuple based Get and Set methods using either the vtkDataArray or vtkGenericDataArray API, depending on the class’s template parameter. It also defines an APIType, which can be used to allocate temporaries, etc. This type is double for vtkDataArrays and vtkGenericDataArray::ValueType for vtkGenericDataArrays.
vtkDataArrayAccessor has a more compact API. The only defined methods are Get and Set, and they’re overloaded to work on tuples and components. Note that non-element access operations such as GetNumberOfTuples should still be called on the array pointer using vtkDataArray API.
Using vtkDataArrayAccessor, we can write a single worker template that works for both vtkDataArray and vtkGenericDataArray, without a loss of performance in the latter case. That worker looks like this:
// Stores the tuple/component coordinates of the maximum value using vtkDataArrayAccessor: struct FindMax { vtkIdType Tuple; // Result int Component; // Result FindMax() : Tuple(-1), Component(-1) {} template void operator()(ArrayT *array) { // Create the accessor: vtkDataArrayAccessor access(array); // Prepare the temporary. We’ll use the accessor's APIType instead of // ArrayT::ValueType, since that is appropriate for the vtkDataArray // fallback: typedef typename vtkDataArrayAccessor::APIType ValueType; ValueType max = std::numeric_limits::min(); // Iterate as before, but use access.Get instead of // array->GetTypedComponent. GetTypedComponent is still used // when ArrayT is a vtkGenericDataArray, but // vtkDataArray::GetComponent is now used as a fallback when ArrayT // is vtkDataArray. vtkIdType numTuples = array->GetNumberOfTuples(); int numComps = array->GetNumberOfComponents(); for (vtkIdType tupleIdx = 0; tupleIdx < numTuples; ++tupleIdx) { for (int compIdx = 0; compIdx < numComps; ++compIdx) { if (max < access.Get(tupleIdx, compIdx)) { max = access.Get(tupleIdx, compIdx); this->Tuple = tupleIdx; this->Component = compIdx; } } } } };
When we call operator() with, say, ArrayT=vtkFloatArray, we’ll get an optimized, efficient code path. But we can also call this same implementation with ArrayT=vtkDataArray and still get a correct result (assuming that the vtkDataArray’s double API represents the data well enough).
Using the vtkDataArray fallback path is straightforward. At the call site:
void someFunction(vtkDataArray *array) { FindMax maxWorker; if (!vtkArrayDispatch::Dispatch::Execute(array, maxWorker)) { maxWorker(array); // Dispatch failed, call vtkDataArray fallback } // Do work using maxWorker.Tuple and maxWorker.Component -- now we know // for sure that they’re initialized! }
Using the above pattern for calling a worker and always going through vtkDataArrayAccessor to Get/Set array elements ensures that any worker implementation can be its own fallback path.
VTK_ASSUME
The new array classes achieve the same performance as raw memory buffers for most cases, using both AOS and SOA array implementations. In fact, with –ffast-math optimizations on GCC 4.9, the optimizer is able to remove all function calls and apply SIMD vectorized instructions in the dispatched worker, showing that the new array API is thin enough that the compiler can see the algorithm in terms of memory access.
But there was one case where performance suffered. If iterating through an AOS data array with a known number of components using GetTypedComponent, the raw pointer implementation initially outperformed the dispatched array. To understand why, note that the AOS implementation of GetTypedComponent is along the lines of:
ValueType vtkAOSDataArrayTemplate::GetTypedComponent(vtkIdType tuple, int comp) const { // AOSData is a ValueType* pointing at the base of the array data. return this->AOSData[tuple * this->NumberOfComponents + comp]; }
Because NumberOfComponents is unknown at compile time, the optimizer cannot assume anything about the stride of the components in the array. This leads to missed optimizations for vectorized read/writes and increased complexity in the instructions used to iterate through the data.
For such cases where the number of components is, in fact, known at compile time (due to a calling function performing some validation, for instance), it is possible to tell the compiler about this fact using VTK_ASSUME.
VTK_ASSUME wraps a compiler-specific __assume statement, which is used to pass such optimization hints. Its argument is an expression of some condition that is guaranteed to always be true. This allows more aggressive optimizations when used correctly, but be forewarned that if the condition is not met at runtime, the results are unpredictable and likely catastrophic.
But if we’re writing a filter that only operates on 3D point sets, we know the number of components in the point array will always be 3. In this case we can write:
VTK_ASSUME(pointsArray->GetNumberOfComponents() == 3);
in the worker function and this instructs the compiler that the array’s internal NumberOfComponents variable will always be 3, and thus the stride of the array is known. Of course, the caller of this worker function should ensure that this is a 3-component array and fail gracefully if it is not.
There are many scenarios where VTK_ASSUME can offer a serious performance boost, the case of known tuple size is a common one that’s really worth remembering.
vtkArrayDispatch
The dispatchers implemented in the vtkArrayDispatch namespace provide array dispatching with customizable restrictions on code generation and a simple syntax that hides the messy details of type resolution and multi-array dispatch. There are several “flavors” of dispatch available that operate on up to three arrays simultaneously.
Components Of A Dispatch
Using the vtkArrayDispatch system requires three elements: the array(s), the worker, and the dispatcher.
The Arrays
All dispatched arrays must be subclasses of vtkDataArray. It is important to identify as many restrictions as possible. Must every ArrayType be considered during dispatch, or is the array’s ValueType (or even the ArrayType itself) restricted? If dispatching multiple arrays at once, are they expected to have the same ValueType? These scenarios are common, and these conditions can be used to reduce the number of instantiations of the worker template.
The Worker
The worker is some generic callable. In C++98, a templated functor is a good choice. In C++14, a generic lambda is a usable option as well. For our purposes, we’ll only consider the functor approach, as C++14 is a long ways off for core VTK code.
At a minimum, the worker functor should define operator() to make it callable. This should be a function template with a template parameter for each array it should handle. For a three array dispatch, it should look something like this:
struct ThreeArrayWorker { template void operator()(Array1T *array1, Array2T *array2, Array3T *array3) { /* Do stuff... */ } };
At runtime, the dispatcher will call ThreeWayWorker::operator() with a set of Array1T, Array2T, and Array3T that satisfy any dispatch restrictions.
Workers can be stateful, too, as seen in the FindMax worker earlier where the worker simply identified the component and tuple id of the largest value in the array. The functor stored them for the caller to use in further analysis:
// Example of a stateful dispatch functor: struct FindMax { // Functor state, holds results that are accessible to the caller: vtkIdType Tuple; int Component; // Set initial values: FindMax() : Tuple(-1), Component(-1) {} // Template method to set Tuple and Component ivars: template void operator()(ArrayT *array) { /* Do stuff... */ } };
The Dispatcher
The dispatcher is the workhorse of the system. It is responsible for applying restrictions, resolving array types, and generating the requested template instantiations. It has responsibilities both at run-time and compile-time.
During compilation, the dispatcher will identify the valid combinations of arrays that can be used according to the restrictions. This is done by starting with a typelist of arrays, either supplied as a template parameter or by defaulting to vtkArrayDispatch::Arrays, and filtering them by ValueType if needed. For multi-array dispatches, additional restrictions may apply, such as forcing the second and third arrays to have the same ValueType as the first. It must then generate the required code for the dispatch — that is, the templated worker implementation must be instantiated for each valid combination of arrays.
At runtime, it tests each of the dispatched arrays to see if they match one of the generated code paths. Runtime type resolution is carried out using vtkArrayDownCast to get the best performance available for the arrays of interest. If it finds a match, it calls the worker’s operator() method with the properly typed arrays. If no match is found, it returns false without executing the worker.
Restrictions: Why They Matter
We’ve made several mentions of using restrictions to reduce the number of template instantiations during a dispatch operation. You may be wondering if it really matters so much. Let’s consider some numbers.
VTK is configured to use 13 ValueTypes for numeric data. These are the standard numeric types float, int, unsigned char, etc. By default, VTK will define vtkArrayDispatch::Arrays to use all 13 types with vtkAOSDataArrayTemplate for the standard set of dispatchable arrays. If enabled during compilation, the SOA data arrays are added to this list for a total of 26 arrays.
Using these 26 arrays in a single, unrestricted dispatch will result in 26 instantiations of the worker template. A double dispatch will generate 676 workers. A triple dispatch with no restrictions creates a whopping 17,576 functions to handle the possible combinations of arrays.
Applying some simple restrictions can reduce this immensely. If the arrays will only contain floats or doubles, the single dispatch drops to 4 instantiations, the double dispatch to 16, and the triple to 64. We could even apply such a restriction to just create some ‘fast-paths’ and let the integral types fallback to using the vtkDataArray API by using vtkDataArrayAccessors.
Another common restriction is that all arrays in a multi-array dispatch have the same ValueType, even if that ValueType is not known at compile time. By specifying this restriction, a double dispatch on all 26 AOS/SOA arrays will only produce 52 worker instantiations, down from 676. The triple dispatch drops to 104 instantiations from 17,576.
Always apply restrictions when they are known, especially for multi-array dispatches. The savings are worth it.
Putting It All Together
Now that we’ve explored the new tools introduced with VTK 7.1 that allow efficient, implementation agnostic array access, let’s look at the calcMagnitude example and identify the key features of the implementation:
// Modern implementation of calcMagnitude using new concepts in VTK 7.1: struct CalcMagnitudeWorker { template void operator()(VectorArray *vectors, MagnitudeArray *magnitude) { VTK_ASSUME(vectors->GetNumberOfComponents() == 3); VTK_ASSUME(magnitude->GetNumberOfComponents() == 1); vtkDataArrayAccessor v(vectors); vtkDataArrayAccessor m(magnitude); vtkIdType numVectors = vectors->GetNumberOfTuples(); for (vtkIdType tupleIdx = 0; tupleIdx < numVectors; ++tupleIdx) { m.Set(tupleIdx, 0, std::sqrt(v.Get(tupleIdx, 0) * v.Get(tupleIdx, 0) + v.Get(tupleIdx, 1) * v.Get(tupleIdx, 1) + v.Get(tupleIdx, 2) * v.Get(tupleIdx, 2))); } } }; void calcMagnitude(vtkDataArray *vectors, vtkDataArray *magnitude) { CalcMagnitudeWorker worker; typedef vtkArrayDispatch::Dispatch2ByValueType < vtkArrayDispatch::AllTypes, vtkArrayDispatch::Reals > Dispatcher; if (!Dispatcher::Execute(vectors, magnitude, worker)) { worker(vectors, magnitude); // vtkDataArray fallback } }
This implementation uses dispatch restrictions to reduce the number of instantiated templated worker functions from down to 104 combinations, where a double-dispatch would create 676 instantiations. When ValueType restrictions aren’t met, the calculation is still carried out at double precision; those other 572 cases won’t have special code generated but can still run if needed. Thanks to vtkDataArrayAccessor, we have a fallback implementation that reuses our templated worker code. In this case, the dispatch is really just a fast-path implementation for floating point output types.
The performance should be identical to iterating through raw memory buffers.The vtkGenericDataArray API is transparent to the compiler. The specialized instantiations of operator() can be heavily optimized since the memory access patterns are known and well-defined. Using VTK_ASSUME tells the compiler that the arrays have known strides, allowing further compile-time optimizations.
Hopefully this has convinced you that the vtkArrayDispatch and related tools are worth using to create flexible, efficient, typesafe implementations for your work with VTK. For information on types of Dispatchers and advanced usage such as accessing memory buffers, check out the wiki. Please direct any additional questions you may have on the subject to the VTK mailing lists.
Awesome. Would it be adapted to tackle this cascade of templating in the VBO creation?
https://gitlab.kitware.com/vtk/vtk/blob/f830ff47b33f7c0c72fe601a6cb762aea00ada1e/Rendering/OpenGL2/vtkOpenGLVertexBufferObject.cxx#L233-341
Yep! Ken has started reworking some of the rendering code to use these new techniques, but there’s still a lot of VTK internals that use the old GetVoidPointer/vtkTemplateMacro approach. If there’s a section of rendering code that’s causing you trouble, feel free to open a bug on the VTK tracker and ping me (@dlonie) and @ken-martin on it.