Wednesday, June 5, 2013

What is Data Virtualization?

“Virtualization” is everywhere but nowhere. The term is virtually ubiquitous. The first use I remember when computers got into the picture was “virtual reality,” when we computationally rendered 3D worlds in 2D, complete with lighting models and all that. The good thing is that all those complicated algorithms are now encapsulated for cool things and used by beginner gamers. In those days, we had to actually calculate every pixel ourselves. But I digress.

First, let’s clarify that “virtualizing data” means putting in the cloud or elsewhere in order to eliminate some of the hassles of its existence and maintenance. That has nothing to do with data virtualization, which is a term that I believe is still evolving.

Data Virtualization, according to Rick van der Lans, who literally wrote the book, is “the technology that offers data consumers a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores.**”  

As the discipline matures, he is expanding his view, as in his new white paper, Creating an AgileData Integration Platform Using Data Virtualization. Definitely recommended reading. 

The “unified, abstracted, and encapsulated view,” from his original definition is the core concept, in my opinion, of data virtualization. In other words, there is a mechanism to bring together, or “federate,” virtually, data from many sources in a way that is useful. This means that the data is federated without creating a physical or cached staging database, but is aligned, transformed, and made available for use. So, for example, you may have a SharePoint BCS application that needs data from SAP, Oracle, and Salesforce.com. Data virtualization will provide a mechanism to merge all of the data into the form necessary for the end user’s interaction in SharePoint. The data is federated “on the fly” and delivered virtually to a web page, on-demand upon refresh of the screen. Think about the security of the backend data that has been accessed…it never actually moves from its original source! Data virtualization also includes writeback to the sources (with end user security, but that’s for another blog) so that an end user can, for example correct his phone number or address, sending it as an update directly to the backend source. )See more examples at http://tinyurl.com/a3wkffc)

You can see that this description expands the definition to include any sources, not just data stores, although the focus of most data virtualization products is BI, in which case, that limitation it makes sense. The BI view of using data virtualization usually is with respect to federating relational databases for the sole purpose of querying. The tools that were designed assuming that constraint have some difficulty accommodating the expanding definition. 

In addition to evolving from federating data stores to federating any kinds of disparate sources, data virtualization is shedding the concept of “on-demand” only. Now federated data is not just available by web services, ado.net, ODBC, JDBC, etc, but for any type of data integration, such as ETL, EAI, etc. 

In fact, it is the “data virtualization” concept of federation that becomes the kingpin for “Convergence,” as Gartner is wont to say, of all integration modalities in a single toolset, sharing metadata and business rules across all. 

**Rick F. van der Lans, Data virtualization for Business Intelligence Systems, Morgan Kauyfmann, 2012