Thursday, July 9, 2015

Agile ETL

It seems now that most enterprise architects are at least aware of the concept and meaning of data virtualization, something I, along with our team, had awaited many years. It really is a fairly significant mind-shift to eliminate dependence on the idea of staging databases and data warehouses in order to make federated data available as needed. Until the term “data virtualization” was coined and analysts began spreading the word, we were a bit stuck, since even they could not understand, or perhaps articulate what our integration platform, Enterprise Enabler®, actually does and how.

The interesting thing is that we started out applying the underlying concepts of live federation to ETL configurations. As far as I can tell, analysts haven’t grabbed onto this concept yet, but the power of this variation on DV makes it worth contemplating.  For now we’ll call this “Agile ETL,” or, I suppose it could be dubbed “Virtual ETL.” Yes, that sounds better. 

What is ETL? As everyone knows, it means:

Extract data from source. Transform it to the destination format. Load it directly to a destination application or to a staging database or data warehouse. For each source, the process is the same. Then when it’s needed elsewhere, the consuming application, dashboard, or other system queries the staging DB or Data Warehouse. So that’s EXTRACT. TRANSFORM. LOAD… Three distinct steps for each source, generally involving considerable custom programming.

In our book, “transformation” and “federation” are always done together, so, any time the “T” in ETL is performed, federation is also invoked if there are multiple sources involved. So Agile, or Virtual ETL inherently involves one or more sources, and is a point-to-point solution only as a special case, i.e., when there’s only one source.

The steps in Data Virtualization

First let’s look at Data Virtualization (DV). What is it that constitutes a DV? See the key elements in the diagram above:
  1. Data accessed live directly from sources
  2. ALL SOURCES are included (e.g., electronic devices, any application, data feed, bus, Big Data, lake, cloud-based, on premise, in the field.  Oh, AND, of course, all databases, and relational or hierarchical formats, which constitute the total domain of other DV software products unless considerable programming is involved.
  3. Data is federated, transformed, and validated live as it come from the sources. (This is not necessarily available without custom coding in competing products.)
  4. No data is stored en route, except where caching is applied for performance or impact on source systems
  5. The target entity or data model is defined up front, but can be easily modified.
  6. Each DV is packaged in one or many consumable and queryable formats.
  7. Each DV may include end user awareness with full CRUD (Create, Read, Update, and Delete) functionality, honoring security permissions. This “write-back” capability has huge implications for simplifying synchronization and enabling users of consuming applications to actually take action on data (e.g., updating or correcting data.) This is not a built-in capability for most DV platforms. 

Most people think of DV as only being for Business Intelligence and analytics. You can see that it is also a powerful tool any on-demand uses, such as portals, dashboards, Enterprise Service Layer (ESL), and a basis for Master Data Management (MDM).

Compare the path of Agile ETL?

Who says that ETL has to be clunky, just because that’s the way it grew up? Who says it must be one-to-one? When we started out more than twelve years ago, our objective was to get data from lots of different sources, combine them (federate) in the process, and deliver the data validated and however and whenever the destination application, data store, or electronic device required it.

Let’s look at what, in my book, constituted, and continues to define, Agile ETL. Hmmm…. Being a strong proponent of reusability, I’ll refer to the list above for brevity and clarity:
  1. Same as (1.) above. Data accessed live
  2. Same as (2.) above. ALL SOURCES
  3. See (3.) above. Data is federated, transformed, and validated live
  4. See (4.) above. No data is stored en route.
  5. Same list as SOURCES (2.) above ALL DESTINATIONS.
  6. Each Agile ETL is associated with one or more data workflows that include triggers, additional validations, business rules, multi-threaded logic, etc., essentially a configured composite application. Triggers are many, including web service call, so the waters do get muddy.
  7. Same as (7.) above.End user awareness for authorization to execute ETLs and/or the associated workflow process.

Since we started out with Agile ETL, the DV became a matter of adding the packaging as services along with complex query capabilities.

Since Enterprise Enabler is a single secure platform where every element of configuration is stored as metadata, you can readily see that reusability becomes natural, and that added benefits abound, such as monitoring for change and analyzing its impact, tracing the actual data lineage, and MDM.

Even if you decide to continue using Data Warehouse architecture rather than going the DV route, isn’t it at least time to add agility to your ETL?