As a good example of Agile Integration Software, Enterprise Enabler's data quality features and capabilities serve a representative discussion. (http://www.enterpriseenabler.com/) In the context of data integration, I tend to think of data cleansing and profiling in two separate categories, "batch" and “in transit," or "real time."
Batch - Often this is performed as a first-step-project to an integration implementation to ensure that any existing data that is being used is as correct as possible. The context of correctness is generally defined by the source for which it exists. When the source is an existing data warehouse, the correctness is usually considered with respect to a pre-defined master data definition.
In-Transit or Real-Time - Once the integration is in place, new data is being generated and flows through the organization and systems via the agile integration framework. This data must be validated as soon as it appears in play, as well as when it is passed to its destination, since the definition of "correctness" is ultimately determined by the target use.
With Agile Integration, the philosophy is to focus on the data required for the purpose of the project at hand. While cleansing/validating an entire database or data warehouse full of data may be important, the chances are that it is not important for any particular integration project. Addressing the subset needed means a more efficient project and faster time-to-value.
Pre-validating existing data
Using the inherent capabilities of Enterprise Enabler to discover data schemas and objects, one can simply "point" the appropriate AppComm (application communicator) to a database or application that is to become a source to the integration, and the schema or services available are presented. Select the tables, fields, objects, etc. of interest, and grab a sample or the full set of data. In a configured process, the data can be cleaned, validated and standardized using pre-built rules, external tools, or special logic for each unit of data, by field, by record, or by other cross-section. Rules for logging, notifications, and mediation are configured as part of the process. With this approach, you are focused specifically on the data that will be used for the subsequent integration, and a staging database is not required. Once this process is configured, it can be triggered to automatically run as desired to ensure ongoing monitoring and validation of new data. The results can be fed to a BI tool or spreadsheet for statistical analysis on the data quality ("profiling").
With the AppComm approach, combined with the ability to easily create virtual relationships across disparate sources, cross validation ("matching") across systems or merging data to enrich it, becomes a reasonable exercise, without having to design and build a consolidated staging database. Of course, if the situation still requires a staging database, there's no more efficient way to populate it than Enterprise Enabler.
After you have completed this step, the chances are that the new data that will be captured from here forward needs to be cleansed, too. This can be done "real-time" as it is being acquired from the source and passed to the federation and transformation steps of an integration.
Validating data on-the-fly
As is the nature of Agile Integration, Enterprise Enabler offers multiple places where data cleansing, validation, and remediation can be managed within the flow of data through an integration. Some amount of detection of erroneous data is done as a natural part the data acquisition by the intelligent AppComm technology. Driven by metadata definitions, AppComms check not only for valid data (type, format, etc.), but also for the expected schema. Additionally,
o Validation/cleansing rules, pre-built processes or 3rd party tools can be dropped in or invoked for detection and mediation at various points in execution:
· As soon as the data has been acquired
· As it is being transformed and merged with other sources
· After it has been transformed
· By the destination's Appcomm before/as the data is being posted (plus transaction rollback and assurance in the case of multiple destinations)
· Anywhere in the data workflow process surrounding the transformations
o Enterprise Master System ensures that the data comes from the correct source when an end user invokes a particular piece of information.
o Since Enterprise Enabler's user interface ("Designer Studio") is tied directly to a copy of the run-time engines, as you design an integration, you can do a trial run from the studio and see a sample of the data for inspection to get an idea of the quality of data you are dealing with.
Still don’t trust your data?
Sometimes there are situations where validation rules just won't cut it. Example: setting hard minimum and maximum values for something coming from a physical processing plant. You may be able to determine a reasonable range, but only with the knowledge of what happened yesterday will you be able to determine that a "way out of whack" set of numbers are actually due to a disruption at some part of the plant yesterday. Enterprise Enabler has a preview/analysis feature that holds the result data (post transformation and process) just before it is posted to the destination, in a virtual store, only to be released and posted after review and approval by an authorized human being. That person can do quick tests on ranges, averages, etc. as a gut feel reality check and then fix it if necessary before releasing the data set.
And for those of you who care about data governance
Only an AIS is a single end-to-end integration solution. This means that security can be maintained throughout the integration infrastructure. Developers and Data Analysts log in with the permissions of their role and group, and anything they build or change is logged with who-what-when stamp. Every object in Enterprise Enabler is locked down in such a way, preventing intentional or accidental diversion or modification of data and their flows through the enterprise.
And what about bad data in your ERP?
My apologies, but I just can't help saying to the ERP vendors, "shame on you" for not taking the responsibility to ensure that the data captured and generated by your system is completely correct. How could you let that happen? People trusted you! Ok. Ok.. I'll stop short of calling for an "Occupy ERP" movement.
With all of the various angles on Data Quality, it’s clear that Agile Integration inherently brings a range of capabilities that are simply not possible with other DQ products. Whether you are looking to correct existing data or ensure the quality of new data as it is created, the fact that the data quality is handled as a natural aspect of integration means a more efficient overall solution.