Monday, November 21, 2011

Data Quality and your Enabled Enterprise

As a good example of Agile Integration Software, Enterprise Enabler's data quality features and capabilities serve a representative discussion. (http://www.enterpriseenabler.com/)   In the context of data integration, I tend to think of data cleansing and profiling in two separate categories, "batch" and “in transit," or "real time."

Batch - Often this is performed as a first-step-project to an integration implementation to ensure that any existing data that is being used is as correct as possible.  The context of correctness is generally defined by the source for which it exists. When the source is an existing data warehouse, the correctness is usually considered with respect to a pre-defined master data definition.

In-Transit or Real-Time - Once the integration is in place, new data is being generated and flows through the organization and systems via the agile integration framework. This data must be validated as soon as it appears in play, as well as when it is passed to its destination, since the definition of "correctness" is ultimately determined by the target use.

With Agile Integration, the philosophy is to focus on the data required for the purpose of the project at hand. While cleansing/validating an entire database or data warehouse full of data may be important, the chances are that it is not important for any particular integration project.  Addressing the subset needed means a more efficient project and faster time-to-value.

Pre-validating existing data

Using the inherent capabilities of Enterprise Enabler to discover data schemas and objects, one can simply "point" the appropriate AppComm (application communicator) to a database or application that is to become a source to the integration, and the schema or services available are presented. Select the tables, fields, objects, etc. of interest, and grab a sample or the full set of data. In a configured process, the data can be cleaned, validated and standardized using pre-built rules, external tools, or special logic for each unit of data, by field, by record, or by other cross-section.  Rules for logging, notifications, and mediation are configured as part of the process. With this approach, you are focused specifically on the data that will be used for the subsequent integration, and a staging database is not required. Once this process is configured, it can be triggered to automatically run as desired to ensure ongoing monitoring and validation of new data. The results can be fed to a BI tool or spreadsheet for statistical analysis on the data quality ("profiling").

With the AppComm approach, combined with the ability to easily create virtual relationships across disparate sources, cross validation ("matching")  across systems or merging data to enrich it, becomes a reasonable exercise, without having to design and build a consolidated staging database. Of course, if the situation still requires a staging database, there's no more efficient way to populate it than Enterprise Enabler. 

After you have completed this step, the chances are that the new data that will be captured from here forward needs to be cleansed, too. This can be done "real-time" as it is being acquired from the source and passed to the federation and transformation steps of an integration.

Validating data on-the-fly

As is the nature of Agile Integration, Enterprise Enabler offers multiple places where data cleansing, validation, and remediation can be managed within the flow of data through an integration. Some amount of detection of erroneous data is done as a natural part the data acquisition by the intelligent AppComm technology.  Driven by metadata definitions, AppComms check not only for valid data (type, format, etc.), but also for the expected schema.  Additionally,

o    Validation/cleansing rules, pre-built processes or  3rd party tools can be dropped in or invoked for detection and mediation at various points in execution:
·    As soon as the data has been acquired
·    As it is being transformed and merged with other sources
·    After it has been transformed
·    By the destination's Appcomm  before/as the data is being posted (plus transaction rollback and assurance in the case of multiple destinations)
·   Anywhere in the data workflow process surrounding the transformations
o      Enterprise Master System ensures that the data comes from the correct source when an end user invokes a particular piece of information.
o     Since Enterprise Enabler's user interface ("Designer Studio") is tied directly to a copy of the run-time engines, as you design an integration, you can do a trial run from the studio and see a sample of the data for inspection to get an idea of the quality of data you are dealing with.

Still don’t trust your data?

Sometimes there are situations where validation rules just won't cut it. Example: setting hard minimum and maximum values for something coming from a physical processing plant. You may be able to determine a reasonable range, but only with the knowledge of what happened yesterday will  you be able to determine that a "way out of whack" set of numbers are actually due to a disruption at some part of the plant yesterday. Enterprise Enabler has a preview/analysis feature that holds the result data (post transformation and process) just before it is posted to the destination, in a virtual store, only to be released and posted after review and approval by an authorized human being.  That person can do quick tests on ranges, averages, etc. as a gut feel reality check and then fix it if necessary before releasing the data set.

And for those of you who care about data governance

Only an AIS is a single end-to-end integration solution. This means that security can be maintained throughout the integration infrastructure.  Developers and Data Analysts log in with the permissions of their role and group, and anything they build or change is logged with who-what-when stamp. Every object in Enterprise Enabler is locked down in such a way, preventing intentional or accidental diversion or modification of data and their flows through the enterprise.

And what about bad data in your ERP?

My apologies, but I just can't help saying to the ERP vendors, "shame on you" for not taking the responsibility to ensure that the data captured and generated by your system is completely correct.  How could you let that happen? People trusted you!  Ok. Ok.. I'll stop short of calling for an "Occupy ERP" movement.

Alltogether..

With all of the various angles on Data Quality, it’s clear that Agile Integration inherently brings a range of capabilities that are simply not possible with other DQ products. Whether you are looking to correct existing data or ensure the quality of new data as it is created, the fact that the data quality is handled as a natural aspect of integration means a more efficient overall solution.


Thursday, November 17, 2011

Big Data Quality


Big Data means big data quality issues, right?  Well, of course, right.  Big data means more data that can be bad or go bad one way or another.  Big, bad data could have big bad consequences. But just think about some of the ways Big Data may have be in better shape than others.

Big Data
  • is usually captured automatically, without manual intervention
  • often has been gathered over many years, so that the framework for capture and validation at the source has improved and been "debugged" over time. Various standards may also play a role in the data capture and ultimate quality. Examples might be weather related data and GIS data.
  • is often used in ways where analytics and conclusions improve with data volume and errors in individual data become less important.  Data quality is essential for Business Intelligence (BI), but from some perspectives, and some aspects of data quality, DQ may move into the background.  
Big Data from Social Media has some additional considerations.
  • Capture mechanisms are well known. Facebook, emails, Twitter, etc.
  • We know that the quality of information from these is highly questionable - that's the nature, and the beauty of the beast.
  • We also know that they are well structured. For example an email has a very easily determined structure: there is the header, the body, attachments, etc. The content of the unstructured data (body, attachments) can be searched for relevant information and key words. Bad data might be a corrupted attachment or garbled text in the body, but other than that, errors are, almost by definition, not really bad data.
  • What do you/we want from social Media’s Big Data? Mostly the trends of the masses. If you clean it up that very exercise could corrupt the data.
Senile data forgets its source and loses relevance and accuracy

There is an altogether different situation with many of the nouveau trendy Corporate Big Data projects.  In this case, big data is likely to be consolidated data coming from a number of sources, including those suffering from data senility. Senile data has been through the wringer, moved from residence to residence, been "cleansed" and perhaps never saw the light. A data warehouse usually is populated with data from a huge number of sources, and fallible humans have pored through it, run human-defined cleansing and validation algorithms, and then subjected it to manually-programmed integration code.  It is incumbent upon the mining and analysis functions to accommodate assumptions about data quality.

So, as you can see, data quality and cleansing becomes an altogether different problem for Big Data.