operational systems. The operational systems staff
actually expect their dirty data to run successfully!
In many cases, modifying an operational system is
not cost-effective. On the other hand, old practices
and bad habits that produce these data quality
problems should be addressed. Ask your business
sponsor to make the owners of the operational
systems and the business executives aware of the
cost and effort it takes to cleanse their bad data for
the BI decision-support environment. Some general
data quality standards should be implemented
across the entire organization to avoid perpetuating
data quality problems.
Organizations devote close to 80 percent of the BI
project time to back-end efforts, including labor-
intensive data cleansing. Although tools can help
with assessing the extent of data quality problems
in the operational systems, they cannot magically
turn bad data into good data.
Cleansing data is a time-intensive and expensive
process. Analyze, prioritize, and then choose your
battles since cleansing 20 percent of the enterprise
data may solve 80 percent of the information needs.
About 80 percent of the data transformation effort is
spent on enforcing business data domain rules and
business data integrity rules, and only about 20
percent of the effort is spent on technical data
conversion rules.
The most common symptoms of dirty source data
are data inconsistencies and overuse of data
elements, especially in old flat files, where one data
element can explicitly be redefined half a dozen
times or can implicitly have half a dozen different
meanings.
Why use automated software tools for data
transformation? Data-profiling tools can significantly
shorten the time it takes to analyze data domains.
ETL tools can perform data type and length
conversions and code translations in minutes, rather
than hours when done manually. However, note that
writing data-cleansing algorithms is still a manual
effort and must be performed before the ETL tool
can be utilized.
The ETL process will run into fewer problems if
extensive source data analysis is performed ahead
of time. Source data rules are usually discovered
proactively during requirements gathering, data
analysis, and meta data repository analysis. They