Data Mining Activities
The activities for data mining do not need to be performed linearly. Figure 13.2
indicates which activities can be performed concurrently. The list below briefly
describes the activities associated with Step 13, Data Mining.
State the business problem.
Set goals before starting the data mining efforts, and prioritize the goals (such
as increase profits, reduce costs, create innovative product strategies, or
expand the market share). Time and money have to be invested in order to
reach any of these goals. There also needs to be a commitment from
management to implement a data mining solution at the organization.
1.
Collect the data.
One of the most time-consuming activities of data mining is the collection of the
appropriate types and quantities of data. In order to have correct
representation, first identify all the data needed for analysis. This includes data
stored in the operational databases, data from the BI target databases, and any
external data that will have to be included. Once you have identified the source
data, extract all pertinent data elements from these various internal and
external data sources.
2.
Consolidate and cleanse the data.
Redundantly stored data is more of a norm than an exception in most
organizations. Therefore, the data from the various sources has to be
consolidated and cleansed. If the internal data is to be supplemented by
acquired external data, match the external data to the internal data, and
determine the correct content.
3.
Prepare the data.
Before building an analytical data model, you need to prepare the data. Part of
data preparation is the classification of variables. The variables could be
discrete or continuous, qualitative or quantitative. Eliminate variables with
missing values or replace them with most likely values. It provides great insight
to know the maximum, minimum, average, mean, median, and mode values for
quantitative variables. In order to streamline the preparation process, consider
applying data reduction transformations. The objective of data reduction is to
combine several variables into one in order to keep the result set manageable
for analysis. For example, combine education level, income, marital status, and
ZIP code into one profile variable.
4.
Build the analytical data model.
One of the most important activities of data mining is to build the analytical
data model. An analytical data model represents a structure of consolidated,
integrated, and time-dependent data that was selected and preprocessed from
various internal and external data sources. Once implemented, this model must
be able to continue "learning" while it is repeatedly used by the data mining
5.
6.