ETL Architecture Guide
Purpose
The purpose of this document
is to present architectural guidelines for developing a common-sense approach
to supply the best possible quality of data attainable for the data mart.
Terms
ETL – an industry standard abbreviation for the phrase: “extract,
transform and load” which favors the use of applications to perform data
transformations outside of a database on a row-by-row basis. Traditionally, referred to as the “hub and
spokes” practice.
ELT – an industry standard abbreviation for the phrase:
“extract, load and transform” which favors the use of relational databases as
the staging point before performing any transformations of source data into
target data.
Topic
- When functional information requirements have
been defined for the end products of the data mart, each of these
requirements will be traced back to the appropriate source systems of
record.
- Best practices will be used to determine the
most expedient approach to obtain the highest quality data in terms of
most accurate, complete and valid data from each source.
- Flat (non-hiearchical, non-relational) file
extracts may be either pulled by an application such as ftp or other
mechanism (tape, disk, etc…) or pushed by a source system extract
application to a staging area.
- Relational data may also be queried to produce
extract files or to load directly into relational database staging tables.
- An ETL metadata reference table will be defined
(data_source_type) to uniquely identify each type of data source (flat
file, spreadsheet, hierarchical database, relational database,
multi-valued database, comma-separated variable length, fixed record
length, etc…)
- An ETL metadata reference table will be defined
(data_source) to uniquely identify all data sources.
- All data sources will be required to have event
metadata either embedded with the source data itself in standardized
header/trailer records, or in a separate file or table which must be
accessible by the ETL tools or applications at the same time as the source
data itself. At a minimum this will
include file or table name, date of creation, unique creator/author
identifier, checksum, size in bytes, character set, number of
records/rows, data source type (see above) and data source identifier (see
above). Additional items may also
include: cyclic redundancy check (“CRC”), hash totals, public key
encryption token.
- Where no secure data channel exists, corporate
data must be encrypted by a standard application before transmission to a
staging location.
- All data mart staging areas must be kept secure
with only those users responsible for operational management and the
applications themselves having the necessary permissions. Corporate data is extremely valuable and
must be kept secure at all times as it is moved from source systems toward
the data mart.
- Where large quantities of data may need to be
moved across a wide area network, the ability to compress the data at the
source prior to transmission and expand the data upon receipt into the
staging area is required.
- The ability to archive data with an associated
time context naming convention is frequently useful and often
required. Preferably the naming
convention for files will include the following format: YYYYMMDD_HHmmSS
where YYYY = century and year (ie 2007), MM = Gregorian/calendar month
with leading zeroes (ie. 03 for March); DD = day of month; HH=hour of day
(using zulu, universal or Greenwich meridian timezone; ie. 6:30AM Central
Time = 12:30PM Universal Time); mm=minute of hour; SS=seconds of minute
- Any and all database objects developed to
support the ETL process such as: tables, views, functions, stored
procedures, dictionary table population scripts must have individual drop
and create/replace scripts created for each object to facilitate rapid
deployment of an ETL environment to a new database or server.
- Any and all scripts associated with the ETL
process must be placed into a standard version control system to support
change management and rollback in case of failures to assist in root cause
determination.
- The version control system will be the single
source used to produce a continuous integration environment for unit
testing all ETL scripts. Business
rule-driven unit tests will be designed for each ETL component and
executed after any change which touches either the data processed by the
component or any component immediately up/down stream from the component.
- Continuous integration testing of the ETL system
will require the creation of a separate ETL data work area within the
development environment which can be completely dropped, rebuilt and unit
tests run on static datasets to produce expected results. These continuous integration tests must
produce easily determined pass/fail results and be self-reporting to the
ETL development team.
- All required client and server database, tool or
other software (including upgrades, patches, service packs or hot-fixes)
must be kept in a separate escrow to be able to reproduce an entire ETL
environment exactly as it was prior to any upgrade to any component.
- All database, ftp and other connectivity
information for each environment (development, testing/quality assurance
and production) must be kept in a centralized, single location accessible
by the development team and the ETL applications themselves. Appropriate access controls must be maintained
on this information to only allow authorized individuals the ability to
update this information as needed, and to allow a single point of
maintenance which may be applied to all environments.
- Any component upgrade must be tested in the
continuous integration environment prior to deployment to quality
assurance and must never be applied to production without passing the
continuous integration and quality assurance testing.
- There will be at minimum three (3) separate
environments created to support ETL: 1) Development, 2) test/quality
assurance and 3) production. All
new development and unit testing activities will begin in the ETL
development environment. Prior to
migration of the ETL component to the test/QA environment, a test plan
must be written and communicated to the quality assurance staff. This test plan must include: all setup
instructions, process execution instructions, expected results and/or
success criteria and any restart instructions.
- All ETL processes will make use of parameters to
provide the maximum amount of reusability without re-coding. This also reduces the number of modules
to maintain, while increasing the number of tests required per process /
module.
- All ETL metadata will reside in a central
repository accessible via a generic search engine by all stakeholders in
the process.
- ETL metadata must be accessible and updateable
by all users (ie. Wiki-style) to keep the data current and add additional
perspectives. The metadata should
be periodically peer reviewed by metadata data stewards for content
accuracy and currency.
- Every relational database table whether source,
interim or final will have a minimum of the following attributes defined:
name, classification (process control, process log, dimensional,
master/reference, transaction, fact, associative, reject, transformation);
primary key; business key; foreign keys; surrogate key; estimated row
count; anticipated growth factor per month; average row length and
indexing recommendations.
- Every file whether source, interim or final will
have a minimum of the following attributes defined: creation date,
identify file creator name, file classification (process control, log,
reject, source, final) file type (fixed length, comma-separated variable,
other delimited (with delimiter), structural layout containing: field
name, data type (numeric, date, date/time, alpha, GUID), maximum length,
required Y/N?, identity, domain of values, and range of values. This information must supply enough information
to construct data edit scripts to assess the validity and general quality
of information contained in the file itself.
- Every relational database table column will have
the following metadata requirements: (see file immediately above) in
addition, foreign key columns must be associated with their parent table
names where applicable to facilitate further automation of edit
checking. Any additional business
rules must be defined where applicable to a given column. This will support the addition of
data-driven validation of business rules.
Each column may have zero or many business rules applied to its
validation process. Each business
rule for a given column must have a gate, event (which may be empty) or
state of relevant data required in order to fire the business rule edit
test. As an example, if address
line 1, city and state are supplied we may fire the business rule to
validate the ZIP code for addresses in the United States.
- Each multi-dimensional cube must have the
following items defined in the ETL metadata: by dimension: name, expected
number of entries in each dimension, percent of sparse data anticipated in
the dimension, OLAP query to produce meaningful control totals for the
dimension; then for the cube one or more OLAP queries to produce a
validation of the cube being updated (may be run before building the cube
and again after in order to validate new data added had the desired
impact)
- Any ETL process that must execute a third-party
product must log all parameters sent to the third-party application and
the resulting return code received from the product along with a timestamp
for each and a mapping of the return code into a simple pass/fail status.
- A centralized dashboard must be developed to
monitor and communicate the status of each cycle of the ETL process. All appropriate ETL processes must
accurately update the appropriate ETL logs upon completion of their cycle
to reflect both total data processed and total data rejected. This
dashboard must support establishing performance thresholds for:
- Elapsed time required to process a given
quantity of data;
- Data quality assessment as a percent of total
new and as a percent of total overall for master/reference, transaction,
dimension and fact data
The dashboard must also support drilling down from
the dashboard to the details of the ETL logs to provide immediate insight into
the root causes of any out of threshold aggregates.
·
All data mart surrogate keys will be assigned
via a programmatic (not via database auto assignment / identity) approach
defined for each table as appropriate to provide an added measure of control
and support portability of data between environments.
·
All data added to the data mart will be
identified by a unique data source identifier, which will have been previously
established in the data mart source system control table.
·
All data loaded into the data mart during the
same ETL processing cycle will be identified by a unique identifier assigned at
the beginning of the cycle. This
identifier will be generated at the time of the ETL processing cycle and will
have additional metadata about the ETL cycle associated with it including:
description of the reason for the cycle, ETL start date and time, ETL
environment identifier.
·
When an ETL processing cycle successfully
completes and the resulting data in the data mart has passed both automated and
any manual quality assurance checks, the staging data (successfully loaded)
will be associated with the unique ETL processing session identifier, it will
be archived in either a flat file or database backup.
·
Prior to the start of any ETL processing cycle
the destination data mart, any associated OLAP cubes, and any staging are to be
affected during the process must be backed up.
This is necessary to supply a recovery position should a catastrophic
failure occur during the process. For
performance reasons if the backups are done at the end of the previous ETL
process, these backups must be verified to have been successful and still
viable.