Performing parallel ETL with Greenplum's gpfdist and external tables

June 08, 2011

One of the coolest features that Greenplum offers to Data warehousing
and Business Intelligence operators as far as ETL
is concerned, is the combination of read only external tables
with gpfdist, Greenplum’s parallel file distribution server.

The typical use case for this solution is parallel data loading of text files
(coming from etherogeneous sources – databases or applications) into a Greenplum
data warehouse. For those of you who want to know more about Greenplum, I suggest that
you visit the Greenplum website, download
the Community Edition,
install it and start testing it.

gpfdist is very similar to an HTTP server. It exposes via TCP/IP a local directory which contains
flat text files (typically delimited or CSV, for read only). These files can be periodically
generated from an operational database via unload/export/dump (depending on your business requirements
this could mean once a day or more frequently). On our typical PostgreSQL
driven operational databases, we normally perform this operation using the
COPY command.

A very simple startup of the file distribution server is:

gpfdist -d ${FLAT_FILES_DIR} -p 8081 -l /tmp/gpfdist.8081.log

where ${FLAT_FILES_DIR} is the directory that contains the flat files.
Note: for the sake of simplicity I have used the 8081 TCP port and a log file in the system temporary directory;
it is advised to read the administration guide for more detailed information on command line options
.

It is clear that the extraction process made available by gpfdist
can take advantage of parallelism at multiple stages:

  • host: you might have different hosts where your files reside – one of our typical setups at 2ndQuadrant
    is a distributed operational database with PL/Proxy on multiple servers
  • file system: depending on your physical disk layout and the number of spindles
  • CPUs
  • network: this might be important on large installations where massive data loading is a priority and a single
    network interface might represent a bottleneck

Greenplum documentation states that a single gpfdist application can serve
files to the Greenplum segments at a rate varying from 200MB/s to 350MB/s (this of course
depends on the server’s setup and the files characteristics).

However, file distribution is only one aspect of the described solution: as every server
application, it needs a client to initiate the extraction process. One way
to do this in Greenplum is through the creation of an external table on the
master, which maps to one or more locations defined with the gpfdist:// protocol. A very
simple example for the sole definition of the external table is:

CREATE EXTERNAL TABLE oltp_sales (
ts TIMESTAMP, customer TEXT, product TEXT, quantity REAL, amount REAL
)
LOCATION ('gpfdist://oltp-server:8081/sales_*')
FORMAT 'TEXT'

This process itself won’t initiate any process. It simply creates a definition for an external table.
Greenplum now is aware that this table is like a virtual table, whose data is served by one (in this case)
gfdist server. A query like the following will initiate parallel extraction:

SELECT * FROM oltp_sales;

Every Greenplum segment will directly connect to gpfdist, concurrently. The server will divide the
file in chunks and distribute the work among the segments. You can easily understand the level of parallelism
that this architecture gives to ETL operators (imagine multiple source servers, multiple network cards in 10Gb
environment, multiple Greenplum segments).

The cool thing of this approach is that it allows to perform in-database transformation operations
via the SQL language, moving the whole ETL process in Greenplum. You can perform dimension lookups directly in
your SELECT through joins, and then INSERT data in your warehouse fact tables.
Greenplum, as well as PostgreSQL, allows users to perform INSERT INTO ... SELECT ...
operations.

I am sure you are now all well aware of the potential of Greenplum’s file distribution server and external tables.
I suggest however that you download the community edition and read the documentation, as it gives
you more detailed information about the gpfdist application and external tables: also
it is very important – for production environment – to put in place error handling procedures.

Greenplum’s gpfdist is a flexible and versatile tool for system integrators
and data warehousing experts. Read-only external tables are only one way of using it, but there are
several others, including gpload utility and the integration with Kettle (Pentaho Data Integration).

For information on Greenplum’s community edition, assistance and help you can subscribe
to the community forums.

Share this