Concepts - EDB Postgres Lakehouse

Suggest edits

EDB Postgres Lakehouse is the solution for running Rapid Analytics against operational data on the EDB Postgres® AI platform.

Major concepts

Lakehouse nodes query Lakehouse tables in managed storage locations.
Lakehouse Sync can create Lakehouse tables from Transactional tables in a source database.

Here's how it fits together:

Lakehouse node

A Postgres Lakehouse node is Postgres, with a Vectorized Query Engine that's optimized to query Lakehouse tables, but still fall back to Postgres for full compatibility.

Lakehouse nodes are stateless and ephemeral. Scale them up or down based on workload requirements.

Lakehouse tables

Lakehouse Tables are stored using highly compressible, columnar storage formats optimized for analytics and interoperable with the rest of the Analytics ecosystem. Currently, Postgres Lakehouse nodes can read tables stored using the Delta Protocol ("delta tables"), and Lakehouse Sync can write them.

Managed storage location

A managed storage location is where you can organize Lakehouse tables in object storage, so that Postgres Lakehouse can query them.

A managed storage location is a location in object storage where we control the file layout and write Lakehouse Tables on your behalf. Technically, it's an implementation detail that we store these in buckets. This is really a subset of an upcoming "Storage Location" feature that will also support "External Storage Locations," where you bring your own bucket.

Lakehouse Sync

Lakehouse Sync is a data migration service offered as part of the EDB Postgres AI platform. It can "sync" tables from a transactional database, to Lakehouse Tables in a destination Storage Location. Currently, it supports source databases hosted in the EDB Postgres AI Cloud Service (formerly known as BigAnimal). In the future, it will support more source databases (such as Oracle) and deployment locations (on-premises databases). It's built using Debezium.

Terminology

Lakehouse

The "Lakehouse architecture" is a data engineering practice, which is a portmanteau of "Data Lake" and "Data Warehouse," offering the best of both. The central tenet of the architecture is that data is stored in Object Storage, generally in columnar formats like Parquet, where different query engines can process it for their own specialized purposes, using the optimal compute resources for a given query.

Vectorized query engine

A vectorized query engine is a query engine that's optimized for running queries on columnar data. Most analytics engines use vectorized query execution. Postgres Lakehouse uses Apache DataFusion.

Delta tables

We use the term "Lakehouse tables" to avoid overcommitting to a particular format (since we might eventually support Iceberg or Hudi, for example). But technically, we're using Delta Tables. A Delta Table is a well-defined container of Parquet files and JSON metadata, according to the "Delta Lake" spec and open protocol. Delta Lake is a Linux Foundation project.

How it works

Postgres Lakehouse is built using a number of technologies:

PostgreSQL
Seafowl, an analytical database acquired from Splitgraph
Apache DataFusion, the query engine used by Seafowl
Delta Lake (and specifically delta-rs), for implementing the storage and retrieval layer of Delta Tables

Level 100

The most important thing to understand about Postgres Lakehouse, is that it separates storage from compute. This allows you to scale them independently, which is ideal for analytical workloads where queries can be unpredictable and spiky. You would not want to keep a machine mostly idle just to hold data on its attached hard drives. Instead, you can keep data in object storage (and also in highly compressible formats), and only provision the compute needed to query it when necessary.