Reference - EDB Postgres Lakehouse

Suggest edits

Postgres Lakehouse is an early product. Eventually, it will support deployment modes across multiple clouds and on-premises. However, currently it's fairly limited in terms of where you can deploy it and what data you can query with it.

To get the best experience with Postgres Lakehouse, you should follow the "quick start" guide to query benchmarking data. Then you can try loading your own data with Lakehouse Sync. If you're intrigued, reach out to us and we can talk more about your use case and potential opportunities.

This page details some of the important bits to know.

Supported cloud providers and regions

AWS only: Currently, support for all Lakehouse features (Lakehouse nodes, Managed Storage Locations, and Lakehouse Sync) is limited to AWS.

EDB-hosted only: "Bring Your Own Account" (BYOA) regions are NOT currently supported for Lakehouse resources. Support is limited to ONLY EDB Postgres® AI - Hosted environments on AWS (a.k.a. "EDB-Hosted AWS regions").

This means you can select from one of the following regions:

  • North America
    • US East 1
    • US East 2
    • US West 2
  • Europe
    • EU Central 1
    • EU West 1
    • EU West 2
  • Asia
    • AP South 1
  • Australia
    • AP SouthEast 2

To be precise:

  • Lakehouse nodes can only be provisioned in EDB-hosted AWS regions
  • Managed Storage Locations can only be created in EDB-hosted AWS regions
  • Lakehouse Sync can only sync from source databases in EDB-hosted AWS regions

These limitations will be removed as we continue to improve the product. Eventually, we will support BYOA, as well as Azure and GCP, for all Lakehouse use cases. We will also add better support for "external" buckets ("bring your own bucket").

Supported AWS instances

When deploying a Lakehouse node, you must choose an instance type from the m6id family of instances. Importantly, these instances come with NVMe drives attached to them.

Instances are ephemeral. These NVMe drives are used only for spill-out space *while processing queries, and for caching Delta Tables on disk. All data on the NVMe drives will be lost when the cluster is shutdown.

System tables are persisted. Persistent data in system tables (users, roles, *etc.) is stored in an attached block storage device, and will survive a pause/resume cycle.

Supported instances

API NameMemoryvCPUsCoresStorage
m6id.large8.0 GiB2 vCPUs1118 GB NVMe SSD
m6id.xlarge16.0 GiB4 vCPUs2237 GB NVMe SSD
m6id.2xlarge32.0 GiB8 vCPUs4474 GB NVMe SSD
m6id.4xlarge64.0 GiB16 vCPUs8950 GB NVMe SSD
m6id.8xlarge128.0 GiB32 vCPUs161900 GB NVMe SSD
m6id.12xlarge192.0 GiB48 vCPUs242850 GB (2 * 1425 GB NVMe SSD)
m6id.16xlarge256.0 GiB64 vCPUs323800 GB (2 * 1900 GB NVMe SSD)
m6id.24xlarge384.0 GiB96 vCPUs485700 GB (4 * 1425 GB NVMe SSD)
m6id.32xlarge512.0 GiB128 vCPUs647600 GB (4 * 1900 GB NVMe SSD)

Available benchmarking datasets

When you provision a Lakehouse node, it comes pre-configured to point to a public S3 bucket in its same region, containing sample benchmarking datasets.

You can query tables in these datasets by referencing them with their schema name.

Schema NameDataset
tpcds_sf_1TPC-DS, Scale Factor 1
tpcds_sf_10TPC-DS, Scale Factor 10
tpcds_sf_100TPC-DS, Scale Factor 100
tpcds_sf_1000TPC-DS, Scale Factor 1000
tpch_sf_1TPC-H, Scale Factor 1
tpch_sf_10TPC-H, Scale Factor 10
tpch_sf_100TPC-H, Scale Factor 100
tpch_sf_1000TPC-H, Scale Factor 1000
clickbenchClickBench, 100 million rows
brc_1bBillion row challenge
Notes about ClickBench data:

Data columns (EventData) are integers, not dates.

You must quote ClickBench column names, because they contain uppercase letters, but unquoted identifiers in Postgres are case-insensitive. For example:

select "Title" from clickbench.hits;

🚫 select Title from clickbench.hits;

User management

When you provision a Lakehouse node, you must provide a password. We do not save this password. You will need it to login as the edb_admin user. This is not a superuser account, but it does have the ability to create users and roles and grants. Thus, you can either share the credentials for edb_admin itself, or you can create other users and distribute those.

Gotcha: Do not set search_path

Do not set search_path. Always reference fully qualified table names.

Using search_path makes Postgres Lakehouse fall back to PostgreSQL, dramatically impacting query performance. To avoid this, qualify all table names in your query with a schema.

For example:

🚫 Do NOT do this!

--- DO NOT DO THIS
SET search_path = tpch_sf_10;
SELECT COUNT(*) FROM lineitem;

✅ Do this instead!

SELECT COUNT(*) FROM tpch_sf_10.lineitem

Supported queries

In general, READ ONLY queries are supported. You cannot write directly to object storage. This includes all Postgres built-in functions, statements and types. It also includes any of those provided by EPAS or PGE, depending on which distribution you choose to deploy.

In general, you cannot insert, update, delete or otherwise modify data. You cannot CREATE TABLE. You must load data into the bucket out-of-band, either with your own ETL scripts or with Lakehouse Sync. See "Advanced: Bring Your Own Data" for more details. (In the future, we will be making this more usable with a custom DDL).

One exception is Postgres system tables, such as those used for storing users, roles, and grants. These tables are stored on the local block device, which is included in backups and restores. So you can CREATE USER or CREATE ROLE or GRANT USAGE, and these users/roles/grants will survive restarts and restores.

DirectScan vs. fallback modes and EXPLAIN

Postgres Lakehouse is fastest when it can "push down" your entire query to DataFusion, the vectorized query used for handling queries when possible. (In the future, this will be more fine-grained as we add support for partial pushdowns.)

Postgres Lakehouse can execute your query in two modes. First, it attempts to run the entire query using Seafowl (a dedicated columnar database based on DataFusion). If Seafowl can't run the entire query, for example, because it uses PostgreSQL-specific operations like JSON, then Postgres Lakehouse will fall back to using the PostgreSQL executor, with Seafowl streaming full table contents to it.

If your query is extremely slow, it's possible that's what's happening.

You can check which mode is being used by running an EXPLAIN on the query and making sure that the top-most query node is SeafowlDirectScan. For example:

explain select count from (select count(*) from tpch_sf_1.lineitem);
                                                                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=167.52..167.55 rows=1 width=8)
   ->  Append  (cost=0.00..165.01 rows=1001 width=0)
         ->  Seq Scan on lineitem lineitem_1  (cost=0.00..0.00 rows=1 width=0)
         ->  SeafowlScan on "16529" lineitem_2  (cost=100.00..150.00 rows=1000 width=0)
               SeafowlPlan: logical_plan
                 TableScan:  tpch_sf_1.lineitem projection=[l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment]
(6 rows)

In this case, the query is executed by PostgreSQL and Seafowl is only involved when scanning the table (see SeafowlScan at the bottom). The fix in this case is to explicitly name the inner COUNT(*) column, since Seafowl gives it an implicit name count(*) whereas PostgreSQL calls it count:

edb_admin=> explain select count from (select count(*) as count from tpch_sf_1.lineitem);
                             QUERY PLAN
--------------------------------------------------------------------
 SeafowlDirectScan: logical_plan
   Projection:  COUNT(*) AS count
     Aggregate:  groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]]
       TableScan:  tpch_sf_1.lineitem projection=[]
(4 rows)

Here, we can see the SeafowlDirectScan at the top, which means that Seafowl is running the entire query.

If you're having trouble rewording your query to make it run fully on Seafowl, open a support ticket.

Load data with Lakehouse sync

If you have a transactional database running in EDB Postgres AI Cloud Service, then you can sync tables from this database into a Managed Storage Location.

A more detailed guide for this is forthcoming. If you want to try it yourself, look in the UI for "Migrations" or "Sync to Lakehouse."

Advanced: Bring your own data

It's possible to point your Lakehouse node at an arbitrary S3 bucket with Delta Tables inside of it. However, this comes with some major caveats (which will eventually be resolved):

Caveats

  • The bucket must be publicly accessible.
    • If you want to use a private bucket, this is technically possible, but requires some manual action on our side and your side (to assign the correct IAM policies). Let us know if you want to try it. We will be adding proper support for private, external buckets in the near future.
  • The tables must be stored as Delta Tables within the location
  • A “Delta Table” is a folder of Parquet files along with some JSON metadata.
  • Each table must be prefixed with a $schema/$table/ where $schema and $table are valid Postgres identifiers (i.e. < 64 characters)
    • For example, this is a valid Delta Table that will be recognized by Beacon Analytics:
      • my_schema/my_table/{part1.parquet, part2.parquet, _delta_log}
        • These $schema and $table identifiers will be queryable in the Lakehouse node, e.g.:
          • SELECT count(*) FROM my_schema.my_table;
    • This Delta Table will NOT be recognized by Beacon Analytics (missing a schema):
      • my_table/{part1.parquet, part2.parquet, _delta_log}

Loading your own data

  • You can use the deltalake Python library to create Delta Tables and write to the bucket
  • You can also use the lakehouse-loader utility we created for this, to export data from an arbitrary Postgres instance to Lakehouse Tables in a storage bucket.

For example, with the lakehouse-loader utility:

export PGPASSWORD="..."
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
# export other AWS envvars

./lakehouse-loader postgres-to-delta postgres://test-user@localhost:5432/test-db -q "SELECT * FROM some_table" s3://my-bucket/my_schema/my_table

Pointing to your bucket

By default, each Lakehouse node is configured to point to a bucket with benchmarking datasets inside. To point it to a different bucket, you can call the seafowl.set_bucket_location function:

SELECT seafowl.set_bucket_location('{"region": "ap-south-1", "bucket": "my-bucket", "public": true}');

Querying your own data

In the example above, after you've called set_bucket_location, you will be able to query data in my_schema.my_table:

SELECT * FROM some_table;

Note that using an S3 bucket that isn't in the same region as your node will 1) be slow because of cross-region latencies, and 2) will incur AWS costs (between $0.01 and $0.02 / GB) for data transfer! Currently these egress costs are not passed through to you but we do track them and reserve the right to terminate an instance.

Switching back to sample data

To switch the bucket back to the default sample bucket in the same region as your node:

SELECT seafowl.set_bucket_location(NULL)

Could this page be better? Report a problem or suggest an addition!