Running GPU-Accelerated Queries on EDB Postgres® AI with NVIDIA RAPIDS Accelerator for Apache Spark

March 16, 2026

This hands-on tutorial is a direct follow-up to the release blog introducing GPU-accelerated queries on EDB Postgres using NVIDIA RAPIDS Accelerator for Apache Spark. It is designed for data engineers, data scientists, and solution architects looking to implement high-performance analytics by seamlessly integrating EDB Postgres AI with an accelerated Apache Spark cluster.

To help you get started immediately, we provide a companion GitHub repository containing the necessary Docker Compose stacks and configurations.

The tutorial is split into two main parts:

  • Part 1: Validating Locally with CPU-Only Stack: This section establishes a baseline by guiding you through setting up the full stack on your local machine using only a CPU. This is an excellent way to understand the component architecture and verify the connection with a smaller TPC-DS dataset (SF10) before moving to a production environment.
  • Part 2: Running on NVIDIA Brev: This section walks you through deploying the GPU-accelerated stack. We use an NVIDIA Brev VM with L40S GPUs to demonstrate the full power of spark-rapids acceleration. While the guide focuses on Brev for ease of provisioning, the configurations can be adapted to any GPU environment with CUDA support and the relevant Spark configurations.

Part 1: Validating Locally with CPU-Only Stack

Before scaling to the NVIDIA L40S on Brev, it is best practice to validate your configuration locally. This "CPU-only" phase allows you to test the connection between EDB Postgres AI and the OSS Apache Spark cluster without incurring cloud GPU costs.

Component Architecture

Before launching the stacks, let’s examine how the components interact. We are building a "Separation of Storage and Compute" model where Postgres handles the interface and metadata, while the Apache Spark cluster performs the heavy lifting.

fig 1
Figure 01 - The component boundary among Postgres, PGAA, and the Spark components


No matter where you run the stack—local or on Brev—the core component architecture remains the same.

  • EDB Postgres (The Gateway): A specialized Postgres 17 container built with the EDB Analytics Accelerator (AA) extension. It acts as the entry point for your SQL queries.
  • OSS Apache Spark Master: The control plane that coordinates task distribution. We are using standard OSS Apache Spark 3.5.6, proving that PGAA integrates seamlessly with the broader data ecosystem.
  • OSS Apache Spark Workers: The worker that actually processes data.
  • Spark Connect (The Bridge): This service acts as the interface between Postgres and the Spark cluster, using the modern Spark Connect protocol to submit query plans.

The difference lies in the resources allocated and the presence of GPU acceleration, but the logical structure of EDB Postgres, Spark Master, Spark Workers, and Spark Connect stays consistent across environments.

Prerequisites & Environment Check

Before we begin, ensure your machine meets the following requirements. If you don't have a local GPU, don't worry—you can perform initial development on your laptop and switch to the Brev instance for the heavy lifting.

  • EDB Token: You will need an EDB_TOKEN to pull the EDB Postgres Analytics Accelerator (PGAA) extension. Follow this link to register for an EDB account to obtain a token.
  • Docker Desktop: Version 4.x or higher with Docker Compose V2.
  • System Resources (Local): * CPU: 4+ Cores.
    • RAM: 16GB minimum (32GB recommended for TPC-DS SF10).
    • Disk: 50GB free space.
  • Git Client: Git command-line tool or GitHub Desktop installed locally to clone the companion repository.
  • AWS CLI: Install AWS CLI (aws) locally. You will need this to download the TPC-DS SF10 dataset.

To follow this tutorial, you will need access to our companion repository which contains all the docker-compose manifests and PGAA configurations.

Launching the CPU-Only Stack

Since we are not using the GPU, we will use a smaller dataset (TPC-DS SF10) to ensure smooth performance on standard hardware.

Navigate to your cloned repository in a terminal. First, set the EDB_TOKEN. Then download the TPC-DS SF10 dataset to a local directory in the project.

# Export your EDB token
export EDB_TOKEN=<Your EDB_TOKEN goes here>
# Change into the repo
git clone https://github.com/EnterpriseDB/spark-rapids-tutorial.git
cd spark-rapids-tutorial/cpu-only
# Download the data set
aws s3 cp s3://beacon-analytics-demo-data-us-east-1-prod/tpcds_sf_10 ./data/tpcds_sf_10 --recursive --no-sign-request

Start the stack by running:

docker compose up -d --build

This command builds the EDB Postgres container, pulls the Spark images, and starts the OSS Apache Spark 3.5.6 cluster.

The CPU-only version omits the nvidia runtime and spark-rapids jars, relying instead on standard Spark CPU execution. Once the containers are up, run docker ps. You should see four containers running: postgres, spark-master, spark-worker-0, and spark-connect.

Running Queries

Let’s verify whether the Postgres container can access the Spark cluster. Login to Postgres container and access the psql shell by running:

docker compose exec postgres psql -U postgres

Activate the EDB Postgres Analytics Accelerator (pgaa).

CREATE EXTENSION IF NOT EXISTS pgaa CASCADE;

After that, run \dx to verify the operation succeeded. You should see both the pgaa and pgfs extensions installed.

Next, run these GUCs (grand unified configuration settings) to change the current execution engine from seafowl to spark-connect.

SET pgaa.executor_engine = 'spark_connect';
SET pgaa.spark_connect_url = 'sc://spark-connect:15002';

These GUC settings pgaa.spark_connect_url = 'sc://spark-connect:15002' specifies the connection URL to the spark-connect container running on port 15002, allowing Postgres to communicate with the Spark cluster. 

Verify that Postgres can connect to Spark Connect:

SELECT pgaa.spark_sql('SELECT version()');

If the connection is successful, you should see the Spark version displayed. Exit the shell by typing \q .

Next, let’s create the TPC-DS tables and populate them.

In the cloned repository, you will find the create_tables_for_spark_sf10.sql script which contains the DDL queries to populate the TPC-DS tables.

Run the script by:

cat create_tables_for_spark_sf10.sql | docker exec -i postgres psql -U postgres -d postgres

Let’s try to understand what happens when you run the above script. First, it creates a storage location that maps to /data/tpcds_data volume mount of the Postgres container, which ultimately maps to the location where you downloaded the TPC-DS SF10 data set.

SELECT pgfs.create_storage_location('spark_sf10', 
'file:///data/tpcds_data');

Next, the script creates the tables. Each table is configured with pgaa directives that instruct Postgres to use the storage location and load Parquet-formatted files.

For example, here's how the customer_address table is created and loaded with Parquet files.

create table customer_address ()
USING PGAA WITH (pgaa.storage_location = 'spark_sf10', pgaa.path = 'customer_address', pgaa.format = 'parquet');

Finally, run a sample query on the psql terminal to establish your "CPU Baseline." This will serve as the performance point of comparison when you eventually move to the GPU-accelerated environment.

WITH customer_total_return AS (
   SELECT 
       sr_customer_sk AS ctr_customer_sk,
       sr_store_sk AS ctr_store_sk,
       SUM(sr_return_amt) AS ctr_total_return
   FROM store_returns, date_dim
   WHERE sr_returned_date_sk = d_date_sk 
     AND d_year = 2000
   GROUP BY sr_customer_sk, sr_store_sk
)
SELECT c_customer_id
FROM customer_total_return ctr1, store, customer
WHERE ctr1.ctr_total_return > (
   SELECT AVG(ctr_total_return) * 1.2
   FROM customer_total_return ctr2
   WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk
)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100;

In the CPU-only logs, you will see the Spark cluster processing tasks using standard Java threads. Take note of the execution time; on an SF10 dataset, this usually takes significantly longer than the near-instant results you will see later on the L40S.

Once you’ve confirmed that Postgres and Spark are communicating correctly, you are ready to move to the Step-by-Step Deployment on Brev to unlock the full power of spark-rapids.

Part 2: Running on NVIDIA Brev

For large-scale analytics, we use an NVIDIA Brev VM with an L40S GPU. NVIDIA Brev is a cloud platform that provides on-demand GPU instances. We use it to provision powerful L40S GPUs for spark-rapids workloads that need significant computational power and memory bandwidth.

1. Provision the Brev Instance

First, ensure you have created a Brev account with sufficient credits to run the L40S instance. This particular instance we are building here will cost you approximately $2-$3 per hour.

Log into the Brev console, select GPUs from the top menu, and click on Create Environment as follows to provision a new GPU instance.

fig 2
Figure 02 - With NVIDIA Brev, you can quickly provision a GPU environment.


The tutorial's specific configurations utilized a Brev instance with the following configurations.

  • Instance Mode: VM Mode.
  • GPU: 2x NVIDIA L40S (48GB VRAM each).
  • Compute: 16 vCPUs.
  • Memory: 294 GiB RAM.
  • Storage: 1.94TB Disk (NVMe recommended for SF100 data).

However, any Brev instance with at least one GPU will suffice. If you use a different configuration, remember to update the docker-compose.yml file to correctly specify the GPU count.

The VM instance will take 2–3 minutes to provision. While you wait, install the Brev CLI on your local machine and follow the instructions under Using Brev CLI (SSH) in the console for your operating system. You'll interact with the VM instance primarily through the CLI throughout this tutorial.

fig 3
Figure 03 - For convenience, install Brev CLI tools locally.


2. Install Required Libraries and Setup Directory Permissions

Once provisioning is complete, log in to the instance using the CLI. Let's call our instance tutorial.

brev shell tutorial

This command opens an SSH connection to the VM instance you just provisioned, allowing you to execute commands and install software directly on the remote machine.

Next, install nvtop and AWS CLI:

sudo apt install awscli nvtop

nvtop is a lightweight monitoring tool that provides real-time visibility into GPU utilization, memory usage, and running processes. We'll use it to observe GPU performance when executing our benchmark queries. AWS CLI (awscli) is required to download the benchmark dataset in the next step.

Next, create the necessary directories and set their permissions. Spark will access these directories later.

sudo mkdir /data
sudo mkdir /data/spark-app
sudo chmod 777 -R /data/spark-app

Finally, export your EDB_TOKEN as an environment variable:

export EDB_TOKEN=<Your EDB_TOKEN goes here>

3. Download the Code and Benchmark Dataset

Clone the companion GitHub repository into the VM's home folder (/home/ubuntu). This is the same repository you ran locally, but now you're going to run the GPU-enabled Spark cluster with spark-rapids:

cd /home/ubuntu
git clone https://github.com/dunithd/spark-rapids-tutorial.git
cd spark-rapids-tutorial/gpu-2xl40s

Next, let's download the TPC-DS dataset, which is formatted as Parquet files. To demonstrate the true power of spark-rapids, we'll use the dataset with scale factor 100 (SF100), which is several gigabytes in size after compression.

To give the dataset ample disk space, store it in the /ephemeral volume of the instance. If you run df -h, you'll see it has around 1TB of space—more than sufficient.

Run the following command to download the dataset:

 aws s3 cp 
s3://beacon-analytics-demo-data-us-east-1-prod/tpcds_sf_100 
/ephemeral/tpcds_sf_100 --recursive --no-sign-request

With the required libraries set up and the dataset downloaded, we're ready to start the Docker Compose stack and run some queries.

Deep Dive: The spark-rapids Integration

Before launching the GPU-accelerated stack, let’s examine how the components interact. It’s essentially the same components we looked at earlier, with the addition of the spark-rapids library with different resource allocations.

Fig 4
Figure 04 - You are effectively running the same Docker Compose stack on Brev, with the addition of spark-rapids jars on the Spark Connect, and Spark workers having access to the GPUs


The Core Components

The ./gpu-2xl40s/docker-compose.yml defines four primary logical units:

  • EDB Postgres (The Gateway): A specialized Postgres 17 container built with the EDB Analytics Accelerator (AA) extension. It acts as the entry point for your SQL queries.
  • OSS Apache Spark Master: The control plane that coordinates task distribution. We are using standard OSS Apache Spark 3.5.6, proving that PGAA integrates seamlessly with the broader data ecosystem.
  • OSS Apache Spark Workers: They are configured with the nvidia runtime to access the L40S hardware directly.
  • Spark Connect (The Bridge): This service acts as the interface between Postgres and the Spark cluster, using the modern Spark Connect protocol to submit query plans.

The spark-connect configuration you see in the compose file is where the magic happens.

spark-connect:
   <<: *spark-common
   container_name: spark-connect
   depends_on: ["spark-master"]
   ports: ["4040:4040", "15002:15002"]
   command: >
     /opt/spark/sbin/start-connect-server.sh --master spark://spark-master:7077
     --packages "org.apache.spark:spark-connect_2.12:3.5.6,com.nvidia:rapids-4-spark_2.12:25.10.0,io.delta:delta-spark_2.12:3.3.1,org.apache.hadoop:hadoop-aws:3.3.4"
     --conf "spark.executor.extraClassPath=/data/spark-app/jars/com.nvidia_rapids-4-spark_2.12-25.10.0.jar"
     --conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/data/spark -Divy.home=/data/spark -Djava.io.tmpdir=/data/spark"
     --conf "spark.executor.extraJavaOptions=-Djava.io.tmpdir=/data/spark"
     --conf "spark.driver.memory=16g"
     --conf "spark.executor.cores=8"
     --conf "spark.executor.instances=2"
     --conf "spark.executor.memory=64g"
     --conf "spark.rapids.filecache.enabled=true"
     --conf "spark.executor.resource.gpu.amount=1"
     --conf "spark.task.resource.gpu.amount=0.0625"
     --conf "spark.scheduler.minRegisteredResourcesRatio=1.0"
     --conf "spark.locality.wait=0s"
     --conf "spark.sql.files.maxPartitionBytes=2GB"
     --conf "spark.rapids.shuffle.multiThreaded.reader.threads=32"
     --conf "spark.rapids.shuffle.multiThreaded.writer.threads=32"
     --conf "spark.rapids.sql.multiThreadedRead.numThreads=32"
     --conf "spark.plugins=com.nvidia.spark.SQLPlugin"
     --conf "spark.rapids.memory.host.spillStorageSize=16g"
     --conf "spark.rapids.memory.pinnedPool.size=8g"
     --conf "spark.rapids.sql.concurrentGpuTasks=3"
     --conf "spark.rapids.sql.enabled=true"
     --conf "spark.shuffle.manager=com.nvidia.spark.rapids.spark356.RapidsShuffleManager"
     -conf "spark.shuffle.manager=com.nvidia.spark.rapids.spark356.RapidsShuffleManager"

We aren't just running Spark; we are injecting NVIDIA’s spark-rapids accelerator into the Spark engine via three critical layers:

1. The Plugin Injection

-conf "spark.plugins=com.nvidia.spark.SQLPlugin"-packages 
"com.nvidia:rapids-4-spark_2.12:25.10.0"

By defining the SQLPlugin, we tell Spark to intercept standard CPU query plans. If an operation (like a Join or Aggregate) is supported by spark-rapids, Spark will swap the CPU version for a highly parallelized GPU kernel.

2. Intelligent GPU Memory Management

Large-scale datasets like TPC-DS SF100 can easily saturate GPU memory. We’ve tuned the memory settings to ensure stability:

  • spark.rapids.memory.pinnedPool.size=8g: This sets aside "Pinned" (page-locked) host memory, which allows for significantly faster data transfers between the System RAM and GPU VRAM.
  • spark.rapids.sql.concurrentGpuTasks=3: This allows multiple tasks to share a single GPU, maximizing the utilization of the L40S’s massive core count.

3. The Accelerated Shuffle

-conf 
"spark.shuffle.manager=com.nvidia.spark.rapids.spark356.RapidsShuffleManager"

In Spark, "Shuffling" (moving data between workers) is often the biggest bottleneck. By using the RapidsShuffleManager, we use the GPU to compress and move data, drastically reducing the time spent on I/O.

Launching the Accelerated Stack

Run the following command to start the services in the background:

docker compose -d --build

Before running queries, let’s confirm that Spark can "see" the L40S hardware.

First, run nvidia-smi on the host to see two L40S GPUs listed with 48GB VRAM each, similar to this:

fig 5
Figure 05 - Two Spark workers (Java processes) are assigned to the two GPUs


Next, exec into the worker container to ensure the NVIDIA drivers are passed through correctly so that the Spark workers can actually “see” the GPUs:

docker exec -it spark-worker-0 nvidia-smi

You should see the same GPU status table inside the container.

Additionally, you can check the spark-connect logs to verify that the spark-rapids acceleration is enabled.

docker compose logs spark-connect | grep -i "RAPIDS Accelerator"

You should see a message saying RAPIDS Accelerator 25.10.0: enabled. If you see that, the plugin is active.

Finally, run the following on your local terminal to configure port forwarding so you can access the Spark UI on port 4040:

brev port-forward tutorial -p 8080:8080 -p 4040:4040

You can now access the Spark UI at http://localhost:4040.

Running Queries

Now that we have the spark-rapids setup running, let’s create the TPC-DS tables, populate them, and run analytics queries.

In the cloned repository, you will find the create_tables_for_spark_sf100.sql script which contains the DDL queries to populate the TPC-DS tables.

Run the script by:

cat create_tables_for_spark_sf100.sql | docker exec -i postgres psql -U postgres -d postgres

With the stack running on the NVIDIA Brev L40S instance, we will now execute a TPC-DS query. The goal is to observe how the EDB Analytics Accelerator (AA) intercepts the SQL and offloads the heavy lifting to the OSS Apache Spark cluster. Before that, let’s run nvtop in a different terminal so we can easily spot GPU utilization visually.

Log into a new VM instance by running this in a new terminal:

brev shell tutorial

Start nvtop in the new terminal session.

nvtop

While nvtop runs in the new terminal, come back to first terminal, log into psql and run this query:

WITH customer_total_return AS (
   SELECT 
       sr_customer_sk AS ctr_customer_sk,
       sr_store_sk AS ctr_store_sk,
       SUM(sr_return_amt) AS ctr_total_return
   FROM store_returns, date_dim
   WHERE sr_returned_date_sk = d_date_sk 
     AND d_year = 2000
   GROUP BY sr_customer_sk, sr_store_sk
)
SELECT c_customer_id
FROM customer_total_return ctr1, store, customer
WHERE ctr1.ctr_total_return > (
   SELECT AVG(ctr_total_return) * 1.2
   FROM customer_total_return ctr2
   WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk
)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100;

We specifically picked this query as:

  • Complex Joins: It joins store_returns, date_dim, store, and customer.
  • Common Table Expressions (CTE): The WITH clause creates a temporary view that Spark must optimize across the cluster.
  • Subqueries: The AVG(...) * 1.2 subquery forces a broadcast or shuffle join, which is where the RAPIDS Shuffle Manager shines on the L40S.

Observing nvtop will show you a visualization like the following, indicating that both GPUs have been utilized

fig 6
Figure 06 - While the query executes on the Spark cluster, you can see the fluctuations of the GPU utilization.


To officially confirm the Spark job's query planning and execution details, you can open your browser to http://localhost:4040. Navigate to the SQL tab and click on the running query.

  • Look for "Gpu" nodes: You will see a physical plan where standard nodes are replaced by GPU versions, such as GpuHashJoin, GpuHashAggregate, and GpuColumnarExchange.
  • Transition Nodes: You will notice GpuRowToColumnar at the start (as data is pulled from Postgres) and GpuColumnarToRow at the end (as results are sent back).
fig 7
Figure 07 - Spark execution DAG illustrating various GPU nodes.


By examining these metrics, you can verify that the RAPIDS Shuffle Manager is actively reducing I/O bottlenecks and that both GPUs are being utilized efficiently during query execution.

Conclusion

By successfully completing this tutorial, you have established both a CPU-only baseline and deployed a fully GPU-accelerated EDB Postgres AI and Apache Spark cluster using NVIDIA spark-rapids on an NVIDIA Brev VM. You have validated the seamless integration of the EDB Analytics Accelerator with Spark, confirmed the active use of GPU kernels for complex query planning (GpuHashJoin, GpuAggregate, etc.), and observed the significant performance boost offered by the RAPIDS Shuffle Manager and L40S GPUs, particularly on large-scale datasets like TPC-DS SF100. This hands-on experience should provide a strong foundation for integrating GPU-accelerated analytics into your production environment. For a deeper dive into all configuration options, advanced tuning, and the full feature set of EDB Postgres AI and the Analytics Accelerator, please refer to the official EDB documentation.

Share this