Using Postgres for analytics, search and AI

September 06, 2024

Using AI for analytics is an immensely popular topic in the database world right now. After all, 55% of organizations are either testing AI or already using it in production for some part of their business. For the enterprise of the future, staying competitive in the AI generation means using intelligent tools to get the most out of your data. That’s why we built EDB Postgres AI, the industry’s first intelligent platform for transactional, analytical, and new AI workloads.

These were the topics at hand when I sat down with EDB Technical Fellow Marc Linster. We discussed the relationship between AI and analytics, as well as how to apply AI techniques to enhance or automate analytics. And perhaps most importantly, our discussion ties back to Postgres, the most popular database for developers. This blog provides an in-depth look at those subjects, as well as how Postgres’s architecture can support large-scale analytics in the world of AI. 

Watch the video below to hear our full discussion. 
 

 

You can also watch the video on YouTube by following this link

The relationship between AI and analytics

AI and analytics are two different disciplines with an important and interesting overlap.
Analytics is the science of analyzing raw data and/or historical data to gather insights, discover trends, or help organizations make operational or corporate decisions.

AI covers a broader spectrum including Machine Learning, Deep Learning, and Generative. AI enables machines to perform tasks that would normally require human intelligence, such as image classification, speech recognition, or learning from experience.

The intersection of the two disciplines is what we call “AI Analytics”. In AI analytics, you can apply AI techniques to enhance or automate analytics. The most common example is when you analyze and also train historical data to make predictions. AI analytics uses techniques such as machine learning or natural language processing to transform how the data is analyzed.

Can Postgres’ architecture support large-scale analytics?

Postgres has an interesting and extensible architecture but needs to be improved to support large-scale analytics. Postgres lacks a few important pieces in terms of architecture, so it's necessary to extend Postgres to support large-scale analytics.

  1. Decoupling compute and storage - Decoupling the compute engine with the storage engine to allow independent scalability.
  2. Dynamic scalability - Horizontal and dynamic scalability for compute with a distributed query engine. The same query can be split and run on multiple computer nodes using a distributed query execution engine. 
  3. Columnar storage - Support for Columnar Storage format for the tables is essential for query execution speed to access only relevant columns. This also enables efficient data compression as you can apply compression algorithms per column.
  4. Vectorization - A vectorization query engine makes it possible to leverage the CPU modern chipset. Vectorization means that you can work on multiple rows in 1 CPU instruction. Single Instruction, Multiple Data (SIMD concept).

Data lake, data warehouse, and data lake house. What's the difference?

Data lake, data warehouse, and lakehouse are three distinct data architectures, each with unique capabilities and use cases.

Data Warehouse (DWH):

The concept of the data warehouse (DWH) emerged in the 1990s, aiming to create a system for reporting and data analysis. A DWH consolidates data from multiple sources in batches, transforming and often aggregating it to provide a single version of the truth for reporting purposes.

However, data warehouses have some limitations:

  • Scalability and Cost: Traditional DWHs can be expensive to scale due to their reliance on high-performance hardware.
  • Speed: Initially designed for structured data, they often process data in batches with a delay, making them less suitable for real-time analytics.
  • User Focus: DWHs are primarily built for business intelligence users rather than data scientists or statisticians.

Data Lake:

Around 2010, the concept of the data lake emerged to address many of the common issues associated with data warehouses. Data lakes were built on the idea of the "3Vs":

  • Volume: Utilizing a scalable and cost-effective architecture based on commodity hardware and Hadoop Distributed File System (HDFS), data lakes are designed to store all data, not just a subset like traditional DWHs.
  • Variety: They manage all data formats—structured, semi-structured, and unstructured.
  • Velocity: Data lakes support real-time data acquisition in addition to batch processing.

The goal of the data lake was to open up the platform to a broader range of users, especially data scientists. Despite the benefits, data lakes also have their limitations:

  • Operational Complexity: Due to the variety of frameworks involved, data lakes can be complex to operate and use.
  • Lack of Data Governance: Ensuring data quality, consistency, and security can be challenging.
  • SQL Performance: Data lakes often have limited SQL query performance compared to traditional DWHs.

Modern Data Warehouse:

As a response to the limitations of both data warehouses and data lakes, the "Modern Data Warehouse" concept emerged. It offers features that land between those of a data lake and a traditional DWH, such as separating compute and storage and handling semi-structured data formats more effectively.

Lakehouse:

The lakehouse architecture emerged around 2020 as a hybrid of data lakes and data warehouses. The lakehouse aims to combine the best features of both worlds:

  • Scalability: Offers infinite scalability, like a data lake.
  • Data Management: Handles all types of data (structured, semi-structured, and unstructured) and supports all use cases.
  • Data Governance: Provides robust data governance comparable to traditional DWHs, ensuring data quality and security.

The lakehouse is designed to address the needs of modern data workloads by merging the flexibility of data lakes with the performance and management capabilities of data warehouses.

Why Postgres could emerge as the best analytics solution

There are three key reasons why Postgres could emerge as the best analytics solution. 

First of all, solutions from Databricks, Snowflake, and Google Big Query only operated in the cloud. Cloud-only solutions won’t work for customers who want to keep data on-premises. 

Secondly, in the same way that Linux has become the standard for the operating system, it’s clear today that Postgres is the leader for operational databases. Major organizations today have Postgres as one of their main operational databases. If we can simplify the flow from Operations to Analytics using the same technology, the same skills and give also the opportunity to run in a hybrid approach, it’s a WIN-WIN for everyone.

Last, cost is a major factor for all enterprises. Operating 24h/7 large analytics platforms in the cloud is highly expensive. Because of that cost, we’re seeing a trend of customers moving back data workloads from the cloud.  

By making Postgres suitable for Large-Scale analytics and because of all the reasons we just mentioned, Postgres could become the best analytics platform. 

Understanding Hybrid Transaction/Analytical Processing (HTAP)

Hybrid Transaction/Analytical Processing (HTAP) is a data architecture that combines online transactional processing (OLTP) and online analytical processing (OLAP) workloads into a single system. It means one single platform that can handle transactional workloads as well as analytics workloads. There are some challenges with this approach, as you usually organize data in a different format: row oriented for columnar for analytics. The typical use case for HTAP is real-time analytics scenarios like fraud detection or online recommendation systems. 

Because of these differences, you usually need two different engines, but you need to be able to auto-sync in real-time the data in between the two layers.

How does Postgres support AI?

When Michael Stonebraker designed Postgres 35 years ago, nobody was thinking about the kind of AI we have today. But Postgres works incredibly well for AI. 

Certainly Michael Stonebraker was not focusing on AI when he created Postgres, but nevertheless, he managed to put in place important design principles inside Postgres that made it very well suited for use in the AI generation. 
Here are three reasons why Postgres works well with AI:

  1. Extensibility as Programming Language  - Python, R, and SQL are the most popular programming languages used by AI professionals including data scientists. You can use all the 3 languages directly in Postgres. Postgres’s extensible framework makes it a good fit for AI applications. Allowing developers to execute Pl-python directly in the database enables data-intensive computation to occur close to the data. You can use the most popular OSS library for NLP and AI transformers directly inside Postgres with pl-python functions. 
  2. Seamless data integration - Using Postgres Foreign data wrapper for data integration makes it easier to connect and execute queries across different databases and data storage systems, which is vital for AI applications that require access to diverse data sets. 
  3. Extensibility for additional Data Structures as Vectors - Postgres is the preferred database for the developers and  you have all the AI extensions available within Postgres, most importantly the pgvector extension, which was initially developed in 2021 to support natively vectors inside Postgres. 

Understanding vectors and similarity search 

For more than 30 years, data scientists have struggled to efficiently search through unstructured data. Today, vectors offer a standardized approach to storing and searching this type of data.

Vectors are arrays of numbers that represent data points in a multidimensional space. Each element in a vector corresponds to a specific feature of the data, enabling mathematical operations and comparisons. However, before utilizing vectors, it's necessary to transform the original data into this format, and this is where AI models are incredibly useful.

Similarity search is a technique used to find items most similar to a given item in a dataset. This method involves comparing the vector representations of items using specific distance measures to identify the closest matches based on a query. The primary advantage of this approach is that it converts unstructured data into a standardized vector format, making it easier to search and analyze.

By using similarity search, you can improve search relevance by not only conducting keyword-based searches but also capturing the context of a query. However, there are some challenges. Calculating vectors is computationally intensive and typically results in an approximation rather than an exact match.

A practical solution to this is to combine similarity search with keyword-based search in PostgreSQL, a technique known as "hybrid search."

Major use cases for vector-based AI and Postgres

As previously mentioned, generating vectors is computationally intensive. The best use cases for vectors are those where vector indexes can provide very low latency, particularly when the vector dimension size is lower than 2,000—and ideally, under 1,000. With these parameters, you achieve much more efficient indexing capabilities.

Here are some examples of use cases that benefit from vector indexing:

  • Text and Document Search: Enhances search relevance by understanding the context of the query.
  • RAG (Retrieval Augmented Generation): Improves retrieval of relevant information for generating responses.
  • Image Search: Finds visually similar images based on vector representations.
  • Content Personalization: Tailors content to individual preferences using user behavior vectors.
  • Recommendation Systems: Suggests products or content based on user vector similarities.
  • Fraud Detection: Identifies unusual patterns that may indicate fraudulent activity.

Vector-based AI and PostgreSQL can also be used for voice and video processing. However, for these types of use cases, it is recommended to perform the embedding generation on-platform using a GPU to handle the computational load efficiently.

While PostgreSQL supports vector search and data is central to AI, large language models (LLMs) should not be stored in PostgreSQL because they are extremely large. Instead, these models can be co-located on the same machine where PostgreSQL is running. 
In PostgreSQL, you should store only your data and the data that has been vectorized using LLM models.

What’s the difference between Discriminative AI and Generative AI?

Generative AI is focused on creating new content or data that resembles the training data. It learns the underlying patterns or distribution of the input data and can generate new, similar instances. Common applications of generative AI include image synthesis, text generation, and creating music or art.

On the other hand, Discriminative AI is about making distinctions and categorizing input data into predefined classes. It uses supervised learning algorithms, which involve training a model on labeled data where both the input and the desired output are known. 

A classic example of discriminative AI is spam detection in email systems, where the algorithm assesses whether an email is spam or not based on features such as the title, author, timing, and content of the email.

Want to learn more? Discover EDB Postgres AI

We hope this blog has given you a better understanding of the relationship between AI and analytics, as well as how to apply AI techniques to enhance or automate analytics, and how Postgres’s architecture can support large-scale analytics in the world of AI. You can watch the full video of my conversation with EDB Technical Fellow Marc Linster on YouTube by following this link

Ready to get started? Contact us for a free demo

Share this