I am always fascinated by the “Search” in this digital world. When I think about text as data, I always believe that Text is Data born for search ☺ compared to other data like numeric ( they are born for measuring & calculating ☺ ).
In the digital age, the ability to sift through vast amounts of text data efficiently and effectively is crucial. PostgreSQL, a robust open-source relational database, offers various search functionalities that cater to multiple needs, from simple pattern matching to linguistic search and more complex semantic search understanding. This article explores these search methods—standard search, full-text search, and semantic search—using the example of searching for articles related to "PostgreSQL & vector" to illustrate their differences, use cases, and internal workings.
Let's understand this with some examples. I have sample data from a few articles inserted into PostgreSQL with text as a column. Later, I will create the full-text search with PostgreSQL's tsvector functionality and semantic search using vector embedding with pgvector.
Below is my sample data.
I have inserted six articles into the PostgreSQL table.
Let’s do some searching on this.
Standard Search with the LIKE Operator
Use Cases
Standard search is your go-to for straightforward queries where exact or partial matches are needed. It shines in scenarios like searching for a specific word in a column or filtering data with a specific word. Based on the above data, here is an example.
Example: "PostgreSQL"
Imagine you would like to find the above article with PostgreSQL as a word.
It works like a charm for a single word. But let's assume that you now have two words with some context, like you want to find articles with “PostgreSQL & vector”. Let’s do this.
Opp,!!! There’s no result. There are two words and some similarities you would like to find between them.
How It Works
The `LIKE` operator scans each row against the specified pattern, which can include wildcards (`%`). The default B-tree index can improve the query performance in many ‘LIKE’ operator scenarios, especially prefix and suffix cases. However, the B-tree is less effective if your data set is large and you have middle matching “%PostgreSQL%”.PostgreSQL is extensible; thus, an extension called “pg_trgm” helps to do world-similar searches like full-text search and uses the GIN index, which is preferred for text-based search in PostgreSQL.
Full-Text Search with tsvector and tsquery
Use Cases
Full-text search is indispensable for digging through large text volumes, making it perfect for content-heavy applications needing nuanced search capabilities, such as document libraries or article repositories.
Example: " PostgreSQL & vector"
Let's take our above examples with two words that have some context and find them using PostgreSQL's full-text search capabilities.
Before searching the text in full-text, I must add a new column, “textsearchable_index_col”, as tsvector data type for full-text search. Now, I will update my new column, which converts my text to a token, which is lexemes based on my base data column, ‘title’ and ‘body’. One can also create the trigger for this update, so there is no need to update this externally. Once I have updated this, I will make a special text search index, ‘GIN’, for more complex and large data set searches.
Here is my added data. It looks like
Now, let me search for the phrase “PostgreSQL & vector” using a full-text search method with PostgreSQL.
This approach is more sophisticated. It captures articles mentioning "PostgreSQL & vector," regardless of their order or proximity. It leverages linguistic analysis to provide more relevant results.
How It Works
PostgreSQL's full-text search breaks down text into lexemes (tokens) using `tsvector` and matches these against a `tsquery`. An inverted index maps words to their occurrences in documents, enabling efficient, complex searches that consider word forms, synonyms, and logical combinations. PostgreSQL's special index is called GIN, an inverted index that helps speed up the full-text query over on tsvector.
Semantic Search with pgvector
Use Cases
Semantic search excels in applications where understanding the intent and meaning behind the text is crucial (context), such as in recommendation engines or advanced content discovery platforms. This is why GenAI comes into the picture, where many new applications are based on this similarity search.
Example: " PostgreSQL & vector "
To perform the semantic search, I must use the pgvector extension available with PostgreSQL and embed my article data into a vector data type. I also have to use a Python program to embed; I have used Hugging Face open source lib with model = SentenceTransformer('all-MiniLM-L6-v2'). You can find the complete Python program on my GitHub here.
For a semantic approach to finding similarity articles, you must find the vector similarity between these articles.
So, the above query provides us with the nearest article using Euclidean,<->. One can also use cosine <=> as well as INNER Product <#> and other operators. Cosine similarity measures the cosine of the angle between two vectors in multi-dimensional space. In comparison, Euclidean distance measures the straight-line distance between two points (vectors), and Inner product similarity measures the similarity between two vectors based on their dot product.
But if we want to find the text “PostgreSQL & vector,” I need to convert this search/input data into vectors and then match the input vectors with those I already stored. So, here is an example of the search using Python code.
One can easily differentiate articles similar to “PostgreSQL” and “vector” words with context. This is how similarity search works and can be used for various use cases with AI models.
Conclusion
The evolution from standard to semantic search in PostgreSQL equips developers with powerful tools for implementing nuanced search functionalities. PostgreSQL caters to a broad spectrum of search needs, from the simplicity of standard search to the depth of semantic understanding with `pgvector`. Finding articles related to "PostgreSQL & vector," we've seen how each method serves different scenarios, highlighting the importance of choosing the right approach to enhance user experiences and effectively meet application requirements. Below is the final thought on when to use and when not on each of these methods.
Standard Search.
Advantages:
- Simplicity: The `LIKE` operator is straightforward to use and understand.
- Basic Pattern Matching: It can match patterns within text data using wildcard characters like `%` and `_`.
Disadvantages:
- Performance: `LIKE` queries can be slow, especially when searching large datasets, as they often require full table scans.
- Limited Functionality: Limited support for linguistic analysis, stemming, and relevance ranking compared to full-tTherch.
- Indexing mutations: Indexing on text columns may not significantly improve performance for wildcard searches.
When to Use:
- Suitable for simple pattern matching or when full-text search features are not required.
- When working with small datasets or when performance is not a critical concern.
Full-Text Search with tsvector:
Advantages:
- Fast and Efficient: Full-text search is optimised for searching large volumes of text data and can provide faster results compared to `LIKE` queries.
- Linguistic Analysis: Provides support for stemming, ranking, and linguistic analysis, allowing for more accurate and relevant search results.
- Index Support: tsvector indexes with GIN can significantly improve search performance, especially for complex queries.
Disadvantages:
- Learning Curve: Requires understanding of configuring and using tsvector data types and GIN or GIST indexes.
- Setup Overhead: Initial setup and configuration for full-text search indexing may require more effort than standard search.
- Context-based searching for text requires additional effort and could be more efficient.
When to Use:
- Suitable for applications requiring advanced search capabilities, such as searching within large documents or implementing relevance ranking.
- Performance is a critical consideration when searching large volumes of text data.
Semantic Search with pgvector
Advantages:
- Enhanced Relevance: Semantic search techniques such as cosine similarity or Euclidean distance can provide a more nuanced relevance ranking than basic text matching.
- Similarity Matching: Allows for searching based on the similarity between text fragments, which can be helpful for applications like spell checking, autocomplete, or recommendation systems.
- Context-based searching applications like Chat Boat can use this method.
Disadvantages:
- Complexity: Semantic search may require more effort and expertise than standard or full-text search.
- Resource Intensive: Calculating similarity scores for large datasets can be resource-intensive, potentially impacting performance.
When to Use:
- Suitable for applications requiring precise similarity matching or recommendation systems based on textual data.
- When relevance ranking based on semantic similarity is crucial for search results.
- Context-based searching with RAG is a machine-learning model.