Introduction
The demand for effective and robust database management solutions has never been higher as businesses continue to produce and analyze enormous amounts of data. One of the top relational database management systems, PostgreSQL, offers a broad range of functions to meet different data processing needs. In this blog, we'll explore the ANY_VALUE() function, a brand-new feature in PostgreSQL 16, and discuss how to use it with examples.
EDB’s PostgreSQL 16 Contributions
Vik Fearing, PostgreSQL Expert at EDB has contributed to add this functionality as per the SQL:2023 standard (T626). Peter Eisentraut, Chief Engineer at EDB and Core Team Member of PostgreSQL, actively worked on the SQL:2023 standard and has published details about it in his blog here: http://peter.eisentraut.org/blog/2023/04/04/sql-2023-is-finished-here-is-whats-new.
For more enhancements, check out the following link: https://www.postgresql.org/docs/16/release-16.html#RELEASE-16-FUNCTIONS
Understanding GROUP BY Clause
The GROUP BY clause is used to divide the rows in a table into smaller groups that have the same values in the specified columns. This clause is used with a SELECT statement to combine a group of rows based on the values of a particular column or expression. Aggregate functions are used to return summary information for each group.
Syntax
SELECT column1, column2, ... column_n,
aggregate_function (expression)
FROM tables
WHERE <condition>
GROUP BY column1, column2, ... column_n
HAVING <condition>;
Where:
- column1, column2, ... column_n are those expressions that are not encapsulated within an aggregate function and must be included in the GROUP BY clause.
- Aggregate_function are those functions where the values of multiple rows are processed together to form a single summary value an aggregate function such as sum, count, min, max, or avg functions.
- Table is the relation that you wish to retrieve records from. There must be at least one table listed in the FROM clause.
- WHERE condition is optional and used to filter out records from the table. The conditions that must be met for the records to be selected.
- HAVING condition is optional and is used to filter out aggregated records returned from GROUP BY clause. It serves as a post-filter.
Functionality
Calculations involving aggregates are the main objective of GROUP BY when applied to data subsets. We can compute summary statistics for each group by grouping rows based on particular columns. For example, we can use GROUP BY to figure out the overall sales for each product category, the typical client age across several cities, or the monthly order volume.
GROUP BY enables you to condense huge datasets into digestible, informative bits. As a result, it is simpler to analyze and comprehend the data. It is frequently used when creating reports and performing business intelligence tasks, where you need to present data in a structured and summarized format.
Understanding ANY_VALUE() Function
any_value(anyelement): Returns an arbitrary value from the non-null input values.
The ANY_VALUE() function is an aggregate function in PostgreSQL 16 that helps optimize queries when using GROUP BY clauses. It returns an arbitrary non-null value in a given set of values. Prior to PostgreSQL 16, when we used GROUP BY, all non-aggregated columns from the SELECT statement had to be included in the GROUP BY clause as well. PostgreSQL will throw an error if a non-aggregated column is not added in the GROUP BY clause as shown below in the example section.
The reason for this requirement is that the GROUP BY clause separates the result set into groups based on the unique combinations of values in the specified columns. The query results might not make sense if you include some non-aggregated columns in the SELECT statement that aren't listed in the GROUP BY clause. As a result, the database engine would have to make arbitrary decisions about how to display the data, resulting in non-deterministic and possibly incorrect results that could cause errors.
However, this limitation is relaxed with the introduction of the ANY_VALUE() function.
Syntax
The syntax goes like this:
ANY_VALUE(anyelement)
Where anyelement represents the column or value we want to include in the result set. It can be a column name, a literal value, or an expression involving columns and operators.
Functionality
The primary purpose of the ANY_VALUE() function is to eliminate the ambiguity that arises when non-aggregated columns are included in a GROUP BY query. Without enforcing any ordering criteria, it guarantees that you get a representative value from each group. This is especially helpful when you need any value from the group for reporting or analytical purposes and the value you receive is not critical.
When we apply the ANY_VALUE() function to a nonaggregated column within a GROUP BY query, PostgreSQL selects an arbitrary non-null value from the group to display in the result set. It effectively informs PostgreSQL that any value from the group is acceptable, resolving the ambiguity and allowing the query to execute successfully.
Additionally, performing complex aggregations for each group can be computationally expensive. ANY_VALUE() allows you to retrieve just one representative value from each group, reducing the computational load and boosting query performance. Comparing ANY_VALUE() to the MIN/MAX aggregates that were in use before, we have witnessed significant performance improvement. Check out the example below.
Example
Let's create a city table
CREATE TABLE city_data (
id bigint,
city text,
country text,
population bigint
);
After inserting around 44,000 records, we analyzed the table
ANALYZE city_data;
Suppose we run the following query:
postgres=# select country,city,population from city_data limit 10;
country | city | population
-------------+-------------+------------
Japan | Tokyo | 37732000
Indonesia | Jakarta | 33756000
India | Delhi | 32226000
China | Guangzhou | 26940000
India | Mumbai | 24973000
Philippines | Manila | 24922000
China | Shanghai | 24073000
Brazil | São Paulo | 23086000
South Korea | Seoul | 23016000
Mexico | Mexico City | 21804000
(10 rows)
This query returns a list of cities and their populations, along with the country that each city belongs to.
Now suppose we want to group the results by country, and include the population of each country. And we don’t want to include the city names in the grouping, but we would like to have a sample city returned for each district.
postgres=# SELECT country,city as city_name,SUM(population) as total_population from city_data group by country;
ERROR: column "city_data.city" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT country,city,SUM(population) from city_data group by ...
We get an error, because PostgreSQL rejects queries if the nonaggregated columns are not included in the GROUP BY clause.
Using the MIN/MAX function with the non-aggregated column is the previous workaround for doing this (PostgreSQL 15 and earlier versions).
postgres=# SELECT country,MIN(city) as city_name,SUM(population) as total_population from city_data group by country limit 10;
country | city_name | total_population
--------------------+-----------+------------------------
Bangladesh | Amtali | 43503119
Mayotte | Mamoudzou | 32057
Venezuela | Acarigua | 24389811
Luxembourg | Capellen | 238576
Sweden | Åhus | 7049100
Dominican Republic | Altamira | 11791011
Ireland | An Cabhán | 2944351
Cambodia | Banlung | 3626621
Singapore | Singapore | 5983000
San Marino | Acquaviva | 32191
postgres=# EXPLAIN (analyze,buffers,verbose) SELECT country,MIN(city) as city_name ,SUM(population) as total_population from city_data group by country limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1452.09..1452.22 rows=10 width=72) (actual time=36.296..36.302 rows=10 loops=1)
Output: country, (MIN(city)), (SUM(population))
Buffers: shared hit=670
-> HashAggregate (cost=1452.09..1454.99 rows=232 width=72) (actual time=36.295..36.299 rows=10 loops=1)
Output: country, MIN(city), SUM(population)
Group Key: city_data.country
Batches: 1 Memory Usage: 93kB
Buffers: shared hit=670
-> Seq Scan on public.city_data (cost=0.00..1116.91 rows=44691 width=26) (actual time=0.012..4.892 rows=44691 loops=1)
Output: city, country, population, id
Buffers: shared hit=670
Planning Time: 0.082 ms
Execution Time: 36.347 ms
(13 rows)
Let's check the explain plan with MAX aggregate.
postgres=# EXPLAIN (analyze,buffers,verbose) SELECT country,MAX(city) as city_name,SUM(population) as total_population from city_data group by country limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1452.09..1452.22 rows=10 width=72) (actual time=34.601..34.608 rows=10 loops=1)
Output: country, (MAX(city)), (SUM(population))
Buffers: shared hit=670
-> HashAggregate (cost=1452.09..1454.99 rows=232 width=72) (actual time=34.600..34.605 rows=10 loops=1)
Output: country, MAX(city), SUM(population)
Group Key: city_data.country
Batches: 1 Memory Usage: 93kB
Buffers: shared hit=670
-> Seq Scan on public.city_data (cost=0.00..1116.91 rows=44691 width=26) (actual time=0.011..4.525 rows=44691 loops=1)
Output: city, country, population, id
Buffers: shared hit=670
Planning Time: 0.085 ms
Execution Time: 34.655 ms
(13 rows)
Now, we are going to use ANY_VALUE() aggregate (PostgreSQL 16)
postgres=# SELECT country,ANY_VALUE(city) as city_name, SUM(population) from city_data as total_population group by country ;
country | city_name | total_population
--------------------+---------------+-------------------
Bangladesh | Dhaka | 43503119
Mayotte | Mamoudzou | 32057
Venezuela | Maracaibo | 24389811
Luxembourg | Luxembourg | 238576
Sweden | Stockholm | 7049100
Dominican Republic | Santo Domingo | 11791011
Ireland | Dublin | 2944351
Cambodia | Phnom Penh | 3626621
Singapore | Singapore | 5983000
San Marino | San Marino | 32191
postgres=# EXPLAIN (analyze,buffers,verbose) SELECT country,ANY_VALUE(city) as city_name , SUM(population) as total_population from city_data group by country limit 10 ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1452.09..1452.22 rows=10 width=72) (actual time=23.043..23.051 rows=10 loops=1)
Output: country, (ANY_VALUE(city)), (SUM(population))
Buffers: shared hit=670
-> HashAggregate (cost=1452.09..1454.99 rows=232 width=72) (actual time=23.042..23.047 rows=10 loops=1)
Output: country, ANY_VALUE(city), SUM(population)
Group Key: city_data.country
Batches: 1 Memory Usage: 93kB
Buffers: shared hit=670
-> Seq Scan on public.city_data (cost=0.00..1116.91 rows=44691 width=26) (actual time=0.013..4.816 rows=44691 loops=1)
Output: city, country, population, id
Buffers: shared hit=670
Planning Time: 0.092 ms
Execution Time: 23.096 ms
(13 rows)
The execution time shown above is about 35% faster than the times for MIN and MAX functions.
Conclusion
The ANY_VALUE() function in PostgreSQL enables the inclusion of non-aggregated columns in the SELECT clause without affecting the grouping behavior. It's a practical choice for increasing query flexibility and resolving issues with non-aggregated columns in GROUP BY queries. Improved query performance and more readable SQL code are further advantages. Despite the presence of less efficient alternatives, ANY_VALUE() can be a useful function for SQL developers and database administrators, particularly in situations where efficiency and clarity are crucial.