We get numerous tech issues and common database symptoms flagged to our team on a daily basis. And given the current landscape fraught with uncertainty, these, along with many new and pressing questions, are filing in at greater volumes. Now is the right time to use what we know to help you in your Postgres experience.
Introducing: Postgres Pulse, a weekly blog and live video forum series built to offer insight into how we analyze problems and develop solutions using Postgres. Our Postgres Pulse Insight blogs and Postgres Pulse Live video sessions will focus on trending questions, common challenges, and how to solve the top issues you may face in your Postgres environment.
Join us for the first Postgres Pulse Live session on Monday, 3/30, at 11am ET! Postgres experts like Bruce Momijian, PostgreSQL co-founder, and Vibhor Kumar, Chief Performance Architect for EDB, will be there to talk about what’s new in the world of Postgres this week, as well as take questions from the listeners, live, as this will be an open forum for attendees. Have a question you’d like to ask in advance? We welcome it! You can reach out to me at marc.linster@enterprisedb.com and your question may be selected to tackle on air.
Postgres Pulse Insights: Improving Performance
For our first Insights topic—which we’ll dive into on Friday’s Postgres Pulse Live—let’s explore one of the most widespread issues we see hit almost every database from time to time: poor performance.
Performance is a tough nut to crack in any technology, and databases are no different. In this post, we’re hoping to teach you something about how to look at the problem, recognize some common symptoms, and give you tips for finding the cause and the answer. Performance never has a “press here to fix everything” button, but we can get you started on the right path.
“The database is slow!”
DBAs are often bombarded by end-users with complaints about the database being slow. You’re expected to act quickly in order to diagnose and fix the problem. The first hurdle while analyzing is to narrow down this broad statement from “IT’S SLOW” to a symptom or set of symptoms which can flow into a meaningful diagnosis.
Start with the basics:
What is slow?
- A single query
- Multiple queries
- Queries to a certain relation or a group of relations
Answers to questions like these help to create the chain that leads you to action and solution.
We’ll focus here on the second example, where multiple queries are slow, and outline some key parameters which, if tuned correctly, can have the maximum impact on improving database performance.
Let’s start with what symptoms are reported in this example:
- The majority of the operations on the database that are user-initiated are showing longer than usual time of execution. When you take stock of system resources, CPU utilization, which is often indicated by load averages in ‘top’ command output, is higher than what you expect, but the OS memory utilization is relatively low. At times you can see a spike in I/O to the disk when queries are running.
- Maintenance operations, like autovacuum, vacuum or reindex, are not in progress. There are no obvious locks that are blocking each other, but the throughput is still low.
- Even though the application does repeat operations, you are not seeing any performance improvements.
- You see a large number of temp files being created by the majority of your queries, which involve sort operations through your OS--and buffers stay underutilized.
- You frequently see Postgres sessions using large chunks of memory, which can lead to an overall system slowdown.
So what is the underlying issue?
To answer this, it’s useful to understand at a very high level how Postgres functions. When a query is processed for the first time, the data is read from the disk and put into the shared_buffer of Postgres. If this data is small enough to fit into the shared_buffer, it’s retained and reused for the next query run. However, if the data is too large for the shared_buffer, during the first run of the query, the data is loaded until it runs out of shared_buffer--and then replaces the oldest blocks from the shared_buffers. So, when the query runs again, it can’t use what might otherwise be kept in the buffer, and instead has to return to disk--which is an expensive operation.
Note: The above is a simplified example. It doesn’t cover what would happen in the case of multiple sessions working in parallel or explain plan handling or use of indexing.
The next most prevalent use of memory is for sort operations in the query, which are governed by work_mem. When sorting large amounts of data, Postgres sorts them in segments, which are then put in temp files. All these parts are then merged until all rows are sorted. Increasing your work_mem value reduces the temp files, which makes the sort operations much faster. The cost of this is that each sort operation needs its own batch for sorting, so configuring work_mem too large can quickly result in a situation where queries cost you all available memory.
Note: Again, the above is a highly simplified use case of work_mem. Please refer to the Postgres documentation for details.
Let’s start correlating the symptom with the two parameters.
You talked with the application teams and concluded two things: there was no change in business logic, and the slowness is accompanied by an increase in the number of concurrent user sessions. If you were only seeing symptoms 1, 2 and 3, more than likely, this would tell you that the answer is shared_buffers.
You then need to think about increasing the value of shared_buffers. As a rough guideline, you could safely increase shared_buffers to as much as 25% of total RAM--but never use this as a starting value. Not only would this mean you consumed a lot of memory in one step, it also means that you don’t know where the “sweet spot” optimal RAM usage is. So, you start with a conservative value of 8 GBs, and increase it slowly to find the right value. This, of course, can be frustrating; while it would be great to have a set formula for all uses, your needs will be unique to your workload. You have to be a good neighbor to other consumers of memory on the same system--system processes are an obvious tenant, but Postgres sessions themselves do require some overhead of private memory to operate.
Let’s move on to symptom 4. We now understand work_mem determines the size of the sort batch. If you are observing that a query is taking a long time to complete, and at the same time, seeing an increase in temp file creation under the base directory of the cluster, then increasing the work_mem can help.
Check to see if the query is using an extensive sorting operation. If so, it’s advisable to assign more work_mem to the session. Some examples would be ORDER BY, DISTINCT or Merge Joins.
However, as with shared_buffers, don’t rush to set it to a large value. For example, if your worst performing query only runs 2-3 times per day, making it a priority may mean that other queries with greater usage and more importance suffer as a consequence. Start with a factor of two. The default is 4MB, so increase it to 8MB, then 16 MB, then 32MB--and use that as your high water mark. Why? Well, take a situation with 100 parallel sessions. If each is running a query which fully uses one work_mem area of 32MB in a single sort operation, you’ve now consumed 32MB x 100 sessions, and 3.2GB of memory is exhausted. You can see that high values can quickly exhaust available memory, which can cause the final symptom. Avoiding that scenario helps you to avoid experiencing symptom 5.
As always, these examples are not intended to be exhaustive. They serve to highlight common examples of problems we see users experience, and the most common path that we see that leads from symptom—>diagnosis—>cause—>resolution.
Next week, we’ll be looking at vacuum practices, as well as the most common symptom/solution chain associated with these situations.
In the meantime, mark your calendar for our upcoming Postgres Pulse Live sessions every Monday at 11am EDT time! Can’t wait to see you there and talk all things Postgres.