“We need to attain ZERO downtime and high performance in our application,” I was told during a conversation with a manager of a line of business during a conference last year. We hear this often from our customers, and we see it constantly in our conversations on forums.
Most of the data center managers, sysadmins, and database administrators have heard this one from “The Business” on more than one occasion. What this often means, in reality, is that the database must remain accessible and that problems must be recoverable within the timeframe required by the business.
When you ask someone on how to avoid these problems, one of the most-heard options is:
“Implement a comprehensive database maintenance strategy.”
What are the common database maintenance tasks?
When we dig a bit deeper in what it means, we get advice like the list of the common database maintenance tasks:
- Vacuum Analyse every week, keeps your stats updated
- Reindex your heavily updated tables every month
- Plan for vacuum full events
- Monitor your connections and load
All good advice, all help but by doing all the above does it always ensure that you will not need downtime? Is downtime truly avoidable, or is there more to it than just maintenance?
Maintenance Can Mislead
In the past, we have come across some interesting situations where routine maintenance tasks were inadvertently compensating for an unoptimized database, failure of it led to significant downtime.
Adjusting PostgreSQL Autovacuum Settings
In one of the instances, the user had scheduled a script that would vacuum his database every few days. He was getting good performance and things were looking fine till one day the database simply stopped accepting connections and he saw an error in his logs:
ERROR: database is not accepting commands to avoid wraparound data loss in
database "xxxxx"
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop
stale replication slots.
We were able to fix the database by running a ‘Vacuum Full’ in single user mode which took close to 13 hours. While investigating, it was found that the user's database had been steadily growing over a period of time while his autovacuum settings were never adjusted to take this into account.
As a result, autovacuum on some large tables was getting interrupted and never allowed to complete due to the user transactions. All the vacuum work was done by his scheduled maintenance, just so happened that his cron got disabled during one of the OS maintenance activities so now there was no vacuuming happening. This ultimately resulted in the cluster reaching its wraparound threshold.
One could argue that he should have been more careful with his cron, however, had there been a more aware and adjusted their vacuum parameters to accommodate for the increased size of database; autovacuum could have prevented a wraparound even without the maintenance.
Another interesting one which didn’t result in an outage but still managed to give a lot of heartburn was of intermittent load spikes in the system which was accompanied by slow query executions. We had checked query plans and they did not change even when the performance was deplorable. A look at the pg_stat_activities during one such occurrence showed the following with queries from application:
datname | usename | pid | waiting | xact_runtime |
query
---------+----------+-------+---------+-----------------+------------------------------------------------------------------------
xxxxx | rdsadmin | xxx | f | 00:00:xxx | autovacuum: VACUUM xxxxx
(to prevent wraparound)
The table undergoing vacuum to prevent wraparound was a heavily used table. The queries which were to be run against it had to wait which was having a snowball effect on system resources. User had a maintenance script but it was not run frequently enough to negate the effect of the workload. The table had to go through a bunch of bulk updates every few hours as a part of the business logic which made the need to frequent vacuums even more pressing.
This raised a question on ‘How often should I do maintenance/vacuum?’ This was not an easy one to answer as frequency will depend on the rate at which the data in the table changes. Only if there was a script that could frequently check the stats and initiate vacuum.
Wait, there was one ‘autovacuum’, just set the parameters better.
The user’s got misled into believing that they had done enough with their vacuum scripts. In both cases Database Maintenance tasks alone were not enough to keep the server running optimally.
A Step Back
When it comes to managing PostgreSQL— or, any database, really— we start with identifying the symptoms, and look to provide remedies to address the problems which cause them. Maintenance strategies are also created with a strategy in the same continuum, but earlier; the intent is to prevent the pain, or to treat it periodically before it actually becomes a major hurt. The challenge is that sometimes your maintenance is so good that you don’t ever really uncover the flaws that might be there in the design itself. In our case, it’s a database that might need help, all slowness is not caused by bad statistics. All temporary files are not generated due to small work_mem.
What to Do for the Best PostgreSQL Maintenance
- #1 Start with what state is your database in
- This is a shorter list of things to work through than what it might appear to be on the surface.
- This is a shorter list of things to work through than what it might appear to be on the surface.
- #2 Is your system experiencing frequent load spikes?
- Looking at the CPU loads, memory utilization and I/O load on the machine itself is a good place to start. Observing these and finding the processes which are using the most amount of resources and what part of the system resources are being hit is a good clue to what needs to be addressed. But this on its own is an indication of a DB with issues but doesn’t answer what the problem is.
- #3 Seeing queries waiting for a lock
- Queries are consistently locking each other out or resource contentions is another sign that there is a deeper problem that could be related to the business logic or table structures.
- Queries are consistently locking each other out or resource contentions is another sign that there is a deeper problem that could be related to the business logic or table structures.
- #4 Performance degradation
- Now, this is a bit subjective, or at the very least, its gradient. To conclude that you’re suffering a loss of performance, you really need to have benchmarks; what was your “known good state” where things worked well? History of query performance during the good times should be available from previous runs so that you can understand what led to the bad times and either prevent or eliminate the risk of these problems cropping up in the future. If you haven’t yet developed that baseline, one helpful approach is explained in Marc’s blog “The database is slow!”
- Now, this is a bit subjective, or at the very least, its gradient. To conclude that you’re suffering a loss of performance, you really need to have benchmarks; what was your “known good state” where things worked well? History of query performance during the good times should be available from previous runs so that you can understand what led to the bad times and either prevent or eliminate the risk of these problems cropping up in the future. If you haven’t yet developed that baseline, one helpful approach is explained in Marc’s blog “The database is slow!”
- #5 Read errors, missing files
- I don’t wish this jinx anyone, but let’s suppose your logs do happen to be filled with errors about files missing or errors reading files. There are errors reading tables due to missing files, or corrupted data blocks or tuples, or rows that show up in the table but not its indexes; there are even errors just related to connecting to the database itself. This is a scary find and is often an indication of a corrupt database when we see these things in Technical Support.
- I don’t wish this jinx anyone, but let’s suppose your logs do happen to be filled with errors about files missing or errors reading files. There are errors reading tables due to missing files, or corrupted data blocks or tuples, or rows that show up in the table but not its indexes; there are even errors just related to connecting to the database itself. This is a scary find and is often an indication of a corrupt database when we see these things in Technical Support.
- #6 Gremlins in database?
- If the answer to one or more than one of the above is “yes”, then ultimately you have two major efforts ahead of you: address the current problem, and prevent these from occurring in the future. Roll up your sleeves, and be ready to dig in before you set about creating a maintenance plan for good health.
Where to Start
Knowing where to start can be difficult. I suspect things are not looking good for the DB. How do I start finding the issues and fixing them?
Go low-tech: Start with the Logs
Your log files can be immeasurably valuable, if correct values are selected they can provide you with exact information on what is happening inside the database be it a query execution, checkpoints, locks. Below is a set of values which I prefer to set for a small period of time when given a new Database to manage:
log_min_duration_statement = 0
log_line_prefix = '%t [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h'
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_error_verbosity = default
More details about the logging parameters have been beautifully captured by Richard in his blog Identify and Resolve Problems Faster by Logging PostgreSQL Activities.
For ease of reading the trends in the database logs, you can use pgBadger, blog about it was written by Amit PgBadger: X-Ray Vision for Your Queries.
Also, set the monitoring of your OS resources so as to be able to generate a report of CPU utilization, Memory, and I/O statistics of the time period when you are capturing the database logs.
When you start going through them listing out inordinary events from both; you will start to see patterns.
As an example where we had a user complain about random CPU spikes. We had the log levels increased to capture queries with their session ID’s in DB logs and also as we knew the time, an output of ‘top’ command when the spike was observed. Below are some relevant entries which we saw:
top - 09:07:56 up 50 days, 6:43, 2 users, load average: 1.24, 1.32, 1.17
Tasks: 241 total, 4 running, 237 sleeping, 0 stopped, 0 zombie
%Cpu(s): 28.1 us, 1.8 sy, 0.0 ni, 69.5 id, 0.4 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32782940 total, 258468 free, 1740380 used, 30784092 buff/cache
KiB Swap: 1048572 total, 176 free, 1048396 used. 4785984 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8578 postgres 20 0 28.934g 865816 842396 R 100.0 2.6 29:18.26 postgres: xxxx96: xxx_user xxx_xx.0 xx.xxx.x.xxx[62299] SELECT
19886 postgres 20 0 172604 2704 1804 R 6.2 0.0 0:00.01 top -c -n 1
1 root 20 0 45824 2620 1268 S 0.0 0.0 0:51.25 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
2 root 20 0 0 0 0 S 0.0 0.0 0:00.73 [kthreadd]
Process ID 8578 seemed interesting so we looked for it in the logs and found this:
2018-09-19 10:26:26 CEST postgres[8578]: SELECT xxx<BIG QUERY>xxxxxxxx JOIN xxxxx<BIG QUERY>xxxxx
We had a huge query with joins between large tables which was eating up resources. We then set about optimizing the query and the spikes were gone.
Above was one of the simpler situations where we were able to capture the TOP output just as the spike was happening. In situations where we have only CPU graphs to refer, I personally like to put them in a kind of grid with rows containing most common messages in DB logs and columns showing observations in System. I then try to mark the intersecting block with events that happened at the same time. Like in the above case I had a CPU spike and in the logs, I had a Query associated with the PID which seemed to cause the spike. For above my table would have looked like:
|
|
|
|
|
|||
|
|
||
|
Note: Please note above is an approach that I find useful from my experiences--this is not a “best practice” scenario. The idea is to demonstrate that you have to look at more than one piece of data when working on a problem.
If you find individual slow queries you will have to do some query tuning. Richard again did a great job in his blog about this please check it out How to Use EXPLAIN ANALYZE.
If you start with a database that works but has an “unknown” state of health, you never really know what your baseline should be or how well you can scale performance or availability. While reaching zero downtime ever isn’t a reasonable goal, you can at least improve the situation dramatically with the right approach.
If you find yourself inheriting a solution, either by way of a consultant or a prior DBA, remember:
- Always start with your baseline of behavior
- Implement a monitoring and alerting system
- Understand both the attributes of the existing system...
- ...and the expectations your applications have from the database
- Identify any symptoms of “this doesn’t work” or “this doesn’t work well enough”
- Create a plan that’s more focused on a “get well” approach, and work to address as many individual problems as you can
- Identify the right maintenance behaviors which:
- Allow you to avoid manual or drastic solutions.
- Takes into account any unresolved issues from your “get well” work, and allows you to knowingly compensate where you need to, rather than mask the problem from being understood
For more insight into monitoring practices and solutions, you can find out more in Ajay Patel’s blog How To Monitor PostgreSQL, as well as Amit Sharma and Mohini Ogale’s blog, How to utilize pgBadger log analyzer and postgresql.conf to solve query performance issues.