Just as last year was ending, one of our long-time customers came to us because one of their long-existing PostgreSQL queries involving PostGIS geometry calculations was much slower for specific values. We researched the problem and found out how to solve it; read on! What we found as the cause of the problem will surprise you!
The starting observation, reported by our customer, was that running a query involving ST_DistanceSpheroid
took about 7 milliseconds when asked to return the distance to POINT(33.681953 23.155994)
on a specific spheroid, but if that point was moved to POINT(33.681953 23.1559941)
(a difference of just 0.0000001
) then it took 0.13 millisecond. 60 times faster! What on Earth (another spheroid!) could be going on?
Initially, we were unable to reproduce the slowness in our test environments. In our hands, both queries would perform equally quickly, with no slowdown. We dug down to the specific versions of software in use thinking that an update might be needed. We used the versions reported by customer: PostgreSQL 10.11, PostGIS 2.4.4, libproj 4.93. We returned to cave ages by downgrading to those precise versions, with no success.
Eventually we became aware that the customer was using Ubuntu 18.04, so we tried that … and lo and behold, the problem did reproduce there. Suffices to say that we jumped at the opportunity of profiling the query in that machine. We got this:
Samples: 224K of event 'cpu-clock', Event count (approx.): 56043500000
Children Self Command Shared Object Symbol
+ 84.86% 0.00% postgres [unknown] [.] 0xffffffffffffffff
+ 84.59% 0.00% postgres postgres [.] DirectFunctionCall4Coll
+ 84.58% 0.00% postgres postgis-2.5.so [.] geometry_distance_spheroid
+ 84.56% 0.00% postgres liblwgeom-2.5.so.0.0.0 [.] lwgeom_distance_spheroid
+ 84.31% 0.19% postgres libm-2.27.so [.] __sincos
+ 84.18% 0.00% postgres libm-2.27.so [.] __cos_local (inlined)
+ 84.13% 0.00% postgres libm-2.27.so [.] cslow2 (inlined)
+ 84.05% 0.01% postgres libm-2.27.so [.] __mpcos
+ 83.95% 0.32% postgres libm-2.27.so [.] __c32
+ 83.87% 0.00% postgres postgres [.] ExecInterpExpr
+ 83.75% 0.00% postgres postgres [.] standard_ExecutorRun
+ 83.75% 0.00% postgres postgres [.] ExecutePlan (inlined)
+ 83.73% 0.00% postgres postgres [.] ExecProcNode (inlined)
+ 83.73% 0.00% postgres postgres [.] ExecScan
+ 83.67% 0.00% postgres postgres [.] ExecProject (inlined)
+ 83.67% 0.00% postgres postgres [.] ExecEvalExprSwitchContext (inlined)
+ 83.65% 0.00% postgres postgres [.] _SPI_execute_plan
+ 83.60% 0.00% postgres postgres [.] _SPI_pquery (inlined)
+ 83.49% 0.01% postgres plpgsql.so [.] exec_stmts
+ 83.49% 0.00% postgres plpgsql.so [.] exec_stmt (inlined)
+ 83.49% 0.00% postgres plpgsql.so [.] exec_stmt_fori (inlined)
+ 83.48% 0.00% postgres plpgsql.so [.] exec_stmt_perform (inlined)
+ 83.48% 0.00% postgres plpgsql.so [.] exec_run_select
+ 83.47% 0.00% postgres postgres [.] SPI_execute_plan_with_paramlist
+ 81.67% 0.01% postgres liblwgeom-2.5.so.0.0.0 [.] edge_distance_to_point
+ 81.67% 0.00% postgres liblwgeom-2.5.so.0.0.0 [.] 0x00007f2ce1c2c0e6
+ 61.85% 60.82% postgres libm-2.27.so [.] __mul
+ 54.83% 0.01% postgres liblwgeom-2.5.so.0.0.0 [.] sphere_distance
+ 27.14% 0.00% postgres plpgsql.so [.] exec_stmt_block
+ 26.67% 0.01% postgres liblwgeom-2.5.so.0.0.0 [.] geog2cart
+ 19.24% 0.00% postgres libm-2.27.so [.] ss32 (inlined)
+ 18.28% 0.00% postgres libm-2.27.so [.] cc32 (inlined)
+ 12.55% 0.76% postgres libm-2.27.so [.] __sub
+ 11.46% 11.40% postgres libm-2.27.so [.] sub_magnitudes
+ 8.15% 4.89% postgres libm-2.27.so [.] __add
+ 4.94% 0.00% postgres libm-2.27.so [.] add_magnitudes (inlined)
+ 3.18% 3.16% postgres libm-2.27.so [.] __acr
+ 2.66% 0.00% postgres libm-2.27.so [.] mcr (inlined)
+ 1.44% 0.00% postgres liblwgeom-2.5.so.0.0.0 [.] lwgeom_calculate_gbox_geodetic
+ 1.44% 0.00% postgres liblwgeom-2.5.so.0.0.0 [.] ptarray_calculate_gbox_geodetic
Gibberish, you say. However, there’s something mighty curious about this profile … and you have to ignore the first 26 lines and focus on the __mul line there. See that 60.82% of "self" time? (I can hear the sound of realization your mind just made). So why does it take so much time for certain points on the spheroid and not others? And also why does it take long in Ubuntu 18.04 but not in other machines? Why does upgrading PostGIS not fix the problem?
The answer was suggested to me once I realized what was obvious: PostGIS does a lot of trigonometry (sine, cosine, tangent etc) by calling libm
functions. Looking at the glibc changelogs we found some optimizations in trigonometry functions: for certain inputs, trigonometry calculations take shortcuts that cannot be taken for other inputs; and such shortcuts have been optimized over time. Indeed, the glibc announces for both 2.27 and 2.29 both mention optimizations in sine/cosine/etc functions. Apparently, once upon a time there were some optimizations by Intel that were supposed to provide very accurate results, but then somebody realized that the accuracy claim was incorrect, so glibc disabled the use of those optimizations; later on, that stuff was reimplemented in a different but again fast way. Or something like that — for outsiders like me it’s hard to figure out the exact details.
We suspected that upgrading to a newer version of glibc would fix the problem, leaving everything else the same. Our customer tried that, and indeed it was true, and they were happy. We’re not really certain which of these glibc changes were responsible, but one thing is clear: it’s always a good idea to run stuff on up-to-date software.
Do keep in mind that bleeding edges are sharp … so be careful out there.