Enhance nbtree ScalarArrayOp execution.

Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals natively. This works by pushing down the full context (the array keys) to the nbtree index AM, enabling it to execute multiple primitive index scans that the planner treats as one continuous index scan/index path. This earlier enhancement enabled nbtree ScalarArrayOp index-only scans. It also allowed scans with ScalarArrayOp quals to return ordered results (with some notable restrictions, described further down). Take this general approach a lot further: teach nbtree SAOP index scans to decide how to execute ScalarArrayOp scans (when and where to start the next primitive index scan) based on physical index characteristics. This can be far more efficient. All SAOP scans will now reliably avoid duplicative leaf page accesses (just like any other nbtree index scan). SAOP scans whose array keys are naturally clustered together now require far fewer index descents, since we'll reliably avoid starting a new primitive scan just to get to a later offset from the same leaf page. The scan's arrays now advance using binary searches for the array element that best matches the next tuple's attribute value. Required scan key arrays (i.e. arrays from scan keys that can terminate the scan) ratchet forward in lockstep with the index scan. Non-required arrays (i.e. arrays from scan keys that can only exclude non-matching tuples) "advance" without the process ever rolling over to a higher-order array. Naturally, only required SAOP scan keys trigger skipping over leaf pages (non-required arrays cannot safely end or start primitive index scans). Consequently, even index scans of a composite index with a high-order inequality scan key (which we'll mark required) and a low-order SAOP scan key (which we won't mark required) now avoid repeating leaf page accesses -- that benefit isn't limited to simpler equality-only cases. In general, all nbtree index scans now output tuples as if they were one continuous index scan -- even scans that mix a high-order inequality with lower-order SAOP equalities reliably output tuples in index order. This allows us to remove a couple of special cases that were applied when building index paths with SAOP clauses during planning. Bugfix commit 807a40c5 taught the planner to avoid generating unsafe path keys: path keys on a multicolumn index path, with a SAOP clause on any attribute beyond the first/most significant attribute. These cases are now all safe, so we go back to generating path keys without regard for the presence of SAOP clauses (just like with any other clause type). Affected queries can now exploit scan output order in all the usual ways (e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early). Also undo changes from follow-up bugfix commit a4523c5a, which taught the planner to produce alternative index paths, with path keys, but without low-order SAOP index quals (filter quals were used instead). We'll no longer generate these alternative paths, since they can no longer offer any meaningful advantages over standard index qual paths. Affected queries thereby avoid all of the disadvantages that come from using filter quals within index scan nodes. They can avoid extra heap page accesses from using filter quals to exclude non-matching tuples (index quals will never have that problem). They can also skip over irrelevant sections of the index in more cases (though only when nbtree determines that starting another primitive scan actually makes sense). There is a theoretical risk that removing restrictions on SAOP index paths from the planner will break compatibility with amcanorder-based index AMs maintained as extensions. Such an index AM could have the same limitations around ordered SAOP scans as nbtree had up until now. Adding a pro forma incompatibility item about the issue to the Postgres 17 release notes seems like a good idea. Author: Peter Geoghegan <pg@bowt.ie> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
author: Peter Geoghegan <pg@bowt.ie> 2024-04-06 11:47:10 -0400
committer: Peter Geoghegan <pg@bowt.ie> 2024-04-06 11:47:10 -0400
commit: 5bf748b86bc6786a3fc57fc7ce296c37da6564b0 (patch)
tree: cdf5b28c807516e1b25c716beb77f7592a1c941b /src/backend/utils/adt/selfuncs.c
parent: ddd9e43a92417dd0c2b60822d6e75862c73b139a (diff)
download: postgresql-5bf748b86bc6786a3fc57fc7ce296c37da6564b0.tar.gz
postgresql-5bf748b86bc6786a3fc57fc7ce296c37da6564b0.zip
1 files changed, 64 insertions, 19 deletions
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cea777e9d40..35f8f306ee4 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6572,21 +6572,26 @@ genericcostestimate(PlannerInfo *root,
 	selectivityQuals = add_predicate_to_index_quals(index, indexQuals);
 
 	/*
-	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
+	 * If caller didn't give us an estimate for ScalarArrayOpExpr index scans,
+	 * just assume that the number of index descents is the number of distinct
+	 * combinations of array elements from all of the scan's SAOP clauses.
 	 */
-	num_sa_scans = 1;
-	foreach(l, indexQuals)
+	num_sa_scans = costs->num_sa_scans;
+	if (num_sa_scans < 1)
 	{
-		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
-
-		if (IsA(rinfo->clause, ScalarArrayOpExpr))
+		num_sa_scans = 1;
+		foreach(l, indexQuals)
 		{
-			ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
-			double		alength = estimate_array_length(root, lsecond(saop->args));
+			RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
 
-			if (alength > 1)
-				num_sa_scans *= alength;
+			if (IsA(rinfo->clause, ScalarArrayOpExpr))
+			{
+				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+				double		alength = estimate_array_length(root, lsecond(saop->args));
+
+				if (alength > 1)
+					num_sa_scans *= alength;
+			}
 		}
 	}
 
@@ -6813,9 +6818,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * For a RowCompareExpr, we consider only the first column, just as
 	 * rowcomparesel() does.
 	 *
-	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform N
-	 * index scans not one, but the ScalarArrayOpExpr's operator can be
-	 * considered to act the same as it normally does.
+	 * If there's a ScalarArrayOpExpr in the quals, we'll actually perform up
+	 * to N index descents (not just one), but the ScalarArrayOpExpr's
+	 * operator can be considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
 	indexcol = 0;
@@ -6867,7 +6872,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 				clause_op = saop->opno;
 				found_saop = true;
-				/* count number of SA scans induced by indexBoundQuals only */
+				/* estimate SA descents by indexBoundQuals only */
 				if (alength > 1)
 					num_sa_scans *= alength;
 			}
@@ -6931,9 +6936,47 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		numIndexTuples = btreeSelectivity * index->rel->tuples;
 
 		/*
+		 * btree automatically combines individual ScalarArrayOpExpr primitive
+		 * index scans whenever the tuples covered by the next set of array
+		 * keys are close to tuples covered by the current set.  That puts a
+		 * natural ceiling on the worst case number of descents -- there
+		 * cannot possibly be more than one descent per leaf page scanned.
+		 *
+		 * Clamp the number of descents to at most 1/3 the number of index
+		 * pages.  This avoids implausibly high estimates with low selectivity
+		 * paths, where scans usually require only one or two descents.  This
+		 * is most likely to help when there are several SAOP clauses, where
+		 * naively accepting the total number of distinct combinations of
+		 * array elements as the number of descents would frequently lead to
+		 * wild overestimates.
+		 *
+		 * We somewhat arbitrarily don't just make the cutoff the total number
+		 * of leaf pages (we make it 1/3 the total number of pages instead) to
+		 * give the btree code credit for its ability to continue on the leaf
+		 * level with low selectivity scans.
+		 */
+		num_sa_scans = Min(num_sa_scans, ceil(index->pages * 0.3333333));
+		num_sa_scans = Max(num_sa_scans, 1);
+
+		/*
 		 * As in genericcostestimate(), we have to adjust for any
 		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
 		 * to integer.
+		 *
+		 * It is tempting to make genericcostestimate behave as if SAOP
+		 * clauses work in almost the same way as scalar operators during
+		 * btree scans, making the top-level scan look like a continuous scan
+		 * (as opposed to num_sa_scans-many primitive index scans).  After
+		 * all, btree scans mostly work like that at runtime.  However, such a
+		 * scheme would badly bias genericcostestimate's simplistic appraoch
+		 * to calculating numIndexPages through prorating.
+		 *
+		 * Stick with the approach taken by non-native SAOP scans for now.
+		 * genericcostestimate will use the Mackert-Lohman formula to
+		 * compensate for repeat page fetches, even though that definitely
+		 * won't happen during btree scans (not for leaf pages, at least).
+		 * We're usually very pessimistic about the number of primitive index
+		 * scans that will be required, but it's not clear how to do better.
 		 */
 		numIndexTuples = rint(numIndexTuples / num_sa_scans);
 	}
@@ -6942,6 +6985,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * Now do generic index cost estimation.
 	 */
 	costs.numIndexTuples = numIndexTuples;
+	costs.num_sa_scans = num_sa_scans;
 
 	genericcostestimate(root, path, loop_count, &costs);
 
@@ -6952,9 +6996,9 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * comparisons to descend a btree of N leaf tuples.  We charge one
 	 * cpu_operator_cost per comparison.
 	 *
-	 * If there are ScalarArrayOpExprs, charge this once per SA scan.  The
-	 * ones after the first one are not startup cost so far as the overall
-	 * plan is concerned, so add them only to "total" cost.
+	 * If there are ScalarArrayOpExprs, charge this once per estimated SA
+	 * index descent.  The ones after the first one are not startup cost so
+	 * far as the overall plan goes, so just add them to "total" cost.
 	 */
 	if (index->tuples > 1)		/* avoid computing log(0) */
 	{
@@ -6971,7 +7015,8 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * in cases where only a single leaf page is expected to be visited.  This
 	 * cost is somewhat arbitrarily set at 50x cpu_operator_cost per page
 	 * touched.  The number of such pages is btree tree height plus one (ie,
-	 * we charge for the leaf page too).  As above, charge once per SA scan.
+	 * we charge for the leaf page too).  As above, charge once per estimated
+	 * SA index descent.
 	 */
 	descentCost = (index->tree_height + 1) * DEFAULT_PAGE_CPU_MULTIPLIER * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
author	Peter Geoghegan <pg@bowt.ie>	2024-04-06 11:47:10 -0400
committer	Peter Geoghegan <pg@bowt.ie>	2024-04-06 11:47:10 -0400
commit	5bf748b86bc6786a3fc57fc7ce296c37da6564b0 (patch)
tree	cdf5b28c807516e1b25c716beb77f7592a1c941b /src/backend/utils/adt/selfuncs.c
parent	ddd9e43a92417dd0c2b60822d6e75862c73b139a (diff)
download	postgresql-5bf748b86bc6786a3fc57fc7ce296c37da6564b0.tar.gz postgresql-5bf748b86bc6786a3fc57fc7ce296c37da6564b0.zip