diff options
-rw-r--r-- | doc/src/sgml/catalogs.sgml | 22 | ||||
-rw-r--r-- | doc/src/sgml/perform.sgml | 195 | ||||
-rw-r--r-- | doc/src/sgml/planstats.sgml | 228 |
3 files changed, 311 insertions, 134 deletions
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index ef36e87a720..ed74704b2ad 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -4321,6 +4321,17 @@ </row> <row> + <entry><structfield>stxkeys</structfield></entry> + <entry><type>int2vector</type></entry> + <entry><literal><link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.attnum</literal></entry> + <entry> + This is an array of values that indicate which table columns this + statistic covers. For example a value of <literal>1 3</literal> would + mean that the first and the third table columns make up the statistic key. + </entry> + </row> + + <row> <entry><structfield>stxkind</structfield></entry> <entry><type>char[]</type></entry> <entry></entry> @@ -4333,17 +4344,6 @@ </row> <row> - <entry><structfield>stxkeys</structfield></entry> - <entry><type>int2vector</type></entry> - <entry><literal><link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.attnum</literal></entry> - <entry> - This is an array of values that indicate which table columns this - statistic covers. For example a value of <literal>1 3</literal> would - mean that the first and the third table columns make up the statistic key. - </entry> - </row> - - <row> <entry><structfield>stxndistinct</structfield></entry> <entry><type>pg_ndistinct</type></entry> <entry></entry> diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index 8d30fd1384c..a8bebb14363 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -906,6 +906,8 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000 <secondary>of the planner</secondary> </indexterm> + <sect2> + <title>Single-Column Statistics</title> <para> As we saw in the previous section, the query planner needs to estimate the number of rows retrieved by a query in order to make good choices @@ -1043,7 +1045,200 @@ WHERE tablename = 'road'; Further details about the planner's use of statistics can be found in <xref linkend="planner-stats-details">. </para> + </sect2> + + <sect2 id="planner-stats-extended"> + <title>Extended Statistics</title> + + <indexterm zone="planner-stats-extended"> + <primary>statistics</primary> + <secondary>of the planner</secondary> + </indexterm> + <indexterm> + <primary>correlation</primary> + <secondary>in the query planner</secondary> + </indexterm> + + <indexterm> + <primary>pg_statistic_ext</primary> + </indexterm> + + <para> + It is common to see slow queries running bad execution plans because + multiple columns used in the query clauses are correlated. + The planner normally assumes that multiple conditions + are independent of each other, + an assumption that does not hold when column values are correlated. + Regular statistics, because of their per-individual-column nature, + do not capture the knowledge of cross-column correlation; + <firstterm>multivariate statistics</firstterm> can be used to instruct + the server to obtain statistics across such a set of columns, + which are later used by the query optimizer + to determine cardinality and selectivity + of clauses involving those columns. + Multivariate statistics are currently the only use of + <firstterm>extended statistics</firstterm>. + </para> + + <para> + Extended statistics are created using + <xref linkend="sql-createstatistics">, which see for more details. + Data collection is deferred until the next <command>ANALYZE</command> + on the table, after which the stored values can be examined in the + <link linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link> + catalog. + </para> + + <para> + The following subsections describe the types of extended statistics + that are currently supported. + </para> + + <sect3> + <title>Functional Dependencies</title> + + <para> + The simplest type of extended statistics are functional dependencies, + a concept used in definitions of database normal forms. + Put simply, it is said that column <literal>b</> is functionally + dependent on column <literal>a</> if knowledge of the value of + <literal>a</> is sufficient to determine the value of <literal>b</>. + In normalized databases, functional dependencies are allowed only on + primary keys and superkeys. However, many data sets are in practice not + fully normalized for various reasons; intentional denormalization for + performance reasons is a common example. + </para> + + <para> + The existance of functional dependencies directly affects the accuracy + of estimates in certain queries. + The reason is that conditions on the dependent columns do not + restrict the result set, but the query planner (lacking functional + dependency knowledge) considers them independent, resulting in + underestimates. + To inform the planner about the functional dependencies, we collect + measurements of dependency during <command>ANALYZE</>. Assessing + the degree of dependency between all sets of columns would be + prohibitively expensive, so the search is limited to potential + dependencies defined using the <literal>dependencies</> option of + extended statistics. It is advisable to create + <literal>dependencies</> statistics if and only if functional + dependencies actually exist, to avoid unnecessary overhead on both + <command>ANALYZE</> and query planning. + </para> + + <para> + To inspect functional dependencies on a statistics + <literal>stts</literal>, you may do this: +<programlisting> +CREATE STATISTICS stts WITH (dependencies) + ON (zip, city) FROM zipcodes; +ANALYZE zipcodes; +SELECT stxname, stxkeys, stxdependencies + FROM pg_statistic_ext + WHERE stxname = 'stts'; + stxname | stxkeys | stxdependencies +---------+---------+-------------------------------------------- + stts | 1 5 | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}] +(1 row) +</programlisting> + where it can be seen that column 1 (a zip code) fully determines column + 5 (city) so the coefficient is 1.0, while city only determines zip code + about 42% of the time, meaning that there are many cities (58%) that are + represented by more than a single ZIP code. + </para> + + <para> + When computing the selectivity, the planner inspects all conditions and + attempts to identify which conditions are already implied by other + conditions. The selectivity estimates from any redundant conditions are + ignored from a selectivity point of view. In the example query above, + the selectivity estimates for either of the conditions may be eliminated, + thus improving the overall estimate. + </para> + + <sect4> + <title>Limitations of Functional Dependencies</title> + + <para> + Functional dependencies are a very simple type of statistics, and + as such have several limitations. The first limitation is that they + only work with simple equality conditions, comparing columns and constant + values. It's not possible to use them to eliminate equality conditions + comparing two columns or a column to an expression, range clauses, + <literal>LIKE</> or any other type of conditions. + </para> + + <para> + When eliminating the implied conditions, the planner assumes that the + conditions are compatible. Consider the following example, where + this assumption does not hold: + +<programlisting> +EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10; + QUERY PLAN +----------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1) + Filter: ((a = 1) AND (b = 10)) + Rows Removed by Filter: 10000 +</programlisting> + + While there are no rows with such combination of values, the planner + is unable to verify whether the values match — it only knows that + the columns are functionally dependent. + </para> + + <para> + This assumption is related to queries executed on the database; in many + cases, it's actually satisfied (e.g. when the GUI only allows selecting + compatible values). But if that's not the case, functional dependencies + may not be a viable option. + </para> + </sect4> + </sect3> + + <sect3> + <title>Multivariate N-Distinct Coefficients</title> + + <para> + Single-column statistics store the number of distinct values in each + column. Estimates of the number of distinct values on more than one + column (for example, for <literal>GROUP BY a, b</literal>) are + frequently wrong when the planner only has single-column statistical + data, however, causing it to select bad plans. + In order to improve n-distinct estimation when multiple columns are + grouped together, the <literal>ndistinct</> option of extended statistics + can be used, which instructs <command>ANALYZE</> to collect n-distinct + estimates for all possible combinations of two or more columns of the set + of columns in the statistics object (the per-column estimates are already + available in <structname>pg_statistic</>). + </para> + + <para> + Continuing the above example, the n-distinct coefficients in a ZIP + code table may look like the following: +<programlisting> +CREATE STATISTICS stts2 WITH (ndistinct) + ON (zip, state, city) FROM zipcodes; +ANALYZE zipcodes; +SELECT stxkeys AS k, stxndistinct AS nd + FROM pg_statistic_ext + WHERE stxname = 'stts2'; +-[ RECORD 1 ]--------------------------------------------- +k | 1 2 5 +nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000}, + {(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}] +(1 row) +</programlisting> + which indicates that there are three combinations of columns that + have 33178 distinct values: ZIP code and state; ZIP code and city; + and ZIP code, city and state (the fact that they are all equal is + expected given the nature of ZIP-code data). On the other hand, + the combination of city and state only has 27435 distinct values. + </para> + </sect3> + </sect2> </sect1> <sect1 id="explicit-joins"> diff --git a/doc/src/sgml/planstats.sgml b/doc/src/sgml/planstats.sgml index 124e7e20ced..f4430eb23cf 100644 --- a/doc/src/sgml/planstats.sgml +++ b/doc/src/sgml/planstats.sgml @@ -445,159 +445,141 @@ rows = (outer_cardinality * inner_cardinality) * selectivity operator-specific selectivity functions are mostly found in <filename>src/backend/utils/adt/selfuncs.c</filename>. </para> + </sect1> - <sect2 id="functional-dependencies"> - <title>Functional Dependencies</title> + <sect1 id="multivariate-statistics-examples"> + <title>Multivariate Statistics Examples</title> - <para> - The simplest type of extended statistics are functional dependencies, - used in definitions of database normal forms. When simplified, saying that - <literal>b</> is functionally dependent on <literal>a</> means that - knowledge of value of <literal>a</> is sufficient to determine value of - <literal>b</>. - </para> + <indexterm> + <primary>row estimation</primary> + <secondary>multivariate</secondary> + </indexterm> + <sect2> + <title>Functional dependencies</title> <para> - In normalized databases, only functional dependencies on primary keys - and superkeys are allowed. However, in practice, many data sets are not - fully normalized, for example, due to intentional denormalization for - performance reasons. - </para> + Multivariate correlation can be seen with a very simple data set — a + table with two columns, both containing the same values: - <para> - Functional dependencies directly affect accuracy of the estimates, as - conditions on the dependent column(s) do not restrict the result set, - resulting in underestimates. +<programlisting> +CREATE TABLE t (a INT, b INT); +INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i); +ANALYZE t; +</programlisting> + + As explained in <xref linkend="planner-stats">, the planner can determine + cardinality of <structname>t</structname> using the number of pages and + rows obtained from <structname>pg_class</structname>: + +<programlisting> +SELECT relpages, reltuples FROM pg_class WHERE relname = 't'; + + relpages | reltuples +----------+----------- + 45 | 10000 +</programlisting> + + The data distribution is very simple; there are only 100 distinct values + in each column, uniformly distributed. </para> <para> - To inform the planner about the functional dependencies, we collect - measurements of dependency during <command>ANALYZE</>. Assessing - dependency between all sets of columns would be prohibitively - expensive, so we limit our search to potential dependencies defined - using the <command>CREATE STATISTICS</> command. + The following example shows the result of estimating a <literal>WHERE</> + condition on the <structfield>a</> column: <programlisting> -CREATE TABLE t (a INT, b INT); -INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i); -CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t; -ANALYZE t; -EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1; - QUERY PLAN -------------------------------------------------------------------------------------------------- - Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1) - Filter: ((a = 1) AND (b = 1)) +EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1; + QUERY PLAN +------------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1) + Filter: (a = 1) Rows Removed by Filter: 9900 - Planning time: 0.367 ms - Execution time: 3.380 ms -(5 rows) </programlisting> - The planner is now aware of the functional dependencies and considers - them when computing the selectivity of the second condition. Running - the query without the statistics would lead to quite different estimates. + The planner examines the condition and determines the selectivity + of this clause to be 1%. By comparing this estimate and the actual + number of rows, we see that the estimate is very accurate + (in fact exact, as the table is very small). Changing the + <literal>WHERE</> to use the <structfield>b</> column, an identical + plan is generated. Observe what happens if we apply the same + condition on both columns combining them with <literal>AND</>: <programlisting> -DROP STATISTICS s1; -EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1; - QUERY PLAN ------------------------------------------------------------------------------------------------ - Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1) +EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; + QUERY PLAN +----------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1) Filter: ((a = 1) AND (b = 1)) Rows Removed by Filter: 9900 - Planning time: 0.000 ms - Execution time: 6.379 ms -(5 rows) </programlisting> - </para> - <para> - If no dependency exists, the collected statistics do not influence the - query plan. The only effect is to slow down <command>ANALYZE</>. Should - partial dependencies exist these will also be stored and applied - during planning. + The planner estimates the selectivity for each condition individually, + arriving to the 1% estimates as above, and then multiplies them, getting + the final 0.01% estimate. The <quote>actual</quote> figures, however, + show that this results in a significant underestimate, as the actual + number of rows matching the conditions (100) is two orders of magnitude + higher than the estimated value. </para> <para> - Similarly to per-column statistics, extended statistics are stored in - a system catalog called <structname>pg_statistic_ext</structname>. - To inspect the statistics <literal>s1</literal> defined above, - you may do this: + This problem can be fixed by applying functional-dependency + multivariate statistics on the two columns: <programlisting> -SELECT stxname,stxdependencies FROM pg_statistic_ext WHERE stxname = 's1'; - stxname | stxdependencies ----------+-------------------------------------------- - s1 | [{1 => 2 : 1.000000}, {2 => 1 : 1.000000}] -(1 row) +CREATE STATISTICS stts WITH (dependencies) ON (a, b) FROM t; +ANALYZE t; +EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1; + QUERY PLAN +------------------------------------------------------------------------------- + Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1) + Filter: ((a = 1) AND (b = 1)) + Rows Removed by Filter: 9900 </programlisting> - - This shows that the statistics are defined on table <structname>t</>, - <structfield>attnums</structfield> lists attribute numbers of columns - (references <structname>pg_attribute</structname>). It also shows - the length in bytes of the functional dependencies, as found by - <command>ANALYZE</> when serialized into a <literal>bytea</> column. </para> + </sect2> + <sect2> + <title>Multivariate N-Distinct coefficients</title> <para> - When computing the selectivity, the planner inspects all conditions and - attempts to identify which conditions are already implied by other - conditions. The selectivity estimates from any redundant conditions are - ignored from a selectivity point of view. In the example query above, - the selectivity estimates for either of the conditions may be eliminated, - thus improving the overall estimate. + A similar problem occurs with estimation of the cardinality of distinct + elements, used to determine the number of groups that would be generated + by a <command>GROUP BY</command> clause. When <command>GROUP BY</command> + lists a single column, the n-distinct estimate (which can be seen as the + number of rows returned by the aggregate execution node) is very accurate: +<programlisting> +EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a; + QUERY PLAN +----------------------------------------------------------------------------------------- + HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1) + Group Key: a + -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1) +</programlisting> + But without multivariate statistics, the estimate for the number of + groups in a query with two columns in <command>GROUP BY</command>, as + in the following example, is off by an order of magnitude: +<programlisting> +EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b; + QUERY PLAN +-------------------------------------------------------------------------------------------- + HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1) + Group Key: a, b + -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1) +</programlisting> + By dropping the existing statistics and re-creating it to include n-distinct + calculation, the estimate is much improved: +<programlisting> +DROP STATISTICS stts; +CREATE STATISTICS stts WITH (dependencies, ndistinct) ON (a, b) FROM t; +ANALYZE t; +EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b; + QUERY PLAN +-------------------------------------------------------------------------------------------- + HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1) + Group Key: a, b + -> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1) +</programlisting> </para> - <sect3 id="functional-dependencies-limitations"> - <title>Limitations of functional dependencies</title> - - <para> - Functional dependencies are a very simple type of statistics, and - as such have several limitations. The first limitation is that they - only work with simple equality conditions, comparing columns and constant - values. It's not possible to use them to eliminate equality conditions - comparing two columns or a column to an expression, range clauses, - <literal>LIKE</> or any other type of conditions. - </para> - - <para> - When eliminating the implied conditions, the planner assumes that the - conditions are compatible. Consider the following example, violating - this assumption: - -<programlisting> -EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10; - QUERY PLAN ------------------------------------------------------------------------------------------------ - Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1) - Filter: ((a = 1) AND (b = 10)) - Rows Removed by Filter: 10000 - Planning time: 0.232 ms - Execution time: 3.033 ms -(5 rows) -</programlisting> - - While there are no rows with such combination of values, the planner - is unable to verify whether the values match - it only knows that - the columns are functionally dependent. - </para> - - <para> - This assumption is more about queries executed on the database - in many - cases, it's actually satisfied (e.g. when the GUI only allows selecting - compatible values). But if that's not the case, functional dependencies - may not be a viable option. - </para> - - <para> - For additional information about functional dependencies, see - <filename>src/backend/statistics/README.dependencies</>. - </para> - - </sect3> - </sect2> - </sect1> - </chapter> |