diff options
-rw-r--r-- | doc/src/sgml/perform.sgml | 196 | ||||
-rw-r--r-- | doc/src/sgml/ref/analyze.sgml | 6 |
2 files changed, 199 insertions, 3 deletions
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index 61c28bd000e..c10be1e34f6 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -1,5 +1,5 @@ <!-- -$Header: /cvsroot/pgsql/doc/src/sgml/perform.sgml,v 1.12 2001/10/12 23:32:34 momjian Exp $ +$Header: /cvsroot/pgsql/doc/src/sgml/perform.sgml,v 1.13 2001/10/16 01:13:44 tgl Exp $ --> <chapter id="performance-tips"> @@ -315,6 +315,200 @@ Total runtime: 30.67 msec </para> </sect1> + <sect1 id="planner-stats"> + <title>Statistics used by the Planner</title> + + <para> + As we saw in the previous section, the query planner needs to estimate + the number of rows retrieved by a query in order to make good choices + of query plans. This section provides a quick look at the statistics + that the system uses for these estimates. + </para> + + <para> + One component of the statistics is the total number of entries in each + table and index, as well as the number of disk blocks occupied by each + table and index. This information is kept in + <structname>pg_class</structname>'s <structfield>reltuples</structfield> + and <structfield>relpages</structfield> columns. We can look at it + with queries similar to this one: + +<screen> +regression=# select relname, relkind, reltuples, relpages from pg_class +regression-# where relname like 'tenk1%'; + relname | relkind | reltuples | relpages +---------------+---------+-----------+---------- + tenk1 | r | 10000 | 233 + tenk1_hundred | i | 10000 | 30 + tenk1_unique1 | i | 10000 | 30 + tenk1_unique2 | i | 10000 | 30 +(4 rows) +</screen> + + Here we can see that <structname>tenk1</structname> contains 10000 + rows, as do its indexes, but the indexes are (unsurprisingly) much + smaller than the table. + </para> + + <para> + For efficiency reasons, <structfield>reltuples</structfield> + and <structfield>relpages</structfield> are not updated on-the-fly, + and so they usually contain only approximate values (which is good + enough for the planner's purposes). They are initialized with dummy + values (presently 1000 and 10 respectively) when a table is created. + They are updated by certain commands, presently <command>VACUUM</>, + <command>ANALYZE</>, and <command>CREATE INDEX</>. A stand-alone + <command>ANALYZE</>, that is one not part of <command>VACUUM</>, + generates an approximate <structfield>reltuples</structfield> value + since it does not read every row of the table. + </para> + + <para> + Most queries retrieve only a fraction of the rows in a table, due + to having WHERE clauses that restrict the rows to be examined. + The planner thus needs to make an estimate of the + <firstterm>selectivity</> of WHERE clauses, that is, the fraction of + rows that match each clause of the WHERE condition. The information + used for this task is stored in the <structname>pg_statistic</structname> + system catalog. Entries in <structname>pg_statistic</structname> are + updated by <command>ANALYZE</> and <command>VACUUM ANALYZE</> commands, + and are always approximate even when freshly updated. + </para> + + <para> + Rather than look at <structname>pg_statistic</structname> directly, + it's better to look at its view <structname>pg_stats</structname> + when examining the statistics manually. <structname>pg_stats</structname> + is designed to be more easily readable. Furthermore, + <structname>pg_stats</structname> is readable by all, whereas + <structname>pg_statistic</structname> is only readable by the superuser. + (This prevents unprivileged users from learning something about + the contents of other people's tables from the statistics. The + <structname>pg_stats</structname> view is restricted to show only + rows about tables that the current user can read.) + For example, we might do: + +<screen> +regression=# select attname, n_distinct, most_common_vals from pg_stats where tablename = 'road'; + attname | n_distinct | most_common_vals +---------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + name | -0.467008 | {"I- 580 Ramp","I- 880 Ramp","Sp Railroad ","I- 580 ","I- 680 Ramp","I- 80 Ramp","14th St ","5th St ","Mission Blvd","I- 880 "} + thepath | 20 | {"[(-122.089,37.71),(-122.0886,37.711)]"} +(2 rows) +regression=# +</screen> + + As of <productname>Postgres</productname> 7.2 the following columns exist + in <structname>pg_stats</structname>: + </para> + + <table> + <title>pg_stats Columns</title> + + <tgroup cols=3> + <thead> + <row> + <entry>Name</entry> + <entry>Type</entry> + <entry>Description</entry> + </row> + </thead> + + <tbody> + <row> + <entry>tablename</entry> + <entry><type>name</type></entry> + <entry>Name of table containing column</entry> + </row> + + <row> + <entry>attname</entry> + <entry><type>name</type></entry> + <entry>Column described by this row</entry> + </row> + + <row> + <entry>null_frac</entry> + <entry><type>real</type></entry> + <entry>Fraction of column's entries that are NULL</entry> + </row> + + <row> + <entry>avg_width</entry> + <entry><type>integer</type></entry> + <entry>Average width in bytes of column's entries</entry> + </row> + + <row> + <entry>n_distinct</entry> + <entry><type>real</type></entry> + <entry>If greater than zero, the estimated number of distinct values + in the column. If less than zero, the negative of the number of + distinct values divided by the number of rows. (The negated form + is used when ANALYZE believes that the number of distinct values + is likely to increase as the table grows; the positive form is used + when the column seems to have a fixed number of possible values.) + For example, -1 indicates a unique column in which the number of + distinct values is the same as the number of rows. + </entry> + </row> + + <row> + <entry>most_common_vals</entry> + <entry><type>text[]</type></entry> + <entry>A list of the most common values in the column. (Omitted if + no values seem to be more common than any others.)</entry> + </row> + + <row> + <entry>most_common_freqs</entry> + <entry><type>real[]</type></entry> + <entry>A list of the frequencies of the most common values, + ie, number of occurrences of each divided by total number of rows. + </entry> + </row> + + <row> + <entry>histogram_bounds</entry> + <entry><type>text[]</type></entry> + <entry>A list of values that divide the column's values into + groups of approximately equal population. The + <structfield>most_common_vals</>, if present, are omitted from the + histogram calculation. (Omitted if column datatype does not have a + <literal><</> operator, or if the <structfield>most_common_vals</> + list accounts for the entire population.) + </entry> + </row> + + <row> + <entry>correlation</entry> + <entry><type>real</type></entry> + <entry>Statistical correlation between physical row ordering and + logical ordering of the column values. This ranges from -1 to +1. + When the value is near -1 or +1, an indexscan on the column will + be estimated to be cheaper than when it is near zero, due to reduction + of random access to the disk. (Omitted if column datatype does + not have a <literal><</> operator.) + </entry> + </row> + </tbody> + </tgroup> + </table> + + <para> + The maximum number of entries in the <structfield>most_common_vals</> + and <structfield>histogram_bounds</> arrays can be set on a + column-by-column basis using the <command>ALTER TABLE SET STATISTICS</> + command. The default limit is presently 10 entries. Raising the limit + may allow more accurate planner estimates to be made, particularly for + columns with irregular data distributions, at the price of consuming + more space in <structname>pg_statistic</structname> and slightly more + time to compute the estimates. Conversely, a lower limit may be + appropriate for columns with simple data distributions. + </para> + + </sect1> + <sect1 id="explicit-joins"> <title>Controlling the Planner with Explicit JOINs</title> diff --git a/doc/src/sgml/ref/analyze.sgml b/doc/src/sgml/ref/analyze.sgml index 98f38aefb90..70e172790ba 100644 --- a/doc/src/sgml/ref/analyze.sgml +++ b/doc/src/sgml/ref/analyze.sgml @@ -1,5 +1,5 @@ <!-- -$Header: /cvsroot/pgsql/doc/src/sgml/ref/analyze.sgml,v 1.2 2001/09/03 12:57:49 petere Exp $ +$Header: /cvsroot/pgsql/doc/src/sgml/ref/analyze.sgml,v 1.3 2001/10/16 01:13:44 tgl Exp $ Postgres documentation --> @@ -152,7 +152,9 @@ ANALYZE [ VERBOSE ] [ <replaceable class="PARAMETER">table</replaceable> [ (<rep distribution in each column. One or both of these may be omitted if <command>ANALYZE</command> deems them uninteresting (for example, in a unique-key column, there are no common values) or if the column - datatype does not support the appropriate operators. + datatype does not support the appropriate operators. There is more + information about the statistics in the <citetitle>User's + Guide</citetitle>. </para> <para> |