aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc/src/sgml/gin.sgml180
-rw-r--r--doc/src/sgml/indices.sgml11
-rw-r--r--doc/src/sgml/xindex.sgml61
3 files changed, 138 insertions, 114 deletions
diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml
index 24517402bcc..c2bbad42bfa 100644
--- a/doc/src/sgml/gin.sgml
+++ b/doc/src/sgml/gin.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.6 2006/11/30 20:50:44 petere Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.7 2006/12/01 23:46:46 tgl Exp $ -->
<chapter id="GIN">
<title>GIN Indexes</title>
@@ -14,8 +14,9 @@
<para>
<acronym>GIN</acronym> stands for Generalized Inverted Index. It is
an index structure storing a set of (key, posting list) pairs, where
- 'posting list' is a set of rows in which the key occurs. Each
- row may contain many keys.
+ a <quote>posting list</> is a set of rows in which the key occurs. Each
+ indexed value may contain many keys, so the same row ID may appear in
+ multiple posting lists.
</para>
<para>
@@ -45,7 +46,7 @@
<para>
The <acronym>GIN</acronym> interface has a high level of abstraction,
- requiring the access method implementer to only implement the semantics of
+ requiring the access method implementer only to implement the semantics of
the data type being accessed. The <acronym>GIN</acronym> layer itself
takes care of concurrency, logging and searching the tree structure.
</para>
@@ -53,26 +54,14 @@
<para>
All it takes to get a <acronym>GIN</acronym> access method working
is to implement four user-defined methods, which define the behavior of
- keys in the tree. In short, <acronym>GIN</acronym> combines extensibility
- along with generality, code reuse, and a clean interface.
- </para>
-
-</sect1>
-
-<sect1 id="gin-implementation">
- <title>Implementation</title>
-
- <para>
- Internally, <acronym>GIN</acronym> consists of a B-tree index constructed
- over keys, where each key is an element of the indexed value
- (element of array, for example) and where each tuple in a leaf page is
- either a pointer to a B-tree over heap pointers (PT, posting tree), or a
- list of heap pointers (PL, posting list) if the tuple is small enough.
+ keys in the tree and the relationships between keys, indexed values,
+ and indexable queries. In short, <acronym>GIN</acronym> combines
+ extensibility with generality, code reuse, and a clean interface.
</para>
<para>
- There are four methods that an index operator class for
- <acronym>GIN</acronym> must provide (prototypes are in pseudocode):
+ The four methods that an index operator class for
+ <acronym>GIN</acronym> must provide are:
</para>
<variablelist>
@@ -80,9 +69,9 @@
<term>int compare(Datum a, Datum b)</term>
<listitem>
<para>
- Compares keys (not indexed values!) and returns an integer less than
- zero, zero, or greater than zero, indicating whether the first key is
- less than, equal to, or greater than the second.
+ Compares keys (not indexed values!) and returns an integer less than
+ zero, zero, or greater than zero, indicating whether the first key is
+ less than, equal to, or greater than the second.
</para>
</listitem>
</varlistentry>
@@ -91,21 +80,26 @@
<term>Datum* extractValue(Datum inputValue, uint32 *nkeys)</term>
<listitem>
<para>
- Returns an array of keys of value to be indexed, nkeys should
- contain the number of returned keys.
+ Returns an array of keys given a value to be indexed. The
+ number of returned keys must be stored into <literal>*nkeys</>.
</para>
</listitem>
</varlistentry>
<varlistentry>
- <term>Datum* extractQuery(Datum query, uint32 nkeys,
- StrategyNumber n)</term>
+ <term>Datum* extractQuery(Datum query, uint32 *nkeys,
+ StrategyNumber n)</term>
<listitem>
<para>
- Returns an array of keys of the query to be executed. n contains the
- strategy number of the operation (see <xref
- linkend="xindex-strategies">). Depending on n, query may be
- different type.
+ Returns an array of keys given a value to be queried; that is,
+ <literal>query</> is the value on the right-hand side of an
+ indexable operator whose left-hand side is the indexed column.
+ <literal>n</> is the strategy number of the operator within the
+ operator class (see <xref linkend="xindex-strategies">).
+ Often, <function>extractQuery</> will need
+ to consult <literal>n</> to determine the data type of
+ <literal>query</> and the key values that need to be extracted.
+ The number of returned keys must be stored into <literal>*nkeys</>.
</para>
</listitem>
</varlistentry>
@@ -114,11 +108,16 @@
<term>bool consistent(bool check[], StrategyNumber n, Datum query)</term>
<listitem>
<para>
- Returns TRUE if the indexed value satisfies the query qualifier with
- strategy n (or may satisfy in case of RECHECK mark in operator class).
- Each element of the check array is TRUE if the indexed value has a
- corresponding key in the query: if (check[i] == TRUE) the i-th key of
- the query is present in the indexed value.
+ Returns TRUE if the indexed value satisfies the query operator with
+ strategy number <literal>n</> (or may satisfy, if the operator is
+ marked RECHECK in the operator class). The <literal>check</> array has
+ the same length as the number of keys previously returned by
+ <function>extractQuery</> for this query. Each element of the
+ <literal>check</> array is TRUE if the indexed value contains the
+ corresponding query key, ie, if (check[i] == TRUE) the i-th key of the
+ <function>extractQuery</> result array is present in the indexed value.
+ The original <literal>query</> datum (not the extracted key array!) is
+ passed in case the <function>consistent</> method needs to consult it.
</para>
</listitem>
</varlistentry>
@@ -127,6 +126,19 @@
</sect1>
+<sect1 id="gin-implementation">
+ <title>Implementation</title>
+
+ <para>
+ Internally, a <acronym>GIN</acronym> index contains a B-tree index
+ constructed over keys, where each key is an element of the indexed value
+ (a member of an array, for example) and where each tuple in a leaf page is
+ either a pointer to a B-tree over heap pointers (PT, posting tree), or a
+ list of heap pointers (PL, posting list) if the list is small enough.
+ </para>
+
+</sect1>
+
<sect1 id="gin-tips">
<title>GIN tips and tricks</title>
@@ -134,44 +146,43 @@
<varlistentry>
<term>Create vs insert</term>
<listitem>
- <para>
- In most cases, insertion into a <acronym>GIN</acronym> index is slow
- due to the likelihood of many keys being inserted for each value.
- So, for bulk insertions into a table it is advisable to to drop the GIN
- index and recreate it after finishing bulk insertion.
- </para>
+ <para>
+ In most cases, insertion into a <acronym>GIN</acronym> index is slow
+ due to the likelihood of many keys being inserted for each value.
+ So, for bulk insertions into a table it is advisable to drop the GIN
+ index and recreate it after finishing bulk insertion.
+ </para>
</listitem>
</varlistentry>
<varlistentry>
- <term>gin_fuzzy_search_limit</term>
+ <term><xref linkend="guc-gin-fuzzy-search-limit"></term>
<listitem>
- <para>
- The primary goal of developing <acronym>GIN</acronym> indices was
- support for highly scalable, full-text search in
- <productname>PostgreSQL</productname> and there are often situations when
- a full-text search returns a very large set of results. Since reading
- tuples from the disk and sorting them could take a lot of time, this is
- unacceptable for production. (Note that the index search itself is very
- fast.)
+ <para>
+ The primary goal of developing <acronym>GIN</acronym> indexes was
+ to create support for highly scalable, full-text search in
+ <productname>PostgreSQL</productname>, and there are often situations when
+ a full-text search returns a very large set of results. Moreover, this
+ often happens when the query contains very frequent words, so that the
+ large result set is not even useful. Since reading many
+ tuples from the disk and sorting them could take a lot of time, this is
+ unacceptable for production. (Note that the index search itself is very
+ fast.)
+ </para>
+ <para>
+ To facilitate controlled execution of such queries
+ <acronym>GIN</acronym> has a configurable soft upper limit on the size
+ of the returned set, the
+ <varname>gin_fuzzy_search_limit</varname> configuration parameter.
+ It is set to 0 (meaning no limit) by default.
+ If a non-zero limit is set, then the returned set is a subset of
+ the whole result set, chosen at random.
+ </para>
+ <para>
+ <quote>Soft</quote> means that the actual number of returned results
+ could differ slightly from the specified limit, depending on the query
+ and the quality of the system's random number generator.
</para>
- <para>
- Such queries usually contain very frequent words, so the results are not
- very helpful. To facilitate execution of such queries
- <acronym>GIN</acronym> has a configurable soft upper limit of the size
- of the returned set, determined by the
- <varname>gin_fuzzy_search_limit</varname> GUC variable. It is set to 0 by
- default (no limit).
- </para>
- <para>
- If a non-zero search limit is set, then the returned set is a subset of
- the whole result set, chosen at random.
- </para>
- <para>
- <quote>Soft</quote> means that the actual number of returned results
- could slightly differ from the specified limit, depending on the query
- and the quality of the system's random number generator.
- </para>
</listitem>
</varlistentry>
</variablelist>
@@ -182,21 +193,30 @@
<title>Limitations</title>
<para>
- <acronym>GIN</acronym> doesn't support full index scans due to their
- extreme inefficiency: because there are often many keys per value,
- each heap pointer will be returned several times.
+ <acronym>GIN</acronym> doesn't support full index scans: because there are
+ often many keys per value, each heap pointer would be returned many times,
+ and there is no easy way to prevent this.
</para>
<para>
When <function>extractQuery</function> returns zero keys,
- <acronym>GIN</acronym> will emit an error: for different opclasses and
- strategies the semantic meaning of a void query may be different (for
- example, any array contains the void array, but they don't overlap the
- void array), and <acronym>GIN</acronym> can't suggest a reasonable answer.
+ <acronym>GIN</acronym> will emit an error. Depending on the operator,
+ a void query might match all, some, or none of the indexed values (for
+ example, every array contains the empty array, but does not overlap the
+ empty array), and <acronym>GIN</acronym> can't determine the correct
+ answer, nor produce a full-index-scan result if it could determine that
+ that was correct.
</para>
<para>
- <acronym>GIN</acronym> searches keys only by equality matching. This may
+ It is not an error for <function>extractValue</> to return zero keys,
+ but in this case the indexed value will be unrepresented in the index.
+ This is another reason why full index scan is not useful &mdash; it would
+ miss such rows.
+ </para>
+
+ <para>
+ <acronym>GIN</acronym> searches keys only by equality matching. This may
be improved in future.
</para>
</sect1>
@@ -206,12 +226,12 @@
<para>
The <productname>PostgreSQL</productname> source distribution includes
- <acronym>GIN</acronym> classes for one-dimensional arrays of all internal
+ <acronym>GIN</acronym> classes for one-dimensional arrays of all internal
types. The following
<filename>contrib</> modules also contain <acronym>GIN</acronym>
- operator classes:
+ operator classes:
</para>
-
+
<variablelist>
<varlistentry>
<term>intarray</term>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index fd8bb7251ed..1edceebd2d5 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.65 2006/10/23 18:10:31 petere Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/indices.sgml,v 1.66 2006/12/01 23:46:46 tgl Exp $ -->
<chapter id="indexes">
<title id="indexes-title">Indexes</title>
@@ -116,7 +116,7 @@ CREATE INDEX test1_id_index ON test1 (id);
<para>
<productname>PostgreSQL</productname> provides several index types:
- B-tree, Hash, GIN and GiST. Each index type uses a different
+ B-tree, Hash, GiST and GIN. Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command will create a
B-tree index, which fits the most common situations.
@@ -247,8 +247,8 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
<primary>GIN</primary>
<see>index</see>
</indexterm>
- GIN is a inverted index and it's usable for values which have more
- than one key, arrays for example. Like GiST, GIN may support
+ GIN indexes are inverted indexes which can handle values that contain more
+ than one key, arrays for example. Like GiST, GIN can support
many different user-defined indexing strategies and the particular
operators with which a GIN index can be used vary depending on the
indexing strategy.
@@ -267,7 +267,8 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
(See <xref linkend="functions-array"> for the meaning of
these operators.)
Other GIN operator classes are available in the <literal>contrib</>
- <literal>tsearch2</literal> and <literal>intarray</literal> modules. For more information see <xref linkend="GIN">.
+ <literal>tsearch2</literal> and <literal>intarray</literal> modules.
+ For more information see <xref linkend="GIN">.
</para>
</sect1>
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 444839399e1..a66dd3c4ee4 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.52 2006/10/23 18:10:32 petere Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.53 2006/12/01 23:46:46 tgl Exp $ -->
<sect1 id="xindex">
<title>Interfacing Extensions To Indexes</title>
@@ -243,15 +243,16 @@
</table>
<para>
- GIN indexes are similar to GiST's in flexibility: they don't have a fixed
- et of strategies. Instead, the <quote>consistency</> support routine
- interprets the strategy numbers accordingly with operator class
- definition. As an example, strategies of operator class over arrays
- is shown in <xref linkend="xindex-gin-array-strat-table">.
+ GIN indexes are similar to GiST indexes in flexibility: they don't have a
+ fixed set of strategies. Instead the support routines of each operator
+ class interpret the strategy numbers according to the operator class's
+ definition. As an example, the strategy numbers used by the built-in
+ operator classes for arrays are
+ shown in <xref linkend="xindex-gin-array-strat-table">.
</para>
<table tocentry="1" id="xindex-gin-array-strat-table">
- <title>GIN Array's Strategies</title>
+ <title>GIN Array Strategies</title>
<tgroup cols="2">
<thead>
<row>
@@ -388,36 +389,35 @@
<tbody>
<row>
<entry>consistent - determine whether key satisfies the
- query qualifier</entry>
+ query qualifier</entry>
<entry>1</entry>
</row>
<row>
- <entry>union - compute union of of a set of given keys</entry>
+ <entry>union - compute union of a set of keys</entry>
<entry>2</entry>
</row>
<row>
- <entry>compress - computes a compressed representation of a key or value
- to be indexed</entry>
+ <entry>compress - compute a compressed representation of a key or value
+ to be indexed</entry>
<entry>3</entry>
</row>
<row>
- <entry>decompress - computes a decompressed representation of a
- compressed key </entry>
+ <entry>decompress - compute a decompressed representation of a
+ compressed key</entry>
<entry>4</entry>
</row>
<row>
<entry>penalty - compute penalty for inserting new key into subtree
- with given subtree's key</entry>
+ with given subtree's key</entry>
<entry>5</entry>
</row>
<row>
<entry>picksplit - determine which entries of a page are to be moved
- to the new page and compute the union keys for resulting pages </entry>
+ to the new page and compute the union keys for resulting pages</entry>
<entry>6</entry>
</row>
<row>
- <entry>equal - compare two keys and returns true if they are equal
- </entry>
+ <entry>equal - compare two keys and return true if they are equal</entry>
<entry>7</entry>
</row>
</tbody>
@@ -441,23 +441,22 @@
<tbody>
<row>
<entry>
- compare - Compare two keys and return an integer less than zero, zero, or
- greater than zero, indicating whether the first key is less than, equal to,
- or greater than the second.
- </entry>
+ compare - compare two keys and return an integer less than zero, zero,
+ or greater than zero, indicating whether the first key is less than,
+ equal to, or greater than the second
+ </entry>
<entry>1</entry>
</row>
<row>
- <entry>extractValue - extract keys from value to be indexed</entry>
+ <entry>extractValue - extract keys from a value to be indexed</entry>
<entry>2</entry>
</row>
<row>
- <entry>extractQuery - extract keys from query</entry>
+ <entry>extractQuery - extract keys from a query condition</entry>
<entry>3</entry>
</row>
<row>
- <entry>consistent - determine whether value matches by the
- query</entry>
+ <entry>consistent - determine whether value matches query condition</entry>
<entry>4</entry>
</row>
</tbody>
@@ -822,12 +821,16 @@ CREATE OPERATOR CLASS polygon_ops
STORAGE box;
</programlisting>
- At present, only the GiST and GIN index method supports a
+ At present, only the GiST and GIN index methods support a
<literal>STORAGE</> type that's different from the column data type.
- The GiST <literal>compress</> and <literal>decompress</> support
+ The GiST <function>compress</> and <function>decompress</> support
routines must deal with data-type conversion when <literal>STORAGE</>
- is used. Functions named <literal>extractValue</> and <literal>extractQuery</>
- do conversation into internally used types for GIN.
+ is used. In GIN, the <literal>STORAGE</> type identifies the type of
+ the <quote>key</> values, which normally is different from the type
+ of the indexed column &mdash; for example, an operator class for
+ integer array columns might have keys that are just integers. The
+ GIN <function>extractValue</> and <function>extractQuery</> support
+ routines are responsible for extracting keys from indexed values.
</para>
</sect2>