aboutsummaryrefslogtreecommitdiff
path: root/contrib/tsearch2/docs/tsearch2-ref.html
diff options
context:
space:
mode:
Diffstat (limited to 'contrib/tsearch2/docs/tsearch2-ref.html')
-rw-r--r--contrib/tsearch2/docs/tsearch2-ref.html448
1 files changed, 448 insertions, 0 deletions
diff --git a/contrib/tsearch2/docs/tsearch2-ref.html b/contrib/tsearch2/docs/tsearch2-ref.html
new file mode 100644
index 00000000000..df0faa47d93
--- /dev/null
+++ b/contrib/tsearch2/docs/tsearch2-ref.html
@@ -0,0 +1,448 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+<html>
+<head>
+<link type="text/css" rel="stylesheet" href="/~megera/postgres/gist/tsearch/tsearch.css">
+<title>tsearch2 reference</title>
+</head>
+<body>
+<h1 align=center>The tsearch2 Reference</h1>
+
+<p align=center>
+Brandon Craig Rhodes<br>30 June 2003
+<p>
+This Reference documents the user types and functions
+of the tsearch2 module for PostgreSQL.
+An introduction to the module is provided
+by the <a href="tsearch2-guide.html">tsearch2 Guide</a>,
+a companion document to this one.
+You can retrieve a beta copy of the tsearch2 module from the
+<a href="http://www.sai.msu.su/~megera/postgres/gist/">GiST for PostgreSQL</a>
+page &mdash; look under the section entitled <i>Development History</i>
+for the current version.
+
+<h2><a name="vq">Vectors and Queries</h2>
+
+Vectors and queries both store lexemes,
+but for different purposes.
+A <tt>tsvector</tt> stores the lexemes
+of the words that are parsed out of a document,
+and can also remember the position of each word.
+A <tt>tsquery</tt> specifies a boolean condition among lexemes.
+<p>
+Any of the following functions with a <tt><i>configuration</i></tt> argument
+can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
+to select a configuration;
+if the option is omitted, then the current configuration is used.
+For more information on the current configuration,
+read the next section on Configurations.
+
+<h3>Vector Operations</h3>
+
+<dl>
+<dt>
+ <tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
+ <i>document</i> TEXT) RETURNS tsvector</tt>
+<dd>
+ Parses a document into tokens,
+ reduces the tokens to lexemes,
+ and returns a <tt>tsvector</tt> which lists the lexemes
+ together with their positions in the document.
+ For the best description of this process,
+ see the section on <a href="tsearch2-guide.html#ps">Parsing and Stemming</a>
+ in the accompanying tsearch2 Guide.
+<dt>
+ <tt>strip(<i>vector</i> tsvector) RETURNS tsvector</tt>
+<dd>
+ Return a vector which lists the same lexemes
+ as the given <tt><i>vector</i></tt>,
+ but which lacks any information
+ about where in the document each lexeme appeared.
+ While the returned vector is thus useless for relevance ranking,
+ it will usually be much smaller.
+<dt>
+ <tt>setweight(<i>vector</i> tsvector, <i>letter</i>) RETURNS tsvector</tt>
+<dd>
+ This function returns a copy of the input vector
+ in which every location has been labelled
+ with either the <tt><i>letter</i></tt>
+ <tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
+ or the default label <tt>'D'</tt>
+ (which is the default with which new vectors are created,
+ and as such is usually not displayed).
+ These labels are retained when vectors are concatenated,
+ allowing words from different parts of a document
+ to be weighted differently by ranking functions.
+<dt>
+ <tt><i>vector1</i> || <i>vector2</i></tt>
+<dt class=br>
+ <tt>concat(<i>vector1</i> tsvector, <i>vector2</i> tsvector)
+ RETURNS tsvector</tt>
+<dd>
+ Returns a vector which combines the lexemes and position information
+ in the two vectors given as arguments.
+ Position weight labels (described in the previous paragraph)
+ are retained intact during the concatenation.
+ This has at least two uses.
+ First,
+ if some sections of your document
+ need be parsed with different configurations than others,
+ you can parse them separately
+ and concatenate the resulting vectors into one.
+ Second,
+ you can weight words from some sections of you document
+ more heavily than those from others by:
+ parsing the sections into separate vectors;
+ assigning the vectors different position labels
+ with the <tt>setweight()</tt> function;
+ concatenating them into a single vector;
+ and then providing a <tt><i>weights</i></tt> argument
+ to the <tt>rank()</tt> function
+ that assigns different weights to positions with different labels.
+<dt>
+ <tt>tsvector_size(<i>vector</i> tsvector) RETURNS INT4</tt>
+<dd>
+ Returns the number of lexemes stored in the vector.
+<dt>
+ <tt><i>text</i>::tsvector RETURNS tsvector</tt>
+<dd>
+ Directly casting text to a <tt>tsvector</tt>
+ allows you to directly inject lexemes into a vector,
+ with whatever positions and position weights you choose to specify.
+ The <tt><i>text</i></tt> should be formatted
+ like the vector would be printed by the output of a <tt>SELECT</tt>.
+ See the <a href="tsearch2-guide.html#casting">Casting</a>
+ section in the Guide for details.
+</dl>
+
+<h3>Query Operations</h3>
+
+<dl>
+<dt>
+ <tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
+ <i>querytext</i> text) RETURNS tsvector</tt>
+<dd>
+ Parses a query,
+ which should be single words separated by the boolean operators
+ &ldquo;<tt>&amp;</tt>&rdquo;&nbsp;and,
+ &ldquo;<tt>|</tt>&rdquo;&nbsp;or,
+ and&nbsp;&ldquo;<tt>!</tt>&rdquo;&nbsp;not,
+ which can be grouped using parenthesis.
+ Each word is reduced to a lexeme using the current
+ or specified configuration.
+</ul>
+<dt>
+ <tt>querytree(<i>query</i> tsquery) RETURNS text</tt>
+<dd>
+ This might return a textual representation of the given query.
+<dt>
+ <tt><i>text</i>::tsquery RETURNS tsquery</tt>
+<dd>
+ Directly casting text to a <tt>tsquery</tt>
+ allows you to directly inject lexemes into a query,
+ with whatever positions and position weight flags you choose to specify.
+ The <tt><i>text</i></tt> should be formatted
+ like the query would be printed by the output of a <tt>SELECT</tt>.
+ See the <a href="tsearch2-guide.html#casting">Casting</a>
+ section in the Guide for details.
+</dl>
+
+<h2><a name="configurations">Configurations</a></h2>
+
+A configuration specifies all of the equipment necessary
+to transform a document into a <tt>tsvector</tt>:
+the parser that breaks its text into tokens,
+and the dictionaries which then transform each token into a lexeme.
+Every call to <tt>to_tsvector()</tt> (described above)
+uses a configuration to perform its processing.
+Three configurations come with tsearch2:
+
+<ul>
+<li><b>default</b> &mdash; Indexes words and numbers,
+ using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
+ and the <i>simple</i> dictionary for all others.
+<li><b>default_russian</b> &mdash; Indexes words and numbers,
+ using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
+ and the <i>ru_stem</i> Russian Snowball dictionary for all others.
+<li><b>simple</b> &mdash; Processes both words and numbers
+ with the <i>simple</i> dictionary,
+ which neither discards any stop words nor alters them.
+</ul>
+
+The tsearch2 modules initially chooses your current configuration
+by looking for your current locale in the <tt>locale</tt> field
+of the <tt>pg_ts_cfg</tt> table described below.
+You can manipulate the current configuration yourself with these functions:
+
+<dl>
+<dt>
+ <tt>set_curcfg( <i>id</i> INT <em>|</em> <i>ts_name</i> TEXT
+ ) RETURNS VOID</tt>
+<dd>
+ Set the current configuration used by <tt>to_tsvector</tt>
+ and <tt>to_tsquery</tt>.
+<dt>
+ <tt>show_curcfg() RETURNS INT4</tt>
+<dd>
+ Returns the integer <tt>id</tt> of the current configuration.
+</dl>
+
+<p>
+Each configuration is defined by a record in the <tt>pg_ts_cfg</tt> table:
+
+<pre>create table pg_ts_cfg (
+ id int not null primary key,
+ ts_name text not null,
+ prs_name text not null,
+ locale text
+);</pre>
+
+The <tt>id</tt> and <tt>ts_name</tt> are unique values
+which identify the configuration;
+the <tt>prs_name</tt> specifies which parser the configuration uses.
+Once this parser has split document text into tokens,
+the type of each resulting token &mdash;
+or, more specifically, the type's <tt>lex_alias</tt>
+as specified in the parser's <tt>lexem_type()</tt> table &mdash;
+is searched for together with the configuration's <tt>ts_name</tt>
+in the <tt>pg_ts_cfgmap</tt> table:
+
+<pre>create table pg_ts_cfgmap (
+ ts_name text not null,
+ lex_alias text not null,
+ dict_name text[],
+ primary key (ts_name,lex_alias)
+);</pre>
+
+Those tokens whose types are not listed are discarded.
+The remaining tokens are assigned integer positions,
+starting with 1 for the first token in the document,
+and turned into lexemes with the help of the dictionaries
+whose names are given in the <tt>dict_name</tt> array for their type.
+These dictionaries are tried in order,
+stopping either with the first one to return a lexeme for the token,
+or discarding the token if no dictionary returns a lexeme for it.
+
+<h2><a name="dictionaries">Parsers</a></h2>
+
+Each parser is defined by a record in the <tt>pg_ts_parser</tt> table:
+
+<pre>create table pg_ts_parser (
+ prs_id int not null primary key,
+ prs_name text not null,
+ prs_start oid not null,
+ prs_getlexem oid not null,
+ prs_end oid not null,
+ prs_headline oid not null,
+ prs_lextype oid not null,
+ prs_comment text
+);</pre>
+
+The <tt>prs_id</tt> and <tt>prs_name</tt> uniquely identify the parser,
+while <tt>prs_comment</tt> usually describes its name and version
+for the reference of users.
+The other items identify the low-level functions
+which make the parser operate,
+and are only of interest to someone writing a parser of their own.
+<p>
+The tsearch2 module comes with one parser named <tt>default</tt>
+which is suitable for parsing most plain text and HTML documents.
+<p>
+Each <tt><i>parser</i></tt> argument below
+must designate a parser with either an integer <tt><i>prs_id</i></tt>
+or a textual <tt><i>prs_name</i></tt>;
+the current parser is used when this argument is omitted.
+
+<dl>
+<dt>
+ <tt>CREATE FUNCTION set_curprs(<i>parser</i>) RETURNS VOID</tt>
+<dd>
+ Selects a current parser
+ which will be used when any of the following functions
+ are called without a parser as an argument.
+<dt>
+ <tt>CREATE FUNCTION lexem_type(
+ <em>[</em> <i>parser</i> <em>]</em>
+ ) RETURNS SETOF lexemtype</tt>
+<dd>
+ Returns a table which defines and describes
+ each kind of token the parser may produce as output.
+ For each token type the table gives the <tt>lexid</tt>
+ which the parser will label each token of that type,
+ the <tt>alias</tt> which names the token type,
+ and a short description <tt>descr</tt> for the user to read.
+<dt>
+ <tt>CREATE FUNCTION parse(
+ <em>[</em> <i>parser</i>, <em>]</em> <i>document</i> TEXT
+ ) RETURNS SETOF lexemtype</tt>
+<dd>
+ Parses the given document and returns a series of records,
+ one for each token produced by parsing.
+ Each token includes a <tt>lexid</tt> giving its type
+ and a <tt>lexem</tt> which gives its content.
+</dl>
+
+<h2><a name="dictionaries">Dictionaries</a></h2>
+
+Dictionaries take textual tokens as input,
+usually those produced by a parser,
+and return lexemes which are usually some reduced form of the token.
+Among the dictionaries which come installed with tsearch2 are:
+
+<ul>
+<li><b>simple</b> simply folds uppercase letters to lowercase
+ before returning the word.
+<li><b>en_stem</b> runs an English Snowball stemmer on each word
+ that attempts to reduce the various forms of a verb or noun
+ to a single recognizable form.
+<li><b>ru_stem</b> runs a Russian Snowball stemmer on each word.
+</ul>
+
+Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:
+
+<pre>CREATE TABLE pg_ts_dict (
+ dict_id int not null primary key,
+ dict_name text not null,
+ dict_init oid,
+ dict_initoption text,
+ dict_lemmatize oid not null,
+ dict_comment text
+);</pre>
+
+The <tt>dict_id</tt> and <tt>dict_name</tt>
+serve as unique identifiers for the dictionary.
+The meaning of the <tt>dict_initoption</tt> varies among dictionaries,
+but for the built-in Snowball dictionaries
+it specifies a file from which stop words should be read.
+The <tt>dict_comment</tt> is a human-readable description of the dictionary.
+The other fields are internal function identifiers
+useful only to developers trying to implement their own dictionaries.
+<p>
+The argument named <tt><i>dictionary</i></tt>
+in each of the following functions
+should be either an integer <tt>dict_id</tt> or a textual <tt>dict_name</tt>
+identifying which dictionary should be used for the operation;
+if omitted then the current dictionary is used.
+
+<dl>
+<dt>
+ <tt>CREATE FUNCTION set_curdict(<i>dictionary</i>) RETURNS VOID</tt>
+<dd>
+ Selects a current dictionary for use by functions
+ that do not select a dictionary explicitly.
+<dt>
+ <tt>CREATE FUNCTION lexize(
+ <em>[</em> <i>dictionary</i>, <em>]</em> <i>word</i> text)
+ RETURNS TEXT[]</tt>
+<dd>
+ Reduces a single word to a lexeme.
+ Note that lexemes are arrays of zero or more strings,
+ since in some languages there might be several base words
+ from which an inflected form could arise.
+</dl>
+
+<h2><a name="ranking">Ranking</a></h2>
+
+Ranking attempts to measure how relevant documents are to particular queries
+by inspecting the number of times each search word appears in the document,
+and whether different search terms occur near each other.
+Note that this information is only available in unstripped vectors &mdash;
+ranking functions will only return a useful result
+for a <tt>tsvector</tt> which still has position information!
+<p>
+Both of these ranking functions
+take an integer <i>normalization</i> option
+that specifies whether a document's length should impact its rank.
+This is often desirable,
+since a hundred-word document with five instances of a search word
+is probably more relevant than a thousand-word document with five instances.
+The option can have the values:
+
+<ul>
+<li><tt>0</tt> (the default) ignores document length.
+<li><tt>1</tt> divides the rank by the logarithm of the length.
+<li><tt>2</tt> divides the rank by the length itself.
+</ul>
+
+The two ranking functions currently available are:
+
+<dl>
+<dt>
+ <tt>CREATE FUNCTION rank(<br>
+ <em>[</em> <i>weights</i> float4[], <em>]</em>
+ <i>vector</i> tsvector, <i>query</i> tsquery,
+ <em>[</em> <i>normalization</i> int4 <em>]</em><br>
+ ) RETURNS float4</tt>
+<dd>
+ This is the ranking function from the old version of OpenFTS,
+ and offers the ability to weight word instances more heavily
+ depending on how you have classified them.
+ The <i>weights</i> specify how heavily to weight each category of word:
+ <pre
+>{<i>D-weight</i>, <i>A-weight</i>, <i>B-weight</i>, <i>C-weight</i>}</pre>
+ If no weights are provided, then these defaults are used:
+ <pre>{0.1, 0.2, 0.4, 1.0}</pre>
+ Often weights are used to mark words from special areas of the document,
+ like the title or an initial abstract,
+ and make them more or less important than words in the document body.
+<dt>
+ <tt>CREATE FUNCTION rank_cd(<br>
+ <em>[</em> <i>K</i> int4, <em>]</em>
+ <i>vector</i> tsvector, <i>query</i> tsquery,
+ <em>[</em> <i>normalization</i> int4 <em>]</em><br>
+ ) RETURNS float4</tt>
+<dd>
+ This function computes the cover density ranking
+ for the given document <i>vector</i> and <i>query</i>,
+ as described in Clarke, Cormack, and Tudhope's
+ &ldquo;<a href="http://citeseer.nj.nec.com/clarke00relevance.html"
+>Relevance Ranking for One to Three Term Queries</a>&rdquo;
+ in the 1999 <i>Information Processing and Management</i>.
+ The value <i>K</i> is one of the values from their formula,
+ and defaults to&nbsp;<i>K</i>=4.
+ The examples in their paper <i>K</i>=16;
+ we can roughly describe the term
+ as stating how far apart two search terms can fall
+ before the formula begins penalizing them for lack of proximity.
+</dl>
+
+<h2><a name="headlines">Headlines</a></h2>
+
+<dl>
+<dt>
+ <tt>CREATE FUNCTION headline(<br>
+ <em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
+ <i>document</i> text, <i>query</i> tsquery,
+ <em>[</em> <i>options</i> text <em>]</em><br>
+ ) RETURNS text</tt>
+<dd>
+ Every form of the the <tt>headline()</tt> function
+ accepts a <tt>document</tt> along with a <tt>query</tt>,
+ and returns one or more ellipse-separated excerpts from the document
+ in which terms from the query are highlighted.
+ The configuration with which to parse the document
+ can be specified by either its <i>id</i> or <i>ts_name</i>;
+ if none is specified that the current configuration is used instead.
+ <p>
+ An <i>options</i> string if provided should be a comma-separated list
+ of one or more &lsquo;<i>option</i><tt>=</tt><i>value</i>&rsquo; pairs.
+ The available options are:
+ <ul>
+ <li><tt>StartSel</tt>, <tt>StopSel</tt> &mdash;
+ the strings with which query words appearing in the document
+ should be delimited to distinguish them from other excerpted words.
+ <li><tt>MaxWords</tt>, <tt>MinWords</tt> &mdash;
+ limits on the shortest and longest headlines you will accept.
+ <li><tt>ShortWord</tt> &mdash;
+ this prevents your headline from beginning or ending
+ with a word which has this many characters or less.
+ The default value of <tt>3</tt> should eliminate most English
+ conjunctions and articles.
+ </ul>
+ Any unspecified options receive these defaults:
+ <pre>
+StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3
+ </pre>
+</dl>
+
+</body>
+</html>