aboutsummaryrefslogtreecommitdiff
path: root/contrib/tsearch2/docs/tsearch2-ref.html
blob: 7edcc55a9b875eead1a783a0794f73fa86d9b517 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head>

<title>tsearch2 reference</title></head>

<body>
<h1 align="center">The tsearch2 Reference</h1>

<p align="center">
Brandon Craig Rhodes<br>30 June 2003 (edited by Oleg Bartunov, 2 Aug 2003).
<br>Massive update for 8.2 release by Oleg Bartunov, October 2006
</p>
<p>
This Reference documents the user types and functions
of the tsearch2 module for PostgreSQL.
An introduction to the module is provided
by the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html">tsearch2 Guide</a>,
a companion document to this one.
</p>

<h2>Table of Contents</h2>
<blockquote>
<a href="#vq">Vectors and Queries</a><br>
<a href="#vqo">Vector Operations</a><br>
<a href="#qo">Query Operations</a><br>
<a href="#fts">Full Text Search Operator</a><br>
<a href="#configurations">Configurations</a><br>
<a href="#testing">Testing</a><br>
<a href="#parsers">Parsers</a><br>
<a href="#dictionaries">Dictionaries</a><br>
<a href="#ranking">Ranking</a><br>
<a href="#headlines">Headlines</a><br>
<a href="#indexes">Indexes</a><br>
<a href="#tz">Thesaurus dictionary</a><br>
</blockquote>




<h2><a name="vq">Vectors and Queries</a></h2>

Vectors and queries both store lexemes,
but for different purposes.
A <tt>tsvector</tt> stores the lexemes
of the words that are parsed out of a document,
and can also remember the position of each word.
A <tt>tsquery</tt> specifies a boolean condition among lexemes.
<p>
Any of the following functions with a <tt><i>configuration</i></tt> argument
can use either an integer <tt>id</tt> or textual <tt>ts_name</tt>
to select a configuration;
if the option is omitted, then the current configuration is used.
For more information on the current configuration,
read the next section on Configurations.
</p>

<h3><a name="vqo">Vector Operations</a></h3>

<dl><dt>
<tt>to_tsvector( <em>[</em><i>configuration</i>,<em>]</em>
 <i>document</i> TEXT) RETURNS TSVECTOR</tt>
</dt><dd>
 Parses a document into tokens,
 reduces the tokens to lexemes,
 and returns a <tt>tsvector</tt> which lists the lexemes
 together with their positions in the document.
 For the best description of this process,
 see the section on <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#ps">Parsing and Stemming</a>
 in the accompanying tsearch2 Guide.
</dd><dt>
 <tt>strip(<i>vector</i> TSVECTOR) RETURNS TSVECTOR</tt>
</dt><dd>
 Return a vector which lists the same lexemes
 as the given <tt><i>vector</i></tt>,
 but which lacks any information
 about where in the document each lexeme appeared.
 While the returned vector is thus useless for relevance ranking,
 it will usually be much smaller.
</dd><dt>
 <tt>setweight(<i>vector</i> TSVECTOR, <i>letter</i>) RETURNS TSVECTOR</tt>
</dt><dd>
 This function returns a copy of the input vector
 in which every location has been labeled
 with either the <tt><i>letter</i></tt>
 <tt>'A'</tt>, <tt>'B'</tt>, or <tt>'C'</tt>,
 or the default label <tt>'D'</tt>
 (which is the default with which new vectors are created,
 and as such is usually not displayed).
 These labels are retained when vectors are concatenated,
 allowing words from different parts of a document
 to be weighted differently by ranking functions.
</dd>
<dt>
 <tt><i>vector1</i> || <i>vector2</i></tt><BR>
 <tt>concat(<i>vector1</i> TSVECTOR, <i>vector2</i> TSVECTOR)
 RETURNS TSVECTOR</tt>
</dt><dd>
 Returns a vector which combines the lexemes and position information
 in the two vectors given as arguments.
 Position weight labels (described in the previous paragraph)
 are retained intact during the concatenation.
 This has at least two uses.
 First,
 if some sections of your document
 need be parsed with different configurations than others,
 you can parse them separately
 and concatenate the resulting vectors into one.
 Second,
 you can weight words from some sections of you document
 more heavily than those from others by:
 parsing the sections into separate vectors;
 assigning the vectors different position labels
 with the <tt>setweight()</tt> function;
 concatenating them into a single vector;
 and then providing a <tt><i>weights</i></tt> argument
 to the <tt>rank()</tt> function
 that assigns different weights to positions with different labels.
</dd><dt>
 <tt>length(<i>vector</i> TSVECTOR) RETURNS INT4</tt>
</dt><dd>
 Returns the number of lexemes stored in the vector.
</dd><dt>
 <tt><i>text</i>::TSVECTOR RETURNS TSVECTOR</tt>
</dt><dd>
 Directly casting text to a <tt>tsvector</tt>
 allows you to directly inject lexemes into a vector,
 with whatever positions and position weights you choose to specify.
 The <tt><i>text</i></tt> should be formatted
 like the vector would be printed by the output of a <tt>SELECT</tt>.
 See the <a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
 section in the Guide for details.
</dd><dt>
 <tt>tsearch2(<i>vector_column_name</i>[, (<i>my_filter_name</i> | <i>text_column_name1</i>) [...] ], <i>text_column_nameN</i>)</tt> 
 </dt><dd>
<tt>tsearch2()</tt> trigger used to automatically update <i>vector_column_name</i>, <i>my_filter_name</i>
is the function name to preprocess <i>text_column_name</i>.  There are can be many
functions  and text columns specified in <tt>tsearch2()</tt> trigger. 
The following rule used: 
function applied to all subsequent text columns until next function occurs.
Example, function <tt>dropatsymbol</tt> replaces all entries of <tt>@</tt>
sign by space.
<pre>
CREATE FUNCTION dropatsymbol(text) RETURNS text 
AS 'select replace($1, ''@'', '' '');'
LANGUAGE SQL;

CREATE TRIGGER tsvectorupdate BEFORE UPDATE OR INSERT 
ON tblMessages FOR EACH ROW EXECUTE PROCEDURE 
tsearch2(tsvector_column,dropatsymbol, strMessage);
</pre>
</dd>

<dt>
<tt>stat(<i>sqlquery</i> text [, <i>weight</i> text]) RETURNS SETOF statinfo</tt>
</dt><dd>
Here <tt>statinfo</tt> is a type, defined as 
<tt>
CREATE TYPE statinfo as (<i>word</i> text, <i>ndoc</i> int4, <i>nentry</i> int4)
</tt> and <i>sqlquery</i> is a query, which returns column <tt>tsvector</tt>.
<P>
This returns statistics (the number of documents <i>ndoc</i> and total number <i>nentry</i> of <i>word</i> 
in the collection) about column <i>vector</i> <tt>tsvector</tt>. 
Useful to check how good is your configuration and
to  find stop-words candidates.For example, find top 10 most frequent words:
<pre>
=# select * from stat('select vector from apod') order by ndoc desc, nentry desc,word limit 10;
</pre>
Optionally, one can specify <i>weight</i> to obtain statistics about words with specific weight.
<pre>
=# select * from stat('select vector from apod','a') order by ndoc desc, nentry desc,word limit 10;
</pre>

</dd>
<dt>
<tt>TSVECTOR &lt; TSVECTOR</tt><BR>
<tt>TSVECTOR &lt;= TSVECTOR</tt><BR>
<tt>TSVECTOR = TSVECTOR</tt><BR>
<tt>TSVECTOR >= TSVECTOR</tt><BR>
<tt>TSVECTOR > TSVECTOR</tt>
</dt><dd>
All btree operations defined for <tt>tsvector</tt> type. <tt>tsvectors</tt> compares 
with each other using lexicographical order.
</dd>
</dl>

<h3><a name="qo">Query Operations</a></h3>

<dl>
<dt>
 <tt>to_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
 <i>querytext</i> text) RETURNS TSQUERY[A</tt>
</dt>
<dd>
 Parses a query,
 which should be single words separated by the boolean operators
 "<tt>&amp;</tt>"&nbsp;and,
 "<tt>|</tt>"&nbsp;or,
 and&nbsp;"<tt>!</tt>"&nbsp;not,
 which can be grouped using parenthesis.
 Each word is reduced to a lexeme using the current
 or specified configuration. 
 Weight class can be assigned to each lexeme entry
 to restrict search region
 (see <tt>setweight</tt> for explanation), for example 
 "<tt>fat:a &amp; rats</tt>".
</dd><dt>
<dt>
 <tt>plainto_tsquery( <em>[</em><i>configuration</i>,<em>]</em>
 <i>querytext</i> text) RETURNS TSQUERY</tt>
</dt>
<dd>
Transforms unformatted text to tsquery. It is the same as to_tsquery, 
but assumes "<tt>&amp;</tt>" boolean operator between words and doesn't 
recognizes weight classes.
</dd><dt>

 <tt>querytree(<i>query</i> TSQUERY) RETURNS text</tt>
</dt><dd>
This returns a query which actually used in searching in GiST index.
</dd><dt>
 <tt><i>text</i>::TSQUERY RETURNS TSQUERY</tt>
</dt><dd>
 Directly casting text to a <tt>tsquery</tt>
 allows you to directly inject lexemes into a query,
 with whatever positions and position weight flags you choose to specify.
 The <tt><i>text</i></tt> should be formatted
 like the query would be printed by the output of a <tt>SELECT</tt>.
 See the <a href="http://www.sai.msu.su/%7Emegera/postgres/gist/tsearch/V2/docs/tsearch2-guide.html#casting">Casting</a>
 section in the Guide for details.
</dd>
<dt>
 <tt>numnode(<i>query</i> TSQUERY) RETURNS INTEGER</tt>
</dt><dd>
This returns the number of nodes in query tree
</dd><dt>
 <tt>TSQUERY && TSQUERY RETURNS TSQUERY</tt>
</dt><dd>
AND-ed TSQUERY
</dd><dt>
 <tt>TSQUERY || TSQUERY RETURNS TSQUERY</tt>
</dt> <dd>
 OR-ed TSQUERY
</dd><dt>
 <tt>!! TSQUERY RETURNS TSQUERY</tt>
</dt> <dd>
 negation of TSQUERY
</dd>
<dt>
<tt>TSQUERY &lt; TSQUERY</tt><BR>
<tt>TSQUERY &lt;= TSQUERY</tt><BR>
<tt>TSQUERY = TSQUERY</tt><BR>
<tt>TSQUERY >= TSQUERY</tt><BR>
<tt>TSQUERY > TSQUERY</tt>
</dt><dd>
All btree operations defined for <tt>tsquery</tt> type. <tt>tsqueries</tt> compares 
with each other using lexicographical order.
</dd>
</dl>

<h3>Query rewriting</h3>
Query rewriting is a set of functions and operators for tsquery type. 
It allows to control search at query time without reindexing (opposite to thesaurus), for example,
expand search using  synonyms (new york,  big apple, nyc, gotham).
<P>
<tt><b>rewrite()</b></tt> function changes original <i>query</i> by replacing <i>target</i> by <i>sample</i>.
There are three possibilities to use <tt>rewrite()</tt> function. Notice, that arguments of <tt>rewrite()</tt> 
function can be column names of type <tt>tsquery</tt>.
<pre>
create table rw (q TSQUERY, t TSQUERY, s TSQUERY);
insert into rw values('a & b','a', 'c');
</pre>
<dl>
<dt> <tt>rewrite (<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY) RETURNS TSQUERY</tt>
</dt>
<dd>
<pre>
=# select rewrite('a & b'::TSQUERY, 'a'::TSQUERY, 'c'::TSQUERY);
  rewrite
  -----------
   'c' & 'b'
</pre>
</dd>
<dt> <tt>rewrite (ARRAY[<i>query</i> TSQUERY, <i>target</i> TSQUERY, <i>sample</i> TSQUERY])  RETURNS TSQUERY</tt>
</dt>
<dd>
<pre>
=# select rewrite(ARRAY['a & b'::TSQUERY, t,s]) from rw;
  rewrite
  -----------
   'c' & 'b'
</pre>
</dd>
<dt> <tt>rewrite (<i>query</i> TSQUERY,'select <i>target</i> ,<i>sample</i> from test'::text)  RETURNS TSQUERY</tt>
</dt>
<dd>
<pre>
=# select rewrite('a & b'::TSQUERY, 'select t,s from rw'::text);
  rewrite
  -----------
   'c' & 'b'
</pre>
</dd>
</dl>
Two operators defined for <tt>tsquery</tt> type:
<dl>
<dt><tt>TSQUERY @ TSQUERY</tt></dt>
<dd>
  Returns <tt>TRUE</tt> if right agrument might contained in left argument.
 </dd>
 <dt><tt>TSQUERY ~ TSQUERY</tt></dt>
 <dd> 
  Returns <tt>TRUE</tt> if left agrument might contained in right argument.
 </dd>
</dl>                    
To speed up these operators one can use GiST index with <tt>gist_tp_tsquery_ops</tt> opclass.
<pre>
create index qq on test_tsquery using gist (keyword gist_tp_tsquery_ops);
</pre>

<h2><a name="fts">Full Text Search operator</a></h2>

<dl><dt>
<tt>TSQUERY @@ TSVECTOR</tt><br>
<tt>TSVECTOR @@ TSQUERY</tt>
</dt>
<dd>
Returns <tt>TRUE</tt> if <tt>TSQUERY</tt> contained in <tt>TSVECTOR</tt> and 
<tt>FALSE</tt> otherwise.
<pre>
=# select 'cat & rat':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 ?column?
 ----------
  t
=# select 'fat & cow':: tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 ?column?
 ----------
  f
</pre>
</dd>
</dl>

<h2><a name="configurations">Configurations</a></h2>

A configuration specifies all of the equipment necessary
to transform a document into a <tt>tsvector</tt>:
the parser that breaks its text into tokens,
and the dictionaries which then transform each token into a lexeme.
Every call to <tt>to_tsvector(), to_tsquery()</tt> (described above)
uses a configuration to perform its processing.
Three configurations come with tsearch2:

<ul>
<li><b>default</b> -- Indexes words and numbers,
 using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
 and the <i>simple</i> dictionary for all others.
</li><li><b>default_russian</b> -- Indexes words and numbers,
 using the <i>en_stem</i> English Snowball stemmer for Latin-alphabet words
 and the <i>ru_stem</i> Russian Snowball dictionary for all others. It's default
 for <tt>ru_RU.KOI8-R</tt> locale.
</li><li><b>utf8_russian</b> -- the same as <b>default_russian</b> but 
for <tt>ru_RU.UTF-8</tt> locale.
</li><li><b>simple</b> -- Processes both words and numbers
 with the <i>simple</i> dictionary,
 which neither discards any stop words nor alters them.
</li></ul>

The tsearch2 modules initially chooses your current configuration
by looking for your current locale in the <tt>locale</tt> field
of the <tt>pg_ts_cfg</tt> table described below.
You can manipulate the current configuration yourself with these functions:

<dl><dt>
 <tt>set_curcfg( <i>id</i> INT <em>|</em> <i>ts_name</i> TEXT
  ) RETURNS VOID</tt>
</dt><dd>
 Set the current configuration used by <tt>to_tsvector</tt>
 and <tt>to_tsquery</tt>.
</dd><dt>
 <tt>show_curcfg() RETURNS INT4</tt>
</dt><dd>
 Returns the integer <tt>id</tt> of the current configuration.
</dd></dl>

<p>
Each configuration is defined by a record in the <tt>pg_ts_cfg</tt> table:

</p><pre>create table pg_ts_cfg (
	id		int not  null primary key,
	ts_name		text not null,
	prs_name	text not null,
	locale		text
);</pre>

The <tt>id</tt> and <tt>ts_name</tt> are unique values
which identify the configuration;
the <tt>prs_name</tt> specifies which parser the configuration uses.
Once this parser has split document text into tokens,
the type of each resulting token --
or, more specifically, the type's <tt>tok_alias</tt>
as specified in the parser's <tt>lexem_type()</tt> table --
is searched for together with the configuration's <tt>ts_name</tt>
in the <tt>pg_ts_cfgmap</tt> table:

<pre>create table pg_ts_cfgmap (
	ts_name		text not null,
	tok_alias	text not null,
	dict_name	text[],
	primary key (ts_name,tok_alias)
);</pre>

Those tokens whose types are not listed are discarded.
The remaining tokens are assigned integer positions,
starting with 1 for the first token in the document,
and turned into lexemes with the help of the dictionaries
whose names are given in the <tt>dict_name</tt> array for their type.
These dictionaries are tried in order,
stopping either with the first one to return a lexeme for the token,
or discarding the token if no dictionary returns a lexeme for it.

<h2><a name="testing">Testing</a></h2>

Function <tt>ts_debug</tt> allows easy testing of your <b>current</b> configuration. 
You may always test another configuration using <tt>set_curcfg</tt> function.
<p>
Example:
</p><pre>apod=# select * from ts_debug('Tsearch module for PostgreSQL 7.3.3');
 ts_name | tok_type | description |   token    | dict_name |  tsvector    
---------+----------+-------------+------------+-----------+--------------
 default | lword    | Latin word  | Tsearch    | {en_stem} | 'tsearch'
 default | lword    | Latin word  | module     | {en_stem} | 'modul'
 default | lword    | Latin word  | for        | {en_stem} | 
 default | lword    | Latin word  | PostgreSQL | {en_stem} | 'postgresql'
 default | version  | VERSION     | 7.3.3      | {simple}  | '7.3.3'
</pre>
Here: 
<br>
<ul>
<li>tsname - configuration name 
</li><li>tok_type - token type 
</li><li>description - human readable name of tok_type 
</li><li>token - parser's token 
</li><li>dict_name - dictionary used for the token 
</li><li>tsvector - final result</li>
</ul>


<h2><a name="parsers">Parsers</a></h2>

Each parser is defined by a record in the <tt>pg_ts_parser</tt> table:

<pre>create table pg_ts_parser (
	prs_name	text not null,
	prs_start	regprocedure not null,
	prs_nexttoken	regprocedure not null,
	prs_end		regprocedure not null,
	prs_headline	regprocedure not null,
	prs_lextype	regprocedure not null,
	prs_comment	text
);</pre>

The <tt>prs_name</tt> uniquely identify the parser,
while <tt>prs_comment</tt> usually describes its name and version
for the reference of users.
The other items identify the low-level functions
which make the parser operate,
and are only of interest to someone writing a parser of their own.
<p>
The tsearch2 module comes with one parser named <tt>default</tt>
which is suitable for parsing most plain text and HTML documents.
</p><p>
Each <tt><i>parser</i></tt> argument below
must designate a parser with <tt><i>prs_name</i></tt>;
the current parser is used when this argument is omitted.

</p><dl><dt>
 <tt>CREATE FUNCTION set_curprs(<i>parser</i>) RETURNS VOID</tt>
</dt><dd>
 Selects a current parser
 which will be used when any of the following functions
 are called without a parser as an argument.
</dd><dt>
 <tt>CREATE FUNCTION token_type(
  <em>[</em> <i>parser</i> <em>]</em>
  ) RETURNS SETOF tokentype</tt>
</dt><dd>
 Returns a table which defines and describes
 each kind of token the parser may produce as output.
 For each token type the table gives the <tt>tokid</tt>
 which the parser will label each token of that type,
 the <tt>alias</tt> which names the token type,
 and a short description <tt>descr</tt> for the user to read.
</dd><dt>
 <tt>CREATE FUNCTION parse(
  <em>[</em> <i>parser</i>, <em>]</em> <i>document</i> TEXT
  ) RETURNS SETOF tokenout</tt>
</dt><dd>
 Parses the given document and returns a series of records,
 one for each token produced by parsing.
 Each token includes a <tt>tokid</tt> giving its type
 and a <tt>lexem</tt> which gives its content.
</dd></dl>

<h2><a name="dictionaries">Dictionaries</a></h2>

Dictionary is a program, which accepts lexeme(s), usually those produced by a parser, 
on input and returns:
<ul>
<li>array of lexeme(s) if input lexeme is known to the dictionary
<li>void array - dictionary knows lexeme, but it's stop word.
<li> NULL - dictionary doesn't recognized input lexeme
</ul>
Usually, dictionaries used for normalization of words ( ispell, stemmer dictionaries),
but see, for example, <tt>intdict</tt> dictionary (available from 
<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/">Tsearch2</a> home page, 
which controls indexing of integers.

<P>
Among the dictionaries which come installed with tsearch2 are:

<ul>
<li><b>simple</b> simply folds uppercase letters to lowercase
 before returning the word.
</li>
<li><b>ispell_template</b> - template for ispell dictionaries.
</li>
<li><b>en_stem</b> runs an English Snowball stemmer on each word
 that attempts to reduce the various forms of a verb or noun
 to a single recognizable form.
</li><li><b>ru_stem_koi8</b>, <b>ru_stem_utf8</b> runs a Russian Snowball stemmer on each word.
</li>
<li><b>synonym</b> - simple lexeme-to-lexeme replacement
</li>
<li><b>thesaurus_template</b> - template for <a href="#tz">thesaurus dictionary</a>. It's 
phrase-to-phrase replacement
</li>
</ul>

<P>
Each dictionary is defined by an entry in the <tt>pg_ts_dict</tt> table:

<pre>CREATE TABLE pg_ts_dict (
	dict_name	text not null,
	dict_init	regprocedure,
	dict_initoption	text,
	dict_lexize	regprocedure not null,
	dict_comment	text
);</pre>

The <tt>dict_name</tt>
serve as unique identifiers for the dictionary.
The meaning of the <tt>dict_initoption</tt> varies among dictionaries,
but for the built-in Snowball dictionaries
it specifies a file from which stop words should be read.
The <tt>dict_comment</tt> is a human-readable description of the dictionary.
The other fields are internal function identifiers
useful only to developers trying to implement their own dictionaries.

<blockquote>
<b>WARNING:</b> Data files, used by dictionaries, should be in <tt>server_encoding</tt> to 
avoid possible problems&nbsp;!
</blockquote>

<p>
The argument named <tt><i>dictionary</i></tt>
in each of the following functions
should be <tt>dict_name</tt>
identifying which dictionary should be used for the operation;
if omitted then the current dictionary is used.

</p><dl><dt>
 <tt>CREATE FUNCTION set_curdict(<i>dictionary</i>) RETURNS VOID</tt>
</dt><dd>
 Selects a current dictionary for use by functions
 that do not select a dictionary explicitly.
</dd><dt>
 <tt>CREATE FUNCTION lexize(
 <em>[</em> <i>dictionary</i>, <em>]</em> <i>word</i> text)
 RETURNS TEXT[]</tt>
</dt><dd>
 Reduces a single word to a lexeme.
 Note that lexemes are arrays of zero or more strings,
 since in some languages there might be several base words
 from which an inflected form could arise.
</dd></dl>

<h3>Using dictionaries template</h3>
Templates used to define new dictionaries, for example,
<pre>
INSERT INTO pg_ts_dict
               (SELECT 'en_ispell', dict_init,
                       'DictFile="/usr/local/share/dicts/ispell/english.dict",'
                       'AffFile="/usr/local/share/dicts/ispell/english.aff",'
                       'StopFile="/usr/local/share/dicts/english.stop"',
                       dict_lexize
               FROM pg_ts_dict
               WHERE dict_name = 'ispell_template');
</pre>

<h3>Working with stop words</h3>
Ispell and snowball stemmers treat stop words differently:
<ul>
<li>ispell - normalize word and then lookups normalized form in stop-word file
<li>snowball stemmer - first, it lookups word in stop-word file and then does it job. 
The reason - to minimize possible 'noise'.
</ul>

<h2><a name="ranking">Ranking</a></h2>

Ranking attempts to measure how relevant documents are to particular queries
by inspecting the number of times each search word appears in the document,
and whether different search terms occur near each other.
Note that this information is only available in unstripped vectors --
ranking functions will only return a useful result
for a <tt>tsvector</tt> which still has position information!
<p>
Notice, that ranking functions supplied are just an examples and 
doesn't belong to the tsearch2 core, you can
write your very own ranking function and/or combine additional
factors to fit your specific interest.
</p>

The two ranking functions currently available are:

<dl><dt>
 <tt>CREATE FUNCTION rank(<br>
  <em>[</em> <i>weights</i> float4[], <em>]</em>
  <i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
  <em>[</em> <i>normalization</i> int4 <em>]</em><br>
  ) RETURNS float4</tt>
</dt><dd>
 This is the ranking function from the old version of OpenFTS,
 and offers the ability to weight word instances more heavily
 depending on how you have classified them.
 The <i>weights</i> specify how heavily to weight each category of word:
 <pre>{<i>D-weight</i>, <i>C-weight</i>, <i>B-weight</i>, <i>A-weight</i>}</pre>
 If no weights are provided, then these defaults are used:
 <pre>{0.1, 0.2, 0.4, 1.0}</pre>
 Often weights are used to mark words from special areas of the document,
 like the title or an initial abstract,
 and make them more or less important than words in the document body.
</dd><dt>
 <tt>CREATE FUNCTION rank_cd(<br>
  <em>[</em> <i>weights</i> float4[], <em>]</em>
  <i>vector</i> TSVECTOR, <i>query</i> TSQUERY,
  <em>[</em> <i>normalization</i> int4 <em>]</em><br>
  ) RETURNS float4</tt>
</dt><dd>
 This function computes the cover density ranking
 for the given document <i>vector</i> and <i>query</i>,
 as described in Clarke, Cormack, and Tudhope's
 "<a href="http://citeseer.nj.nec.com/clarke00relevance.html">Relevance Ranking for One to Three Term Queries</a>"
 in the 1999 <i>Information Processing and Management</i>.
</dd>
<dt>
 <tt>CREATE FUNCTION get_covers(vector TSVECTOR, query TSQUERY) RETURNS text</tt>
 </dt>
 <dd>
 Returns <tt>extents</tt>, which are a shortest and non-nested sequences of words, which satisfy a query. 
 Extents (covers) used in <tt>rank_cd</tt> algorithm for fast calculation of proximity ranking.
 In example below there are two extents - <tt><b>{1</b>...<b>}1</b> and <b>{2</b> ...<b>}2</b></tt>.
 <pre>
=# select get_covers('1:1,2,10 2:4'::tsvector,'1&amp; 2');
get_covers
----------------------
1 {1 1 {2 2 }1 1 }2
</pre>
 </dd>

</dl>

<p>
Both of these (<tt>rank(), rank_cd()</tt>) ranking functions
take an integer <i>normalization</i> option
that specifies whether a document's length should impact its rank.
This is often desirable,
since a hundred-word document with five instances of a search word
is probably more relevant than a thousand-word document with five instances.
The option can have the values, which could be combined using "|" ( 2|4) to
take into account several factors:

</p>
<ul>
<li><tt>0</tt> (the default) ignores document length.</li>
<li><tt>1</tt> divides the rank by the 1 + logarithm of the length </li>
<li><tt>2</tt> divides the rank by the length itself.</li>
<li><tt>4</tt> divides the rank by the mean harmonic distance between extents</li>
<li><tt>8</tt> divides the rank by the  number of unique words in document</li>
<li><tt>16</tt> divides the rank by 1 + logarithm of the  number of unique words in document
</li>
</ul>

<h2><a name="headlines">Headlines</a></h2>

<dl><dt>
 <tt>CREATE FUNCTION headline(<br>
  <em>[</em> <i>id</i> int4, <em>|</em> <i>ts_name</i> text, <em>]</em>
  <i>document</i> text, <i>query</i> TSQUERY,
  <em>[</em> <i>options</i> text <em>]</em><br>
  ) RETURNS text</tt>
</dt><dd>
 Every form of the the <tt>headline()</tt> function
 accepts a <tt>document</tt> along with a <tt>query</tt>,
 and returns one or more ellipse-separated excerpts from the document
 in which terms from the query are highlighted.
 The configuration with which to parse the document
 can be specified by either its <i>id</i> or <i>ts_name</i>;
 if none is specified that the current configuration is used instead.
 <p>
 An <i>options</i> string if provided should be a comma-separated list
 of one or more '<i>option</i><tt>=</tt><i>value</i>' pairs.
 The available options are:
 </p><ul>
  <li><tt>StartSel</tt>, <tt>StopSel</tt> --
   the strings with which query words appearing in the document
   should be delimited to distinguish them from other excerpted words.
  </li><li><tt>MaxWords</tt>, <tt>MinWords</tt> --
   limits on the shortest and longest headlines you will accept.
  </li><li><tt>ShortWord</tt> --
   this prevents your headline from beginning or ending
   with a word which has this many characters or less.
   The default value of <tt>3</tt> should eliminate most English
   conjunctions and articles.
  </li><li><tt>HighlightAll</tt> -- 
   boolean flag, if TRUE, than the whole document will be highlighted.
 </li></ul>
 Any unspecified options receive these defaults:
 <pre>StartSel=&lt;b&gt;, StopSel=&lt;/b&gt;, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE
 </pre>
</dd></dl>


<h2><a name="indexes">Indexes</a></h2>
Tsearch2 supports indexed access to tsvector in order to further speedup FTS. Notice, indexes are not mandatory for FTS ! 
<ul>
<li> RD-Tree (Russian Doll Tree, matryoshka), based on GiST (Generalized Search Tree) 
<pre>    
    =# create index fts_idx on apod using gist(fts);
</pre>    
<li>GIN - Generalized Inverted Index 
<pre>       
        =# create index fts_idx on apod using gin(fts);
</pre>
</ul>  
<b>GiST</b> index is very good for online update, but is not as scalable as <b>GIN</b> index,
which, in turn, isn't good for updates. Both indexes support concurrency and recovery.

<h2><a name="tz">Thesaurus dictionary</a></h2>

<P>
Thesaurus - is a collection of words with included information about the relationships of words and phrases, 
i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred, related terms,etc.</p>
<p>Basically,thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, 
preserves them for indexing. Thesaurus used when indexing, so any changes in thesaurus require reindexing.
Tsearch2's <tt>thesaurus</tt> dictionary (TZ) is an extension of <tt>synonym</tt> dictionary 
with <b>phrase</b> support. Thesaurus is a plain file of the following format: 
<pre>
# this is a comment 
sample word(s) : indexed word(s)
...............................
</pre>
<ul>
<li><strong>Colon</strong> (:) symbol used as a delimiter.</li>
<li>Use asterisk (<b>*</b>) at the beginning of <tt>indexed word</tt> to skip subdictionary.
It's still required, that <tt>sample words</tt> should be known.</li>
<li>thesaurus dictionary looks for the most longest match</li></ul>
<P>
TZ uses <strong>subdictionary</strong> (should be defined in tsearch2 configuration) 
to normalize thesaurus text. It's possible to define only <strong>one dictionary</strong>. 
Notice, that subdictionary produces an error, if it couldn't recognize word. 
In that case, you should remove definition line with this word or teach  subdictionary to know it. 
</p>
<p>Stop-words recognized by subdictionary replaced by  'stop-word placeholder', i.e., 
important only their position.
To break possible ties thesaurus applies the last definition. For example, consider 
thesaurus (with simple subdictionary) rules with pattern 'swsw' 
('s' designates stop-word and 'w' - known word): </p>
<pre>
a one the two : swsw
the one a two : swsw2
</pre>
<p>Words 'a' and 'the' are stop-words defined in the configuration of a subdictionary. 
Thesaurus considers texts 'the one the two' and 'that one then two' as equal and  will use definition 
'swsw2'.</p>
<p>As a normal dictionary, it should be assigned to the specific lexeme types. 
Since TZ has a capability to recognize phrases it must remember its  state and interact with parser. 
TZ use these assignments to check if it should handle next word or stop accumulation. 
Compiler of TZ should take care about proper configuration to avoid confusion. 
For example, if TZ is assigned to handle only <tt>lword</tt> lexeme, then TZ definition like 
' one 1:11' will not works, since lexeme type <tt>digit</tt> doesn't assigned to the TZ.</p>

<h3>Configuration</h3>

<dl><dt>tsearch2</dt><dd></dd></dl><p>tsearch2 comes with thesaurus template, which could be used to define new dictionary: </p>
<pre class="real">INSERT INTO pg_ts_dict
               (SELECT 'tz_simple', dict_init,
                        'DictFile="/path/to/tz_simple.txt",'
                        'Dictionary="en_stem"',
                       dict_lexize
                FROM pg_ts_dict
                WHERE dict_name = 'thesaurus_template');

</pre>
<p>Here: </p>
<ul>
<li><tt>tz_simple</tt> - is the dictionary name</li>
<li><tt>DictFile="/path/to/tz_simple.txt"</tt> - is the location of thesaurus file</li>
<li><tt>Dictionary="en_stem"</tt> defines dictionary (snowball english stemmer) to use for thesaurus normalization. Notice, that <em>en_stem</em> dictionary has it's own configuration (stop-words, for example).</li>
</ul>
<p>Now, it's possible to use <tt>tz_simple</tt> in pg_ts_cfgmap, for  example: </p>
<pre>
update pg_ts_cfgmap set dict_name='{tz_simple,en_stem}' where ts_name = 'default_russian' and 
tok_alias in ('lhword', 'lword', 'lpart_hword');
</pre>
<h3>Examples</h3>
<p>tz_simple: </p>
<pre>
one : 1
two : 2
one two : 12
the one : 1
one 1 : 11
</pre>
<p>To see, how thesaurus works, one could use <tt>to_tsvector</tt>, <tt>to_tsquery</tt> or <tt>plainto_tsquery</tt> functions: </p><pre class="real">=# select plainto_tsquery('default_russian',' one day is oneday');
    plainto_tsquery
------------------------
 '1' &amp; 'day' &amp; 'oneday'

=# select plainto_tsquery('default_russian','one two day is oneday');
     plainto_tsquery
-------------------------
 '12' &amp; 'day' &amp; 'oneday'

=# select plainto_tsquery('default_russian','the one');
NOTICE:  Thesaurus: word 'the' is recognized as stop-word, assign any stop-word (rule 3)
 plainto_tsquery
-----------------
 '1'
</pre>

Additional information about thesaurus dictionary is available from
<a href="http://www.sai.msu.su/~megera/wiki/Thesaurus_dictionary">Wiki</a> page.
</body></html>