aboutsummaryrefslogtreecommitdiff
path: root/contrib/tsearch2/docs/tsearch2-guide.html
blob: d2d764580c7b554c5f0f5ebc124df1903cf5a763 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>tsearch2 guide</title>
</head>
<body>
<h1 align=center>The tsearch2 Guide</h1>

<p align=center>
Brandon Craig Rhodes<br>30 June 2003
<br>Updated to 8.2 release by Oleg Bartunov, October 2006</br>
<p>
This Guide introduces the reader to the PostgreSQL tsearch2 module,
version&nbsp;2.
More formal descriptions of the module's types and functions
are provided in the <a href="tsearch2-ref.html">tsearch2 Reference</a>,
which is a companion to this document.
<p>
First we will examine the <tt>tsvector</tt> and <tt>tsquery</tt> types
and how they are used to search documents;
next, we will use them to build a simple search engine in&nbsp;SQL;
and finally, we will study the internals of document conversion
and how you might tune the internals to accommodate various searching needs.
<p>
Once you have tsearch2 working with PostgreSQL,
you should be able to run the examples here exactly as they are typed.
<p>
<hr>
<h2>Table of Contents</h2>
<blockquote>
<a href="#intro">Introduction to FTS with tsearch2</a><br>
<a href="#vectors_queries">Vectors and Queries</a><br>
<a href="#simple_search">A Simple Search Engine</a><br>
<a href="#weights">Ranking and Position Weights</a><br>
<a href="#casting">Casting Vectors and Queries</a><br>
<a href="#parsing_lexing">Parsing and Lexing</a><br>
<a href="#ref">Additional information</a>
</blockquote>

<hr>


<h2><a name="intro">Introduction to FTS with tsearch2</a></h2>
The purpose of FTS is to
find <b>documents</b>, which satisfy <b>query</b> and optionally return 
them in some <b>order</b>. 
Most common case: Find documents containing all query terms and return them in order 
of their similarity to the query. Document in database can be 
any text attribute, or combination of text attributes from one or many tables
(using joins).
Text search operators existed for years, in PostgreSQL they are
<tt><b>~,~*, LIKE, ILIKE</b></tt>, but they lack linguistic support,
tends to be slow and have no relevance ranking. The idea behind tsearch2 is 
is rather simple - preprocess document at index time to save time at search stage.
Preprocessing includes
<ul>
<li>document parsing onto words
<li>linguistic - normalize words to obtain lexemes
<li>store document in optimized for searching way
</ul>
Tsearch2, in a nutshell, provides FTS operator (contains) for two new data types, 
which represent document and query - <tt>tsquery  @@ tsvector</tt>.

<P>
<h2><a name=vectors_queries>Vectors and Queries</a></h2>

<blockquote>
<i>This section introduces
the two data types upon which tsearch2 search engines are based,
and illustrates their interaction using the simplest possible case.
The complex examples we present later on
are merely variations and elaborations of this basic mechanism.</i>
</blockquote>
<p>
The tsearch2 module allows you to index documents by the words they contain,
and then perform very efficient searches
for documents that contain a given combination of words.
Preparing your document index involves two steps:
<ul>
<li><b>Making a list of the words each document contains.</b>
 You must reduce each document to a <tt>tsvector</tt>
 which lists each word that appears in the document.
 This process offers many options,
 because there is no requirement
 that you must copy words into the vector
 exactly as they appear in the document.
 For example,
 many developers omit frequent and content-free <b>stop words</b>
 like <i>the</i> to reduce the size of their index;
 others reduce different forms of the same word
 (<i>forked</i>, <i>forking</i>, <i>forks</i>)
 to a common form (<i>fork</i>)
 to make search results independent of tense and case.
 Because words are very often stored in a modified form,
 we use the special term <b>lexemes</b>
 for the word forms we actually store in the vector.
<li><b>Creating an index of the documents by lexeme.</b>
 This is managed automatically by tsearch2
 when you creat a <tt>gist()</tt> index
 on the <tt>tsvector</tt> column of a table,
 which implements a form of the Berkeley
 <a href="http://gist.cs.berkeley.edu/"><i>Generalized Search Tree</i></a>.
 Since PostgreSQL 8.2 tsearch2 supports <a href="http://www.sigaev.ru/gin/">Gin</a> index,
 which is an inverted index, commonly used in search engines. It adds scalability to tsearch2.
</ul>
Once your documents are indexed,
performing a search involves:
<ul>
<li><b>Reducing the search terms to lexemes.</b>
 You must express each search you want to perform
 as a <tt>tsquery</tt> specifying a boolean combination of lexemes.
 Note that tsearch2 only finds <i>exact</i> matches
 between the lexemes in your query and the ones in each vector &mdash;
 even capitalization counts as a difference
 (which is why all lexemes are usually kept lowercase).
 So you must process search words the same way you processed document words;
 if <i>forking</i> became <i>fork</i> in the document's <tt>tsvector</tt>,
 then the search term <i>forking</i> must also become <i>fork</i>
 or the search will not find the document.
<li><b>Retrieving the documents that match the query.</b>
 Running a <tt>SELECT</tt> ... <tt>WHERE</tt>
 <tt><i>query</i></tt> <tt>@@</tt> <tt><i>vector</i></tt>
 on the table with the <tt><i>vector</i></tt> column
 will return the documents that match your query.
<li><b>Presenting your results.</b>
 This final stage offers as many options
 as turning documents into vectors.
 You can order documents by how well they matched the search terms;
 create a headline for each document
 showing some of the phrases in which it uses the search terms;
 and restrict the number of results retrieved.
 You will of course want some way to identify each document,
 so the user can ask for the full text of the ones he wants to read.
</ul>
And beyond deciding upon rules for turning documents into vectors
and for presenting search results to users,
you have to decide <i>where</i> to perform these operations &mdash;
whether one database server
will parse documents, perform searches, and prepare search results,
or whether to spread the load of these operations across several machines.
These are complicated design issues
which we will explore later;
in this section and the next,
we will illustrate what can be accomplished
using a single database server.
<p>
The <tt>default</tt> tsearch2 configuration,
which we will learn more about later,
provides a good example of a process for reducing documents to vectors:

<pre>
=# <b>SELECT set_curcfg('default')</b>
=# <b>SELECT to_tsvector('The air smells of sea water.')</b>
             to_tsvector             
-------------------------------------
 'air':2 'sea':5 'smell':3 'water':6
(1 row)
</pre>

Note the complex relationship between this document and its vector.
The vector lists only words from the document &mdash;
spaces and punctuation have disappeared.
Common words like <i>the</i> and <i>of</i> have been eliminated.
The&nbsp;<i>-s</i> that makes <i>smells</i> a plural has been removed,
leaving a lexeme that represents the word in its simplest form.
And finally,
though the vector remembers the positions in which each word appeared,
it does not store the lexemes in that order.
<p>
Keeping word positions in your vectors is optional, by the way.
The positions are necessary for the tsearch2 ranking functions,
which you can use to prioritize documents
based on how often each document uses the search terms
and whether they appear in close proximity.
But if you do not perform ranking,
or use your own process that ignores the word positions stored in the vector,
then you can save space by stripping them from your vectors:

<pre>
=# <b>SELECT strip(to_tsvector('The air smells of sea water.'))</b>
            strip            
-----------------------------
 'air' 'sea' 'smell' 'water'
(1 row)
</pre>

Now that we have a procedure for creating vectors,
we can build an indexed table of vectors very simply:

<pre>
=# <b>CREATE TABLE vectors ( vector tsvector )</b>
=# <b>CREATE INDEX vector_index ON vectors USING gist(vector)</b>
=# <b>INSERT INTO vectors VALUES (to_tsvector('The path forks here'))</b>
=# <b>INSERT INTO vectors VALUES (to_tsvector('A crawl leads west'))</b>
=# <b>INSERT INTO vectors VALUES (to_tsvector('The left fork leads northeast'))</b>
=# <b>SELECT * FROM vectors</b>
                  vector                  
------------------------------------------
 'fork':3 'path':2
 'lead':3 'west':4 'crawl':2
 'fork':3 'lead':4 'left':2 'northeast':5
(3 rows)
</pre>

Now we can search this collection of document vectors
using the <tt>@@</tt> operator and a <tt>tsquery</tt>
that specifies the combination of lexemes we are looking for.
Note that while vectors simply list lexemes,
queries always combine them with the operators
&lsquo;<tt>&amp;</tt>&rsquo;&nbsp;and,
&lsquo;<tt>|</tt>&rsquo;&nbsp;or,
and &nbsp;&lsquo;<tt>!</tt>&rsquo;&nbsp;not,
plus parentheses for grouping.
Some examples of the query syntax:
<table align=center>
<tr>
 <td>&lsquo;find documents with the word <i>forks</i> in them&rsquo;<br>
 <td><tt>'forks'</tt>
<tr>
 <td>&lsquo;... with both <i>forks</i> and <i>leads</i>&rsquo;<br>
 <td><tt>'forks & leads'</tt>
<tr>
 <td>&lsquo;... with either <i>forks</i> or <i>leads</i>&rsquo;<br>
 <td><tt>'forks | leads'</tt>
<tr>
 <td>&lsquo;... with either <i>forks</i> or <i>leads</i>,
  but without <i>crawl</i>&rsquo;<br>
 <td><tt>'(forks|leads) & !crawl'</tt>
</table>
The tsearch2 module
provides a <tt>to_tsquery()</tt> function for creating queries
that uses the same process as <tt>to_tsvector()</tt> uses
to reduce words to lexemes.
For instance,
it will remove the&nbsp;<i>-s</i> from the plurals in the last example above:

<pre>
=# <b>SELECT to_tsquery('(leads|forks) & !crawl')</b>
           to_tsquery           
--------------------------------
 ( 'lead' | 'fork' ) & !'crawl'
(1 row)
</pre>

Again,
this is critically important because the search operator <tt>@@</tt>
only finds <i>exact</i> matches
between the words in a query and the words in a vector;
if the document vector lists the lexeme <i>fork</i>
but the query looks for the plural form <i>forks</i>,
the query would not match that document.
Thanks to the symmetry between our process
for producing vectors and queries, however,
the above searches return correct results:

<pre>
=# <b>SELECT * FROM vectors WHERE vector @@ to_tsquery('(leads|forks) & !crawl')</b>
                  vector                  
------------------------------------------
 'fork':3 'path':2
 'fork':3 'lead':4 'left':2 'northeast':5
(2 rows)
</pre>

You may want to try the other queries shown above,
and perhaps invent some of your own.
<p>
You should not include stop words in a query,
since you cannot search for words you have discarded.
If you throw out the word <i>the</i> when building vectors, for example,
your index will obviously not know which documents included it.
The <tt>to_tsquery()</tt> function will automatically detect this
and give you an error to prevent this mistake:

<pre>
=# <b>SELECT to_tsquery('the')</b>
NOTICE:  Query contains only stopword(s) or doesn't contain lexem(s), ignored
 to_tsquery 
------------
 
(1 row)
</pre>

But if you every build vectors and queries using your own routines,
a possibility we will discuss later,
then you will need to enforce this rule yourself.

<blockquote><i>
Now that you understand how vectors and queries work together,
you are prepared to tackle many additional topics:
how to distribute searching across many servers;
how to customize the process
by which tsearch2 turns documents and queries into lexemes,
or use a process of your own;
and how to sort and display search results to your users.
But before discussing these detailed questions,
we will build a simple search engine
to see how easily its basic features work together.
</i></blockquote>

<h2><a name=simple_search>A Simple Search Engine</a></h2>

<blockquote><i>
In this section we build a simple search engine out of SQL functions
that use the vector and query types described in the previous section.
While this example is simpler
than a search engine that has to interface with the outside world,
it will illustrate the basic principles of building a search engine,
and better prepare you for developing your own.
</i></blockquote>
Building a search engine involves only a few improvements
upon the rudimentary vector searches described in the last section.
<ul>
<li>Because the user wants to read documents, not vectors,
 you must provide some way
 for the full text of each document to be accessed &mdash;
 either by storing the entire text of each document in the database,
 or storing an identifier
 like a URL, file name, or document routing number
 that lets you fetch the document from other storage.
<li>You can make it easier for user interface code to refer to each document
 by providing a unique identifier for each document,
 perhaps with a <tt>SERIAL</tt> column.
<li>Search results should be ordered by relevance.
 If you leave word positions in your vectors,
 you can either have PostgreSQL <tt>ORDER</tt> your results
 <tt>BY</tt> a ranking function,
 or you can fetch the vectors yourself and perform your own sort.
 If you choose to ignore word positions or strip them from your vectors,
 you will have to determine relevance yourself,
 using either the full text of the document
 or other information about each document you may possess.
<li>For each document returned by a search,
 you will usually want to display a summary called a <i>headline</i>
 that shows short excerpts
 illustrating how the document uses the query words.
 Headlines are usually generated from the full text of the document,
 not from position information in the <tt>tsvector</tt>,
 since excerpts lacking stop words, punctuation, and suffixes
 would not be comprehensible.
 If you store the full text of each document in the database,
 headlines can be generated very simply by a tsearch2 function.
 If you store your documents elsewhere,
 then you will either have to transmit each document to the database
 every time you want to run the headline function on it,
 or use your own headline code outside of the database.
</ul>
<p>
We can easily construct a simple search engine
that accomplishes these goals.
First we build a table that, for each document,
stores a unique identifier, the full text of the document,
and its <tt>tsvector</tt>:

<pre>
=# <b>CREATE TABLE docs ( id SERIAL, doc TEXT, vector tsvector )</b>
=# <b>CREATE INDEX docs_index ON docs USING gist(vector);</b>
</pre>

Note that although searches will still work
on tables where you have neglected
to create a <tt>gist()</tt> index over your vectors,
they will run much more slowly
since they will have to compare the query
against every document vector in the table.
<p>
Because the table we have created
stores each document in two different ways &mdash;
both as text and as a vector &mdash;
our <tt>INSERT</tt> statements must provide the document in both forms.
While more advanced PostgreSQL programmers
might accomplish this with a database trigger or rule,
for this simple example we will use a small SQL function:

<pre>
=# <b>CREATE FUNCTION insdoc(text) RETURNS void LANGUAGE sql AS
  'INSERT INTO docs (doc, vector) VALUES ($1, to_tsvector($1));'</b>
</pre>

Now, by calling <tt>insdoc()</tt> several times,
we can populate our table with documents:

<pre>
=# <b>SELECT insdoc('A low crawl over cobbles leads inward to the west.')</b>
=# <b>SELECT insdoc('The canyon runs into a mass of boulders -- dead end.')</b>
=# <b>SELECT insdoc('You are crawling over cobbles in a low passage.')</b>
=# <b>SELECT insdoc('Cavernous passages lead east, north, and south.')</b>
=# <b>SELECT insdoc('To the east a low wide crawl slants up.')</b>
=# <b>SELECT insdoc('You are in the south side chamber.')</b>
=# <b>SELECT insdoc('The passage here is blocked by a recent cave-in.')</b>
=# <b>SELECT insdoc('You are in a splendid chamber thirty feet high.')</b>
</pre>

Now we can build a search function.
Its <tt>SELECT</tt> statement is based upon
the same <tt>@@</tt> operation illustrated in the previous section.
But instead of returning matching vectors,
we return for each document
its <tt>SERIAL</tt> identifier, so the user can retrieve it later;
a headline that illustrates its use of the search terms;
and a ranking with which we also order the results.
Our search operation can be coded as a single <tt>SELECT</tt> statement
returning its own kind of table row,
which we call a&nbsp;<tt>finddoc_t</tt>:

<pre>
=# <b>CREATE TYPE finddoc_t AS (id INTEGER, headline TEXT, rank REAL)</b>
=# <b>CREATE FUNCTION finddoc(text) RETURNS SETOF finddoc_t LANGUAGE sql AS '
   SELECT id, headline(doc, q), rank(vector, q)
     FROM docs, to_tsquery($1) AS q
     WHERE vector @@ q ORDER BY rank(vector, q) DESC'</b>
</pre>

This function is a rather satisfactory search engine.
Here is one example search,
after which the user fetches the top-ranking document itself;
with similar commands you can try queries of your own:

<pre>
=# <b>SELECT * FROM finddoc('passage|crawl')</b>
 id |                       headline                        | rank 
----+-------------------------------------------------------+------
  3 | &lt;b&gt;crawling&lt;/b&gt; over cobbles in a low &lt;b&gt;passage&lt;/b&gt;. | 0.19
  1 | &lt;b&gt;crawl&lt;/b&gt; over cobbles leads inward to the west.   |  0.1
  4 | &lt;b&gt;passages&lt;/b&gt; lead east, north, and south.          |  0.1
  5 | &lt;b&gt;crawl&lt;/b&gt; slants up.                               |  0.1
  7 | &lt;b&gt;passage&lt;/b&gt; here is blocked by a recent  cave-in.  |  0.1
(5 rows)
=# <b>SELECT doc FROM docs WHERE id = 3</b>
                       doc                       
-------------------------------------------------
 You are crawling over cobbles in a low passage.
(1 row)
</pre>

While by default the <tt>headline()</tt> function
surrounds matching words with <tt>&lt;b&gt;</tt> and <tt>&lt;/b&gt;</tt>
in order to distinguish them from the surrounding text,
you can provide options that change its behavior;
consult the tsearch2 Reference for more details about
<a href="tsearch2-ref.html#headlines">Headline Functions</a>.
<p>
Though a search may match hundreds or thousands of documents,
you will usually present only ten or twenty results to the user at a time.
This can be most easily accomplished
by limiting your query with a <tt>LIMIT</tt>
and an <tt>OFFSET</tt> clause &mdash;
to display results ten at a time, for example,
your would generate your first page of results
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>0</tt>,
your second page
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>10</tt>,
your third page
with <tt>LIMIT</tt> <tt>10</tt> <tt>OFFSET</tt> <tt>20</tt>,
and so forth.
There are two problems with this approach, however.
<p>
The first problem is the strain of running the query over again
for every page of results the user views.
For small document collections or lightly loaded servers,
this may not be a problem;
but the impact can be high
when a search must repeatedly rank and sort
the same ten thousand results
on an already busy server.
So instead of selecting only one page of results,
you will probably use <tt>LIMIT</tt> and <tt>OFFSET</tt>
to return a few dozen or few hundred results,
which you can cache and display to the user one page at a time.
Whether a result cache rewards your effort
will depend principally on the behavior of your users &mdash;
how often they even view the second page of results, for instance.
<p>
The second issue solved by caching involves consistency.
If the database is changing while the user browses their results,
then documents might appear and disappear as they page through them.
In some cases the user might even miss a particular result &mdash;
perhaps the one they were looking for &mdash;
if, say, its rank improves from 31th to 30th
after they load results 21&ndash;30 but before they view results 31&ndash;40.
While many databases are static or infrequently updated,
and will not present this problem,
users searching very dymanic document collections
might benefit from the stable results that caches yield.

<blockquote><i>
Having seen the features of a search engine
implemented entirely within the database,
we will learn about some specific tsearch2 features.
First we will look in more detail at document ranking.
</i></blockquote>

<h2><a name=weights>Ranking and Position Weights</a></h2>

<blockquote><i>
When we built our simple search engine,
we used the </i><tt>rank()</tt><i> function to order our results.
Here we describe tsearch2 ranking in more detail.
</i></blockquote>

There are two functions with which tsearch2 can rank search results.
They both use the lexeme positions listed in the <tt>tsvector</tt>,
so you cannot rank vectors
from which these have been removed with <tt>strip()</tt>.
The <tt>rank()</tt> function existed in older versions of OpenFTS,
and has the feature that you can assign different weights
to words from different sections of your document.
The <tt>rank_cd()</tt> uses a recent technique for weighting results
and also allows  different weight to be given
to different sections of your document (since 8.2).
<p>
Both ranking functions allow you to specify,
as an optional last argument,
whether you want their results <i>normalized</i> &mdash;
whether the rank returned should be adjusted for document length.
Specifying a last argument of <tt>0</tt> (zero) makes no adjustment;
<tt>1</tt> (one) divides the document rank
by the logarithm of the document length;
and <tt>2</tt> divides it by the plain length.
In all of these examples we omit this optional argument,
which is the same as specifying zero &mdash;
we are making no adjustment for document length.
<p>
The <tt>rank_cd()</tt> function uses an experimental measurement
called <i>cover density ranking</i> that rewards documents
when they make frequent use of the search terms
that are close together in the document.
You can read about the algorithm in more detail
in Clarke&nbsp;et&nbsp;al.,
 &ldquo;<a href="http://citeseer.nj.nec.com/clarke00relevance.html"
>Relevance Ranking for One to Three Term Queries</a>.&rdquo;
An optional first argument allows you to tune their formula;
for details
see the <a href="tsearch2-ref.html#ranking">section on ranking</a>
in the Reference.
<p>
Currently tsearch2 supports four different weight labels:
<tt>'D'</tt>, the default weight;
and <tt>'A'</tt>, <tt>'B'</tt>, and <tt>'C'</tt>.
All vectors created with <tt>to_tsvector()</tt>
assign the weight <tt>'D'</tt> to each position,
which as the default is not displayed when you print a vector out.
<p>
If you want positions with weights other than <tt>'D'</tt>,
you have two options:
either you can author a vector directly through the <tt>::tsvector</tt>
casting operation,
as described in the following section,
which lets you give each position whichever weight you want;
or you can pass a vector through the <tt>setweight()</tt> function
which sets all of its position weights to a single value.
An example of the latter:


<pre>
=# <b>SELECT vector FROM docs WHERE id = 3</b>
                 vector                 
----------------------------------------
 'low':8 'cobbl':5 'crawl':3 'passag':9
(1 row)
=# <b>SELECT setweight(vector, 'A') FROM docs WHERE id = 3</b>
                 setweight                  
--------------------------------------------
 'low':8A 'cobbl':5A 'crawl':3A 'passag':9A
(1 row)
</pre>


Merely changing all of the weights in a vector is not very useful,
of course,
since this results still in all words having the same weight.
But if we parse different parts of a document separately,
giving each section its own weight,
and then concatenate the vectors of each part into a single vector,
the result can be very useful.
We can construct a simple example
in which document titles are given greater weight
that text in the body of the document:


<pre>
=# <b>CREATE TABLE tdocs ( id SERIAL, title TEXT, doc TEXT, vector tsvector )</b>
=# <b>CREATE INDEX tdocs_index ON tdocs USING gist(vector);</b>
=# <b>CREATE FUNCTION instdoc(text, text) RETURNS void LANGUAGE sql AS
  'INSERT INTO tdocs (title, doc, vector)
   VALUES ($1, $2, setweight(to_tsvector($1), ''A'') || to_tsvector($2));'</b>
</pre>


Now words from a document title will be weighted differently
than those in the main text
if we provide the title and body as separate arguments:


<pre>
=# <b>SELECT instdoc('Spendid Chamber',
 'The walls are frozen rivers of orange stone.')</b>
 instdoc 
---------
 
(1 row)
=# <b>SELECT vector FROM tdocs</b>
                                    vector                                    
------------------------------------------------------------------------------
 'wall':4 'orang':9 'river':7 'stone':10 'frozen':6 'chamber':2A 'spendid':1A
(1 row)
</pre>


Note that although the necessity is unusual,
you can constrain search terms
to only match words from certain sections
by following them with a colon
and a list of the sections in which the word can occur;
by default this list is <tt>'ABCD'</tt>
so that search terms match words from all sections.
For example,
here we search for a word both generally,
and then looking only for specific weights:


<pre>
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid')</b>
      title      |                     doc                      
-----------------+----------------------------------------------
 Spendid Chamber | The walls are frozen rivers of orange stone.
(1 row)
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:A')</b>
      title      |                     doc                      
-----------------+----------------------------------------------
 Spendid Chamber | The walls are frozen rivers of orange stone.
(1 row)
=# <b>SELECT title, doc FROM tdocs WHERE vector @@ to_tsquery('spendid:D')</b>
 title | doc 
-------+-----
(0 rows)
</pre>




<blockquote><i>
Our examples so far use tsearch2 to parse our documents into vectors.
When your application needs absolute control over vector content,
you will want to use direct type casting,
which is described in the next section.
</i></blockquote>

<h2><a name=casting>Casting Vectors and Queries</a></h2>

<blockquote><i>
While tsearch2 has powerful and flexible ways
to process documents and turn them into document vectors,
you will sometimes want to parse documents on your own
and place the results directly in vectors.
Here we show you how.
</i></blockquote>

In the preceding examples,
we used the <tt>to_tsvector()</tt> function
when we needed a document's text reduced to a document vector.
We saw that the function stripped whitespace and punctuation,
eliminated common words,
and altered suffixes to reduce words to a common form.
While these operations are often desirable,
and while in the sections below
we will gain precise control over this process,
there are occasions on which
you want to avoid the changes that <tt>to_tsvector()</tt> makes to text
and specify explicitly the words that you want in your vectors.
Or you may want to create queries directly
rather than through <tt>to_tsquery()</tt>.
<p>
For example,
you may have already developed your own routine
for reducing your documents to searchable lexemes,
and do not want your carefully generated terms altered
by passing them through <tt>to_tsvector()</tt>.
Or you might be developing and debugging parsing routines of your own
that you are not ready to load into the database.
In either case,
you will find that direct insertion is easily accomplished
if you simply follow some simple rules.
<p>
Vectors are created directly
when you cast a string of whitespace separated lexemes
to the <tt>tsvector</tt> type:


<pre>
=# <b>select 'the only exit is the way you came in'::tsvector</b>
                     tsvector                     
--------------------------------------------------
 'in' 'is' 'the' 'way' 'you' 'came' 'exit' 'only'
(1 row)
</pre>


Notice that the conversion interpreted the string
simply as a list of lexemes to be included in the vector.
Their order was lost,
as was the number of times each lexeme appeared.
You must keep in mind that directly creating vectors with casting
is <i>not</i> an alternate means of parsing;
it is a way of directly entering lexemes into a vector <i>without</i> parsing.
<p>
Queries can also be created through casting,
if you separate lexemes with boolean operators
rather than with whitespace.
When creating your own vectors and queries,
remember that the search operator <tt>@@</tt>
finds only <i>exact</i> matches between query lexemes and vector lexemes
&mdash;
if they are not exactly the same string,
they will not be considered a match.
<p>
To include lexeme positions in your vector,
write the positions exactly the way tsearch2 displays them
when it prints vectors:
by following each lexeme with a colon
and a comma-separated list of integer positions.
If you list a lexeme more than once,
then all the positions listed for it are combined into a single list.
For example,
here are two ways of writing the same vector,
depending on whether you mention &lsquo;<tt>the</tt>&rsquo; twice
or combine its positions into a list yourself:


<pre>
=# <b>select 'the:1 only:2 exit:3 is:4 the:5 way:6 you:7 came:8 in:9'::tsvector</b>
                              tsvector                              
--------------------------------------------------------------------
 'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
(1 row)
=# <b>select 'the:1,5 only:2 exit:3 is:4 way:6 you:7 came:8 in:9'::tsvector</b>
                              tsvector                              
--------------------------------------------------------------------
 'in':9 'is':4 'the':1,5 'way':6 'you':7 'came':8 'exit':3 'only':2
(1 row)
</pre>


Things can get slightly tricky
if you want to include apostrophes, backslashes, or spaces
inside your lexemes
(wanting to include either of the latter would be unusual,
but they can be included if you follow the rules).
The main problem is that the apostrophe and backslash
are important <i>both</i> to PostgreSQL when it is interpreting a string,
<i>and</i> to the <tt>tsvector</tt> conversion function.
You may want to review section
<a href="http://www.postgresql.org/docs/current/static/sql-syntax.html#SQL-SYNTAX-STRINGS">
&ldquo;String Constants&rdquo;</a>
in the PostgreSQL documentation before proceeding.
<p>
When you cast strings directly into vectors:
<ul>
<li>The string is interpreted as a whitespace-separated list of lexemes,
 any of which can be suffixed with a colon and a list of positions.
<li>A lexeme can be quoted by preceding it with an apostrophe,
 in which case it runs until the next apostrophe;
 otherwise a lexeme ends with the first whitespace or colon encountered.
<li>Any character preceded by a backslash,
 including whitespace, the apostrophe, the colon, and the backslash itself,
 loses its normal meaning and is treated as a letter.
 Backslashes are effective
 both inside and outside of apostrophe-quoted lexemes.
<li>A lexeme can be suffixed with a list of positions
 by appending a colon and a comma-separated list of integers,
 each of which can itself be followed by a letter
 to designate a position weight
 (position weights are <a href="#weights">described below</a>).
</ul>

Here are some example strings,
showing the lexeme you want to insert
together with the string that the <tt>::tsvector</tt> operator
needs to see,
and how you would type that string at the PostgreSQL prompt:

<table align=center>
<tr>
<td><i>For the lexeme...</i>
<td><i>you need the string...</i>
<td><i>which you can type as:</i>
<tr>
<td><tt>nugget</tt>
<td><tt>nugget</tt>
<td><tt>'nugget'</tt>
<tr>
<td><tt>won't</tt>
<td><tt>won't</tt>
<td><tt>'won''t'</tt>
<tr>
<td><tt>pinin'</tt>
<td><tt>pinin'</tt>
<td><tt>'pinin'''</tt>
<tr>
<td><tt>'bout</tt>
<td><tt>\'bout</tt>
<td><tt>'\\''bout'</tt>
<tr>
<td><tt>white mist</tt>
<td><tt>white\ mist</tt>
<td><tt>'white\\ mist'</tt>
<tr>
<td align=right><tt><i>or:</i></tt>
<td><tt>'white mist'</tt>
<td><tt>'''white mist'''</tt>
<tr>
<td><tt>won't budge</tt>
<td><tt>won\'t\ budge</tt>
<td><tt>'won\\''t\\ budge'</tt>
<tr>
<td align=right><tt><i>or:</i></tt>
<td><tt>'won\'t budge'</tt>
<td><tt>'''won\\''t budge'''</tt>
<tr>
<td><tt>back\slashed</tt>
<td><tt>back\\slashed</tt>
<td><tt>'back\\\\slashed'</tt>
</table>

Remember to use the quoted quoting shown at the right
only when typing in strings as part of a PostgreSQL query.
If you are providing strings through a library
that automatically quotes them
or provides them in binary form to PostgreSQL,
then you can use the strings in the middle instead &mdash;
suitably quoted in the language you are using, of course.
<p>
Position weights are <a href="#weights">described below</a>
and can be written exactly as they will be displayed
when you select a weighted vector:

<pre>
=# <b>select 'weighty:1,3A trivial:2B,4'::tsvector</b>
           tsvector            
-------------------------------
 'trivial':2B,4 'weighty':1,3A
(1 row)
</pre>

<p>
Note that if you are composing SQL queries
in a scripting language like Perl or Python,
that itself considers quotes and backslashes special,
then you may have another quoting layer to deal with
on top of the two layers already shown above.
In such cases you may want to write a function
that performs the necessary quoting for you.

<blockquote><i>
Having seen how to create vectors of your own,
it is time to learn how the native tsearch2 parser
reduces documents to vectors.
</i></blockquote>

<h2><a name=parsing_lexing>Parsing and Lexing</a></h2>

<blockquote><i>
The previous section
described how you can bypass the parser provided by tsearch2
and populate your table of documents
with vectors of your own devising.
But for those interested in the native tsearch2 facilities,
we present here an overview of how it goes about
reducing documents to vectors.
</i></blockquote>

The <tt>to_tsvector()</tt> function reduces documents to vectors
in two stages.
First, a <i>parser</i> breaks the input document
into short sequences of text called <i>tokens</i>.
Each token is usually a word, space, or piece of punctuation,
though some parsers return larger and more exotic items
like HTML tags as single tokens.
Each token returned by the parser
is either discarded
or passed to a <i>dictionary</i> that converts it into a lexeme.
The resulting lexemes are collected into a vector and returned.
<p>
The choice of which parser and dictionaries <tt>to_tsvector()</tt> should use
is controlled by your choice of <i>configuration</i>.
The tsearch2 module comes with several configurations,
and you can define more of your own;
in fact the creation of a new configuration is illustrated below,
in the section on position weights.
<p>
To learn about parsing in more detail,
we will study this example:

<pre>
=# <b>select to_tsvector('default',
     'The walls extend upward for well over 100 feet.')</b>
                       to_tsvector                        
----------------------------------------------------------
 '100':8 'feet':9 'wall':2 'well':6 'extend':3 'upward':4
(1 row)
</pre>

Unlike the <tt>to_tsvector()</tt> calls used in the above examples,
this one specifies the <tt>'default'</tt> configuration explicitly.
When we called <tt>to_tsvector()</tt> in earlier examples
with only one argument,
it used the <i>current</i> configuration,
which is chosen automatically based on your <tt>LOCALE</tt>
if that locale is mentioned in the <tt>pg_ts_cfg</tt> table
(which is shown under the first bullet in the description below).
If your locale is not listed in the table,
your attempts to use the current configuration will return:

<pre>
ERROR:  Can't find tsearch2 config by locale
</pre>

You can always change the current configuration manually
by calling the <tt>set_curcfg()</tt> function
described in the section on
<a href="tsearch2-ref.html#configurations">Configurations</a>
in the Reference.
<p>
Each configuration serves as an index into two different tables:
in <tt>pg_ts_cfg</tt> it determines
which parser will break our text into tokens,
and in <tt>pg_ts_cfgmap</tt>
it directs each token to a dictionary for processing.
The steps in detail are:

<ul>
<li class=big>
<p>First, our text is parsed,
using the parser listed for our configuration in the <tt>pg_ts_cfg</tt> table.
We are using the <tt>'default'</tt> configuration,
so the table tells us to use the <tt>'default'</tt> parser:

<pre>
=# <b>SELECT * FROM pg_ts_cfg WHERE ts_name = 'default'</b>
 ts_name | prs_name | locale 
---------+----------+--------
 default | default  | C
(1 row)
</pre>

So our text will be parsed as though we had called:

<pre>
=# <b>select * from parse('default',
     'The walls extend upward for well over 100 feet.')</b>
</pre>

This breaks the text into a list of tokens
which are each labelled with an integer type:
<p align=center>
The<sub>1</sub>&diams;<sub>12</sub
>walls<sub>1</sub>&diams;<sub>12</sub
>extend<sub>1</sub>&diams;<sub>12</sub
>upward<sub>1</sub>&diams;<sub>12</sub
>for<sub>1</sub>&diams;<sub>12</sub
>well<sub>1</sub>&diams;<sub>12</sub
>over<sub>1</sub>&diams;<sub>12</sub
>100<sub>22</sub>&diams;<sub>12</sub
>feet<sub>1</sub>.<sub>12</sub>
<p>
Each word has been assigned type&nbsp;1;
each space (represented here by a diamond) and the period, type&nbsp;12;
and the number one hundred, type&nbsp;22.
We can retrieve the alias for each type
through the <tt>token_type</tt> function:

<pre>
=# <b>select * from token_type('default')
     where tokid = 1 or tokid = 12 or tokid = 22</b>
 tokid | alias |      descr       
-------+-------+------------------
     1 | lword | Latin word
    12 | blank | Space symbols
    22 | uint  | Unsigned integer
(3 rows)
</pre>


<li class=big>
Next, the tokens are assigned to dictionaries
by looking up their type aliases in <tt>pg_ts_cfgmap</tt>
to determine which dictionary should process each token.
Since we are using the <tt>'default'</tt> configuration:

<pre>
=# <b>select * from pg_ts_cfgmap where ts_name = 'default' and
      (tok_alias = 'lword' or tok_alias = 'blank' or tok_alias = 'uint')</b>
 ts_name | tok_alias | dict_name 
---------+-----------+-----------
 default | lword     | {en_stem}
 default | uint      | {simple}
(2 rows)
</pre>

Since this map provides no dictionary for <tt>blank</tt> tokens,
the spaces and period are simply discarded,
leaving nine tokens,
which are then numbered by their position:
<p align=center>
The<sup>1</sup>
walls<sup>2</sup>
extend<sup>3</sup>
upward<sup>4</sup>
for<sup>5</sup>
well<sup>6</sup>
over<sup>7</sup>
100<sup>8</sup>
feet<sup>9</sup>

<li class=big>
Finally, the words are reduced to lexemes by their respective dictionaries.
The <tt>100</tt> is submitted to the <tt>simple</tt> dictionary,
which returns tokens unaltered except for making them lowercase:

<pre>
=# <b>select lexize('simple', '100')</b>
 lexize 
--------
 {100}
(1 row)
</pre>

The other words are submitted to <tt>en_stem</tt>
which reduces each English word to a linguistic stem,
and then discards stems which belong to its list of stop words;
you can see the list of stop words
in the file whose path is in the <tt>dict_initoption</tt> field
of the <tt>pg_ts_dict</tt> table entry for <tt>en_stem</tt>.
The first three words of our text illustrate respectively
an <tt>en_stem</tt> stop word,
a word which <tt>en_stem</tt> alters by stemming,
and a word which <tt>en_stem</tt> leaves alone:

<pre>
=# <b>select lexize('en_stem', 'The')</b>
 lexize 
--------
 {}
(1 row)
=# <b>select lexize('en_stem', 'walls')</b>
 lexize 
--------
 {wall}
(1 row)
=# <b>select lexize('en_stem', 'extend')</b>
  lexize  
----------
 {extend}
(1 row)
</pre>

Once <tt>en_stem</tt> is done discarding stop words and stemming the rest,
we are left with:
<p align=center>
wall<sup>2</sup>
extend<sup>3</sup>
upward<sup>4</sup>
well<sup>6</sup>
100<sup>8</sup>
feet<sup>9</sup>
<p>
Which is precisely the result of the example that began this section.
</ul>
Query words are stemmed by the <tt>to_tsquery()</tt> function
using the same scheme to determine the dictionary for each token,
with the difference that the query parser recognizes as special
the boolean operators that separate query words.


<h2><a name="ref">Additional information</a></h2>
More information about tsearch2 is available from 
<a href="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2">tsearch2</a> page.
Also, it's worth to check 
<a href="http://www.sai.msu.su/~megera/wiki/Tsearch2">tsearch2 wiki</a> pages.


</body>
</html>