aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/gist/gistutil.c
diff options
context:
space:
mode:
authorTom Lane <tgl@sss.pgh.pa.us>2010-05-30 21:59:09 +0000
committerTom Lane <tgl@sss.pgh.pa.us>2010-05-30 21:59:09 +0000
commitf8fc6082b4e7e5e6d7e6e7699f2b77497b1664ae (patch)
treec12733394f7229a270c3ddec180fff951d789f38 /src/backend/access/gist/gistutil.c
parentb7d7df63c876df13e66e72618793a206ed4776c4 (diff)
downloadpostgresql-f8fc6082b4e7e5e6d7e6e7699f2b77497b1664ae.tar.gz
postgresql-f8fc6082b4e7e5e6d7e6e7699f2b77497b1664ae.zip
Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats().
We must filter out hashtable entries with frequencies less than those specified by the algorithm, else we risk emitting junk entries whose actual frequency is much less than other lexemes that did not get tabulated. This is bad enough by itself, but even worse is that tsquerysel() believes that the minimum frequency seen in pg_statistic is a hard upper bound for lexemes not included, and was thus underestimating the frequency of non-MCEs. Also, set the threshold frequency to something with a little bit of theory behind it, to wit assume that the input distribution is approximately Zipfian. This might need adjustment in future, but some preliminary experiments suggest that it's not too unreasonable. Back-patch to 8.4, where this code was introduced. Jan Urbanski, with some editorialization by Tom
Diffstat (limited to 'src/backend/access/gist/gistutil.c')
0 files changed, 0 insertions, 0 deletions