From 85b7efa1cdd63c2fe2b70b725b8285743ee5787f Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 27 Nov 2024 08:18:35 +0100 Subject: Support LIKE with nondeterministic collations MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This allows for example using LIKE with case-insensitive collations. There was previously no internal implementation of this, so it was met with a not-supported error. This adds the internal implementation and removes the error. The implementation follows the specification of the SQL standard for this. Unlike with deterministic collations, the LIKE matching cannot go character by character but has to go substring by substring. For example, if we are matching against LIKE 'foo%bar', we can't start by looking for an 'f', then an 'o', but instead with have to find something that matches 'foo'. This is because the collation could consider substrings of different lengths to be equal. This is all internal to MatchText() in like_match.c. The changes in GenericMatchText() in like.c just pass through the locale information to MatchText(), which was previously not needed. This matches exactly Generic_Text_IC_like() below. ILIKE is not affected. (It's unclear whether ILIKE makes sense under nondeterministic collations.) This also updates match_pattern_prefix() in like_support.c to support optimizing the case of an exact pattern with nondeterministic collations. This was already alluded to in the previous code. (includes documentation examples from Daniel Vérité and test cases from Paul A Jungwirth) Reviewed-by: Jian He Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org --- doc/src/sgml/charset.sgml | 2 +- doc/src/sgml/func.sgml | 52 ++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 48 insertions(+), 6 deletions(-) (limited to 'doc/src') diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index f5e115e8d6e..00e1986849a 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -1197,7 +1197,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr to a performance penalty. Note, in particular, that B-tree cannot use deduplication with indexes that use a nondeterministic collation. Also, certain operations are not possible with nondeterministic collations, - such as pattern matching operations. Therefore, they should be used + such as some pattern matching operations. Therefore, they should be used only in cases where they are specifically wanted. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 13ccbe7d78c..8b81106fa23 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -5414,9 +5414,10 @@ cast(-44 as bit(12)) 111111010100 - The pattern matching operators of all three kinds do not support - nondeterministic collations. If required, apply a different collation to - the expression to work around this limitation. + SIMILAR TO and POSIX-style regular + expressions do not support nondeterministic collations. If required, use + LIKE or apply a different collation to the expression + to work around this limitation. @@ -5462,6 +5463,46 @@ cast(-44 as bit(12)) 111111010100 + + LIKE pattern matching supports nondeterministic + collations (see ), such as + case-insensitive collations or collations that, say, ignore punctuation. + So with a case-insensitive collation, one could have: + +'AbC' LIKE 'abc' COLLATE case_insensitive true +'AbC' LIKE 'a%' COLLATE case_insensitive true + + With collations that ignore certain characters or in general that consider + strings of different lengths equal, the semantics can become a bit more + complicated. Consider these examples: + +'.foo.' LIKE 'foo' COLLATE ign_punct true +'.foo.' LIKE 'f_o' COLLATE ign_punct true +'.foo.' LIKE '_oo' COLLATE ign_punct false + + The way the matching works is that the pattern is partitioned into + sequences of wildcards and non-wildcard strings (wildcards being + _ and %). For example, the pattern + f_o is partitioned into f, _, o, the + pattern _oo is partitioned into _, + oo. The input string matches the pattern if it can be + partitioned in such a way that the wildcards match one character or any + number of characters respectively and the non-wildcard partitions are + equal under the applicable collation. So for example, '.foo.' + LIKE 'f_o' COLLATE ign_punct is true because one can partition + .foo. into .f, o, o., and then + '.f' = 'f' COLLATE ign_punct, 'o' + matches the _ wildcard, and 'o.' = 'o' COLLATE + ign_punct. But '.foo.' LIKE '_oo' COLLATE + ign_punct is false because .foo. cannot be + partitioned in a way that the first character is any character and the + rest of the string compares equal to oo. (Note that + the single-character wildcard always matches exactly one character, + independent of the collation. So in this example, the + _ would match ., but then the rest + of the input string won't match the rest of the pattern.) + + LIKE pattern matching always covers the entire string. Therefore, if it's desired to match a sequence anywhere within @@ -5503,8 +5544,9 @@ cast(-44 as bit(12)) 111111010100 The key word ILIKE can be used instead of - LIKE to make the match case-insensitive according - to the active locale. This is not in the SQL standard but is a + LIKE to make the match case-insensitive according to the + active locale. (But this does not support nondeterministic collations.) + This is not in the SQL standard but is a PostgreSQL extension. -- cgit v1.2.3