aboutsummaryrefslogtreecommitdiff
path: root/src/backend/utils/adt/like_support.c
Commit message (Collapse)AuthorAge
* Update copyright for 2025Bruce Momjian2025-01-01
| | | | Backpatch-through: 13
* Support LIKE with nondeterministic collationsPeter Eisentraut2024-11-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows for example using LIKE with case-insensitive collations. There was previously no internal implementation of this, so it was met with a not-supported error. This adds the internal implementation and removes the error. The implementation follows the specification of the SQL standard for this. Unlike with deterministic collations, the LIKE matching cannot go character by character but has to go substring by substring. For example, if we are matching against LIKE 'foo%bar', we can't start by looking for an 'f', then an 'o', but instead with have to find something that matches 'foo'. This is because the collation could consider substrings of different lengths to be equal. This is all internal to MatchText() in like_match.c. The changes in GenericMatchText() in like.c just pass through the locale information to MatchText(), which was previously not needed. This matches exactly Generic_Text_IC_like() below. ILIKE is not affected. (It's unclear whether ILIKE makes sense under nondeterministic collations.) This also updates match_pattern_prefix() in like_support.c to support optimizing the case of an exact pattern with nondeterministic collations. This was already alluded to in the previous code. (includes documentation examples from Daniel Vérité and test cases from Paul A Jungwirth) Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org
* Fix missed case for builtin collation provider.Jeff Davis2024-10-11
| | | | | | | | | | | | A missed check for the builtin collation provider could result in falling through to call isalpha(). This does not appear to have practical consequences because it only happens for characters in the ASCII range. Regardless, the builtin provider should not be calling libc functions, so backpatch. Discussion: https://postgr.es/m/1bd5a0a5192f82c22ee7527e825b18ab0028b2c7.camel@j-davis.com Backpatch-through: 17
* Remove lc_ctype_is_c().Jeff Davis2024-09-06
| | | | | | | | | | | | Instead always fetch the locale and look at the ctype_is_c field. hba.c relies on regexes working for the C locale without needing catalog access, which worked before due to a special case for C_COLLATION_OID in lc_ctype_is_c(). Move the special case to pg_set_regex_collation() now that lc_ctype_is_c() is gone. Author: Andreas Karlsson Discussion: https://postgr.es/m/60929555-4709-40a7-b136-bcb44cff5a3c@proxel.se
* Remove lc_collate_is_c().Jeff Davis2024-09-04
| | | | | | | Instead just look up the collation and check collate_is_c field. Author: Andreas Karlsson Discussion: https://postgr.es/m/60929555-4709-40a7-b136-bcb44cff5a3c@proxel.se
* Remove unused #include's from backend .c filesPeter Eisentraut2024-03-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Add trailing commas to enum definitionsPeter Eisentraut2023-10-26
| | | | | | | | | | | | | | | | | | | | Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
* All supported systems have locale_t.Thomas Munro2023-07-09
| | | | | | | | | | | | | | | | | | | | locale_t is defined by POSIX.1-2008 and SUSv4, and available on all targeted systems. For Windows, win32_port.h redirects to a partial implementation called _locale_t. We can now remove a lot of compile-time tests for HAVE_LOCALE_T, and associated comments and dead code branches that were needed for older computers. Since configure + MinGW builds didn't detect locale_t but now we assume that all systems have it, further inconsistencies among the 3 Windows build systems were revealed. With this commit, we no longer define HAVE_WCSTOMBS_L and HAVE_MBSTOWCS_L on any Windows build system, but we have logic to deal with that so that replacements are available where appropriate. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Tristan Partin <tristan@neon.tech> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/CA%2BhUKGLg7_T2GKwZFAkEf0V7vbnur-NfCjZPKZb%3DZfAXSV1ORw%40mail.gmail.com
* Update copyright for 2023Bruce Momjian2023-01-02
| | | | Backpatch-through: 11
* Defend against stack overrun in a few more places.Tom Lane2022-08-24
| | | | | | | | | | | | | | | | | | | SplitToVariants() in the ispell code, lseg_inside_poly() in geo_ops.c, and regex_selectivity_sub() in selectivity estimation could recurse until stack overflow; fix by adding check_stack_depth() calls. So could next() in the regex compiler, but that case is better fixed by converting its tail recursion to a loop. (We probably get better code that way too, since next() can now be inlined into its sole caller.) There remains a reachable stack overrun in the Turkish stemmer, but we'll need some advice from the Snowball people about how to fix that. Per report from Egor Chindyaskin and Alexander Lakhin. These mistakes are old, so back-patch to all supported branches. Richard Guo and Tom Lane Discussion: https://postgr.es/m/1661334672.728714027@f473.i.mail.ru
* Call pg_newlocale_from_collation() also with default collationPeter Eisentraut2022-01-20
| | | | | | | | | | | | | | | | Previously, callers of pg_newlocale_from_collation() did not call it if the collation was DEFAULT_COLLATION_OID and instead proceeded with a pg_locale_t of 0. Instead, now we call it anyway and have it return 0 if the default collation was passed. It already did this, so we just have to adjust the callers. This simplifies all the call sites and also makes future enhancements easier. After discussion and testing, the previous comment in pg_locale.c about avoiding this for performance reasons may have been mistaken since it was testing a very different patch version way back when. Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Discussion: https://www.postgresql.org/message-id/ed3baa81-7fac-7788-cc12-41e3f7917e34@enterprisedb.com
* Update copyright for 2022Bruce Momjian2022-01-07
| | | | Backpatch-through: 10
* Add a planner support function for starts_with().Tom Lane2021-11-17
| | | | | | | | | | | | | | | | | | | | | | This fills in some gaps in planner support for starts_with() and the equivalent ^@ operator: * A condition such as "textcol ^@ constant" can now use a regular btree index, not only an SP-GiST index, so long as the index's collation is C. (This works just like "textcol LIKE 'foo%'".) * "starts_with(textcol, constant)" can be optimized the same as "textcol ^@ constant". * Fixed-prefix LIKE and regex patterns are now more like starts_with() in another way: if you apply one to an SPGiST-indexed column, you'll get an index condition using ^@ rather than two index conditions with >= and <. Per a complaint from Shay Rojansky. Patch by me; thanks to Nathan Bossart for review. Discussion: https://postgr.es/m/232599.1633800229@sss.pgh.pa.us
* Avoid divide-by-zero in regex_selectivity() with long fixed prefix.Tom Lane2021-02-12
| | | | | | | | | | | | | | | | | | | | | Given a regex pattern with a very long fixed prefix (approaching 500 characters), the result of pow(FIXED_CHAR_SEL, fixed_prefix_len) can underflow to zero. Typically the preceding selectivity calculation would have underflowed as well, so that we compute 0/0 and get NaN. In released branches this leads to an assertion failure later on. That doesn't happen in HEAD, for reasons I've not explored yet, but it's surely still a bug. To fix, just skip the division when the pow() result is zero, so that we'll (most likely) return a zero selectivity estimate. In the edge cases where "sel" didn't yet underflow, perhaps this isn't desirable, but I'm not sure that the case is worth spending a lot of effort on. The results of regex_selectivity_sub() are barely worth the electrons they're written on anyway :-( Per report from Alexander Lakhin. Back-patch to all supported versions. Discussion: https://postgr.es/m/6de0a0c3-ada9-cd0c-3e4e-2fa9964b41e3@gmail.com
* Update copyright for 2021Bruce Momjian2021-01-02
| | | | Backpatch-through: 9.5
* Improve ineq_histogram_selectivity's behavior for non-default orderings.Tom Lane2020-06-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ineq_histogram_selectivity() can be invoked in situations where the ordering we care about is not that of the column's histogram. We could be considering some other collation, or even more drastically, the query operator might not agree at all with what was used to construct the histogram. (We'll get here for anything using scalarineqsel-based estimators, so that's quite likely to happen for extension operators.) Up to now we just ignored this issue and assumed we were dealing with an operator/collation whose sort order exactly matches the histogram, possibly resulting in junk estimates if the binary search gets confused. It's past time to improve that, since the use of nondefault collations is increasing. What we can do is verify that the given operator and collation match what's recorded in pg_statistic, and use the existing code only if so. When they don't match, instead execute the operator against each histogram entry, and take the fraction of successes as our selectivity estimate. This gives an estimate that is probably good to about 1/histogram_size, with no assumptions about ordering. (The quality of the estimate is likely to degrade near the ends of the value range, since the two orderings probably don't agree on what is an extremal value; but this is surely going to be more reliable than what we did before.) At some point we might further improve matters by storing more than one histogram calculated according to different orderings. But this code would still be good fallback logic when no matches exist, so that is not an argument for not doing this. While here, also improve get_variable_range() to deal more honestly with non-default collations. This isn't back-patchable, because it requires adding another argument to ineq_histogram_selectivity, and because it might have significant impact on the estimation results for extension operators relying on scalarineqsel --- mostly for the better, one hopes, but in any case destabilizing plan choices in back branches is best avoided. Per investigation of a report from James Lucas. Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
* Use query collation, not column's collation, while examining statistics.Tom Lane2020-06-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5e0928005 changed the planner so that, instead of blindly using DEFAULT_COLLATION_OID when invoking operators for selectivity estimation, it would use the collation of the column whose statistics we're considering. This was recognized as still being not quite the right thing, but it seemed like a good incremental improvement. However, shortly thereafter we introduced nondeterministic collations, and that creates cases where operators can fail if they're passed the wrong collation. We don't want planning to fail in cases where the query itself would work, so this means that we *must* use the query's collation when invoking operators for estimation purposes. The only real problem this creates is in ineq_histogram_selectivity, where the binary search might produce a garbage answer if we perform comparisons using a different collation than the column's histogram is ordered with. However, when the query's collation is significantly different from the column's default collation, the estimate we previously generated would be pretty irrelevant anyway; so it's not clear that this will result in noticeably worse estimates in practice. (A follow-on patch will improve this situation in HEAD, but it seems too invasive for back-patch.) The patch requires changing the signatures of mcv_selectivity and allied functions, which are exported and very possibly are used by extensions. In HEAD, I just did that, but an API/ABI break of this sort isn't acceptable in stable branches. Therefore, in v12 the patch introduces "mcv_selectivity_ext" and so on, with signatures matching HEAD, and makes the old functions into wrappers that assume DEFAULT_COLLATION_OID should be used. That does not match the prior behavior, but it should avoid risk of failure in most cases. (In practice, I think most extension datatypes aren't collation-aware, so the change probably doesn't matter to them.) Per report from James Lucas. Back-patch to v12 where the problem was introduced. Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
* Update copyrights for 2020Bruce Momjian2020-01-01
| | | | Backpatch-through: update all files in master, backpatch legal files through 9.4
* Sync patternsel_common's operator selection logic with pattern_prefix's.Tom Lane2019-11-20
| | | | | | | | | | | | | | | | | | | | | Make patternsel_common() select the comparison operators to use with hardwired logic that matches pattern_prefix()'s new logic, eliminating its dependencies on particular index opfamilies. This shouldn't change any behavior, as it's just replacing runtime operator lookups with the same values hard-wired. But it makes these closely-related functions look more alike, and saving some runtime syscache lookups is worth something. Actually, it's not quite true that this is zero behavioral change: when estimating for a column of type "name", the comparison constant will be kept as "text" not coerced to "name". But that's more correct anyway, and it allows additional simplification of the coercion logic, again syncing this more closely with pattern_prefix(). Per consideration of a report from Manuel Rigger. Discussion: https://postgr.es/m/CA+u7OA7nnGYy8rY0vdTe811NuA+Frr9nbcBO9u2Z+JxqNaud+g@mail.gmail.com
* Reduce match_pattern_prefix()'s dependencies on index opfamilies.Tom Lane2019-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically, the planner's LIKE/regex index optimizations were only carried out for specific index opfamilies. That's never been a great idea from the standpoint of extensibility, but it didn't matter so much as long as we had no practical way to extend such behaviors anyway. With the addition of planner support functions, and in view of ongoing work to support additional table and index AMs, it seems like a good time to relax this. Hence, recast the decisions in match_pattern_prefix() so that rather than decide which operators to generate by looking at what the index opfamily contains, we decide which operators to generate a-priori and then see if the opfamily supports them. This is much more defensible from a semantic standpoint anyway, since we know the semantics of the chosen operators precisely, and we only need to assume that the opfamily correctly implements operators it claims to support. The existing "pattern" opfamilies put a crimp in this approach, since we need to select the pattern operators if we want those to work. So we still have to special-case those opfamilies. But that seems all right, since in view of the addition of collations, the pattern opfamilies seem like a legacy hack that nobody will be building on. The only immediate effect of this change, so far as the core code is concerned, is that anchored LIKE/regex patterns can be mapped onto BRIN index searches, and exact-match patterns can be mapped onto hash indexes, not only btree and spgist indexes as before. That's not a terribly exciting result, but it does fix an omission mentioned in the ancient comments here. Note: no catversion bump, even though this touches pg_operator.dat, because it's only adding OID macros not changing the contents of postgres.bki. Per consideration of a report from Manuel Rigger. Discussion: https://postgr.es/m/CA+u7OA7nnGYy8rY0vdTe811NuA+Frr9nbcBO9u2Z+JxqNaud+g@mail.gmail.com
* Fix corner-case failure in match_pattern_prefix().Tom Lane2019-11-19
| | | | | | | | | | | | | | | | | | | | | | | | The planner's optimization code for LIKE and regex operators could error out with a complaint like "no = operator for opfamily NNN" if someone created a binary-compatible index (for example, a bpchar_ops index on a text column) on the LIKE's left argument. This is a consequence of careless refactoring in commit 74dfe58a5. The old code in match_special_index_operator only accepted specific combinations of the pattern operator and the index opclass, thereby indirectly guaranteeing that the opclass would have a comparison operator with the same LHS input type as the pattern operator. While moving the logic out to a planner support function, I simplified that test in a way that no longer guarantees that. Really though we'd like an altogether weaker dependency on the opclass, so rather than put back exactly the old code, just allow lookup failure. I have in mind now to rewrite this logic completely, but this is the minimum change needed to fix the bug in v12. Per report from Manuel Rigger. Back-patch to v12 where the mistake came in. Discussion: https://postgr.es/m/CA+u7OA7nnGYy8rY0vdTe811NuA+Frr9nbcBO9u2Z+JxqNaud+g@mail.gmail.com
* Fix planner's test for case-foldable characters in ILIKE with ICU.Tom Lane2019-08-12
| | | | | | | | | | | | | As coded, the ICU-collation path in pattern_char_isalpha() failed to consider regular ASCII letters to be case-varying. This led to like_fixed_prefix treating too much of an ILIKE pattern as being a fixed prefix, so that indexscans derived from an ILIKE clause might miss entries that they should find. Per bug #15892 from James Inform. This is an oversight in the original ICU patch (commit eccfef81e), so back-patch to v10 where that came in. Discussion: https://postgr.es/m/15892-e5d2bea3e8a04a1b@postgresql.org
* Phase 2 pgindent run for v12.Tom Lane2019-05-22
| | | | | | | | | Switch to 2.1 version of pg_bsd_indent. This formats multiline function declarations "correctly", that is with additional lines of parameter declarations indented to match where the first line's left parenthesis is. Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
* Initial pgindent run for v12.Tom Lane2019-05-22
| | | | | | | | This is still using the 2.0 version of pg_bsd_indent. I thought it would be good to commit this separately, so as to document the differences between 2.0 and 2.1 behavior. Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
* Unbreak index optimization for LIKE on byteaPeter Eisentraut2019-04-15
| | | | | | | | | The same code is used to handle both text and bytea, but bytea is not collation-aware, so we shouldn't call get_collation_isdeterministic() in that case, since that will error out with an invalid collation. Reported-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAM2%2B6%3DWaf3qJ1%3DyVTUH8_yG-SC0xcBMY%2BSFLhvKKNnWNXSUDBw%40mail.gmail.com
* Collations with nondeterministic comparisonPeter Eisentraut2019-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a flag "deterministic" to collations. If that is false, such a collation disables various optimizations that assume that strings are equal only if they are byte-wise equal. That then allows use cases such as case-insensitive or accent-insensitive comparisons or handling of strings with different Unicode normal forms. This functionality is only supported with the ICU provider. At least glibc doesn't appear to have any locales that work in a nondeterministic way, so it's not worth supporting this for the libc provider. The term "deterministic comparison" in this context is from Unicode Technical Standard #10 (https://unicode.org/reports/tr10/#Deterministic_Comparison). This patch makes changes in three areas: - CREATE COLLATION DDL changes and system catalog changes to support this new flag. - Many executor nodes and auxiliary code are extended to track collations. Previously, this code would just throw away collation information, because the eventually-called user-defined functions didn't use it since they only cared about equality, which didn't need collation information. - String data type functions that do equality comparisons and hashing are changed to take the (non-)deterministic flag into account. For comparison, this just means skipping various shortcuts and tie breakers that use byte-wise comparison. For hashing, we first need to convert the input string to a canonical "sort key" using the ICU analogue of strxfrm(). Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
* Move pattern selectivity code from selfuncs.c to like_support.c.Tom Lane2019-02-14
| | | | | | | | | | | | | | | | | | | | | While at it, refactor patternsel() a bit so that it can be used from the LIKE/regex planner support functions as well. This makes the planner able to deal equally well with either operator or function syntax for these operations. I'm not excited about that as a feature in itself, but it provides a nice model for extensions to follow if they want such behavior for their operations. This change localizes the use of pattern_fixed_prefix() and make_greater_string() so that they no longer need be exported. (We might get pushback from extensions about that, perhaps, in which case I'd be inclined to re-export them in a new header file like_support.h.) This reduces the bulk of selfuncs.c a fair amount, removing ~1370 lines or about one-sixth of that file; it's still too big, but this is progress. Discussion: https://postgr.es/m/24537.1550093915@sss.pgh.pa.us
* Allow extensions to generate lossy index conditions.Tom Lane2019-02-11
For a long time, indxpath.c has had the ability to extract derived (lossy) index conditions from certain operators such as LIKE. For just as long, it's been obvious that we really ought to make that capability available to extensions. This commit finally accomplishes that, by adding another API for planner support functions that lets them create derived index conditions for their functions. As proof of concept, the hardwired "special index operator" code formerly present in indxpath.c is pushed out to planner support functions attached to LIKE and other relevant operators. A weak spot in this design is that an extension needs to know OIDs for the operators, datatypes, and opfamilies involved in the transformation it wants to make. The core-code prototypes use hard-wired OID references but extensions don't have that option for their own operators etc. It's usually possible to look up the required info, but that may be slow and inconvenient. However, improving that situation is a separate task. I want to do some additional refactorization around selfuncs.c, but that also seems like a separate task. Discussion: https://postgr.es/m/15193.1548028093@sss.pgh.pa.us