aboutsummaryrefslogtreecommitdiff
path: root/src/test/modules/test_regex
Commit message (Collapse)AuthorAge
* Update copyright for 2025Bruce Momjian2025-01-01
| | | | Backpatch-through: 13
* Remove useless casts to (void *)Peter Eisentraut2024-11-28
| | | | | | | | Many of them just seem to have been copied around for no real reason. Their presence causes (small) risks of hiding actual type mismatches or silently discarding qualifiers Discussion: https://www.postgresql.org/message-id/flat/461ea37c-8b58-43b4-9736-52884e862820@eisentraut.org
* Avoid assertion due to disconnected NFA sub-graphs in regex parsing.Tom Lane2024-11-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 08c0d6ad6 which introduced "rainbow" arcs in regex NFAs, I didn't think terribly hard about what to do when creating the color complement of a rainbow arc. Clearly, the complement cannot match any characters, and I took the easy way out by just not building any arcs at all in the complement arc set. That mostly works, but Nikolay Shaplov found a case where it doesn't: if we decide to delete that sub-NFA later because it's inside a "{0}" quantifier, delsub() suffered an assertion failure. That's because delsub() relies on the target sub-NFA being fully connected. That was always true before, and the best fix seems to be to restore that property. Hence, invent a new arc type CANTMATCH that can be generated in place of an empty color complement, and drop it again later when we start NFA optimization. (At that point we don't need to do delsub() any more, and besides there are other cases where NFA optimization can lead to disconnected subgraphs.) It appears that this bug has no consequences in a non-assert-enabled build: there will be some transiently leaked NFA states/arcs, but they'll get cleaned up eventually. Still, we don't like assertion failures, so back-patch to v14 where rainbow arcs were introduced. Per bug #18708 from Nikolay Shaplov. Discussion: https://postgr.es/m/18708-f94f2599c9d2c005@postgresql.org
* Remove unused #include's from contrib, pl, test .c filesPeter Eisentraut2024-10-28
| | | | | | | | | as determined by IWYU Similar to commit dbbca2cf299, but for contrib, pl, and src/test/. Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://www.postgresql.org/message-id/flat/0df1d5b1-8ca8-4f84-93be-121081bde049%40eisentraut.org
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Redesign interrupt/cancel API for regex engine.Thomas Munro2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, a PostgreSQL-specific callback checked by the regex engine had a way to trigger a special error code REG_CANCEL if it detected that the next call to CHECK_FOR_INTERRUPTS() would certainly throw via ereport(). A later proposed bugfix aims to move some complex logic out of signal handlers, so that it won't run until the next CHECK_FOR_INTERRUPTS(), which makes the above design impossible unless we split CHECK_FOR_INTERRUPTS() into two phases, one to run logic and another to ereport(). We may develop such a system in the future, but for the regex code it is no longer necessary. An earlier commit moved regex memory management over to our MemoryContext system. Given that the purpose of the two-phase interrupt checking was to free memory before throwing, something we don't need to worry about anymore, it seems simpler to inject CHECK_FOR_INTERRUPTS() directly into cancelation points, and just let it throw. Since the plan is to keep PostgreSQL-specific concerns separate from the main regex engine code (with a view to bein able to stay in sync with other projects), do this with a new macro INTERRUPT(), customizable in regcustom.h and defaulting to nothing. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
* meson: Prevent installation of test files during main installPeter Eisentraut2023-03-03
| | | | | | | | | | | | | | | Previously, meson installed modules under src/test/modules/ as part of a normal installation, even though these files are only meant for use by tests. This is because there is no way to set up up the build system to install extra things only when told. This patch fixes that with a workaround: We don't install these modules as part of meson install, but we create a new "test" that runs before the real tests whose action it is to install these files. The installation is done by manual copies using a small helper script. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/2a039e8e-f31f-31e8-afe7-bab3130ad2de%40enterprisedb.com
* Update copyright for 2023Bruce Momjian2023-01-02
| | | | Backpatch-through: 11
* Add copyright notices to meson filesAndrew Dunstan2022-12-20
| | | | Discussion: https://postgr.es/m/222b43a5-2fb3-2c1b-9cd0-375d376c8246@dunslane.net
* meson: Add windows resource filesAndres Freund2022-10-05
| | | | | | | | | | | | | The generated resource files aren't exactly the same ones as the old buildsystems generate. Previously "InternalName" and "OriginalFileName" were mostly wrong / not set (despite being required), but that was hard to fix in at least the make build. Additionally, the meson build falls back to a "auto-generated" description when not set, and doesn't set it in a few cases - unlikely that anybody looks at these descriptions in detail. Author: Andres Freund <andres@anarazel.de> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
* meson: Add initial version of meson based build systemAndres Freund2022-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Autoconf is showing its age, fewer and fewer contributors know how to wrangle it. Recursive make has a lot of hard to resolve dependency issues and slow incremental rebuilds. Our home-grown MSVC build system is hard to maintain for developers not using Windows and runs tests serially. While these and other issues could individually be addressed with incremental improvements, together they seem best addressed by moving to a more modern build system. After evaluating different build system choices, we chose to use meson, to a good degree based on the adoption by other open source projects. We decided that it's more realistic to commit a relatively early version of the new build system and mature it in tree. This commit adds an initial version of a meson based build system. It supports building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD, Solaris and Windows (however only gcc is supported on aix, solaris). For Windows/MSVC postgres can now be built with ninja (faster, particularly for incremental builds) and msbuild (supporting the visual studio GUI, but building slower). Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM bitcode generation, documentation adjustments) are done in subsequent commits requiring further review. Other aspects (e.g. not installing test-only extensions) are not yet addressed. When building on Windows with msbuild, builds are slower when using a visual studio version older than 2019, because those versions do not support MultiToolTask, required by meson for intra-target parallelism. The plan is to remove the MSVC specific build system in src/tools/msvc soon after reaching feature parity. However, we're not planning to remove the autoconf/make build system in the near future. Likely we're going to keep at least the parts required for PGXS to keep working around until all supported versions build with meson. Some initial help for postgres developers is at https://wiki.postgresql.org/wiki/Meson With contributions from Thomas Munro, John Naylor, Stone Tickle and others. Author: Andres Freund <andres@anarazel.de> Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Author: Peter Eisentraut <peter@eisentraut.org> Reviewed-By: Peter Eisentraut <peter.eisentraut@enterprisedb.com> Discussion: https://postgr.es/m/20211012083721.hvixq4pnh2pixr3j@alap3.anarazel.de
* Consistently use named parameters in regex code.Peter Geoghegan2022-09-19
| | | | | | | | | | | | Make regex code consistently use named parameters in function declarations. Also make sure that parameter names from each function's declaration match corresponding definition parameter names. This makes Henry Spencer's regex code follow Postgres coding standards. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
* Update copyright for 2022Bruce Momjian2022-01-07
| | | | Backpatch-through: 10
* Fix regexp misbehavior with capturing parens inside "{0}".Tom Lane2021-08-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Regexps like "(.){0}...\1" drew an "invalid backreference number". That's not unreasonable on its face, since the capture group will never be matched if it's iterated zero times. However, other engines such as Perl's don't complain about this, nor do we throw an error for related cases such as "(.)|\1", even though that backref can never succeed either. Also, if the zero-iterations case happens at runtime rather than compile time --- say, "(x)*...\1" when there's no "x" to be found --- that's not an error, we just deem the backref to not match. Making this even less defensible, no error was thrown for nested cases such as "((.)){0}...\2"; and to add insult to injury, those cases could result in assertion failures instead. (It seems that nothing especially bad happened in non-assert builds, though.) Let's just fix it so that no error is thrown and instead the backref is deemed to never match, so that compile-time detection of no iterations behaves the same as run-time detection. Per report from Mark Dilger. This appears to be an aboriginal error in Spencer's library, so back-patch to all supported versions. Pre-v14, it turns out to also be necessary to back-patch one aspect of commits cb76fbd7e/00116dee5, namely to create capture-node subREs with the begin/end states of their subexpressions, not the current lp/rp of the outer parseqatom invocation. Otherwise delsub complains that we're trying to disconnect a state from itself. This is a bit scary but code examination shows that it's safe: in the pre-v14 code, if we want to wrap iteration around the subexpression, the first thing we do is overwrite the atom's begin/end fields with new states. So the bogus values didn't survive long enough to be used for anything, except if no iteration is required, in which case it doesn't matter. Discussion: https://postgr.es/m/A099E4A8-4377-4C64-A98C-3DEDDC075502@enterprisedb.com
* Prevent regexp back-refs from sometimes matching when they shouldn't.Tom Lane2021-08-23
| | | | | | | | | | | | | | | | | | | | | | | | The recursion in cdissect() was careless about clearing match data for capturing parentheses after rejecting a partial match. This could allow a later back-reference to succeed when by rights it should fail for lack of a defined referent. To fix, think a little more rigorously about what the contract between different levels of cdissect's recursion needs to be. With the right spec, we can fix this using fewer rather than more resets of the match data; the key decision being that a failed sub-match is now explicitly responsible for clearing any matches it may have set. There are enough other cross-checks and optimizations in the code that it's not especially easy to exhibit this problem; usually, the match will fail as-expected. Plus, regexps that are even potentially vulnerable are most likely user errors, since there's just not much point in writing a back-ref that doesn't always have a referent. These facts perhaps explain why the issue hasn't been detected, even though it's almost certainly a couple of decades old. Discussion: https://postgr.es/m/151435.1629733387@sss.pgh.pa.us
* Reduce assumptions about locale's behavior in new regex tests.Tom Lane2021-08-17
| | | | | | | | I was overoptimistic to assume that UTF8-based locales would all consider U+1500 to be a member of the [[:alpha:]] char class. Tweak the test cases added by commit 78a843f11 to avoid that assumption. We might need to lobotomize them further, but this should be enough to fix the early buildfarm reports.
* Improve regex compiler's arc moving/copying logic.Tom Lane2021-08-17
| | | | | | | | | | | | | | | | | | | | | | | | The functions moveins(), copyins(), moveouts(), copyouts() are required to preserve the invariant that there are no duplicate arcs in the regex's NFA. Spencer's original implementation of them was O(N^2) since it checked separately for a match to each source arc. In commit 579840ca0 I improved that by adding sort/merge logic to be used if more than a few arcs are to be moved/copied. However, I now realize that that missed a bet. At many call sites, the target state is newly made and cannot have any existing in-arcs (respectively out-arcs) that could be duplicates. So spending any cycles at all on checking for duplicates is wasted effort; in these cases we can just blindly move/copy all the source arcs. Add code paths to do that. It turns out that for copyins()/copyouts(), *all* the call sites have this property, making all the "improved" logic in them flat out unreachable. Perhaps we'll need the full capability again someday, so I just #ifdef'd those paths out rather than removing them entirely. In passing, add a few test cases to improve code coverage in this area as well as in regc_locale.c/regc_pg_locale.c. Discussion: https://postgr.es/m/810272.1629064063@sss.pgh.pa.us
* Avoid determining regexp subexpression matches, when possible.Tom Lane2021-08-09
| | | | | | | | | | | | | | | | | | | | | Identifying the precise match locations for parenthesized subexpressions is a fairly expensive task given the way our regexp engine works, both at regexp compile time (where we must create an optimized NFA for each parenthesized subexpression) and at runtime (where determining exact match locations requires laborious search). Up to now we've made little attempt to optimize this situation. This patch identifies cases where we know at compile time that we won't need to know subexpression match locations, and teaches the regexp compiler to not bother creating per-subexpression regexps for parenthesis pairs that are not referenced by backrefs elsewhere in the regexp. (To preserve semantics, we obviously still have to pin down the match locations of backref references.) Users could have obtained the same results before this by being careful to write "non capturing" parentheses wherever possible, but few people bother with that. Discussion: https://postgr.es/m/2219936.1628115334@sss.pgh.pa.us
* Fix use-after-free issue in regexp engine.Tom Lane2021-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | Commit cebc1d34e taught parseqatom() to optimize cases where a branch contains only one, "messy", atom by getting rid of excess subRE nodes. The way we really should do that is to keep the subRE built for the "messy" child atom; but to avoid changing parseqatom's nominal API, I made it delete that node after copying its fields to the outer subRE made by parsebranch(). It seems that that actually worked at the time; but it became dangerous after ea1268f63, because that later commit allowed the lower invocation of parse() to return a subRE that was also pointed to by some v->subs[] entry. This meant we could wind up with a dangling pointer in v->subs[], allowing a later backref to misbehave, but only if that subRE struct had been reused in between. So the damage seems confined to cases like '((...))...(...\2'. To fix, do what I should have done before and modify parseqatom's API to make it possible for it to remove the caller's subRE instead of the callee's. That's safer because we know that subRE isn't complete yet, so noplace else will have a pointer to it. Per report from Mark Dilger. Back-patch to v14 where the problematic patches came in. Discussion: https://postgr.es/m/0203588E-E609-43AF-9F4F-902854231EE7@enterprisedb.com
* Fix semantics of regular expression back-references.Tom Lane2021-03-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | POSIX defines the behavior of back-references thus: The back-reference expression '\n' shall match the same (possibly empty) string of characters as was matched by a subexpression enclosed between "\(" and "\)" preceding the '\n'. As far as I can see, the back-reference is supposed to consider only the data characters matched by the referenced subexpression. However, because our engine copies the NFA constructed from the referenced subexpression, it effectively enforces any constraints therein, too. As an example, '(^.)\1' ought to match 'xx', or any other string starting with two occurrences of the same character; but in our code it does not, and indeed can't match anything, because the '^' anchor constraint is included in the backref's copied NFA. If POSIX intended that, you'd think they'd mention it. Perl for one doesn't act that way, so it's hard to conclude that this isn't a bug. Fix by modifying the backref's NFA immediately after it's copied from the reference, replacing all constraint arcs by EMPTY arcs so that the constraints are treated as automatically satisfied. This still allows us to enforce matching rules that depend only on the data characters; for example, in '(^\d+).*\1' the NFA matching step will still know that the backref can only match strings of digits. Perhaps surprisingly, this change does not affect the results of any of a rather large corpus of real-world regexes. Nonetheless, I would not consider back-patching it, since it's a clear compatibility break. Patch by me, reviewed by Joel Jacobson Discussion: https://postgr.es/m/661609.1614560029@sss.pgh.pa.us
* Change regex \D and \W shorthands to always match newlines.Tom Lane2021-02-25
| | | | | | | | | | | | | | | | | | | | | | Newline is certainly not a digit, nor a word character, so it is sensible that it should match these complemented character classes. Previously, \D and \W acted that way by default, but in newline-sensitive mode ('n' or 'p' flag) they did not match newlines. This behavior was previously forced because explicit complemented character classes don't match newlines in newline-sensitive mode; but as of the previous commit that implementation constraint no longer exists. It seems useful to change this because the primary real-world use for newline-sensitive mode seems to be to match the default behavior of other regex engines such as Perl and Javascript ... and their default behavior is that these match newlines. The old behavior can be kept by writing an explicit complemented character class, i.e. [^[:digit:]] or [^[:word:]]. (This means that \D and \W are not exactly equivalent to those strings, but they weren't anyway.) Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us
* Allow complemented character class escapes within regex brackets.Tom Lane2021-02-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The complement-class escapes \D, \S, \W are now allowed within bracket expressions. There is no semantic difficulty with doing that, but the rather hokey macro-expansion-based implementation previously used here couldn't cope. Also, invent "word" as an allowed character class name, thus "\w" is now equivalent to "[[:word:]]" outside brackets, or "[:word:]" within brackets. POSIX allows such implementation-specific extensions, and the same name is used in e.g. bash. One surprising compatibility issue this raises is that constructs such as "[\w-_]" are now disallowed, as our documentation has always said they should be: character classes can't be endpoints of a range. Previously, because \w was just a macro for "[:alnum:]_", such a construct was read as "[[:alnum:]_-_]", so it was accepted so long as the character after "-" was numerically greater than or equal to "_". Some implementation cleanup along the way: * Remove the lexnest() hack, and in consequence clean up wordchrs() to not interact with the lexer. * Fix colorcomplement() to not be O(N^2) in the number of colors involved. * Get rid of useless-as-far-as-I-can-see calls of element() on single-character character element names in brackpart(). element() always maps these to the character itself, and things would be quite broken if it didn't --- should "[a]" match something different than "a" does? Besides, the shortcut path in brackpart() wasn't doing this anyway, making it even more inconsistent. Discussion: https://postgr.es/m/2845172.1613674385@sss.pgh.pa.us Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us
* Fix another ancient bug in parsing of BRE-mode regular expressions.Tom Lane2021-02-18
| | | | | | | | | | | | | | While poking at the regex code, I happened to notice that the bug squashed in commit afcc8772e had a sibling: next() failed to return a specific value associated with the '}' token for a "\{m,n\}" quantifier when parsing in basic RE mode. Again, this could result in treating the quantifier as non-greedy, which it never should be in basic mode. For that to happen, the last character before "\}" that sets "nextvalue" would have to set it to zero, or it'd have to have accidentally been zero from the start. The failure can be provoked repeatably with, for example, a bound ending in digit "0". Like the previous patch, back-patch all the way.
* Make some minor improvements in the regex code.Tom Lane2021-02-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Push some hopefully-uncontroversial bits extracted from an upcoming patch series, to remove non-relevant clutter from the main patches. In compact(), return immediately after setting REG_ASSERT error; continuing the loop would just lead to assertion failure below. (Ask me how I know.) In parseqatom(), remove assertion that moresubs() did its job. When moresubs actually did its job, this is redundant with that function's final assert; but when it failed on OOM, this is an assertion crash. We could avoid the crash by adding a NOERR() check before the assertion, but it seems better to subtract code than add it. (Note that there's a NOERR exit a few lines further down, and nothing else between here and there requires moresubs to have succeeded. So we don't really need an extra error exit.) This is a live bug in assert-enabled builds, but given the very low likelihood of OOM in moresub's tiny allocation, I don't think it's worth back-patching. On the other hand, it seems worthwhile to add an assertion that our intended v->subs[subno] target is still null by the time we are ready to insert into it, since there's a recursion in between. In pg_regexec, ensure we fflush any debug output on the way out, and try to make MDEBUG messages more uniform and helpful. (In particular, ensure that all of them are prefixed with the subre's id number, so one can match up entry and exit reports.) Add some test cases in test_regex to improve coverage of lookahead and lookbehind constraints. Adding these now is mainly to establish that this is indeed the existing behavior. Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
* Add missing array-enlargement logic to test_regex.c.Tom Lane2021-01-17
| | | | | | | | | | | The stanza to report a "partial" match could overrun the initially allocated output array, so it needs its own copy of the array-resizing logic that's in the main loop. I overlooked the need for this in ca8217c10. Per report from Alexander Lakhin. Discussion: https://postgr.es/m/3206aace-50db-e02a-bbea-76d5cdaa2cb6@gmail.com
* Fix ancient bug in parsing of BRE-mode regular expressions.Tom Lane2021-01-08
| | | | | | | | | | | | | | | | | brenext(), when parsing a '*' quantifier, forgot to return any "value" for the token; per the equivalent case in next(), it should return value 1 to indicate that greedy rather than non-greedy behavior is wanted. The result is that the compiled regexp could behave like 'x*?' rather than the intended 'x*', if we were unlucky enough to have a zero in v->nextvalue at this point. That seems to happen with some reliability if we have '.*' at the beginning of a BRE-mode regexp, although that depends on the initial contents of a stack-allocated struct, so it's not guaranteed to fail. Found by Alexander Lakhin using valgrind testing. This bug seems to be aboriginal in Spencer's code, so back-patch all the way. Discussion: https://postgr.es/m/16814-6c5e3edd2bdf0d50@postgresql.org
* Fix bogus link in test comments.Tom Lane2021-01-06
| | | | | I apparently copied-and-pasted the wrong link in commit ca8217c10. Point it where it was meant to go.
* Add a test module for the regular expression package.Tom Lane2021-01-06
This module provides a function test_regex() that is functionally rather like regexp_matches(), but with additional debugging-oriented options and additional output. The debug options are somewhat obscure; they are chosen to match the API of the test harness that Henry Spencer wrote way-back-when for use in Tcl. With this, we can import all the test cases that Spencer wrote originally, even for regex functionality that we don't currently expose in Postgres. This seems necessary because we can no longer rely on Tcl to act as upstream and verify any fixes or improvements that we make. In addition to Spencer's tests, I added a few for lookbehind constraints (which we added in 2015, and Tcl still hasn't absorbed) that are modeled on his tests for lookahead constraints. After looking at code coverage reports, I also threw in a couple of tests to more fully exercise our "high colormap" logic. According to my testing, this brings the check-world coverage for src/backend/regex/ from 71.1% to 86.7% of lines. (coverage.postgresql.org shows a slightly different number, which I think is because it measures a non-assert build.) Discussion: https://postgr.es/m/2873268.1609732164@sss.pgh.pa.us