aboutsummaryrefslogtreecommitdiff
path: root/src/backend/regex
Commit message (Collapse)AuthorAge
* Fix two low-probability memory leaks in regular expression parsing.Tom Lane2014-07-18
| | | | | | | | | | | | | | | | | If pg_regcomp failed after having invoked markst/cleanst, it would leak any "struct subre" nodes it had created. (We've already detected all regex syntax errors at that point, so the only likely causes of later failure would be query cancel or out-of-memory.) To fix, make sure freesrnode knows the difference between the pre-cleanst and post-cleanst cleanup procedures. Add some documentation of this less-than-obvious point. Also, newlacon did the wrong thing with an out-of-memory failure from realloc(), so that the previously allocated array would be leaked. Both of these are pretty low-probability scenarios, but a bug is a bug, so patch all the way back. Per bug #10976 from Arthur O'Dwyer.
* Remove tabs after spaces in C commentsBruce Momjian2014-05-06
| | | | | | | | | This was not changed in HEAD, but will be done later as part of a pgindent run. Future pgindent runs will also do this. Report by Tom Lane Backpatch through all supported branches, but not HEAD
* Allow regex operations to be terminated early by query cancel requests.Tom Lane2014-03-01
| | | | | | | | | | | | | | | | | | | | | | | | | The regex code didn't have any provision for query cancel; which is unsurprising given its non-Postgres origin, but still problematic since some operations can take a long time. Introduce a callback function to check for a pending query cancel or session termination request, and call it in a couple of strategic spots where we can make the regex code exit with an error indicator. If we ever actually split out the regex code as a standalone library, some additional work will be needed to let the cancel callback function be specified externally to the library. But that's straightforward (certainly so by comparison to putting the locale-dependent character classification logic on a similar arms-length basis), and there seems no need to do it right now. A bigger issue is that there may be more places than these two where we need to check for cancels. We can always add more checks later, now that the infrastructure is in place. Since there are known examples of not-terribly-long regexes that can lock up a backend for a long time, back-patch to all supported branches. I have hopes of fixing the known performance problems later, but adding query cancel ability seems like a good idea even if they were all fixed.
* Fix regex match failures for backrefs combined with non-greedy quantifiers.Tom Lane2013-07-18
| | | | | | | | | | | | An ancient logic error in cfindloop() could cause the regex engine to fail to find matches that begin later than the start of the string. This function is only used when the regex pattern contains a back reference, and so far as we can tell the error is only reachable if the pattern is non-greedy (i.e. its first quantifier uses the ? modifier). Furthermore, the actual match must begin after some potential match that satisfies the DFA but then fails the back-reference's match test. Reported and fixed by Jeevan Chalke, with cosmetic adjustments by me.
* Fix crash on compiling a regular expression with more than 32k colors.Heikki Linnakangas2013-04-04
| | | | | | Throw an error instead. Backpatch to all supported branches.
* Fix infinite-loop risk in fixempties() stage of regex compilation.Tom Lane2013-03-07
| | | | | | | | | | | The previous coding of this function could get into situations where it would never terminate, because successive passes would re-add EMPTY arcs that had been removed by the previous pass. Rewrite the function completely using a new algorithm that is guaranteed to terminate, and also seems to be usually faster than the old one. Per Tcl bugs 3604074 and 3606683. Tom Lane and Don Porter
* Add missing error check in regexp parser.Tom Lane2013-02-27
| | | | | | | | | | | | parseqatom() failed to check for an error return (NULL result) from its recursive call to parsebranch(), and in consequence could crash with a null-pointer dereference after an error return. This bug has been there since day one, but wasn't noticed before, probably because most error cases in parsebranch() didn't actually lead to returning NULL. Add the missing error check, and also tweak parsebranch() to exit in a less indirect fashion after a call to parseqatom() fails. Report by Tomasz Karlik, fix by me.
* Prevent corner-case core dump in rfree().Tom Lane2012-07-15
| | | | | | | | | | | | | rfree() failed to cope with the case that pg_regcomp() had initialized the regex_t struct but then failed to allocate any memory for re->re_guts (ie, the first malloc call in pg_regcomp() failed). It would try to touch the guts struct anyway, and thus dump core. This is a sufficiently narrow corner case that it's not surprising it's never been seen in the field; but still a bug is a bug, so patch all active branches. Noted while investigating whether we need to call pg_regfree after a failure return from pg_regcomp. Other than this bug, it turns out we don't, so adjust comments appropriately.
* Back-patch fix for extraction of fixed prefixes from regular expressions.Tom Lane2012-07-10
| | | | | | Back-patch of commits 628cbb50ba80c83917b07a7609ddec12cda172d0 and c6aae3042be5249e672b731ebeb21875b5343010. This has been broken since 7.3, so back-patch to all supported branches.
* Fix regex back-references that are directly quantified with *.Tom Lane2012-02-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The syntax "\n*", that is a backref with a * quantifier directly applied to it, has never worked correctly in Spencer's library. This has been an open bug in the Tcl bug tracker since 2005: https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894 The core of the problem is in parseqatom(), which first changes "\n*" to "\n+|" and then applies repeat() to the NFA representing the backref atom. repeat() thinks that any arc leading into its "rp" argument is part of the sub-NFA to be repeated. Unfortunately, since parseqatom() already created the arc that was intended to represent the empty bypass around "\n+", this arc gets moved too, so that it now leads into the state loop created by repeat(). Thus, what was supposed to be an "empty" bypass gets turned into something that represents zero or more repetitions of the NFA representing the backref atom. In the original example, in place of ^([bc])\1*$ we now have something that acts like ^([bc])(\1+|[bc]*)$ At runtime, the branch involving the actual backref fails, as it's supposed to, but then the other branch succeeds anyway. We could no doubt fix this by some rearrangement of the operations in parseqatom(), but that code is plenty ugly already, and what's more the whole business of converting "x*" to "x+|" probably needs to go away to fix another problem I'll mention in a moment. Instead, this patch suppresses the *-conversion when the target is a simple backref atom, leaving the case of m == 0 to be handled at runtime. This makes the patch in regcomp.c a one-liner, at the cost of having to tweak cbrdissect() a little. In the event I went a bit further than that and rewrote cbrdissect() to check all the string-length-related conditions before it starts comparing characters. It seems a bit stupid to possibly iterate through many copies of an n-character backreference, only to fail at the end because the target string's length isn't a multiple of n --- we could have found that out before starting. The existing coding could only be a win if integer division is hugely expensive compared to character comparison, but I don't know of any modern machine where that might be true. This does not fix all the problems with quantified back-references. In particular, the code is still broken for back-references that appear within a larger expression that is quantified (so that direct insertion of the quantification limits into the BACKREF node doesn't apply). I think fixing that will take some major surgery on the NFA code, specifically introducing an explicit iteration node type instead of trying to transform iteration into concatenation of modified regexps. Back-patch to all supported branches. In HEAD, also add a regression test case for this. (It may seem a bit silly to create a regression test file for just one test case; but I'm expecting that we will soon import a whole bunch of regex regression tests from Tcl, so might as well create the infrastructure now.)
* Pgindent run before 9.1 beta2.Bruce Momjian2011-06-09
|
* Insert dummy "break"s to silence compiler complaints.Tom Lane2011-04-10
| | | | | Apparently some compilers dislike a case label with nothing after it. Per buildfarm.
* Teach regular expression operators to honor collations.Tom Lane2011-04-10
| | | | | | | | | | | | | | This involves getting the character classification and case-folding functions in the regex library to use the collations infrastructure. Most of this work had been done already in connection with the upper/lower and LIKE logic, so it was a simple matter of transposition. While at it, split out these functions into a separate source file regc_pg_locale.c, so that they can be correctly labeled with the Postgres project's license rather than the Scriptics license. These functions are 100% Postgres-written code whereas what remains in regc_locale.c is still mostly not ours, so lumping them both under the same copyright notice was getting more and more misleading.
* pgindent run before PG 9.1 beta 1.Bruce Momjian2011-04-10
|
* Fix comparisons of pointers with zero to compare with NULL instead.Tom Lane2010-10-29
| | | | | | | Per C standard, these are semantically the same thing; but saying NULL when you mean NULL is good for readability. Marti Raudsepp, per results of INRIA's Coccinelle.
* Remove cvs keywords from all files.Magnus Hagander2010-09-20
|
* Tweak a couple of macros in the regex code to suppress compiler warningsTom Lane2010-08-02
| | | | | | | | from "clang". The VERR changes make an assignment unconditional, which is probably easier to read/understand anyway, and one can hardly argue that it's worth shaving cycles off the case of reporting another error when one has already been detected. The INSIST change limits where that macro can be used, but not in a way that creates a problem for any existing call.
* pgindent run for 9.0Bruce Momjian2010-02-26
|
* Change regexp engine's ccondissect/crevdissect routines to perform DFATom Lane2010-02-01
| | | | | | | | | | | | | | | | | matching before recursing instead of after. The DFA match eliminates unworkable midpoint choices a lot faster than the recursive check, in most cases, so doing it first can speed things up; particularly in pathological cases such as recently exhibited by Michael Glaesemann. In addition, apply some cosmetic changes that were applied upstream (in the Tcl project) at the same time, in order to sync with upstream version 1.15 of regexec.c. Upstream apparently intends to backpatch this, so I will too. The pathological behavior could be unpleasant if encountered in the field, which seems to justify any risk of introducing new bugs. Tom Lane, reviewed by Donal K. Fellows of Tcl project
* Fix some comments that got mangled by pgindent.Tom Lane2010-01-30
|
* Teach the regular expression functions to do case-insensitive matching andTom Lane2009-12-01
| | | | | | | | | | | | | | | | | | | | locale-dependent character classification properly when the database encoding is UTF8. The previous coding worked okay in single-byte encodings, or in any case for ASCII characters, but failed entirely on multibyte characters. The fix assumes that the <wctype.h> functions use Unicode code points as the wchar representation for Unicode, ie, wchar matches pg_wchar. This is only a partial solution, since we're still stupid about non-ASCII characters in multibyte encodings other than UTF8. The practical effect of that is limited, however, since those cases are generally Far Eastern glyphs for which concepts like case-folding don't apply anyway. Certainly all or nearly all of the field reports of problems have been about UTF8. A more general solution would require switching to the platform's wchar representation for all regex operations; which is possible but would have substantial disadvantages. Let's try this and see if it's sufficient in practice.
* 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef listBruce Momjian2009-06-11
| | | | provided by Andrew.
* Refactor backend makefiles to remove lots of duplicate codePeter Eisentraut2008-02-19
|
* Sync our regex code with upstream changes since last time we did this, whichTom Lane2008-02-14
| | | | | | | | | | | | | was Tcl 8.4.8. The main changes are to remove the never-fully-implemented code for multi-character collating elements, and to const-ify some stuff a bit more fully. In combination with the recent security patch, this commit brings us into line with Tcl 8.5.0. Note that I didn't make any effort to duplicate a lot of cosmetic changes that they made to bring their copy into line with their own style guidelines, such as adding braces around single-line IF bodies. Most of those we either had done already (such as ANSI-fication of function headers) or there is no point because pgindent would undo the change anyway.
* Fix assorted security-grade bugs in the regex engine. All of these problemsTom Lane2008-01-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | are shared with Tcl, since it's their code to begin with, and the patches have been copied from Tcl 8.5.0. Problems: CVE-2007-4769: Inadequate check on the range of backref numbers allows crash due to out-of-bounds read. CVE-2007-4772: Infinite loop in regex optimizer for pattern '($|^)*'. CVE-2007-6067: Very slow optimizer cleanup for regex with a large NFA representation, as well as crash if we encounter an out-of-memory condition during NFA construction. Part of the response to CVE-2007-6067 is to put a limit on the number of states in the NFA representation of a regex. This seems needed even though the within-the-code problems have been corrected, since otherwise the code could try to use very large amounts of memory for a suitably-crafted regex, leading to potential DOS by driving the system into swap, activating a kernel OOM killer, etc. Although there are certainly plenty of ways to drive the system into effective DOS with poorly-written SQL queries, these problems seem worth treating as security issues because many applications might accept regex search patterns from untrustworthy sources. Thanks to Will Drewry of Google for reporting these problems. Patches by Will Drewry and Tom Lane. Security: CVE-2007-4769, CVE-2007-4772, CVE-2007-6067
* pgindent run for 8.3.Bruce Momjian2007-11-15
|
* Add a useless return statement to suppress a warning seen with someTom Lane2007-10-22
| | | | | | versions of gcc (I'm seeing it with Apple's gcc 4.0.1). I think the reason we did not see this before was that the assert() macros in the regex code were all no-ops till recently.
* Make dumpcolors() have tolerable performance when using 32-bit chr,Tom Lane2007-10-06
| | | | | | as we do (and upstream Tcl doesn't). The loop limit might be subject to negotiation if anyone ever tries to do regex debugging in Far Eastern languages, but for now 1000 seems plenty. CHR_MAX was right out :-(
* Adjust some regex debugging printouts to not give wrong-format-widthTom Lane2007-10-06
| | | | | warnings on a 64-bit machine. Noted while chasing a recent regex bug report.
* Wording cleanup for error messages. Also change can't -> cannot.Bruce Momjian2007-02-01
| | | | | | | | | | | | | | Standard English uses "may", "can", and "might" in different ways: may - permission, "You may borrow my rake." can - ability, "I can lift that log." might - possibility, "It might rain today." Unfortunately, in conversational English, their use is often mixed, as in, "You may use this variable to do X", when in fact, "can" is a better choice. Similarly, "It may crash" is better stated, "It might crash".
* Re-run pgindent, fixing a problem where comment lines after a blankBruce Momjian2005-11-22
| | | | | | | | | comment line where output as too long, and update typedefs for /lib directory. Also fix case where identifiers were used as variable names in the backend, but as typedefs in ecpg (favor the backend for indenting). Backpatch to 8.1.X.
* Standard pgindent run for 8.1.Bruce Momjian2005-10-15
|
* Clean up possibly-uninitialized-variable warnings reported by gcc 4.x.Tom Lane2005-09-24
|
* I made the patch that implements regexp_replace again.Bruce Momjian2005-07-10
| | | | | | | | | | | | | | | | | | | | | | | The specification of this function is as follows. regexp_replace(source text, pattern text, replacement text, [flags text]) returns text Replace string that matches to regular expression in source text to replacement text. - pattern is regular expression pattern. - replacement is replace string that can use '\1'-'\9', and '\&'. '\1'-'\9': back reference to the n'th subexpression. '\&' : entire matched string. - flags can use the following values: g: global (replace all) i: ignore case When the flags is not specified, case sensitive, replace the first instance only. Atsushi Ogawa
* Add parentheses to macros when args are used in computations. WithoutBruce Momjian2005-05-25
| | | | them, the executation behavior could be unexpected.
* Install Tcl regex fixes to sync our regex engine with Tcl 8.4.8 (up fromTom Lane2004-11-24
| | | | | 8.4.1). This corrects some curious regex bugs, though not the greediness issue I was hoping to find a solution for :-(
* Solve the 'Turkish problem' with undesirable locale behavior for caseTom Lane2004-05-07
| | | | | | | | | | | | | conversion of basic ASCII letters. Remove all uses of strcasecmp and strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp; remove most but not all direct uses of toupper and tolower in favor of pg_toupper and pg_tolower. These functions use the same notions of case folding already developed for identifier case conversion. I left the straight locale-based folding in place for situations where we are just manipulating user data and not trying to match it to built-in strings --- for example, the SQL upper() function is still locale dependent. Perhaps this will prove not to be what's wanted, but at the moment we can initdb and pass regression tests in Turkish locale.
* $Header: -> $PostgreSQL Changes ...PostgreSQL Daemon2003-11-29
|
* Fix broken definition of :print: character class, per Bruno Wolff.Tom Lane2003-09-29
| | | | | Also, make :alnum: character class directly dependent on isalnum() rather than guessing.
* Another pgindent run with updated typedefs.Bruce Momjian2003-08-08
|
* pgindent run.Bruce Momjian2003-08-04
|
* Replace regular expression package with Henry Spencer's latest versionTom Lane2003-02-05
| | | | | | | (extracted from Tcl 8.4.1 release, as Henry still hasn't got round to making it a separate library). This solves a performance problem for multibyte, as well as upgrading our regexp support to match recent Tcl and nearly match recent Perl.
* This patch removes a bunch of superfluous #include directives: ifBruce Momjian2002-11-08
| | | | | | | | postgres.h or c.h includes a system header (such as stdio.h or stdlib.h), there's no need to specifically include it in any of the .c files in the backend. Neil Conway
* Remove retest Makefile entry because it does not compile.Bruce Momjian2002-09-16
|
* pgindent run.Bruce Momjian2002-09-04
|
* Remove all traces of multibyte and locale options. Clean up commentsPeter Eisentraut2002-09-03
| | | | referring to "multibyte" where it really means character encoding.
* Remove sys/types.h in files that include postgres.h, and hence c.h,Bruce Momjian2002-09-02
| | | | because c.h has sys/types.h.
* Remove #ifdef MULTIBYTE per hackers list discussion.Tatsuo Ishii2002-08-29
|
* Implement SQL99 OVERLAY(). Allows substitution of a substring in a string.Thomas G. Lockhart2002-06-11
| | | | | | | | | | | Implement SQL99 SIMILAR TO as a synonym for our existing operator "~". Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape). Extend the definition to make the FOR clause optional. Define textregexsubstr() to actually implement this feature. Update the regression test to include these new string features. All tests pass. Rename the regular expression support routines from "pg95_xxx" to "pg_xxx". Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.
* Fix code to work when isalpha and friends are macros, not functions.Tom Lane2002-05-05
|