Repair bug in regexp split performance improvements.

Commit c8ea87e4b introduced a temporary conversion buffer for substrings extracted during regexp splits. Unfortunately the code that sized it was failing to ignore the effects of ignored degenerate regexp matches, so for regexp_split_* calls it could under-size the buffer in such cases. Fix, and add some regression test cases (though those will only catch the bug if run in a multibyte encoding). Backpatch to 9.3 as the faulty code was. Thanks to the PostGIS project, Regina Obe and Paul Ramsey for the report (via IRC) and assistance in analysis. Patch by me.
author: Andrew Gierth <rhodiumtoad@postgresql.org> 2018-09-12 19:31:06 +0100
committer: Andrew Gierth <rhodiumtoad@postgresql.org> 2018-09-12 19:31:06 +0100
commit: b7f6bcbffcc0b41d783c0c9c61766428159969ff (patch)
tree: 97f6adb12342296dd1d7dba55586141914da4bb1 /src/backend/utils/adt/regexp.c
parent: ba37349cff781120f61b2778257f594f0d10253c (diff)
download: postgresql-b7f6bcbffcc0b41d783c0c9c61766428159969ff.tar.gz
postgresql-b7f6bcbffcc0b41d783c0c9c61766428159969ff.zip
1 files changed, 10 insertions, 6 deletions
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index d8b69212342..171fcc8a448 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -982,6 +982,7 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
 	int			array_len;
 	int			array_idx;
 	int			prev_match_end;
+	int			prev_valid_match_end;
 	int			start_search;
 	int			maxlen = 0;		/* largest fetch length in characters */
 
@@ -1024,6 +1025,7 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
 
 	/* search for the pattern, perhaps repeatedly */
 	prev_match_end = 0;
+	prev_valid_match_end = 0;
 	start_search = 0;
 	while (RE_wchar_execute(cpattern, wide_str, wide_len, start_search,
 							pmatch_len, pmatch))
@@ -1076,13 +1078,15 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
 			matchctx->nmatches++;
 
 			/*
-			 * check length of unmatched portion between end of previous match
-			 * and start of current one
+			 * check length of unmatched portion between end of previous valid
+			 * (nondegenerate, or degenerate but not ignored) match and start
+			 * of current one
 			 */
 			if (fetching_unmatched &&
 				pmatch[0].rm_so >= 0 &&
-				(pmatch[0].rm_so - prev_match_end) > maxlen)
-				maxlen = (pmatch[0].rm_so - prev_match_end);
+				(pmatch[0].rm_so - prev_valid_match_end) > maxlen)
+				maxlen = (pmatch[0].rm_so - prev_valid_match_end);
+			prev_valid_match_end = pmatch[0].rm_eo;
 		}
 		prev_match_end = pmatch[0].rm_eo;
 
@@ -1108,8 +1112,8 @@ setup_regexp_matches(text *orig_str, text *pattern, pg_re_flags *re_flags,
 	 * input string
 	 */
 	if (fetching_unmatched &&
-		(wide_len - prev_match_end) > maxlen)
-		maxlen = (wide_len - prev_match_end);
+		(wide_len - prev_valid_match_end) > maxlen)
+		maxlen = (wide_len - prev_valid_match_end);
 
 	/*
 	 * Keep a note of the end position of the string for the benefit of
author	Andrew Gierth <rhodiumtoad@postgresql.org>	2018-09-12 19:31:06 +0100
committer	Andrew Gierth <rhodiumtoad@postgresql.org>	2018-09-12 19:31:06 +0100
commit	b7f6bcbffcc0b41d783c0c9c61766428159969ff (patch)
tree	97f6adb12342296dd1d7dba55586141914da4bb1 /src/backend/utils/adt/regexp.c
parent	ba37349cff781120f61b2778257f594f0d10253c (diff)
download	postgresql-b7f6bcbffcc0b41d783c0c9c61766428159969ff.tar.gz postgresql-b7f6bcbffcc0b41d783c0c9c61766428159969ff.zip