aboutsummaryrefslogtreecommitdiff
path: root/src/backend
Commit message (Collapse)AuthorAge
...
* Rename a parse node to be more generalPeter Eisentraut2021-03-25
| | | | | | | | | A WHERE clause will be used for row filtering in logical replication. We already have a similar node: 'WHERE (condition here)'. Let's rename the node to a generic name and use it for row filtering too. Author: Euler Taveira <euler.taveira@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAHE3wggb715X+mK_DitLXF25B=jE6xyNCH4YOwM860JR7HarGQ@mail.gmail.com
* Sanitize the term "combo CID" in code commentsMichael Paquier2021-03-25
| | | | | | | | | Combo CIDs were referred in the code comments using different terms across various places of the code, so unify a bit the term used with what is currently in use in some of the READMEs. Author: "Hou, Zhijie" Discussion: https://postgr.es/m/1d42865c91404f46af4562532fdbea31@G08CNEXMBPEKD05.g08.fujitsu.local
* Fix bug in WAL replay of COMMIT_TS_SETTS record.Fujii Masao2021-03-25
| | | | | | | | | | | | | | | | | | Previously the WAL replay of COMMIT_TS_SETTS record called TransactionTreeSetCommitTsData() with the argument write_xlog=true, which generated and wrote new COMMIT_TS_SETTS record. This should not be acceptable because it's during recovery. This commit fixes the WAL replay of COMMIT_TS_SETTS record so that it calls TransactionTreeSetCommitTsData() with write_xlog=false and doesn't generate new WAL during recovery. Back-patch to all supported branches. Reported-by: lx zou <zoulx1982@163.com> Author: Fujii Masao Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
* Improve connection denied error message during recovery.Fujii Masao2021-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously when an archive recovery or a standby was starting and reached the consistent recovery state but hot_standby was configured to off, the error message when a client connectted was "the database system is starting up", which was needless confusing and not really all that accurate either. This commit improves the connection denied error message during recovery, as follows, so that the users immediately know that their servers are configured to deny those connections. - If hot_standby is disabled, the error message "the database system is not accepting connections" and the detail message "Hot standby mode is disabled." are output when clients connect while an archive recovery or a standby is running. - If hot_standby is enabled, the error message "the database system is not yet accepting connections" and the detail message "Consistent recovery state has not been yet reached." are output when clients connect until the consistent recovery state is reached and postmaster starts accepting read only connections. This commit doesn't change the connection denied error message of "the database system is starting up" during normal server startup and crash recovery. Because it's still suitable for those situations. Author: James Coleman Reviewed-by: Alvaro Herrera, Andres Freund, David Zhang, Tom Lane, Fujii Masao Discussion: https://postgr.es/m/CAAaqYe8h5ES_B=F_zDT+Nj9XU7YEwNhKhHA2RE4CFhAQ93hfig@mail.gmail.com
* Fix stray double semicolonsPeter Eisentraut2021-03-24
| | | | Reported-by: John Naylor <john.naylor@enterprisedb.com>
* Change checkpoint_completion_target default to 0.9Stephen Frost2021-03-24
| | | | | | | | | | | | | | | | | | | Common recommendations are that the checkpoint should be spread out as much as possible, provided we avoid having it take too long. This change updates the default to 0.9 (from 0.5) to match that recommendation. There was some debate about possibly removing the option entirely but it seems there may be some corner-cases where having it set much lower to try to force the checkpoint to be as fast as possible could result in fewer periods of time of reduced performance due to kernel flushing. General agreement is that the "spread more" is the preferred approach though and those who need to tune away from that value are much less common. Reviewed-By: Michael Paquier, Peter Eisentraut, Tom Lane, David Steele, Nathan Bossart Discussion: https://postgr.es/m/20201207175329.GM16415%40tamriel.snowman.net
* Tidy up more loose ends related to configurable TOAST compression.Robert Haas2021-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Change the default_toast_compression GUC to be an enum rather than a string. Earlier, uncommitted versions of the patch supported using CREATE ACCESS METHOD to add new compression methods to a running system, but that idea was dropped before commit. So, we can simplify the GUC handling as well, which has the nice side effect of improving the error messages. While updating the documentation to reflect the new GUC type, also move it back to the right place in the list. I moved this while revising what became commit 24f0e395ac5892cd12e8914646fe921fac5ba23d, but apparently the intended ordering is "alphabetical" rather than "whatever Robert thinks looks nice." Rejigger things to avoid having access/toast_compression.h depend on utils/guc.h, so that we don't end up with every file that includes it also depending on something largely unrelated. Move a few inline functions back into the C source file partly to help reduce dependencies and partly just to avoid clutter. A few very minor cosmetic fixes. Original patch by Justin Pryzby, but very heavily edited by me, and reverse reviewed by him and also reviewed by by Tom Lane. Discussion: http://postgr.es/m/CA+TgmoYp=GT_ztUCeZg2i4hkHAQv8o=-nVJ1-TKWTG1zQOmOpg@mail.gmail.com
* Add date_bin functionPeter Eisentraut2021-03-24
| | | | | | | | | | | | Similar to date_trunc, but allows binning by an arbitrary interval rather than just full units. Author: John Naylor <john.naylor@enterprisedb.com> Reviewed-by: David Fetter <david@fetter.org> Reviewed-by: Isaac Morland <isaac.morland@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Artur Zakirov <zaartur@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CACPNZCt4buQFRgy6DyjuZS-2aPDpccRkrJBmgUfwYc1KiaXYxg@mail.gmail.com
* Improve an error messagePeter Eisentraut2021-03-24
| | | | Make it the same as another nearby message.
* Revert "Enable parallel SELECT for "INSERT INTO ... SELECT ..."."Amit Kapila2021-03-24
| | | | | | | | | | | | | | | | | | | To allow inserts in parallel-mode this feature has to ensure that all the constraints, triggers, etc. are parallel-safe for the partition hierarchy which is costly and we need to find a better way to do that. Additionally, we could have used existing cached information in some cases like indexes, domains, etc. to determine the parallel-safety. List of commits reverted, in reverse chronological order: ed62d3737c Doc: Update description for parallel insert reloption. c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode. c5be48f092 Improve FK trigger parallel-safety check added by 05c8482f7f. e2cda3c20a Fix use of relcache TriggerDesc field introduced by commit 05c8482f7f. e4e87a32cc Fix valgrind issue in commit 05c8482f7f. 05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...". Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org
* Rename wait event WalrcvExit to WalReceiverExit.Fujii Masao2021-03-24
| | | | | | | | | | | Commit de829ddf23 added wait event WalrcvExit. But its name is not consistent with other wait events like WalReceiverMain or WalReceiverWaitStart, etc. So this commit renames WalrcvExit to WalReceiverExit. Author: Fujii Masao Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/cced9995-8fa2-7b22-9d91-3f22a2b8c23c@oss.nttdata.com
* Log when GetNewOidWithIndex() fails to find unused OID many times.Fujii Masao2021-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GetNewOidWithIndex() generates a new OID one by one until it finds one not in the relation. If there are very long runs of consecutive existing OIDs, GetNewOidWithIndex() needs to iterate many times in the loop to find unused OID. Since TOAST table can have a large number of entries and there can be such long runs of OIDs, there is the case where it takes so many iterations to find new OID not in TOAST table. Furthermore if all (i.e., 2^32) OIDs are already used, GetNewOidWithIndex() enters something like busy loop and repeats the iterations until at least one OID is marked as unused. There are some reported troubles caused by a large number of iterations in GetNewOidWithIndex(). For example, when inserting a billion of records into the table, all the backends doing that insertion operation got hang with 100% CPU usage at some point. Previously there was no easy way to detect that GetNewOidWithIndex() failed to find unused OID many times. So, for example, gdb full backtrace of hanged backends needed to be taken, in order to investigate that trouble. This is inconvenient and may not be available in some production environments. To provide easy way for that, this commit makes GetNewOidWithIndex() log that it iterates more than GETNEWOID_LOG_THRESHOLD but have not yet found OID unused in the relation. Also this commit makes it repeat logging with exponentially increasing intervals until it iterates more than GETNEWOID_LOG_MAX_INTERVAL, and makes it finally repeat logging every GETNEWOID_LOG_MAX_INTERVAL unless an unused OID is found. Those macro variables are used not to fill up the server log with the similar messages. In the discusion at pgsql-hackers, there was another idea to report the lots of iterations in GetNewOidWithIndex() via wait event. But since GetNewOidWithIndex() traverses indexes to find unused OID and which will do I/O, acquire locks, etc, which will overwrite the wait event and reset it to nothing once done. So that idea doesn't work well, and we didn't adopt it. Author: Tomohiro Hiramitsu Reviewed-by: Tatsuhito Kasahara, Kyotaro Horiguchi, Tom Lane, Fujii Masao Discussion: https://postgr.es/m/16722-93043fb459a41073@postgresql.org
* Reword slightly logs generated for index stats in autovacuumMichael Paquier2021-03-24
| | | | | | | | | Using "remain" is confusing, as it implies that the index file can shrink. Instead, use "in total". Per discussion with Peter Geoghegan. Discussion: https://postgr.es/m/CAH2-WzkYgHZzpGOwR14CScJsjaQpvJrEkEfkh_=wGhzLb=yVdQ@mail.gmail.com
* Allow composite types in catalog bootstrapTomas Vondra2021-03-24
| | | | | | | | | | When resolving types during catalog bootstrap, try to reload the pg_type contents if a type is not found. That allows catalogs to contain composite types, e.g. row types for other catalogs. Author: Justin Pryzby Reviewed-by: Dean Rasheed, Tomas Vondra Discussion: https://postgr.es/m/ad7891d2-e90c-b446-9fe2-7419143847d7%40enterprisedb.com
* Convert Typ from array to list in bootstrapTomas Vondra2021-03-24
| | | | | | | | | | It's a bit easier and more convenient to free and reload a List, compared to a plain array. This will be helpful when allowing catalogs to contain composite types. Author: Justin Pryzby Reviewed-by: Dean Rasheed, Tomas Vondra Discussion: https://postgr.es/m/ad7891d2-e90c-b446-9fe2-7419143847d7%40enterprisedb.com
* nbtree VACUUM: Cope with buggy opclasses.Peter Geoghegan2021-03-23
| | | | | | | | | | | | | Teach nbtree VACUUM to press on with vacuuming in the event of a page deletion attempt that fails to "re-find" a downlink for its child/target page. There is no good reason to treat this as an irrecoverable error. But there is a good reason not to: pressing on at this point removes any question of VACUUM not making progress solely due to misbehavior from user-defined operator class code. Discussion: https://postgr.es/m/CAH2-Wzma5G9CTtMjbrXTwOym+U=aWg-R7=-htySuztgoJLvZXg@mail.gmail.com
* Avoid possible crash while finishing up a heap rewrite.Tom Lane2021-03-23
| | | | | | | | | | | | | | | | | | | | end_heap_rewrite was not careful to ensure that the target relation is open at the smgr level before performing its final smgrimmedsync. In ordinary cases this is no problem, because it would have been opened earlier during the rewrite. However a crash can be reproduced by re-clustering an empty table with CLOBBER_CACHE_ALWAYS enabled. Although that exact scenario does not crash in v13, I think that's a chance result of unrelated planner changes, and the problem is likely still reachable with other test cases. The true proximate cause of this failure is commit c6b92041d, which replaced a call to heap_sync (which was careful about opening smgr) with a direct call to smgrimmedsync. Hence, back-patch to v13. Amul Sul, per report from Neha Sharma; cosmetic changes and test case by me. Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
* Add bit_count SQL functionPeter Eisentraut2021-03-23
| | | | | | | | | | | This function for bit and bytea counts the set bits in the bit or byte string. Internally, we use the existing popcount functionality. For the name, after some discussion, we settled on bit_count, which also exists with this meaning in MySQL, Java, and Python. Author: David Fetter <david@fetter.org> Discussion: https://www.postgresql.org/message-id/flat/20201230105535.GJ13234@fetter.org
* Add per-index stats information in verbose logs of autovacuumMichael Paquier2021-03-23
| | | | | | | | | | | | | | | | | | | Once a relation's autovacuum is completed, the logs include more information about this relation state if the threshold of log_autovacuum_min_duration (or its relation option) is reached, with for example contents about the statistics of the VACUUM operation for the relation, WAL and system usage. This commit adds more information about the statistics of the relation's indexes, with one line of logs generated for each index. The index stats were already calculated, but not printed in the context of autovacuum yet. While on it, some refactoring is done to keep track of the index statistics directly within LVRelStats, simplifying some routines related to parallel VACUUMs. Author: Masahiko Sawada Reviewed-by: Michael Paquier, Euler Taveira Discussion: https://postgr.es/m/CAD21AoAy6SxHiTivh5yAPJSUE4S=QRPpSZUdafOSz0R+fRcM6Q@mail.gmail.com
* Fix dangling pointer reference in stream_cleanup_files.Amit Kapila2021-03-23
| | | | | | | We can't access the entry after it is removed from dynahash. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Ps-pL++f6CJwPx2+vUqXuew=Xt-9Bi-6kCyxn+Fwi2M7w@mail.gmail.com
* Use correct spelling of statistics kindTomas Vondra2021-03-23
| | | | | | | | A couple error messages and comments used 'statistic kind', not the correct 'statistics kind'. Fix and backpatch all the way back to 10, where extended statistics were introduced. Backpatch-through: 10
* Change the type of WalReceiverWaitStart wait event from Client to IPC.Fujii Masao2021-03-23
| | | | | | | | | | | | | Previously the type of this wait event was Client. But while this wait event is being reported, walreceiver process is waiting for the startup process to set initial data for streaming replication. It's not waiting for any activity on a socket connected to a user application or walsender. So this commit changes the type for WalReceiverWaitStart wait event to IPC. Author: Fujii Masao Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/cdacc27c-37ff-f1a4-20e2-ce19933abfcc@oss.nttdata.com
* Add macro RelationIsPermanent() to report relation permanenceBruce Momjian2021-03-22
| | | | | | | | | | | | | Previously, to check relation permanence, the Relation's Form_pg_class structure member relpersistence was compared to the value RELPERSISTENCE_PERMANENT ("p"). This commit adds the macro RelationIsPermanent() and is used in appropirate places to simplify the code. This matches other RelationIs* macros. This macro will be used in more places in future cluster file encryption patches. Discussion: https://postgr.es/m/20210318153134.GH20766@tamriel.snowman.net
* Optimize allocations in bringetbitmapTomas Vondra2021-03-23
| | | | | | | | | | | | | | | | The bringetbitmap function allocates memory for various purposes, which may be quite expensive, depending on the number of scan keys. Instead of allocating them separately, allocate one bit chunk of memory an carve it into smaller pieces as needed - all the pieces have the same lifespan, and it saves quite a bit of CPU and memory overhead. Author: Tomas Vondra <tomas.vondra@postgresql.org> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Mark Dilger <hornschnorter@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
* Move IS [NOT] NULL handling from BRIN support functionsTomas Vondra2021-03-23
| | | | | | | | | | | | | | | | | | | | The handling of IS [NOT] NULL clauses is independent of an opclass, and most of the code was exactly the same in both minmax and inclusion. So instead move the code from support procedures to the AM. This simplifies the code - especially the support procedures - quite a bit, as they don't need to care about NULL values and flags at all. It also means the IS [NOT] NULL clauses can be evaluated without invoking the support procedure. Author: Tomas Vondra <tomas.vondra@postgresql.org> Author: Nikita Glukhov <n.gluhov@postgrespro.ru> Reviewed-by: Nikita Glukhov <n.gluhov@postgrespro.ru> Reviewed-by: Mark Dilger <hornschnorter@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
* Pass all scan keys to BRIN consistent function at onceTomas Vondra2021-03-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit changes how we pass scan keys to BRIN consistent function. Instead of passing them one by one, we now pass all scan keys for a given attribute at once. That makes the consistent function a bit more complex, as it has to loop through the keys, but it does allow more elaborate opclasses that can use multiple keys to eliminate ranges much more effectively. The existing BRIN opclasses (minmax, inclusion) don't really benefit from this change. The primary purpose is to allow future opclases to benefit from seeing all keys at once. This does change the BRIN API, because the signature of the consistent function changes (a new parameter with number of scan keys). So this breaks existing opclasses, and will require supporting two variants of the code for different PostgreSQL versions. We've considered supporting two variants of the consistent, but we've decided not to do that. Firstly, there's another patch that moves handling of NULL values from the opclass, which means the opclasses need to be updated anyway. Secondly, we're not aware of any out-of-core BRIN opclasses, so it does not seem worth the extra complexity. Bump catversion, because of pg_proc changes. Author: Tomas Vondra <tomas.vondra@postgresql.org> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Mark Dilger <hornschnorter@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Reviewed-by: Nikita Glukhov <n.gluhov@postgrespro.ru> Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
* Move bsearch_arg to src/portTomas Vondra2021-03-23
| | | | | | Until now the bsearch_arg function was used only in extended statistics code, so it was defined in that code. But we already have qsort_arg in src/port, so let's move it next to it.
* Short-circuit slice requests that are for more than the object's size.Tom Lane2021-03-22
| | | | | | | | | | | | | | | | substring(), and perhaps other callers, isn't careful to pass a slice length that is no more than the datum's true size. Since toast_decompress_datum_slice's children will palloc the requested slice length, this can waste memory. Also, close study of the liblz4 documentation suggests that it is dependent on the caller to not ask for more than the correct amount of decompressed data; this squares with observed misbehavior with liblz4 1.8.3. Avoid these problems by switching to the normal full-decompression code path if the slice request is >= datum's decompressed size. Tom Lane and Dilip Kumar Discussion: https://postgr.es/m/507597.1616370729@sss.pgh.pa.us
* Mostly-cosmetic adjustments of TOAST-related macros.Tom Lane2021-03-22
| | | | | | | | | | | | | | | | | | | | | | | | The authors of bbe0a81db hadn't quite got the idea that macros named like SOMETHING_4B_C were only meant for internal endianness-related details in postgres.h. Choose more legible names for macros that are intended to be used elsewhere. Rearrange postgres.h a bit to clarify the separation between those internal macros and ones intended for wider use. Also, avoid using the term "rawsize" for true decompressed size; we've used "extsize" for that, because "rawsize" generally denotes total Datum size including header. This choice seemed particularly unfortunate in tests that were comparing one of these meanings to the other. This patch includes a couple of not-purely-cosmetic changes: be sure that the shifts aligning compression methods are unsigned (not critical today, but will be when compression method 2 exists), and fix broken definition of VARATT_EXTERNAL_GET_COMPRESSION (now VARATT_EXTERNAL_GET_COMPRESS_METHOD), whose callers worked only accidentally. Discussion: https://postgr.es/m/574197.1616428079@sss.pgh.pa.us
* Error on invalid TOAST compression in CREATE or ALTER TABLE.Robert Haas2021-03-22
| | | | | | | | | The previous coding treated an invalid compression method name as equivalent to the default, which is certainly not right. Justin Pryzby Discussion: http://postgr.es/m/20210321235544.GD4203@telsasoft.com
* More code cleanup for configurable TOAST compression.Robert Haas2021-03-22
| | | | | | | | | Remove unused macro. Fix confusion about whether a TOAST compression method is identified by an OID or a char. Justin Pryzby Discussion: http://postgr.es/m/20210321235544.GD4203@telsasoft.com
* Fix concurrency issues with WAL segment recycling on WindowsMichael Paquier2021-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit is mostly a revert of aaa3aed, that switched the routine doing the internal renaming of recycled WAL segments to use on Windows a combination of CreateHardLinkA() plus unlink() instead of rename(). As reported by several users of Postgres 13, this is causing concurrency issues when manipulating WAL segments, mostly in the shape of the following error: LOG: could not rename file "pg_wal/000000XX000000YY000000ZZ": Permission denied This moves back to a logic where a single rename() (well, pgrename() for Windows) is used. This issue has proved to be hard to hit when I tested it, facing it only once with an archive_command that was not able to do its work, so it is environment-sensitive. The reporters of this issue have been able to confirm that the situation improved once we switched back to a single rename(). In order to check things, I have provided to the reporters a patched build based on 13.2 with aaa3aed reverted, to test if the error goes away, and an unpatched build of 13.2 to test if the error still showed up (just to make sure that I did not mess up my build process). Extra thanks to Fujii Masao for pointing out what looked like the culprit commit, and to all the reporters for taking the time to test what I have sent them. Reported-by: Andrus, Guy Burgess, Yaroslav Pashinsky, Thomas Trenz Reviewed-by: Tom Lane, Andres Freund Discussion: https://postgr.es/m/3861ff1e-0923-7838-e826-094cc9bef737@hot.ee Discussion: https://postgr.es/m/16874-c3eecd319e36a2bf@postgresql.org Discussion: https://postgr.es/m/095ccf8d-7f58-d928-427c-b17ace23cae6@burgess.co.nz Discussion: https://postgr.es/m/16927-67c570d968c99567%40postgresql.org Discussion: https://postgr.es/m/YFBcRbnBiPdGZvfW@paquier.xyz Backpatch-through: 13
* Fix timeline assignment in checkpoints with 2PC transactionsMichael Paquier2021-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Any transactions found as still prepared by a checkpoint have their state data read from the WAL records generated by PREPARE TRANSACTION before being moved into their new location within pg_twophase/. While reading such records, the WAL reader uses the callback read_local_xlog_page() to read a page, that is shared across various parts of the system. This callback, since 1148e22a, has introduced an update of ThisTimeLineID when reading a record while in recovery, which is potentially helpful in the context of cascading WAL senders. This update of ThisTimeLineID interacts badly with the checkpointer if a promotion happens while some 2PC data is read from its record, as, by changing ThisTimeLineID, any follow-up WAL records would be written to an timeline older than the promoted one. This results in consistency issues. For instance, a subsequent server restart would cause a failure in finding a valid checkpoint record, resulting in a PANIC, for instance. This commit changes the code reading the 2PC data to reset the timeline once the 2PC record has been read, to prevent messing up with the static state of the checkpointer. It would be tempting to do the same thing directly in read_local_xlog_page(). However, based on the discussion that has led to 1148e22a, users may rely on the updates of ThisTimeLineID when a WAL record page is read in recovery, so changing this callback could break some cases that are working currently. A TAP test reproducing the issue is added, relying on a PITR to precisely trigger a promotion with a prepared transaction still tracked. Per discussion with Heikki Linnakangas, Kyotaro Horiguchi, Fujii Masao and myself. Author: Soumyadeep Chakraborty, Jimmy Yih, Kevin Yeap Discussion: https://postgr.es/m/CAE-ML+_EjH_fzfq1F3RJ1=XaaNG=-Jz-i3JqkNhXiLAsM3z-Ew@mail.gmail.com Backpatch-through: 10
* Fix assorted silliness in ATExecSetCompression().Tom Lane2021-03-21
| | | | | | | | | It's not okay to scribble directly on a syscache entry. Nor to continue accessing said entry after releasing it. Also get rid of not-used local variables. Per valgrind testing.
* Recycle nbtree pages deleted during same VACUUM.Peter Geoghegan2021-03-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Maintain a simple array of metadata about pages that were deleted during nbtree VACUUM's current btvacuumscan() call. Use this metadata at the end of btvacuumscan() to attempt to place newly deleted pages in the FSM without further delay. It might not yet be safe to place any of the pages in the FSM by then (they may not be deemed recyclable), but we have little to lose and plenty to gain by trying. In practice there is a very good chance that this will work out when vacuuming larger indexes, where scanning the index naturally takes quite a while. This commit doesn't change the page recycling invariants; it merely improves the efficiency of page recycling within the confines of the existing design. Recycle safety is a part of nbtree's implementation of what Lanin & Shasha call "the drain technique". The design happens to use transaction IDs (they're stored in deleted pages), but that in itself doesn't align the cutoff for recycle safety to any of the XID-based cutoffs used by VACUUM (e.g., OldestXmin). All that matters is whether or not _other_ backends might be able to observe various inconsistencies in the tree structure (that they cannot just detect and recover from by moving right). Recycle safety is purely a question of maintaining the consistency (or the apparent consistency) of a physical data structure. Note that running a simple serial test case involving a large range DELETE followed by a VACUUM VERBOSE will probably show that any newly deleted nbtree pages are not yet reusable/recyclable. This is expected in the absence of even one concurrent XID assignment. It is an old implementation restriction. In practice it's unlikely to be the thing that makes recycling remain unsafe, at least with larger indexes, where recycling newly deleted pages during the same VACUUM actually matters. An important high-level goal of this commit (as well as related recent commits e5d8a999 and 9f3665fb) is to make expensive deferred cleanup operations in index AMs rare in general. If index vacuuming frequently depends on the next VACUUM operation finishing off work that the current operation started, then the general behavior of index vacuuming is hard to predict. This is relevant to ongoing work that adds a vacuumlazy.c mechanism to skip index vacuuming in certain cases. Anything that makes the real world behavior of index vacuuming simpler and more linear will also make top-down modeling in vacuumlazy.c more robust. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
* Suppress various new compiler warnings.Tom Lane2021-03-21
| | | | | | | | Compilers that don't understand that elog(ERROR) doesn't return issued warnings here. In the cases in libpq_pipeline.c, we were not exactly helping things by failing to mark pg_fatal() as noreturn. Per buildfarm.
* Move lwlock-release probe back where it belongsPeter Eisentraut2021-03-21
| | | | | | | | | | The documentation specifically states that lwlock-release fires before any released waiters have been awakened. It worked that way until ab5194e6f617a9a9e7aadb3dd1cee948a42d0755, where is seems to have been misplaced accidentally. Move it back where it belongs. Author: Craig Ringer <craig.ringer@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/CAGRY4nwxKUS_RvXFW-ugrZBYxPFFM5kjwKT5O+0+Stuga5b4+Q@mail.gmail.com
* Use valid compression method in brin_form_tupleTomas Vondra2021-03-21
| | | | | | | | | | | | | | | | When compressing the BRIN summary, we can't simply use the compression method from the indexed attribute. The summary may use a different data type, e.g. fixed-length attribute may have varlena summary, leading to compression failures. For the built-in BRIN opclasses this happens to work, because the summary uses the same data type as the attribute. When the data types match, we can inherit use the compression method specified for the attribute (it's copied into the index descriptor). Otherwise we don't have much choice and have to use the default one. Author: Tomas Vondra Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/e0367f27-392c-321a-7411-a58e1a7e4817%40enterprisedb.com
* Fix memory leak when rejecting bogus DH parameters.Tom Lane2021-03-20
| | | | | | | | | | | | | While back-patching e0e569e1d, I noted that there were some other places where we ought to be applying DH_free(); namely, where we load some DH parameters from a file and then reject them as not being sufficiently secure. While it seems really unlikely that anybody would hit these code paths in production, let alone do so repeatedly, let's fix it for consistency. Back-patch to v10 where this code was introduced. Discussion: https://postgr.es/m/16160-18367e56e9a28264@postgresql.org
* Avoid leaking memory in RestoreGUCState(), and improve comments.Tom Lane2021-03-19
| | | | | | | | | | | | | | | | | | | | | | | | RestoreGUCState applied InitializeOneGUCOption to already-live GUC entries, causing any malloc'd subsidiary data to be forgotten. We do want the effect of resetting the GUC to its compiled-in default, and InitializeOneGUCOption seems like the best way to do that, so add code to free any existing subsidiary data beforehand. The interaction between can_skip_gucvar, SerializeGUCState, and RestoreGUCState is way more subtle than their opaque comments would suggest to an unwary reader. Rewrite and enlarge the comments to try to make it clearer what's happening. Remove a long-obsolete assertion in read_nondefault_variables: the behavior of set_config_option hasn't depended on IsInitProcessingMode since f5d9698a8 installed a better way of controlling it. Although this is fixing a clear memory leak, the leak is quite unlikely to involve any large amount of data, and it can only happen once in the lifetime of a worker process. So it seems unnecessary to take any risk of back-patching. Discussion: https://postgr.es/m/4105247.1616174862@sss.pgh.pa.us
* Provide recovery_init_sync_method=syncfs.Thomas Munro2021-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | Since commit 2ce439f3 we have opened every file in the data directory and called fsync() at the start of crash recovery. This can be very slow if there are many files, leading to field complaints of systems taking minutes or even hours to begin crash recovery. Provide an alternative method, for Linux only, where we call syncfs() on every possibly different filesystem under the data directory. This is equivalent, but avoids faulting in potentially many inodes from potentially slow storage. The new mode comes with some caveats, described in the documentation, so the default value for the new setting is "fsync", preserving the older behavior. Reported-by: Michael Brown <michael.brown@discourse.org> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Paul Guo <guopa@vmware.com> Reviewed-by: Bruce Momjian <bruce@momjian.us> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: David Steele <david@pgmasters.net> Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
* Use lfirst_int in cmp_list_len_contents_ascTomas Vondra2021-03-20
| | | | | | | | | | | | | The function added in be45be9c33 is comparing integer lists (IntList) by length and contents, but there were two bugs. Firstly, it used intVal() to extract the value, but that's for Value nodes, not for extracting int values from IntList. Secondly, it called it directly on the ListCell, without doing lfirst(). So just do lfirst_int() instead. Interestingly enough, this did not cause any crashes on the buildfarm, but valgrind rightfully complained about it. Discussion: https://postgr.es/m/bf3805a8-d7d1-ae61-fece-761b7ff41ecc@postgresfriends.org
* Fix use-after-ReleaseSysCache problem in ATExecAlterColumnType.Robert Haas2021-03-19
| | | | | | Introduced by commit bbe0a81db69bd10bd166907c3701492a29aca294. Per buildfarm member prion.
* Allow configurable LZ4 TOAST compression.Robert Haas2021-03-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is now a per-column COMPRESSION option which can be set to pglz (the default, and the only option in up until now) or lz4. Or, if you like, you can set the new default_toast_compression GUC to lz4, and then that will be the default for new table columns for which no value is specified. We don't have lz4 support in the PostgreSQL code, so to use lz4 compression, PostgreSQL must be built --with-lz4. In general, TOAST compression means compression of individual column values, not the whole tuple, and those values can either be compressed inline within the tuple or compressed and then stored externally in the TOAST table, so those properties also apply to this feature. Prior to this commit, a TOAST pointer has two unused bits as part of the va_extsize field, and a compessed datum has two unused bits as part of the va_rawsize field. These bits are unused because the length of a varlena is limited to 1GB; we now use them to indicate the compression type that was used. This means we only have bit space for 2 more built-in compresison types, but we could work around that problem, if necessary, by introducing a new vartag_external value for any further types we end up wanting to add. Hopefully, it won't be too important to offer a wide selection of algorithms here, since each one we add not only takes more coding but also adds a build dependency for every packager. Nevertheless, it seems worth doing at least this much, because LZ4 gets better compression than PGLZ with less CPU usage. It's possible for LZ4-compressed datums to leak into composite type values stored on disk, just as it is for PGLZ. It's also possible for LZ4-compressed attributes to be copied into a different table via SQL commands such as CREATE TABLE AS or INSERT .. SELECT. It would be expensive to force such values to be decompressed, so PostgreSQL has never done so. For the same reasons, we also don't force recompression of already-compressed values even if the target table prefers a different compression method than was used for the source data. These architectural decisions are perhaps arguable but revisiting them is well beyond the scope of what seemed possible to do as part of this project. However, it's relatively cheap to recompress as part of VACUUM FULL or CLUSTER, so this commit adjusts those commands to do so, if the configured compression method of the table happens not to match what was used for some column value stored therein. Dilip Kumar. The original patches on which this work was based were written by Ildus Kurbangaliev, and those were patches were based on even earlier work by Nikita Glukhov, but the design has since changed very substantially, since allow a potentially large number of compression methods that could be added and dropped on a running system proved too problematic given some of the architectural issues mentioned above; the choice of which specific compression method to add first is now different; and a lot of the code has been heavily refactored. More recently, Justin Przyby helped quite a bit with testing and reviewing and this version also includes some code contributions from him. Other design input and review from Tomas Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander Korotkov, and me. Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com
* Fix comments in postmaster.c.Fujii Masao2021-03-19
| | | | | | | | | | | | Commit 86c23a6eb2 changed the option to specify that postgres will stop all other server processes by sending the signal SIGSTOP, from -s to -T. But previously there were comments incorrectly explaining that SIGSTOP behavior is set by -s option. This commit fixes them. Author: Kyotaro Horiguchi Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/20210316.165141.1400441966284654043.horikyota.ntt@gmail.com
* Don't leak malloc'd error string in libpqrcv_check_conninfo().Tom Lane2021-03-18
| | | | | | | | | | | | We leaked the error report from PQconninfoParse, when there was one. It seems unlikely that real usage patterns would repeat the failure often enough to create serious bloat, but let's back-patch anyway to keep the code similar in all branches. Found via valgrind testing. Back-patch to v10 where this code was added. Discussion: https://postgr.es/m/3816764.1616104288@sss.pgh.pa.us
* Don't leak malloc'd strings when a GUC setting is rejected.Tom Lane2021-03-18
| | | | | | | | | | | | | | Because guc.c prefers to keep all its string values in malloc'd not palloc'd storage, it has to be more careful than usual to avoid leaks. Error exits out of string GUC hook checks failed to clear the proposed value string, and error exits out of ProcessGUCArray() failed to clear the malloc'd results of ParseLongOption(). Found via valgrind testing. This problem is ancient, so back-patch to all supported branches. Discussion: https://postgr.es/m/3816764.1616104288@sss.pgh.pa.us
* Don't leak compiled regex(es) when an ispell cache entry is dropped.Tom Lane2021-03-18
| | | | | | | | | | | | | | | | | | The text search cache mechanisms assume that we can clean up an invalidated dictionary cache entry simply by resetting the associated long-lived memory context. However, that does not work for ispell affixes that make use of regular expressions, because the regex library deals in plain old malloc. Hence, we leaked compiled regex(es) any time we dropped such a cache entry. That could quickly add up, since even a fairly trivial regex can use up tens of kB, and a large one can eat megabytes. Add a memory context callback to ensure that a regex gets freed when its owning cache entry is cleared. Found via valgrind testing. This problem is ancient, so back-patch to all supported branches. Discussion: https://postgr.es/m/3816764.1616104288@sss.pgh.pa.us
* Don't run RelationInitTableAccessMethod in a long-lived context.Tom Lane2021-03-18
| | | | | | | | | | | | | | | Some code paths in this function perform syscache lookups, which can lead to table accesses and possibly leakage of cruft into the caller's context. If said context is CacheMemoryContext, we eventually will have visible bloat. But fixing this is no harder than moving one memory context switch step. (The other callers don't have a problem.) Andres Freund and I independently found this via valgrind testing. Back-patch to v12 where this code was added. Discussion: https://postgr.es/m/20210317023101.anvejcfotwka6gaa@alap3.anarazel.de Discussion: https://postgr.es/m/3816764.1616104288@sss.pgh.pa.us
* Don't leak rd_statlist when a relcache entry is dropped.Tom Lane2021-03-18
| | | | | | | | | | | Although these lists are usually NIL, and even when not empty are unlikely to be large, constant relcache update traffic could eventually result in visible bloat of CacheMemoryContext. Found via valgrind testing. Back-patch to v10 where this field was added. Discussion: https://postgr.es/m/3816764.1616104288@sss.pgh.pa.us