aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Use pg_pread() and pg_pwrite() for data files and WAL.Thomas Munro2018-11-07
| | | | | | | | | | | | | Cut down on system calls by doing random I/O using offset-based OS routines where available. Remove the code for tracking the 'virtual' seek position. The only reason left to call FileSeek() was to get the file's size, so provide a new function FileSize() instead. Author: Oskari Saarenmaa, Thomas Munro Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
* Rename rbtree.c functions to use "rbt" prefix not "rb" prefix.Tom Lane2018-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | The "rb" prefix is used by Ruby, so that our existing code results in name collisions that break plruby. We discussed ways to prevent that by adjusting dynamic linker options, but it seems that at best we'd move the pain to other cases. Renaming to avoid the collision is the only portable fix anyway. Fortunately, our rbtree code is not (yet?) widely used --- in core, there's only a single usage in GIN --- so it seems likely that we can get away with a rename. I chose to do this basically as s/rb/rbt/g, except for places where there already was a "t" after "rb". The patch could have been made smaller by only touching linker-visible symbols, but it would have resulted in oddly inconsistent-looking code. Better to make it look like "rbt" was the plan all along. Back-patch to v10. The rbtree.c code exists back to 9.5, but rb_iterate() which is the actual immediate source of pain was added in v10, so it seems like changing the names before that would have more risk than benefit. Per report from Pavel Raiskup. Discussion: https://postgr.es/m/4738198.8KVIIDhgEB@nb.usersys.redhat.com
* Fix spelling errors and typos in commentsMagnus Hagander2018-11-02
| | | | Author: Daniel Gustafsson <daniel@yesql.se>
* Fix memory leak in repeated SPGIST index scans.Tom Lane2018-10-31
| | | | | | | | | | | | | | | | | | | | | | | spgendscan neglected to pfree all the memory allocated by spgbeginscan. It's possible to get away with that in most normal queries, since the memory is allocated in the executor's per-query context which is about to get deleted anyway; but it causes severe memory leakage during creation or filling of large exclusion-constraint indexes. Also, document that amendscan is supposed to free what ambeginscan allocates. The docs' lack of clarity on that point probably caused this bug to begin with. (There is discussion of changing that API spec going forward, but I don't think it'd be appropriate for the back branches.) Per report from Bruno Wolff. It's been like this since the beginning, so back-patch to all active branches. In HEAD, also fix an independent leak caused by commit 2a6368343 (allocating memory during spgrescan instead of spgbeginscan, which might be all right if it got cleaned up, but it didn't). And do a bit of code beautification on that commit, too. Discussion: https://postgr.es/m/20181024012314.GA27428@wolff.to
* Fix typo in xlog.c.Andres Freund2018-10-31
| | | | | Author: Daniel Gustafsson Discussion: https://postgr.es/m/A6817958-949E-4A5B-895D-FA421B6640C2@yesql.se
* Add pg_promote functionMichael Paquier2018-10-25
| | | | | | | | | | | | This function is able to promote a standby with this new SQL-callable function. Execution access can be granted to non-superusers so that failover tools can observe the principle of least privilege. Catalog version is bumped. Author: Laurenz Albe Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/6e7c79b3ec916cf49742fb8849ed17cd87aed620.camel@cybertec.at
* Correctly set t_self for heap tuples in expand_tupleAndrew Dunstan2018-10-24
| | | | | | | | | | | | Commit 16828d5c0 incorrectly set an invalid pointer for t_self for heap tuples. This patch correctly copies it from the source tuple, and includes a regression test that relies on it being set correctly. Backpatch to release 11. Fixes bug #15448 reported by Tillmann Schulz Diagnosis and test case by Amit Langote
* Move TupleTableSlots boolean member into one flag variable.Andres Freund2018-10-15
| | | | | | | | | | | | | | | There's several reasons for this change: 1) It reduces the total size of TupleTableSlot / reduces alignment padding, making the commonly accessed members fit into a single cacheline (but we currently do not force proper alignment, so that's not yet guaranteed to be helpful) 2) Combining the booleans into a flag allows to combine read/writes from memory. 3) With the upcoming slot abstraction changes, it allows to have core and extended flags, in a memory efficient way. Author: Ashutosh Bapat and Andres Freund Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
* Move generic slot support functions from heaptuple.c into execTuples.c.Andres Freund2018-10-15
| | | | | | | | | | | | | | | heaptuple.c was never a particular good fit for slot_getattr(), slot_getsomeattrs() and slot_getmissingattrs(), but in upcoming changes slots will be made more abstract (allowing slots that contain different types of tuples), making it clearly the wrong place. Note that slot_deform_tuple() remains in it's current place, as it clearly deals with a HeapTuple. getmissingattrs() also remains, but it's less clear that that's correct - but execTuples.c wouldn't be the right place. Author: Ashutosh Bapat. Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
* Simplify use of AllocSetContextCreate() wrapper macro.Tom Lane2018-10-12
| | | | | | | | | | | | | | | We can allow this macro to accept either abbreviated or non-abbreviated allocation parameters by making use of __VA_ARGS__. As noted by Andres Freund, it's unlikely that any compiler would have __builtin_constant_p but not __VA_ARGS__, so this gives up little or no error checking, and it avoids a minor but annoying API break for extensions. With this change, there is no reason for anybody to call AllocSetContextCreateExtended directly, so in HEAD I renamed it to AllocSetContextCreateInternal. It's probably too late for an ABI break like that in 11, though. Discussion: https://postgr.es/m/20181012170355.bhxi273skjt6sag4@alap3.anarazel.de
* Remove deprecated abstime, reltime, tinterval datatypes.Andres Freund2018-10-11
| | | | | | | | | | | | These types have been deprecated for a *long* time. Catversion bump, for obvious reasons. Author: Andres Freund Discussion: https://postgr.es/m/20181009192237.34wjp3nmw7oynmmr@alap3.anarazel.de https://postgr.es/m/20171213080506.cwjkpcz3bkk6yz2u@alap3.anarazel.de https://postgr.es/m/25615.1513115237@sss.pgh.pa.us
* Fix logical decoding error when system table w/ toast is repeatedly rewritten.Andres Freund2018-10-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Repeatedly rewriting a mapped catalog table with VACUUM FULL or CLUSTER could cause logical decoding to fail with: ERROR, "could not map filenode \"%s\" to relation OID" To trigger the problem the rewritten catalog had to have live tuples with toasted columns. The problem was triggered as during catalog table rewrites the heap_insert() check that prevents logical decoding information to be emitted for system catalogs, failed to treat the new heap's toast table as a system catalog (because the new heap is not recognized as a catalog table via RelationIsLogicallyLogged()). The relmapper, in contrast to the normal catalog contents, does not contain historical information. After a single rewrite of a mapped table the new relation is known to the relmapper, but if the table is rewritten twice before logical decoding occurs, the relfilenode cannot be mapped to a relation anymore. Which then leads us to error out. This only happens for toast tables, because the main table contents aren't re-inserted with heap_insert(). The fix is simple, add a new heap_insert() flag that prevents logical decoding information from being emitted, and accept during decoding that there might not be tuple data for toast tables. Unfortunately that does not fix pre-existing logical decoding errors. Doing so would require not throwing an error when a filenode cannot be mapped to a relation during decoding, and that seems too likely to hide bugs. If it's crucial to fix decoding for an existing slot, temporarily changing the ERROR in ReorderBufferCommit() to a WARNING appears to be the best fix. Author: Andres Freund Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de Backpatch: 9.4-, where logical decoding was introduced
* Relax transactional restrictions on ALTER TYPE ... ADD VALUE (redux).Thomas Munro2018-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally committed as 15bc038f (plus some follow-ups), this was reverted in 28e07270 due to a problem discovered in parallel workers. This new version corrects that problem by sending the list of uncommitted enum values to parallel workers. Here follows the original commit message describing the change: To prevent possibly breaking indexes on enum columns, we must keep uncommitted enum values from getting stored in tables, unless we can be sure that any such column is new in the current transaction. Formerly, we enforced this by disallowing ALTER TYPE ... ADD VALUE from being executed at all in a transaction block, unless the target enum type had been created in the current transaction. This patch removes that restriction, and instead insists that an uncommitted enum value can't be referenced unless it belongs to an enum type created in the same transaction as the value. Per discussion, this should be a bit less onerous. It does require each function that could possibly return a new enum value to SQL operations to check this restriction, but there aren't so many of those that this seems unmaintainable. Author: Andrew Dunstan and Tom Lane, with parallel query fix by Thomas Munro Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAEepm%3D0Ei7g6PaNTbcmAh9tCRahQrk%3Dr5ZWLD-jr7hXweYX3yg%40mail.gmail.com Discussion: https://postgr.es/m/4075.1459088427%40sss.pgh.pa.us
* Advance transaction timestamp for intra-procedure transactions.Tom Lane2018-10-08
| | | | | | | | Per discussion, this behavior seems less astonishing than not doing so. Peter Eisentraut and Tom Lane Discussion: https://postgr.es/m/20180920234040.GC29981@momjian.us
* Restore sane locking behavior during parallel query.Tom Lane2018-10-06
| | | | | | | | | | | | | | Commit 9a3cebeaa changed things so that parallel workers didn't obtain any lock of their own on tables they access. That was clearly a bad idea, but I'd mistakenly supposed that it was the intended end result of the series of patches for simplifying the executor's lock management. Undo that change in relation_open(), and adjust ExecOpenScanRelation() so that it gets the correct lock if inside a parallel worker. In passing, clean up some more obsolete comments about when locks are acquired. Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
* Propagate xactStartTimestamp and stmtStartTimestamp to parallel workers.Tom Lane2018-10-06
| | | | | | | | | | | | | | | | | | | | | | Previously, a worker process would establish values for these based on its own start time. In v10 and up, this can trivially be shown to cause misbehavior of transaction_timestamp(), timestamp_in(), and related functions which are (perhaps unwisely?) marked parallel-safe. It seems likely that other behaviors might diverge from what happens in the parent as well. It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure it's still possible, so back-patch to all branches containing parallel worker infrastructure. In HEAD only, mark now() and statement_timestamp() as parallel-safe (other affected functions already were). While in theory we could still squeeze that change into v11, it doesn't seem important enough to force a last-minute catversion bump. Konstantin Knizhnik, whacked around a bit by me Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
* Allow btree comparison functions to return INT_MIN.Tom Lane2018-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically we forbade datatype-specific comparison functions from returning INT_MIN, so that it would be safe to invert the sort order just by negating the comparison result. However, this was never really safe for comparison functions that directly return the result of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction on those library functions. Buildfarm results show that at least on recent Linux on s390x, memcmp() actually does return INT_MIN sometimes, causing sort failures. The agreed-on answer is to remove this restriction and fix relevant call sites to not make such an assumption; code such as "res = -res" should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed in a few places that just directly negated the result of memcmp or strcmp. To help find places having this problem, I've also added a compile option to nbtcompare.c that causes some of the commonly used comparators to return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be a good idea to have at least one buildfarm member running with "-DSTRESS_SORT_INT_MIN". That's far from a complete test of course, but it should help to prevent fresh introductions of such bugs. This is a longstanding portability hazard, so back-patch to all supported branches. Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
* Change executor to just Assert that table locks were already obtained.Tom Lane2018-10-03
| | | | | | | | | | | | | | | | | | | | | | Instead of locking tables during executor startup, just Assert that suitable locks were obtained already during the parse/plan pipeline (or re-obtained by the plan cache). This must be so, else we have a hazard that concurrent DDL has invalidated the plan. This is pretty inefficient as well as undercommented, but it's all going to go away shortly, so I didn't try hard. This commit is just another attempt to use the buildfarm to see if we've missed anything in the plan to simplify the executor's table management. Note that the change needed here in relation_open() exposes that parallel workers now really are accessing tables without holding any lock of their own, whereas they were not doing that before this commit. This does not give me a warm fuzzy feeling about that aspect of parallel query; it does not seem like a good design, and we now know that it's had exactly no actual testing. I think that we should modify parallel query so that that change can be reverted. Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
* Use slots more widely in tuple mapping code and make naming more consistent.Andres Freund2018-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It's inefficient to use a single slot for mapping between tuple descriptors for multiple tuples, as previously done when using ConvertPartitionTupleSlot(), as that means the slot's tuple descriptors change for every tuple. Previously we also, via ConvertPartitionTupleSlot(), built new tuples after the mapping even in cases where we, immediately afterwards, access individual columns again. Refactor the code so one slot, on demand, is used for each partition. That avoids having to change the descriptor (and allows to use the more efficient "fixed" tuple slots). Then use slot->slot mapping, to avoid unnecessarily forming a tuple. As the naming between the tuple and slot mapping functions wasn't consistent, rename them to execute_attr_map_{tuple,slot}. It's likely that we'll also rename convert_tuples_by_* to denote that these functions "only" build a map, but that's left for later. Author: Amit Khandekar and Amit Langote, editorialized by me Reviewed-By: Amit Langote, Amit Khandekar, Andres Freund Discussion: https://postgr.es/m/CAJ3gD9fR0wRNeAE8VqffNTyONS_UfFPRpqxhnD9Q42vZB+Jvpg@mail.gmail.com https://postgr.es/m/e4f9d743-cd4b-efb0-7574-da21d86a7f36%40lab.ntt.co.jp Backpatch: -
* Add assertions that we hold some relevant lock during relation open.Tom Lane2018-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | Opening a relation with no lock at all is unsafe; there's no guarantee that we'll see a consistent state of the relevant catalog entries. While use of MVCC scans to read the catalogs partially addresses that complaint, it's still possible to switch to a new catalog snapshot partway through loading the relcache entry. Moreover, whether or not you trust the reasoning behind sometimes using less than AccessExclusiveLock for ALTER TABLE, that reasoning is certainly not valid if concurrent users of the table don't hold a lock corresponding to the operation they want to perform. Hence, add some assertion-build-only checks that require any caller of relation_open(x, NoLock) to hold at least AccessShareLock. This isn't a full solution, since we can't verify that the lock level is semantically appropriate for the action --- but it's definitely of some use, because it's already caught two bugs. We can also assert that callers of addRangeTableEntryForRelation() hold at least the lock level specified for the new RTE. Amit Langote and Tom Lane Discussion: https://postgr.es/m/16565.1538327894@sss.pgh.pa.us
* Fix assertion failure when updating full_page_writes for checkpointer.Amit Kapila2018-09-28
| | | | | | | | | | | | | | | | | | When the checkpointer receives a SIGHUP signal to update its configuration, it may need to update the shared memory for full_page_writes and need to write a WAL record for it. Now, it is quite possible that the XLOG machinery has not been initialized by that time and it will lead to assertion failure while doing that. Fix is to allow the initialization of the XLOG machinery outside critical section. This bug has been introduced by the commit 2c03216d83 which added the XLOG machinery initialization in RecoveryInProgress code path. Reported-by: Dilip Kumar Author: Dilip Kumar Reviewed-by: Michael Paquier and Amit Kapila Backpatch-through: 9.5 Discussion: https://postgr.es/m/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC=C2whnzBM8TAcv=stckYUw@mail.gmail.com
* Fix WAL recycling on standbys depending on archive_modeMichael Paquier2018-09-28
| | | | | | | | | | | | | | | | | | | | | | | | A restart point or a checkpoint recycling WAL segments treats segments marked with neither ".done" (archiving is done) or ".ready" (segment is ready to be archived) in archive_status the same way for archive_mode being "on" or "always". While for a primary this is fine, a standby running a restart point with archive_mode = on would try to mark such a segment as ready for archiving, which is something that will never happen except after the standby is promoted. Note that this problem applies only to WAL segments coming from the local pg_wal the first time archive recovery is run. Segments part of a self-contained base backup are the most common case where this could happen, however even in this case normally the .done markers would be most likely part of the backup. Segments recovered from an archive are marked as .ready or .done by the startup process, and segments finished streaming are marked as such by the WAL receiver, so they are handled already. Reported-by: Haruka Takatsuka Author: Michael Paquier Discussion: https://postgr.es/m/15402-a453c90ed4cf88b2@postgresql.org Backpatch-through: 9.5, where archive_mode = always has been added.
* Minor formatting cleanup for 2a6368343fAlexander Korotkov2018-09-27
|
* Remove extra usage of BoxPGetDatum() macroAlexander Korotkov2018-09-27
| | | | | Author: Mark Dilger Discussion: https://postgr.es/m/B2AEFCD0-836D-4654-9D59-3DF616E0A6F3%40gmail.com
* Rework activation of commit timestamps during recoveryMichael Paquier2018-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | The activation and deactivation of commit timestamp tracking has not been handled consistently for a primary or standbys at recovery. The facility can be activated at three different moments of recovery: - The beginning, where a primary would use the GUC value for the decision-making, and where a standby relies on the contents of the control file. - When replaying a XLOG_PARAMETER_CHANGE record at redo. - The end, where both primary and standby rely on the GUC value. Using the GUC value for a primary at the beginning of recovery causes problems with commit timestamp access when doing crash recovery. Particularly, when replaying transaction commits, it could be possible that an attempt to read commit timestamps is done for a transaction which committed at a moment when track_commit_timestamp was disabled. A test case is added to reproduce the failure. The test works down to v11 as it takes advantage of transaction commits within procedures. Reported-by: Hailong Li Author: Masahiko Sawasa, Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/11224478-a782-203b-1f17-e4797b39bdf0@qunar.com Backpatch-through: 9.5, where commit timestamps have been introduced.
* Split ExecStoreTuple into ExecStoreHeapTuple and ExecStoreBufferHeapTuple.Andres Freund2018-09-25
| | | | | | | | | | | | | | | | | | | | Upcoming changes introduce further types of tuple table slots, in preparation of making table storage pluggable. New storage methods will have different representation of tuples, therefore the slot accessor should refer explicitly to heap tuples. Instead of just renaming the functions, split it into one function that accepts heap tuples not residing in buffers, and one accepting ones in buffers. Previously one function was used for both, but that was a bit awkward already, and splitting will allow us to represent slot types for tuples in buffers and normal memory separately. This is split out from the patch introducing abstract slots, as this largely consists out of mechanical changes. Author: Ashutosh Bapat Reviewed-By: Andres Freund Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
* Fast default trigger and expand_tuple fixesAndrew Dunstan2018-09-24
| | | | | | | | | | | | | | Ensure that triggers get properly filled in tuples for the OLD value. Also fix the logic of detecting missing null values. The previous logic failed to detect a missing null column before the first missing column with a default. Fixing this has simplified the logic a bit. Regression tests are added to test changes. This should ensure better coverage of expand_tuple(). Original bug reports, and some code and test scripts from Tomas Vondra Backpatch to release 11.
* Defer restoration of libraries in parallel workers.Thomas Munro2018-09-20
| | | | | | | | | | | | | | | | | | | | | | Several users of extensions complained of crashes in parallel workers that turned out to be due to syscache access from their _PG_init() functions. Reorder the initialization of parallel workers so that libraries are restored after the caches are initialized, and inside a transaction. This was reported in bug #15350 and elsewhere. We don't consider it to be a bug: extensions shouldn't do that, because then they can't be used in shared_preload_libraries. However, it's a fairly obscure hazard and these extensions worked in practice before parallel query came along. So let's make it work. Later commits might add a warning message and eventually an error. Back-patch to 9.6, where parallel query landed. Author: Thomas Munro Reviewed-by: Amit Kapila Reported-by: Kieran McCusker, Jimmy Discussion: https://postgr.es/m/153512195228.1489.8545997741965926448%40wrigleys.postgresql.org
* Fix minor error message style guide violation.Tom Lane2018-09-19
| | | | | | | | No periods at the ends of primary error messages, please. Daniel Gustafsson Discussion: https://postgr.es/m/43E004C0-18C6-42B4-A313-003B43EB0571@yesql.se
* Add support for nearest-neighbor (KNN) searches to SP-GiSTAlexander Korotkov2018-09-19
| | | | | | | | | | | | | Currently, KNN searches were supported only by GiST. SP-GiST also capable to support them. This commit implements that support. SP-GiST scan stack is replaced with queue, which serves as stack if no ordering is specified. KNN support is provided for three SP-GIST opclasses: quad_point_ops, kd_point_ops and poly_ops (catversion is bumped). Some common parts between GiST and SP-GiST KNNs are extracted into separate functions. Discussion: https://postgr.es/m/570825e8-47d0-4732-2bf6-88d67d2d51c8%40postgrespro.ru Author: Nikita Glukhov, Alexander Korotkov based on GSoC work by Vlad Sterzhanov Review: Andrey Borodin, Alexander Korotkov
* Attach FPI to the first record after full_page_writes is turned on.Amit Kapila2018-09-13
| | | | | | | | | | | | | | | | XLogInsert fails to attach a required FPI to the first record after full_page_writes is turned on by the last checkpoint. This bug got introduced in 9.5 due to code rearrangement in commits 2c03216d83 and 2076db2aea. Fix it by ensuring that XLogInsertRecord performs a recomputation when the given record is generated with FPW as off but found that the flag has been turned on while actually inserting the record. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi Reviewed-by: Amit Kapila Backpatch-through: 9.5 where this problem was introduced Discussion: https://postgr.es/m/20180420.151043.74298611.horiguchi.kyotaro@lab.ntt.co.jp
* Repair double-free in SP-GIST rescan (bug #15378)Andrew Gierth2018-09-11
| | | | | | | | | | | | | | | | | | | | | | spgrescan would first reset traversalCxt, and then traverse a potentially non-empty stack containing pointers to traversalValues which had been allocated in those contexts, freeing them a second time. This bug originates in commit ccd6eb49a where traversalValue was introduced. Repair by traversing the stack before the context reset; this isn't ideal, since it means doing retail pfree in a context that's about to be reset, but the freeing of a stack entry is also done in other places in the code during the scan so it's not worth trying to refactor it further. Regression test added. Backpatch to 9.6 where the problem was introduced. Per bug #15378; analysis and patch by me, originally from a report on IRC by user velix; see also PostGIS ticket #4174; review by Alexander Korotkov. Discussion: https://postgr.es/m/153663176628.23136.11901365223750051490@wrigleys.postgresql.org
* Fix past pd_upper write in ginRedoRecompress()Alexander Korotkov2018-09-09
| | | | | | | | | | | | | | | | ginRedoRecompress() replays actions over compressed segments of posting list in-place. However, it might lead to write past pg_upper, because intermediate state during playing the changes can take more space than both original state and final state. This commit fixes that by refuse from in-place modification. Instead page tail is copied once modification is started, and then it's used as the source of original segments. Backpatch to 9.4 where posting list compression was introduced. Reported-by: Sivasubramanian Ramasubramanian Discussion: https://postgr.es/m/1536091151804.6588%40amazon.com Author: Alexander Korotkov based on patch from and ideas by Sivasubramanian Ramasubramanian Review: Sivasubramanian Ramasubramanian Backpatch-through: 9.4
* Remove duplicated words split across lines in commentsMichael Paquier2018-09-08
| | | | | | | | This has been detected using some interesting tricks with sed, and the method used is mentioned in details in the discussion below. Author: Justin Pryzby Discussion: https://postgr.es/m/20180908013109.GB15350@telsasoft.com
* Improve handling of corrupted two-phase state files at recoveryMichael Paquier2018-09-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When a corrupted two-phase state file is found by WAL replay, be it for crash recovery or archive recovery, then the file is simply skipped and a WARNING is logged to the user, causing the transaction to be silently lost. Facing an on-disk WAL file which is corrupted is as likely to happen as what is stored in WAL records, but WAL records are already able to fail hard if there is a CRC mismatch. On-disk two-phase state files, on the contrary, are simply ignored if corrupted. Note that when restoring the initial two-phase data state at recovery, files newer than the horizon XID are discarded hence no files present in pg_twophase/ should be torned and have been made durable by a previous checkpoint, so recovery should never see any corrupted two-phase state file by design. The situation got better since 978b2f6 which has added two-phase state information directly in WAL instead of using on-disk files, so the risk is limited to two-phase transactions which live across at least one checkpoint for long periods. Backups having legit two-phase state files on-disk could also lose silently transactions when restored if things get corrupted. This behavior exists since two-phase commit has been introduced, no back-patch is done for now per the lack of complaints about this problem. Author: Michael Paquier Discussion: https://postgr.es/m/20180709050309.GM1467@paquier.xyz
* Use C99 designated initializers for some structsPeter Eisentraut2018-09-07
| | | | | | | These are just a few particularly egregious cases that were hard to read and write, and error prone because of many similar adjacent types. Discussion: https://www.postgresql.org/message-id/flat/4c9f01be-9245-2148-b569-61a8562ef190%402ndquadrant.com
* During the split, set checksum on an empty hash index page.Amit Kapila2018-09-04
| | | | | | | | | | | | | | | | On a split, we allocate a new splitpoint's worth of bucket pages wherein we initialize the last page with zeros which is fine, but we forgot to set the checksum for that last page. We decided to back-patch this fix till 10 because we don't have an easy way to test it in prior versions. Another reason is that the hash-index code is changed heavily in 10, so it is not advisable to push the fix without testing it in prior versions. Author: Amit Kapila Reviewed-by: Yugo Nagata Backpatch-through: 10 Discussion: https://postgr.es/m/5d03686d-727c-dbf8-0064-bf8b97ffe850@2ndquadrant.com
* Avoid using potentially-under-aligned page buffers.Tom Lane2018-09-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a project policy against using plain "char buf[BLCKSZ]" local or static variables as page buffers; preferred style is to palloc or malloc each buffer to ensure it is MAXALIGN'd. However, that policy's been ignored in an increasing number of places. We've apparently got away with it so far, probably because (a) relatively few people use platforms on which misalignment causes core dumps and/or (b) the variables chance to be sufficiently aligned anyway. But this is not something to rely on. Moreover, even if we don't get a core dump, we might be paying a lot of cycles for misaligned accesses. To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock that the compiler must allocate with sufficient alignment, and use those in place of plain char arrays. I used these types even for variables where there's no risk of a misaligned access, since ensuring proper alignment should make kernel data transfers faster. I also changed some places where we had been palloc'ing short-lived buffers, for coding style uniformity and to save palloc/pfree overhead. Since this seems to be a live portability hazard (despite the lack of field reports), back-patch to all supported versions. Patch by me; thanks to Michael Paquier for review. Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
* Ensure correct minimum consistent point on standbysMichael Paquier2018-08-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Startup process has improved its calculation of incorrect minimum consistent point in 8d68ee6, which ensures that all WAL available gets replayed when doing crash recovery, and has introduced an incorrect calculation of the minimum recovery point for non-startup processes, which can cause incorrect page references on a standby when for example the background writer flushed a couple of pages on-disk but was not updating the control file to let a subsequent crash recovery replay to where it should have. The only case where this has been reported to be a problem is when a standby needs to calculate the latest removed xid when replaying a btree deletion record, so one would need connections on a standby that happen just after recovery has thought it reached a consistent point. Using a background worker which is started after the consistent point is reached would be the easiest way to get into problems if it connects to a database. Having clients which attempt to connect periodically could also be a problem, but the odds of seeing this problem are much lower. The fix used is pretty simple, as the idea is to give access to the minimum recovery point written in the control file to non-startup processes so as they use a reference, while the startup process still initializes its own references of the minimum consistent point so as the original problem with incorrect page references happening post-promotion with a crash do not show up. Reported-by: Alexander Kukushkin Diagnosed-by: Alexander Kukushkin Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi, Alexander Kukushkin Discussion: https://postgr.es/m/153492341830.1368.3936905691758473953@wrigleys.postgresql.org Backpatch-through: 9.3
* Deduplicate code between slot_getallattrs() and slot_getsomeattrs().Andres Freund2018-08-23
| | | | | | | | | | | | | | | Code in slot_getallattrs() is the same as if slot_getsomeattrs() is called with number of attributes specified in the tuple descriptor. Implement it that way instead of duplicating the code between those two functions. This is part of a patchseries abstracting TupleTableSlots so they can store arbitrary forms of tuples, but is a nice enough cleanup on its own. Author: Ashutosh Bapat Reviewed-By: Andres Freund Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
* doc: Update uses of the word "procedure"Peter Eisentraut2018-08-22
| | | | | | | | | | | | | | | | | | | Historically, the term procedure was used as a synonym for function in Postgres/PostgreSQL. Now we have procedures as separate objects from functions, so we need to clean up the documentation to not mix those terms. In particular, mentions of "trigger procedures" are changed to "trigger functions", and access method "support procedures" are changed to "support functions". (The latter already used FUNCTION in the SQL syntax anyway.) Also, the terminology in the SPI chapter has been cleaned up. A few tests, examples, and code comments are also adjusted to be consistent with documentation changes, but not everything. Reported-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Jonathan S. Katz <jonathan.katz@excoventures.com>
* fix typoAlvaro Herrera2018-08-21
|
* Use the built-in float datatypes to implement geometric typesTomas Vondra2018-08-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes the geometric operators and functions use the exported function of the float4/float8 datatypes. The main reason of doing so is to check for underflow and overflow, and to handle NaNs consciously. The float datatypes consider NaNs values to be equal and greater than all non-NaN values. This change considers NaNs equal only for equality operators. The placement operators, contains, overlaps, left/right of etc. continue to return false when NaNs are involved. We don't need to worry about them being considered greater than any-NaN because there aren't any basic comparison operators like less/greater than for the geometric datatypes. The changes may be summarised as: * Check for underflow, overflow and division by zero * Consider NaN values to be equal * Return NULL when the distance is NaN for all closest point operators * Favour not-NaN over NaN where it makes sense The patch also replaces all occurrences of "double" as "float8". They are the same, but were used inconsistently in the same file. Author: Emre Hasegeli Reviewed-by: Kyotaro Horiguchi, Tomas Vondra Discussion: https://www.postgresql.org/message-id/CAE2gYzxF7-5djV6-cEvqQu-fNsnt%3DEqbOURx7ZDg%2BVv6ZMTWbg%40mail.gmail.com
* Improve comment in GetNewObjectId().Thomas Munro2018-08-16
| | | | | | | | | | | | The previous comment gave the impression that skipping OIDs before FirstNormalObjectId was merely an optimization to avoid likely collisions. In fact other parts of the system have been relying on this threshold to detect system-created objects since commit 8e18d04d4da, so adjust the wording. Author: Thomas Munro Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAEepm%3D33JASACeOayr_W3%3DCSjy2jiPxM-k89axu0akFbHdjnjA%40mail.gmail.com
* Update FSM on WAL replay of page all-visible/frozenAlvaro Herrera2018-08-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We aren't very strict about keeping FSM up to date on WAL replay, because per-page freespace values aren't critical in replicas (can't write to heap in a replica; and if the replica is promoted, the values would be updated by VACUUM anyway). However, VACUUM since 9.6 can skip processing pages marked all-visible or all-frozen, and if such pages are recorded in FSM with wrong values, those values are blindly propagated to FSM's upper layers by VACUUM's FreeSpaceMapVacuum. (This rationale assumes that crashes are not very frequent, because those would cause outdated FSM to occur in the primary.) Even when the FSM is outdated in standby, things are not too bad normally, because, most per-page FSM values will be zero (other than those propagated with the base-backup that created the standby); only once the remaining free space is less than 0.2*BLCKSZ the per-page value is maintained by WAL replay of heap ins/upd/del. However, if wal_log_hints=on causes complete FSM pages to be propagated to a standby via full-page images, many too-optimistic per-page values can end up being registered in the standby. Incorrect per-page values aren't critical in most cases, since an inserter that is given a page that doesn't actually contain the claimed free space will update FSM with the correct value, and retry until it finds a usable page. However, if there are many such updates to do, an inserter can spend a long time doing them before a usable page is found; in a heavily trafficked insert-only table with many concurrent inserters this has been observed to cause several second stalls, causing visible application malfunction. To fix this problem, it seems sufficient to have heap_xlog_visible (replay of setting all-visible and all-frozen VM bits for a heap page) update the FSM value for the page being processed. This fixes the per-page counters together with making the page skippable to vacuum, so when vacuum does FreeSpaceMapVacuum, the values propagated to FSM upper layers are the correct ones, avoiding the problem. While at it, apply the same fix to heap_xlog_clean (replay of tuple removal by HOT pruning and vacuum). This makes any space freed by the cleaning available earlier than the next vacuum in the promoted replica. Backpatch to 9.6, where this problem was diagnosed on an insert-only table with all-frozen pages, which were introduced as a concept in that release. Theoretically it could apply with all-visible pages to older branches, but there's been no report of that and it doesn't backpatch cleanly anyway. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20180802172857.5skoexsilnjvgruk@alvherre.pgsql
* Make autovacuum more aggressive to remove orphaned temp tablesMichael Paquier2018-08-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit dafa084, added in 10, made the removal of temporary orphaned tables more aggressive. This commit makes an extra step into the aggressiveness by adding a flag in each backend's MyProc which tracks down any temporary namespace currently in use. The flag is set when the namespace gets created and can be reset if the temporary namespace has been created in a transaction or sub-transaction which is aborted. The flag value assignment is assumed to be atomic, so this can be done in a lock-less fashion like other flags already present in PGPROC like databaseId or backendId, still the fact that the temporary namespace and table created are still locked until the transaction creating those commits acts as a barrier for other backends. This new flag gets used by autovacuum to discard more aggressively orphaned tables by additionally checking for the database a backend is connected to as well as its temporary namespace in-use, removing orphaned temporary relations even if a backend reuses the same slot as one which created temporary relations in a past session. The base idea of this patch comes from Robert Haas, has been written in its first version by Tsunakawa Takayuki, then heavily reviewed by me. Author: Tsunakawa Takayuki Reviewed-by: Michael Paquier, Kyotaro Horiguchi, Andres Freund Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8A4DC6@G01JPEXMBYT05 Backpatch: 11-, as PGPROC gains a new flag and we don't want silent ABI breakages on already released versions.
* Handle parallel index builds on mapped relations.Peter Geoghegan2018-08-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9da0cc35284, which introduced parallel CREATE INDEX, failed to propagate relmapper.c backend local cache state to parallel worker processes. This could result in parallel index builds against mapped catalog relations where the leader process (participating as a worker) scans the new, pristine relfilenode, while worker processes scan the obsolescent relfilenode. When this happened, the final index structure was typically not consistent with the owning table's structure. The final index structure could contain entries formed from both heap relfilenodes. Only rebuilds on mapped catalog relations that occur as part of a VACUUM FULL or CLUSTER could become corrupt in practice, since their mapped relation relfilenode swap is what allows the inconsistency to arise. On master, fix the problem by propagating the required relmapper.c backend state as part of standard parallel initialization (Cf. commit 29d58fd3). On v11, simply disallow builds against mapped catalog relations by deeming them parallel unsafe. Author: Peter Geoghegan Reported-By: "death lock" Reviewed-By: Tom Lane, Amit Kapila Bug: #15309 Discussion: https://postgr.es/m/153329671686.1405.18298309097348420351@wrigleys.postgresql.org Backpatch: 11-, where parallel CREATE INDEX was introduced.
* Fix typo in SP-GiST error messageAlexander Korotkov2018-08-10
| | | | | | | | | Error message didn't match the actual check. Fix that. Compression of leaf SP-GiST values was introduced in 11. So, backpatch. Discussion: https://postgr.es/m/20180810.100742.15469435.horiguchi.kyotaro%40lab.ntt.co.jp Author: Kyotaro Horiguchi Backpatch-through: 11
* Reset properly errno before calling write()Michael Paquier2018-08-05
| | | | | | | | | | | | 6cb3372 enforces errno to ENOSPC when less bytes than what is expected have been written when it is unset, though it forgot to properly reset errno before doing a system call to write(), causing errno to potentially come from a previous system call. Reported-by: Tom Lane Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us
* Provide separate header file for built-in float typesTomas Vondra2018-07-29
| | | | | | | | | | | | | | | | | | Some data types under adt/ have separate header files, but most simple ones do not, and their public functions are defined in builtins.h. As the patches improving geometric types will require making additional functions public, this seems like a good opportunity to create a header for floats types. Commit 1acf757255 made _cmp functions public to solve NaN issues locally for GiST indexes. This patch reworks it in favour of a more widely applicable API. The API uses inline functions, as they are easier to use compared to macros, and avoid double-evaluation hazards. Author: Emre Hasegeli Reviewed-by: Kyotaro Horiguchi Discussion: https://www.postgresql.org/message-id/CAE2gYzxF7-5djV6-cEvqQu-fNsnt%3DEqbOURx7ZDg%2BVv6ZMTWbg%40mail.gmail.com