aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Remove WITH OIDS support, change oid catalog column visibility.Andres Freund2018-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously tables declared WITH OIDS, including a significant fraction of the catalog tables, stored the oid column not as a normal column, but as part of the tuple header. This special column was not shown by default, which was somewhat odd, as it's often (consider e.g. pg_class.oid) one of the more important parts of a row. Neither pg_dump nor COPY included the contents of the oid column by default. The fact that the oid column was not an ordinary column necessitated a significant amount of special case code to support oid columns. That already was painful for the existing, but upcoming work aiming to make table storage pluggable, would have required expanding and duplicating that "specialness" significantly. WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0). Remove it. Removing includes: - CREATE TABLE and ALTER TABLE syntax for declaring the table to be WITH OIDS has been removed (WITH (oids[ = true]) will error out) - pg_dump does not support dumping tables declared WITH OIDS and will issue a warning when dumping one (and ignore the oid column). - restoring an pg_dump archive with pg_restore will warn when restoring a table with oid contents (and ignore the oid column) - COPY will refuse to load binary dump that includes oids. - pg_upgrade will error out when encountering tables declared WITH OIDS, they have to be altered to remove the oid column first. - Functionality to access the oid of the last inserted row (like plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed. The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false) for CREATE TABLE) is still supported. While that requires a bit of support code, it seems unnecessary to break applications / dumps that do not use oids, and are explicit about not using them. The biggest user of WITH OID columns was postgres' catalog. This commit changes all 'magic' oid columns to be columns that are normally declared and stored. To reduce unnecessary query breakage all the newly added columns are still named 'oid', even if a table's column naming scheme would indicate 'reloid' or such. This obviously requires adapting a lot code, mostly replacing oid access via HeapTupleGetOid() with access to the underlying Form_pg_*->oid column. The bootstrap process now assigns oids for all oid columns in genbki.pl that do not have an explicit value (starting at the largest oid previously used), only oids assigned later by oids will be above FirstBootstrapObjectId. As the oid column now is a normal column the special bootstrap syntax for oids has been removed. Oids are not automatically assigned during insertion anymore, all backend code explicitly assigns oids with GetNewOidWithIndex(). For the rare case that insertions into the catalog via SQL are called for the new pg_nextoid() function can be used (which only works on catalog tables). The fact that oid columns on system tables are now normal columns means that they will be included in the set of columns expanded by * (i.e. SELECT * FROM pg_class will now include the table's oid, previously it did not). It'd not technically be hard to hide oid column by default, but that'd mean confusing behavior would either have to be carried forward forever, or it'd cause breakage down the line. While it's not unlikely that further adjustments are needed, the scope/invasiveness of the patch makes it worthwhile to get merge this now. It's painful to maintain externally, too complicated to commit after the code code freeze, and a dependency of a number of other patches. Catversion bump, for obvious reasons. Author: Andres Freund, with contributions by John Naylor Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
* Make WAL description output more consistentPeter Eisentraut2018-11-20
| | | | | | | | | | The output for record types XLOG_DBASE_CREATE and XLOG_DBASE_DROP used the order dbid/tablespaceid, whereas elsewhere the order is tablespaceid/dbid[/relfilenodeid]. Flip the order for those two types to make it consistent. Author: Jean-Christophe Arnu <jcarnu@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAHZmTm18Ln62KW-G8NYvO1wbBL3QU1E76Zep=DuHmg-zS2XFAg@mail.gmail.com/
* PANIC on fsync() failure.Thomas Munro2018-11-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some operating systems, it doesn't make sense to retry fsync(), because dirty data cached by the kernel may have been dropped on write-back failure. In that case the only remaining copy of the data is in the WAL. A subsequent fsync() could appear to succeed, but not have flushed the data. That means that a future checkpoint could apparently complete successfully but have lost data. Therefore, violently prevent any future checkpoint attempts by panicking on the first fsync() failure. Note that we already did the same for WAL data; this change extends that behavior to non-temporary data files. Provide a GUC data_sync_retry to control this new behavior, for users of operating systems that don't eject dirty data, and possibly forensic/testing uses. If it is set to on and the write-back error was transient, a later checkpoint might genuinely succeed (on a system that does not throw away buffers on failure); if the error is permanent, later checkpoints will continue to fail. The GUC defaults to off, meaning that we panic. Back-patch to all supported releases. There is still a narrow window for error-loss on some operating systems: if the file is closed and later reopened and a write-back error occurs in the intervening time, but the inode has the bad luck to be evicted due to memory pressure before we reopen, we could miss the error. A later patch will address that with a scheme for keeping files with dirty data open at all times, but we judge that to be too complicated to back-patch. Author: Craig Ringer, with some adjustments by Thomas Munro Reported-by: Craig Ringer Reviewed-by: Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
* Remove unnecessary memcpy when reading WAL record fitting on pageMichael Paquier2018-11-19
| | | | | | | | | | | | | When reading a WAL record, its contents are copied into an intermediate buffer. However, doing so is not necessary if the record fits fully into the current page, saving one memcpy for each such record. The allocation handling of the intermediate buffer is also now done only when a record crosses a page boundary, shaving some extra cycles when reading a WAL record. Author: Andrey Lepikhov Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas Discussion: https://postgr.es/m/c2ea54dd-a1d3-80eb-ddbf-7e6f258e615e@postgrespro.ru
* Make TupleTableSlots extensible, finish split of existing slot type.Andres Freund2018-11-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit completes the work prepared in 1a0586de36, splitting the old TupleTableSlot implementation (which could store buffer, heap, minimal and virtual slots) into four different slot types. As described in the aforementioned commit, this is done with the goal of making tuple table slots extensible, to allow for pluggable table access methods. To achieve runtime extensibility for TupleTableSlots, operations on slots that can differ between types of slots are performed using the TupleTableSlotOps struct provided at slot creation time. That includes information from the size of TupleTableSlot struct to be allocated, initialization, deforming etc. See the struct's definition for more detailed information about callbacks TupleTableSlotOps. I decided to rename TTSOpsBufferTuple to TTSOpsBufferHeapTuple and ExecCopySlotTuple to ExecCopySlotHeapTuple, as that seems more consistent with other naming introduced in recent patches. There's plenty optimization potential in the slot implementation, but according to benchmarking the state after this commit has similar performance characteristics to before this set of changes, which seems sufficient. There's a few changes in execReplication.c that currently need to poke through the slot abstraction, that'll be repaired once the pluggable storage patchset provides the necessary infrastructure. Author: Andres Freund and Ashutosh Bapat, with changes by Amit Khandekar Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
* Introduce notion of different types of slots (without implementing them).Andres Freund2018-11-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upcoming work intends to allow pluggable ways to introduce new ways of storing table data. Accessing those table access methods from the executor requires TupleTableSlots to be carry tuples in the native format of such storage methods; otherwise there'll be a significant conversion overhead. Different access methods will require different data to store tuples efficiently (just like virtual, minimal, heap already require fields in TupleTableSlot). To allow that without requiring additional pointer indirections, we want to have different structs (embedding TupleTableSlot) for different types of slots. Thus different types of slots are needed, which requires adapting creators of slots. The slot that most efficiently can represent a type of tuple in an executor node will often depend on the type of slot a child node uses. Therefore we need to track the type of slot is returned by nodes, so parent slots can create slots based on that. Relatedly, JIT compilation of tuple deforming needs to know which type of slot a certain expression refers to, so it can create an appropriate deforming function for the type of tuple in the slot. But not all nodes will only return one type of slot, e.g. an append node will potentially return different types of slots for each of its subplans. Therefore add function that allows to query the type of a node's result slot, and whether it'll always be the same type (whether it's fixed). This can be queried using ExecGetResultSlotOps(). The scan, result, inner, outer type of slots are automatically inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(), left/right subtrees respectively. If that's not correct for a node, that can be overwritten using new fields in PlanState. This commit does not introduce the actually abstracted implementation of different kind of TupleTableSlots, that will be left for a followup commit. The different types of slots introduced will, for now, still use the same backing implementation. While this already partially invalidates the big comment in tuptable.h, it seems to make more sense to update it later, when the different TupleTableSlot implementations actually exist. Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
* Initialize TransactionState and user ID consistently at transaction startMichael Paquier2018-11-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a failure happens when a transaction is starting between the moment the transaction status is changed from TRANS_DEFAULT to TRANS_START and the moment the current user ID and security context flags are fetched via GetUserIdAndSecContext(), or before initializing its basic fields, then those may get reset to incorrect values when the transaction aborts, leaving the session in an inconsistent state. One problem reported is that failing a starting transaction at the first query of a session could cause several kinds of system crashes on the follow-up queries. In order to solve that, move the initialization of the transaction state fields and the call of GetUserIdAndSecContext() in charge of fetching the current user ID close to the point where the transaction status is switched to TRANS_START, where there cannot be any error triggered in-between, per an idea of Tom Lane. This properly ensures that the current user ID, the security context flags and that the basic fields of TransactionState remain consistent even if the transaction fails while starting. Reported-by: Richard Guo Diagnosed-By: Richard Guo Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAN_9JTxECSb=pEPcb0a8d+6J+bDcOZ4=DgRo_B7Y5gRHJUM=Rw@mail.gmail.com Backpatch-through: 9.4
* Add flag values in WAL description to all heap recordsMichael Paquier2018-11-14
| | | | | | | | | | | Hexadecimal is consistently used as format to not bloat too much the output but keep it readable. This information is useful mainly for debugging purposes with for example pg_waldump. Author: Michael Paquier Reviewed-by: Nathan Bossart, Dmitry Dolgov, Andres Freund, Álvaro Herrera Discussion: https://postgr.es/m/20180413034734.GE1552@paquier.xyz
* Remove volatiles from {procarray,volatile}.c and fix memory ordering issue.Andres Freund2018-11-10
| | | | | | | | | | | | | | | | | | | | | | | The use of volatiles in procarray.c largely originated from the time when postgres did not have reliable compiler and memory barriers. That's not the case anymore, so we can do better. Several of the functions in procarray.c can be bottlenecks, and removal of volatile yields mildly better code. The new state, with explicit memory barriers, is also more correct. The previous use of volatile did not actually deliver sufficient guarantees on weakly ordered machines, in particular the logic in GetNewTransactionId() does not look safe. It seems unlikely to be a problem in practice, but worth fixing. Thomas and I independently wrote a patch for this. Reported-By: Andres Freund and Thomas Munro Author: Andres Freund, with cherrypicked changes from a patch by Thomas Munro Discussion: https://postgr.es/m/20181005172955.wyjb4fzcdzqtaxjq@alap3.anarazel.de https://postgr.es/m/CAEepm=1nff0x=7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g@mail.gmail.com
* Use pg_pread() and pg_pwrite() for data files and WAL.Thomas Munro2018-11-07
| | | | | | | | | | | | | Cut down on system calls by doing random I/O using offset-based OS routines where available. Remove the code for tracking the 'virtual' seek position. The only reason left to call FileSeek() was to get the file's size, so provide a new function FileSize() instead. Author: Oskari Saarenmaa, Thomas Munro Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
* Rename rbtree.c functions to use "rbt" prefix not "rb" prefix.Tom Lane2018-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | The "rb" prefix is used by Ruby, so that our existing code results in name collisions that break plruby. We discussed ways to prevent that by adjusting dynamic linker options, but it seems that at best we'd move the pain to other cases. Renaming to avoid the collision is the only portable fix anyway. Fortunately, our rbtree code is not (yet?) widely used --- in core, there's only a single usage in GIN --- so it seems likely that we can get away with a rename. I chose to do this basically as s/rb/rbt/g, except for places where there already was a "t" after "rb". The patch could have been made smaller by only touching linker-visible symbols, but it would have resulted in oddly inconsistent-looking code. Better to make it look like "rbt" was the plan all along. Back-patch to v10. The rbtree.c code exists back to 9.5, but rb_iterate() which is the actual immediate source of pain was added in v10, so it seems like changing the names before that would have more risk than benefit. Per report from Pavel Raiskup. Discussion: https://postgr.es/m/4738198.8KVIIDhgEB@nb.usersys.redhat.com
* Fix spelling errors and typos in commentsMagnus Hagander2018-11-02
| | | | Author: Daniel Gustafsson <daniel@yesql.se>
* Fix memory leak in repeated SPGIST index scans.Tom Lane2018-10-31
| | | | | | | | | | | | | | | | | | | | | | | spgendscan neglected to pfree all the memory allocated by spgbeginscan. It's possible to get away with that in most normal queries, since the memory is allocated in the executor's per-query context which is about to get deleted anyway; but it causes severe memory leakage during creation or filling of large exclusion-constraint indexes. Also, document that amendscan is supposed to free what ambeginscan allocates. The docs' lack of clarity on that point probably caused this bug to begin with. (There is discussion of changing that API spec going forward, but I don't think it'd be appropriate for the back branches.) Per report from Bruno Wolff. It's been like this since the beginning, so back-patch to all active branches. In HEAD, also fix an independent leak caused by commit 2a6368343 (allocating memory during spgrescan instead of spgbeginscan, which might be all right if it got cleaned up, but it didn't). And do a bit of code beautification on that commit, too. Discussion: https://postgr.es/m/20181024012314.GA27428@wolff.to
* Fix typo in xlog.c.Andres Freund2018-10-31
| | | | | Author: Daniel Gustafsson Discussion: https://postgr.es/m/A6817958-949E-4A5B-895D-FA421B6640C2@yesql.se
* Add pg_promote functionMichael Paquier2018-10-25
| | | | | | | | | | | | This function is able to promote a standby with this new SQL-callable function. Execution access can be granted to non-superusers so that failover tools can observe the principle of least privilege. Catalog version is bumped. Author: Laurenz Albe Reviewed-by: Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/6e7c79b3ec916cf49742fb8849ed17cd87aed620.camel@cybertec.at
* Correctly set t_self for heap tuples in expand_tupleAndrew Dunstan2018-10-24
| | | | | | | | | | | | Commit 16828d5c0 incorrectly set an invalid pointer for t_self for heap tuples. This patch correctly copies it from the source tuple, and includes a regression test that relies on it being set correctly. Backpatch to release 11. Fixes bug #15448 reported by Tillmann Schulz Diagnosis and test case by Amit Langote
* Move TupleTableSlots boolean member into one flag variable.Andres Freund2018-10-15
| | | | | | | | | | | | | | | There's several reasons for this change: 1) It reduces the total size of TupleTableSlot / reduces alignment padding, making the commonly accessed members fit into a single cacheline (but we currently do not force proper alignment, so that's not yet guaranteed to be helpful) 2) Combining the booleans into a flag allows to combine read/writes from memory. 3) With the upcoming slot abstraction changes, it allows to have core and extended flags, in a memory efficient way. Author: Ashutosh Bapat and Andres Freund Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
* Move generic slot support functions from heaptuple.c into execTuples.c.Andres Freund2018-10-15
| | | | | | | | | | | | | | | heaptuple.c was never a particular good fit for slot_getattr(), slot_getsomeattrs() and slot_getmissingattrs(), but in upcoming changes slots will be made more abstract (allowing slots that contain different types of tuples), making it clearly the wrong place. Note that slot_deform_tuple() remains in it's current place, as it clearly deals with a HeapTuple. getmissingattrs() also remains, but it's less clear that that's correct - but execTuples.c wouldn't be the right place. Author: Ashutosh Bapat. Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
* Simplify use of AllocSetContextCreate() wrapper macro.Tom Lane2018-10-12
| | | | | | | | | | | | | | | We can allow this macro to accept either abbreviated or non-abbreviated allocation parameters by making use of __VA_ARGS__. As noted by Andres Freund, it's unlikely that any compiler would have __builtin_constant_p but not __VA_ARGS__, so this gives up little or no error checking, and it avoids a minor but annoying API break for extensions. With this change, there is no reason for anybody to call AllocSetContextCreateExtended directly, so in HEAD I renamed it to AllocSetContextCreateInternal. It's probably too late for an ABI break like that in 11, though. Discussion: https://postgr.es/m/20181012170355.bhxi273skjt6sag4@alap3.anarazel.de
* Remove deprecated abstime, reltime, tinterval datatypes.Andres Freund2018-10-11
| | | | | | | | | | | | These types have been deprecated for a *long* time. Catversion bump, for obvious reasons. Author: Andres Freund Discussion: https://postgr.es/m/20181009192237.34wjp3nmw7oynmmr@alap3.anarazel.de https://postgr.es/m/20171213080506.cwjkpcz3bkk6yz2u@alap3.anarazel.de https://postgr.es/m/25615.1513115237@sss.pgh.pa.us
* Fix logical decoding error when system table w/ toast is repeatedly rewritten.Andres Freund2018-10-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Repeatedly rewriting a mapped catalog table with VACUUM FULL or CLUSTER could cause logical decoding to fail with: ERROR, "could not map filenode \"%s\" to relation OID" To trigger the problem the rewritten catalog had to have live tuples with toasted columns. The problem was triggered as during catalog table rewrites the heap_insert() check that prevents logical decoding information to be emitted for system catalogs, failed to treat the new heap's toast table as a system catalog (because the new heap is not recognized as a catalog table via RelationIsLogicallyLogged()). The relmapper, in contrast to the normal catalog contents, does not contain historical information. After a single rewrite of a mapped table the new relation is known to the relmapper, but if the table is rewritten twice before logical decoding occurs, the relfilenode cannot be mapped to a relation anymore. Which then leads us to error out. This only happens for toast tables, because the main table contents aren't re-inserted with heap_insert(). The fix is simple, add a new heap_insert() flag that prevents logical decoding information from being emitted, and accept during decoding that there might not be tuple data for toast tables. Unfortunately that does not fix pre-existing logical decoding errors. Doing so would require not throwing an error when a filenode cannot be mapped to a relation during decoding, and that seems too likely to hide bugs. If it's crucial to fix decoding for an existing slot, temporarily changing the ERROR in ReorderBufferCommit() to a WARNING appears to be the best fix. Author: Andres Freund Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de Backpatch: 9.4-, where logical decoding was introduced
* Relax transactional restrictions on ALTER TYPE ... ADD VALUE (redux).Thomas Munro2018-10-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally committed as 15bc038f (plus some follow-ups), this was reverted in 28e07270 due to a problem discovered in parallel workers. This new version corrects that problem by sending the list of uncommitted enum values to parallel workers. Here follows the original commit message describing the change: To prevent possibly breaking indexes on enum columns, we must keep uncommitted enum values from getting stored in tables, unless we can be sure that any such column is new in the current transaction. Formerly, we enforced this by disallowing ALTER TYPE ... ADD VALUE from being executed at all in a transaction block, unless the target enum type had been created in the current transaction. This patch removes that restriction, and instead insists that an uncommitted enum value can't be referenced unless it belongs to an enum type created in the same transaction as the value. Per discussion, this should be a bit less onerous. It does require each function that could possibly return a new enum value to SQL operations to check this restriction, but there aren't so many of those that this seems unmaintainable. Author: Andrew Dunstan and Tom Lane, with parallel query fix by Thomas Munro Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAEepm%3D0Ei7g6PaNTbcmAh9tCRahQrk%3Dr5ZWLD-jr7hXweYX3yg%40mail.gmail.com Discussion: https://postgr.es/m/4075.1459088427%40sss.pgh.pa.us
* Advance transaction timestamp for intra-procedure transactions.Tom Lane2018-10-08
| | | | | | | | Per discussion, this behavior seems less astonishing than not doing so. Peter Eisentraut and Tom Lane Discussion: https://postgr.es/m/20180920234040.GC29981@momjian.us
* Restore sane locking behavior during parallel query.Tom Lane2018-10-06
| | | | | | | | | | | | | | Commit 9a3cebeaa changed things so that parallel workers didn't obtain any lock of their own on tables they access. That was clearly a bad idea, but I'd mistakenly supposed that it was the intended end result of the series of patches for simplifying the executor's lock management. Undo that change in relation_open(), and adjust ExecOpenScanRelation() so that it gets the correct lock if inside a parallel worker. In passing, clean up some more obsolete comments about when locks are acquired. Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
* Propagate xactStartTimestamp and stmtStartTimestamp to parallel workers.Tom Lane2018-10-06
| | | | | | | | | | | | | | | | | | | | | | Previously, a worker process would establish values for these based on its own start time. In v10 and up, this can trivially be shown to cause misbehavior of transaction_timestamp(), timestamp_in(), and related functions which are (perhaps unwisely?) marked parallel-safe. It seems likely that other behaviors might diverge from what happens in the parent as well. It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure it's still possible, so back-patch to all branches containing parallel worker infrastructure. In HEAD only, mark now() and statement_timestamp() as parallel-safe (other affected functions already were). While in theory we could still squeeze that change into v11, it doesn't seem important enough to force a last-minute catversion bump. Konstantin Knizhnik, whacked around a bit by me Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
* Allow btree comparison functions to return INT_MIN.Tom Lane2018-10-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically we forbade datatype-specific comparison functions from returning INT_MIN, so that it would be safe to invert the sort order just by negating the comparison result. However, this was never really safe for comparison functions that directly return the result of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction on those library functions. Buildfarm results show that at least on recent Linux on s390x, memcmp() actually does return INT_MIN sometimes, causing sort failures. The agreed-on answer is to remove this restriction and fix relevant call sites to not make such an assumption; code such as "res = -res" should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed in a few places that just directly negated the result of memcmp or strcmp. To help find places having this problem, I've also added a compile option to nbtcompare.c that causes some of the commonly used comparators to return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be a good idea to have at least one buildfarm member running with "-DSTRESS_SORT_INT_MIN". That's far from a complete test of course, but it should help to prevent fresh introductions of such bugs. This is a longstanding portability hazard, so back-patch to all supported branches. Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
* Change executor to just Assert that table locks were already obtained.Tom Lane2018-10-03
| | | | | | | | | | | | | | | | | | | | | | Instead of locking tables during executor startup, just Assert that suitable locks were obtained already during the parse/plan pipeline (or re-obtained by the plan cache). This must be so, else we have a hazard that concurrent DDL has invalidated the plan. This is pretty inefficient as well as undercommented, but it's all going to go away shortly, so I didn't try hard. This commit is just another attempt to use the buildfarm to see if we've missed anything in the plan to simplify the executor's table management. Note that the change needed here in relation_open() exposes that parallel workers now really are accessing tables without holding any lock of their own, whereas they were not doing that before this commit. This does not give me a warm fuzzy feeling about that aspect of parallel query; it does not seem like a good design, and we now know that it's had exactly no actual testing. I think that we should modify parallel query so that that change can be reverted. Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
* Use slots more widely in tuple mapping code and make naming more consistent.Andres Freund2018-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It's inefficient to use a single slot for mapping between tuple descriptors for multiple tuples, as previously done when using ConvertPartitionTupleSlot(), as that means the slot's tuple descriptors change for every tuple. Previously we also, via ConvertPartitionTupleSlot(), built new tuples after the mapping even in cases where we, immediately afterwards, access individual columns again. Refactor the code so one slot, on demand, is used for each partition. That avoids having to change the descriptor (and allows to use the more efficient "fixed" tuple slots). Then use slot->slot mapping, to avoid unnecessarily forming a tuple. As the naming between the tuple and slot mapping functions wasn't consistent, rename them to execute_attr_map_{tuple,slot}. It's likely that we'll also rename convert_tuples_by_* to denote that these functions "only" build a map, but that's left for later. Author: Amit Khandekar and Amit Langote, editorialized by me Reviewed-By: Amit Langote, Amit Khandekar, Andres Freund Discussion: https://postgr.es/m/CAJ3gD9fR0wRNeAE8VqffNTyONS_UfFPRpqxhnD9Q42vZB+Jvpg@mail.gmail.com https://postgr.es/m/e4f9d743-cd4b-efb0-7574-da21d86a7f36%40lab.ntt.co.jp Backpatch: -
* Add assertions that we hold some relevant lock during relation open.Tom Lane2018-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | Opening a relation with no lock at all is unsafe; there's no guarantee that we'll see a consistent state of the relevant catalog entries. While use of MVCC scans to read the catalogs partially addresses that complaint, it's still possible to switch to a new catalog snapshot partway through loading the relcache entry. Moreover, whether or not you trust the reasoning behind sometimes using less than AccessExclusiveLock for ALTER TABLE, that reasoning is certainly not valid if concurrent users of the table don't hold a lock corresponding to the operation they want to perform. Hence, add some assertion-build-only checks that require any caller of relation_open(x, NoLock) to hold at least AccessShareLock. This isn't a full solution, since we can't verify that the lock level is semantically appropriate for the action --- but it's definitely of some use, because it's already caught two bugs. We can also assert that callers of addRangeTableEntryForRelation() hold at least the lock level specified for the new RTE. Amit Langote and Tom Lane Discussion: https://postgr.es/m/16565.1538327894@sss.pgh.pa.us
* Fix assertion failure when updating full_page_writes for checkpointer.Amit Kapila2018-09-28
| | | | | | | | | | | | | | | | | | When the checkpointer receives a SIGHUP signal to update its configuration, it may need to update the shared memory for full_page_writes and need to write a WAL record for it. Now, it is quite possible that the XLOG machinery has not been initialized by that time and it will lead to assertion failure while doing that. Fix is to allow the initialization of the XLOG machinery outside critical section. This bug has been introduced by the commit 2c03216d83 which added the XLOG machinery initialization in RecoveryInProgress code path. Reported-by: Dilip Kumar Author: Dilip Kumar Reviewed-by: Michael Paquier and Amit Kapila Backpatch-through: 9.5 Discussion: https://postgr.es/m/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC=C2whnzBM8TAcv=stckYUw@mail.gmail.com
* Fix WAL recycling on standbys depending on archive_modeMichael Paquier2018-09-28
| | | | | | | | | | | | | | | | | | | | | | | | A restart point or a checkpoint recycling WAL segments treats segments marked with neither ".done" (archiving is done) or ".ready" (segment is ready to be archived) in archive_status the same way for archive_mode being "on" or "always". While for a primary this is fine, a standby running a restart point with archive_mode = on would try to mark such a segment as ready for archiving, which is something that will never happen except after the standby is promoted. Note that this problem applies only to WAL segments coming from the local pg_wal the first time archive recovery is run. Segments part of a self-contained base backup are the most common case where this could happen, however even in this case normally the .done markers would be most likely part of the backup. Segments recovered from an archive are marked as .ready or .done by the startup process, and segments finished streaming are marked as such by the WAL receiver, so they are handled already. Reported-by: Haruka Takatsuka Author: Michael Paquier Discussion: https://postgr.es/m/15402-a453c90ed4cf88b2@postgresql.org Backpatch-through: 9.5, where archive_mode = always has been added.
* Minor formatting cleanup for 2a6368343fAlexander Korotkov2018-09-27
|
* Remove extra usage of BoxPGetDatum() macroAlexander Korotkov2018-09-27
| | | | | Author: Mark Dilger Discussion: https://postgr.es/m/B2AEFCD0-836D-4654-9D59-3DF616E0A6F3%40gmail.com
* Rework activation of commit timestamps during recoveryMichael Paquier2018-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | The activation and deactivation of commit timestamp tracking has not been handled consistently for a primary or standbys at recovery. The facility can be activated at three different moments of recovery: - The beginning, where a primary would use the GUC value for the decision-making, and where a standby relies on the contents of the control file. - When replaying a XLOG_PARAMETER_CHANGE record at redo. - The end, where both primary and standby rely on the GUC value. Using the GUC value for a primary at the beginning of recovery causes problems with commit timestamp access when doing crash recovery. Particularly, when replaying transaction commits, it could be possible that an attempt to read commit timestamps is done for a transaction which committed at a moment when track_commit_timestamp was disabled. A test case is added to reproduce the failure. The test works down to v11 as it takes advantage of transaction commits within procedures. Reported-by: Hailong Li Author: Masahiko Sawasa, Michael Paquier Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/11224478-a782-203b-1f17-e4797b39bdf0@qunar.com Backpatch-through: 9.5, where commit timestamps have been introduced.
* Split ExecStoreTuple into ExecStoreHeapTuple and ExecStoreBufferHeapTuple.Andres Freund2018-09-25
| | | | | | | | | | | | | | | | | | | | Upcoming changes introduce further types of tuple table slots, in preparation of making table storage pluggable. New storage methods will have different representation of tuples, therefore the slot accessor should refer explicitly to heap tuples. Instead of just renaming the functions, split it into one function that accepts heap tuples not residing in buffers, and one accepting ones in buffers. Previously one function was used for both, but that was a bit awkward already, and splitting will allow us to represent slot types for tuples in buffers and normal memory separately. This is split out from the patch introducing abstract slots, as this largely consists out of mechanical changes. Author: Ashutosh Bapat Reviewed-By: Andres Freund Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
* Fast default trigger and expand_tuple fixesAndrew Dunstan2018-09-24
| | | | | | | | | | | | | | Ensure that triggers get properly filled in tuples for the OLD value. Also fix the logic of detecting missing null values. The previous logic failed to detect a missing null column before the first missing column with a default. Fixing this has simplified the logic a bit. Regression tests are added to test changes. This should ensure better coverage of expand_tuple(). Original bug reports, and some code and test scripts from Tomas Vondra Backpatch to release 11.
* Defer restoration of libraries in parallel workers.Thomas Munro2018-09-20
| | | | | | | | | | | | | | | | | | | | | | Several users of extensions complained of crashes in parallel workers that turned out to be due to syscache access from their _PG_init() functions. Reorder the initialization of parallel workers so that libraries are restored after the caches are initialized, and inside a transaction. This was reported in bug #15350 and elsewhere. We don't consider it to be a bug: extensions shouldn't do that, because then they can't be used in shared_preload_libraries. However, it's a fairly obscure hazard and these extensions worked in practice before parallel query came along. So let's make it work. Later commits might add a warning message and eventually an error. Back-patch to 9.6, where parallel query landed. Author: Thomas Munro Reviewed-by: Amit Kapila Reported-by: Kieran McCusker, Jimmy Discussion: https://postgr.es/m/153512195228.1489.8545997741965926448%40wrigleys.postgresql.org
* Fix minor error message style guide violation.Tom Lane2018-09-19
| | | | | | | | No periods at the ends of primary error messages, please. Daniel Gustafsson Discussion: https://postgr.es/m/43E004C0-18C6-42B4-A313-003B43EB0571@yesql.se
* Add support for nearest-neighbor (KNN) searches to SP-GiSTAlexander Korotkov2018-09-19
| | | | | | | | | | | | | Currently, KNN searches were supported only by GiST. SP-GiST also capable to support them. This commit implements that support. SP-GiST scan stack is replaced with queue, which serves as stack if no ordering is specified. KNN support is provided for three SP-GIST opclasses: quad_point_ops, kd_point_ops and poly_ops (catversion is bumped). Some common parts between GiST and SP-GiST KNNs are extracted into separate functions. Discussion: https://postgr.es/m/570825e8-47d0-4732-2bf6-88d67d2d51c8%40postgrespro.ru Author: Nikita Glukhov, Alexander Korotkov based on GSoC work by Vlad Sterzhanov Review: Andrey Borodin, Alexander Korotkov
* Attach FPI to the first record after full_page_writes is turned on.Amit Kapila2018-09-13
| | | | | | | | | | | | | | | | XLogInsert fails to attach a required FPI to the first record after full_page_writes is turned on by the last checkpoint. This bug got introduced in 9.5 due to code rearrangement in commits 2c03216d83 and 2076db2aea. Fix it by ensuring that XLogInsertRecord performs a recomputation when the given record is generated with FPW as off but found that the flag has been turned on while actually inserting the record. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi Reviewed-by: Amit Kapila Backpatch-through: 9.5 where this problem was introduced Discussion: https://postgr.es/m/20180420.151043.74298611.horiguchi.kyotaro@lab.ntt.co.jp
* Repair double-free in SP-GIST rescan (bug #15378)Andrew Gierth2018-09-11
| | | | | | | | | | | | | | | | | | | | | | spgrescan would first reset traversalCxt, and then traverse a potentially non-empty stack containing pointers to traversalValues which had been allocated in those contexts, freeing them a second time. This bug originates in commit ccd6eb49a where traversalValue was introduced. Repair by traversing the stack before the context reset; this isn't ideal, since it means doing retail pfree in a context that's about to be reset, but the freeing of a stack entry is also done in other places in the code during the scan so it's not worth trying to refactor it further. Regression test added. Backpatch to 9.6 where the problem was introduced. Per bug #15378; analysis and patch by me, originally from a report on IRC by user velix; see also PostGIS ticket #4174; review by Alexander Korotkov. Discussion: https://postgr.es/m/153663176628.23136.11901365223750051490@wrigleys.postgresql.org
* Fix past pd_upper write in ginRedoRecompress()Alexander Korotkov2018-09-09
| | | | | | | | | | | | | | | | ginRedoRecompress() replays actions over compressed segments of posting list in-place. However, it might lead to write past pg_upper, because intermediate state during playing the changes can take more space than both original state and final state. This commit fixes that by refuse from in-place modification. Instead page tail is copied once modification is started, and then it's used as the source of original segments. Backpatch to 9.4 where posting list compression was introduced. Reported-by: Sivasubramanian Ramasubramanian Discussion: https://postgr.es/m/1536091151804.6588%40amazon.com Author: Alexander Korotkov based on patch from and ideas by Sivasubramanian Ramasubramanian Review: Sivasubramanian Ramasubramanian Backpatch-through: 9.4
* Remove duplicated words split across lines in commentsMichael Paquier2018-09-08
| | | | | | | | This has been detected using some interesting tricks with sed, and the method used is mentioned in details in the discussion below. Author: Justin Pryzby Discussion: https://postgr.es/m/20180908013109.GB15350@telsasoft.com
* Improve handling of corrupted two-phase state files at recoveryMichael Paquier2018-09-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When a corrupted two-phase state file is found by WAL replay, be it for crash recovery or archive recovery, then the file is simply skipped and a WARNING is logged to the user, causing the transaction to be silently lost. Facing an on-disk WAL file which is corrupted is as likely to happen as what is stored in WAL records, but WAL records are already able to fail hard if there is a CRC mismatch. On-disk two-phase state files, on the contrary, are simply ignored if corrupted. Note that when restoring the initial two-phase data state at recovery, files newer than the horizon XID are discarded hence no files present in pg_twophase/ should be torned and have been made durable by a previous checkpoint, so recovery should never see any corrupted two-phase state file by design. The situation got better since 978b2f6 which has added two-phase state information directly in WAL instead of using on-disk files, so the risk is limited to two-phase transactions which live across at least one checkpoint for long periods. Backups having legit two-phase state files on-disk could also lose silently transactions when restored if things get corrupted. This behavior exists since two-phase commit has been introduced, no back-patch is done for now per the lack of complaints about this problem. Author: Michael Paquier Discussion: https://postgr.es/m/20180709050309.GM1467@paquier.xyz
* Use C99 designated initializers for some structsPeter Eisentraut2018-09-07
| | | | | | | These are just a few particularly egregious cases that were hard to read and write, and error prone because of many similar adjacent types. Discussion: https://www.postgresql.org/message-id/flat/4c9f01be-9245-2148-b569-61a8562ef190%402ndquadrant.com
* During the split, set checksum on an empty hash index page.Amit Kapila2018-09-04
| | | | | | | | | | | | | | | | On a split, we allocate a new splitpoint's worth of bucket pages wherein we initialize the last page with zeros which is fine, but we forgot to set the checksum for that last page. We decided to back-patch this fix till 10 because we don't have an easy way to test it in prior versions. Another reason is that the hash-index code is changed heavily in 10, so it is not advisable to push the fix without testing it in prior versions. Author: Amit Kapila Reviewed-by: Yugo Nagata Backpatch-through: 10 Discussion: https://postgr.es/m/5d03686d-727c-dbf8-0064-bf8b97ffe850@2ndquadrant.com
* Avoid using potentially-under-aligned page buffers.Tom Lane2018-09-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a project policy against using plain "char buf[BLCKSZ]" local or static variables as page buffers; preferred style is to palloc or malloc each buffer to ensure it is MAXALIGN'd. However, that policy's been ignored in an increasing number of places. We've apparently got away with it so far, probably because (a) relatively few people use platforms on which misalignment causes core dumps and/or (b) the variables chance to be sufficiently aligned anyway. But this is not something to rely on. Moreover, even if we don't get a core dump, we might be paying a lot of cycles for misaligned accesses. To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock that the compiler must allocate with sufficient alignment, and use those in place of plain char arrays. I used these types even for variables where there's no risk of a misaligned access, since ensuring proper alignment should make kernel data transfers faster. I also changed some places where we had been palloc'ing short-lived buffers, for coding style uniformity and to save palloc/pfree overhead. Since this seems to be a live portability hazard (despite the lack of field reports), back-patch to all supported versions. Patch by me; thanks to Michael Paquier for review. Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
* Ensure correct minimum consistent point on standbysMichael Paquier2018-08-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Startup process has improved its calculation of incorrect minimum consistent point in 8d68ee6, which ensures that all WAL available gets replayed when doing crash recovery, and has introduced an incorrect calculation of the minimum recovery point for non-startup processes, which can cause incorrect page references on a standby when for example the background writer flushed a couple of pages on-disk but was not updating the control file to let a subsequent crash recovery replay to where it should have. The only case where this has been reported to be a problem is when a standby needs to calculate the latest removed xid when replaying a btree deletion record, so one would need connections on a standby that happen just after recovery has thought it reached a consistent point. Using a background worker which is started after the consistent point is reached would be the easiest way to get into problems if it connects to a database. Having clients which attempt to connect periodically could also be a problem, but the odds of seeing this problem are much lower. The fix used is pretty simple, as the idea is to give access to the minimum recovery point written in the control file to non-startup processes so as they use a reference, while the startup process still initializes its own references of the minimum consistent point so as the original problem with incorrect page references happening post-promotion with a crash do not show up. Reported-by: Alexander Kukushkin Diagnosed-by: Alexander Kukushkin Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi, Alexander Kukushkin Discussion: https://postgr.es/m/153492341830.1368.3936905691758473953@wrigleys.postgresql.org Backpatch-through: 9.3
* Deduplicate code between slot_getallattrs() and slot_getsomeattrs().Andres Freund2018-08-23
| | | | | | | | | | | | | | | Code in slot_getallattrs() is the same as if slot_getsomeattrs() is called with number of attributes specified in the tuple descriptor. Implement it that way instead of duplicating the code between those two functions. This is part of a patchseries abstracting TupleTableSlots so they can store arbitrary forms of tuples, but is a nice enough cleanup on its own. Author: Ashutosh Bapat Reviewed-By: Andres Freund Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
* doc: Update uses of the word "procedure"Peter Eisentraut2018-08-22
| | | | | | | | | | | | | | | | | | | Historically, the term procedure was used as a synonym for function in Postgres/PostgreSQL. Now we have procedures as separate objects from functions, so we need to clean up the documentation to not mix those terms. In particular, mentions of "trigger procedures" are changed to "trigger functions", and access method "support procedures" are changed to "support functions". (The latter already used FUNCTION in the SQL syntax anyway.) Also, the terminology in the SPI chapter has been cleaned up. A few tests, examples, and code comments are also adjusted to be consistent with documentation changes, but not everything. Reported-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Jonathan S. Katz <jonathan.katz@excoventures.com>