aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* pgindent run prior to branchingAndrew Dunstan2018-06-30
|
* Cosmetic improvements for faster column addition.Amit Kapila2018-06-27
| | | | | | | | | | Changed the name of few structure members for the sake of clarity and removed spurious whitespace. Reported-by: Amit Kapila Author: Amit Kapila, based on suggestion by Andrew Dunstan Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/CAA4eK1K2znsFpC+NQ9A4vxT4uDxADN4RmvHX0L6Y=aHVo9gB4Q@mail.gmail.com
* Fix upper limit for vacuum_cleanup_index_scale_factorAlexander Korotkov2018-06-26
| | | | | | | | | | | | | | | 6ca33a88 sets upper limit for vacuum_cleanup_index_scale_factor to DBL_MAX. DBL_MAX appears to be platform-dependent. That causes many buildfarm animals to fail, because we check boundaries of vacuum_cleanup_index_scale_factor in regression tests. This commit changes upper limit from DBL_MAX to just "large enough" limit, which was arbitrary selected as 1e10. Author: Alexander Korotkov Reported-by: Tom Lane, Darafei Praliaskouski Discussion: https://postgr.es/m/CAPpHfdvewmr4PcpRjrkstoNn1n2_6dL-iHRB21CCfZ0efZdBTg%40mail.gmail.com Discussion: https://postgr.es/m/CAC8Q8tLYFOpKNaPS_E7V8KtPdE%3D_TnAn16t%3DA3LuL%3DXjfOO-BQ%40mail.gmail.com
* Remove obsolete comment block in nbtsort.c.Peter Geoghegan2018-06-26
| | | | | | | | | | | Building a new nbtree index through incremental insertions would always be slower than our actual approach of sorting using tuplesort, assembling leaf pages from tuplesort output, and writing and WAL-logging whole pages. Remove a comment block from the Berkeley days claiming that incremental insertions might be slightly faster with presorted input. Discussion: https://postgr.es/m/CAH2-WzmKs4mLAoFgJ3yHMRYc849efc=dw+pNRb3NEog2oJoCNw@mail.gmail.com
* Increase upper limit for vacuum_cleanup_index_scale_factorAlexander Korotkov2018-06-26
| | | | | | | | | | | | | | | | | | Upper limits for vacuum_cleanup_index_scale_factor GUC and reloption were initially set to 100.0 in 857f9c36. However, after further discussion, it appears that some users like to disable B-tree cleanup index scan completely (assuming there are no deleted pages). vacuum_cleanup_index_scale_factor is used barely to protect against stalled index statistics. And after detailed consideration it appears that risk of stalled index statistics is low. And it would be nice to allow advanced users setting higher values of vacuum_cleanup_index_scale_factor. So, set upper limit for these GUC and reloption to DBL_MAX. Author: Alexander Korotkov Reviewed-by: Masahiko Sawada Discussion: https://postgr.es/m/CAC8Q8tJCb%3DgxhzcV7T6ctx7PY-Ux1oA-AsTJc6cAVNsQiYcCzA%40mail.gmail.com
* Address set of issues with errno handlingMichael Paquier2018-06-25
| | | | | | | | | | | | | | | | | | | | System calls mixed up in error code paths are causing two issues which several code paths have not correctly handled: 1) For write() calls, sometimes the system may return less bytes than what has been written without errno being set. Some paths were careful enough to consider that case, and assumed that errno should be set to ENOSPC, other calls missed that. 2) errno generated by a system call is overwritten by other system calls which may succeed once an error code path is taken, causing what is reported to the user to be incorrect. This patch uses the brute-force approach of correcting all those code paths. Some refactoring could happen in the future, but this is let as future work, which is not targeted for back-branches anyway. Author: Michael Paquier Reviewed-by: Ashutosh Sharma Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
* Fix typo in comment of commit_ts.c for incorrect reference to CLOGMichael Paquier2018-06-22
| | | | Author: Shao Bret
* Prevent hard failures of standbys caused by recycled WAL segmentsMichael Paquier2018-06-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a standby's WAL receiver stops reading WAL from a WAL stream, it writes data to the current WAL segment without having priorily zero'ed the page currently written to, which can cause the WAL reader to read junk data from a past recycled segment and then it would try to get a record from it. While sanity checks in place provide most of the protection needed, in some rare circumstances, with chances increasing when a record header crosses a page boundary, then the startup process could fail violently on an allocation failure, as follows: FATAL: invalid memory alloc request size XXX This is confusing for the user and also unhelpful as this requires in the worst case a manual restart of the instance, impacting potentially the availability of the cluster, and this also makes WAL data look like it is in a corrupted state. The chances of seeing failures are higher if the connection between the standby and its root node is unstable, causing WAL pages to be written in the middle. A couple of approaches have been discussed, like zero-ing new WAL pages within the WAL receiver itself but this has the disadvantage of impacting performance of any existing instances as this breaks the sequential writes done by the WAL receiver. This commit deals with the problem with a more simple approach, which has no performance impact without reducing the detection of the problem: if a record is found with a length higher than 1GB for backends, then do not try any allocation and report a soft failure which will force the standby to retry reading WAL. It could be possible that the allocation call passes and that an unnecessary amount of memory is allocated, however follow-up checks on records would just fail, making this allocation short-lived anyway. This patch owes a great deal to Tsunakawa Takayuki for reporting the failure first, and then discussing a couple of potential approaches to the problem. Backpatch down to 9.5, which is where palloc_extended has been introduced. Reported-by: Tsunakawa Takayuki Reviewed-by: Tsunakawa Takayuki Author: Michael Paquier Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
* Remove AELs from subxids correctly on standbySimon Riggs2018-06-16
| | | | | | | | | | | | | | | | | | | | Issues relate only to subtransactions that hold AccessExclusiveLocks when replayed on standby. Prior to PG10, aborting subtransactions that held an AccessExclusiveLock failed to release the lock until top level commit or abort. 49bff5300d527 fixed that. However, 49bff5300d527 also introduced a similar bug where subtransaction commit would fail to release an AccessExclusiveLock, leaving the lock to be removed sometimes early and sometimes late. This commit fixes that bug also. Backpatch to PG10 needed. Tested by observation. Note need for multi-node isolationtester to improve test coverage for this and other HS cases. Reported-by: Simon Riggs Author: Simon Riggs
* Fix off-by-one bug in XactLogCommitRecordAlvaro Herrera2018-06-15
| | | | | | | | | | | Commit 1eb6d6527aae introduced zeroed alignment bytes in the GID field of commit/abort WAL records. Fixup commit cf5a1890592b later changed that representation into a regular cstring with a single terminating zero byte, but it also introduced an off-by-one mistake. Fix that. Author: Nikhil Sontakke Reported-by: Nikhil Sontakke Discussion: https://postgr.es/m/CAMGcDxey6dG1DP34_tJMoWPcp5sPJUAL4K5CayUUXLQSx2GQpA@mail.gmail.com
* Fail BRIN control functions during recovery explicitlyAlvaro Herrera2018-06-14
| | | | | | | | | | | They already fail anyway, but prior to this patch they raise an ugly error message about a lock that cannot be acquired. This just improves the message. Author: Masahiko Sawada Reported-by: Masahiko Sawada Discussion: https://postgr.es/m/CAD21AoBZau4g4_NUf3BKNd=CdYK+xaPdtJCzvOC1TxGdTiJx_Q@mail.gmail.com Reviewed-by: Kuntal Ghosh, Alexander Korotkov, Simon Riggs, Michaël Paquier, Álvaro Herrera
* Fix function code in error reportAlvaro Herrera2018-06-06
| | | | | | | | | | | | | This bug causes a lseek() failure to be reported as a "could not open" failure in the error message, muddling bug reports. I introduced this copy-and-pasteo in commit 78e122010422. Noticed while reviewing code for bug report #15221, from lily liang. In version 10 the affected function is only used by multixact.c and commit_ts, and only in corner-case circumstances, neither of which are involved in the reported bug (a pg_subtrans failure.) Author: Álvaro Herrera
* Move _bt_upgrademetapage() into critical section.Teodor Sigaev2018-05-30
| | | | | | | | Any changes on page should be done in critical section, so move _bt_upgrademetapage into critical section. Improve comment. Found by Amit Kapila during post-commit review of 857f9c36. Author: Amit Kapila
* printf("%lf") is not portable, so omit the "l".Tom Lane2018-05-20
| | | | | | | | | | | | The "l" (ell) width spec means something in the corresponding scanf usage, but not here. While modern POSIX says that applying "l" to "f" and other floating format specs is a no-op, SUSv2 says it's undefined. Buildfarm experience says that some old compilers emit warnings about it, and at least one old stdio implementation (mingw's "ANSI" option) actually produces wrong answers and/or crashes. Discussion: https://postgr.es/m/21670.1526769114@sss.pgh.pa.us Discussion: https://postgr.es/m/c085e1da-0d64-1c15-242d-c921f32e0d5c@dunslane.net
* Fix error message on short read of pg_controlMagnus Hagander2018-05-18
| | | | | Instead of saying "error: success", indicate that we got a working read but it was too short.
* Message wording and pluralization improvementsPeter Eisentraut2018-05-17
|
* Various improvements of skipping index scan during vacuum technicsTeodor Sigaev2018-05-10
| | | | | | | | | | | | | | | | | - Change vacuum_cleanup_index_scale_factor GUC to PGC_USERSET. vacuum_cleanup_index_scale_factor GUC was defined as PGC_SIGHUP. But this GUC affects not only autovacuum. So it might be useful to change it from user session in order to influence manually runned VACUUM. - Add missing tab-complete support for vacuum_cleanup_index_scale_factor reloption. - Fix condition for B-tree index cleanup. Zero value of vacuum_cleanup_index_scale_factor means that user wants B-tree index cleanup to be never skipped. - Documentation and comment improvements Authors: Justin Pryzby, Alexander Korotkov, Liudmila Mantrova Reviewed by: all authors and Robert Haas Discussion: https://www.postgresql.org/message-id/flat/20180502023025.GD7631%40telsasoft.com
* Fix scenario where streaming standby gets stuck at a continuation record.Heikki Linnakangas2018-05-05
| | | | | | | | | | | | | | | | | | | | | If a continuation record is split so that its first half has already been removed from the master, and is only present in pg_wal, and there is a recycled WAL segment in the standby server that looks like it would contain the second half, recovery would get stuck. The code in XLogPageRead() incorrectly started streaming at the beginning of the WAL record, even if we had already read the first page. Backpatch to 9.4. In principle, older versions have the same problem, but without replication slots, there was no straightforward mechanism to prevent the master from recycling old WAL that was still needed by standby. Without such a mechanism, I think it's reasonable to assume that there's enough slack in how many old segments are kept around to not run into this, or you have a WAL archive. Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with some extra comments by me. Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
* Don't mark pages all-visible spuriouslyAlvaro Herrera2018-05-04
| | | | | | | | | | | | | | | | | | | | | | | | | Dan Wood diagnosed a long-standing problem that pages containing tuples that are locked by multixacts containing live lockers may spuriously end up as candidates for getting their all-visible flag set. This has the long-term effect that multixacts remain unfrozen; this may previously pass undetected, but since commit XYZ it would be reported as "ERROR: found multixact 134100944 from before relminmxid 192042633" because when a later vacuum tries to freeze the page it detects that a multixact that should have gotten frozen, wasn't. Dan proposed a (correct) patch that simply sets a variable to its correct value, after a bogus initialization. But, per discussion, it seems better coding to avoid the bogus initializations altogether, since they could give rise to more bugs later. Therefore this fix rewrites the logic a little bit to avoid depending on the bogus initializations. This bug was part of a family introduced in 9.6 by commit a892234f830e; later, commit 38e9f90a227d fixed most of them, but this one was unnoticed. Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/84EBAC55-F06D-4FBE-A3F3-8BDA093CE3E3@amazon.com
* Don't truncate away non-key attributes for leftmost downlinks.Teodor Sigaev2018-05-04
| | | | | | | | | | | | nbtsort.c does not need to truncate away non-key attributes for the minimum key of the leftmost page on a level, since this is only used to build a minus infinity downlink for the level's leftmost page. Truncating away non-key attributes in advance of truncating away all attributes in _bt_sortaddtup() does not affect the correctness of CREATE INDEX, but it is misleading. Author: Peter Geoghegan Discussion: https://www.postgresql.org/message-id/CAH2-WzkAS2M3ussHG-s_Av=Zo6dPjOxyu5fNRkYnxQV+YzGQ4w@mail.gmail.com
* Re-think predicate locking on GIN indexes.Teodor Sigaev2018-05-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The principle behind the locking was not very well thought-out, and not documented. Add a section in the README to explain how it's supposed to work, and change the code so that it actually works that way. This fixes two bugs: 1. If fast update was turned on concurrently, subsequent inserts to the pending list would not conflict with predicate locks that were acquired earlier, on entry pages. The included 'predicate-gin-fastupdate' test demonstrates that. To fix, make all scans acquire a predicate lock on the metapage. That lock represents a scan of the pending list, whether or not there is a pending list at the moment. Forget about the optimization to skip locking/checking for locks, when fastupdate=off. 2. If a scan finds no match, it still needs to lock the entry page. The point of predicate locks is to lock the gabs between values, whether or not there is a match. The included 'predicate-gin-nomatch' test tests that case. In addition to those two bug fixes, this removes some unnecessary locking, following the principle laid out in the README. Because all items in a posting tree have the same key value, a lock on the posting tree root is enough to cover all the items. (With a very large posting tree, it would possibly be better to lock the posting tree leaf pages instead, so that a "skip scan" with a query like "A & B", you could avoid unnecessary conflict if a new tuple is inserted with A but !B. But let's keep this simple.) Also, some spelling fixes. Author: Heikki Linnakangas with some editorization by me Review: Andrey Borodin, Alexander Korotkov Discussion: https://www.postgresql.org/message-id/0b3ad2c2-2692-62a9-3a04-5724f2af9114@iki.fi
* Add HOLD_INTERRUPTS section into FinishPreparedTransaction.Teodor Sigaev2018-05-03
| | | | | | | | | | | | | If an interrupt arrives in the middle of FinishPreparedTransaction and any callback decide to call CHECK_FOR_INTERRUPTS (e.g. RemoveTwoPhaseFile can write a warning with ereport, which checks for interrupts) then it's possible to leave current GXact undeleted. Backpatch to all supported branches Stas Kelvich Discussion: ihttps://www.postgresql.org/message-id/3AD85097-A3F3-4EBA-99BD-C38EDF8D2949@postgrespro.ru
* Clean up warnings from -Wimplicit-fallthrough.Tom Lane2018-05-01
| | | | | | | | | | | | | | | | | | | | | | | | | Recent gcc can warn about switch-case fall throughs that are not explicitly labeled as intentional. This seems like a good thing, so clean up the warnings exposed thereby by labeling all such cases with comments that gcc will recognize. In files that already had one or more suitable comments, I generally matched the existing style of those. Otherwise I went with /* FALLTHROUGH */, which is one of the spellings approved at the more-restrictive-than-default level -Wimplicit-fallthrough=4. (At the default level you can also spell it /* FALL ?THRU */, and it's not picky about case. What you can't do is include additional text in the same comment, so some existing comments containing versions of this aren't good enough.) Testing with gcc 8.0.1 (Fedora 28's current version), I found that I also had to put explicit "break"s after elog(ERROR) or ereport(ERROR); apparently, for this purpose gcc doesn't recognize that those don't return. That seems like possibly a gcc bug, but it's fine because in most places we did that anyway; so this amounts to a visit from the style police. Discussion: https://postgr.es/m/15083.1525207729@sss.pgh.pa.us
* In AtEOXact_Files, complain if any files remain unclosed at commit.Tom Lane2018-04-28
| | | | | | | | | | | This change makes this module act more like most of our other low-level resource management modules. It's a caller error if something is not explicitly closed by the end of a successful transaction, so issue a WARNING about it. This would not actually have caught the file leak bug fixed in commit 231bcd080, because that was in a transaction-abort path; but it still seems like a good, and pretty cheap, cross-check. Discussion: https://postgr.es/m/152056616579.4966.583293218357089052@wrigleys.postgresql.org
* Post-feature-freeze pgindent run.Tom Lane2018-04-26
| | | | Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
* Fix wrong validation of top-parent pointer during page deletion in Btree.Teodor Sigaev2018-04-23
| | | | | | | | | | | | | | | | | | | | | After introducing usage of t_tid of inner or page high key for storing number of attributes of tuple, validation of tuple's ItemPointer with ItemPointerIsValid becomes incorrect, it's need to validate only blocknumber of ItemPointer. Missing this causes a incorrect page deletion, fix that. Test is added. BTW, current contrib/amcheck doesn't fail on index corrupted by this way. Also introduce BTreeTupleGetTopParent/BTreeTupleSetTopParent macroses to improve code readability and to avoid possible confusion with page high key: high key is used to store top-parent link for branch to remove. Bug found by Michael Paquier, but bug doesn't exist in previous versions because t_tid was set to P_HIKEY. Author: Teodor Sigaev Reviewer: Peter Geoghegan Discussion: https://www.postgresql.org/message-id/flat/20180419052436.GA16000%40paquier.xyz
* Adjust _bt_insertonpg() commentsTeodor Sigaev2018-04-19
| | | | | | | | Remove an obsolete reference to the 'afteritem' argument, which was removed by commit bc292937. Add a comment that clarifies how _bt_insertonpg() indirectly handles the insertion of high key items. Author: Peter Geoghegan
* Handle XLOG_BTREE_META_CLEANUP in btree_desc() and btree_identify()Teodor Sigaev2018-04-19
| | | | | | | | New WAL record XLOG_BTREE_META_CLEANUP introduced in 857f9c36 has no handling in btree_desc() and btree_identify(). This patch implements corresponding handling. Alexander Korotkov
* Adjust INCLUDE index truncation comments and code.Teodor Sigaev2018-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add several assertions that ensure that we're dealing with a pivot tuple without non-key attributes where that's expected. Also, remove the assertion within _bt_isequal(), restoring the v10 function signature. A similar check will be performed for the page highkey within _bt_moveright() in most cases. Also avoid dropping all objects within regression tests, to increase pg_dump test coverage for INCLUDE indexes. Rather than using infrastructure that's generally intended to be used with reference counted heap tuple descriptors during truncation, use the same function that was introduced to store flat TupleDescs in shared memory (we use a temp palloc'd buffer). This isn't strictly necessary, but seems more future-proof than the old approach. It also lets us avoid including rel.h within indextuple.c, which was arguably a modularity violation. Also, we now call index_deform_tuple() with the truncated TupleDesc, not the source TupleDesc, since that's more robust, and saves a few cycles. In passing, fix a memory leak by pfree'ing truncated pivot tuple memory during CREATE INDEX. Also pfree during a page split, just to be consistent. Refactor _bt_check_natts() to be more readable. Author: Peter Geoghegan with some editorization by me Reviewed by: Alexander Korotkov, Teodor Sigaev Discussion: https://www.postgresql.org/message-id/CAH2-Wz%3DkCWuXeMrBCopC-tFs3FbiVxQNjjgNKdG2sHxZ5k2y3w%40mail.gmail.com
* Fix confusion on the padding of GIDs in on commit and abort records.Heikki Linnakangas2018-04-17
| | | | | | | | | | | Review of commit 1eb6d652: It's pointless to add padding to the GID fields, when the code that follows assumes that there is no alignment, and uses memcpy(). Remove the pointless padding. Update comments to note the new fields in the WAL records. Reviewed-by: Michael Paquier Discussion: https://www.postgresql.org/message-id/33b787bf-dc20-1161-54e9-3f3b607bf59d%40iki.fi
* Fix a few typos in comments and variable names.Heikki Linnakangas2018-04-17
| | | | | Author: Michael Paquier Discussion: https://www.postgresql.org/message-id/20180411075223.GB19732%40paquier.xyz
* Fix broken collation-aware searches in SP-GiST text opclass.Tom Lane2018-04-16
| | | | | | | | | | | | | | | | | | | | | | | | | spg_text_leaf_consistent() supposed that it should compare only Min(querylen, entrylen) bytes of the two strings, and then deal with any excess bytes in one string or the other by assuming the longer string is greater if the prefixes are equal. Quite aside from the fact that that's just wrong in some locales (e.g., 'ch' is not less than 'd' in cs_CZ), it also risked passing incomplete multibyte characters to strcoll(), with ensuing bad results. Instead, just pass the full strings to varstr_cmp, and let it decide what to do about unequal-length strings. Fortunately, this error doesn't imply any index corruption, it's just that searches might return the wrong set of entries. Per report from Emre Hasegeli, though this is not his patch. Thanks to Peter Geoghegan for review and discussion. This code was born broken, so back-patch to all supported branches. In HEAD, I failed to resist the temptation to do a bit of cosmetic cleanup/pgindent'ing on 710d90da1, too. Discussion: https://postgr.es/m/CAE2gYzzb6K51VnTq5i5p52z+j9p2duEa-K1T3RrC_GQEynAKEg@mail.gmail.com
* Prevent segfault in expand_tuple with no missing valuesAndrew Dunstan2018-04-13
| | | | | | | | | | Commit 16828d5c forgot to check that it had a set of missing values before trying to retrieve a value from it. An additional query to add coverage for this code is added to the regression test. Per bug report from Andreas Seltenreich.
* Revert MERGE patchSimon Riggs2018-04-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commits d204ef63776b8a00ca220adec23979091564e465, 83454e3c2b28141c0db01c7d2027e01040df5249 and a few more commits thereafter (complete list at the end) related to MERGE feature. While the feature was fully functional, with sufficient test coverage and necessary documentation, it was felt that some parts of the executor and parse-analyzer can use a different design and it wasn't possible to do that in the available time. So it was decided to revert the patch for PG11 and retry again in the future. Thanks again to all reviewers and bug reporters. List of commits reverted, in reverse chronological order: f1464c5380 Improve parse representation for MERGE ddb4158579 MERGE syntax diagram correction 530e69e59b Allow cpluspluscheck to pass by renaming variable 01b88b4df5 MERGE minor errata 3af7b2b0d4 MERGE fix variable warning in non-assert builds a5d86181ec MERGE INSERT allows only one VALUES clause 4b2d44031f MERGE post-commit review 4923550c20 Tab completion for MERGE aa3faa3c7a WITH support in MERGE 83454e3c2b New files for MERGE d204ef6377 MERGE SQL Command following SQL:2016 Author: Pavan Deolasee Reviewed-by: Michael Paquier
* Ignore nextOid when replaying an ONLINE checkpoint.Tom Lane2018-04-11
| | | | | | | | | | | | | | | The nextOid value is from the start of the checkpoint and may well be stale compared to values from more recent XLOG_NEXTOID records. Previously, we adopted it anyway, allowing the OID counter to go backwards during a crash. While this should be harmless, it contributed to the severity of the bug fixed in commit 0408e1ed5, by allowing duplicate TOAST OIDs to be assigned immediately following a crash. Without this error, that issue would only have arisen when TOAST objects just younger than a multiple of 2^32 OIDs were deleted and then not vacuumed in time to avoid a conflict. Pavan Deolasee Discussion: https://postgr.es/m/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-=h4OLosXHf9w@mail.gmail.com
* Do not select new object OIDs that match recently-dead entries.Tom Lane2018-04-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When selecting a new OID, we take care to avoid picking one that's already in use in the target table, so as not to create duplicates after the OID counter has wrapped around. However, up to now we used SnapshotDirty when scanning for pre-existing entries. That ignores committed-dead rows, so that we could select an OID matching a deleted-but-not-yet-vacuumed row. While that mostly worked, it has two problems: * If recently deleted, the dead row might still be visible to MVCC snapshots, creating a risk for duplicate OIDs when examining the catalogs within our own transaction. Such duplication couldn't be visible outside the object-creating transaction, though, and we've heard few if any field reports corresponding to such a symptom. * When selecting a TOAST OID, deleted toast rows definitely *are* visible to SnapshotToast, and will remain so until vacuumed away. This leads to a conflict that will manifest in errors like "unexpected chunk number 0 (expected 1) for toast value nnnnn". We've been seeing reports of such errors from the field for years, but the cause was unclear before. The fix is simple: just use SnapshotAny to search for conflicting rows. This results in a slightly longer window before object OIDs can be recycled, but that seems unlikely to create any large problems. Pavan Deolasee Discussion: https://postgr.es/m/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-=h4OLosXHf9w@mail.gmail.com
* minor comment fixes in nbtinsert.cAndrew Dunstan2018-04-10
|
* Adjustments to the btree fastpath optimization.Andrew Dunstan2018-04-10
| | | | | | | | | | | | | | | This optimization was introduced in commit 2b272734. The changes include some additional comments and documentation, and also these more substantive changes: . ensure the optimization is only applied on the leaf node of a tree whose root is on level 2 or more. It's of little value on small trees. . Delay calling RelationSetTargetBlock() until after the critical section of _bt_insertonpg . ensure the optimization is also applied to unlogged tables. Pavan Deolasee and Peter Geoghegan with some very light editing from me. Discussion: https://postgr.es/m/CABOikdO8jhRarNC60nZLktZYhxt+TK8z_V97+Ny499YQdyAfug@mail.gmail.com
* Fix comment on B-tree insertion fastpath condition.Heikki Linnakangas2018-04-10
| | | | | | The comment earlier in the function correctly states "and the insertion key is strictly greater than the first key in this page". That is what we check here, not "greater than or equal".
* Further cleanup of client dependencies on src/include/catalog headers.Tom Lane2018-04-09
| | | | | | | | | | | | | | | | | In commit 9c0a0de4c, I'd failed to notice that catalog/catalog.h should also be considered a frontend-unsafe header, because it includes (and needs) the full form of pg_class.h, not to mention relcache.h. However, various frontend code was depending on it to get TABLESPACE_VERSION_DIRECTORY, so refactoring of some sort is called for. The cleanest answer seems to be to move TABLESPACE_VERSION_DIRECTORY, as well as the OIDCHARS symbol, to common/relpath.h. Do that, and mop up inclusions as necessary. (I found that quite a few current users of catalog/catalog.h don't seem to need it at all anymore, apparently as a result of the refactorings that created common/relpath.[hc]. And initdb.c needed it only as a route to pg_class_d.h.) Discussion: https://postgr.es/m/6629.1523294509@sss.pgh.pa.us
* Revert "Allow on-line enabling and disabling of data checksums"Magnus Hagander2018-04-09
| | | | | | | | This reverts the backend sides of commit 1fde38beaa0c3e66c340efc7cc0dc272d6254bb0. I have, at least for now, left the pg_verify_checksums tool in place, as this tool can be very valuable without the rest of the patch as well, and since it's a read-only tool that only runs when the cluster is down it should be a lot safer.
* Remove unused variable in non-assert-enabled buildTeodor Sigaev2018-04-08
| | | | | | Use field of structure in Assert directly Jeff Janes
* Refactor dir/file permissionsStephen Frost2018-04-07
| | | | | | | | | | | | | | | | | | | Consolidate directory and file create permissions for tools which work with the PG data directory by adding a new module (common/file_perm.c) that contains variables (pg_file_create_mode, pg_dir_create_mode) and constants to initialize them (0600 for files and 0700 for directories). Convert mkdir() calls in the backend to MakePGDirectory() if the original call used default permissions (always the case for regular PG directories). Add tests to make sure permissions in PGDATA are set correctly by the tools which modify the PG data directory. Authors: David Steele <david@pgmasters.net>, Adam Brightwell <adam.brightwell@crunchydata.com> Reviewed-By: Michael Paquier, with discussion amongst many others. Discussion: https://postgr.es/m/ad346fe6-b23e-59f1-ecb7-0e08390ad629%40pgmasters.net
* Raise error when affecting tuple moved into different partition.Andres Freund2018-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an update moves a row between partitions (supported since 2f178441044b), our normal logic for following update chains in READ COMMITTED mode doesn't work anymore. Cross partition updates are modeled as an delete from the old and insert into the new partition. No ctid chain exists across partitions, and there's no convenient space to introduce that link. Not throwing an error in a partitioned context when one would have been thrown without partitioning is obviously problematic. This commit introduces infrastructure to detect when a tuple has been moved, not just plainly deleted. That allows to throw an error when encountering a deletion that's actually a move, while attempting to following a ctid chain. The row deleted as part of a cross partition update is marked by pointing it's t_ctid to an invalid block, instead of self as a normal update would. That was deemed to be the least invasive and most future proof way to represent the knowledge, given how few infomask bits are there to be recycled (there's also some locking issues with using infomask bits). External code following ctid chains should be updated to check for moved tuples. The most likely consequence of not doing so is a missed error. Author: Amul Sul, editorialized by me Reviewed-By: Amit Kapila, Pavan Deolasee, Andres Freund, Robert Haas Discussion: http://postgr.es/m/CAAJ_b95PkwojoYfz0bzXU8OokcTVGzN6vYGCNVUukeUDrnF3dw@mail.gmail.com
* Indexes with INCLUDE columns and their support in B-treeTeodor Sigaev2018-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces INCLUDE clause to index definition. This clause specifies a list of columns which will be included as a non-key part in the index. The INCLUDE columns exist solely to allow more queries to benefit from index-only scans. Also, such columns don't need to have appropriate operator classes. Expressions are not supported as INCLUDE columns since they cannot be used in index-only scans. Index access methods supporting INCLUDE are indicated by amcaninclude flag in IndexAmRoutine. For now, only B-tree indexes support INCLUDE clause. In B-tree indexes INCLUDE columns are truncated from pivot index tuples (tuples located in non-leaf pages and high keys). Therefore, B-tree indexes now might have variable number of attributes. This patch also provides generic facility to support that: pivot tuples contain number of their attributes in t_tid.ip_posid. Free 13th bit of t_info is used for indicating that. This facility will simplify further support of index suffix truncation. The changes of above are backward-compatible, pg_upgrade doesn't need special handling of B-tree indexes for that. Bump catalog version Author: Anastasia Lubennikova with contribition by Alexander Korotkov and me Reviewed by: Peter Geoghegan, Tomas Vondra, Antonin Houska, Jeff Janes, David Rowley, Alexander Korotkov Discussion: https://www.postgresql.org/message-id/flat/56168952.4010101@postgrespro.ru
* Logical decoding of TRUNCATEPeter Eisentraut2018-04-07
| | | | | | | | | | | | | | Add a new WAL record type for TRUNCATE, which is only used when wal_level >= logical. (For physical replication, TRUNCATE is already replicated via SMGR records.) Add new callback for logical decoding output plugins to receive TRUNCATE actions. Author: Simon Riggs <simon@2ndquadrant.com> Author: Marco Nenciarini <marco.nenciarini@2ndquadrant.it> Author: Peter Eisentraut <peter.eisentraut@2ndquadrant.com> Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
* Predicate locking in hash indexes.Teodor Sigaev2018-04-07
| | | | | | | | | | | | Hash index searches acquire predicate locks on the primary page of a bucket. It acquires a lock on both the old and new buckets for scans that happen concurrently with page splits. During a bucket split, a predicate lock is copied from the primary page of an old bucket to the primary page of a new bucket. Author: Shubham Barai, Amit Kapila Reviewed by: Amit Kapila, Alexander Korotkov, Thomas Munro Discussion: https://www.postgresql.org/message-id/flat/CALxAEPvNsM2GTiXdRgaaZ1Pjd1bs+sxfFsf7Ytr+iq+5JJoYXA@mail.gmail.com
* Allow on-line enabling and disabling of data checksumsMagnus Hagander2018-04-05
| | | | | | | | | | | | | | | | | | | This makes it possible to turn checksums on in a live cluster, without the previous need for dump/reload or logical replication (and to turn it off). Enabling checkusm starts a background process in the form of a launcher/worker combination that goes through the entire database and recalculates checksums on each and every page. Only when all pages have been checksummed are they fully enabled in the cluster. Any failure of the process will revert to checksums off and the process has to be started. This adds a new WAL record that indicates the state of checksums, so the process works across replicated clusters. Authors: Magnus Hagander and Daniel Gustafsson Review: Tomas Vondra, Michael Banck, Heikki Linnakangas, Andrey Borodin
* Allow background workers to bypass datallowconnMagnus Hagander2018-04-05
| | | | | | | THis adds a "flags" field to the BackgroundWorkerInitializeConnection() and BackgroundWorkerInitializeConnectionByOid(). For now only one flag, BGWORKER_BYPASS_ALLOWCONN, is defined, which allows the worker to ignore datallowconn.
* Fix handling of non-upgraded B-tree metapagesTeodor Sigaev2018-04-05
| | | | | | | | | | | | 857f9c36 bumps B-tree metapage version while upgrade is performed "on the fly" when needed. However, some asserts fired when old version metapage was cached to rel->rd_amcache. Despite new metadata fields are never used from rel->rd_amcache, that needs to be fixed. This patch introduces metadata upgrade during its caching, which fills unavailable fields with their default values. contrib/pageinspect is also patched to handle non-upgraded metapages in the same way. Author: Alexander Korotkov