aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Don't open a WAL segment for writing at end of recovery.Heikki Linnakangas2015-01-07
| | | | | | | | | | | | | | Since commit ba94518a, we used XLogFileOpen to open the next segment for writing, but if the end-of-recovery happens exactly at a segment boundary, the new segment might not exist yet. (Before ba94518a, XLogFileOpen was correct, because we would open the previous segment if the switch happened at the boundary.) Instead of trying to create it if necessary, it's simpler to not bother opening the segment at all. XLogWrite() will open or create it soon anyway, after writing the checkpoint or end-of-recovery record. Reported by Andres Freund.
* Update copyright for 2015Bruce Momjian2015-01-06
| | | | Backpatch certain files through 9.0
* Fix thinko in lock mode enumAlvaro Herrera2015-01-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0e5680f4737a9c6aa94aa9e77543e5de60411322 contained a thinko mixing LOCKMODE with LockTupleMode. This caused misbehavior in the case where a tuple is marked with a multixact with at most a FOR SHARE lock, and another transaction tries to acquire a FOR NO KEY EXCLUSIVE lock; this case should block but doesn't. Include a new isolation tester spec file to explicitely try all the tuple lock combinations; without the fix it shows the problem: starting permutation: s1_begin s1_lcksvpt s1_tuplock2 s2_tuplock3 s1_commit step s1_begin: BEGIN; step s1_lcksvpt: SELECT * FROM multixact_conflict FOR KEY SHARE; SAVEPOINT foo; a 1 step s1_tuplock2: SELECT * FROM multixact_conflict FOR SHARE; a 1 step s2_tuplock3: SELECT * FROM multixact_conflict FOR NO KEY UPDATE; a 1 step s1_commit: COMMIT; With the fixed code, step s2_tuplock3 blocks until session 1 commits, which is the correct behavior. All other cases behave correctly. Backpatch to 9.3, like the commit that introduced the problem.
* Remove superflous variable from xlogreader's XLogFindNextRecord().Andres Freund2015-01-04
| | | | | | | Pointed out by Coverity. Since this is mere, and debatable, cosmetics I'm not backpatching this.
* Treat negative values of recovery_min_apply_delay as having no effect.Tom Lane2015-01-03
| | | | | | | | | | | | | | | | | | | | | | At one point in the development of this feature, it was claimed that allowing negative values would be useful to compensate for timezone differences between master and slave servers. That was based on a mistaken assumption that commit timestamps are recorded in local time; but of course they're in UTC. Nor is a negative apply delay likely to be a sane way of coping with server clock skew. However, the committed patch still treated negative delays as doing something, and the timezone misapprehension survived in the user documentation as well. If recovery_min_apply_delay were a proper GUC we'd just set the minimum allowed value to be zero; but for the moment it seems better to treat negative settings as if they were zero. In passing do some extra wordsmithing on the parameter's documentation, including correcting a second misstatement that the parameter affects processing of Restore Point records. Issue noted by Michael Paquier, who also provided the code patch; doc changes by me. Back-patch to 9.4 where the feature was introduced.
* Grab heavyweight tuple lock only before sleepingAlvaro Herrera2014-12-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | We were trying to acquire the lock even when we were subsequently not sleeping in some other transaction, which opens us up unnecessarily to deadlocks. In particular, this is troublesome if an update tries to lock an updated version of a tuple and finds itself doing EvalPlanQual update chain walking; more than two sessions doing this concurrently will find themselves sleeping on each other because the HW tuple lock acquisition in heap_lock_tuple called from EvalPlanQualFetch races with the same tuple lock being acquired in heap_update -- one of these sessions sleeps on the other one to finish while holding the tuple lock, and the other one sleeps on the tuple lock. Per trouble report from Andrew Sackville-West in http://www.postgresql.org/message-id/20140731233051.GN17765@andrew-ThinkPad-X230 His scenario can be simplified down to a relatively simple isolationtester spec file which I don't include in this commit; the reason is that the current isolationtester is not able to deal with more than one blocked session concurrently and it blocks instead of raising the expected deadlock. In the future, if we improve isolationtester, it would be good to include the spec file in the isolation schedule. I posted it in http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org Hat tip to Mark Kirkwood, who helped diagnose the trouble.
* Temporarily revert "Move pg_lzcompress.c to src/common."Tom Lane2014-12-25
| | | | | | | This reverts commit 60838df922345b26a616e49ac9fab808a35d1f85. That change needs a bit more thought to be workable. In view of the potentially machine-dependent stuff that went in today, we need all of the buildfarm to be testing those other changes.
* Convert the PGPROC->lwWaitLink list into a dlist instead of open coding it.Andres Freund2014-12-25
| | | | | | | | Besides being shorter and much easier to read it changes the logic in LWLockRelease() to release all shared lockers when waking up any. This can yield some significant performance improvements - and the fairness isn't really much worse than before, as we always allowed new shared lockers to jump the queue.
* Move pg_lzcompress.c to src/common.Fujii Masao2014-12-25
| | | | | | | | | | | | Exposing compression and decompression APIs of pglz makes possible its use by extensions and contrib modules. pglz_decompress contained a call to elog to emit an error message in case of corrupted data. This function is changed to return a status code to let its callers return an error instead. This commit is required for upcoming WAL compression feature so that the WAL reader facility can decompress the WAL data by using pglz_decompress. Michael Paquier
* Revert "Use a bitmask to represent role attributes"Alvaro Herrera2014-12-23
| | | | | | | | | This reverts commit 1826987a46d079458007b7b6bbcbbd852353adbb. The overall design was deemed unacceptable, in discussion following the previous commit message; we might find some parts of it still salvageable, but I don't want to be on the hook for fixing it, so let's wait until we have a new patch.
* Use a bitmask to represent role attributesAlvaro Herrera2014-12-23
| | | | | | | | | | | | | The previous representation using a boolean column for each attribute would not scale as well as we want to add further attributes. Extra auxilliary functions are added to go along with this change, to make up for the lost convenience of access of the old representation. Catalog version bumped due to change in catalogs and the new functions. Author: Adam Brightwell, minor tweaks by Álvaro Reviewed by: Stephen Frost, Andres Freund, Álvaro Herrera
* Use a pairing heap for the priority queue in kNN-GiST searches.Heikki Linnakangas2014-12-22
| | | | | | | This performs slightly better, uses less memory, and needs slightly less code in GiST, than the Red-Black tree previously used. Reviewed by Peter Geoghegan
* Fix file descriptor leak at end of recovery.Heikki Linnakangas2014-12-21
| | | | | | | | XLogFileInit() returns a file descriptor, which needs to be closed. The leak was short-lived, since the startup process exits shortly afterwards, but it was clearly a bug, nevertheless. Per Coverity report.
* Fix timestamp in end-of-recovery WAL records.Heikki Linnakangas2014-12-19
| | | | | | | We used time(null) to set a TimestampTz field, which gave bogus results. Noticed while looking at pg_xlogdump output. Backpatch to 9.3 and above, where the fast promotion was introduced.
* Improve hash_create's API for selecting simple-binary-key hash functions.Tom Lane2014-12-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, if you wanted anything besides C-string hash keys, you had to specify a custom hashing function to hash_create(). Nearly all such callers were specifying tag_hash or oid_hash; which is tedious, and rather error-prone, since a caller could easily miss the opportunity to optimize by using hash_uint32 when appropriate. Replace this with a design whereby callers using simple binary-data keys just specify HASH_BLOBS and don't need to mess with specific support functions. hash_create() itself will take care of optimizing when the key size is four bytes. This nets out saving a few hundred bytes of code space, and offers a measurable performance improvement in tidbitmap.c (which was not exploiting the opportunity to use hash_uint32 for its 4-byte keys). There might be some wins elsewhere too, I didn't analyze closely. In future we could look into offering a similar optimized hashing function for 8-byte keys. Under this design that could be done in a centralized and machine-independent fashion, whereas getting it right for keys of platform-dependent sizes would've been notationally painful before. For the moment, the old way still works fine, so as not to break source code compatibility for loadable modules. Eventually we might want to remove tag_hash and friends from the exported API altogether, since there's no real need for them to be explicitly referenced from outside dynahash.c. Teodor Sigaev and Tom Lane
* Change how first WAL segment on new timeline after promotion is created.Heikki Linnakangas2014-12-18
| | | | | | | | | | | | | | Two changes: 1. When copying a WAL segment from old timeline to create the first segment on the new timeline, only copy up to the point where the timeline switch happens, and zero-fill the rest. This avoids corner cases where we might think that the copied WAL from the previous timeline belong to the new timeline. 2. If the timeline switch happens at a segment boundary, don't copy the whole old segment to the new timeline. It's pointless, because it's 100% identical to the old segment.
* Fix (re-)starting from a basebackup taken off a standby after a failure.Andres Freund2014-12-18
| | | | | | | | | | | | | | | | | | | | | | | | | | When starting up from a basebackup taken off a standby extra logic has to be applied to compute the point where the data directory is consistent. Normal base backups use a WAL record for that purpose, but that isn't possible on a standby. That logic had a error check ensuring that the cluster's control file indicates being in recovery. Unfortunately that check was too strict, disregarding the fact that the control file could also indicate that the cluster was shut down while in recovery. That's possible when the a cluster starting from a basebackup is shut down before the backup label has been removed. When everything goes well that's a short window, but when either restore_command or primary_conninfo isn't configured correctly the window can get much wider. That's because inbetween reading and unlinking the label we restore the last checkpoint from WAL which can need additional WAL. To fix simply also allow starting when the control file indicates "shutdown in recovery". There's nicer fixes imaginable, but they'd be more invasive. Backpatch to 9.2 where support for taking basebackups from standbys was added.
* Fix assorted confusion between Oid and int32.Tom Lane2014-12-11
| | | | | | | | | | | In passing, also make some debugging elog's in pgstat.c a bit more consistently worded. Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are really old). Mark Dilger identified and patched the type violations; the message rewordings are mine.
* Remove duplicate code in heap_prune_chain()Simon Riggs2014-12-08
| | | | | | No need to set tuple tableOid twice Jim Nasby
* Tweaks for recovery_target_actionSimon Riggs2014-12-07
| | | | | | | | | Rename parameter action_at_recovery_target to recovery_target_action suggested by Christoph Berg. Place into recovery.conf suggested by Fujii Masao, replacing (deprecating) earlier parameters, per Michael Paquier.
* Print new track_commit_timestamp in rm_desc of a parameter-change record.Heikki Linnakangas2014-12-05
| | | | Michael Paquier
* Print wal_log_hints in the rm_desc routing of a parameter-change record.Heikki Linnakangas2014-12-05
| | | | | | | | | It was an oversight in the original commit. Also note in the sample config file that changing wal_log_hints requires a restart. Michael Paquier. Backpatch to 9.4, where wal_log_hints was added.
* Keep track of transaction commit timestampsAlvaro Herrera2014-12-03
| | | | | | | | | | | | | | | | | | | | | Transactions can now set their commit timestamp directly as they commit, or an external transaction commit timestamp can be fed from an outside system using the new function TransactionTreeSetCommitTsData(). This data is crash-safe, and truncated at Xid freeze point, same as pg_clog. This module is disabled by default because it causes a performance hit, but can be enabled in postgresql.conf requiring only a server restart. A new test in src/test/modules is included. Catalog version bumped due to the new subdirectory within PGDATA and a couple of new SQL functions. Authors: Álvaro Herrera and Petr Jelínek Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven Singer, Peter Eisentraut
* Fix typosAlvaro Herrera2014-12-03
|
* Minor cleanup of function declarations for BRIN.Tom Lane2014-12-02
| | | | | | | Get rid of PG_FUNCTION_INFO_V1() macros, which are quite inappropriate for built-in functions (possibly leftovers from testing as a loadable module?). Also, fix gratuitous inconsistency between SQL-level and C-level names of the minmax support functions.
* Update transaction README for persistent multixactsAlvaro Herrera2014-11-28
| | | | | Multixacts are now maintained during recovery, but the README didn't get the memo. Backpatch to 9.3, where the divergence was introduced.
* Fix assertion failure at end of PITR.Heikki Linnakangas2014-11-28
| | | | | | | | | | | | | InitXLogInsert() cannot be called in a critical section, because it allocates memory. But CreateCheckPoint() did that, when called for the end-of-recovery checkpoint by the startup process. In the passing, fix the scratch space allocation in InitXLogInsert to go to the right memory context. Also update the comment at InitXLOGAccess, which hasn't been totally accurate since hot standby was introduced (in a hot standby backend, InitXLOGAccess isn't called at backend startup). Reported by Michael Paquier
* action_at_recovery_target recovery config optionSimon Riggs2014-11-25
| | | | | | | | | action_at_recovery_target = pause | promote | shutdown Petr Jelinek Reviewed by Muhammad Asif Naeem, Fujji Masao and Simon Riggs
* Add a few paragraphs to B-tree README explaining L&Y algorithm.Heikki Linnakangas2014-11-24
| | | | | | | This gives an overview of what Lehman & Yao's paper is all about, so that you can understand the rest of the README without having to read the paper. Per discussion with Peter Geoghegan and others.
* Distinguish XLOG_FPI records generated for hint-bit updates.Heikki Linnakangas2014-11-24
| | | | | | | Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images generated for hint bit updates, when checksums are enabled. The new record type is replayed exactly the same as XLOG_FPI, but allows them to be tallied separately e.g. in pg_xlogdump.
* No need to call XLogEnsureRecordSpace when the relation is unlogged.Heikki Linnakangas2014-11-21
| | | | Amit Kapila
* Fix bogus comments in XLogRecordAssembleHeikki Linnakangas2014-11-21
| | | | Pointed out by Michael Paquier
* Remove dead code supporting mark/restore in SeqScan, TidScan, ValuesScan.Tom Lane2014-11-20
| | | | | | | | | | | | There seems no prospect that any of this will ever be useful, and indeed it's questionable whether some of it would work if it ever got called; it's certainly not been exercised in a very long time, if ever. So let's get rid of it, and make the comments about mark/restore in execAmi.c less wishy-washy. The mark/restore support for Result nodes is also currently dead code, but that's due to planner limitations not because it's impossible that it could be useful. So I left it in.
* Silence compiler warning about variable being used uninitialized.Heikki Linnakangas2014-11-20
| | | | | | It's a false positive - the variable is only used when 'onleft' is true, and it is initialized in that case. But the compiler doesn't necessarily see that.
* Revamp the WAL record format.Heikki Linnakangas2014-11-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.
* Reduce btree scan overhead for < and > strategiesSimon Riggs2014-11-18
| | | | | | | | | | | For <, <=, > and >= strategies, mark the first scan key as already matched if scanning in an appropriate direction. If index tuple contains no nulls we can skip the first re-check for each tuple. Author: Rajeev Rastogi Reviewer: Haribabu Kommi Rework of the code and comments by Simon Riggs
* Fix WAL-logging of B-tree "unlink halfdead page" operation.Heikki Linnakangas2014-11-17
| | | | | | | | | | | There was some confusion on how to record the case that the operation unlinks the last non-leaf page in the branch being deleted. _bt_unlink_halfdead_page set the "topdead" field in the WAL record to the leaf page, but the redo routine assumed that it would be an invalid block number in that case. This commit fixes _bt_unlink_halfdead_page to do what the redo routine expected. This code is new in 9.4, so backpatch there.
* Ensure unlogged tables are reset even if crash recovery errors out.Andres Freund2014-11-15
| | | | | | | | | | | | | | | | | | | | | | | | Unlogged relations are reset at the end of crash recovery as they're only synced to disk during a proper shutdown. Unfortunately that and later steps can fail, e.g. due to running out of space. This reset was, up to now performed after marking the database as having finished crash recovery successfully. As out of space errors trigger a crash restart that could lead to the situation that not all unlogged relations are reset. Once that happend usage of unlogged relations could yield errors like "could not open file "...": No such file or directory". Luckily clusters that show the problem can be fixed by performing a immediate shutdown, and starting the database again. To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier, before marking the database as having successfully recovered. Discussion: 20140912112246.GA4984@alap3.anarazel.de Backpatch to 9.1 where unlogged tables were introduced. Abhijit Menon-Sen and Andres Freund
* Clean up includes from RLS patchStephen Frost2014-11-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | The initial patch for RLS mistakenly included headers associated with the executor and planner bits in rewrite/rowsecurity.h. Per policy and general good sense, executor headers should not be included in planner headers or vice versa. The include of execnodes.h was a mistaken holdover from previous versions, while the include of relation.h was used for Relation's definition, which should have been coming from utils/relcache.h. This patch cleans these issues up, adds comments to the RowSecurityPolicy struct and the RowSecurityConfigType enum, and changes Relation->rsdesc to Relation->rd_rsdesc to follow Relation field naming convention. Additionally, utils/rel.h was including rewrite/rowsecurity.h, which wasn't a great idea since that was pulling in things not really needed in utils/rel.h (which gets included in quite a few places). Instead, use 'struct RowSecurityDesc' for the rd_rsdesc field and add comments explaining why. Lastly, add an include into access/nbtree/nbtsort.c for utils/sortsupport.h, which was evidently missed due to the above mess. Pointed out by Tom in 16970.1415838651@sss.pgh.pa.us; note that the concerns regarding a similar situation in the custom-path commit still need to be addressed.
* Allow interrupting GetMultiXactIdMembersAlvaro Herrera2014-11-14
| | | | | | | | | | | This function has a loop which can lead to uninterruptible process "stalls" (actually infinite loops) when some bugs are triggered. Avoid that unpleasant situation by adding a check for interrupts in a place that shouldn't degrade performance in the normal case. Backpatch to 9.3. Older branches have an identical loop here, but the aforementioned bugs are only a problem starting in 9.3 so there doesn't seem to be any point in backpatching any further.
* Fix race condition between hot standby and restoring a full-page image.Heikki Linnakangas2014-11-13
| | | | | | | | | | | | | | | | | | | There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.
* Fix XLogReadBufferForRedoExtended to get cleanup lock when asked to do so.Heikki Linnakangas2014-11-13
|
* Rename pending_list_cleanup_size to gin_pending_list_limit.Fujii Masao2014-11-13
| | | | | Since this parameter is only for GIN index, it's better to add "gin" to the parameter name for easier understanding.
* Add GUC and storage parameter to set the maximum size of GIN pending list.Fujii Masao2014-11-11
| | | | | | | | | | | | | | | | Previously the maximum size of GIN pending list was controlled only by work_mem. But the reasonable value of work_mem and the reasonable size of the list are basically not the same, so it was not appropriate to control both of them by only one GUC, i.e., work_mem. This commit separates new GUC, pending_list_cleanup_size, from work_mem to allow users to control only the size of the list. Also this commit adds pending_list_cleanup_size as new storage parameter to allow users to specify the size of the list per index. This is useful, for example, when users want to increase the size of the list only for the GIN index which can be updated heavily, and decrease it otherwise. Reviewed by Etsuro Fujita.
* BRIN: fix bug in xlog backup block countingAlvaro Herrera2014-11-10
| | | | | | | | | | | | | The code that generates the BRIN_XLOG_UPDATE removes the buffer reference when the page that's target for the updated tuple is freshly initialized. This is a pretty usual optimization, but was breaking the case where the revmap buffer, which is referenced in the same WAL record, is getting a backup block: the replay code was using backup block index 1, which is not valid when the update target buffer gets pruned; the revmap buffer gets assigned 0 instead. Make sure to use the correct backup block index for revmap when replaying. Bug reported by Fujii Masao.
* Further code and wording tweaks in BRINAlvaro Herrera2014-11-10
| | | | | | | Besides a couple of typo fixes, per David Rowley, Thom Brown, and Amit Langote, and mentions of BRIN in the general CREATE INDEX page again per David, this includes silencing MSVC compiler warnings (thanks Microsoft) and an additional variable initialization per Coverity scanner.
* Fix compiler warning for non-assert builds.Kevin Grittner2014-11-10
| | | | | Reported by Peter Geoghegan David Rowley
* Fix some coding issues in BRINAlvaro Herrera2014-11-08
| | | | | | | | | | | | | | | Reported by David Rowley: variadic macros are a problem. Get rid of them using a trick suggested by Tom Lane: add extra parentheses where needed. In the future we might decide we don't need the calls at all and remove them, but it seems appropriate to keep them while this code is still new. Also from David Rowley: brininsert() was trying to use a variable before initializing it. Fix by moving the brin_form_tuple call (which initializes the variable) to within the locked section. Reported by Peter Eisentraut: can't use "new" as a struct member name, because C++ compilers will choke on it, as reported by cpluspluscheck.
* Fix building with WAL_DEBUG.Heikki Linnakangas2014-11-07
| | | | | | | | | | | Now that the backup blocks are appended to the WAL record in xloginsert.c, XLogInsert doesn't see them anymore and cannot remove them from the version reconstructed for xlog_outdesc. This makes running with wal_debug=on more expensive, as we now make (unnecessary) temporary copies of the backup blocks, but it doesn't seem worth convoluting the code to keep that optimization. Reported by Alvaro Herrera.
* Use the sortsupport infrastructure in more cases.Robert Haas2014-11-07
| | | | | | This removes some fmgr overhead from cases such as btree index builds. Peter Geoghegan, reviewed by Andreas Karlsson and me.