aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Reset properly errno before calling write()Michael Paquier2018-08-05
| | | | | | | | | | | | 6cb3372 enforces errno to ENOSPC when less bytes than what is expected have been written when it is unset, though it forgot to properly reset errno before doing a system call to write(), causing errno to potentially come from a previous system call. Reported-by: Tom Lane Author: Michael Paquier Reviewed-by: Tom Lane Discussion: https://postgr.es/m/31797.1533326676@sss.pgh.pa.us
* Provide separate header file for built-in float typesTomas Vondra2018-07-29
| | | | | | | | | | | | | | | | | | Some data types under adt/ have separate header files, but most simple ones do not, and their public functions are defined in builtins.h. As the patches improving geometric types will require making additional functions public, this seems like a good opportunity to create a header for floats types. Commit 1acf757255 made _cmp functions public to solve NaN issues locally for GiST indexes. This patch reworks it in favour of a more widely applicable API. The API uses inline functions, as they are easier to use compared to macros, and avoid double-evaluation hazards. Author: Emre Hasegeli Reviewed-by: Kyotaro Horiguchi Discussion: https://www.postgresql.org/message-id/CAE2gYzxF7-5djV6-cEvqQu-fNsnt%3DEqbOURx7ZDg%2BVv6ZMTWbg%40mail.gmail.com
* Reduce path length for locking leaf B-tree pages during insertionAlexander Korotkov2018-07-28
| | | | | | | | | | | | | | | | | In our B-tree implementation appropriate leaf page for new tuple insertion is acquired using _bt_search() function. This function always returns leaf page locked in shared mode. In order to obtain exclusive lock, caller have to relock the page. This commit makes _bt_search() function lock leaf page immediately in exclusive mode when needed. That removes unnecessary relock and, in turn reduces lock contention for B-tree leaf pages. Our experiments on multi-core systems showed acceleration up to 4.5 times in corner case. Discussion: https://postgr.es/m/CAPpHfduAMDFMNYTCN7VMBsFg_hsf0GqiqXnt%2BbSeaJworwFoig%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Yoshikazu Imai, Simon Riggs, Peter Geoghegan
* Fix grammar in README.tuplockAlvaro Herrera2018-07-27
| | | | | Author: Brad DeJong Discussion: https://postgr.es/m/CAJnrtnxrA4FqZi0Z6kGPQKMiZkWv2xxgSDQ+hv1jDrf8WCKjjw@mail.gmail.com
* Fix the buffer release order for parallel index scans.Amit Kapila2018-07-27
| | | | | | | | | | | | | | | | | | | | | | | | | During parallel index scans, if the current page to be read is deleted, we skip it and try to get the next page for a scan without releasing the buffer lock on the current page. To get the next page, sometimes it needs to wait for another process to complete its scan and advance it to the next page. Now, it is quite possible that the master backend has errored out before advancing the scan and issued a termination signal for all workers. The workers failed to notice the termination request during wait because the interrupts are held due to buffer lock on the previous page. This lead to all workers being stuck. The fix is to release the buffer lock on current page before trying to get the next page. We are already doing same in backward scans, but missed it for forward scans. Reported-by: Victor Yegorov Bug: 15290 Diagnosed-by: Thomas Munro and Amit Kapila Author: Amit Kapila Reviewed-by: Thomas Munro Tested-By: Thomas Munro and Victor Yegorov Backpatch-through: 10 where parallel index scans were introduced Discussion: https://postgr.es/m/153228422922.1395.1746424054206154747@wrigleys.postgresql.org
* Fix calculation for WAL segment recycling and removalMichael Paquier2018-07-24
| | | | | | | | | | | | | | Commit 4b0d28de06 has removed the prior checkpoint and related facilities but has left WAL recycling based on the LSN of the prior checkpoint, which causes incorrect calculations for WAL removal and recycling for max_wal_size and min_wal_size. This commit changes things so as the base calculation point is the last checkpoint generated. Reported-by: Kyotaro Horiguchi Author: Kyotaro Horiguchi Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/20180723.135748.42558387.horiguchi.kyotaro@lab.ntt.co.jp Backpatch: 11-, where the prior checkpoint has been removed.
* Add proper errcodes to new error messages for read() failuresMichael Paquier2018-07-23
| | | | | | | | | | | | | | | | | Those would use the default ERRCODE_INTERNAL_ERROR, but for foreseeable failures an errcode ought to be set, ERRCODE_DATA_CORRUPTED making the most sense here. While on the way, fix one errcode_for_file_access missing in origin.c since the code has been created, and remove one assignment of errno to 0 before calling read(), as this was around to fit with what was present before 811b6e36 where errno would not be set when not enough bytes are read. I have noticed the first one, and Tom has pinged me about the second one. Author: Michael Paquier Reported-by: Tom Lane Discussion: https://postgr.es/m/27265.1531925836@sss.pgh.pa.us
* Make more consistent some error messages for file-related operationsMichael Paquier2018-07-23
| | | | | | | | | | | | | | Some error messages which report something about a file operation use as well context which is already provided within the path being worked on, making things rather duplicated. This creates more work for translators, and does not actually bring clarity. More could be done, however in a lot of cases the context used is actually useful, still that patch gets down things with a good cut. Author: Michael Paquier Reviewed-by: Kyotaro Horiguchi, Tom Lane Discussion: https://postgr.es/m/20180718044711.GA8565@paquier.xyz
* Fix handling of empty uncompressed posting list pages in GINAlexander Korotkov2018-07-19
| | | | | | | | | | | | | | PostgreSQL 9.4 introduces posting list compression in GIN. This feature supports online upgrade, so that after pg_upgrade uncompressed posting lists are compressed on-the-fly. Underlying code appears to always expect at least one item on uncompressed posting list page. But there could be completely empty pages, because VACUUM never deletes leftmost and rightmost pages from posting trees. This commit fixes that. Reported-by: Sivasubramanian Ramasubramanian Discussion: https://postgr.es/m/1531867212836.63354%40amazon.com Author: Sivasubramanian Ramasubramanian, Alexander Korotkov Backpatch-through: 9.4
* Use a ResourceOwner to track buffer pins in all cases.Tom Lane2018-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically, we've allowed auxiliary processes to take buffer pins without tracking them in a ResourceOwner. However, that creates problems for error recovery. In particular, we've seen multiple reports of assertion crashes in the startup process when it gets an error while holding a buffer pin, as for example if it gets ENOSPC during a write. In a non-assert build, the process would simply exit without releasing the pin at all. We've gotten away with that so far just because a failure exit of the startup process translates to a database crash anyhow; but any similar behavior in other aux processes could result in stuck pins and subsequent problems in vacuum. To improve this, institute a policy that we must *always* have a resowner backing any attempt to pin a buffer, which we can enforce just by removing the previous special-case code in resowner.c. Add infrastructure to make it easy to create a process-lifespan AuxProcessResourceOwner and clear out its contents at appropriate times. Replace existing ad-hoc resowner management in bgwriter.c and other aux processes with that. (Thus, while the startup process gains a resowner where it had none at all before, some other aux process types are replacing an ad-hoc resowner with this code.) Also use the AuxProcessResourceOwner to manage buffer pins taken during StartupXLOG and ShutdownXLOG, even when those are being run in a bootstrap process or a standalone backend rather than a true auxiliary process. In passing, remove some other ad-hoc resource owner creations that had gotten cargo-culted into various other places. As far as I can tell that was all unnecessary, and if it had been necessary it was incomplete, due to lacking any provision for clearing those resowners later. (Also worth noting in this connection is that a process that hasn't called InitBufferPoolBackend has no business accessing buffers; so there's more to do than just add the resowner if we want to touch buffers in processes not covered by this patch.) Although this fixes a very old bug, no back-patch, because there's no evidence of any significant problem in non-assert builds. Patch by me, pursuant to a report from Justin Pryzby. Thanks to Robert Haas and Kyotaro Horiguchi for reviews. Discussion: https://postgr.es/m/20180627233939.GA10276@telsasoft.com
* Fix misc typos, mostly in comments.Heikki Linnakangas2018-07-18
| | | | | | | | A collection of typos I happened to spot while reading code, as well as grepping for common mistakes. Backpatch to all supported versions, as applicable, to avoid conflicts when backporting other commits in the future.
* Fix casting in error message for two-phase fileMichael Paquier2018-07-18
| | | | | | | This error from from 811b6e3, which causes compilation warnings with OSX 10.3 and clang. Reported by Tom Lane, via buildfarm member longfin.
* Rework error messages around file handlingMichael Paquier2018-07-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some error messages related to file handling are using the code path context to define their state. For example, 2PC-related errors are referring to "two-phase status files", or "relation mapping file" is used for catalog-to-filenode mapping, however those prove to be difficult to translate, and are not more helpful than just referring to the path of the file being worked on. So simplify all those error messages by just referring to files with their path used. In some cases, like the manipulation of WAL segments, the context is actually helpful so those are kept. Calls to the system function read() have also been rather inconsistent with their error handling sometimes not reporting the number of bytes read, and some other code paths trying to use an errno which has not been set. The in-core functions are using a more consistent pattern with this patch, which checks for both errno if set or if an inconsistent read is happening. So as to care about pluralization when reading an unexpected number of byte(s), "could not read: read %d of %zu" is used as error message, with %d field being the output result of read() and %zu the expected size. This simplifies the work of translators with less variations of the same message. Author: Michael Paquier Reviewed-by: Álvaro Herrera Discussion: https://postgr.es/m/20180520000522.GB1603@paquier.xyz
* Revise BuildIndexValueDescription to simplify itAlvaro Herrera2018-07-16
| | | | | | | | | Getting a pg_index tuple from syscache when the open index relation is available is pointless -- just use the one from relcache. Noticed while reviewing code for cb9db2ab0674. No backpatch.
* Add subtransaction handling for table synchronization workers.Robert Haas2018-07-16
| | | | | | | | | | | Since the old logic was completely unaware of subtransactions, a change made in a subsequently-aborted subtransaction would still cause workers to be stopped at toplevel transaction commit. Fix that by managing a stack of worker lists rather than just one. Amit Khandekar and Robert Haas Discussion: http://postgr.es/m/CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com
* Improve performance of tuple conversion map generationHeikki Linnakangas2018-07-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | Previously convert_tuples_by_name_map naively performed a search of each outdesc column starting at the first column in indesc and searched each indesc column until a match was found. When partitioned tables had many columns this could result in slow generation of the tuple conversion maps. For INSERT and UPDATE statements that touched few rows, this could mean a very large overhead indeed. We can do a bit better with this loop. It's quite likely that the columns in partitioned tables and their partitions are in the same order, so it makes sense to start searching for each column outer column at the inner column position 1 after where the previous match was found (per idea from Alexander Kuzmenkov). This makes the best case search O(N) instead of O(N^2). The worst case is still O(N^2), but it seems unlikely that would happen. Likewise, in the planner, make_inh_translation_list's search for the matching column could often end up falling back on an O(N^2) type search. This commit also improves that by first checking the column that follows the previous match, instead of the column with the same attnum. If we fail to match here we fallback on the syscache's hashtable lookup. Author: David Rowley Reviewed-by: Alexander Kuzmenkov Discussion: https://www.postgresql.org/message-id/CAKJS1f9-wijVgMdRp6_qDMEQDJJ%2BA_n%3DxzZuTmLx5Fz6cwf%2B8A%40mail.gmail.com
* Fix inadequate buffer locking in FSM and VM page re-initialization.Tom Lane2018-07-13
| | | | | | | | | | | | | | | | | | | When reading an existing FSM or VM page that was found to be corrupt by the buffer manager, the code applied PageInit() to reinitialize the page, but did so without any locking. There is thus a hazard that two backends might concurrently do PageInit, which in itself would still be OK, but the slower one might then zero over subsequent data changes applied by the faster one. Even that is unlikely to be fatal; but it's not desirable, so add locking to prevent it. This does not add any locking overhead in the normal code path where the page is OK. It's not immediately obvious that that's safe, but I believe it is, for reasons explained in the added comments. Problem noted by R P Asim. It's been like this for a long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/CANXE4Te4G0TGq6cr0-TvwP0H4BNiK_-hB5gHe8mF+nz0mcYfMQ@mail.gmail.com
* Clean up temporary WAL segments after an instance crashMichael Paquier2018-07-13
| | | | | | | | | | | | | | | | Temporary WAL segments are created in pg_wal and named as xlogtemp.pid before being renamed to the real deal when creating a new segment. If an instance crashes after the temporary segment is created and before the rename is done, then the server would finish with unremovable data. After an instance crash, scan pg_wal and remove any such segments. With repetitive unlucky crashes this would contribute to disk bloat and presents risks of ENOSPC especially with max_wal_size close to the maximum allowed. Author: Michael Paquier Reviewed-by: Yugo Nagata, Heikki Linnakangas Discussion: https://postgr.es/m/20180514054955.GF1528@paquier.xyz
* Fix more wrong paths in header commentsAlexander Korotkov2018-07-11
| | | | | | | It appears that there are more files, whose header comment paths are wrong. So, fix those paths. No backpatching per proposal of Tom Lane. Discussion: https://postgr.es/m/CAPpHfdsJyYbOj59MOQL%2B4XxdcomLSLfLqBtAvwR%2BpsCqj3ELdQ%40mail.gmail.com
* Rethink how to get float.h in old Windows API for isnan/isinfAlvaro Herrera2018-07-11
| | | | | | | | | | | | | | | | We include <float.h> in every place that needs isnan(), because MSVC used to require it. However, since MSVC 2013 that's no longer necessary (cf. commit cec8394b5ccd), so we can retire the inclusion to a version-specific stanza in win32_port.h, where it doesn't need to pollute random .c files. The header is of course still needed in a few places for other reasons. I (Álvaro) removed float.h from a few more files than in Emre's original patch. This doesn't break the build in my system, but we'll see what the buildfarm has to say about it all. Author: Emre Hasegeli Discussion: https://postgr.es/m/CAE2gYzyc0+5uG+Cd9-BSL7NKC8LSHLNg1Aq2=8ubjnUwut4_iw@mail.gmail.com
* Remove dynamic_shared_memory_type=nonePeter Eisentraut2018-07-10
| | | | | | | | | PostgreSQL nowadays offers some kind of dynamic shared memory feature on all supported platforms. Having the choice of "none" prevents us from relying on DSM in core features. So this patch removes the choice of "none". Author: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
* Avoid emitting a bogus WAL record when recycling an all-zero btree page.Tom Lane2018-07-09
| | | | | | | | | | | | | | | | | | | | Commit fafa374f2 caused _bt_getbuf() to possibly emit a WAL record for a page that it was about to recycle. However, it failed to distinguish all-zero pages from dead pages, which is important because only the latter have valid btpo.xact values, or indeed any special space at all. Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore led to an Assert failure, or to emission of a WAL record containing a bogus cutoff XID, which might lead to unnecessary query cancellations on hot standby servers. Per reports from Antonin Houska and 自己. Amit Kapila was first to propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi reviewed it at various times. This is an old bug, so back-patch to all supported branches. Discussion: https://postgr.es/m/2628.1474272158@localhost Discussion: https://postgr.es/m/48875502.f4a0.1635f0c27b0.Coremail.zoulx1982@163.com
* Flip argument order in XLogSegNoOffsetToRecPtrAlvaro Herrera2018-07-09
| | | | | | | | | Commit fc49e24fa69a added an input argument after the existing output argument. Flip those. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20180708182345.imdgovmkffgtihhk@alvherre.pgsql
* Rework order of end-of-recovery actions to delay timeline history writeMichael Paquier2018-07-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A critical failure in some of the end-of-recovery actions before the end-of-recovery record is written can cause PostgreSQL to react inconsistently with the rest of the cluster in the event of a crash before the final record is written. Two such failures are for example an error while processing a two-phase state files or when operating on recovery.conf. With this commit, the failures are still considered FATAL, but the write of the timeline history file is delayed as much as possible so as the window between the moment the file is written and the end-of-recovery record is generated gets minimized. This way, in the event of a crash or a failure, the new timeline decided at promotion will not seem taken by other nodes in the cluster. It is not really possible to reduce to zero this window, hence one could still see failures if a crash happens between the history file write and the end-of-recovery record, so any future code should be careful when adding new end-of-recovery actions. The original report from Magnus Hagander mentioned a renamed recovery.conf as original end-of-recovery failure which caused a timeline to be seen as taken but the subsequent processing on the now-missing recovery.conf cause the startup process to issue stop on FATAL, which at follow-up startup made the system inconsistent because of on-disk changes which already happened. Processing of two-phase state files still needs some work as corrupted entries are simply ignored now. This is left as a future item and this commit fixes the original complain. Reported-by: Magnus Hagander Author: Heikki Linnakangas Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
* Correct obsolete unique index insertion comment.Peter Geoghegan2018-07-08
| | | | | | Commit bc292937ae6 failed to update a comment about unique index checking. _bt_insertonpg() is no longer responsible for finding an insertion location while preventing conflicting insertions.
* Prevent references to invalid relation pages after fresh promotionMichael Paquier2018-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a standby crashes after promotion before having completed its first post-recovery checkpoint, then the minimal recovery point which marks the LSN position where the cluster is able to reach consistency may be set to a position older than the first end-of-recovery checkpoint while all the WAL available should be replayed. This leads to the instance thinking that it contains inconsistent pages, causing a PANIC and a hard instance crash even if all the WAL available has not been replayed for certain sets of records replayed. When in crash recovery, minRecoveryPoint is expected to always be set to InvalidXLogRecPtr, which forces the recovery to replay all the WAL available, so this commit makes sure that the local copy of minRecoveryPoint from the control file is initialized properly and stays as it is while crash recovery is performed. Once switching to archive recovery or if crash recovery finishes, then the local copy minRecoveryPoint can be safely updated. Pavan Deolasee has reported and diagnosed the failure in the first place, and the base fix idea to rely on the local copy of minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded into a full-fledged patch by me. The test included in this commit has been written by Álvaro Herrera and Pavan Deolasee, which I have modified to make it faster and more reliable with sleep phases. Backpatch down to all supported versions where the bug appears, aka 9.3 which is where the end-of-recovery checkpoint is not run by the startup process anymore. The test gets easily supported down to 10, still it has been tested on all branches. Reported-by: Pavan Deolasee Diagnosed-by: Pavan Deolasee Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
* Check for interrupts inside the nbtree page deletion code.Andres Freund2018-07-04
| | | | | | | | | | | | | | | | | | | | When deleting pages the nbtree code has to walk through siblings of a tree node. When those sibling links are corrupted that can lead to endless loops - which are currently not interruptible. This is especially problematic if autovacuum is repeatedly blocked on such indexes, as it can be hard to get out of that situation without resorting to single user mode. Thus add interrupt checks to appropriate places in such loops. Unfortunately in one of the cases it's it's not easy to do so. Between 9.3 and 9.4 the page deletion (and page split) code changed significantly. Before it was significantly less robust against interruptions. Therefore don't backpatch to 9.3. Author: Andres Freund Discussion: https://postgr.es/m/20180627191629.wkunw2qbibnvlz53@alap3.anarazel.de Backpatch: 9.4-
* Improve the performance of relation deletes during recovery.Fujii Masao2018-07-05
| | | | | | | | | | | | | | | | | | | | | | When multiple relations are deleted at the same transaction, the files of those relations are deleted by one call to smgrdounlinkall(), which leads to scan whole shared_buffers only one time. OTOH, previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was called for each file to delete, which led to scan shared_buffers multiple times. Obviously this could cause to increase the WAL replay time very much especially when shared_buffers was huge. To alleviate this situation, this commit changes the recovery so that it also calls smgrdounlinkall() only one time to delete multiple relation files. This is just fix for oversight of commit 279628a0a7, not new feature. So, per discussion on pgsql-hackers, we concluded to backpatch this to all supported versions. Author: Fujii Masao Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
* Add wait event for fsync of WAL segmentsMichael Paquier2018-07-02
| | | | | | | | | | | | This has been visibly a forgotten spot in the first implementation of wait events for I/O added by 249cf07, and what has been missing is a fsync call for WAL segments which is a wrapper reacting on the value of GUC wal_sync_method. Reported-by: Konstantin Knizhnik Author: Konstantin Knizhnik Reviewed-by: Craig Ringer, Michael Paquier Discussion: https://postgr.es/m/4a243897-0ad8-f471-aa40-242591f2476e@postgrespro.ru
* pgindent run prior to branchingAndrew Dunstan2018-06-30
|
* Cosmetic improvements for faster column addition.Amit Kapila2018-06-27
| | | | | | | | | | Changed the name of few structure members for the sake of clarity and removed spurious whitespace. Reported-by: Amit Kapila Author: Amit Kapila, based on suggestion by Andrew Dunstan Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/CAA4eK1K2znsFpC+NQ9A4vxT4uDxADN4RmvHX0L6Y=aHVo9gB4Q@mail.gmail.com
* Fix upper limit for vacuum_cleanup_index_scale_factorAlexander Korotkov2018-06-26
| | | | | | | | | | | | | | | 6ca33a88 sets upper limit for vacuum_cleanup_index_scale_factor to DBL_MAX. DBL_MAX appears to be platform-dependent. That causes many buildfarm animals to fail, because we check boundaries of vacuum_cleanup_index_scale_factor in regression tests. This commit changes upper limit from DBL_MAX to just "large enough" limit, which was arbitrary selected as 1e10. Author: Alexander Korotkov Reported-by: Tom Lane, Darafei Praliaskouski Discussion: https://postgr.es/m/CAPpHfdvewmr4PcpRjrkstoNn1n2_6dL-iHRB21CCfZ0efZdBTg%40mail.gmail.com Discussion: https://postgr.es/m/CAC8Q8tLYFOpKNaPS_E7V8KtPdE%3D_TnAn16t%3DA3LuL%3DXjfOO-BQ%40mail.gmail.com
* Remove obsolete comment block in nbtsort.c.Peter Geoghegan2018-06-26
| | | | | | | | | | | Building a new nbtree index through incremental insertions would always be slower than our actual approach of sorting using tuplesort, assembling leaf pages from tuplesort output, and writing and WAL-logging whole pages. Remove a comment block from the Berkeley days claiming that incremental insertions might be slightly faster with presorted input. Discussion: https://postgr.es/m/CAH2-WzmKs4mLAoFgJ3yHMRYc849efc=dw+pNRb3NEog2oJoCNw@mail.gmail.com
* Increase upper limit for vacuum_cleanup_index_scale_factorAlexander Korotkov2018-06-26
| | | | | | | | | | | | | | | | | | Upper limits for vacuum_cleanup_index_scale_factor GUC and reloption were initially set to 100.0 in 857f9c36. However, after further discussion, it appears that some users like to disable B-tree cleanup index scan completely (assuming there are no deleted pages). vacuum_cleanup_index_scale_factor is used barely to protect against stalled index statistics. And after detailed consideration it appears that risk of stalled index statistics is low. And it would be nice to allow advanced users setting higher values of vacuum_cleanup_index_scale_factor. So, set upper limit for these GUC and reloption to DBL_MAX. Author: Alexander Korotkov Reviewed-by: Masahiko Sawada Discussion: https://postgr.es/m/CAC8Q8tJCb%3DgxhzcV7T6ctx7PY-Ux1oA-AsTJc6cAVNsQiYcCzA%40mail.gmail.com
* Address set of issues with errno handlingMichael Paquier2018-06-25
| | | | | | | | | | | | | | | | | | | | System calls mixed up in error code paths are causing two issues which several code paths have not correctly handled: 1) For write() calls, sometimes the system may return less bytes than what has been written without errno being set. Some paths were careful enough to consider that case, and assumed that errno should be set to ENOSPC, other calls missed that. 2) errno generated by a system call is overwritten by other system calls which may succeed once an error code path is taken, causing what is reported to the user to be incorrect. This patch uses the brute-force approach of correcting all those code paths. Some refactoring could happen in the future, but this is let as future work, which is not targeted for back-branches anyway. Author: Michael Paquier Reviewed-by: Ashutosh Sharma Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
* Fix typo in comment of commit_ts.c for incorrect reference to CLOGMichael Paquier2018-06-22
| | | | Author: Shao Bret
* Prevent hard failures of standbys caused by recycled WAL segmentsMichael Paquier2018-06-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a standby's WAL receiver stops reading WAL from a WAL stream, it writes data to the current WAL segment without having priorily zero'ed the page currently written to, which can cause the WAL reader to read junk data from a past recycled segment and then it would try to get a record from it. While sanity checks in place provide most of the protection needed, in some rare circumstances, with chances increasing when a record header crosses a page boundary, then the startup process could fail violently on an allocation failure, as follows: FATAL: invalid memory alloc request size XXX This is confusing for the user and also unhelpful as this requires in the worst case a manual restart of the instance, impacting potentially the availability of the cluster, and this also makes WAL data look like it is in a corrupted state. The chances of seeing failures are higher if the connection between the standby and its root node is unstable, causing WAL pages to be written in the middle. A couple of approaches have been discussed, like zero-ing new WAL pages within the WAL receiver itself but this has the disadvantage of impacting performance of any existing instances as this breaks the sequential writes done by the WAL receiver. This commit deals with the problem with a more simple approach, which has no performance impact without reducing the detection of the problem: if a record is found with a length higher than 1GB for backends, then do not try any allocation and report a soft failure which will force the standby to retry reading WAL. It could be possible that the allocation call passes and that an unnecessary amount of memory is allocated, however follow-up checks on records would just fail, making this allocation short-lived anyway. This patch owes a great deal to Tsunakawa Takayuki for reporting the failure first, and then discussing a couple of potential approaches to the problem. Backpatch down to 9.5, which is where palloc_extended has been introduced. Reported-by: Tsunakawa Takayuki Reviewed-by: Tsunakawa Takayuki Author: Michael Paquier Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
* Remove AELs from subxids correctly on standbySimon Riggs2018-06-16
| | | | | | | | | | | | | | | | | | | | Issues relate only to subtransactions that hold AccessExclusiveLocks when replayed on standby. Prior to PG10, aborting subtransactions that held an AccessExclusiveLock failed to release the lock until top level commit or abort. 49bff5300d527 fixed that. However, 49bff5300d527 also introduced a similar bug where subtransaction commit would fail to release an AccessExclusiveLock, leaving the lock to be removed sometimes early and sometimes late. This commit fixes that bug also. Backpatch to PG10 needed. Tested by observation. Note need for multi-node isolationtester to improve test coverage for this and other HS cases. Reported-by: Simon Riggs Author: Simon Riggs
* Fix off-by-one bug in XactLogCommitRecordAlvaro Herrera2018-06-15
| | | | | | | | | | | Commit 1eb6d6527aae introduced zeroed alignment bytes in the GID field of commit/abort WAL records. Fixup commit cf5a1890592b later changed that representation into a regular cstring with a single terminating zero byte, but it also introduced an off-by-one mistake. Fix that. Author: Nikhil Sontakke Reported-by: Nikhil Sontakke Discussion: https://postgr.es/m/CAMGcDxey6dG1DP34_tJMoWPcp5sPJUAL4K5CayUUXLQSx2GQpA@mail.gmail.com
* Fail BRIN control functions during recovery explicitlyAlvaro Herrera2018-06-14
| | | | | | | | | | | They already fail anyway, but prior to this patch they raise an ugly error message about a lock that cannot be acquired. This just improves the message. Author: Masahiko Sawada Reported-by: Masahiko Sawada Discussion: https://postgr.es/m/CAD21AoBZau4g4_NUf3BKNd=CdYK+xaPdtJCzvOC1TxGdTiJx_Q@mail.gmail.com Reviewed-by: Kuntal Ghosh, Alexander Korotkov, Simon Riggs, Michaël Paquier, Álvaro Herrera
* Fix function code in error reportAlvaro Herrera2018-06-06
| | | | | | | | | | | | | This bug causes a lseek() failure to be reported as a "could not open" failure in the error message, muddling bug reports. I introduced this copy-and-pasteo in commit 78e122010422. Noticed while reviewing code for bug report #15221, from lily liang. In version 10 the affected function is only used by multixact.c and commit_ts, and only in corner-case circumstances, neither of which are involved in the reported bug (a pg_subtrans failure.) Author: Álvaro Herrera
* Move _bt_upgrademetapage() into critical section.Teodor Sigaev2018-05-30
| | | | | | | | Any changes on page should be done in critical section, so move _bt_upgrademetapage into critical section. Improve comment. Found by Amit Kapila during post-commit review of 857f9c36. Author: Amit Kapila
* printf("%lf") is not portable, so omit the "l".Tom Lane2018-05-20
| | | | | | | | | | | | The "l" (ell) width spec means something in the corresponding scanf usage, but not here. While modern POSIX says that applying "l" to "f" and other floating format specs is a no-op, SUSv2 says it's undefined. Buildfarm experience says that some old compilers emit warnings about it, and at least one old stdio implementation (mingw's "ANSI" option) actually produces wrong answers and/or crashes. Discussion: https://postgr.es/m/21670.1526769114@sss.pgh.pa.us Discussion: https://postgr.es/m/c085e1da-0d64-1c15-242d-c921f32e0d5c@dunslane.net
* Fix error message on short read of pg_controlMagnus Hagander2018-05-18
| | | | | Instead of saying "error: success", indicate that we got a working read but it was too short.
* Message wording and pluralization improvementsPeter Eisentraut2018-05-17
|
* Various improvements of skipping index scan during vacuum technicsTeodor Sigaev2018-05-10
| | | | | | | | | | | | | | | | | - Change vacuum_cleanup_index_scale_factor GUC to PGC_USERSET. vacuum_cleanup_index_scale_factor GUC was defined as PGC_SIGHUP. But this GUC affects not only autovacuum. So it might be useful to change it from user session in order to influence manually runned VACUUM. - Add missing tab-complete support for vacuum_cleanup_index_scale_factor reloption. - Fix condition for B-tree index cleanup. Zero value of vacuum_cleanup_index_scale_factor means that user wants B-tree index cleanup to be never skipped. - Documentation and comment improvements Authors: Justin Pryzby, Alexander Korotkov, Liudmila Mantrova Reviewed by: all authors and Robert Haas Discussion: https://www.postgresql.org/message-id/flat/20180502023025.GD7631%40telsasoft.com
* Fix scenario where streaming standby gets stuck at a continuation record.Heikki Linnakangas2018-05-05
| | | | | | | | | | | | | | | | | | | | | If a continuation record is split so that its first half has already been removed from the master, and is only present in pg_wal, and there is a recycled WAL segment in the standby server that looks like it would contain the second half, recovery would get stuck. The code in XLogPageRead() incorrectly started streaming at the beginning of the WAL record, even if we had already read the first page. Backpatch to 9.4. In principle, older versions have the same problem, but without replication slots, there was no straightforward mechanism to prevent the master from recycling old WAL that was still needed by standby. Without such a mechanism, I think it's reasonable to assume that there's enough slack in how many old segments are kept around to not run into this, or you have a WAL archive. Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with some extra comments by me. Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
* Don't mark pages all-visible spuriouslyAlvaro Herrera2018-05-04
| | | | | | | | | | | | | | | | | | | | | | | | | Dan Wood diagnosed a long-standing problem that pages containing tuples that are locked by multixacts containing live lockers may spuriously end up as candidates for getting their all-visible flag set. This has the long-term effect that multixacts remain unfrozen; this may previously pass undetected, but since commit XYZ it would be reported as "ERROR: found multixact 134100944 from before relminmxid 192042633" because when a later vacuum tries to freeze the page it detects that a multixact that should have gotten frozen, wasn't. Dan proposed a (correct) patch that simply sets a variable to its correct value, after a bogus initialization. But, per discussion, it seems better coding to avoid the bogus initializations altogether, since they could give rise to more bugs later. Therefore this fix rewrites the logic a little bit to avoid depending on the bogus initializations. This bug was part of a family introduced in 9.6 by commit a892234f830e; later, commit 38e9f90a227d fixed most of them, but this one was unnoticed. Authors: Dan Wood, Pavan Deolasee, Álvaro Herrera Reviewed-by: Masahiko Sawada, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/84EBAC55-F06D-4FBE-A3F3-8BDA093CE3E3@amazon.com
* Don't truncate away non-key attributes for leftmost downlinks.Teodor Sigaev2018-05-04
| | | | | | | | | | | | nbtsort.c does not need to truncate away non-key attributes for the minimum key of the leftmost page on a level, since this is only used to build a minus infinity downlink for the level's leftmost page. Truncating away non-key attributes in advance of truncating away all attributes in _bt_sortaddtup() does not affect the correctness of CREATE INDEX, but it is misleading. Author: Peter Geoghegan Discussion: https://www.postgresql.org/message-id/CAH2-WzkAS2M3ussHG-s_Av=Zo6dPjOxyu5fNRkYnxQV+YzGQ4w@mail.gmail.com
* Re-think predicate locking on GIN indexes.Teodor Sigaev2018-05-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The principle behind the locking was not very well thought-out, and not documented. Add a section in the README to explain how it's supposed to work, and change the code so that it actually works that way. This fixes two bugs: 1. If fast update was turned on concurrently, subsequent inserts to the pending list would not conflict with predicate locks that were acquired earlier, on entry pages. The included 'predicate-gin-fastupdate' test demonstrates that. To fix, make all scans acquire a predicate lock on the metapage. That lock represents a scan of the pending list, whether or not there is a pending list at the moment. Forget about the optimization to skip locking/checking for locks, when fastupdate=off. 2. If a scan finds no match, it still needs to lock the entry page. The point of predicate locks is to lock the gabs between values, whether or not there is a match. The included 'predicate-gin-nomatch' test tests that case. In addition to those two bug fixes, this removes some unnecessary locking, following the principle laid out in the README. Because all items in a posting tree have the same key value, a lock on the posting tree root is enough to cover all the items. (With a very large posting tree, it would possibly be better to lock the posting tree leaf pages instead, so that a "skip scan" with a query like "A & B", you could avoid unnecessary conflict if a new tuple is inserted with A but !B. But let's keep this simple.) Also, some spelling fixes. Author: Heikki Linnakangas with some editorization by me Review: Andrey Borodin, Alexander Korotkov Discussion: https://www.postgresql.org/message-id/0b3ad2c2-2692-62a9-3a04-5724f2af9114@iki.fi