aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Refactor other replication commands to use DestRemoteSimple.Robert Haas2017-02-01
| | | | | | | | | | | | Commit a84069d9350400c860d5e932b50dfd337aa407b0 added a new type of DestReceiver to avoid duplicating the existing code for the SHOW command, but it turns out we can leverage that new DestReceiver type in a few more places, saving some code. Michael Paquier, reviewed by Andres Freund and by me. Discussion: http://postgr.es/m/CAB7nPqSdFOQC0evc0r1nJeQyGBqjBrR41MC4rcMqUUpoJaZbtQ%40mail.gmail.com Discussion: http://postgr.es/m/CAB7nPqT2K4XFT1JgqufFBjsOc-NUKXg5qBDucHPMbk6Xi1kYaA@mail.gmail.com
* Move comment about test slightly closer to test.Robert Haas2017-01-31
| | | | | | | | The addition of a TestForOldSnapshot() call here has made the referent of this comment slightly less clear, so move the comment to compensate. Amit Kapila (as part of the parallel index scan patch)
* Be more aggressive in avoiding tuple conversion.Robert Haas2017-01-24
| | | | | | | | | | | | | | | | | | | | | | According to the comments in tupconvert.c, it's necessary to perform tuple conversion when either table has OIDs, and this was previously checked by ensuring that the tdtypeid value matched between the tables in question. However, that's overly stringent: we have access to tdhasoid and can test directly whether OIDs are present, which lets us avoid conversion in cases where the type OIDs are different but the tuple descriptors are entirely the same (and neither has OIDs). This is useful to the partitioning code, which can thereby avoid converting tuples when inserting into a partition whose columns appear in the same order as the parent columns, the normal case. It's possible for the tuple routing code to avoid some additional overhead in this case as well, so do that, too. It's not clear whether it would be OK to skip this when both tables have OIDs: do callers count on this to build a new tuple (losing the previous OID) in such instances? Until we figure it out, leave the behavior in that case alone. Amit Langote, reviewed by me.
* Add a SHOW command to the replication command language.Robert Haas2017-01-24
| | | | | | | | | | | | | | This is useful infrastructure for an upcoming proposed patch to allow the WAL segment size to be changed at initdb time; tools like pg_basebackup need the ability to interrogate the server setting. But it also doesn't seem like a bad thing to have independently of that; it may find other uses in the future. Robert Haas and Beena Emerson. (The original patch here was by Beena, but I rewrote it to such a degree that most of the code being committed here is mine.) Discussion: http://postgr.es/m/CA+TgmobNo4qz06wHEmy9DszAre3dYx-WNhHSCbU9SAwf+9Ft6g@mail.gmail.com
* Add a new DestReceiver for printing tuples without catalog access.Robert Haas2017-01-24
| | | | | | | | | | | | | | | If you create a DestReciver of type DestRemote and try to use it from a replication connection that is not bound to a specific daabase, or any other hypothetical type of backend that is not bound to a specific database, it will fail because it doesn't have a pg_proc catalog to look up properties of the types being printed. In general, that's an unavoidable problem, but we can hardwire the properties of a few builtin types in order to support utility commands. This new DestReceiver of type DestRemoteSimple does just that. Patch by me, reviewed by Michael Paquier. Discussion: http://postgr.es/m/CA+TgmobNo4qz06wHEmy9DszAre3dYx-WNhHSCbU9SAwf+9Ft6g@mail.gmail.com
* Extend index AM API for parallel index scans.Robert Haas2017-01-24
| | | | | | | This patch doesn't actually make any index AM parallel-aware, but it provides the necessary functions at the AM layer to do so. Rahila Syed, Amit Kapila, Robert Haas
* Fix interaction of partitioned tables with BulkInsertState.Robert Haas2017-01-24
| | | | | | | | | | | | | | | | | | | | When copying into a partitioned table, the target heap may change from one tuple to next. We must ask ReadBufferBI() to get a new buffer every time such change occurs. To do that, use new function ReleaseBulkInsertStatePin(). This fixes the bug that tuples ended up being inserted into the wrong partition, which occurred exactly because the wrong buffer was used. Amit Langote, per a suggestion from Robert Haas. Some cosmetic adjustments by me. Reports by 高增琦 (Gao Zengqi), Venkata B Nagothi, and Ragnar Ouchterlony. Discussion: http://postgr.es/m/CAFmBtr32FDOqofo8yG-4mjzL1HnYHxXK5S9OGFJ%3D%3DcJpgEW4vA%40mail.gmail.com Discussion: http://postgr.es/m/CAEyp7J9WiX0L3DoiNcRrY-9iyw%3DqP%2Bj%3DDLsAnNFF1xT2J1ggfQ%40mail.gmail.com Discussion: http://postgr.es/m/16d73804-c9cd-14c5-463e-5caad563ff77%40agama.tv Discussion: http://postgr.es/m/CA+TgmoaiZpDVUUN8LZ4jv1qFE_QyR+H9ec+79f5vNczYarg5Zg@mail.gmail.com
* Move some things from builtins.h to new header filesPeter Eisentraut2017-01-20
| | | | This avoids that builtins.h has to include additional header files.
* Logical replicationPeter Eisentraut2017-01-20
| | | | | | | | | | | | | - Add PUBLICATION catalogs and DDL - Add SUBSCRIPTION catalog and DDL - Define logical replication protocol and output plugin - Add logical replication workers From: Petr Jelinek <petr@2ndquadrant.com> Reviewed-by: Steve Singer <steve@ssinger.info> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Erik Rijkers <er@xs4all.nl> Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
* Fix race condition in reading commit timestampsAlvaro Herrera2017-01-19
| | | | | | | | | | | | | | | | | | | | | | | | | If a user requests the commit timestamp for a transaction old enough that its data is concurrently being truncated away by vacuum at just the right time, they would receive an ugly internal file-not-found error message from slru.c rather than the expected NULL return value. In a primary server, the window for the race is very small: the lookup has to occur exactly between the two calls by vacuum, and there's not a lot that happens between them (mostly just a multixact truncate). In a standby server, however, the window is larger because the truncation is executed as soon as the WAL record for it is replayed, but the advance of the oldest-Xid is not executed until the next checkpoint record. To fix in the primary, simply reverse the order of operations in vac_truncate_clog. To fix in the standby, augment the WAL truncation record so that the standby is aware of the new oldest-XID value and can apply the update immediately. WAL version bumped because of this. No backpatch, because of the low importance of the bug and its rarity. Author: Craig Ringer Reviewed-By: Petr Jelínek, Peter Eisentraut Discussion: https://postgr.es/m/CAMsr+YFhVtRQT1VAwC+WGbbxZZRzNou=N9Ed-FrCqkwQ8H8oJQ@mail.gmail.com
* Improve comment in hashsearch.c.Robert Haas2017-01-18
| | | | Typo fix from Mithun Cy; other improvements by me.
* Generate fmgr prototypes automaticallyPeter Eisentraut2017-01-17
| | | | | | | | | | | | Gen_fmgrtab.pl creates a new file fmgrprotos.h, which contains prototypes for all functions registered in pg_proc.h. This avoids having to manually maintain these prototypes across a random variety of header files. It also automatically enforces a correct function signature, and since there are warnings about missing prototypes, it will detect functions that are defined but not registered in pg_proc.h (or otherwise used). Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
* Fix an assertion failure related to an exclusive backup.Fujii Masao2017-01-17
| | | | | | | | | | | | | | | | | | | | | | Previously multiple sessions could execute pg_start_backup() and pg_stop_backup() to start and stop an exclusive backup at the same time. This could trigger the assertion failure of "FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)". This happend because, even while pg_start_backup() was starting an exclusive backup, other session could run pg_stop_backup() concurrently and mark the backup as not-in-progress unconditionally. This patch introduces ExclusiveBackupState indicating the state of an exclusive backup. This state is used to ensure that there is only one session running pg_start_backup() or pg_stop_backup() at the same time, to avoid the assertion failure. Back-patch to all supported versions. Author: Michael Paquier Reviewed-By: Kyotaro Horiguchi and me Reported-By: Andreas Seltenreich Discussion: <87mvktojme.fsf@credativ.de>
* Improve coding in _hash_addovflpage.Robert Haas2017-01-10
| | | | | | | | Instead of relying on the page contents to know whether we have advanced from the primary bucket page to an overflow page, track that explicitly. Amit Kapila, per a complaint by me.
* Fix ALTER TABLE / SET TYPE for irregular inheritanceAlvaro Herrera2017-01-09
| | | | | | | | | | | | | | | | | | | If inherited tables don't have exactly the same schema, the USING clause in an ALTER TABLE / SET DATA TYPE misbehaves when applied to the children tables since commit 9550e8348b79. Starting with that commit, the attribute numbers in the USING expression are fixed during parse analysis. This can lead to bogus errors being reported during execution, such as: ERROR: attribute 2 has wrong type DETAIL: Table has type smallint, but query expects integer. Since it wouldn't do to revert to the original coding, we now apply a transformation to map the attribute numbers to the correct ones for each child. Reported by Justin Pryzby Analysis by Tom Lane; patch by me. Discussion: https://postgr.es/m/20170102225618.GA10071@telsasoft.com
* BRIN revmap pages are not standard pages ...Alvaro Herrera2017-01-09
| | | | | | | | | | | | | | | | | | | | | | | | ... and therefore we ought not to tell XLogRegisterBuffer the opposite, when writing XLog for a brin update that moves the index tuple to a different page. Otherwise, xlog insertion would try to "compress the hole" when producing a full-page image for it; but since we don't update pd_lower/upper, the hole covers the whole page. On WAL replay, the revmap page becomes empty and so the entire portion of the index is useless and needs to be recomputed. This is low-probability: a BRIN update only moves an index tuple to a different page when the summary tuple is larger than the existing one, which doesn't happen with fixed-width datatypes. Also, the revmap page must be first after a checkpoint. Report and patch: Kuntal Ghosh Bug is alleged to have detected by a WAL-consistency-checking tool. Discussion: https://postgr.es/m/CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com I posted a test case demonstrating the problem, but I'm refraining from adding it to the test suite; if the WAL consistency tool makes it in, that will be a better way to catch this from regressing. (We should definitely have someting that causes not-same-page updates, though.)
* Update copyright via script for 2017Bruce Momjian2017-01-03
|
* Remove triggerable Assert in hashname().Tom Lane2016-12-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | hashname() asserted that the key string it is given is shorter than NAMEDATALEN. That should surely always be true if the input is in fact a regular value of type "name". However, for reasons of coding convenience, we allow plain old C strings to be treated as "name" values in many places. Some SQL functions accept arbitrary "text" inputs, convert them to C strings, and pass them otherwise-untransformed to syscache lookups for name columns, allowing an overlength input value to trigger hashname's Assert. This would be a DOS problem, except that it only happens in assert-enabled builds which aren't recommended for production. In a production build, you'll just get a name lookup error, since regardless of the hash value computed by hashname, the later equality comparison checks can't match. Likewise, if the catalog lookup is done by seqscan or indexscan searches, there will just be a lookup error, since the name comparison functions don't contain any similar length checks, and will see an overlength input as unequal to any stored entry. After discussion we concluded that we should simply remove this Assert. It's inessential to hashname's own functionality, and having such an assertion in only some paths for name lookup is more of a foot-gun than a useful check. There may or may not be a case for the affected callers to do something other than let the name lookup fail, but we'll consider that separately; in any case we probably don't want to change such behavior in the back branches. Per report from Tushar Ahuja. Back-patch to all supported branches. Report: https://postgr.es/m/7d0809ee-6f25-c9d6-8e74-5b2967830d49@enterprisedb.com Discussion: https://postgr.es/m/17691.1482523168@sss.pgh.pa.us
* Remove _hash_chgbufaccess().Robert Haas2016-12-23
| | | | | | | | | This is basically for the same reasons I got rid of _hash_wrtbuf() in commit 25216c98938495fd741bf585dcbef45b3a9ffd40: it's not convenient to have a function which encapsulates MarkBufferDirty(), especially as we move towards having hash indexes be WAL-logged. Patch by me, reviewed (but not entirely endorsed) by Amit Kapila.
* Skip checkpoints, archiving on idle systems.Andres Freund2016-12-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Some background activity (like checkpoints, archive timeout, standby snapshots) is not supposed to happen on an idle system. Unfortunately so far it was not easy to determine when a system is idle, which defeated some of the attempts to avoid redundant activity on an idle system. To make that easier, allow to make individual WAL insertions as not being "important". By checking whether any important activity happened since the last time an activity was performed, it now is easy to check whether some action needs to be repeated. Use the new facility for checkpoints, archive timeout and standby snapshots. The lack of a facility causes some issues in older releases, but in my opinion the consequences (superflous checkpoints / archived segments) aren't grave enough to warrant backpatching. Author: Michael Paquier, editorialized by Andres Freund Reviewed-By: Andres Freund, David Steele, Amit Kapila, Kyotaro HORIGUCHI Bug: #13685 Discussion: https://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org https://www.postgresql.org/message-id/CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com Backpatch: -
* Fix broken error check in _hash_doinsert.Robert Haas2016-12-22
| | | | | | | | | | | | You can't just cast a HashMetaPage to a Page, because the meta page data is stored after the page header, not at offset 0. Fortunately, this didn't break anything because it happens to find hashm_bsize at the offset at which it expects to find pd_pagesize_version, and the values are close enough to the same that this works out. Still, it's a bug, so back-patch to all supported versions. Mithun Cy, revised a bit by me.
* Fix locking problem in _hash_squeezebucket() / _hash_freeovflpage().Robert Haas2016-12-19
| | | | | | | | | | | | A bucket squeeze operation needs to lock each page of the bucket before releasing the prior page, but the previous coding fumbled the locking when freeing an overflow page during a bucket squeeze operation. Commit 6d46f4783efe457f74816a75173eb23ed8930020 introduced this bug. Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after an initial trouble report by Jeff Janes. Reviewed by me. I also fixed a problem with a comment.
* Simplify LWLock tranche machinery by removing array_base/array_stride.Robert Haas2016-12-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | array_base and array_stride were added so that we could identify the offset of an LWLock within a tranche, but this facility is only very marginally used apart from the main tranche. So, give every lock in the main tranche its own tranche ID and get rid of array_base, array_stride, and all that's attached. For debugging facilities (Trace_lwlocks and LWLOCK_STATS) print the pointer address of the LWLock using %p instead of the offset. This is arguably more useful, and certainly a lot cheaper. Drop the offset-within-tranche from the information reported to dtrace and from one can't-happen message inside lwlock.c. The main user-visible impact of this change is that pg_stat_activity will now report all waits for LWLocks as "LWLock" rather than reporting some as "LWLockTranche" and others as "LWLockNamed". The main motivation for this change is that the need to specify an array_base and an array_stride is awkward for parallel query. There is only a very limited supply of tranche IDs so we can't just keep allocating new ones, and if we try to use the same tranche IDs every time then we run into trouble when multiple parallel contexts are use simultaneously. So if we didn't get rid of this mechanism we'd have to make it even more complicated. By simplifying it in this way, we instead reduce the size of the generated code for lwlock.c by about 5%. Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
* Fix more hash index bugs around marking buffers dirty.Robert Haas2016-12-16
| | | | | | | | | | | | | | In _hash_freeovflpage(), if we're freeing the overflow page that immediate follows the page to which tuples are being moved (the confusingly-named "write buffer"), don't forget to mark that page dirty after updating its hasho_nextblkno. In _hash_squeezebucket(), it's not necessary to mark the primary bucket page dirty if there are no overflow pages, because there's nothing to squeeze in that case. Amit Kapila, with help from Kuntal Ghosh and Dilip Kumar, after an initial trouble report by Jeff Janes.
* Remove _hash_wrtbuf() in favor of calling MarkBufferDirty().Robert Haas2016-12-16
| | | | | | | | | | | | | | | | | | | | | | The whole concept of _hash_wrtbuf() is that we need to know at the time we're releasing the buffer lock (and pin) whether we dirtied the buffer, but this is easy to get wrong. This patch actually fixes one non-obvious bug of that form: hashbucketcleanup forgot to signal _hash_squeezebucket, which gets the primary bucket page already locked, as to whether it had already dirtied the page. Calling MarkBufferDirty() at the places where we dirty the buffer is more intuitive and lets us simplify the code in various places as well. On top of all that, the ultimate goal here is to make hash indexes WAL-logged, and as the comments to _hash_wrtbuf() note, it should go away when that happens. Making it go away a little earlier than that seems like a good preparatory step. Report by Jeff Janes. Diagnosis by Amit Kapila, Kuntal Ghosh, and Dilip Kumar. Patch by me, after studying an alternative patch submitted by Amit Kapila. Discussion: http://postgr.es/m/CAA4eK1Kf6tOY0oVz_SEdngiNFkeXrA3xUSDPPORQvsWVPdKqnA@mail.gmail.com
* Fix bug in hashbulkdelete.Robert Haas2016-12-13
| | | | | | | | Commit 6d46f4783efe457f74816a75173eb23ed8930020 failed to account for the possibility that hashbulkdelete() might encounter a bucket that has been split since it began scanning the bucket array. Repair. Extracted from a larger pathc by Amit Kapila; I rewrote the comment.
* Remove should_free arguments to tuplesort routines.Robert Haas2016-12-12
| | | | | | | | | | Since commit e94568ecc10f2638e542ae34f2990b821bbf90ac, the answer is always "false", and we do not need to complicate the API by arranging to return a constant value. Peter Geoghegan Discussion: http://postgr.es/m/CAM3SWZQWZZ_N=DmmL7tKy_OUjGH_5mN=N=A6h7kHyyDvEhg2DA@mail.gmail.com
* Log the creation of an init fork unconditionally.Robert Haas2016-12-08
| | | | | | | | | | | | | | | | | Previously, it was thought that this only needed to be done for the benefit of possible standbys, so wal_level = minimal skipped it. But that's not safe, because during crash recovery we might replay XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record which recursively removes the directory that contains the new init fork. So log it always. The user-visible effect of this bug is that if you create a database or tablespace, then create an unlogged table, then crash without checkpointing, then restart, accessing the table will fail, because the it won't have been properly reset. This commit fixes that. Michael Paquier, per a report from Konstantin Knizhnik. Wording of the comments per a suggestion from me.
* Implement table partitioning.Robert Haas2016-12-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Table partitioning is like table inheritance and reuses much of the existing infrastructure, but there are some important differences. The parent is called a partitioned table and is always empty; it may not have indexes or non-inherited constraints, since those make no sense for a relation with no data of its own. The children are called partitions and contain all of the actual data. Each partition has an implicit partitioning constraint. Multiple inheritance is not allowed, and partitioning and inheritance can't be mixed. Partitions can't have extra columns and may not allow nulls unless the parent does. Tuples inserted into the parent are automatically routed to the correct partition, so tuple-routing ON INSERT triggers are not needed. Tuple routing isn't yet supported for partitions which are foreign tables, and it doesn't handle updates that cross partition boundaries. Currently, tables can be range-partitioned or list-partitioned. List partitioning is limited to a single column, but range partitioning can involve multiple columns. A partitioning "column" can be an expression. Because table partitioning is less general than table inheritance, it is hoped that it will be easier to reason about properties of partitions, and therefore that this will serve as a better foundation for a variety of possible optimizations, including query planner optimizations. The tuple routing based which this patch does based on the implicit partitioning constraints is an example of this, but it seems likely that many other useful optimizations are also possible. Amit Langote, reviewed and tested by Robert Haas, Ashutosh Bapat, Amit Kapila, Rajkumar Raghuwanshi, Corey Huinker, Jaime Casanova, Rushabh Lathia, Erik Rijkers, among others. Minor revisions by me.
* Fix race introduced by 6d46f4783efe457f74816a75173eb23ed8930020.Robert Haas2016-12-05
| | | | | | | | | It's possible for the metapage contents to change after we release the lock, so we must read them before releasing the lock. Amit Kapila. Submitted in response to a trouble report from Andreas Seltenreich, though it is not certain this fixes the problem.
* Fix incorrect output from gin_desc().Fujii Masao2016-12-05
| | | | | | | | | | | | | | | | | | | | | Previously gin_desc() displayed incorrect output "unknown action 0" for XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE records with valid actions. The cause of this problem was that gin_desc() wrongly used XLogRecGetData() to extract data from those records. Since they were registered by XLogRegisterBufData(), gin_desc() should have used XLogRecGetBlockData(), instead, like gin_redo(). Also there were other differences about how to treat XLOG_GIN_INSERT record between gin_desc() and gin_redo(). This commit fixes gin_desc() routine so that it treats those records in the same way as gin_redo(). Batch-patch to 9.5 where WAL record format was revamped and XLogRegisterBufData() was added. Reported-By: Andres Freund Reviewed-By: Tom Lane Discussion: <20160509194645.7lewnpw647zegx2m@alap3.anarazel.de>
* Add max_parallel_workers GUC.Robert Haas2016-12-02
| | | | | | | | | | Increase the default value of the existing max_worker_processes GUC from 8 to 16, and add a new max_parallel_workers GUC with a maximum of 8. This way, even if the maximum amount of parallel query is happening, there is still room for background workers that do other things, as originally envisioned when max_worker_processes was added. Julien Rouhaud, reviewed by Amit Kapila and by revised by me.
* Permit dump/reload of not-too-large >1GB tuplesAlvaro Herrera2016-12-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our documentation states that our maximum field size is 1 GB, and that our maximum row size of 1.6 TB. However, while this might be attainable in theory with enough contortions, it is not workable in practice; for starters, pg_dump fails to dump tables containing rows larger than 1 GB, even if individual columns are well below the limit; and even if one does manage to manufacture a dump file containing a row that large, the server refuses to load it anyway. This commit enables dumping and reloading of such tuples, provided two conditions are met: 1. no single column is larger than 1 GB (in output size -- for bytea this includes the formatting overhead) 2. the whole row is not larger than 2 GB There are three related changes to enable this: a. StringInfo's API now has two additional functions that allow creating a string that grows beyond the typical 1GB limit (and "long" string). ABI compatibility is maintained. We still limit these strings to 2 GB, though, for reasons explained below. b. COPY now uses long StringInfos, so that pg_dump doesn't choke trying to emit rows longer than 1GB. c. heap_form_tuple now uses the MCXT_ALLOW_HUGE flag in its allocation for the input tuple, which means that large tuples are accepted on input. Note that at this point we do not apply any further limit to the input tuple size. The main reason to limit to 2 GB is that the FE/BE protocol uses 32 bit length words to describe each row; and because the documentation is ambiguous on its signedness and libpq does consider it signed, we cannot use the highest-order bit. Additionally, the StringInfo API uses "int" (which is 4 bytes wide in most platforms) in many places, so we'd need to change that API too in order to improve, which has lots of fallout. Backpatch to 9.5, which is the oldest that has MemoryContextAllocExtended, a necessary piece of infrastructure. We could apply to 9.4 with very minimal additional effort, but any further than that would require backpatching "huge" allocations too. This is the largest set of changes we could find that can be back-patched without breaking compatibility with existing systems. Fixing a bigger set of problems (for example, dumping tuples bigger than 2GB, or dumping fields bigger than 1GB) would require changing the FE/BE protocol and/or changing the StringInfo API in an ABI-incompatible way, neither of which would be back-patchable. Authors: Daniel Vérité, Álvaro Herrera Reviewed by: Tomas Vondra Discussion: https://postgr.es/m/20160229183023.GA286012@alvherre.pgsql
* Improve hash index bucket split behavior.Robert Haas2016-11-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the right to split a bucket was represented by a heavyweight lock on the page number of the primary bucket page. Unfortunately, this meant that every scan needed to take a heavyweight lock on that bucket also, which was bad for concurrency. Instead, use a cleanup lock on the primary bucket page to indicate the right to begin a split, so that scans only need to retain a pin on that page, which is they would have to acquire anyway, and which is also much cheaper. In addition to reducing the locking cost, this also avoids locking out scans and inserts for the entire lifetime of the split: while the new bucket is being populated with copies of the appropriate tuples from the old bucket, scans and inserts can happen in parallel. There are minor concurrency improvements for vacuum operations as well, though the situation there is still far from ideal. This patch also removes the unworldly assumption that a split will never be interrupted. With the new code, a split is done in a series of small steps and the system can pick up where it left off if it is interrupted prior to completion. While this patch does not itself add write-ahead logging for hash indexes, it is clearly a necessary first step, since one of the things that could interrupt a split is the removal of electrical power from the machine performing it. Amit Kapila. I wrote the original design on which this patch is based, and did a good bit of work on the comments and README through multiple rounds of review, but all of the code is Amit's. Also reviewed by Jesper Pedersen, Jeff Janes, and others. Discussion: http://postgr.es/m/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
* Bring some clarity to the defaults for the xxx_flush_after parameters.Tom Lane2016-11-25
| | | | | | | | | | | | | | | | | | | | | | | | | | Instead of confusingly stating platform-dependent defaults for these parameters in the comments in postgresql.conf.sample (with the main entry being a lie on Linux), teach initdb to install the correct platform-dependent value in postgresql.conf, similarly to the way we handle other platform-dependent defaults. This won't do anything for existing 9.6 installations, but since it's effectively only a documentation improvement, that seems OK. Since this requires initdb to have access to the default values, move the #define's for those to pg_config_manual.h; the original placement in bufmgr.h is unworkable because that file can't be included by frontend programs. Adjust the default value for wal_writer_flush_after so that it is 1MB regardless of XLOG_BLCKSZ, conforming to what is stated in both the SGML docs and postgresql.conf. (We could alternatively make it scale with XLOG_BLCKSZ, but I'm not sure I see the point.) Copy-edit related SGML documentation. Fabien Coelho and Tom Lane, per a gripe from Tomas Vondra. Discussion: <30ebc6e3-8358-09cf-44a8-578252938424@2ndquadrant.com>
* Fix commit_ts for FrozenXid and BootstrapXidAlvaro Herrera2016-11-24
| | | | | | | | | | | | | | | | Previously, requesting commit timestamp for transactions FrozenTransactionId and BootstrapTransactionId resulted in an error. But since those values can validly appear in committed tuples' Xmin, this behavior is unhelpful and error prone: each caller would have to special-case those values before requesting timestamp data for an Xid. We already have a perfectly good interface for returning "the Xid you requested is too old for us to have commit TS data for it", so let's use that instead. Backpatch to 9.5, where commit timestamps appeared. Author: Craig Ringer Discussion: https://www.postgresql.org/message-id/CAMsr+YFM5Q=+ry3mKvWEqRTxrB0iU3qUSRnS28nz6FJYtBwhJg@mail.gmail.com
* Remove barrier.hRobert Haas2016-11-22
| | | | | | | | A new thing also called a "barrier" is proposed, but whether we decide to take that patch or not, this file seems to have outlived its usefulness. Thomas Munro
* Support condition variables.Robert Haas2016-11-22
| | | | | | | | | | | | | | | | | Condition variables provide a flexible way to sleep until a cooperating process causes an arbitrary condition to become true. In simple cases, this can be accomplished with a WaitLatch/ResetLatch loop; the cooperating process can call SetLatch after performing work that might cause the condition to be satisfied, and the waiting process can recheck the condition each time. However, if the process performing the work doesn't have an easy way to identify which processes might be waiting, this doesn't work, because it can't identify which latches to set. Condition variables solve that problem by internally maintaining a list of waiters; a process that may have caused some waiter's condition to be satisfied must "signal" or "broadcast" on the condition variable. Robert Haas and Thomas Munro
* Remove or reduce verbosity of some debug messages.Robert Haas2016-11-17
| | | | | | | | | | | | | | | | | | | The debug messages that merely print StartTransactionCommand, CommitTransactionCommand, ProcessUtilty, or ProcessQuery with no additional details seem to be useless. Get rid of them. The transaction status messages produced by ShowTransactionState are occasionally useful, but they are extremely verbose, producing multiple lines of log output every time they fire, which can happens multiple times per transaction. So, reduce the level to DEBUG5; avoid emitting an extra line just to explain which debug point is at issue; and tighten up the rest of the message so it doesn't use quite so much horizontal space. With these changes, it's possible to run a somewhat busy system with a log level even as high as DEBUG4, whereas previously anything above DEBUG2 would flood the log with output that probably wasn't really all that useful.
* Avoid pin scan for replay of XLOG_BTREE_VACUUM in all casesAlvaro Herrera2016-11-17
| | | | | | | | | | | | | | | | | | | | | | | Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require complex interlocking that matched the requirements on the master. This required an O(N) operation that became a significant problem with large indexes, causing replication delays of seconds or in some cases minutes while the XLOG_BTREE_VACUUM was replayed. This commit skips the “pin scan” that was previously required, by observing in detail when and how it is safe to do so, with full documentation. The pin scan is skipped only in replay; the VACUUM code path on master is not touched here. No tests included. Manual tests using an additional patch to view WAL records and their timing have shown the change in WAL records and their handling has successfully reduced replication delay. This is a back-patch of commits 687f2cd7a015, 3e4b7d87988f, b60284261375 by Simon Riggs, to branches 9.4 and 9.5. No further backpatch is possible because this depends on catalog scans being MVCC. I (Álvaro) additionally updated a slight problem in the README, which explains why this touches the 9.6 and master branches.
* Replace uses of SPI_modifytuple that intend to allocate in current context.Tom Lane2016-11-08
| | | | | | | | | | | | | | | | | | | | | | | | | Invent a new function heap_modify_tuple_by_cols() that is functionally equivalent to SPI_modifytuple except that it always allocates its result by simple palloc. I chose however to make the API details a bit more like heap_modify_tuple: pass a tupdesc rather than a Relation, and use bool convention for the isnull array. Use this function in place of SPI_modifytuple at all call sites where the intended behavior is to allocate in current context. (There actually are only two call sites left that depend on the old behavior, which makes me wonder if we should just drop this function rather than keep it.) This new function is easier to use than heap_modify_tuple() for purposes of replacing a single column (or, really, any fixed number of columns). There are a number of places where it would simplify the code to change over, but I resisted that temptation for the moment ... everywhere except in plpgsql's exec_assign_value(); changing that might offer some small performance benefit, so I did it. This is on the way to removing SPI_push/SPI_pop, but it seems like good code cleanup in its own right. Discussion: <9633.1478552022@sss.pgh.pa.us>
* Improve handling of dead tuples in hash indexes.Robert Haas2016-11-08
| | | | | | | | | | | When squeezing a bucket during vacuum, it's not necessary to retain any tuples already marked as dead, so ignore them when deciding which tuples must be moved in order to empty a bucket page. Similarly, when splitting a bucket, relocating dead tuples to the new bucket is a waste of effort; instead, just ignore them. Amit Kapila, reviewed by me. Testing help provided by Ashutosh Sharma.
* Fix silly nil-pointer-dereference bug introduced in commit d5f6f13f8.Tom Lane2016-11-06
| | | | | | | Don't fetch record->xl_info before we've verified that record isn't NULL. Per Coverity. Michael Paquier
* Be more consistent about masking xl_info with ~XLR_INFO_MASK.Tom Lane2016-11-04
| | | | | | | | | | | | Generally, WAL resource managers are only supposed to examine the top 4 bits of a WAL record's xl_info; the rest are reserved for the WAL mechanism itself. A few places were not consistent about doing this with respect to XLOG_CHECKPOINT and XLOG_SWITCH records. There's no bug currently, since no additional bits ever get set in these specific record types, but that might not be true forever. Let's follow the generic coding rule here too. Michael Paquier
* Fix leftover reference to background writer performing checkpoints.Robert Haas2016-10-28
| | | | | This was changed in PostgreSQL 9.2, but somehow this comment never got updated.
* Fix possible pg_basebackup failure on standby with "include WAL".Robert Haas2016-10-27
| | | | | | | | | | | | | | | If a restartpoint flushed no dirty buffers, it could fail to update the minimum recovery point, leading to a minimum recovery point prior to the starting REDO location. perform_base_backup() would interpret that as meaning that no WAL files at all needed to be included in the backup, failing an internal sanity check. To fix, have restartpoints always update the minimum recovery point to just after the checkpoint record itself, so that the file (or files) containing the checkpoint record will always be included in the backup. Code by Amit Kapila, per a design suggestion by me, with some additional work on the code comment by me. Test case by Michael Paquier. Report by Kyotaro Horiguchi.
* Preserve commit timestamps across clean restartAlvaro Herrera2016-10-24
| | | | | | | | | An oversight in setting the boundaries of known commit timestamps during startup caused old commit timestamps to become inaccessible after a server restart. Author and reporter: Julien Rouhaud Review, test code: Craig Ringer
* Fix comment formatting.Robert Haas2016-10-21
|
* Rename "pg_xlog" directory to "pg_wal".Robert Haas2016-10-20
| | | | | | | | | | | | | | | | | | | | | "xlog" is not a particularly clear abbreviation for "write-ahead log", and it sometimes confuses users into believe that the contents of the "pg_xlog" directory are not critical data, leading to unpleasant consequences. So, rename the directory to "pg_wal". This patch modifies pg_upgrade and pg_basebackup to understand both the old and new directory layouts; the former is necessary given the purpose of the tool, while the latter merely avoids an unnecessary backward-compatibility break. We may wish to consider renaming other programs, switches, and functions which still use the old "xlog" naming to also refer to "wal". However, that's still under discussion, so let's do just this much for now. Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com Michael Paquier
* Fix WAL-logging of FSM and VM truncation.Heikki Linnakangas2016-10-19
| | | | | | | | | | | | | | | | | | | | | | | | When a relation is truncated, it is important that the FSM is truncated as well. Otherwise, after recovery, the FSM can return a page that has been truncated away, leading to errors like: ERROR: could not read block 28991 in file "base/16390/572026": read only 0 of 8192 bytes We were using MarkBufferDirtyHint() to dirty the buffer holding the last remaining page of the FSM, but during recovery, that might in fact not dirty the page, and the FSM update might be lost. To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty() requires us to do WAL-logging ourselves, to protect from a torn page, if checksumming is enabled. Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log when checksumming is enabled. Analysis by Pavan Deolasee. Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>