aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Add deduplication to nbtree.Peter Geoghegan2020-02-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Deduplication reduces the storage overhead of duplicates in indexes that use the standard nbtree index access method. The deduplication process is applied lazily, after the point where opportunistic deletion of LP_DEAD-marked index tuples occurs. Deduplication is only applied at the point where a leaf page split would otherwise be required. New posting list tuples are formed by merging together existing duplicate tuples. The physical representation of the items on an nbtree leaf page is made more space efficient by deduplication, but the logical contents of the page are not changed. Even unique indexes make use of deduplication as a way of controlling bloat from duplicates whose TIDs point to different versions of the same logical table row. The lazy approach taken by nbtree has significant advantages over a GIN style eager approach. Most individual inserts of index tuples have exactly the same overhead as before. The extra overhead of deduplication is amortized across insertions, just like the overhead of page splits. The key space of indexes works in the same way as it has since commit dd299df8 (the commit that made heap TID a tiebreaker column). Testing has shown that nbtree deduplication can generally make indexes with about 10 or 15 tuples for each distinct key value about 2.5X - 4X smaller, even with single column integer indexes (e.g., an index on a referencing column that accompanies a foreign key). The final size of single column nbtree indexes comes close to the final size of a similar contrib/btree_gin index, at least in cases where GIN's posting list compression isn't very effective. This can significantly improve transaction throughput, and significantly reduce the cost of vacuuming indexes. A new index storage parameter (deduplicate_items) controls the use of deduplication. The default setting is 'on', so all new B-Tree indexes automatically use deduplication where possible. This decision will be reviewed at the end of the Postgres 13 beta period. There is a regression of approximately 2% of transaction throughput with synthetic workloads that consist of append-only inserts into a table with several non-unique indexes, where all indexes have few or no repeated values. The underlying issue is that cycles are wasted on unsuccessful attempts at deduplicating items in non-unique indexes. There doesn't seem to be a way around it short of disabling deduplication entirely. Note that deduplication of items in unique indexes is fairly well targeted in general, which avoids the problem there (we can use a special heuristic to trigger deduplication passes in unique indexes, since we're specifically targeting "version bloat"). Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. No bump in BTREE_VERSION, since the representation of posting list tuples works in a way that's backwards compatible with version 4 indexes (i.e. indexes built on PostgreSQL 12). However, users must still REINDEX a pg_upgrade'd index to use deduplication, regardless of the Postgres version they've upgraded from. This is the only way to set the new nbtree metapage flag indicating that deduplication is generally safe. Author: Anastasia Lubennikova, Peter Geoghegan Reviewed-By: Peter Geoghegan, Heikki Linnakangas Discussion: https://postgr.es/m/55E4051B.7020209@postgrespro.ru https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
* Add equalimage B-Tree support functions.Peter Geoghegan2020-02-26
| | | | | | | | | | | | | | | | | | | | | | Invent the concept of a B-Tree equalimage ("equality implies image equality") support function, registered as support function 4. This indicates whether it is safe (or not safe) to apply optimizations that assume that any two datums considered equal by an operator class's order method must be interchangeable without any loss of semantic information. This is static information about an operator class and a collation. Register an equalimage routine for almost all of the existing B-Tree opclasses. We only need two trivial routines for all of the opclasses that are included with the core distribution. There is one routine for opclasses that index non-collatable types (which returns 'true' unconditionally), plus another routine for collatable types (which returns 'true' when the collation is a deterministic collation). This patch is infrastructure for an upcoming patch that adds B-Tree deduplication. Author: Peter Geoghegan, Anastasia Lubennikova Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
* Issue properly WAL record for CID of first catalog tuple in multi-insertMichael Paquier2020-02-25
| | | | | | | | | | | | | | Multi-insert for heap is not yet used actively for catalogs, but the code to support this case is in place for logical decoding. The existing code forgot to issue a XLOG_HEAP2_NEW_CID record for the first tuple inserted, leading to failures when attempting to use multiple inserts for catalogs at decoding time. This commit fixes the problem by WAL-logging the needed CID. This is not an active bug, so no back-patch is done. Author: Daniel Gustafsson Discussion: https://postgr.es/m/E0D4CC67-A1CF-4DF4-991D-B3AC2EB5FAE9@yesql.se
* Account explicitly for long-lived FDs that are allocated outside fd.c.Tom Lane2020-02-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comments in fd.c have long claimed that all file allocations should go through that module, but in reality that's not always practical. fd.c doesn't supply APIs for invoking some FD-producing syscalls like pipe() or epoll_create(); and the APIs it does supply for non-virtual FDs are mostly insistent on releasing those FDs at transaction end; and in some cases the actual open() call is in code that can't be made to use fd.c, such as libpq. This has led to a situation where, in a modern server, there are likely to be seven or so long-lived FDs per backend process that are not known to fd.c. Since NUM_RESERVED_FDS is only 10, that meant we had *very* few spare FDs if max_files_per_process is >= the system ulimit and fd.c had opened all the files it thought it safely could. The contrib/postgres_fdw regression test, in particular, could easily be made to fall over by running it under a restrictive ulimit. To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD that allow outside callers to tell fd.c that they have or want to allocate a FD that's not directly managed by fd.c. Add calls to track all the fixed FDs in a standard backend session, so that we are honestly guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE limit in a backend's idle state. The coding rules for these functions say that there's no need to call them in code that just allocates one FD over a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases. That means that there aren't all that many places where we need to worry. But postgres_fdw and dblink must use this facility to account for long-lived FDs consumed by libpq connections. There may be other places where it's worth doing such accounting, too, but this seems like enough to solve the immediate problem. Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs. (Callers can choose to ignore this limit, but of course it's unwise to do so except for fixed file allocations.) I also reduced the limit on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2). Conceivably a smarter rule could be used here --- but in practice, on reasonable systems, max_safe_fds should be large enough that this isn't much of an issue, so KISS for now. To avoid possible regression in the number of external or allocated files that can be opened, increase FD_MINFREE and the lower limit on max_files_per_process a little bit; we now insist that the effective "ulimit -n" be at least 64. This seems like pretty clearly a bug fix, but in view of the lack of field complaints, I'll refrain from risking a back-patch. Discussion: https://postgr.es/m/E1izCmM-0005pV-Co@gemulon.postgresql.org
* Factor out InitControlFile() from BootStrapXLOG()Peter Eisentraut2020-02-22
| | | | | | | Right now this only makes BootStrapXLOG() a bit more manageable, but in the future there may be external callers. Discussion: https://www.postgresql.org/message-id/e8f86ba5-48f1-a80a-7f1d-b76bcb9c5c47@2ndquadrant.com
* Reformat code commentPeter Eisentraut2020-02-22
| | | | Discussion: https://www.postgresql.org/message-id/e8f86ba5-48f1-a80a-7f1d-b76bcb9c5c47@2ndquadrant.com
* Fix mesurement of elapsed time during truncating heap in VACUUM.Fujii Masao2020-02-19
| | | | | | | | | | | | | | | | | | | | | | | | | | VACUUM may truncate heap in several batches. The activity report is logged for each batch, and contains the number of pages in the table before and after the truncation, and also the elapsed time during the truncation. Previously the elapsed time reported in each batch was the total elapsed time since starting the truncation until finishing each batch. For example, if the truncation was processed dividing into three batches, the second batch reported the accumulated time elapsed during both first and second batches. This is strange and confusing because the number of pages in the table reported together is not total. Instead, each batch should report the time elapsed during only that batch. The cause of this issue was that the resource usage snapshot was initialized only at the beginning of the truncation and was never reset later. This commit fixes the issue by changing VACUUM so that the resource usage snapshot is reset at each batch. Back-patch to all supported branches. Reported-by: Tatsuhito Kasahara Author: Tatsuhito Kasahara Reviewed-by: Masahiko Sawada, Fujii Masao Discussion: https://postgr.es/m/CAP0=ZVJsf=NvQuy+QXQZ7B=ZVLoDV_JzsVC1FRsF1G18i3zMGg@mail.gmail.com
* Remove obsolete _bt_compare() comment.Peter Geoghegan2020-02-18
| | | | | | btbuild() has nothing to say about how NULL values compare in nbtree. Besides, there are _bt_compare() header comments that describe how NULL values are handled.
* Use pg_pwrite() in more places.Thomas Munro2020-02-11
| | | | | | | | This removes some lseek() system calls. Author: Thomas Munro Reviewed-by: Andres Freund Discussion: https://postgr.es/m/CA%2BhUKGJ%2BoHhnvqjn3%3DHro7xu-YDR8FPr0FL6LF35kHRX%3D_bUzg%40mail.gmail.com
* Fix typos.Amit Kapila2020-02-10
| | | | | | Reported-by: Justin Pryzby Author: Justin Pryzby Discussion: https://postgr.es/m/20200206021432.GA24549@telsasoft.com
* Store the deletion horizon XID for a deleted GIN page on the right page.Tom Lane2020-02-09
| | | | | | | | | | | | | | | | | Commit b10714080 moved the GinPageSetDeleteXid() call to a spot where the "page" variable was pointing to the wrong page, causing the XID to be inserted on a page that's not being deleted, thus allowing later GinPageIsRecyclable tests to recycle the deleted page too soon. It might be a good idea to stop using the single "page" variable for multiple purposes in this function. But for the moment I just moved the GinPageSetDeleteXid() call down beside the GinPageSetDeleted() call, which seems like a more logical place for it anyway. Back-patch to v11, as the faulty patch was. (Fortunately, the bug hasn't made it into any release yet.) Discussion: https://postgr.es/m/21620.1581098806@sss.pgh.pa.us
* Force tuple conversion when the source has missing attributes.Andrew Gierth2020-02-05
| | | | | | | | | | | | | | | | | | | | | Tuple conversion incorrectly concluded that no conversion was needed as long as all the attributes lined up. But if the source tuple has a missing attribute (from addition of a column with default), then the destination tupdesc might not reflect the same default. The typical symptom was that the affected columns would be unexpectedly NULL. Repair by always forcing conversion if the source has missing attributes, which will be filled in by the deform operation. (In theory we could optimize for when the destination has the same default, but that seemed overkill.) Backpatch to 11 where missing attributes were added. Per bug #16242. Vik Fearing (discovery, code, testing) and me (analysis, testcase). Discussion: https://postgr.es/m/16242-d1c9fca28445966b@postgresql.org
* Make vacuum buffer counters 64 bits wideAlvaro Herrera2020-02-05
| | | | | | | | | | | | Using 32 bit counters means they can now realistically wrap around when vacuuming extremely large tables. Because they're signed integers, stats printed by vacuum look very odd when they do. We'd love to backpatch this, but refrain because the variables are exported and could cause third-party code to break. Reviewed-by: Julien Rouhaud, Tom Lane, Michael Paquier Discussion: https://postgr.es/m/20200131205926.GA16367@alvherre.pgsql
* Handle lack of DSM slots in parallel btree build, take 2.Thomas Munro2020-02-05
| | | | | | | | | | Commit 74618e77 added a new check intended to fix a bug, but put it in the wrong place so that parallel btree build was always disabled. Do the check after we've actually tried to create a DSM segment. Back-patch to 11, like the earlier commit. Reviewed-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-WzmDABkJzrNnvf%2BOULK-_A_j9gkYg_Dz-H62jzNv4eKQTw%40mail.gmail.com
* Optimizations for integer to decimal output.Andrew Gierth2020-02-01
| | | | | | | | | | Using a lookup table of digit pairs reduces the number of divisions needed, and calculating the length upfront saves some work; these ideas are taken from the code previously committed for floats. David Fetter, reviewed by Kyotaro Horiguchi, Tels, and me. Discussion: https://postgr.es/m/20190924052620.GP31596%40fetter.org
* Handle lack of DSM slots in parallel btree build.Thomas Munro2020-01-31
| | | | | | | | | | | | | If no DSM slots are available, a ParallelContext can still be created, but its seg pointer is NULL. Teach parallel btree build to cope with that by falling back to a regular non-parallel build, to avoid crashing with a segmentation fault. Back-patch to 11, where parallel CREATE INDEX landed. Reported-by: Nicola Contu Reviewed-by: Peter Geoghegan Discussion: https://postgr.es/m/CA%2BhUKGJgJEBnkuODBVomyK3MWFvDBbMVj%3Dgdt6DnRPU-5sQ6UQ%40mail.gmail.com
* Clean up newlines following left parenthesesAlvaro Herrera2020-01-30
| | | | | | | | | | | | We used to strategically place newlines after some function call left parentheses to make pgindent move the argument list a few chars to the left, so that the whole line would fit under 80 chars. However, pgindent no longer does that, so the newlines just made the code vertically longer for no reason. Remove those newlines, and reflow some of those lines for some extra naturality. Reviewed-by: Michael Paquier, Tom Lane Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql
* Remove excess parens in ereport() callsAlvaro Herrera2020-01-30
| | | | | | | Cosmetic cleanup, not worth backpatching. Discussion: https://postgr.es/m/20200129200401.GA6303@alvherre.pgsql Reviewed-by: Tom Lane, Michael Paquier
* Fail if recovery target is not reachedPeter Eisentraut2020-01-29
| | | | | | | | | | | | Before, if a recovery target is configured, but the archive ended before the target was reached, recovery would end and the server would promote without further notice. That was deemed to be pretty wrong. With this change, if the recovery target is not reached, it is a fatal error. Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5@lako.no
* Fix randAccess setting in ReadRecord()Heikki Linnakangas2020-01-28
| | | | | | | Commit 38a957316d got this backwards. Author: Kyotaro Horiguchi Discussion: https://www.postgresql.org/message-id/20200128.194408.2260703306774646445.horikyota.ntt@gmail.com
* Fix compile error on HP C.Thomas Munro2020-01-28
| | | | Per build farm animal anole, after commit 6f38d4dac3.
* Remove dependency on HeapTuple from predicate locking functions.Thomas Munro2020-01-28
| | | | | | | | | | | | | | | | The following changes make the predicate locking functions more generic and suitable for use by future access methods: - PredicateLockTuple() is renamed to PredicateLockTID(). It takes ItemPointer and inserting transaction ID instead of HeapTuple. - CheckForSerializableConflictIn() takes blocknum instead of buffer. - CheckForSerializableConflictOut() no longer takes HeapTuple or buffer. Author: Ashwin Agrawal Reviewed-by: Andres Freund, Kuntal Ghosh, Thomas Munro Discussion: https://postgr.es/m/CALfoeiv0k3hkEb3Oqk%3DziWqtyk2Jys1UOK5hwRBNeANT_yX%2Bng%40mail.gmail.com
* Refactor XLogReadRecord(), adding XLogBeginRead() function.Heikki Linnakangas2020-01-26
| | | | | | | | | | | | | | | | | | | | The signature of XLogReadRecord() required the caller to pass the starting WAL position as argument, or InvalidXLogRecPtr to continue reading at the end of previous record. That's slightly awkward to the callers, as most of them don't want to randomly jump around in the WAL stream, but start reading at one position and then read everything from that point onwards. Remove the 'RecPtr' argument and add a new function XLogBeginRead() to specify the starting position instead. That's more convenient for the callers. Also, xlogreader holds state that is reset when you change the starting position, so having a separate function for doing that feels like a more natural fit. This changes XLogFindNextRecord() function so that it doesn't reset the xlogreader's state to what it was before the call anymore. Instead, it positions the xlogreader to the found record, like XLogBeginRead(). Reviewed-by: Kyotaro Horiguchi, Alvaro Herrera Discussion: https://www.postgresql.org/message-id/5382a7a3-debe-be31-c860-cb810c08f366%40iki.fi
* Clarify some comments in vacuumlazy.cMichael Paquier2020-01-23
| | | | | Author: Justin Pryzby Discussion: https://postgr.es/m/20200113004542.GA26045@telsasoft.com
* Add GUC ignore_invalid_pages.Fujii Masao2020-01-22
| | | | | | | | | | | | | | | Detection of WAL records having references to invalid pages during recovery causes PostgreSQL to raise a PANIC-level error, aborting the recovery. Setting ignore_invalid_pages to on causes the system to ignore those WAL records (but still report a warning), and continue recovery. This behavior may cause crashes, data loss, propagate or hide corruption, or other serious problems. However, it may allow you to get past the PANIC-level error, to finish the recovery, and to cause the server to start up. Author: Fujii Masao Reviewed-by: Michael Paquier Discussion: https://www.postgresql.org/message-id/CAHGQGwHCK6f77yeZD4MHOnN+PaTf6XiJfEB+Ce7SksSHjeAWtg@mail.gmail.com
* Fix the computation of max dead tuples during the vacuum.Amit Kapila2020-01-22
| | | | | | | | | | | | In commit 40d964ec99, we changed the way memory is allocated for dead tuples but forgot to update the place where we compute the maximum number of dead tuples. This could lead to invalid memory requests. Reported-by: Andres Freund Diagnosed-by: Andres Freund Author: Masahiko Sawada Reviewed-by: Amit Kapila and Dilip Kumar Discussion: https://postgr.es/m/20200121060020.e3cr7s7fj5rw4lok@alap3.anarazel.de
* Fix crash in BRIN inclusion op functions, due to missing datum copy.Heikki Linnakangas2020-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | The BRIN add_value() and union() functions need to make a longer-lived copy of the argument, if they want to store it in the BrinValues struct also passed as argument. The functions for the "inclusion operator classes" used with box, range and inet types didn't take into account that the union helper function might return its argument as is, without making a copy. Check for that case, and make a copy if necessary. That case arises at least with the range_union() function, when one of the arguments is an 'empty' range: CREATE TABLE brintest (n numrange); CREATE INDEX brinidx ON brintest USING brin (n); INSERT INTO brintest VALUES ('empty'); INSERT INTO brintest VALUES (numrange(0, 2^1000::numeric)); INSERT INTO brintest VALUES ('(-1, 0)'); SELECT brin_desummarize_range('brinidx', 0); SELECT brin_summarize_range('brinidx', 0); Backpatch down to 9.5, where BRIN was introduced. Discussion: https://www.postgresql.org/message-id/e6e1d6eb-0a67-36aa-e779-bcca59167c14%40iki.fi Reviewed-by: Emre Hasegeli, Tom Lane, Alvaro Herrera
* Allow vacuum command to process indexes in parallel.Amit Kapila2020-01-20
| | | | | | | | | | | | | | | | | | | | | | | | | | This feature allows the vacuum to leverage multiple CPUs in order to process indexes. This enables us to perform index vacuuming and index cleanup with background workers. This adds a PARALLEL option to VACUUM command where the user can specify the number of workers that can be used to perform the command which is limited by the number of indexes on a table. Specifying zero as a number of workers will disable parallelism. This option can't be used with the FULL option. Each index is processed by at most one vacuum process. Therefore parallel vacuum can be used when the table has at least two indexes. The parallel degree is either specified by the user or determined based on the number of indexes that the table has, and further limited by max_parallel_maintenance_workers. The index can participate in parallel vacuum iff it's size is greater than min_parallel_index_scan_size. Author: Masahiko Sawada and Amit Kapila Reviewed-by: Dilip Kumar, Amit Kapila, Robert Haas, Tomas Vondra, Mahendra Singh and Sergei Kornilov Tested-by: Mahendra Singh and Prabhat Sahu Discussion: https://postgr.es/m/CAD21AoDTPMgzSkV4E3SFo1CH_x50bf5PqZFQf4jmqjk-C03BWg@mail.gmail.com https://postgr.es/m/CAA4eK1J-VoR9gzS5E75pcD-OH0mEyCdp8RihcwKrcuw7J-Q0+w@mail.gmail.com
* Avoid full scan of GIN indexes when possibleAlexander Korotkov2020-01-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The strategy of GIN index scan is driven by opclass-specific extract_query method. This method that needed search mode is GIN_SEARCH_MODE_ALL. This mode means that matching tuple may contain none of extracted entries. Simple example is '!term' tsquery, which doesn't need any term to exist in matching tsvector. In order to handle such scan key GIN calculates virtual entry, which contains all TIDs of all entries of attribute. In fact this is full scan of index attribute. And typically this is very slow, but allows to handle some queries correctly in GIN. However, current algorithm calculate such virtual entry for each GIN_SEARCH_MODE_ALL scan key even if they are multiple for the same attribute. This is clearly not optimal. This commit improves the situation by introduction of "exclude only" scan keys. Such scan keys are not capable to return set of matching TIDs. Instead, they are capable only to filter TIDs produced by normal scan keys. Therefore, each attribute should contain at least one normal scan key, while rest of them may be "exclude only" if search mode is GIN_SEARCH_MODE_ALL. The same optimization might be applied to the whole scan, not per-attribute. But that leads to NULL values elimination problem. There is trade-off between multiple possible ways to do this. We probably want to do this later using some cost-based decision algorithm. Discussion: https://postgr.es/m/CAOBaU_YGP5-BEt5Cc0%3DzMve92vocPzD%2BXiZgiZs1kjY0cj%3DXBg%40mail.gmail.com Author: Nikita Glukhov, Alexander Korotkov, Tom Lane, Julien Rouhaud Reviewed-by: Julien Rouhaud, Tomas Vondra, Tom Lane
* Introduce IndexAM fields for parallel vacuum.Amit Kapila2020-01-15
| | | | | | | | | | | | | | | | | Introduce new fields amusemaintenanceworkmem and amparallelvacuumoptions in IndexAmRoutine for parallel vacuum. The amusemaintenanceworkmem tells whether a particular IndexAM uses maintenance_work_mem or not. This will help in controlling the memory used by individual workers as otherwise, each worker can consume memory equal to maintenance_work_mem. The amparallelvacuumoptions tell whether a particular IndexAM participates in a parallel vacuum and if so in which phase (bulkdelete, vacuumcleanup) of vacuum. Author: Masahiko Sawada and Amit Kapila Reviewed-by: Dilip Kumar, Amit Kapila, Tomas Vondra and Robert Haas Discussion: https://postgr.es/m/CAD21AoDTPMgzSkV4E3SFo1CH_x50bf5PqZFQf4jmqjk-C03BWg@mail.gmail.com https://postgr.es/m/CAA4eK1LmcD5aPogzwim5Nn58Ki+74a6Edghx4Wd8hAskvHaq5A@mail.gmail.com
* Fix comment in heapam.cMichael Paquier2020-01-13
| | | | | | | Improvement per suggestion from Tom Lane. Author: Daniel Gustafsson Discussion: https://postgr.es/m/FED18699-4270-4778-8DA8-10F119A5ECF3@yesql.se
* Delete empty pages in each pass during GIST VACUUM.Amit Kapila2020-01-13
| | | | | | | | | | | | | | | | | | | | | | | Earlier, we use to postpone deleting empty pages till the second stage of vacuum to amortize the cost of scanning internal pages. However, that can sometimes (say vacuum is canceled or errored between first and second stage) delay the pages to be recycled. Another thing is that to facilitate deleting empty pages in the second stage, we need to share the information about internal and empty pages between different stages of vacuum. It will be quite tricky to share this information via DSM which is required for the upcoming parallel vacuum patch. Also, it will bring the logic to reclaim deleted pages closer to nbtree where we delete empty pages in each pass. Overall, the advantages of deleting empty pages in each pass outweigh the advantages of postponing the same. Author: Dilip Kumar, with changes by Amit Kapila Reviewed-by: Sawada Masahiko and Amit Kapila Discussion: https://postgr.es/m/CAA4eK1LGr+MN0xHZpJ2dfS8QNQ1a_aROKowZB+MPNep8FVtwAA@mail.gmail.com
* Reimplement nullification of walsender timestampAlvaro Herrera2020-01-08
| | | | | | | | | | | | | | | Make the value null only at pg_stat_activity-output time, as suggested by Tom Lane, instead of messing with the internal state. This should appease buildfarm members with force_parallel_mode=regress, which are running parallel queries on logical replication walsenders. The fact that walsenders can run parallel queries should perhaps be studied more carefully, but for the moment let's get rid of the red blots in buildfarm. Backpatch to pg10, like the previous commit. Discussion: https://postgr.es/m/30804.1578438763@sss.pgh.pa.us
* pg_stat_activity: show NULL stmt start time for walsendersAlvaro Herrera2020-01-07
| | | | | | | | | | | | | Returning a non-NULL time is pointless, sinc a walsender is not a process that would be running normal transactions anyway, but the code was unintentionally exposing the process start time intermittently, which was not only bogus but it also confused monitoring systems looking for idle transactions. Fix by avoiding all updates in walsenders. Backpatch to 11, where walsenders started appearing in pg_stat_activity. Reported-by: Tomas Vondra Discussion: https://postgr.es/m/20191209234409.exe7osmyalwkt5j4@development
* tableam: New callback relation_fetch_toast_slice.Robert Haas2020-01-07
| | | | | | | | | | | | | | | | Instead of always calling heap_fetch_toast_slice during detoasting, invoke a table AM callback which, when the toast table is a heap table, will be heap_fetch_toast_slice. This makes it possible for a table AM other than heap to be used as a TOAST table. It also completes the series of commits intended to improve the interaction of tableam with TOAST that began with commit 8b94dab06617ef80a0901ab103ebd8754427ef5a; detoast.c is now, hopefully, fully AM-independent. Patch by me, reviewed by Andres Freund and Peter Eisentraut. Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
* tableam: Allow choice of toast AM.Robert Haas2020-01-07
| | | | | | | | | | | | Previously, the toast table had to be implemented by the same AM that was used for the main table, which was bad, because the detoasting code won't work with anything but heap. This commit doesn't fix the latter problem, although there's another patch coming which does, but it does let you pick something that works (i.e. heap, right now). Patch by me, reviewed by Andres Freund. Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com
* Remove redundant incomplete split assertion.Peter Geoghegan2020-01-05
| | | | | | | The fastpath insert optimization's incomplete split flag Assert() is redundant. We'll reach the more general Assert() within _bt_findinsertloc() in all cases. (Besides, Assert()'ing that the rightmost page doesn't have the flag set never made much sense.)
* Add xl_btree_delete optimization.Peter Geoghegan2020-01-03
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit 558a9165e08 taught _bt_delitems_delete() to produce its own XID horizon on the primary. Standbys no longer needed to generate their own latestRemovedXid, since they could just use the explicitly logged value from the primary instead. The deleted offset numbers array from the xl_btree_delete WAL record was no longer used by the REDO routine for anything other than deleting the items. This enables a minor optimization: We now treat the array as buffer state, not generic WAL data, following _bt_delitems_vacuum()'s example. This should be a minor win, since it allows us to avoid including the deleted items array in cases where XLogInsert() stores the whole buffer anyway. The primary goal here is to make the code more maintainable, though. Removing inessential differences between the two functions highlights the fundamental differences that remain. Also change xl_btree_delete to use uint32 for the size of the array of item offsets being deleted. This brings xl_btree_delete closer to xl_btree_vacuum. Furthermore, it seems like a good idea to use an explicit-width integer type (the field was previously an "int"). Bump XLOG_PAGE_MAGIC because xl_btree_delete changed. Discussion: https://postgr.es/m/CAH2-Wzkz4TjmezzfAbaV1zYrh=fr0bCpzuJTvBe5iUQ3aUPsCQ@mail.gmail.com
* Clear up btree_xlog_split() alignment comment.Peter Geoghegan2020-01-02
| | | | | | | Adjust a comment that describes how alignment of the new left page high key works in btree_xlog_split(), the nbtree page split REDO routine. The wording used before commit 2c03216d831 is much clearer, so go back to that.
* Correct _bt_delitems_vacuum() lock comments.Peter Geoghegan2020-01-02
| | | | | The expectation within _bt_delitems_vacuum() is that caller has a super-exclusive/cleanup buffer lock (not just a pin and a write lock).
* Revise BTP_HAS_GARBAGE nbtree VACUUM comments.Peter Geoghegan2020-01-01
| | | | | | | | | | | | | | | | | | | | | | | | _bt_delitems_vacuum() comments claimed that it isn't worth another scan of the page to avoid falsely unsetting the BTP_HAS_GARBAGE page flag hint (this happens to be the same wording that was removed from _bt_delitems_delete() by my recent commit fe97c61c). The comments made little sense, though. The issue can't have much to do with performing a second scan of the target leaf page, since an LP_DEAD test could easily be performed in the first scan of the page anyway (the scan that takes place in btvacuumpage() caller). Revise the explanation. It makes much more sense to frame this as an issue about recovery conflicts. _bt_delitems_vacuum() cannot easily generate an XID cutoff in the same way that _bt_delitems_delete() is designed to. Falsely unsetting the page flag is not ideal, and is likely to happen more often than was supposed by the original comments. Explain why it usually isn't a problem in practice. There may be an argument for _bt_delitems_vacuum() not clearing the BTP_HAS_GARBAGE bit, removing the question of it being falsely unset by VACUUM (there may even be an argument for not using a page level hint at all). This can be revisited later.
* Update btree_xlog_delete() comments.Peter Geoghegan2020-01-01
| | | | | | | | | | | Commit fe97c61c updated LP_DEAD item deletion comments, but missed a minor discrepancy on the REDO side. Fix it now. In passing, don't talk about the btree_xlog_vacuum() behavior within btree_xlog_delete(). The reliance on XLOG_HEAP2_CLEANUP_INFO records for recovery conflicts is already discussed within btvacuumpage() and mentioned again in passing above btree_xlog_vacuum(), which seems sufficient.
* Update copyrights for 2020Bruce Momjian2020-01-01
| | | | Backpatch-through: update all files in master, backpatch legal files through 9.4
* Revert "Rename files and headers related to index AM"Michael Paquier2019-12-27
| | | | | | | | This follows multiple complains from Peter Geoghegan, Andres Freund and Alvaro Herrera that this issue ought to be dug more before actually happening, if it happens. Discussion: https://postgr.es/m/20191226144606.GA5659@alvherre.pgsql
* Refactor code dedicated to index vacuuming in vacuumlazy.cMichael Paquier2019-12-26
| | | | | | | | | | | | The part in charge of doing the vacuum on all the indexes of a relation was duplicated, with the same handling for progress reporting done. While on it, update the progress reporting for heap vacuuming in the subroutine doing the actual work, keeping the status update local. This way, any future caller of lazy_vacuum_heap() does not have to worry about doing any progress reporting update. Author: Justin Pryzby, Michael Paquier Discussion: https://postgr.es/m/20191120210600.GC30362@telsasoft.com
* Rename files and headers related to index AMMichael Paquier2019-12-25
| | | | | | | | | | | | | | | | | | | | | The following renaming is done so as source files related to index access methods are more consistent with table access methods (the original names used for index AMs ware too generic, and could be confused as including features related to table AMs): - amapi.h -> indexam.h. - amapi.c -> indexamapi.c. Here we have an equivalent with backend/access/table/tableamapi.c. - amvalidate.c -> indexamvalidate.c. - amvalidate.h -> indexamvalidate.h. - genam.c -> indexgenam.c. - genam.h -> indexgenam.h. This has been discussed during the development of v12 when table AM was worked on, but the renaming never happened. Author: Michael Paquier Reviewed-by: Fabien Coelho, Julien Rouhaud Discussion: https://postgr.es/m/20191223053434.GF34339@paquier.xyz
* Avoid splitting C string literals with \-newlineAlvaro Herrera2019-12-24
| | | | | | | | | | | Using \ is unnecessary and ugly, so remove that. While at it, stitch the literals back into a single line: we've long discouraged splitting error message literals even when they go past the 80 chars line limit, to improve greppability. Leave contrib/tablefunc alone. Discussion: https://postgr.es/m/20191223195156.GA12271@alvherre.pgsql
* Update nbtree LP_DEAD item deletion comments.Peter Geoghegan2019-12-22
| | | | | | | | | | | | | | | | Comments about the consequences of clearing the BTP_HAS_GARBAGE page flag bit that apply only to VACUUM were added to code that deals with opportunistic deletion of LP_DEAD items by commit a760893d. The same comment block was added to both _bt_delitems_vacuum() and _bt_delitems_delete(). Correct _bt_delitems_delete()'s copy of the comment block. _bt_delitems_delete() reliably deletes items that were found by caller to have their LP_DEAD bit set. There is no question about whether or not unsetting the BTP_HAS_GARBAGE bit can miss some LP_DEAD items that were set recently. Also tweak a related section of the nbtree README.
* Remove unneeded "pin scan" nbtree VACUUM code.Peter Geoghegan2019-12-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The REDO routine for nbtree's xl_btree_vacuum record type hasn't performed a "pin scan" since commit 3e4b7d87 went in, so clearly there isn't any point in VACUUM WAL-logging information that won't actually be used. Finish off the work of commit 3e4b7d87 (and the closely related preceding commit 687f2cd7) by removing the code that generates this unused information. Also remove the REDO routine code disabled by commit 3e4b7d87. Replace the unneeded lastBlockVacuumed field in xl_btree_vacuum with a new "ndeleted" field. The new field isn't actually needed right now, since we could continue to infer the array length from the overall record length. However, an upcoming patch to add deduplication to nbtree needs to add an "items updated" field to xl_btree_vacuum, so we might as well start being explicit about the number of items now. (Besides, it doesn't seem like a good idea to leave the xl_btree_vacuum struct without any fields; the C standard says that that's undefined.) nbtree VACUUM no longer forces writing a WAL record for the last block in the index. Writing out a WAL record with no items for the final block was supposed to force processing of a lastBlockVacuumed field by a pin scan. Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. Discussion: https://postgr.es/m/CAH2-WzmY_mT7UnTzFB5LBQDBkKpdV5UxP3B5bLb7uP%3D%3D6UQJRQ%40mail.gmail.com
* revert: Remove meaningless assignments in nbtree codeBruce Momjian2019-12-19
| | | | | | | | | | Reverts commit 05684c8255. Reported-by: Tom Lane Discussion: https://postgr.es/m/404.1576770942@sss.pgh.pa.us Backpatch-through: master