| Commit message (Collapse) | Author | Age |
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
collectMatchBitmap() needs to re-find the index tuple it was previously
looking at, after transiently dropping lock on the index page it's on.
The tuple should still exist and be at its prior position or somewhere
to the right of that, since ginvacuum never removes tuples but
concurrent insertions could add one. However, there was a thinko in
that logic, to the effect of expecting any inserted tuples to have the
same index "attnum" as what we'd been scanning. Since there's no
physical separation of tuples with different attnums, it's not terribly
hard to devise scenarios where this fails, leading to transient "lost
saved point in index" errors. (While I've duplicated this with manual
testing, it seems impossible to make a reproducible test case with our
available testing technology.)
Fix by just continuing the scan when the attnum doesn't match.
While here, improve the error message used if we do fail, so that it
matches the wording used in btree for a similar case.
collectMatchBitmap()'s posting-tree code path was previously not
exercised at all by our regression tests. While I can't make
a regression test that exhibits the bug, I can at least improve
the code coverage here, so do that. The test case I made for this
is an extension of one added by 4b754d6c1, so it only works in
HEAD and v13; didn't seem worth trying hard to back-patch it.
Per bug #16595 from Jesse Kinkead. This has been broken since
multicolumn capability was added to GIN (commit 27cb66fdf),
so back-patch to all supported branches.
Discussion: https://postgr.es/m/16595-633118be8eef9ce2@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were displaying the wrong phase information for 'info' message in the
index clean up phase because we were switching to the previous phase a bit
early. We were also not displaying context information for heap phase
unless the block number is valid which is fine for error cases but for
messages at 'info' or lower error level it appears to be inconsistent with
index phase information.
Reported-by: Sawada Masahiko
Author: Sawada Masahiko
Reviewed-by: Amit Kapila
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CA+fd4k4HcbhPnCs7paRTw1K-AHin8y4xKomB9Ru0ATw0UeTy2w@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The SimpleLruTruncate() header comment states the new coding rule. To
achieve this, add locktype "frozenid" and two LWLocks. This closes a
rare opportunity for data loss, which manifested as "apparent
wraparound" or "could not access status of transaction" errors. Data
loss is more likely in pg_multixact, due to released branches' thin
margin between multiStopLimit and multiWrapLimit. If a user's physical
replication primary logged ": apparent wraparound" messages, the user
should rebuild standbys of that primary regardless of symptoms. At less
risk is a cluster having emitted "not accepting commands" errors or
"must be vacuumed" warnings at some point. One can test a cluster for
this data loss by running VACUUM FREEZE in every database. Back-patch
to 9.5 (all supported versions).
Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a table is scanned by heapam_index_build_range_scan (née
IndexBuildHeapScan) and the table lock being held allows concurrent data
changes, it is possible for new HOT chains to sprout in a page that were
unknown when the scan of a page happened. This leads to an error such
as
ERROR: failed to find parent tuple for heap-only tuple at (X,Y) in table "tbl"
because the root tuple was not present when we first obtained the list
of the page's root tuples. This can be fixed by re-obtaining the list
of root tuples, if we see that a heap-only tuple appears to point to a
non-existing root.
This was reported by Anastasia as occurring for BRIN summarization
(which exists since 9.5), but I think it could theoretically also happen
with CREATE INDEX CONCURRENTLY (much older) or REINDEX CONCURRENTLY
(very recent). It seems a happy coincidence that BRIN forces us to
backpatch this all the way to 9.5.
Reported-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Diagnosed-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Co-authored-by: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/602d8487-f0b2-5486-0088-0f372b2549fa@postgrespro.ru
Backpatch: 9.5 - master
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a page range is desummarized at just the right time concurrently with
an index walk, BRIN would raise an error indicating index corruption.
This is scary and unhelpful; silently returning that the page range is
not summarized is sufficient reaction.
This bug was introduced by commit 975ad4e602ff as additional protection
against a bug whose actual fix was elsewhere. Backpatch equally.
Reported-By: Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Diagnosed-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/2588667e-d07d-7e10-74e2-7e1e46194491@postgrespro.ru
Backpatch: 9.5 - master
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
max_slot_wal_keep_size that was added in v13 and wal_keep_segments are
the GUC parameters to specify how much WAL files to retain for
the standby servers. While max_slot_wal_keep_size accepts the number of
bytes of WAL files, wal_keep_segments accepts the number of WAL files.
This difference of setting units between those similar parameters could
be confusing to users.
To alleviate this situation, this commit renames wal_keep_segments to
wal_keep_size, and make users specify the WAL size in it instead of
the number of WAL files.
There was also the idea to rename max_slot_wal_keep_size to
max_slot_wal_keep_segments, in the discussion. But we have been moving
away from measuring in segments, for example, checkpoint_segments was
replaced by max_wal_size. So we concluded to rename wal_keep_segments
to wal_keep_size.
Back-patch to v13 where max_slot_wal_keep_size was added.
Author: Fujii Masao
Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, David Steele
Discussion: https://postgr.es/m/574b4ea3-e0f9-b175-ead2-ebea7faea855@oss.nttdata.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is no advantage to attempting deduplication for a unique index
during CREATE INDEX, since there cannot possibly be any duplicates.
Doing so wastes cycles due to unnecessary copying. Make sure that we
avoid it consistently.
We already avoided unique index deduplication in the case where there
were some spool2 tuples to merge. That didn't account for the fact that
spool2 is removed early/unset in the common case where it has no tuples
that need to be merged (i.e. it failed to account for the "spool2 turns
out to be unnecessary" optimization in _bt_spools_heapscan()).
Oversight in commit 0d861bbb, which added nbtree deduplication
Backpatch: 13-, where nbtree deduplication was introduced.
|
|
|
|
|
|
|
|
|
|
|
| |
Incorrect function names were referenced. As this fixes some portions
of tableam.h, that is mentioned in the docs as something to look at when
implementing a table AM, backpatch down to 12 where this has been
introduced.
Author: Hironobu Suzuki
Discussion: https://postgr.es/m/8fe6d672-28dd-3f1d-7aed-ac2f6d599d3f@interdb.jp
Backpatch-through: 12
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove previous hack in KeepLogSeg that added a case to deal with a
(badly represented) invalid segment number. This was added for the sake
of GetWALAvailability. But it's not needed if in that function we
initialize the segment number to be retreated to the currently being
written segment, so do that instead.
Per valgrind-running buildfarm member skink, and some sparc64 animals.
Discussion: https://postgr.es/m/1724648.1594230917@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
| |
This message was being emitted on the grounds that only crashed
summarization could cause it, but in reality even an aborted vacuum
could do it ... which makes it way too noisy, particularly since it
shows up in regression tests and makes them die.
Reported by Tom Lane.
Discussion: https://postgr.es/m/489091.1593534251@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
| |
Since slot_keep_segs indicates the number of WAL segments not LSN,
its datatype should not be XLogRecPtr.
Back-patch to v13 where this issue was added.
Reported-by: Atsushi Torikoshi
Author: Atsushi Torikoshi, tweaked by Fujii Masao
Discussion: https://postgr.es/m/ebd0d674f3e050222238a960cac5251a@oss.nttdata.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous definition of the column was almost universally disliked,
so provide this updated definition which is more useful for monitoring
purposes: a large positive value is good, while zero or a negative value
means danger. This should be operationally more convenient.
Backpatch to 13, where the new column to pg_replication_slots (and the
feature it represents) were added.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Discussion: https://postgr.es/m/9ddfbf8c-2f67-904d-44ed-cf8bc5916228@oss.nttdata.com
|
|
|
|
|
|
|
| |
Author: Vignesh C
Reviewed-by: Sawada Masahiko
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CALDaNm3Ppt71NafGY5mk3V2i3Q+mm93pVibDq-0NpW7WU67Jcg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
| |
Use separate functions to save and restore error context information as
that made code easier to understand. Also, make it clear that the index
information required for error context is sane.
Author: Andres Freund, Justin Pryzby, Amit Kapila
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CAA4eK1LWo+v1OWu=Sky27GTGSCuOmr7iaURNbc5xz6jO+SaPeA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 0d861bbb, which added deduplication to nbtree, had
_bt_check_unique() pass a TID to table_index_fetch_tuple_check() that
isn't safe to mutate. table_index_fetch_tuple_check()'s tid argument is
modified when the TID in question is not the latest visible tuple in a
hot chain, though this wasn't documented.
To fix, go back to using a local copy of the TID in _bt_check_unique(),
and update comments above table_index_fetch_tuple_check().
Backpatch: 13-, where B-Tree deduplication was introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In pg_replication_slot, change output from normal/reserved/lost to
reserved/extended/unreserved/ lost, which better expresses the possible
states particularly near the time where segments are no longer safe but
checkpoint has not run yet.
Under the new definition, reserved means the slot is consuming WAL
that's still under the normal WAL size constraints; extended means it's
consuming WAL that's being protected by wal_keep_segments or the slot
itself, whose size is below max_slot_wal_keep_size; unreserved means the
WAL is no longer safe, but checkpoint has not yet removed those files.
Such as slot is in imminent danger, but can still continue for a little
while and may catch up to the reserved WAL space.
Also, there were some bugs in the calculations used to report the
status; fixed those.
Backpatch to 13.
Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200616.120236.1809496990963386593.horikyota.ntt@gmail.com
|
|
|
|
|
|
|
| |
The current definition is dangerous. No bugs exist in our code at
present, but backpatch to 11 nonetheless where it was introduced.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
|
|
|
|
|
|
|
|
|
|
| |
spg_mask() didn't take into account that pd_lower equal to SizeOfPageHeaderData
is still valid value. This commit fixes that. Backpatch to 11, where
spg_mask() pg_lower check was introduced.
Reported-by: Michael Paquier
Discussion: https://postgr.es/m/20200615131405.GM52676%40paquier.xyz
Backpatch-through: 11
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It was possible for deduplication's single value strategy to mistakenly
believe that a very small duplicate tuple counts as one of the six large
tuples that it aims to leave behind after the page finally splits. This
could cause slightly suboptimal space utilization with very low
cardinality indexes, though only under fairly narrow conditions.
To fix, be particular about what kind of tuple counts as a
maxpostingsize-capped tuple. This avoids confusion in the event of a
small tuple that gets "wedged" between two large tuples, where all
tuples on the page are duplicates of the same value.
Discussion: https://postgr.es/m/CAH2-Wz=Y+sgSFc-O3LpiZX-POx2bC+okec2KafERHuzdVa7-rQ@mail.gmail.com
Backpatch: 13-, where deduplication was introduced (by commit 0d861bbb)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Convert buffile.c error handling to use ereport. This fixes cases where
I/O errors were indistinguishable from EOF or not reported. Also remove
"%m" from error messages where errno would be bogus. While we're
modifying those strings, add block numbers and short read byte counts
where appropriate.
Back-patch to all supported releases.
Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
|
|
|
|
|
| |
Reported-By: Tom Lane
Discussion: https://postgr.es/m/841649.1592065060@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rewrite the documentation of these functions, in light of recent bug fix
commit 5940ffb2.
Back-patch to 13 where the check-for-conflict-out code was split up into
AM-specific and generic parts, and new documentation was added that now
looked wrong.
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SSI's HeapCheckForSerializableConflictOut() test failed to correctly
handle conditions involving a concurrently inserted tuple which is later
concurrently updated by a separate transaction . A SELECT statement
that called HeapCheckForSerializableConflictOut() could end up using the
same XID (updater's XID) for both the original tuple, and the successor
tuple, missing the XID of the xact that created the original tuple
entirely. This only happened when neither tuple from the chain was
visible to the transaction's MVCC snapshot.
The observable symptoms of this bug were subtle. A pair of transactions
could commit, with the later transaction failing to observe the effects
of the earlier transaction (because of the confusion created by the
update to the non-visible row). This bug dates all the way back to
commit dafaa3ef, which added SSI.
To fix, make sure that we check the xmin of concurrently inserted tuples
that happen to also have been updated concurrently.
Author: Peter Geoghegan
Reported-By: Kyle Kingsbury
Reviewed-By: Thomas Munro
Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443@jepsen.io
Backpatch: All supported versions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire
ControlFileLock before modifying ControlFile->checkPointCopy, or the
checkpointer could write out a control file with a bad checksum.
Likewise, XLogReportParameters() must acquire ControlFileLock before
modifying ControlFile and calling UpdateControlFile().
Back-patch to all supported releases.
Author: Nathan Bossart <bossartn@amazon.com>
Author: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since database connections can be used with WAL senders in 9.4, it is
possible to use physical replication. This commit fixes a crash when
starting physical replication with a WAL sender using a database
connection, caused by the refactoring done in 850196b.
There have been discussions about forbidding the use of physical
replication in a database connection, but this is left for later,
taking care only of the crash new to 13.
While on it, add a test to check for a failure when attempting logical
replication if the WAL sender does not have a database connection. This
part is extracted from a larger patch by Kyotaro Horiguchi.
Reported-by: Vladimir Sitnikov
Author: Michael Paquier, Kyotaro Horiguchi
Reviewed-by: Kyotaro Horiguchi, Álvaro Herrera
Discussion: https://postgr.es/m/CAB=Je-GOWMj1PTPkeUhjqQp-4W3=nW-pXe2Hjax6rJFffB5_Aw@mail.gmail.com
Backpatch-through: 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 624686abcf8 added an assertion that verified that _bt_search
successfully relocated the leaf page undergoing deletion. Page deletion
cannot deal with the case where the descent stack is to the right of the
page, so this seemed critical (deletion can only handle the case where
the descent stack is to the left of the leaf/target page). However, the
assertion went a bit too far.
Since only a buffer pin is held on the leaf page throughout the call to
_bt_search, nothing guarantees that it can't have split during this
small window. And if does actually split, _bt_search may end up
"relocating" a page to the right of the original target leaf page. This
scenario seems extremely unlikely, but it must still be considered.
Remove the assertion, and document how we cope in this scenario.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Synchronize the event names for parallel hash join waits with other
event names, by getting rid of the slashes and dropping "-ing"
suffixes. Rename ClogGroupUpdate to XactGroupUpdate, to match the
new SLRU name. Move the ProcSignalBarrier event to the IPC category;
it doesn't belong under IO.
Also a bit more wordsmithing in the wait event documentation tables.
Discussion: https://postgr.es/m/4505.1589640417@sss.pgh.pa.us
|
|
|
|
|
|
|
| |
This is just to provide a clean basis for comparison of the results
of the new version. I did fix a typo that crept into 242dfcbaf.
Discussion: https://postgr.es/m/20200114221814.GA19630@alvherre.pgsql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Choose names that fit into the conventions for wait event names
(particularly, that multi-word names are in the style MultiWordName)
and hopefully convey more information to non-hacker users than the
previous names did.
Also rename SerializablePredicateLockListLock to
SerializablePredicateListLock; the old name was long enough to cause
table formatting problems, plus the double occurrence of "Lock" seems
confusing/error-prone.
Also change a couple of particularly opaque LWLock field names.
Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
_bt_killitems marks btree items dead when a scan leaves the page where
they live, but it does so with only share lock (to improve concurrency).
This was historicall okay, since killing a dead item has no
consequences. However, with the advent of data checksums and
wal_log_hints, this action incurs a WAL full-page-image record of the
page. Multiple concurrent processes would write the same page several
times, leading to WAL bloat. The probability of this happening can be
reduced by only killing items if they're not already dead, so change the
code to do that.
The problem could eliminated completely by having _bt_killitems upgrade
to exclusive lock upon seeing a killable item, but that would reduce
concurrency so it's considered a cure worse than the disease.
Backpatch all the way back to 9.5, since wal_log_hints was introduced in
9.4.
Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originally, the names assigned to SLRUs had no purpose other than
being shmem lookup keys, so not a lot of thought went into them.
As of v13, though, we're exposing them in the pg_stat_slru view and
the pg_stat_reset_slru function, so it seems advisable to take a bit
more care. Rename them to names based on the associated on-disk
storage directories (which fortunately we *did* think about, to some
extent; since those are also visible to DBAs, consistency seems like
a good thing). Also rename the associated LWLocks, since those names
are likewise user-exposed now as wait event names.
For the most part I only touched symbols used in the respective modules'
SimpleLruInit() calls, not the names of other related objects. This
renaming could have been taken further, and maybe someday we will do so.
But for now it seems undesirable to change the names of any globally
visible functions or structs, so some inconsistency is unavoidable.
(But I *did* terminate "oldserxid" with prejudice, as I found that
name both unreadable and not descriptive of the SLRU's contents.)
Table 27.12 needs re-alphabetization now, but I'll leave that till
after the other LWLock renamings I have in mind.
Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
| |
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.
Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is little point in using the LWLockRegisterTranche mechanism for
built-in tranche names. It wastes cycles, it creates opportunities for
bugs (since failing to register a tranche name is a very hard-to-detect
problem), and the lack of any centralized list of names encourages
sloppy nonconformity in name choices. Moreover, since we have a
centralized list of the tranches anyway in enum BuiltinTrancheIds, we're
certainly not buying any flexibility in return for these disadvantages.
Hence, nuke all the backend-internal LWLockRegisterTranche calls,
and instead provide a const array of the builtin tranche names.
(I have in mind to change a bunch of these names shortly, but this
patch is just about getting them into one place.)
Discussion: https://postgr.es/m/9056.1589419765@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
| |
The additional pain from level 4 is excessive for the gain.
Also revert all the source annotation changes to their original
wordings, to avoid back-patching pain.
Discussion: https://postgr.es/m/31166.1589378554@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of re-identifying which statistics bucket to use for a given
SLRU on every counter increment, do it once during shmem initialization.
This saves a fair number of cycles, and there's no real cost because
we could not have a bucket assignment that varies over time or across
backends anyway.
Also, get rid of the ill-considered decision to let pgstat.c pry
directly into SLRU's shared state; it's cleaner just to have slru.c
pass the stats bucket number.
In consequence of these changes, there's no longer any need to store
an SLRU's LWLock tranche info in shared memory, so get rid of that,
making this a net reduction in shmem consumption. (That partly
reverts fe702a7b3.)
This is basically code review for 28cac71bd, so I also cleaned up
some comments, removed a dangling extern declaration, fixed some
things that should be static and/or const, etc.
Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Have both physical and logical walsender share a 'xlogreader' state
struct for tracking state. This replaces the existing globals sendSeg
and sendCxt.
* Change WALRead not to receive XLogReaderState->seg and ->segcxt as
separate arguments anymore; just use the ones from 'state'. This is
made possible by the above change.
* have the XLogReader segment_open contract require the callbacks to
install the file descriptor in the state struct themselves instead of
returning it. xlogreader was already ignoring any possible failed
return from the callbacks, relying solely on them never returning.
(This point is not altogether excellent, as it means the callbacks
have to know more of XLogReaderState; but to really improve on that
we would have to pass back error info from the callbacks to
xlogreader. And the complexity would not be saved but instead just
transferred to the callers of WALRead, which would have to learn how
to throw errors from the open_segment callback in addition of, as
currently, from pg_pread.)
* segment_open no longer receives the 'segcxt' as a separate argument,
since it's part of the XLogReaderState argument.
Per comments from Kyotaro Horiguchi.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200511203336.GA9913@alvherre.pgsql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use it at level 4, a bit more restrictive than the default level, and
tweak our commanding comments to FALLTHROUGH.
(However, leave zic.c alone, since it's external code; to avoid the
warnings that would appear there, change CFLAGS for that file in the
Makefile.)
Author: Julien Rouhaud <rjuju123@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20200412081825.qyo5vwwco3fv4gdo@nol
Discussion: https://postgr.es/m/flat/E1fDenm-0000C8-IJ@gemulon.postgresql.org
|
|
|
|
|
|
|
|
|
|
|
| |
The existing callers of XLogReadDetermineTimeline() performing recovery
need to check a replay LSN position when determining on which timeline
to read a WAL page. A portion of the comment describing this function
said exactly that, while referring to a routine for fetching a write
LSN, something not available in recovery.
Author: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20200511.101619.2043820539323292957.horikyota.ntt@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Restructure the function that locates the root of the to-be-deleted
subtree during nbtree page deletion. Handle the conditions that make
page deletion unsafe in a slightly more uniform way, and acknowledge the
fact that the behavior with incomplete splits on internal pages is
different (as pointed out in the nbtree README as of commit 35bc0ec7).
Also invent new terminology that avoids ambiguity around which pages are
about to be deleted. Consistently use the term "to-be-deleted subtree",
not the ambiguous term "branch".
We were calling the subtree parent page the "top parent page", but that
was quite misleading. The top parent page usually refers to a page
unlinked from its siblings and marked deleted (during the second stage
of page deletion). There was one kind of top parent page that we merely
removed a downlink from, and another kind of top parent page that we
actually marked deleted. Eliminate the ambiguity by inventing a new
term ("subtree parent page") that refers to the former kind of page
only.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Code review for 0dc8ead46363, prompted by a bug closed by 91c40548d5f7.
XLogReader's system for opening and closing segments had gotten too
complicated, with callbacks being passed at both the XLogReaderAllocate
level (read_page) as well as at the WALRead level (segment_open). This
was confusing and hard to follow, so restructure things so that these
callbacks are passed together at XLogReaderAllocate time, and add
another callback to the set (segment_close) to make it a coherent whole.
Also, ensure XLogReaderState is an argument to all the callbacks, so
that they can grab at the ->private data if necessary.
Document the whole arrangement more clearly.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/20200422175754.GA19858@alvherre.pgsql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TimelineHistoryRead and TimelineHistoryWrite wait events are reported
during waiting for a read and write of a timeline history file, respectively.
However, previously, TimelineHistoryRead wait event was not reported
while readTimeLineHistory() was reading a timeline history file. Also
TimelineHistoryWrite was not reported while writeTimeLineHistory() was
writing one line with the details of the timeline split, at the end.
This commit fixes these issues.
Back-patch to v10 where wait events for a timeline history file was added.
Author: Masahiro Ikeda
Reviewed-by: Michael Paquier, Fujii Masao
Discussion: https://postgr.es/m/d11b0c910b63684424e06772eb844ab5@oss.nttdata.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Factor out code common to _bt_lock_branch_parent() and _bt_pagedel()
into a new utility function. This new function is used to check that
the left sibling of a deletion target page does not have the
INCOMPLETE_SPLIT page flag set. If it is set then deletion is unsafe;
there won't be a usable pivot tuple (with a downlink) in the parent page
that points to the deletion target page. The page deletion algorithm is
not prepared to deal with that. Also restructure an existing, related
utility function that checks if the right sibling of the target page has
the ISHALFDEAD page flag set.
This organization highlights the symmetry between the two cases. The
goal is to make the design of page deletion clearer. Both functions
involve a sibling page with a flag that indicates that there was an
interrupted operation (a page split or a page deletion) that resulted in
a page pointed to by sibling pages, but not pointed to in the parent.
And, both functions indicate if page deletion is unsafe due to the
absence of a particular downlink in the parent page.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In commit 33e05f89c5, we have added the option to display WAL usage
statistics in Explain and auto_explain. The display format used two spaces
between each field which is inconsistent with Buffer usage statistics which
is using one space between each field. Change the format to make WAL usage
statistics consistent with Buffer usage statistics.
This commit also changed the usage of "full page writes" to
"full page images" for WAL usage statistics to make it consistent with
other parts of code and docs.
Author: Julien Rouhaud, Amit Kapila
Reviewed-by: Justin Pryzby, Kyotaro Horiguchi and Amit Kapila
Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove one of the arguments to btvacuumpage(), and give up on the idea
that it's a recursive function. We now use the term "backtracking" to
refer to the case where an earlier block must be visited to make sure no
tuples that need to be removed were missed.
Advertising btvacuumpage() as a recursive function was unhelpful. In
reality the function always simulates recursion with a loop (it doesn't
actually call itself). This wasn't just necessary as a precaution (per
the comments mentioning tail recursion), though. There is no reliable
natural limit on the number of times we can backtrack.
There are important behavioral difference when "recursing"/backtracking,
mostly related to page deletion. We don't perform page deletion when
backtracking due to the extra complexity. And when we recurse, we're
not performing a physical order scan anymore, so we expect fairly
different conditions to hold for the page. Structuring the code like
this makes it clearer how _bt_pagedel() cooperates with btvacuumpage()
and btvacuumscan() (as established in commit b0229f26 and commit
73a076b0).
Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzmRGMDWiLMcb+zagG9652PboNN4Gfcq1Gc_wJL6A716MA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On further reflection, code comments added by commit b0229f26 slightly
misrepresented how we determine the oldest bpto.xact for the index.
btvacuumpage() does not treat the bpto.xact of a page that it put in the
FSM as a candidate to be the oldest deleted page (the delete-marked page
that has the oldest bpto.xact XID among all pages encountered).
The definition of a deleted page for the purposes of the bpto.xact
calculation is different from the definition used by the bulk delete
statistics. The bulk delete statistics don't distinguish between pages
that were deleted by the current VACUUM, pages deleted by a previous
VACUUM operation but not yet recyclable/reusable, and pages that are
reusable (though reusable pages are counted separately).
Backpatch: 11-, just like commit b0229f26.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The logic for determining how many nbtree pages in an index are deleted
pages sometimes undercounted pages. Pages that were deleted by the
current VACUUM operation (as opposed to some previous VACUUM operation
whose deleted pages have yet to be reused) were sometimes overlooked.
The final count is exposed to users through VACUUM VERBOSE's "%u index
pages have been deleted" output.
btvacuumpage() avoided double-counting when _bt_pagedel() deleted more
than one page by assuming that only one page was deleted, and that the
additional deleted pages would get picked up during a future call to
btvacuumpage() by the same VACUUM operation. _bt_pagedel() can
legitimately delete pages that the btvacuumscan() scan will not visit
again, though, so that assumption was slightly faulty.
Fix the accounting by teaching _bt_pagedel() about its caller's
requirements. It now only reports on pages that it knows btvacuumscan()
won't visit again (including the current btvacuumpage() page), so
everything works out in the end.
This bug has been around forever. Only backpatch to v11, though, to
keep _bt_pagedel() is sync on the branches that have today's bugfix
commit b0229f26da. Note that this commit changes the signature of
_bt_pagedel(), just like commit b0229f26da.
Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 857f9c36cda (which taught nbtree VACUUM to skip a scan of the
index from btcleanup in situations where it doesn't seem worth it) made
VACUUM maintain the oldest btpo.xact among all deleted pages for the
index as a whole. It failed to handle all the details surrounding pages
that are deleted by the current VACUUM operation correctly (though pages
deleted by some previous VACUUM operation were processed correctly).
The most immediate problem was that the special area of the page was
examined without a buffer pin at one point. More fundamentally, the
handling failed to account for the full range of _bt_pagedel()
behaviors. For example, _bt_pagedel() sometimes deletes internal pages
in passing, as part of deleting an entire subtree with btvacuumpage()
caller's page as the leaf level page. The original leaf page passed to
_bt_pagedel() might not be the page that it deletes first in cases where
deletion can take place.
It's unclear how disruptive this bug may have been, or what symptoms
users might want to look out for. The issue was spotted during
unrelated code review.
To fix, push down the logic for maintaining the oldest btpo.xact to
_bt_pagedel(). btvacuumpage() is now responsible for pages that were
fully deleted by a previous VACUUM operation, while _bt_pagedel() is now
responsible for pages that were deleted by the current VACUUM operation
(this includes half-dead pages from a previous interrupted VACUUM
operation that become fully deleted in _bt_pagedel()). Note that
_bt_pagedel() should never encounter an existing deleted page.
This commit theoretically breaks the ABI of a stable release by changing
the signature of _bt_pagedel(). However, if any third party extension
is actually affected by this, then it must already be completely broken
(since there are numerous assumptions made in _bt_pagedel() that cannot
be met outside of VACUUM). It seems highly unlikely that such an
extension actually exists, in any case.
Author: Peter Geoghegan
Reviewed-By: Masahiko Sawada
Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
|
|
|
|
|
|
|
|
|
|
|
| |
XLogRegisterBufData() does not copy data pointed to by caller's pointer
argument.
Oversight in commit 0d861bbb702.
Author: Peter Eisentraut
Reported-By: Peter Eisentraut
Discussion: https://postgr.es/m/21800dbe-a13e-22f7-d423-b81db9d249f5@2ndquadrant.com
|