| Commit message (Collapse) | Author | Age |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PostgreSQL 9.4 introduces posting list compression in GIN. This feature
supports online upgrade, so that after pg_upgrade uncompressed posting
lists are compressed on-the-fly. Underlying code appears to always
expect at least one item on uncompressed posting list page. But there
could be completely empty pages, because VACUUM never deletes leftmost
and rightmost pages from posting trees. This commit fixes that.
Reported-by: Sivasubramanian Ramasubramanian
Discussion: https://postgr.es/m/1531867212836.63354%40amazon.com
Author: Sivasubramanian Ramasubramanian, Alexander Korotkov
Backpatch-through: 9.4
|
|
|
|
|
|
|
|
| |
A collection of typos I happened to spot while reading code, as well as
grepping for common mistakes.
Backpatch to all supported versions, as applicable, to avoid conflicts
when backporting other commits in the future.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When reading an existing FSM or VM page that was found to be corrupt by the
buffer manager, the code applied PageInit() to reinitialize the page, but
did so without any locking. There is thus a hazard that two backends might
concurrently do PageInit, which in itself would still be OK, but the slower
one might then zero over subsequent data changes applied by the faster one.
Even that is unlikely to be fatal; but it's not desirable, so add locking
to prevent it.
This does not add any locking overhead in the normal code path where the
page is OK. It's not immediately obvious that that's safe, but I believe
it is, for reasons explained in the added comments.
Problem noted by R P Asim. It's been like this for a long time, so
back-patch to all supported branches.
Discussion: https://postgr.es/m/CANXE4Te4G0TGq6cr0-TvwP0H4BNiK_-hB5gHe8mF+nz0mcYfMQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit fafa374f2 caused _bt_getbuf() to possibly emit a WAL record for
a page that it was about to recycle. However, it failed to distinguish
all-zero pages from dead pages, which is important because only the
latter have valid btpo.xact values, or indeed any special space at all.
Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore
led to an Assert failure, or to emission of a WAL record containing a
bogus cutoff XID, which might lead to unnecessary query cancellations
on hot standby servers.
Per reports from Antonin Houska and 自己. Amit Kapila was first to
propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi
reviewed it at various times.
This is an old bug, so back-patch to all supported branches.
Discussion: https://postgr.es/m/2628.1474272158@localhost
Discussion: https://postgr.es/m/48875502.f4a0.1635f0c27b0.Coremail.zoulx1982@163.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written. Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf. With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster. It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions. The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.
Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now. This is left as a future item and this
commit fixes the original complain.
Reported-by: Magnus Hagander
Author: Heikki Linnakangas
Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed. This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed. When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed. Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.
Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me. The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.
Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore. The test gets easily supported down to 10, still it
has been tested on all branches.
Reported-by: Pavan Deolasee
Diagnosed-by: Pavan Deolasee
Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi
Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro
Herrera
Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When deleting pages the nbtree code has to walk through siblings of a
tree node. When those sibling links are corrupted that can lead to
endless loops - which are currently not interruptible. This is
especially problematic if autovacuum is repeatedly blocked on such
indexes, as it can be hard to get out of that situation without
resorting to single user mode.
Thus add interrupt checks to appropriate places in such
loops. Unfortunately in one of the cases it's it's not easy to do so.
Between 9.3 and 9.4 the page deletion (and page split) code changed
significantly. Before it was significantly less robust against
interruptions. Therefore don't backpatch to 9.3.
Author: Andres Freund
Discussion: https://postgr.es/m/20180627191629.wkunw2qbibnvlz53@alap3.anarazel.de
Backpatch: 9.4-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.
To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.
This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.
Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
System calls mixed up in error code paths are causing two issues which
several code paths have not correctly handled:
1) For write() calls, sometimes the system may return less bytes than
what has been written without errno being set. Some paths were careful
enough to consider that case, and assumed that errno should be set to
ENOSPC, other calls missed that.
2) errno generated by a system call is overwritten by other system calls
which may succeed once an error code path is taken, causing what is
reported to the user to be incorrect.
This patch uses the brute-force approach of correcting all those code
paths. Some refactoring could happen in the future, but this is let as
future work, which is not targeted for back-branches anyway.
Author: Michael Paquier
Reviewed-by: Ashutosh Sharma
Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it. While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL: invalid memory alloc request size XXX
This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.
The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle. A couple of approaches have been discussed, like
zero-ing new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver. This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL. It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.
This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.
Backpatch down to 9.5, which is where palloc_extended has been
introduced.
Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
|
|
|
|
|
|
|
|
|
|
|
| |
They already fail anyway, but prior to this patch they raise an ugly
error message about a lock that cannot be acquired. This just improves
the message.
Author: Masahiko Sawada
Reported-by: Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoBZau4g4_NUf3BKNd=CdYK+xaPdtJCzvOC1TxGdTiJx_Q@mail.gmail.com
Reviewed-by: Kuntal Ghosh, Alexander Korotkov, Simon Riggs, Michaël Paquier, Álvaro Herrera
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bug causes a lseek() failure to be reported as a "could not open"
failure in the error message, muddling bug reports. I introduced this
copy-and-pasteo in commit 78e122010422.
Noticed while reviewing code for bug report #15221, from lily liang. In
version 10 the affected function is only used by multixact.c and
commit_ts, and only in corner-case circumstances, neither of which are
involved in the reported bug (a pg_subtrans failure.)
Author: Álvaro Herrera
|
|
|
|
|
| |
Instead of saying "error: success", indicate that we got a working read
but it was too short.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a continuation record is split so that its first half has already been
removed from the master, and is only present in pg_wal, and there is a
recycled WAL segment in the standby server that looks like it would
contain the second half, recovery would get stuck. The code in
XLogPageRead() incorrectly started streaming at the beginning of the
WAL record, even if we had already read the first page.
Backpatch to 9.4. In principle, older versions have the same problem, but
without replication slots, there was no straightforward mechanism to
prevent the master from recycling old WAL that was still needed by standby.
Without such a mechanism, I think it's reasonable to assume that there's
enough slack in how many old segments are kept around to not run into this,
or you have a WAL archive.
Reported by Jonathon Nelson. Analysis and patch by Kyotaro HORIGUCHI, with
some extra comments by me.
Discussion: https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP%2BJoJAjoGx%3DGNuOAshEDWCext7BFvCQ%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If an interrupt arrives in the middle of FinishPreparedTransaction
and any callback decide to call CHECK_FOR_INTERRUPTS (e.g.
RemoveTwoPhaseFile can write a warning with ereport, which checks for
interrupts) then it's possible to leave current GXact undeleted.
Backpatch to all supported branches
Stas Kelvich
Discussion: ihttps://www.postgresql.org/message-id/3AD85097-A3F3-4EBA-99BD-C38EDF8D2949@postgrespro.ru
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
spg_text_leaf_consistent() supposed that it should compare only
Min(querylen, entrylen) bytes of the two strings, and then deal with
any excess bytes in one string or the other by assuming the longer
string is greater if the prefixes are equal. Quite aside from the
fact that that's just wrong in some locales (e.g., 'ch' is not less
than 'd' in cs_CZ), it also risked passing incomplete multibyte
characters to strcoll(), with ensuing bad results.
Instead, just pass the full strings to varstr_cmp, and let it decide
what to do about unequal-length strings.
Fortunately, this error doesn't imply any index corruption, it's just
that searches might return the wrong set of entries.
Per report from Emre Hasegeli, though this is not his patch.
Thanks to Peter Geoghegan for review and discussion.
This code was born broken, so back-patch to all supported branches.
In HEAD, I failed to resist the temptation to do a bit of cosmetic
cleanup/pgindent'ing on 710d90da1, too.
Discussion: https://postgr.es/m/CAE2gYzzb6K51VnTq5i5p52z+j9p2duEa-K1T3RrC_GQEynAKEg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The nextOid value is from the start of the checkpoint and may well be stale
compared to values from more recent XLOG_NEXTOID records. Previously, we
adopted it anyway, allowing the OID counter to go backwards during a crash.
While this should be harmless, it contributed to the severity of the bug
fixed in commit 0408e1ed5, by allowing duplicate TOAST OIDs to be assigned
immediately following a crash. Without this error, that issue would only
have arisen when TOAST objects just younger than a multiple of 2^32 OIDs
were deleted and then not vacuumed in time to avoid a conflict.
Pavan Deolasee
Discussion: https://postgr.es/m/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-=h4OLosXHf9w@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When selecting a new OID, we take care to avoid picking one that's already
in use in the target table, so as not to create duplicates after the OID
counter has wrapped around. However, up to now we used SnapshotDirty when
scanning for pre-existing entries. That ignores committed-dead rows, so
that we could select an OID matching a deleted-but-not-yet-vacuumed row.
While that mostly worked, it has two problems:
* If recently deleted, the dead row might still be visible to MVCC
snapshots, creating a risk for duplicate OIDs when examining the catalogs
within our own transaction. Such duplication couldn't be visible outside
the object-creating transaction, though, and we've heard few if any field
reports corresponding to such a symptom.
* When selecting a TOAST OID, deleted toast rows definitely *are* visible
to SnapshotToast, and will remain so until vacuumed away. This leads to
a conflict that will manifest in errors like "unexpected chunk number 0
(expected 1) for toast value nnnnn". We've been seeing reports of such
errors from the field for years, but the cause was unclear before.
The fix is simple: just use SnapshotAny to search for conflicting rows.
This results in a slightly longer window before object OIDs can be
recycled, but that seems unlikely to create any large problems.
Pavan Deolasee
Discussion: https://postgr.es/m/CABOikdOgWT2hHkYG3Wwo2cyZJq2zfs1FH0FgX-=h4OLosXHf9w@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
| |
The listed numbers disagreed with the ones being used in the symbols;
but instead of just fixing the numbers in the comment, use the symbolic
name instead, which seems clearer.
This has been wrong all along, so apply back to 9.5 where BRIN was
introduced.
Reported-by: Tomas Vondra
Discussion: https://postgr.es/m/5ff514f2-8b1e-6366-b11c-8e2ed442562d@2ndquadrant.com
|
|
|
|
|
|
|
|
|
|
|
|
| |
Count the number of tuples in the index honestly, instead of assuming
that it's the same as the number of tuples in the heap. (It might be
different if the index is partial.)
Back-patch to all supported versions.
Tomas Vondra
Discussion: https://postgr.es/m/3b3d8eac-c709-0d25-088e-b98339a1b28a@2ndquadrant.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"UPDATE/DELETE WHERE CURRENT OF cursor_name" failed, with an error message
like "cannot extract system attribute from virtual tuple", if the cursor
was using a index-only scan for the target table. Fix it by digging the
current TID out of the indexscan state.
It seems likely that the same failure could occur for CustomScan plans
and perhaps some FDW plan types, so that leaving this to be treated as an
internal error with an obscure message isn't as good an idea as it first
seemed. Hence, add a bit of heaptuple.c infrastructure to let us deliver
a more on-topic message. I chose to make the message match what you get
for the case where execCurrentOf can't identify the target scan node at
all, "cursor "foo" is not a simply updatable scan of table "bar"".
Perhaps it should be different, but we can always adjust that later.
In the future, it might be nice to provide hooks that would let custom
scan providers and/or FDWs deal with this in other ways; but that's
not a suitable topic for a back-patchable bug fix.
It's been like this all along, so back-patch to all supported branches.
Yugo Nagata and Tom Lane
Discussion: https://postgr.es/m/20180201013349.937dfc5f.nagata@sraoss.co.jp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, it just returned the heap tuple count, which might be only an
estimate, and would be completely the wrong thing if the index is partial.
Since this function scans every index page anyway to find free pages,
it's practically free to count the surviving index tuples. Let's do that
and return an accurate count.
This is easily visible as a wrong reltuples value for a partial GiST
index following VACUUM, so back-patch to all supported branches.
Andrey Borodin, reviewed by Michail Nikolaev
Discussion: https://postgr.es/m/151956654251.6915.675951950408204404.pgcf@coridan.postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 4800f16a7ad0 added some sanity checks to ensure we don't
accidentally corrupt data, but in one of them we failed to consider the
effects of a database upgraded from 9.2 or earlier, where a tuple
exclusively locked prior to the upgrade has a slightly different bit
pattern. Fix that by using the macro that we fixed in commit
74ebba84aeb6 for similar situations.
Reported-by: Alexandre Garcia
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CAPYLKR6yxV4=pfW0Gwij7aPNiiPx+3ib4USVYnbuQdUtmkMaEA@mail.gmail.com
Andres suspects that this bug may have wider ranging consequences, but I
couldn't find anything.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As src/backend/access/transam/README says, PageGetLSN may only be called
by processes holding either exclusive lock on buffer, or a shared lock
on buffer plus buffer header lock. Therefore any place that only holds
a shared buffer lock must use BufferGetLSNAtomic instead of PageGetLSN,
which internally obtains buffer header lock prior to reading the LSN.
A few callsites failed to comply with this rule. This was detected by
running all tests under a new (not committed) assertion that verifies
PageGetLSN locking contract. All but one of the callsites that failed
the assertion are fixed by this patch. Remaining callsites were
inspected manually and determined not to need any change.
The exception (unfixed callsite) is in TestForOldSnapshot, which only
has a Page argument, making it impossible to access the corresponding
Buffer from it. Fixing that seems a much larger patch that will have to
be done separately; and that's just as well, since it was only
introduced in 9.6 and other bugs are much older.
Some of these bugs are ancient; backpatch all the way back to 9.3.
Authors: Jacob Champion, Asim Praveen, Ashwin Agrawal
Reviewed-by: Michaël Paquier
Discussion: https://postgr.es/m/CABAq_6GXgQDVu3u12mK9O5Xt5abBZWQ0V40LZCE+oUf95XyNFg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commit has shown that the sanity checks around freezing
aren't strong enough. Strengthening them seems especially important
because the existance of the bug has caused corruption that we don't
want to make even worse during future vacuum cycles.
The errors are emitted with ereport rather than elog, despite being
"should never happen" messages, so a proper error code is emitted. To
avoid superflous translations, mark messages as internal.
Author: Andres Freund and Alvaro Herrera
Reviewed-By: Alvaro Herrera, Michael Paquier
Discussion: https://postgr.es/m/20171102112019.33wb7g5wp4zpjelu@alap3.anarazel.de
Backpatch: 9.3-
|
|
|
|
| |
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch fixes a couple of low-probability bugs that could lead to
reporting an irrelevant errno value (and hence possibly a wrong SQLSTATE)
concerning directory-open or file-open failures. It also fixes places
where we took shortcuts in reporting such errors, either by using elog
instead of ereport or by using ereport but forgetting to specify an
errcode. And it eliminates a lot of just plain redundant error-handling
code.
In service of all this, export fd.c's formerly-static function
ReadDirExtended, so that external callers can make use of the coding
pattern
dir = AllocateDir(path);
while ((de = ReadDirExtended(dir, path, LOG)) != NULL)
if they'd like to treat directory-open failures as mere LOG conditions
rather than errors. Also fix FreeDir to be a no-op if we reach it
with dir == NULL, as such a coding pattern would cause.
Then, remove code at many call sites that was throwing an error or log
message for AllocateDir failure, as ReadDir or ReadDirExtended can handle
that job just fine. Aside from being a net code savings, this gets rid of
a lot of not-quite-up-to-snuff reports, as mentioned above. (In some
places these changes result in replacing a custom error message such as
"could not open tablespace directory" with more generic wording "could not
open directory", but it was agreed that the custom wording buys little as
long as we report the directory name.) In some other call sites where we
can't just remove code, change the error reports to be fully
project-style-compliant.
Also reorder code in restoreTwoPhaseData that was acquiring a lock
between AllocateDir and ReadDir; in the unlikely but surely not
impossible case that LWLockAcquire changes errno, AllocateDir failures
would be misreported. There is no great value in opening the directory
before acquiring TwoPhaseStateLock, so just do it in the other order.
Also fix CheckXLogRemoved to guarantee that it preserves errno,
as quite a number of call sites are implicitly assuming. (Again,
it's unlikely but I think not impossible that errno could change
during a SpinLockAcquire. If so, this function was broken for its
own purposes as well as breaking callers.)
And change a few places that were using not-per-project-style messages,
such as "could not read directory" when "could not open directory" is
more correct.
Back-patch the exporting of ReadDirExtended, in case we have occasion
to back-patch some fix that makes use of it; it's not needed right now
but surely making it global is pretty harmless. Also back-patch the
restoreTwoPhaseData and CheckXLogRemoved fixes. The rest of this is
essentially cosmetic and need not get back-patched.
Michael Paquier, with a bit of additional work by me
Discussion: https://postgr.es/m/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originally, we palloc'd this buffer just barely big enough to hold the
largest xlog record seen so far. It turns out that that can result in
valgrind complaints, because some compilers will emit code that assumes
it can safely fetch padding bytes at the end of a struct, and those
padding bytes were unallocated so far as aset.c was concerned. We can
fix that by MAXALIGN'ing the palloc request size, ensuring that it is big
enough to include any possible padding that might've been omitted from
the on-disk record.
An additional objection to the original coding is that it could result in
many repeated palloc cycles, in the worst case where we see a series of
gradually larger xlog records. We can ameliorate that cheaply by
imposing a minimum buffer size that's large enough for most xlog records.
BLCKSZ/2 was chosen after a bit of discussion.
In passing, remove an obsolete comment in struct xl_heap_new_cid that the
combocid field is free due to alignment considerations. Perhaps that was
true at some point, but it's not now.
Back-patch to 9.5 where this code came in.
Discussion: https://postgr.es/m/E1eHa4J-0006hI-Q8@gemulon.postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a process is extending a table concurrently with some BRIN
summarization process, it is possible for the latter to miss pages added
by the former because the number of pages is computed ahead of time.
Fix by determining a fresh relation size after inserting the placeholder
tuple: any process that further extends the table concurrently will
update the placeholder tuple, while previous pages will be processed by
the heap scan.
Reported-by: Tomas Vondra
Reviewed-by: Tom Lane
Author: Álvaro Herrera
Discussion: https://postgr.es/m/083d996a-4a8a-0e13-800a-851dd09ad8cc@2ndquadrant.com
Backpatch-to: 9.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In some cases the BRIN code releases lock on an index page, and later
re-acquires lock and tries to check that the tuple it was working on is
still there. That check was a couple bricks shy of a load. It didn't
consider that the page might have turned into a "revmap" page. (The
samepage code path doesn't call brin_getinsertbuffer(), so it isn't
protected by the checks for revmap status there.) It also didn't check
whether the tuple offset was now off the end of the linepointer array.
Since commit 24992c6db the latter case is pretty common, but at least
in principle it could have occurred before that. The net result is
that concurrent updates of a BRIN index could fail with errors like
"invalid index offnum" or "inconsistent range map".
Per report from Tomas Vondra. Back-patch to 9.5, since this code is
substantially the same in all versions containing BRIN.
Discussion: https://postgr.es/m/10d2b9f9-f427-03b8-8ad9-6af4ecacbee9@2ndquadrant.com
|
|
|
|
|
|
|
|
|
|
| |
It turns out we misdiagnosed what the real problem was. Revert the
previous changes, because they may have worse consequences going
forward. A better fix is forthcoming.
The simplistic test case is kept, though disabled.
Discussion: https://postgr.es/m/20171102112019.33wb7g5wp4zpjelu@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sloppy loop coding in set_status_by_pages() resulted in fetching one array
element more than it should from the subxids[] array. The odds of this
resulting in SIGSEGV are pretty small, but we've certainly seen that happen
with similar mistakes elsewhere. While at it, we can get rid of an extra
TransactionIdToPage() calculation per loop.
Per report from David Binderman. Back-patch to all supported branches,
since this code is quite old.
Discussion: https://postgr.es/m/HE1PR0802MB2331CBA919CBFFF0C465EB429C710@HE1PR0802MB2331.eurprd08.prod.outlook.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When some tuple versions in an update chain are frozen due to them being
older than freeze_min_age, the xmax/xmin trail can become broken. This
breaks HOT (and probably other things). A subsequent VACUUM can break
things in more serious ways, such as leaving orphan heap-only tuples
whose root HOT redirect items were removed. This can be seen because
index creation (or REINDEX) complain like
ERROR: XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"
Because of relfrozenxid contraints, we cannot avoid the freezing of the
early tuples, so we must cope with the results: whenever we see an Xmin
of FrozenTransactionId, consider it a match for whatever the previous
Xmax value was.
This problem seems to have appeared in 9.3 with multixact changes,
though strictly speaking it seems unrelated.
Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as
frozen", so the fix is simple: just compare the raw Xmin (still stored
in the tuple header, since freezing merely set an infomask bit) to the
Xmax. But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
the original value is lost and we have nothing to compare the Xmax with.
To cope with that case we need to compare the Xmin with FrozenXid,
assume it's a match, and hope for the best. Sadly, since you can
pg_upgrade a 9.3 instance containing half-frozen pages to newer
releases, we need to keep the old check in newer versions too, which
seems a bit brittle; I hope we can somehow get rid of that.
I didn't optimize the new function for performance. The new coding is
probably a bit slower than before, since there is a function call rather
than a straight comparison, but I'd rather have it work correctly than
be fast but wrong.
This is a followup after 20b655224249 fixed a few related problems.
Apparently, in 9.6 and up there are more ways to get into trouble, but
in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
there must be a separate bug.
Reported-by: Peter Geoghegan
Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
Yi Wen Wong, Álvaro
Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples. But
concurrent transaction commit/abort may turn DEAD some of the HOT tuples
that survived the prune, before HeapTupleSatisfiesVacuum tests them.
This happens to activate the code that decides to freeze the tuple ...
which resuscitates it, duplicating data.
(This is especially bad if there's any unique constraints, because those
are now internally violated due to the duplicate entries, though you
won't know until you try to REINDEX or dump/restore the table.)
One possible fix would be to simply skip doing anything to the tuple,
and hope that the next HOT prune would remove it. But there is a
problem: if the tuple is older than freeze horizon, this would leave an
unfrozen XID behind, and if no HOT prune happens to clean it up before
the containing pg_clog segment is truncated away, it'd later cause an
error when the XID is looked up.
Fix the problem by having the tuple freezing routines cope with the
situation: don't freeze the tuple (and keep it dead). In the cases that
the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
so that there is no need to look up the XID in pg_clog later on.
An isolation test is included, authored by Michael Paquier, loosely
based on Daniel Wood's original reproducer. It only tests one
particular scenario, though, not all the possible ways for this problem
to surface; it be good to have a more reliable way to test this more
fully, but it'd require more work.
In message https://postgr.es/m/20170911140103.5akxptyrwgpc25bw@alvherre.pgsql
I outlined another test case (more closely matching Dan Wood's) that
exposed a few more ways for the problem to occur.
Backpatch all the way back to 9.3, where this problem was introduced by
multixact juggling. In branches 9.3 and 9.4, this includes a backpatch
of commit e5ff9fefcd50 (of 9.5 era), since the original is not
correctable without matching the coding pattern in 9.5 up.
Reported-by: Daniel Wood
Diagnosed-by: Daniel Wood
Reviewed-by: Yi Wen Wong, Michaël Paquier
Discussion: https://postgr.es/m/E5711E62-8FDF-4DCA-A888-C200BF6B5742@amazon.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Stress testing by Andreas Seltenreich disclosed longstanding problems that
occur if a FATAL exit (e.g. due to receipt of SIGTERM) occurs while we are
trying to execute a ROLLBACK of an already-failed transaction. In such a
case, xact.c is in TBLOCK_ABORT state, so that AbortOutOfAnyTransaction
would skip AbortTransaction and go straight to CleanupTransaction. This
led to an assert failure in an assert-enabled build (due to the ROLLBACK's
portal still having a cleanup hook) or without assertions, to a FATAL exit
complaining about "cannot drop active portal". The latter's not
disastrous, perhaps, but it's messy enough to want to improve it.
We don't really want to run all of AbortTransaction in this code path.
The minimum required to clean up the open portal safely is to do
AtAbort_Memory and AtAbort_Portals. It seems like a good idea to
do AtAbort_Memory unconditionally, to be entirely sure that we are
starting with a safe CurrentMemoryContext. That means that if the
main loop in AbortOutOfAnyTransaction does nothing, we need an extra
step at the bottom to restore CurrentMemoryContext = TopMemoryContext,
which I chose to do by invoking AtCleanup_Memory. This'll result in
calling AtCleanup_Memory twice in many of the paths through this function,
but that seems harmless and reasonably inexpensive.
The original motivation for the assertion in AtCleanup_Portals was that
we wanted to be sure that any user-defined code executed as a consequence
of the cleanup hook runs during AbortTransaction not CleanupTransaction.
That still seems like a valid concern, and now that we've seen one case
of the assertion firing --- which means that exactly that would have
happened in a production build --- let's replace the Assert with a runtime
check. If we see the cleanup hook still set, we'll emit a WARNING and
just drop the hook unexecuted.
This has been like this a long time, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/877ey7bmun.fsf@ansel.ydns.eu
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The sole useful effect of this function, to check that no catcache
entries have positive refcounts at transaction end, has really been
obsolete since we introduced ResourceOwners in PG 8.1. We reduced the
checks to assertions years ago, so that the function was a complete
no-op in production builds. There have been previous discussions about
removing it entirely, but consensus up to now was that it had some small
value as a cross-check for bugs in the ResourceOwner logic.
However, it now emerges that it's possible to trigger these assertions
if you hit an assert-enabled backend with SIGTERM during a call to
SearchCatCacheList, because that function temporarily increases the
refcounts of entries it's intending to add to a catcache list construct.
In a normal ERROR scenario, the extra refcounts are cleaned up by
SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
transaction abort and exit without ever executing PG_CATCH handlers.
There's a case to be made that this is a generic hazard and we should
consider restructuring elog(FATAL) handling so that pending PG_CATCH
handlers do get run. That's pretty scary though: it could easily create
more problems than it solves. Preliminary stress testing by Andreas
Seltenreich suggests that there are not many live problems of this ilk,
so we rejected that idea.
There are more-localized ways to fix the problem; the most principled
one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
But adding cycles to SearchCatCacheList isn't very appealing. We could
also weaken the assertions in AtEOXact_CatCache in some more or less
ad-hoc way, but that just makes its raison d'etre even less compelling.
In the end, the most reasonable solution seems to be to just remove
AtEOXact_CatCache altogether, on the grounds that it's not worth trying
to fix it. It hasn't found any bugs for us in many years.
Per report from Jeevan Chalke. Back-patch to all supported branches.
Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If several sessions are concurrently locking a tuple update chain with
nonconflicting lock modes using an old snapshot, and they all succeed,
it may happen that some of them fail because of restarting the loop (due
to a concurrent Xmax change) and getting an error in the subsequent pass
while trying to obtain a tuple lock that they already have in some tuple
version.
This can only happen with very high concurrency (where a row is being
both updated and FK-checked by multiple transactions concurrently), but
it's been observed in the field and can have unpleasant consequences
such as an FK check failing to see a tuple that definitely exists:
ERROR: insert or update on table "child_table" violates foreign key constraint "fk_constraint_name"
DETAIL: Key (keyid)=(123456) is not present in table "parent_table".
(where the key is observably present in the table).
Discussion: https://postgr.es/m/20170714210011.r25mrff4nxjhmf3g@alvherre.pgsql
|
|
|
|
|
|
|
| |
Fix oversight in 3b97e6823b94 bug fix. Bitwise AND is used instead of OR and
it cleans all bits in t_infomask heap tuple field.
Backpatch to 9.3
|
|
|
|
|
|
|
|
| |
Once upon a time, WAL pointers could be NULL, but no longer. We talk about
"valid" now.
Reported-by: Amit Langote
Discussion: https://postgr.es/m/33e9617d-27f1-eee8-3311-e27af98eaf2b@lab.ntt.co.jp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When promoting a standby just after a XLOG_SWITCH record was replayed,
and next segment(s) are already are locally available (via walsender,
restore_command + trigger/recovery target), that segment could
accidentally be recycled onto the past of the new timeline. Later
checkpointer would create a .ready file for it, assuming there was an
error during creation, and it would get archived. That causes trouble
if another standby is later brought up from a basebackup from before
the timeline creation, because it would try to read the
segment, because XLogFileReadAnyTLI just tries all possible timelines,
which doesn't have valid contents. Thus replay would fail.
The problem, if already occurred, can be fixed by removing the segment
and/or having restore_command filter it out.
The reason for the creation of such "phantom" segments was, that after
an XLOG_SWITCH record the EndOfLog variable points to the beginning of
the next segment, and RemoveXlogFile() used XLByteToPrevSeg().
Normally RemoveXlogFile() doing so is harmless, because the last
segment will still exist preventing InstallXLogFileSegment() from
causing harm, but just after promotion there's no previous segment on
the new timeline.
Fix that by using XLByteToSeg() instead of XLByteToPrevSeg().
Author: Andres Freund
Reported-By: Greg Burek
Discussion: https://postgr.es/m/20170619073026.zcwpe6mydsaz5ygd@alap3.anarazel.de
Backpatch: 9.2-, bug older than all supported versions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.
To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown. Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.
Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
|
|
|
|
|
| |
Mark our rusage reportage string translatable; remove quotes from type
names; unify formatting of very similar messages.
|
|
|
|
|
| |
This addresses the new warning types -Wformat-truncation
-Wformat-overflow that are part of -Wall, via -Wformat, in GCC 7.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GiST's getNextNearest() function attempts to pfree the previously-returned
tuple if any (that is, scan->xs_hitup in HEAD, or scan->xs_itup in older
branches). However, if we are rescanning a plan node after ending a
previous scan early, those tuple pointers could be pointing to garbage,
because they would be pointing into the scan's pageDataCxt or queueCxt
which has been reset. In a debug build this reliably results in a crash,
although I think it might sometimes accidentally fail to fail in
production builds.
To fix, clear the pointer field anyplace we reset a context it might
be pointing into. This may be overkill --- I think probably only the
queueCxt case is involved in this bug, so that resetting in gistrescan()
would be sufficient --- but dangling pointers are generally bad news,
so let's avoid them.
Another plausible answer might be to just not bother with the pfree in
getNextNearest(). The reconstructed tuples would go away anyway in the
context resets, and I'm far from convinced that freeing them a bit earlier
really saves anything meaningful. I'll stick with the original logic in
this patch, but if we find more problems in the same area we should
consider that approach.
Per bug #14641 from Denis Smirnov. Back-patch to 9.5 where this
logic was introduced.
Discussion: https://postgr.es/m/20170504072034.24366.57688@wrigleys.postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ProcessTwoPhaseBuffer (formerly StandbyRecoverPreparedTransactions)
mixed up the parent and child XIDs when calling SubTransSetParent to
record the transactions' relationship in pg_subtrans.
Remarkably, analysis by Simon Riggs suggests that this doesn't lead to
visible problems (at least, not in non-Assert builds). That might
explain why we'd not noticed it before. Nonetheless, it's surely wrong.
This code was born broken, so back-patch to all supported branches.
Discussion: https://postgr.es/m/20110.1492905318@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On EXEC_BACKEND builds, this can fail if ASLR is in use.
Backpatch to 9.5. On master, completely remove the bgw_main field
completely, since there is no situation in which it is safe for an
EXEC_BACKEND build. On 9.6 and 9.5, leave the field intact to avoid
breaking things for third-party code that doesn't care about working
under EXEC_BACKEND. Prior to 9.5, there are no in-core bgworker
entrypoints.
Petr Jelinek, reviewed by me.
Discussion: http://postgr.es/m/09d8ad33-4287-a09b-a77f-77f8761adb5e@2ndquadrant.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were assigning values near 255 through "char *" pointers. On machines
where char is signed, that's not entirely kosher, and it's reasonable
for compilers to warn about it.
A better solution would be to change the pointer type to "unsigned char *",
but that would be vastly more invasive. For the moment, let's just apply
this simple backpatchable solution.
Aleksander Alekseev
Discussion: https://postgr.es/m/20170220141239.GD12278@e733.localdomain
Discussion: https://postgr.es/m/2839.1490714708@sss.pgh.pa.us
|
|
|
|
| |
From: Josh Soref <jsoref@gmail.com>
|
|
|
|
|
|
|
| |
Both datatypes map to the same underlying one which is why it still
worked, but we should use the correct type.
Author: Kyotaro HORIGUCHI
|