| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In writeListPage, never take a full-page image of the page, because we
have all the information required to re-initialize in the WAL record
anyway. Before this fix, a full-page image was always generated, unless
full_page_writes=off, because when the page is initialized its LSN is
always 0. In stable-branches, keep the code to restore the backup blocks
if they exist, in case that the WAL is generated with an older minor
version, but in master Assert that there are no full-page images.
In the redo routine, add missing "off++". Otherwise the tuples are added
to the page in reverse order. That happens to be harmless because we
always scan and remove all the tuples together, but it was clearly wrong.
Also, it was masked by the first bug unless full_page_writes=off, because
the page was always restored from a full-page image.
Backpatch to all supported versions.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As noted some time ago, the original coding had a typo ("|" for "^")
that made the result less unique than intended. Even the intended
behavior is obsolete since it was based on wanting to produce a
usable value even if we didn't have int64 arithmetic --- a limitation
we stopped supporting years ago. Instead, let's redefine the system
identifier as tv_sec in the upper 32 bits (same as before), tv_usec
in the next 20 bits, and the low 12 bits of getpid() in the remaining
bits. This is still hardly guaranteed-universally-unique, but it's
noticeably better than before. Per my proposal at
<29019.1374535940@sss.pgh.pa.us>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a tuple is locked, and this lock is later upgraded either to an
update or to a stronger lock, and in the meantime some other process
tries to lock, update or delete the same tuple, it (the tuple) could end
up being updated twice, or having conflicting locks held.
The reason for this is that the second updater checks for a change in
Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
the first lock; and if there's a change, it restarts and re-evaluates
its ability to update the tuple. But it neglected to check for changes
in lock strength or in lock-vs-update status when those two properties
stayed the same. This would lead it to take the wrong decision and
continue with its own update, when in reality it shouldn't do so but
instead restart from the top.
This could lead to either an assertion failure much later (when a
multixact containing multiple updates is detected), or duplicate copies
of tuples.
To fix, make sure to compare the other relevant infomask bits alongside
the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
necessary.
Also, in the belt-and-suspenders spirit, add a check to
MultiXactCreateFromMembers that a multixact being created does not have
two or more members that are claimed to be updates. This should protect
against other bugs that might cause similar bogus situations.
Backpatch to 9.3, where the possibility of multixacts containing updates
was introduced. (In prior versions it was possible to have the tuple
lock upgraded from shared to exclusive, and an update would not restart
from the top; yet we're protected against a bug there because there's
always a sleep to wait for the locking transaction to complete before
continuing to do anything. Really, the fact that tuple locks always
conflicted with concurrent updates is what protected against bugs here.)
Per report from Andrew Dunstan and Josh Berkus in thread at
http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com
Bug analysis by Andres Freund.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Once we've completed a PREPARE, our session is not running a transaction,
so its entry in pg_stat_activity should show xact_start as null, rather
than leaving the value as the start time of the now-prepared transaction.
I think possibly this oversight was triggered by faulty extrapolation
from the adjacent comment that says PrepareTransaction should not call
AtEOXact_PgStat, so tweak the wording of that comment.
Noted by Andres Freund while considering bug #10123 from Maxim Boguk,
although this error doesn't seem to explain that report.
Back-patch to all active branches.
|
|
|
|
| |
We no longer have a TLI field in the page header.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
|
|
|
|
|
| |
It's blatantly obvious that commit 4d0d607a454ee832574afd52a3c515099cc85eb3
wasn't tested. The leak's real enough, though.
|
|
|
|
| |
Revert due to contrib/test_decoding regression failure
|
|
|
|
| |
Patch by Ants Aasma
|
|
|
|
|
|
| |
Forgot to update LSN of left sibling's page, when creating a new root.
I fixed this for regular insertions and page splits earlier, but missed
new root creation.
|
|
|
|
|
|
|
| |
The README incorrectly claimed that GIN posting tree pages contain an array
of uncompressed items in addition to compressed posting lists. Earlier
versions of the GIN posting list compression patch worked that way, but not
the one that was committed.
|
|
|
|
|
| |
When modifying a page, must hold an exclusive lock. A shared lock is
obviously not good enough.
|
|
|
|
|
| |
It makes no difference to the system, but minimizing the differences
between a master and standby makes debugging simpler.
|
|
|
|
| |
A couple of typos from my refactoring of the page deletion patch.
|
|
|
|
| |
Etsuro Fujita
|
|
|
|
| |
Amit Langote
|
|
|
|
|
|
|
| |
Permissions might prevent the existence of the trigger file from being
checked.
Per report from Andres Freund
|
|
|
|
|
|
| |
I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is
allocated a couple of weeks ago. With the default settings, they are both
8k, but they can be changed at compile-time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows squeezing out the unused space in full-page writes. And more
importantly, it can be a useful debugging aid.
In hindsight we should've done this back when GIN was added - we wouldn't
need the 'maxoff' field in the page opaque struct if we had used pd_lower
and pd_upper like on normal pages. But as long as there can be pages in the
index that have been binary-upgraded from pre-9.4 versions, we can't rely
on that, and have to continue using 'maxoff'.
Most of the code churn comes from renaming some macros, now that they're
used on internal pages, too.
This change is completely backwards-compatible, no effect on pg_upgrade.
|
|
|
|
|
|
|
|
|
|
|
| |
Make sure we throw an error instead of silently doing the wrong thing when
fed a strategy number we don't recognize. Also, in the places that did
already throw an error, spell the error message in a way more consistent
with our message style guidelines.
Per report from Paul Jones. Although this is a bug, it won't occur unless
a superuser tries to do something he shouldn't, so it doesn't seem worth
back-patching.
|
|
|
|
|
| |
In some places, the function assumes the left page is valid, and in others,
it checks if it is valid. Remove all the checks.
|
|
|
|
|
|
| |
The entry B-tree pages all follow the standard page layout. The 9.3 code has
this right. I inadvertently changed this at some point during the big
refactorings in git master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There were a couple of bugs here. First, if the fuzzy limit was exceeded,
the loop in entryGetItem might drop out too soon if a whole block needs to
be skipped because it's < advancePast ("continue" in a while-loop checks the
loop condition too). Secondly, the loop checked when stepping to a new page
that there is at least one offset on the page < advancePast, but we cannot
rely on that on subsequent calls of entryGetItem, because advancePast might
change in between. That caused the skipping loop to read bogus items in the
TbmIterateResult's offset array.
First item and fix by Alexander Korotkov, second bug pointed out by Fabrízio
de Royes Mello, by a small variation of Alexander's test query.
|
|
|
|
| |
Tomonari Katsumata
|
|
|
|
|
|
|
|
|
|
| |
Don't reset the rightlink of a page when replaying a page update record.
This was a leftover from pre-hot standby days, when it was not possible to
have scans concurrent with WAL replay. Resetting the right-link was not
necessary back then either, but it was done for the sake of tidiness. But
with hot standby, it's wrong, because a concurrent scan might still need it.
Backpatch all versions with hot standby, 9.0 and above.
|
|
|
|
| |
This isn't strictly necessary, but helps debugging.
|
|
|
|
|
|
|
|
| |
Forgot to set the incomplete-split flag on the left page half, in redo of a
page split.
Spotted this by comparing the page contents on master and standby, after
inserting/applying each WAL record.
|
|
|
|
|
|
|
|
| |
Also add a regression test for a GIN index with enough items with the same
key, so that a GIN posting tree gets created. Apparently none of the
existing GIN tests were large enough for that.
This code is new, no backpatching required.
|
|
|
|
| |
Andres Freund
|
|
|
|
|
|
| |
It can fail if you run out of memory.
This call was added in 9.3, so backpatch to 9.3 only.
|
|
|
|
|
|
| |
GetVirtualXIDsDelayingChkpt calls palloc, which isn't safe in a critical
section. I thought I covered this case with the exemption for the
checkpointer, but CreateCheckPoint is also called from the startup process.
|
|
|
|
| |
If a palloc in a critical section fails, it becomes a PANIC.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Memory allocation can fail if you run out of memory, and inside a critical
section that will lead to a PANIC. Use conservatively-sized arrays in stack
instead.
There was previously no explicit limit on the number of pages a GiST split
can produce, it was only limited by the number of LWLocks that can be held
simultaneously (100 at the moment). This patch adds an explicit limit of 75
pages. That should be plenty, a typical split shouldn't produce more than
2-3 page halves.
The bug has been there forever, but only backpatch down to 9.1. The code
was changed significantly in 9.1, and it doesn't seem worth the risk or
trouble to adapt this for 9.0 and 8.4.
|
|
|
|
|
|
|
|
| |
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated and it needs to be marked dirty. The codepath for an insertion got
this right, but the case where the internal node is split because of
inserting the new downlink missed that.
|
|
|
|
|
| |
We don't use backup blocks with GIN vacuum records anymore, the page is
always recreated from scratch.
|
|
|
|
|
|
| |
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Inserting (in retail) into the new 9.4 format GIN posting tree created much
larger WAL records than in 9.3. The previous strategy to WAL logging was
basically to log the whole page on each change, with the exception of
completely unmodified segments up to the first modified one. That was not
too bad when appending to the end of the page, as only the last segment had
to be WAL-logged, but per Fujii Masao's testing, even that produced 2x the
WAL volume that 9.3 did.
The new strategy is to keep track of changes to the posting lists in a more
fine-grained fashion, and also make the repacking" code smarter to avoid
decoding and re-encoding segments unnecessarily.
|
|
|
|
|
| |
It's more descriptive. Also, get rid of the enum, and use #defines instead,
per Greg Stark's suggestion.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If you compile with WAL_DEBUG and enable it with wal_debug=on, we used to
only pass the first XLogRecData entry to the rm_desc routine. I think the
original assumprion was that the first XLogRecData entry contains all the
necessary information for the rm_desc routine, but that's a pretty shaky
assumption. At least standby_redo didn't get the memo.
To fix, piece together all the data in a temporary buffer, and pass that to
the rm_desc routine.
It's been like this forever, but the patch didn't apply cleanly to
back-branches. Probably wouldn't be hard to fix the conflicts, but it's
not worth the trouble.
|
|
|
|
| |
Backpatch to 9.0 where XLOG_PARAMETER_CHANGE record was instroduced.
|
|
|
|
|
| |
That seems nicer than making it the caller's responsibility to pass a
suitable-sized array. All the callers were just palloc'ing an array anyway.
|
|
|
|
|
| |
'cbuffer' variable was left over from an earlier version of the patch to
rewrite the incomplete split handling.
|
|
|
|
| |
Erik Rijkers
|
|
|
|
|
|
|
|
|
| |
equalTupleDescs() neglected both of these ConstrCheck fields, and
CreateTupleDescCopyConstr() neglected ccnoinherit. At this time, the
only known behavior defect resulting from these omissions is constraint
exclusion disregarding a CHECK constraint validated by an ALTER TABLE
VALIDATE CONSTRAINT statement issued earlier in the same transaction.
Back-patch to 9.2, where these fields were introduced.
|
|
|
|
|
|
|
|
|
|
| |
The special feature the XLogInsert slots had over regular LWLocks is the
insertingAt value that was updated atomically with releasing backends
waiting on it. Add new functions to the LWLock API to do that, and replace
the slots with LWLocks. This reduces the amount of duplicated code.
(There's still some duplication, but at least it's all in lwlock.c now.)
Reviewed by Andres Freund.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple. For example:
LOG: process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL: Process holding the lock: 11366. Wait queue: 11367.
CONTEXT: while updating tuple (0,2) in relation "foo"
STATEMENT: UPDATE foo SET value = 3;
Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.
Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
|
|
|
|
|
|
|
|
|
| |
It is no longer used, none of the resource managers have multi-record
actions that would make it unsafe to perform a restartpoint.
Also don't allow rm_cleanup to write WAL records, it's also no longer
required. Move the call to rm_cleanup routines to make it more symmetric
with rm_startup.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.
Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.
I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.
Reviewed by Peter Geoghegan.
|
|
|
|
| |
While we're at it, also improve comments in ginlogic.c.
|