| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As pointed out by Fujii Masao, we weren't quite there on a standby
behaving sanely: first because we were failing to acquire the correct
state in the case where no XLOG_PARAMETER_CHANGE message was sent
(because a checkpoint had already happened after the setting was changed
in the master, and then the standby was restarted); and second because
promoting the standby with the feature enabled failed to activate it if
the master had the feature disabled.
This patch fixes both those misbehaviors hopefully without
re-introducing any old problems.
Also change the hint emitted in a standby together with the error
message about the feature being disabled, to make it point out that the
place to chance the setting is the master. Otherwise, if the setting is
already enabled in the standby, it is very confusing to have it say that
the setting must be enabled ...
Authors: Álvaro Herrera, Petr Jelínek.
Backpatch to 9.5.
|
|
|
|
| |
Amit Langote
|
|
|
|
|
|
|
|
|
|
| |
At least one of the names was, due to a function renaming late in the
development of ON CONFLICT, wrong. Since including function names in
error messages is against the message style guide anyway, remove them
from the messages.
Discussion: CAM3SWZT8paz=usgMVHm0XOETkQvzjRtAUthATnmaHQQY0obnGw@mail.gmail.com
Backpatch: 9.5, where ON CONFLICT was introduced
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This makes it significantly easier to identify these lwlocks in
LWLOCK_STATS or Trace_lwlocks output. It's also arguably better
from a modularity standpoint, since lwlock.c no longer needs to
know anything about the LWLock needs of the higher-level SLRU
facility.
Ildus Kurbangaliev, reviewd by Álvaro Herrera and by me.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Up until now, the total amount of data that could be passed to a
background worker at startup was one datum, which can be a small as
4 bytes on some systems. That's enough to pass a dsm_handle or an
array index, but not much else. Add a bgw_extra flag to the
BackgroundWorker struct, allowing up to 128 bytes to be passed to
a new worker on any platform.
Use this to fix a problem I recently discovered with the parallel
context machinery added in 9.5: the master assigns each worker an
array index, and each worker subsequently assigns itself an array
index, and there's nothing to guarantee that the two sets of indexes
match, leading to chaos.
Normally, I would not back-patch the change to add bgw_extra, since it
is basically a feature addition. However, since 9.5 is still in beta
and there seems to be no other sensible way to repair the broken
parallel context machinery, back-patch to 9.5. Existing background
worker code can ignore the bgw_extra field without a problem, but
might need to be recompiled since the structure size has changed.
Report and patch by me. Review by Amit Kapila.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On insert the CheckForSerializableConflictIn() test was performed
before the page(s) which were going to be modified had been locked
(with an exclusive buffer content lock). If another process
acquired a relation SIReadLock on the heap and scanned to a page on
which an insert was going to occur before the page was so locked,
a rw-conflict would be missed, which could allow a serialization
anomaly to be missed. The window between the check and the page
lock was small, so the bug was generally not noticed unless there
was high concurrency with multiple processes inserting into the
same table.
This was reported by Peter Bailis as bug #11732, by Sean Chittenden
as bug #13667, and by others.
The race condition was eliminated in heap_insert() by moving the
check down below the acquisition of the buffer lock, which had been
the very next statement. Because of the loop locking and unlocking
multiple buffers in heap_multi_insert() a check was added after all
inserts were completed. The check before the start of the inserts
was left because it might avoid a large amount of work to detect a
serialization anomaly before performing the all of the inserts and
the related WAL logging.
While investigating this bug, other SSI bugs which were even harder
to hit in practice were noticed and fixed, an unnecessary check
(covered by another check, so redundant) was removed from
heap_update(), and comments were improved.
Back-patch to all supported branches.
Kevin Grittner and Thomas Munro
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit b0b0d84b3d663a148022e900ebfc164284a95f55 purported to make it
possible to relaunch workers using the same parallel context, but it had
an unpleasant race condition: we might reinitialize after the workers
have sent their last control message but before they have dettached the
DSM, leaving to crashes. Repair by introducing a new ParallelContext
operation, ReinitializeParallelDSM.
Adjust execParallel.c to use this new support, so that we can rescan a
Gather node by relaunching workers but without needing to recreate the
DSM.
Amit Kapila, with some adjustments by me. Extracted from latest parallel
sequential scan patch.
|
|
|
|
|
| |
Message style, plurals, quoting, spelling, consistency with similar
messages
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A bug in the original free space computation made it possible to
return a page which wasn't actually able to fit the item. Since the
insertion code isn't prepared to deal with PageAddItem failing, a PANIC
resulted ("failed to add BRIN tuple [to new page]"). Add a macro to
encapsulate the correct computation, and use it in
brin_getinsertbuffer's callers before calling that routine, to raise an
early error.
I became aware of the possiblity of a problem in this area while working
on ccc4c074994d734. There's no archived discussion about it, but it's
easy to reproduce a problem in the unpatched code with something like
CREATE TABLE t (a text);
CREATE INDEX ti ON t USING brin (a) WITH (pages_per_range=1);
for length in `seq 8000 8196`
do
psql -f - <<EOF
TRUNCATE TABLE t;
INSERT INTO t VALUES ('z'), (repeat('a', $length));
EOF
done
Backpatch to 9.5, where BRIN was introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Further tweak commit_ts.c so that on a standby the state is completely
consistent with what that in the master, rather than behaving
differently in the cases that the settings differ. Now in standby and
master the module should always be active or inactive in lockstep.
Author: Petr Jelínek, with some further tweaks by Álvaro Herrera.
Backpatch to 9.5, where commit timestamps were introduced.
Discussion: http://www.postgresql.org/message-id/5622BF9D.2010409@2ndquadrant.com
|
|
|
|
| |
CharSyam
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using this API, one backend can set up a ParallelHeapScanDesc to
which multiple backends can then attach. Each tuple in the relation
will be returned to exactly one of the scanning backends. Only
forward scans are supported, and rescans must be carefully
coordinated.
This is not exposed to the planner or executor yet.
The original version of this code was written by me. Amit Kapila
reviewed it, tested it, and improved it, including adding support for
synchronized scans, per review comments from Jeff Davis. Extensive
testing of this and related patches was performed by Haribabu Kommi.
Final cleanup of this patch by me.
|
|
|
|
|
|
|
|
|
| |
This may allow some callers to avoid the overhead involved in tearing
down a parallel context and then setting up a new one, which means
releasing the DSM and then allocating and populating a new one. I
suspect we'll want to revise the Gather node to make use of this new
capability, but even if not it may be useful elsewhere and requires
very little additional code.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order for this to be safe, the code which hands true serializability
will need to taught that the SIRead locks taken by a parallel worker
pertain to the same transaction as those taken by the parallel leader.
Some further changes may be needed as well. Until the necessary
adaptations are made, don't generate parallel plans in serializable
mode, and if a previously-generated parallel plan is used after
serializable mode has been activated, run it serially.
This fixes a bug in commit 7aea8e4f2daa4b39ca9d1309a0c4aadb0f7ed81b.
|
|
|
|
|
|
|
|
|
| |
check_role() tries to verify that the user has permission to become the
requested role, but this is inappropriate in a parallel worker, which
needs to exactly recreate the master's authorization settings. So skip
the check in that case.
This fixes a bug in commit 924bcf4f16d54c55310b28f77686608684734f42.
|
|
|
|
|
|
|
|
|
|
| |
Starting a parallel worker transaction changes our notion of which XIDs
are in-progress or committed, and our notion of the current command
counter ID. Therefore, our view of these caches prior to starting
this transaction may no longer valid. Defend against that by clearing
them.
This fixes a bug in commit 924bcf4f16d54c55310b28f77686608684734f42.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 924bcf4f16d54c55310b28f77686608684734f42 failed to enforce
parallel mode checks during the commit of a parallel worker, because
we exited parallel mode prior to ending the transaction so that we
could pop the active snapshot. Re-establish parallel mode during
parallel worker commit. Without this, it's far too easy for unsafe
actions during the pre-commit sequence to crash the server instead of
hitting the error checks as intended.
Just to be extra paranoid, adjust a couple of the sanity checks in
xact.c to check not only IsInParallelMode() but also
IsParallelWorker().
|
|
|
|
|
|
|
|
| |
Commit 924bcf4f16d54c55310b28f77686608684734f42 correctly forbade
parallel workers to modify the command counter while in parallel mode,
but it inexplicably neglected to actually transfer the current command
counter from leader to workers. This can result in the workers seeing
a different set of tuples from the leader, which is bad. Repair.
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 2bd9e412f92bc6a68f3e8bcb18e04955cc35001d introduced a mechanism
for relaying protocol messages from a background worker to another
backend via a shm_mq. However, there was no provision for shutting
down the communication channel. Therefore, a protocol message sent
late in the shutdown sequence, such as a DEBUG message resulting from
cranking up log_min_messages, could crash the server. To fix, install
an on_dsm_detach callback that disables sending messages to the shm_mq
when the associated DSM is detached.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In 020235a5754 I lowered the autovacuum_*freeze_max_age minimums to
allow for easier testing of wraparounds. I did not touch the
corresponding per-table limits. While those don't matter for the purpose
of wraparound, it seems more consistent to lower them as well.
It's noteworthy that the previous reloption lower limit for
autovacuum_multixact_freeze_max_age was too high by one magnitude, even
before 020235a5754.
Discussion: 26377.1443105453@sss.pgh.pa.us
Backpatch: back to 9.0 (in parts), like the prior patch
|
|
|
|
| |
Bug noticed by Fujii Masao
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Module initialization was still not completely correct after commit
6b61955135e9, per crash report from Takashi Ohnishi. To fix, instead of
trying to monkey around with the value of the GUC setting directly, add
a separate boolean flag that enables the feature on a standby, but only
for the startup (recovery) process, when it sees that its master server
has the feature enabled.
Discussion: http://www.postgresql.org/message-id/ca44c6c7f9314868bdc521aea4f77cbf@MP-MSGSS-MBX004.msg.nttdata.co.jp
Also change the deactivation routine to delete all segment files rather
than leaving the last one around. (This doesn't need separate
WAL-logging, because on recovery we execute the same deactivation
routine anyway.)
In passing, clean up the code structure somewhat, particularly so that
xlog.c doesn't know so much about when to activate/deactivate the
feature.
Thanks to Fujii Masao for testing and Petr Jelínek for off-list discussion.
Back-patch to 9.5, where commit_ts was introduced.
|
|
|
|
|
|
| |
If a transaction or subtransaction creates a ParallelContext but ends
without calling InitializeParallelDSM, the previous code would
seg fault. Fix that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are three main changes here:
1. No longer cause a start failure in a standby if the feature is
disabled in postgresql.conf but enabled in the master. This reverts one
part of commit 4f3924d9cd43; what we keep is the ability of the standby
to activate/deactivate the module (which includes creating and removing
segments as appropriate) during replay of such actions in the master.
2. Replay WAL records affecting commitTS even if the feature is
disabled. This means the standby will always have the same state as the
master after replay.
3. Have COMMIT PREPARE record the transaction commit time as well. We
were previously only applying it in the normal transaction commit path.
Author: Petr Jelínek
Discussion: http://www.postgresql.org/message-id/CAHGQGwHereDzzzmfxEBYcVQu3oZv6vZcgu1TPeERWbDc+gQ06g@mail.gmail.com
Discussion: http://www.postgresql.org/message-id/CAHGQGwFuzfO4JscM9LCAmCDCxp_MfLvN4QdB+xWsS-FijbjTYQ@mail.gmail.com
Additionally, I cleaned up nearby code related to replication origins,
which I found a bit hard to follow, and fixed a couple of typos.
Backpatch to 9.5, where this code was introduced.
Per bug reports from Fujii Masao and subsequent discussion.
|
|
|
|
|
|
|
| |
It was introduced alongside replication origins, by commit
5aa2350426c, so backpatch to 9.5.
Pointed out by Fujii Masao
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In 9.5 and master there is no need to support legacy truncation. This is
just committed separately to make it easier to backpatch the WAL logged
multixact truncation to 9.3 and 9.4 if we later decide to do so.
I bumped master's magic from 0xD086 to 0xD088 and 9.5's from 0xD085 to
0xD087 to avoid 9.5 reusing a value that has been in use on master while
keeping the numbers increasing between major versions.
Discussion: 20150621192409.GA4797@alap3.anarazel.de
Backpatch: 9.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.
We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for multixact truncations.
This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
this has gotten much easier
During the course of fixing this a bunch of additional bugs had to be
fixed:
1) Data was not purged from memory the member's SLRU before deleting
segments. This happened to be hard or impossible to hit due to the
interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
that doesn't work for offsets that haven't yet been flushed to
disk. Add code to flush the SLRUs to fix. Not pretty, but it feels
slightly safer to only make decisions based on actual on-disk state.
3) find_multixact_start() could be called concurrently with a truncation
and thus fail. Via SetOffsetVacuumLimit() that could lead to a round
of emergency vacuuming. The problem remains in
pg_get_multixact_members(), but that's quite harmless.
For now this is going to only get applied to 9.5+, leaving the issues in
the older branches in place. It is quite possible that we need to
backpatch at a later point though.
For the case this gets backpatched we need to handle that an updated
standby may be replaying WAL from a not-yet upgraded primary. We have to
recognize that situation and use "old style" truncation (i.e. looking at
the SLRUs) during WAL replay. In contrast to before, this now happens in
the startup process, when replaying a checkpoint record, instead of the
checkpointer. Doing truncation in the restartpoint is incorrect, they
can happen much later than the original checkpoint, thereby leading to
wraparound. To avoid "multixact_redo: unknown op code 48" errors
standbys would have to be upgraded before primaries.
A later patch will bump the WAL page magic, and remove the legacy
truncation codepaths. Legacy truncation support is just included to make
a possible future backpatch easier.
Discussion: 20150621192409.GA4797@alap3.anarazel.de
Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro
Backpatch: 9.5 for now
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit e95680832854cf300e64c10de9cc2f586df558e8 introduces adding pages
to FSM for ordinary insert, but autoanalyze was able just cleanup
pending list without adding to FSM.
Also fix double call of IndexFreeSpaceMapVacuum() during ginvacuumcleanup()
Report from Fujii Masao
Patch by me
Review by Jeff Janes
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 013ebc0a7b7ea9c1b1ab7a3d4dd75ea121ea8ba7 introduces microvacuum for
GiST, deletetion of tuple marked LP_DEAD uses IndexPageMultiDelete while
recovery code uses IndexPageTupleDelete in loop. This causes a difference
in offset numbers of tuples to delete. Patch introduces usage of
IndexPageMultiDelete in GiST except gistplacetopage() where only one tuple is
deleted at once. That also slightly improve performance, because
IndexPageMultiDelete is more effective.
Patch changes WAL format, so bump wal page magic.
Bug report from Jeff Janes
Diagnostic and patch by Anastasia Lubennikova and me
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch changes the log message which is logged when the server
successfully renames backup_label file to *.old but fails to rename
tablespace_map file during the shutdown. Previously the WARNING
message "online backup mode was not canceled" was logged in that case.
However this message is confusing because the backup mode is treated
as canceled whenever backup_label is successfully renamed. So this
commit makes the server log the message "online backup mode canceled"
in that case.
Also this commit changes errdetail messages so that they follow the
error message style guide.
Back-patch to 9.5 where tablespace_map file is introduced.
Original patch by Amit Kapila, heavily modified by me.
|
|
|
|
|
|
|
|
| |
I think this particular branch is actually dead, but the analysis to
prove that is not trivial, so instead take the weasel way.
Reported by Jinyu Zhang
Backpatch to 9.5, where BRIN was introduced.
|
|
|
|
| |
Declaration of varibale inside ÓÝ×Õ
|
|
|
|
|
|
|
|
|
|
|
| |
Mark index tuple as dead if it's pointed by kill_prior_tuple during
ordinary (search) scan and remove it during insert process if there is no
enough space for new tuple to insert. This improves select performance
because index will not return tuple marked as dead and improves insert
performance because it reduces number of page split.
Anastasia Lubennikova <a.lubennikova@postgrespro.ru> with
minor editorialization by me
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit makes postmaster forcibly remove the files signaling
a standby promotion request. Otherwise, the existence of those files
can trigger a promotion too early, whether a user wants that or not.
This removal of files is usually unnecessary because they can exist
only during a few moments during a standby promotion. However
there is a race condition: if pg_ctl promote is executed and creates
the files during a promotion, the files can stay around even after
the server is brought up to new master. Then, if new standby starts
by using the backup taken from that master, the files can exist
at the server startup and should be removed in order to avoid
an unexpected promotion.
Back-patch to 9.1 where promote signal file was introduced.
Problem reported by Feike Steenbergen.
Original patch by Michael Paquier, modified by me.
Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org
|
|
|
|
|
|
|
|
|
|
| |
Per discussion, nowadays it is possible to have tablespaces that have
wildly different I/O characteristics from others. Setting different
effective_io_concurrency parameters for those has been measured to
improve performance.
Author: Julien Rouhaud
Reviewed by: Andres Freund
|
|
|
|
|
|
|
|
|
|
|
| |
Cleanup process could be called by ordinary insert/update and could take a lot
of time. Add vacuum_delay_point() to make this process interruptable. Under
vacuum this call will also throttle a vacuum process to decrease system load,
called from insert/update it will not throttle, and that reduces a latency.
Backpatch for all supported branches.
Jeff Janes <jeff.janes@gmail.com>
|
|
|
|
|
|
|
|
|
| |
Add pages deleted from GIN's pending list during cleanup to free space map
immediately. Clean up process could be initiated by ordinary insert but adding
page to FSM might occur only at vacuum. On some workload like never-vacuumed
insert-only tables it could cause a huge bloat.
Jeff Janes <jeff.janes@gmail.com>
|
|
|
|
| |
Oskari Saarenmaa. Backpatch to stable branches where applicable.
|
|
|
|
|
|
|
|
|
|
|
| |
If the number of heap blocks is not multiples of pages per range, the
summarizing produces wrong summary information for the last brin index
tuple while vacuuming.
Problem reported by Tatsuo Ishii and fixed by Amit Langote.
Discussion at "[HACKERS] BRIN INDEX value (message id :20150903.174935.1946402199422994347.t-ishii@sraoss.co.jp)
Backpatched to 9.5 in which brin index was added.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Formerly, we treated only portals created in the current subtransaction as
having failed during subtransaction abort. However, if the error occurred
while running a portal created in an outer subtransaction (ie, a cursor
declared before the last savepoint), that has to be considered broken too.
To allow reliable detection of which ones those are, add a bookkeeping
field to struct Portal that tracks the innermost subtransaction in which
each portal has actually been executed. (Without this, we'd end up
failing portals containing functions that had called the subtransaction,
thereby breaking plpgsql exception blocks completely.)
In addition, when we fail an outer-subtransaction Portal, transfer its
resources into the subtransaction's resource owner, so that they're
released early in cleanup of the subxact. This fixes a problem reported by
Jim Nasby in which a function executed in an outer-subtransaction cursor
could cause an Assert failure or crash by referencing a relation created
within the inner subtransaction.
The proximate cause of the Assert failure is that AtEOSubXact_RelationCache
assumed it could blow away a relcache entry without first checking that the
entry had zero refcount. That was a bad idea on its own terms, so add such
a check there, and to the similar coding in AtEOXact_RelationCache. This
provides an independent safety measure in case there are still ways to
provoke the situation despite the Portal-level changes.
This has been broken since subtransactions were invented, so back-patch
to all supported branches.
Tom Lane and Michael Paquier
|
|
|
|
|
|
|
|
|
| |
The setting values of some parameters including max_worker_processes
must be equal to or higher than the values on the master. However,
previously max_worker_processes was not listed as such parameter
in the document. So this commit adds it to that list.
Back-patch to 9.4 where max_worker_processes was added.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, in-memory posting list during GIN build process is limited 1GB
because of using repalloc. The patch replaces call of repalloc to repalloc_huge.
It increases limit of posting list from 180 millions
(1GB / sizeof(ItemPointerData)) to 4 billions limited by maxcount/count fields
in GinEntryAccumulator and subsequent calls. Check added.
Also, fix accounting of allocatedMemory during build to prevent integer
overflow with maintenance_work_mem > 4GB.
Robert Abraham <robert.abraham86@googlemail.com> with additions by me
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code had bugs that would cause crashes if NULL was passed as that
argument (originally intended to mean not to bother returning its
value), and after inspection it turns out that nothing seems interested
in the case that *ts is NULL anyway. Therefore, remove the partial
checks intended to support that case.
Author: Michael Paquier
though I didn't include a proposed Assert.
Backpatch to 9.5.
|
|
|
|
|
|
|
| |
This fixes a bunch of somewhat pedantic warnings with new
compilers. Since by far the majority of other functions definitions use
the (void) style it just seems to be consistent to do so as well in the
remaining few places.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reduce lock levels down to ShareUpdateExclusiveLock for all autovacuum-related
relation options when setting them using ALTER TABLE.
Add infrastructure to allow varying lock levels for relation options in later
patches. Setting multiple options together uses the highest lock level required
for any option. Works for both main and toast tables.
Fabrízio Mello, reviewed by Michael Paquier, mild edit and additional regression
tests from myself
|
|
|
|
|
|
|
|
|
| |
As complained by clang, reported by Andres Freund. Brown paper bag bug
in ccc4c074994d.
Add some comments, too.
Backpatch to 9.5, like that one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In some corner cases, it is possible for the BRIN index relation to be
extended by brin_getinsertbuffer but the new page not be used
immediately for anything by its callers; when this happens, the page is
initialized and the FSM is updated (by brin_getinsertbuffer) with the
info about that page, but these actions are not WAL-logged. A later
index insert/update can use the page, but since the page is already
initialized, the initialization itself is not WAL-logged then either.
Replay of this sequence of events causes recovery to fail altogether.
There is a related corner case within brin_getinsertbuffer itself, in
which we extend the relation to put a new index tuple there, but later
find out that we cannot do so, and do not return the buffer; the page
obtained from extension is not even initialized. The resulting page is
lost forever.
To fix, shuffle the code so that initialization is not the
responsibility of brin_getinsertbuffer anymore, in normal cases;
instead, the initialization is done by its callers (brin_doinsert and
brin_doupdate) once they're certain that the page is going to be used.
When either those functions determine that the new page cannot be used,
before bailing out they initialize the page as an empty regular page,
enter it in FSM and WAL-log all this. This way, the page is usable for
future index insertions, and WAL replay doesn't find trying to insert
tuples in pages whose initialization didn't make it to the WAL. The
same strategy is used in brin_getinsertbuffer when it cannot return the
new page.
Additionally, add a new step to vacuuming so that all pages of the index
are scanned; whenever an uninitialized page is found, it is initialized
as empty and WAL-logged. This closes the hole that the relation is
extended but the system crashes before anything is WAL-logged about it.
We also take this opportunity to update the FSM, in case it has gotten
out of date.
Thanks to Heikki Linnakangas for finding the problem that kicked some
additional analysis of BRIN page assignment code.
Backpatch to 9.5, where BRIN was introduced.
Discussion: https://www.postgresql.org/message-id/20150723204810.GY5596@postgresql.org
|