| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
| |
User-defined consistent functions believes the check array
contains at least one true element which was not a true for
scanning pending list.
Per report from Yury Don <yura@vpcit.ru>
|
|
|
|
|
|
| |
that it can scribble on scan->xs_ctup.t_self while following HOT chains,
so we can't rely on that to stay valid between hashgettuple() calls.
Introduce a private variable in HashScanOpaque, instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
hash indexes keep entries sorted by hash value. First, the original plans for
concurrency assumed that insertions would happen only at the end of a page,
which is no longer true; this could cause scans to transiently fail to find
index entries in the presence of concurrent insertions. We can compensate
by teaching scans to re-find their position after re-acquiring read locks.
Second, neither the bucket split nor the bucket compaction logic had been
fixed to preserve hashvalue ordering, so application of either of those
processes could lead to permanent corruption of an index, in the sense
that searches might fail to find entries that are present.
This patch fixes the split and compaction logic to preserve hashvalue
ordering, but it cannot do anything about pre-existing corruption. We will
need to recommend reindexing all hash indexes in the 8.4.2 release notes.
To buy back the performance loss hereby induced in split and compaction,
fix them to use PageIndexMultiDelete instead of retail PageIndexDelete
operations. We might later want to do something with qsort'ing the
page contents rather than doing a binary search for each insertion,
but that seemed more invasive than I cared to risk in a back-patch.
Per bug #5157 from Jeff Janes and subsequent investigation.
|
|
|
|
|
|
|
|
|
|
|
| |
tuple size limit. Improve the error message for index-tuple-too-large
so that it includes the actual size, the limit, and the index name.
Sync with the btree occurrences of the same error.
Back-patch to 8.4 because it appears that the out-of-sync problem
is occurring in the field.
Teodor and Tom
|
|
|
|
|
|
| |
only for secondary page split (i.e. for non-first columns of index)
Patch by Paul Ramsey <pramsey@opengeo.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In practice these mistakes were always masked when full_page_writes was on,
because XLogInsert would always choose to log the full page, and then
ginRedoInsertListPage wouldn't try to do anything. But with full_page_writes
off a WAL replay failure was certain.
The GIN_INSERT_LISTPAGE record type could probably be eliminated entirely
in favor of using XLOG_HEAP_NEWPAGE, but I refrained from doing that now
since it would have required a significantly more invasive patch.
In passing do a little bit of code cleanup, including making the accounting
for free space on GIN list pages more precise. (This wasn't a bug as the
errors were always in the conservative direction.)
Per report from Simon. Back-patch to 8.4 which contains the identical code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
of checkpoint. Although the checkpoint has been written to WAL at that point
already, so that all data is safe, and we'll retry removing the WAL segment at
the next checkpoint, if such a failure persists we won't be able to remove any
other old WAL segments either and will eventually run out of disk space. It's
better to treat the failure as non-fatal, and move on to clean any other WAL
segment and continue with any other end-of-checkpoint cleanup.
We don't normally expect any such failures, but on Windows it can happen with
some anti-virus or backup software that lock files without FILE_SHARE_DELETE
flag.
Also, the loop in pgrename() to retry when the file is locked was broken. If a
file is locked on Windows, you get ERROR_SHARE_VIOLATION, not
ERROR_ACCESS_DENIED, at least on modern versions. Fix that, although I left
the check for ERROR_ACCESS_DENIED in there as well (presumably it was correct
in some environment), and added ERROR_LOCK_VIOLATION to be consistent with
similar checks in pgwin32_open(). Reduce the timeout on the loop from 30s to
10s, on the grounds that since it's been broken, we've effectively had a
timeout of 0s and no-one has complained, so a smaller timeout is actually
closer to the old behavior. A longer timeout would mean that if recycling a
WAL file fails because it's locked for some reason, InstallXLogFileSegment()
will hold ControlFileLock for longer, potentially blocking other backends, so
a long timeout isn't totally harmless.
While we're at it, set errno correctly in pgrename().
Backpatch to 8.2, which is the oldest version supported on Windows. The xlog.c
changes would make sense on other platforms and thus on older versions as
well, but since there's no such locking issues on other platforms, it's not
worth it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
file handle on it, the file goes into "pending deletion" state where it
still shows up in directory listing, but isn't accessible otherwise. That
confuses RemoveOldXLogFiles(), making it think that the file hasn't been
archived yet, while it actually was, and it was deleted along with the .done
file.
Fix that by renaming the file with ".deleted" extension before deleting it.
Also check the return value of rename() and unlink(), so that if the removal
fails for any reason (e.g another process is holding the file locked), we
don't delete the .done file until the WAL file is really gone.
Backpatch to 8.2, which is the oldest version supported on Windows.
|
|
|
|
|
|
|
|
| |
In the original coding, setting a single reloption would cause default
values to be used for all the other reloptions. This is a problem
particularly for autovacuum reloptions.
Itagaki Takahiro
|
|
|
|
|
|
|
|
|
|
| |
was incorrectly initialized with timeline ID 0. That rendered the WAL page
unrecoverable, making a subsequent archive recovery stop at that point.
ThisTimeLineID needs to be initialized before calling AdvanceXLInsertBuffer().
This fixes bug #5011 reported by James Bardin. Backpatch to 8.4, as the bug
was introduced by the changes to use of bgwriter for writing the
end-of-archive-recovery checkpoint. Patch by Tom Lane.
|
|
|
|
|
|
|
|
|
|
|
| |
"all tuples visible" flag in heap page headers. The flag update *must*
be applied before calling XLogInsert, but heap_update and the tuple
moving routines in VACUUM FULL were ignoring this rule. A crash and
replay could therefore leave the flag incorrectly set, causing rows
to appear visible in seqscans when they should not be. This might explain
recent reports of data corruption from Jeff Ross and others.
In passing, do a bit of editorialization on comments in visibilitymap.c.
|
|
|
|
| |
Per comment from Simon.
|
|
|
|
|
| |
appears to explain the recent reports of "PANIC: cannot make new WAL entries
during recovery".
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
archive recovery. Invent a separate state variable and inquiry function
for XLogInsertAllowed() to clarify some tests and make the management of
writing the end-of-recovery checkpoint less klugy. Fix several places
that were incorrectly testing InRecovery when they should be looking at
RecoveryInProgress or XLogInsertAllowed (because they will now be executed
in the bgwriter not startup process). Clarify handling of bad LSNs passed
to XLogFlush during recovery. Use a spinlock for setting/testing
SharedRecoveryInProgress. Improve quite a lot of comments.
Heikki and Tom
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
during it:
When bgwriter is active, the startup process can't perform mdsync() correctly
because it won't see the fsync requests accumulated in bgwriter's private
pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery
checkpoint as well, when it's active.
When bgwriter is active (= archive recovery), the startup process must not
accumulate fsync requests to its own pendingOpsTable, since bgwriter won't
see them there when it performs restartpoints. Make startup process drop its
pendingOpsTable when bgwriter is launched to avoid that.
Update minimum recovery point one last time when leaving archive recovery.
It won't be updated by the end-of-recovery checkpoint because XLogFlush()
sees us as out of recovery already.
This fixes bug #4879 reported by Fujii Masao.
|
|
|
|
|
|
| |
acting like runs inside WAL recovery, but it doesn't. I must've copy-pasted
this from a redo-function in the relation forks patch. Noticed by Tom Lane
while he was looking through callers of smgrdounlink().
|
| |
|
|
|
|
| |
visibilitymap.c by me.
|
|
|
|
| |
provided by Andrew.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
node starts from the same place as the first scan did. This avoids surprising
behavior of scrollable and WITH HOLD cursors, as seen in Mark Kirkwood's bug
report of yesterday.
It's not entirely clear whether a rescan should be forced to drop out of the
syncscan mode, but for the moment I left the code behaving the same on that
point. Any change there would only be a performance and not a correctness
issue, anyway.
Back-patch to 8.3, since the unstable behavior was created by the syncscan
patch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
behavior in cases where we don't know the heap tuple count accurately; in
particular partial vacuum, but this also makes the API a bit more useful
for ANALYZE. This patch adds "estimated_count" flags to both structs so
that an approximate count can be flagged as such, and adjusts the logic
so that approximate counts are not used for updating pg_class.reltuples.
This fixes my previous complaint that VACUUM was putting ridiculous values
into pg_class.reltuples for indexes. The actual impact of that bug is
limited, because the planner only pays attention to reltuples for an index
if the index is partial; which probably explains why beta testers hadn't
noticed a degradation in plan quality from it. But it needs to be fixed.
The whole thing is a bit messy and should be redesigned in future, because
reltuples now has the potential to drift quite far away from reality when
a long period elapses with no non-partial vacuums. But this is as good as
it's going to get for 8.4.
|
|
|
|
|
|
|
|
|
| |
is supposed to remove duplicate heap TIDs, we have to be sure to reduce the
tuple size and posting-item count accordingly in addItemPointersToTuple().
Failing to do so resulted in the effective injection of garbage TIDs into the
index contents, ie, whatever happened to be in the memory palloc'd for the
new tuple. I'm not sure that this fully explains the index corruption
reported by Tatsuo Ishii, but the test case I'm using no longer fails.
|
|
|
|
|
|
|
|
| |
symbolic links with the -l option, and as Fujii Masao pointed out we ended up
overwriting files in the archive directory before this patch. Patch by
Aidan Van Dyk, Fujii Masao and me.
Backpatch to 8.3, where pg_standby was introduced.
|
|
|
|
|
|
| |
all transactions are archived.
Original patch by Guillaume Smet.
|
|
|
|
| |
relopt_kind value in add_reloption_kind(). Per Zdenek Kotala.
|
|
|
|
| |
minimal regression test coverage for matchPartialInPendingList().
|
|
|
|
|
|
|
| |
copies?) to ensure they really don't run proc_exit/shmem_exit callbacks,
as was intended. I broke this behavior recently by installing atexit
callbacks without thinking about the one case where we truly don't want
to run those callback functions. Noted in an example from Dave Page.
|
|
|
|
| |
Per suggestion of Jaime Casanova.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
is run at the end of archive recovery, providing a chance to do external
cleanup. Modify pg_standby so that it no longer removes the trigger file,
that is to be done using the recovery_end_command now.
Provide a "smart" failover mode in pg_standby, where we don't fail over
immediately, but only after recovering all unapplied WAL from the archive.
That gives you zero data loss assuming all WAL was archived before
failover, which is what most users of pg_standby actually want.
recovery_end_command by Simon Riggs, pg_standby changes by Fujii Masao and
myself.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
redirecting libxml's allocations into a Postgres context. Instead, just let
it use malloc directly, and add PG_TRY blocks as needed to be sure we release
libxml data structures in error recovery code paths. This is ugly but seems
much more likely to play nicely with third-party uses of libxml, as seen in
recent trouble reports about using Perl XML facilities in pl/perl and bug
#4774 about contrib/xml2.
I left the code for allocation redirection in place, but it's only
built/used if you #define USE_LIBXMLCONTEXT. This is because I found it
useful to corral libxml's allocations in a palloc context when hunting
for libxml memory leaks, and we're surely going to have more of those
in the future with this type of approach. But we don't want it turned on
in a normal build because it breaks exactly what we need to fix.
I have not re-indented most of the code sections that are now wrapped
by PG_TRY(); that's for ease of review. pg_indent will fix it.
This is a pre-existing bug in 8.3, but I don't dare back-patch this change
until it's gotten a reasonable amount of field testing.
|
|
|
|
|
|
|
|
|
| |
errors when tables are concurrently dropped. To do this we must take lock
on each relation before we check its privileges. The old code was trying
to do that the other way around, which is a bit pointless when there are lots
of other commands that lock relations before checking privileges. I did keep
it checking each relation's privilege before locking the next relation, which
is a detail that ALTER TABLE isn't too picky about.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
you can end up with an unrecoverable backup if you start a new base backup
right after finishing archive recovery. In that scenario, the redo pointer of
the checkpoint that pg_start_backup() writes points to the XLOG segment where
the timeline-changing end-of-archive-recovery checkpoint is. The beginning
of that segment contains pages with the old timeline ID, and we don't accept
that in recovery unless we find a history file covering the old timeline ID.
If you omit pg_xlog from the base backup and clear the archive directory
before starting the backup, there will be no such history file available.
The bug is present in all versions since PITR was introduced in 8.0, but I'm
back-patching only back to 8.2. Earlier versions didn't have XLOG switch
records, making this fix unfeasible. Given the lack of reports until now,
it doesn't seem worthwhile to spend more effort to fix 8.0 and 8.1.
Per report and suggestion by Mikael Krantz
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
points where we step right or left to the next page. This should ensure
reasonable response time to a query cancel request during an unsuccessful
index scan, as seen in recent gripe from Marc Cousin. It's a bit trickier
than it might seem at first glance, because CHECK_FOR_INTERRUPTS() is a no-op
if executed while holding a buffer lock. So we have to do it just at the
point where we've dropped one page lock and not yet acquired the next.
Remove CHECK_FOR_INTERRUPTS calls at the top level of btgetbitmap and
hashgetbitmap, since they're pointless given the added checks.
I think that GIST is okay already --- at least, there's a CHECK_FOR_INTERRUPTS
at a plausible-looking place in gistnext(). I don't claim to know GIN well
enough to try to poke it for this, if indeed it has a problem at all.
This is a pre-existing issue, but in view of the lack of prior complaints
I'm not going to risk back-patching.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
documentation warnings against setting it nonzero unless active use of
prepared transactions is intended and a suitable transaction manager has been
installed. This should help to prevent the type of scenario we've seen
several times now where a prepared transaction is forgotten and eventually
causes severe maintenance problems (or even anti-wraparound shutdown).
The only real reason we had the default be nonzero in the first place was to
support regression testing of the feature. To still be able to do that,
tweak pg_regress to force a nonzero value during "make check". Since we
cannot force a nonzero value in "make installcheck", add a variant regression
test "expected" file that shows the results that will be obtained when
max_prepared_transactions is zero.
Also, extend the HINT messages for transaction wraparound warnings to mention
the possibility that old prepared transactions are causing the problem.
All per today's discussion.
|
|
|
|
|
|
|
| |
ready for archival. It was marked at the next checkpoint anyway, but
waiting for the next checkpoint is an unnecessary delay.
Fujii Masao
|
|
|
|
|
|
| |
the checkpoint in immediate or lazy mode. This is to address complaints
that pg_start_backup() takes a long time even when there's no need to minimize
its I/O consumption.
|
|
|
|
| |
from buggy user-defined picksplit to GiST.
|
|
|
|
|
| |
Improve comments. Now GIN-indexable operators should be strict.
Per Tom's questions/suggestions.
|
|
|
|
|
|
| |
of adding optional namespace and action fields to DefElem. Having three
node types that do essentially the same thing bloats the code and leads
to errors of confusion, such as in yesterday's bug report from Khee Chin.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To implement this without almost duplicating the reloption table, treat
relopt_kind as a bitmask instead of an integer value. This decreases the
range of allowed values, but it's not clear that there's need for that much
values anyway.
This patch also makes heap_reloptions explicitly a no-op for relation kinds
other than heap and TOAST tables.
Patch by ITAGAKI Takahiro with minor edits from me. (In particular I removed
the bit about adding relation kind to an error message, which I intend to
commit separately.)
|
| |
|
|
|
|
| |
Robert Lor
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TupleTableSlots. We have functions for retrieving a minimal tuple from a slot
after storing a regular tuple in it, or vice versa; but these were implemented
by converting the internal storage from one format to the other. The problem
with that is it invalidates any pass-by-reference Datums that were already
fetched from the slot, since they'll be pointing into the just-freed version
of the tuple. The known problem cases involve fetching both a whole-row
variable and a pass-by-reference value from a slot that is fed from a
tuplestore or tuplesort object. The added regression tests illustrate some
simple cases, but there may be other failure scenarios traceable to the same
bug. Note that the added tests probably only fail on unpatched code if it's
built with --enable-cassert; otherwise the bug leads to fetching from freed
memory, which will not have been overwritten without additional conditions.
Fix by allowing a slot to contain both formats simultaneously; which turns out
not to complicate the logic much at all, if anything it seems less contorted
than before.
Back-patch to 8.2, where minimal tuples were introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
method to pass extra data to the consistent() and comparePartial() methods.
This is the core infrastructure needed to support the soon-to-appear
contrib/btree_gin module. The APIs are still upward compatible with the
definitions used in 8.3 and before, although *not* with the previous 8.4devel
function definitions.
catversion bump for changes in pg_proc entries (although these are just
cosmetic, since GIN doesn't actually look at the function signature before
calling it...)
Teodor Sigaev and Oleg Bartunov
|
|
|
|
|
|
|
|
|
|
|
|
| |
them from degrading badly when the input is sorted or nearly so. In this
scenario the tree is unbalanced to the point of becoming a mere linked list,
so insertions become O(N^2). The easiest and most safely back-patchable
solution is to stop growing the tree sooner, ie limit the growth of N. We
might later consider a rebalancing tree algorithm, but it's not clear that
the benefit would be worth the cost and complexity. Per report from Sergey
Burladyan and an earlier complaint from Heikki.
Back-patch to 8.2; older versions didn't have GIN indexes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
multiple index entries in a holding area before adding them to the main index
structure. This helps because bulk insert is (usually) significantly faster
than retail insert for GIN.
This patch also removes GIN support for amgettuple-style index scans. The
API defined for amgettuple is difficult to support with fastupdate, and
the previously committed partial-match feature didn't really work with
it either. We might eventually figure a way to put back amgettuple
support, but it won't happen for 8.4.
catversion bumped because of change in GIN's pg_am entry, and because
the format of GIN indexes changed on-disk (there's a metapage now,
and possibly a pending list).
Teodor Sigaev
|
|
|
|
| |
meant it had to be built on-the-fly at each entry to default_reloptions.
|