| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since commit ba94518a, we used XLogFileOpen to open the next segment for
writing, but if the end-of-recovery happens exactly at a segment boundary,
the new segment might not exist yet. (Before ba94518a, XLogFileOpen was
correct, because we would open the previous segment if the switch happened
at the boundary.)
Instead of trying to create it if necessary, it's simpler to not bother
opening the segment at all. XLogWrite() will open or create it soon anyway,
after writing the checkpoint or end-of-recovery record.
Reported by Andres Freund.
|
|
|
|
| |
Backpatch certain files through 9.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 0e5680f4737a9c6aa94aa9e77543e5de60411322 contained a thinko
mixing LOCKMODE with LockTupleMode. This caused misbehavior in the case
where a tuple is marked with a multixact with at most a FOR SHARE lock,
and another transaction tries to acquire a FOR NO KEY EXCLUSIVE lock;
this case should block but doesn't.
Include a new isolation tester spec file to explicitely try all the
tuple lock combinations; without the fix it shows the problem:
starting permutation: s1_begin s1_lcksvpt s1_tuplock2 s2_tuplock3 s1_commit
step s1_begin: BEGIN;
step s1_lcksvpt: SELECT * FROM multixact_conflict FOR KEY SHARE; SAVEPOINT foo;
a
1
step s1_tuplock2: SELECT * FROM multixact_conflict FOR SHARE;
a
1
step s2_tuplock3: SELECT * FROM multixact_conflict FOR NO KEY UPDATE;
a
1
step s1_commit: COMMIT;
With the fixed code, step s2_tuplock3 blocks until session 1 commits,
which is the correct behavior.
All other cases behave correctly.
Backpatch to 9.3, like the commit that introduced the problem.
|
|
|
|
|
|
|
| |
Pointed out by Coverity.
Since this is mere, and debatable, cosmetics I'm not backpatching
this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At one point in the development of this feature, it was claimed that
allowing negative values would be useful to compensate for timezone
differences between master and slave servers. That was based on a mistaken
assumption that commit timestamps are recorded in local time; but of course
they're in UTC. Nor is a negative apply delay likely to be a sane way of
coping with server clock skew. However, the committed patch still treated
negative delays as doing something, and the timezone misapprehension
survived in the user documentation as well.
If recovery_min_apply_delay were a proper GUC we'd just set the minimum
allowed value to be zero; but for the moment it seems better to treat
negative settings as if they were zero.
In passing do some extra wordsmithing on the parameter's documentation,
including correcting a second misstatement that the parameter affects
processing of Restore Point records.
Issue noted by Michael Paquier, who also provided the code patch; doc
changes by me. Back-patch to 9.4 where the feature was introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were trying to acquire the lock even when we were subsequently
not sleeping in some other transaction, which opens us up unnecessarily
to deadlocks. In particular, this is troublesome if an update tries to
lock an updated version of a tuple and finds itself doing EvalPlanQual
update chain walking; more than two sessions doing this concurrently
will find themselves sleeping on each other because the HW tuple lock
acquisition in heap_lock_tuple called from EvalPlanQualFetch races with
the same tuple lock being acquired in heap_update -- one of these
sessions sleeps on the other one to finish while holding the tuple lock,
and the other one sleeps on the tuple lock.
Per trouble report from Andrew Sackville-West in
http://www.postgresql.org/message-id/20140731233051.GN17765@andrew-ThinkPad-X230
His scenario can be simplified down to a relatively simple
isolationtester spec file which I don't include in this commit; the
reason is that the current isolationtester is not able to deal with more
than one blocked session concurrently and it blocks instead of raising
the expected deadlock. In the future, if we improve isolationtester, it
would be good to include the spec file in the isolation schedule. I
posted it in
http://www.postgresql.org/message-id/20141212205254.GC1768@alvh.no-ip.org
Hat tip to Mark Kirkwood, who helped diagnose the trouble.
|
|
|
|
|
|
|
| |
This reverts commit 60838df922345b26a616e49ac9fab808a35d1f85.
That change needs a bit more thought to be workable. In view of
the potentially machine-dependent stuff that went in today,
we need all of the buildfarm to be testing those other changes.
|
|
|
|
|
|
|
|
| |
Besides being shorter and much easier to read it changes the logic in
LWLockRelease() to release all shared lockers when waking up any. This
can yield some significant performance improvements - and the fairness
isn't really much worse than before, as we always allowed new shared
lockers to jump the queue.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. pglz_decompress contained a call
to elog to emit an error message in case of corrupted data. This function
is changed to return a status code to let its callers return an error instead.
This commit is required for upcoming WAL compression feature so that
the WAL reader facility can decompress the WAL data by using pglz_decompress.
Michael Paquier
|
|
|
|
|
|
|
|
|
| |
This reverts commit 1826987a46d079458007b7b6bbcbbd852353adbb.
The overall design was deemed unacceptable, in discussion following the
previous commit message; we might find some parts of it still
salvageable, but I don't want to be on the hook for fixing it, so let's
wait until we have a new patch.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous representation using a boolean column for each attribute
would not scale as well as we want to add further attributes.
Extra auxilliary functions are added to go along with this change, to
make up for the lost convenience of access of the old representation.
Catalog version bumped due to change in catalogs and the new functions.
Author: Adam Brightwell, minor tweaks by Álvaro
Reviewed by: Stephen Frost, Andres Freund, Álvaro Herrera
|
|
|
|
|
|
|
| |
This performs slightly better, uses less memory, and needs slightly less
code in GiST, than the Red-Black tree previously used.
Reviewed by Peter Geoghegan
|
|
|
|
|
|
|
|
| |
XLogFileInit() returns a file descriptor, which needs to be closed. The leak
was short-lived, since the startup process exits shortly afterwards, but it
was clearly a bug, nevertheless.
Per Coverity report.
|
|
|
|
|
|
|
| |
We used time(null) to set a TimestampTz field, which gave bogus results.
Noticed while looking at pg_xlogdump output.
Backpatch to 9.3 and above, where the fast promotion was introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create(). Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate. Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions. hash_create() itself will
take care of optimizing when the key size is four bytes.
This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.
In future we could look into offering a similar optimized hashing function
for 8-byte keys. Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.
For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules. Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.
Teodor Sigaev and Tom Lane
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two changes:
1. When copying a WAL segment from old timeline to create the first segment
on the new timeline, only copy up to the point where the timeline switch
happens, and zero-fill the rest. This avoids corner cases where we might
think that the copied WAL from the previous timeline belong to the new
timeline.
2. If the timeline switch happens at a segment boundary, don't copy the
whole old segment to the new timeline. It's pointless, because it's 100%
identical to the old segment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When starting up from a basebackup taken off a standby extra logic has
to be applied to compute the point where the data directory is
consistent. Normal base backups use a WAL record for that purpose, but
that isn't possible on a standby.
That logic had a error check ensuring that the cluster's control file
indicates being in recovery. Unfortunately that check was too strict,
disregarding the fact that the control file could also indicate that
the cluster was shut down while in recovery.
That's possible when the a cluster starting from a basebackup is shut
down before the backup label has been removed. When everything goes
well that's a short window, but when either restore_command or
primary_conninfo isn't configured correctly the window can get much
wider. That's because inbetween reading and unlinking the label we
restore the last checkpoint from WAL which can need additional WAL.
To fix simply also allow starting when the control file indicates
"shutdown in recovery". There's nicer fixes imaginable, but they'd be
more invasive.
Backpatch to 9.2 where support for taking basebackups from standbys
was added.
|
|
|
|
|
|
|
|
|
|
|
| |
In passing, also make some debugging elog's in pgstat.c a bit more
consistently worded.
Back-patch as far as applicable (9.3 or 9.4; none of these mistakes are
really old).
Mark Dilger identified and patched the type violations; the message
rewordings are mine.
|
|
|
|
|
|
| |
No need to set tuple tableOid twice
Jim Nasby
|
|
|
|
|
|
|
|
|
| |
Rename parameter action_at_recovery_target to
recovery_target_action suggested by Christoph Berg.
Place into recovery.conf suggested by Fujii Masao,
replacing (deprecating) earlier parameters, per
Michael Paquier.
|
|
|
|
| |
Michael Paquier
|
|
|
|
|
|
|
|
|
| |
It was an oversight in the original commit.
Also note in the sample config file that changing wal_log_hints requires a
restart.
Michael Paquier. Backpatch to 9.4, where wal_log_hints was added.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData(). This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.
This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.
A new test in src/test/modules is included.
Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.
Authors: Álvaro Herrera and Petr Jelínek
Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
|
| |
|
|
|
|
|
|
|
| |
Get rid of PG_FUNCTION_INFO_V1() macros, which are quite inappropriate
for built-in functions (possibly leftovers from testing as a loadable
module?). Also, fix gratuitous inconsistency between SQL-level and
C-level names of the minmax support functions.
|
|
|
|
|
| |
Multixacts are now maintained during recovery, but the README didn't get
the memo. Backpatch to 9.3, where the divergence was introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
InitXLogInsert() cannot be called in a critical section, because it
allocates memory. But CreateCheckPoint() did that, when called for the
end-of-recovery checkpoint by the startup process.
In the passing, fix the scratch space allocation in InitXLogInsert to go to
the right memory context. Also update the comment at InitXLOGAccess, which
hasn't been totally accurate since hot standby was introduced (in a hot
standby backend, InitXLOGAccess isn't called at backend startup).
Reported by Michael Paquier
|
|
|
|
|
|
|
|
|
| |
action_at_recovery_target = pause | promote | shutdown
Petr Jelinek
Reviewed by Muhammad Asif Naeem, Fujji Masao and
Simon Riggs
|
|
|
|
|
|
|
| |
This gives an overview of what Lehman & Yao's paper is all about, so that
you can understand the rest of the README without having to read the paper.
Per discussion with Peter Geoghegan and others.
|
|
|
|
|
|
|
| |
Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images
generated for hint bit updates, when checksums are enabled. The new record
type is replayed exactly the same as XLOG_FPI, but allows them to be tallied
separately e.g. in pg_xlogdump.
|
|
|
|
| |
Amit Kapila
|
|
|
|
| |
Pointed out by Michael Paquier
|
|
|
|
|
|
|
|
|
|
|
|
| |
There seems no prospect that any of this will ever be useful, and indeed
it's questionable whether some of it would work if it ever got called;
it's certainly not been exercised in a very long time, if ever. So let's
get rid of it, and make the comments about mark/restore in execAmi.c less
wishy-washy.
The mark/restore support for Result nodes is also currently dead code,
but that's due to planner limitations not because it's impossible that
it could be useful. So I left it in.
|
|
|
|
|
|
| |
It's a false positive - the variable is only used when 'onleft' is true,
and it is initialized in that case. But the compiler doesn't necessarily
see that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.
There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.
This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.
For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.
The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.
Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
|
|
|
|
|
|
|
|
|
|
|
| |
For <, <=, > and >= strategies, mark the first scan key
as already matched if scanning in an appropriate direction.
If index tuple contains no nulls we can skip the first
re-check for each tuple.
Author: Rajeev Rastogi
Reviewer: Haribabu Kommi
Rework of the code and comments by Simon Riggs
|
|
|
|
|
|
|
|
|
|
|
| |
There was some confusion on how to record the case that the operation
unlinks the last non-leaf page in the branch being deleted.
_bt_unlink_halfdead_page set the "topdead" field in the WAL record to
the leaf page, but the redo routine assumed that it would be an invalid
block number in that case. This commit fixes _bt_unlink_halfdead_page to
do what the redo routine expected.
This code is new in 9.4, so backpatch there.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unlogged relations are reset at the end of crash recovery as they're
only synced to disk during a proper shutdown. Unfortunately that and
later steps can fail, e.g. due to running out of space. This reset
was, up to now performed after marking the database as having finished
crash recovery successfully. As out of space errors trigger a crash
restart that could lead to the situation that not all unlogged
relations are reset.
Once that happend usage of unlogged relations could yield errors like
"could not open file "...": No such file or directory". Luckily
clusters that show the problem can be fixed by performing a immediate
shutdown, and starting the database again.
To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT)
earlier, before marking the database as having successfully recovered.
Discussion: 20140912112246.GA4984@alap3.anarazel.de
Backpatch to 9.1 where unlogged tables were introduced.
Abhijit Menon-Sen and Andres Freund
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The initial patch for RLS mistakenly included headers associated with
the executor and planner bits in rewrite/rowsecurity.h. Per policy and
general good sense, executor headers should not be included in planner
headers or vice versa.
The include of execnodes.h was a mistaken holdover from previous
versions, while the include of relation.h was used for Relation's
definition, which should have been coming from utils/relcache.h. This
patch cleans these issues up, adds comments to the RowSecurityPolicy
struct and the RowSecurityConfigType enum, and changes Relation->rsdesc
to Relation->rd_rsdesc to follow Relation field naming convention.
Additionally, utils/rel.h was including rewrite/rowsecurity.h, which
wasn't a great idea since that was pulling in things not really needed
in utils/rel.h (which gets included in quite a few places). Instead,
use 'struct RowSecurityDesc' for the rd_rsdesc field and add comments
explaining why.
Lastly, add an include into access/nbtree/nbtsort.c for
utils/sortsupport.h, which was evidently missed due to the above mess.
Pointed out by Tom in 16970.1415838651@sss.pgh.pa.us; note that the
concerns regarding a similar situation in the custom-path commit still
need to be addressed.
|
|
|
|
|
|
|
|
|
|
|
| |
This function has a loop which can lead to uninterruptible process
"stalls" (actually infinite loops) when some bugs are triggered. Avoid
that unpleasant situation by adding a check for interrupts in a place
that shouldn't degrade performance in the normal case.
Backpatch to 9.3. Older branches have an identical loop here, but the
aforementioned bugs are only a problem starting in 9.3 so there doesn't
seem to be any point in backpatching any further.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.
To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.
In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.
Backpatch to all supported versions; this has been racy since hot standby
was introduced.
|
| |
|
|
|
|
|
| |
Since this parameter is only for GIN index, it's better to
add "gin" to the parameter name for easier understanding.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the maximum size of GIN pending list was controlled only by
work_mem. But the reasonable value of work_mem and the reasonable size
of the list are basically not the same, so it was not appropriate to
control both of them by only one GUC, i.e., work_mem. This commit
separates new GUC, pending_list_cleanup_size, from work_mem to allow
users to control only the size of the list.
Also this commit adds pending_list_cleanup_size as new storage parameter
to allow users to specify the size of the list per index. This is useful,
for example, when users want to increase the size of the list only for
the GIN index which can be updated heavily, and decrease it otherwise.
Reviewed by Etsuro Fujita.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code that generates the BRIN_XLOG_UPDATE removes the buffer
reference when the page that's target for the updated tuple is freshly
initialized. This is a pretty usual optimization, but was breaking the
case where the revmap buffer, which is referenced in the same WAL
record, is getting a backup block: the replay code was using backup
block index 1, which is not valid when the update target buffer gets
pruned; the revmap buffer gets assigned 0 instead. Make sure to use the
correct backup block index for revmap when replaying.
Bug reported by Fujii Masao.
|
|
|
|
|
|
|
| |
Besides a couple of typo fixes, per David Rowley, Thom Brown, and Amit
Langote, and mentions of BRIN in the general CREATE INDEX page again per
David, this includes silencing MSVC compiler warnings (thanks Microsoft)
and an additional variable initialization per Coverity scanner.
|
|
|
|
|
| |
Reported by Peter Geoghegan
David Rowley
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reported by David Rowley: variadic macros are a problem. Get rid of
them using a trick suggested by Tom Lane: add extra parentheses where
needed. In the future we might decide we don't need the calls at all
and remove them, but it seems appropriate to keep them while this code
is still new.
Also from David Rowley: brininsert() was trying to use a variable before
initializing it. Fix by moving the brin_form_tuple call (which
initializes the variable) to within the locked section.
Reported by Peter Eisentraut: can't use "new" as a struct member name,
because C++ compilers will choke on it, as reported by cpluspluscheck.
|
|
|
|
|
|
|
|
|
|
|
| |
Now that the backup blocks are appended to the WAL record in xloginsert.c,
XLogInsert doesn't see them anymore and cannot remove them from the version
reconstructed for xlog_outdesc. This makes running with wal_debug=on more
expensive, as we now make (unnecessary) temporary copies of the backup
blocks, but it doesn't seem worth convoluting the code to keep that
optimization.
Reported by Alvaro Herrera.
|
|
|
|
|
|
| |
This removes some fmgr overhead from cases such as btree index builds.
Peter Geoghegan, reviewed by Andreas Karlsson and me.
|