| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
| |
This comment has been wrong since its introduction in commit
2c03216d8311.
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAD21AoAzz6qipFJBbGEaHmyWxvvNDp8httbwLR9tUQWaTjUs2Q@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of making many block-sized write() calls to fill a new WAL file
with zeroes, make a smaller number of pwritev() calls (or various
emulations). The actual number depends on the OS's IOV_MAX, which
PG_IOV_MAX currently caps at 32. That means we'll write 256kB per call
on typical systems. We may want to tune the number later with more
experience.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
|
|
|
|
|
|
| |
We agreed to remove this terminology and use something more descriptive.
Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue%40alap3.anarazel.de
|
|
|
|
| |
Backpatch-through: 9.5
|
|
|
|
|
|
|
|
|
|
| |
The patch needs test cases, reorganization, and cfbot testing.
Technically reverts commits 5c31afc49d..e35b2bad1a (exclusive/inclusive)
and 08db7c63f3..ccbe34139b.
Reported-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/E1ktAAG-0002V2-VB@gemulon.postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a key management system that stores (currently) two data
encryption keys of length 128, 192, or 256 bits. The data keys are
AES256 encrypted using a key encryption key, and validated via GCM
cipher mode. A command to obtain the key encryption key must be
specified at initdb time, and will be run at every database server
start. New parameters allow a file descriptor open to the terminal to
be passed. pg_upgrade support has also been added.
Discussion: https://postgr.es/m/CA+fd4k7q5o6Nc_AaX6BcYM9yqTbC6_pnH-6nSD=54Zp6NBQTCQ@mail.gmail.com
Discussion: https://postgr.es/m/20201202213814.GG20285@momjian.us
Author: Masahiko Sawada, me, Stephen Frost
|
|
|
|
|
|
|
|
| |
This fixes several areas of the documentation and some comments in
matters of style, grammar, or even format.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20201222041153.GK30237@telsasoft.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Revert ac22929a26, as well as the followup fix 113d3591b8. Because it broke
the assumption that the startup process waiting for the recovery conflict
on buffer pin should be waken up only by buffer unpin or the timeout enabled
in ResolveRecoveryConflictWithBufferPin(). It caused, for example,
SIGHUP signal handler or walreceiver process to wake that startup process
up unnecessarily frequently.
Additionally, add the comments about why that dedicated latch that
the reverted patch tried to get rid of should not be removed.
Thanks to Kyotaro Horiguchi for the discussion.
Author: Fujii Masao
Discussion: https://postgr.es/m/d8c0c608-021b-3c73-fffd-3240829ee986@oss.nttdata.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Invent a new flag bit HASH_STRINGS to specify C-string hashing, which
was formerly the default; and add assertions insisting that exactly
one of the bits HASH_STRINGS, HASH_BLOBS, and HASH_FUNCTION be set.
This is in hopes of preventing recurrences of the type of oversight
fixed in commit a1b8aa1e4 (i.e., mistakenly omitting HASH_BLOBS).
Also, when HASH_STRINGS is specified, insist that the keysize be
more than 8 bytes. This is a heuristic, but it should catch
accidental use of HASH_STRINGS for integer or pointer keys.
(Nearly all existing use-cases set the keysize to NAMEDATALEN or
more, so there's little reason to think this restriction should
be problematic.)
Tweak hash_create() to insist that the HASH_ELEM flag be set, and
remove the defaults it had for keysize and entrysize. Since those
defaults were undocumented and basically useless, no callers
omitted HASH_ELEM anyway.
Also, remove memset's zeroing the HASHCTL parameter struct from
those callers that had one. This has never been really necessary,
and while it wasn't a bad coding convention it was confusing that
some callers did it and some did not. We might as well save a few
cycles by standardizing on "not".
Also improve the documentation for hash_create().
In passing, improve reinit.c's usage of a hash table by storing
the key as a binary Oid rather than a string; and, since that's
a temporary hash table, allocate it in CurrentMemoryContext for
neatness.
Discussion: https://postgr.es/m/590625.1607878171@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is done for end-of-recovery and shutdown checkpoints/restartpoints
(end-of-recovery restartpoints don't exist) rather than all types of
checkpoints, in cases where it may not be possible to rely on
pg_stat_activity to get a status from the startup or checkpointer
processes.
For example, at the end of a crash recovery, this is useful to know if a
checkpoint is running in the startup process, while previously the ps
display may only show some information about "recovering" something,
that can be confusing while a checkpoint runs.
Author: Justin Pryzby
Reviewed-by: Nathan Bossart, Kirk Jamison, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200818225238.GP17022@telsasoft.com
|
|
|
|
|
|
|
|
|
|
|
| |
User-visible log messages should go through ereport(), so they are
subject to translation. Many remaining elog(LOG) calls are really
debugging calls.
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While ereport() and elog() themselves are quite cheap when the
error message level is too low to be printed, some places need to do
substantial work before they can call those macros at all. To allow
optimizing away such setup work when nothing is to be printed, make
elog.c export a new function message_level_is_interesting(elevel)
that reports whether ereport/elog will do anything. Make use of that
in various places that had ad-hoc direct tests of log_min_messages etc.
Also teach ProcSleep to use it to avoid some work. (There may well
be other places that could usefully use this; I didn't search hard.)
Within elog.c, refactor a little bit to avoid having duplicate copies
of the policy-setting logic. When that code was written, we weren't
relying on the availability of inline functions; so it had some
duplications in the name of efficiency, which I got rid of.
Alvaro Herrera and Tom Lane
Discussion: https://postgr.es/m/129515.1606166429@sss.pgh.pa.us
|
|
|
|
|
|
| |
Using a macro is ugly and not justified here.
Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, when restore_command claimed to succeed but failed to restore
the file with the right name, for example, due to mis-configuration of
restore_command, no log message was reported. Then the recovery failed
later with an error message not directly related to the issue.
This commit changes the recovery so that a log message is emitted in
this error case. This would enable us to investigate what happened in
this case more easily.
Author: Jeff Janes, Fujii Masao
Reviewed-by: Pavel Borisov, Kyotaro Horiguchi
Discussion: https://postgr.es/m/CAMkU=1xkFs3Omp4JR4wMYWdam_KLuj6LXnTYfU8u3T0h=PLLMQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With more flags associated to a PGPROC entry that are not related to
vacuum (currently existing or planned), the name "statusFlags" describes
its purpose better.
(The same is done to the mirroring PROC_HDR->vacuumFlags.)
No functional changes in this commit.
This was suggested first by Hari Babu Kommi in [1] and then by Michael
Paquier at [2].
[1] https://postgr.es/m/CAJrrPGcsDC-oy1AhqH0JkXYa0Z2AgbuXzHPpByLoBGMxfOZMEQ@mail.gmail.com
[2] https://postgr.es/m/20200820060929.GB3730@paquier.xyz
Author: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20201116182446.qcg3o6szo2zookyr@localhost
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit changes the startup process in the standby server so that
it handles the interrupt signals after waiting for wal_retrieve_retry_interval
on the latch and resetting it, before entering another wait on the latch.
This change causes the standby server to promptly handle interrupt signals.
Otherwise, previously, there was the case where the standby needs to
wait extra five seconds to shutdown when the shutdown request arrived
while the startup process was waiting for wal_retrieve_retry_interval
on the latch.
Author: Fujii Masao, but implementation idea is from Soumyadeep Chakraborty
Reviewed-by: Soumyadeep Chakraborty
Discussion: https://postgr.es/m/9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce TimestampDifferenceMilliseconds() to simplify callers
that would rather have the difference in milliseconds, instead of
the select()-oriented seconds-and-microseconds format. This gets
rid of at least one integer division per call, and it eliminates
some apparently-easy-to-mess-up arithmetic.
Two of these call sites were in fact wrong:
* pg_prewarm's autoprewarm_main() forgot to multiply the seconds
by 1000, thus ending up with a delay 1000X shorter than intended.
That doesn't quite make it a busy-wait, but close.
* postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute
microseconds not milliseconds, thus ending up with a delay 1000X longer
than intended. Somebody along the way had noticed this problem but
misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather
than fixing the units. This was relatively harmless in context, because
we don't care that much about exactly how long this delay is; still,
it's wrong.
There are a few more callers of TimestampDifference() that don't
have a direct need for seconds-and-microseconds, but can't use
TimestampDifferenceMilliseconds() either because they do need
microsecond precision or because they might possibly deal with
intervals long enough to overflow 32-bit milliseconds. It might be
worth inventing another API to improve that, but that seems outside
the scope of this patch; so those callers are untouched here.
Given the fact that we are fixing some bugs, and the likelihood
that future patches might want to back-patch code that uses this
new API, back-patch to all supported branches.
Alexey Kondratov and Tom Lane
Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Specifically, this blocks DECLARE ... WITH HOLD and firing of deferred
triggers within index expressions and materialized view queries. An
attacker having permission to create non-temp objects in at least one
schema could execute arbitrary SQL functions under the identity of the
bootstrap superuser. One can work around the vulnerability by disabling
autovacuum and not manually running ANALYZE, CLUSTER, REINDEX, CREATE
INDEX, VACUUM FULL, or REFRESH MATERIALIZED VIEW. (Don't restore from
pg_dump, since it runs some of those commands.) Plain VACUUM (without
FULL) is safe, and all commands are fine when a trusted user owns the
target object. Performance may degrade quickly under this workaround,
however. Back-patch to 9.5 (all supported versions).
Reviewed by Robert Haas. Reported by Etienne Stalmans.
Security: CVE-2020-25695
|
|
|
|
|
|
|
|
|
| |
Commit dee663f7 intended to drop any queued up fsync requests before
unlinking segment files, but missed a code path. Fix, by centralizing
the forget-and-unlink code into a single function.
Reported-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/20201104013205.icogbi773przyny5%40development
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit ac22929a26 changed recoveryWakeupLatch so that it's reset to
NULL at the end of recovery. This change could cause a segmentation fault
in the buildfarm member 'elver'.
Previously the latch was reset to NULL after calling ShutdownWalRcv().
But there could be a window between ShutdownWalRcv() and the actual
exit of walreceiver. If walreceiver set the latch during that window,
the segmentation fault could happen.
To fix the issue, this commit changes walreceiver so that it sets
the latch only when the latch has not been reset to NULL yet.
Author: Fujii Masao
Discussion: https://postgr.es/m/5c1f8a85-747c-7bf9-241e-dd467d8a3586@iki.fi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit gets rid of the dedicated latch for signaling the startup
process in favor of using its procLatch, since that comports better
with possible generic signal handlers using that latch.
Commit 1e53fe0e70 changed background processes so that they use standard
SIGHUP handler. Like that, this commit also makes the startup process use
standard SIGHUP handler to simplify the code.
Author: Fujii Masao
Reviewed-by: Bharath Rupireddy, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACXPorUqePswDtOeM_s82v9RW32E1fYmOPZ5NuE+TWKj_A@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some places were using PG_GETARG_UINT32 where PG_GETARG_TRANSACTIONID
would be more appropriate. (Of course, they are the same internally,
so there is no externally visible effect.) To do that, export
PG_GETARG_TRANSACTIONID outside of xid.c. We also export
PG_RETURN_TRANSACTIONID for symmetry, even though there are currently
no external users.
Author: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/d8f6bdd536df403b9b33816e9f7e0b9d@G08CNEXMBPEKD05.g08.fujitsu.local
|
|
|
|
|
|
|
|
| |
Mark Dilger, reviewed by Peter Geoghegan, Andres Freund, Álvaro Herrera,
Michael Paquier, Amul Sul, and by me. Some last-minute cosmetic
revisions by me.
Discussion: http://postgr.es/m/12ED3DA8-25F0-4B68-937D-D907CFBF08E7@enterprisedb.com
|
|
|
|
|
|
|
|
| |
AtEOXact_MultiXact() was referenced in two places with an incorrect
routine name.
Author: Hou Zhijie
Discussion: https://postgr.es/m/1b41e9311e8f474cb5a360292f0b3cb1@G08CNEXMBPEKD05.g08.fujitsu.local
|
|
|
|
|
|
|
| |
Oversight in 9d0bd95.
Reported-by: Andres Freund
Discussion: https://postgr.es/m/20201006023802.qqfi6m5bw5y77zql@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This view shows the statistics about WAL activity. Currently it has only
two columns: wal_buffers_full and stats_reset. wal_buffers_full column
indicates the number of times WAL data was written to the disk because
WAL buffers got full. This information is useful when tuning wal_buffers.
stats_reset column indicates the time at which these statistics were
last reset.
pg_stat_wal view is also the basic infrastructure to expose other
various statistics about WAL activity later.
Bump PGSTAT_FILE_FORMAT_ID due to the change in pgstat format.
Bump catalog version.
Author: Masahiro Ikeda
Reviewed-by: Takayuki Tsunakawa, Kyotaro Horiguchi, Amit Kapila, Fujii Masao
Discussion: https://postgr.es/m/188bd3f2d2233cf97753b5ced02bb050@oss.nttdata.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Providing this information can be useful for example when diagnosing
problems related to recovery conflicts or for recovery issues without
having to go through the output generated by pg_waldump to get some
information about the blocks a WAL record works on.
The block information is printed in the same format as pg_waldump. This
already existed in xlog.c for debugging purposes with -DWAL_DEBUG, so
adding the block information in the callback has required just a small
refactoring.
Author: Bertrand Drouvot
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/c31e2cba-efda-762c-f4ad-5c25e5dac3d0@amazon.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery. Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba. This can cause a significant improvement to
recovery performance, especially when it's otherwise CPU-bound.
Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool. Rearrange things so that data collected in CheckpointStats
includes SLRU activity.
Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them. (I'm not sure if they were ever needed, but
they aren't now.)
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
|
|
|
|
|
|
| |
Existing code used various inconsistent ways to printf struct stat's
st_size member. The type of that is off_t, which is in most cases a
signed 64-bit integer, so use the long long int format for it.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Harmonize behavior by moving reponsibility for fsyncing directories down
into slru.c. In 10 and later, only the multixact directories were
missed (see commit 1b02be21), and in older branches all SLRUs were
missed.
Back-patch to all supported releases.
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGLtsTUOScnNoSMZ-2ZLv%2BwGh01J6kAo_DM8mTRq1sKdSQ%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a new optional support function to the GiST access method:
sortsupport. If it is defined, the GiST index is built by sorting all data
to the order defined by the sortsupport's comparator function, and packing
the tuples in that order to GiST pages. This is similar to how B-tree
index build works, and is much faster than inserting the tuples one by
one. The resulting index is smaller too, because the pages are packed more
tightly, upto 'fillfactor'. The normal build method works by splitting
pages, which tends to lead to more wasted space.
The quality of the resulting index depends on how good the opclass-defined
sort order is. A good order preserves locality of the input data.
As the first user of this facility, add 'sortsupport' function to the
point_ops opclass. It sorts the points in Z-order (aka Morton Code), by
interleaving the bits of the X and Y coordinates.
Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
|
|
|
|
|
|
|
|
|
|
| |
Reporting this has been rather useful in some recent recovery speedup
work. It also seems like something that will be useful to the average DBA
too.
Author: David Rowley
Reviewed-by: Thomas Munro
Discussion: https://postgr.es/m/CAApHDvqYVORiZxq2xPvP6_ndmmsTkvr6jSYv4UTNaFa5i1kd%3DQ%40mail.gmail.com
|
| |
|
|
|
|
|
|
|
|
| |
When reading a WAL record fails to find continuation record(s) of the
proper length, report what it expects, for clarity.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20200903212152.GA15319@alvherre.pgsql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a commit or abort record includes "dropped relfilenodes", then replaying
the record will remove data files. That is surely a "special rel update",
but the records were not marked as such. Fix that, teach pg_rewind to
expect and ignore them, and add a test case to cover it.
It's always been like this, but no backporting for fear of breaking
existing applications. If an application parsed the WAL but was not
handling commit/abort records, it would stop working. That might be a good
thing if it really needed to handle the dropped rels, but it will be caught
when the application is updated to work with PostgreSQL v14 anyway.
Discussion: https://www.postgresql.org/message-id/07b33e2c-46a6-86a1-5f9e-a7da73fddb95%40iki.fi
Reviewed-by: Amit Kapila, Michael Paquier
|
|
|
|
|
|
|
|
|
| |
Reuse cautionary language from src/test/ssl/README in
src/test/kerberos/README. SLRUs have had access to six-character
segments names since commit 73c986adde5d73a5e2555da9b5c8facedb146dcd,
and recovery stopped calling HeapTupleHeaderAdvanceLatestRemovedXid() in
commit 558a9165e081d1936573e5a7d576f5febd7fb55a. The other corrections
are more self-evident.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The SimpleLruTruncate() header comment states the new coding rule. To
achieve this, add locktype "frozenid" and two LWLocks. This closes a
rare opportunity for data loss, which manifested as "apparent
wraparound" or "could not access status of transaction" errors. Data
loss is more likely in pg_multixact, due to released branches' thin
margin between multiStopLimit and multiWrapLimit. If a user's physical
replication primary logged ": apparent wraparound" messages, the user
should rebuild standbys of that primary regardless of symptoms. At less
risk is a cluster having emitted "not accepting commands" errors or
"must be vacuumed" warnings at some point. One can test a cluster for
this data loss by running VACUUM FREEZE in every database. Back-patch
to 9.5 (all supported versions).
Discussion: https://postgr.es/m/20190218073103.GA1434723@rfd.leadboat.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Similar to the previous changes this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. In many
workloads subtransactions are very rare, and this makes the check for
that considerably cheaper.
As this removes the last member of PGXACT, there is no need to keep it
around anymore.
On a larger 2 socket machine this and the two preceding commits result
in a ~1.07x performance increase in read-only pgbench. For read-heavy
mixed r/w workloads without row level contention, I see about 1.1x.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Similar to the previous commit this increases the chance that data
frequently needed by GetSnapshotData() stays in l2 cache. As we now
take care to not unnecessarily write to ProcGlobal->vacuumFlags, there
should be very few modifications to the ProcGlobal->vacuumFlags array.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new array contains the xids for all connected backends / in-use
PGPROC entries in a dense manner (in contrast to the PGPROC/PGXACT
arrays which can have unused entries interspersed).
This improves performance because GetSnapshotData() always needs to
scan the xids of all live procarray entries and now there's no need to
go through the procArray->pgprocnos indirection anymore.
As the set of running top-level xids changes rarely, compared to the
number of snapshots taken, this substantially increases the likelihood
of most data required for a snapshot being in l2 cache. In
read-mostly workloads scanning the xids[] array will sufficient to
build a snapshot, as most backends will not have an xid assigned.
To keep the xid array dense ProcArrayRemove() needs to move entries
behind the to-be-removed proc's one further up in the array. Obviously
moving array entries cannot happen while a backend sets it
xid. I.e. locking needs to prevent that array entries are moved while
a backend modifies its xid.
To avoid locking ProcArrayLock in GetNewTransactionId() - a fairly hot
spot already - ProcArrayAdd() / ProcArrayRemove() now needs to hold
XidGenLock in addition to ProcArrayLock. Adding / Removing a procarray
entry is not a very frequent operation, even taking 2PC into account.
Due to the above, the dense array entries can only be read or modified
while holding ProcArrayLock and/or XidGenLock. This prevents a
concurrent ProcArrayRemove() from shifting the dense array while it is
accessed concurrently.
While the new dense array is very good when needing to look at all
xids it is less suitable when accessing a single backend's xid. In
particular it would be problematic to have to acquire a lock to access
a backend's own xid. Therefore a backend's xid is not just stored in
the dense array, but also in PGPROC. This also allows a backend to
only access the shared xid value when the backend had acquired an
xid.
The infrastructure added in this commit will be used for the remaining
PGXACT fields in subsequent commits. They are kept separate to make
review easier.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
|
|
|
|
| |
Oversight in commit 2c03216d831.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that xmin isn't needed for GetSnapshotData() anymore, it leads to
unnecessary cacheline ping-pong to have it in PGXACT, as it is updated
considerably more frequently than the other PGXACT members.
After the changes in dc7420c2c92, this is a very straight-forward change.
For highly concurrent, snapshot acquisition heavy, workloads this change alone
can significantly increase scalability. E.g. plain pgbench on a smaller 2
socket machine gains 1.07x for read-only pgbench, 1.22x for read-only pgbench
when submitting queries in batches of 100, and 2.85x for batches of 100
'SELECT';. The latter numbers are obviously not to be expected in the
real-world, but micro-benchmark the snapshot computation
scalability (previously spending ~80% of the time in GetSnapshotData()).
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
|
|
|
|
|
|
| |
Time appears to be passing fast.
Reported-By: Peter Geoghegan <pg@bowt.ie>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To make GetSnapshotData() more scalable, it cannot not look at at each proc's
xmin: While snapshot contents do not need to change whenever a read-only
transaction commits or a snapshot is released, a proc's xmin is modified in
those cases. The frequency of xmin modifications leads to, particularly on
higher core count systems, many cache misses inside GetSnapshotData(), despite
the data underlying a snapshot not changing. That is the most
significant source of GetSnapshotData() scaling poorly on larger systems.
Without accessing xmins, GetSnapshotData() cannot calculate accurate horizons /
thresholds as it has so far. But we don't really have to: The horizons don't
actually change that much between GetSnapshotData() calls. Nor are the horizons
actually used every time a snapshot is built.
The trick this commit introduces is to delay computation of accurate horizons
until there use and using horizon boundaries to determine whether accurate
horizons need to be computed.
The use of RecentGlobal[Data]Xmin to decide whether a row version could be
removed has been replaces with new GlobalVisTest* functions. These use two
thresholds to determine whether a row can be pruned:
1) definitely_needed, indicating that rows deleted by XIDs >= definitely_needed
are definitely still visible.
2) maybe_needed, indicating that rows deleted by XIDs < maybe_needed can
definitely be removed
GetSnapshotData() updates definitely_needed to be the xmin of the computed
snapshot.
When testing whether a row can be removed (with GlobalVisTestIsRemovableXid())
and the tested XID falls in between the two (i.e. XID >= maybe_needed && XID <
definitely_needed) the boundaries can be recomputed to be more accurate. As it
is not cheap to compute accurate boundaries, we limit the number of times that
happens in short succession. As the boundaries used by
GlobalVisTestIsRemovableXid() are never reset (with maybe_needed updated by
GetSnapshotData()), it is likely that further test can benefit from an earlier
computation of accurate horizons.
To avoid regressing performance when old_snapshot_threshold is set (as that
requires an accurate horizon to be computed), heap_page_prune_opt() doesn't
unconditionally call TransactionIdLimitedForOldSnapshots() anymore. Both the
computation of the limited horizon, and the triggering of errors (with
SetOldSnapshotThresholdTimestamp()) is now only done when necessary to remove
tuples.
This commit just removes the accesses to PGXACT->xmin from
GetSnapshotData(), but other members of PGXACT residing in the same
cache line are accessed. Therefore this in itself does not result in a
significant improvement. Subsequent commits will take advantage of the
fact that GetSnapshotData() now does not need to access xmins anymore.
Note: This contains a workaround in heap_page_prune_opt() to keep the
snapshot_too_old tests working. While that workaround is ugly, the tests
currently are not meaningful, and it seems best to address them separately.
Author: Andres Freund <andres@anarazel.de>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The reason for doing so is that a subsequent commit will need that to
avoid wraparound issues. As the subsequent change is large this was
split out for easier review.
The reason this is not a perfect straight-forward change is that we do
not want track 64bit xids in the procarray or the WAL. Therefore we
need to advance lastestCompletedXid in relation to 32 bit xids. The
code for that is now centralized in MaintainLatestCompletedXid*.
Author: Andres Freund
Reviewed-By: Thomas Munro, Robert Haas, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
| |
Including Full in variable names duplicates the type information and
leads to overly long names. As FullTransactionId cannot accidentally
be casted to TransactionId that does not seem necessary.
Author: Andres Freund
Discussion: https://postgr.es/m/20200724011143.jccsyvsvymuiqfxu@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
They are equivalent, except that StrNCpy() zero-fills the entire
destination buffer instead of providing just one trailing zero. For
all but a tiny number of callers, that's just overhead rather than
being desirable.
Remove StrNCpy() as it is now unused.
In some cases, namestrcpy() is the more appropriate function to use.
While we're here, simplify the API of namestrcpy(): Remove the return
value, don't check for NULL input. Nothing was using that anyway.
Also, remove a few unused name-related functions.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/44f5e198-36f6-6cdb-7fa9-60e34784daae%402ndquadrant.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of serializing the transaction to disk after reaching the
logical_decoding_work_mem limit in memory, we consume the changes we have
in memory and invoke stream API methods added by commit 45fdc9738b.
However, sometimes if we have incomplete toast or speculative insert we
spill to the disk because we can't generate the complete tuple and stream.
And, as soon as we get the complete tuple we stream the transaction
including the serialized changes.
We can do this incremental processing thanks to having assignments
(associating subxact with toplevel xacts) in WAL right away, and
thanks to logging the invalidation messages at each command end. These
features are added by commits 0bead9af48 and c55040ccd0 respectively.
Now that we can stream in-progress transactions, the concurrent aborts
may cause failures when the output plugin consults catalogs (both system
and user-defined).
We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK
sqlerrcode from system table scan APIs to the backend or WALSender
decoding a specific uncommitted transaction. The decoding logic on the
receipt of such a sqlerrcode aborts the decoding of the current
transaction and continue with the decoding of other transactions.
We have ReorderBufferTXN pointer in each ReorderBufferChange by which we
know which xact it belongs to. The output plugin can use this to decide
which changes to discard in case of stream_abort_cb (e.g. when a subxact
gets discarded).
We also provide a new option via SQL APIs to fetch the changes being
streamed.
Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke
Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have edge-case bugs when assigning values in the last few dozen pages
before the wrap limit. We may introduce similar bugs in the future. At
default BLCKSZ, this makes such bugs unreachable outside of single-user
mode. Also, when VACUUM began to consume mxacts, multiStopLimit did not
change to compensate.
pg_upgrade may fail on a cluster that was already printing "must be
vacuumed" warnings. Follow the warning's instructions to clear the
warning, then run pg_upgrade again. One can still, peacefully consume
98% of XIDs or mxacts, so DBAs need not change routine VACUUM settings.
Discussion: https://postgr.es/m/20200621083513.GA3074645@rfd.leadboat.com
|
|
|
|
|
|
|
|
|
|
| |
This avoids lseek() system calls at every SLRU I/O, as was
done for relation files in commit c24dcd0c.
Reviewed-by: Ashwin Agrawal <aagrawal@pivotal.io>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2Biqke4uTRFj8D8uEUUgj%2BRokPSp%2BCWM6YYzaaamG9Wvg%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGJ%2BoHhnvqjn3%3DHro7xu-YDR8FPr0FL6LF35kHRX%3D_bUzg%40mail.gmail.com
|