| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When decoding from a logical slot, it's necessary for xlog reading to be
able to read xlog from historical (i.e. not current) timelines;
otherwise, decoding fails after failover, because the archives are in
the historical timeline. This is required to make "failover logical
slots" possible; it currently has no other use, although theoretically
it could be used by an extension that creates a slot on a standby and
continues to replay from the slot when the standby is promoted.
This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.
Author: Craig Ringer
Reviewed-By: Oleksii Kliukin, Andres Freund, Petr Jelínek
|
|
|
|
|
|
|
|
|
| |
Some minor tweaks and comment additions, for cleanliness sake and to
avoid having the upcoming timeline-following patch be polluted with
unrelated cleanup.
Extracted from a larger patch by Craig Ringer, reviewed by Andres
Freund, with some additions by myself.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During scan sometimes it would be very helpful to know some information about
parent node or all ancestor nodes. Right now reconstructedValue could be used
but it's not a right usage of it (range opclass uses that).
traversalValue is arbitrary piece of memory in separate MemoryContext while
reconstructedVale should have the same type as indexed column.
Subsequent patches for range opclass and quad4d tree will use it.
Author: Alexander Lebedev, Teodor Sigaev
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In this mode, the master waits for the transaction to be applied on
the remote side, not just written to disk. That means that you can
count on a transaction started on the standby to see all commits
previously acknowledged by the master.
To make this work, the standby sends a reply after replaying each
commit record generated with synchronous_commit >= 'remote_apply'.
This introduces a small inefficiency: the extra replies will be sent
even by standbys that aren't the current synchronous standby. But
previously-existing synchronous_commit levels make no attempt at all
to optimize which replies are sent based on what the primary cares
about, so this is no worse, and at least avoids any extra replies for
people not using the feature at all.
Thomas Munro, reviewed by Michael Paquier and by me. Some additional
tweaks by me.
|
|
|
|
|
|
|
| |
I introduced several uses of !! to force bit arithmetic to be boolean,
but per discussion the project prefers != 0/NULL.
Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This enables external code to create access methods. This is useful so
that extensions can add their own access methods which can be formally
tracked for dependencies, so that DROP operates correctly. Also, having
explicit support makes pg_dump work correctly.
Currently only index AMs are supported, but we expect different types to
be added in the future.
Authors: Alexander Korotkov, Petr Jelínek
Reviewed-By: Teodor Sigaev, Petr Jelínek, Jim Nasby
Commitfest-URL: https://commitfest.postgresql.org/9/353/
Discussion: https://www.postgresql.org/message-id/CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability. This is now a long time ago. We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.
Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses. The old values
are still accepted but are converted internally.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: David Steele <david@pgmasters.net>
|
|
|
|
| |
Oskari Saarenmaa
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit d88976cfa1302e8d removed this code from ginFreeScanKeys():
- if (entry->list)
- pfree(entry->list);
evidently in the belief that that ItemPointer array is allocated in the
keyCtx and so would be reclaimed by the following MemoryContextReset.
Unfortunately, it isn't and it won't. It'd likely be a good idea for
that to become so, but as a simple and back-patchable fix in the
meantime, restore this code to ginFreeScanKeys().
Also, add a similar pfree to where startScanEntry() is about to zero out
entry->list. I am not sure if there are any code paths where this
change prevents a leak today, but it seems like cheap future-proofing.
In passing, make the initial allocation of so->entries[] use palloc
not palloc0. The code doesn't depend on unused entries being zero;
if it did, the array-enlargement code in ginFillScanEntry() would be
wrong. So using palloc0 initially can only serve to confuse readers
about what the invariant is.
Per report from Felipe de Jesús Molina Bravo, via Jaime Casanova in
<CAJGNTeMR1ndMU2Thpr8GPDUfiHTV7idELJRFusA5UXUGY1y-eA@mail.gmail.com>
|
|
|
|
| |
Per Amit Kapila.
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting. Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.
Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
|
|
|
|
|
|
|
| |
Previously 2PC header was fixed at 200 bytes, which in most cases wasted
WAL space for a workload using 2PC heavily.
Pavan Deolasee, reviewed by Petr Jelinek
|
|
|
|
| |
Fabrízio de Royes Mello and Simon Riggs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.
Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability. The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.
Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An index search using a row comparison such as ROW(a, b) > ROW('x', 'y')
would stop upon reaching a NULL entry in the "b" column, ignoring the
fact that there might be non-NULL "b" values associated with later values
of "a". This happens because _bt_mark_scankey_required() marks the
subsidiary scankey for "b" as required, which is just wrong: it's for
a column after the one with the first inequality key (namely "a"), and
thus can't be considered a required match.
This bit of brain fade dates back to the very beginnings of our support
for indexed ROW() comparisons, in 2006. Kind of astonishing that no one
came across it before Glen Takahashi, in bug #14010.
Back-patch to all supported versions.
Note: the given test case doesn't actually fail in unpatched 9.1, evidently
because the fix for bug #6278 (i.e., stopping at nulls in either scan
direction) is required to make it fail. I'm sure I could devise a case
that fails in 9.1 as well, perhaps with something involving making a cursor
back up; but it doesn't seem worth the trouble.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
|
|
|
|
| |
Masahiko Sawada
|
|
|
|
|
|
|
|
|
|
|
| |
Commit a892234f830e832110f63fc0a2afce2fb21d1584 added a second bit per
page to the visibility map, which still seems like a good idea, but it
also added a second page-level bit alongside PD_ALL_VISIBLE to track
whether the visibility map bit was set. That no longer seems like a
clever plan, because we don't really need that bit for anything. We
always clear both bits when the page is modified anyway.
Patch by me, reviewed by Kyotaro Horiguchi and Masahiko Sawada.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously recovery_min_apply_delay was applied even before recovery
had reached consistency. This could cause us to wait a long time
unexpectedly for read-only connections to be allowed. It's problematic
because the standby was useless during that wait time.
This patch changes recovery_min_apply_delay so that it's applied once
the database has reached the consistent state. That is, even if the delay
is set, the standby tries to replay WAL records as fast as possible until
it has reached consistency.
Author: Michael Paquier
Reviewed-By: Julien Rouhaud
Reported-By: Greg Clough
Backpatch: 9.4, where recovery_min_apply_delay was added
Bug: #13770
Discussion: http://www.postgresql.org/message-id/20151111155006.2644.84564@wrigleys.postgresql.org
|
|
|
|
|
|
|
|
| |
A simple SELECT is handled by PortalRunSelect, not ProcessQuery. Also,
the previous indentation was unclear: change it so that a deeper level
of indentation indicates that the outer function calls the inner one.
Stas Kelvich
|
|
|
|
|
|
|
|
|
|
|
| |
Originally, we didn't have nworkers_launched, so code that used parallel
contexts had to be preprared for the possibility that not all of the
workers requested actually got launched. But now we can count on knowing
the number of workers that were successfully launched, which can shave
off a few cycles and simplify some code slightly.
Amit Kapila, reviewed by Haribabu Kommi, per a suggestion from Peter
Geoghegan.
|
|
|
|
|
|
|
| |
606c0123d627 attempted to reduce cost of index scans using > and <
strategies, though got that completely wrong in a few complex cases.
Revert whole patch until we find a safe optimization.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new bit indicates whether every tuple on the page is already frozen.
It is cleared only when the all-visible bit is cleared, and it can be
set only when we vacuum a page and find that every tuple on that page is
both visible to every transaction and in no need of any future
vacuuming.
A future commit will use this new bit to optimize away full-table scans
that would otherwise be triggered by XID wraparound considerations. A
page which is merely all-visible must still be scanned in that case, but
a page which is all-frozen need not be. This commit does not attempt
that optimization, although that optimization is the goal here. It
seems better to get the basic infrastructure in place first.
Per discussion, it's very desirable for pg_upgrade to automatically
migrate existing VM forks from the old format to the new format. That,
too, will be handled in a follow-on patch.
Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit
Kapila, Simon Riggs, Andres Freund, and others, and substantially
revised by me.
|
|
|
|
|
|
|
|
|
|
| |
StartupSUBTRANS() incorrectly handled cases near the max pageid in the subtrans
data structure, which in some cases could lead to errors in startup for Hot
Standby.
This patch wraps the pageids correctly, avoiding any such errors.
Identified by exhaustive crash testing by Jeff Janes.
Jeff Janes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al. But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4. The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.
This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.
To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.
Discussion: 20160118163908.GW10941@awork2.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
NextXID has been rendered in the form of a pg_lsn even though it
really is not. This can cause confusion, so change the format from
%u/%u to %u:%u, per discussion on hackers.
Complaint by me, patch by me and Bruce, reviewed by Michael Paquier
and Alvaro. Applied to HEAD only.
Author: Joe Conway, Bruce Momjian
Reviewed-by: Michael Paquier, Alvaro Herrera
Backpatch-through: master
|
|
|
|
|
|
| |
Previously, we had a mix of styles.
Amit Kapila
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Historically this message has been emitted at the end of ShutdownXLOG().
That's not an insane place for it in a standalone backend, but in the
postmaster environment we've grown a fair amount of stuff that happens
later, including archiver/walsender shutdown, stats collector shutdown,
etc. Recent buildfarm experimentation showed that on slower machines
there could be many seconds' delay between finishing ShutdownXLOG() and
actual postmaster exit. That's fairly confusing, both for testing
purposes and for DBAs. Hence, move the code that prints this message
into UnlinkLockFiles(), so that it comes out just after we remove the
postmaster's pidfile. That is a more appropriate definition of "is shut
down" from the point of view of "pg_ctl stop", for example. In general,
removing the pidfile should be the last externally-visible action of
either a postmaster or a standalone backend; compare commit
d73d14c271653dff10c349738df79ea03b85236c for instance. So this seems
like a reasonably future-proof approach.
|
|
|
|
|
|
| |
This reverts commit 3971f64843b02e4a55d854156bd53e46a0588e45 and a
couple of followon debugging commits; I think we've learned what we can
from them.
|
|
|
|
|
|
|
| |
Early returns from the buildfarm show that there's a bit of a gap in the
logging I added in 3971f64843b02e4a: the portion of CreateCheckPoint()
after CheckPointGuts() can take a fair amount of time. Add a few more
log messages in that section of code. This too shall be reverted later.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a quick hack, due to be reverted when its purpose has been served,
to try to gather information about why some of the buildfarm critters
regularly fail with "postmaster does not shut down" complaints. Maybe they
are just really overloaded, but maybe something else is going on. Hence,
instrument pg_ctl to print the current time when it starts waiting for
postmaster shutdown and when it gives up, and add a lot of logging of the
current time in the server's checkpoint and shutdown code paths.
No attempt has been made to make this pretty. I'm not even totally sure
if it will build on Windows, but we'll soon find out.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe. For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process. When force_parallel_mode = regress, make additional
changes to allow the regression tests to run cleanly even though parallel
workers have been injected under the hood.
Taken together, this facilitates both better user testing and better
regression testing of the parallelism code.
Robert Haas, with help from Amit Kapila and Rushabh Lathia.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For locking purposes, we now regard heavyweight locks as mutually
non-conflicting between cooperating parallel processes. There are some
possible pitfalls to this approach that are not to be taken lightly,
but it works OK for now and can be changed later if we find a better
approach. Without this, it's very easy for parallel queries to
silently self-deadlock if the user backend holds strong relation locks.
Robert Haas, with help from Amit Kapila. Thanks to Noah Misch and
Andres Freund for extensive discussion of possible issues with this
approach.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
KNN GiST with recheck flag should return to executor the same type as ordering
operator, GiST detects this type by looking to return type of function which
implements ordering operator. But occasionally detecting code works after
replacing ordering operator function to distance support function.
Distance support function always returns float8, so, detecting code get float8
instead of actual return type of ordering operator.
Built-in opclasses don't have ordering operator which doesn't return
non-float8 value, so, tests are impossible here, at least now.
Backpatch to 9.5 where lozzy KNN was introduced.
Author: Alexander Korotkov
Report by: Artur Zakirov
|
|
|
|
|
|
|
| |
This makes the values more stable, which seems like a good thing for
anybody who needs to look at at them.
Alexander Korotkov and Amit Kapila
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function cleans up the pending list of the GIN index by
moving entries in it to the main GIN data structure in bulk.
It returns the number of pages cleaned up from the pending list.
This function is useful, for example, when the pending list
needs to be cleaned up *quickly* to improve the performance of
the search using GIN index. VACUUM can do the same thing, too,
but it may take days to run on a large table.
Jeff Janes,
reviewed by Julien Rouhaud, Jaime Casanova, Alvaro Herrera and me.
Discussion: CAMkU=1x8zFkpfnozXyt40zmR3Ub_kHu58LtRmwHUKRgQss7=iQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original coding was quite fast so long as objects were always
released in reverse order of addition; otherwise, it degenerated into
O(N^2) behavior due to searching for the array element to delete.
Improve matters by switching to hashed storage when the number of
objects of a given type exceeds 64. (The cutover point is open to
discussion, of course, but some simple performance testing suggests
that hashing has enough overhead to be a loser below there.)
Also, refactor resowner.c so that we don't need N copies of the array
management code. Since all the resource IDs the code currently needs
to deal with are either pointers or integers, it seems sufficient to
create a one-size-fits-all infrastructure in which everything is
converted to a Datum for storage.
Aleksander Alekseev, reviewed by Stas Kelvich, further fixes by me
|
|
|
|
|
|
|
|
| |
Given the limited range of i, these shifts should not cause any
problem, but that apparently doesn't stop some compilers from
whining about them.
David Rowley
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The amvalidate functions added in commit 65c5fcd353a859da were on the
crude side. Improve them in a few ways:
* Perform signature checking for operators and support functions.
* Apply more thorough checks for missing operators and functions,
where possible.
* Instead of reporting problems as ERRORs, report most problems as INFO
messages and make the amvalidate function return FALSE. This allows
more than one problem to be discovered per run.
* Report object names rather than OIDs, and work a bit harder on making
the messages understandable.
Also, remove a few more opr_sanity regression test queries that are
now superseded by the amvalidate checks.
|
|
|
|
| |
It's an oversight in commit dc943ad.
|
|
|
|
| |
Jeff Janes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
2PC state info is written only to WAL at PREPARE, then read back from WAL at
COMMIT PREPARED/ABORT PREPARED. Prepared transactions that live past one bufmgr
checkpoint cycle will be written to disk in the same form as previously. Crash
recovery path is not altered. Measured performance gains of 50-100% for short
2PC transactions by completely avoiding writing files and fsyncing. Other
optimizations still available, further patches in related areas expected.
Stas Kelvich and heavily edited by Simon Riggs
Based upon earlier ideas and patches by Michael Paquier and Heikki Linnakangas,
a concrete example of how Postgres-XC has fed back ideas into PostgreSQL.
Reviewed by Michael Paquier, Jeff Janes and Andres Freund
Performance testing by Jesper Pedersen
|
|
|
|
|
|
|
|
|
|
| |
Previously we didn’t have a generic WAL page read callback function,
surprisingly. Logical decoding has logical_read_local_xlog_page(), which was
actually generic, so move that to xlogfunc.c and rename to
read_local_xlog_page().
Maintain logical_read_local_xlog_page() so existing callers still work.
As requested by Michael Paquier, Alvaro Herrera and Andres Freund
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The conventions specified by the GiST SGML documentation were widely
ignored. For example, the strategy-number argument for "consistent" and
"distance" functions is specified to be a smallint, but most of the
built-in support functions declared it as an integer, and for that matter
the core code passed it using Int32GetDatum not Int16GetDatum. None of
that makes any real difference at runtime, but it's quite confusing for
newcomers to the code, and it makes it very hard to write an amvalidate()
function that checks support function signatures. So let's try to instill
some consistency here.
Another similar issue is that the "query" argument is not of a single
well-defined type, but could have different types depending on the strategy
(corresponding to search operators with different righthand-side argument
types). Some of the functions threw up their hands and declared the query
argument as being of "internal" type, which surely isn't right ("any" would
have been more appropriate); but the majority position seemed to be to
declare it as being of the indexed data type, corresponding to a search
operator with both input types the same. So I've specified a convention
that that's what to do always.
Also, the result of the "union" support function actually must be of the
index's storage type, but the documentation suggested declaring it to
return "internal", and some of the functions followed that. Standardize
on telling the truth, instead.
Similarly, standardize on declaring the "same" function's inputs as
being of the storage type, not "internal".
Also, somebody had forgotten to add the "recheck" argument to both
the documentation of the "distance" support function and all of their
SQL declarations, even though the C code was happily using that argument.
Clean that up too.
Fix up some other omissions in the docs too, such as documenting that
union's second input argument is vestigial.
So far as the errors in core function declarations go, we can just fix
pg_proc.h and bump catversion. Adjusting the erroneous declarations in
contrib modules is more debatable: in principle any change in those
scripts should involve an extension version bump, which is a pain.
However, since these changes are purely cosmetic and make no functional
difference, I think we can get away without doing that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch reduces pg_am to just two columns, a name and a handler
function. All the data formerly obtained from pg_am is now provided
in a C struct returned by the handler function. This is similar to
the designs we've adopted for FDWs and tablesample methods. There
are multiple advantages. For one, the index AM's support functions
are now simple C functions, making them faster to call and much less
error-prone, since the C compiler can now check function signatures.
For another, this will make it far more practical to define index access
methods in installable extensions.
A disadvantage is that SQL-level code can no longer see attributes
of index AMs; in particular, some of the crosschecks in the opr_sanity
regression test are no longer possible from SQL. We've addressed that
by adding a facility for the index AM to perform such checks instead.
(Much more could be done in that line, but for now we're content if the
amvalidate functions more or less replace what opr_sanity used to do.)
We might also want to expose some sort of reporting functionality, but
this patch doesn't do that.
Alexander Korotkov, reviewed by Petr Jelínek, and rather heavily
editorialized on by me.
|
|
|
|
|
| |
Tomas Vondra, reviewed by Michael Paquier and Amit Kapila
Minor edits by me
|
|
|
|
|
| |
Teach GetFlushRecPtr() to update LogwrtResult cache as performed by all other
functions in xlog.c
|