| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.
This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row. Neither pg_dump nor COPY included the contents of the
oid column by default.
The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.
WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.
Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.
The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.
The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such. This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.
The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.
Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).
The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.
While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.
Catversion bump, for obvious reasons.
Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
| |
The output for record types XLOG_DBASE_CREATE and XLOG_DBASE_DROP used
the order dbid/tablespaceid, whereas elsewhere the order is
tablespaceid/dbid[/relfilenodeid]. Flip the order for those two types
to make it consistent.
Author: Jean-Christophe Arnu <jcarnu@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAHZmTm18Ln62KW-G8NYvO1wbBL3QU1E76Zep=DuHmg-zS2XFAg@mail.gmail.com/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure. In that case the only remaining copy of the
data is in the WAL. A subsequent fsync() could appear to succeed,
but not have flushed the data. That means that a future checkpoint
could apparently complete successfully but have lost data.
Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure. Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.
Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses. If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail. The GUC defaults
to off, meaning that we panic.
Back-patch to all supported releases.
There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error. A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.
Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When reading a WAL record, its contents are copied into an intermediate
buffer. However, doing so is not necessary if the record fits fully
into the current page, saving one memcpy for each such record. The
allocation handling of the intermediate buffer is also now done only
when a record crosses a page boundary, shaving some extra cycles when
reading a WAL record.
Author: Andrey Lepikhov
Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas
Discussion: https://postgr.es/m/c2ea54dd-a1d3-80eb-ddbf-7e6f258e615e@postgrespro.ru
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit completes the work prepared in 1a0586de36, splitting the
old TupleTableSlot implementation (which could store buffer, heap,
minimal and virtual slots) into four different slot types. As
described in the aforementioned commit, this is done with the goal of
making tuple table slots extensible, to allow for pluggable table
access methods.
To achieve runtime extensibility for TupleTableSlots, operations on
slots that can differ between types of slots are performed using the
TupleTableSlotOps struct provided at slot creation time. That
includes information from the size of TupleTableSlot struct to be
allocated, initialization, deforming etc. See the struct's definition
for more detailed information about callbacks TupleTableSlotOps.
I decided to rename TTSOpsBufferTuple to TTSOpsBufferHeapTuple and
ExecCopySlotTuple to ExecCopySlotHeapTuple, as that seems more
consistent with other naming introduced in recent patches.
There's plenty optimization potential in the slot implementation, but
according to benchmarking the state after this commit has similar
performance characteristics to before this set of changes, which seems
sufficient.
There's a few changes in execReplication.c that currently need to poke
through the slot abstraction, that'll be repaired once the pluggable
storage patchset provides the necessary infrastructure.
Author: Andres Freund and Ashutosh Bapat, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a failure happens when a transaction is starting between the moment
the transaction status is changed from TRANS_DEFAULT to TRANS_START and
the moment the current user ID and security context flags are fetched
via GetUserIdAndSecContext(), or before initializing its basic fields,
then those may get reset to incorrect values when the transaction
aborts, leaving the session in an inconsistent state.
One problem reported is that failing a starting transaction at the first
query of a session could cause several kinds of system crashes on the
follow-up queries.
In order to solve that, move the initialization of the transaction state
fields and the call of GetUserIdAndSecContext() in charge of fetching
the current user ID close to the point where the transaction status is
switched to TRANS_START, where there cannot be any error triggered
in-between, per an idea of Tom Lane. This properly ensures that the
current user ID, the security context flags and that the basic fields of
TransactionState remain consistent even if the transaction fails while
starting.
Reported-by: Richard Guo
Diagnosed-By: Richard Guo
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAN_9JTxECSb=pEPcb0a8d+6J+bDcOZ4=DgRo_B7Y5gRHJUM=Rw@mail.gmail.com
Backpatch-through: 9.4
|
|
|
|
|
|
|
|
|
|
|
| |
Hexadecimal is consistently used as format to not bloat too much the
output but keep it readable. This information is useful mainly for
debugging purposes with for example pg_waldump.
Author: Michael Paquier
Reviewed-by: Nathan Bossart, Dmitry Dolgov, Andres Freund, Álvaro
Herrera
Discussion: https://postgr.es/m/20180413034734.GE1552@paquier.xyz
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The use of volatiles in procarray.c largely originated from the time
when postgres did not have reliable compiler and memory
barriers. That's not the case anymore, so we can do better.
Several of the functions in procarray.c can be bottlenecks, and
removal of volatile yields mildly better code.
The new state, with explicit memory barriers, is also more
correct. The previous use of volatile did not actually deliver
sufficient guarantees on weakly ordered machines, in particular the
logic in GetNewTransactionId() does not look safe. It seems unlikely
to be a problem in practice, but worth fixing.
Thomas and I independently wrote a patch for this.
Reported-By: Andres Freund and Thomas Munro
Author: Andres Freund, with cherrypicked changes from a patch by Thomas Munro
Discussion:
https://postgr.es/m/20181005172955.wyjb4fzcdzqtaxjq@alap3.anarazel.de
https://postgr.es/m/CAEepm=1nff0x=7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cut down on system calls by doing random I/O using offset-based OS
routines where available. Remove the code for tracking the 'virtual'
seek position. The only reason left to call FileSeek() was to get
the file's size, so provide a new function FileSize() instead.
Author: Oskari Saarenmaa, Thomas Munro
Reviewed-by: Thomas Munro, Jesper Pedersen, Tom Lane, Alvaro Herrera
Discussion: https://postgr.es/m/CAEepm=02rapCpPR3ZGF2vW=SBHSdFYO_bz_f-wwWJonmA3APgw@mail.gmail.com
Discussion: https://postgr.es/m/b8748d39-0b19-0514-a1b9-4e5a28e6a208%40gmail.com
Discussion: https://postgr.es/m/a86bd200-ebbe-d829-e3ca-0c4474b2fcb7%40ohmu.fi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The "rb" prefix is used by Ruby, so that our existing code results
in name collisions that break plruby. We discussed ways to prevent
that by adjusting dynamic linker options, but it seems that at best
we'd move the pain to other cases. Renaming to avoid the collision
is the only portable fix anyway. Fortunately, our rbtree code is
not (yet?) widely used --- in core, there's only a single usage
in GIN --- so it seems likely that we can get away with a rename.
I chose to do this basically as s/rb/rbt/g, except for places where
there already was a "t" after "rb". The patch could have been made
smaller by only touching linker-visible symbols, but it would have
resulted in oddly inconsistent-looking code. Better to make it look
like "rbt" was the plan all along.
Back-patch to v10. The rbtree.c code exists back to 9.5, but
rb_iterate() which is the actual immediate source of pain was added
in v10, so it seems like changing the names before that would have
more risk than benefit.
Per report from Pavel Raiskup.
Discussion: https://postgr.es/m/4738198.8KVIIDhgEB@nb.usersys.redhat.com
|
|
|
|
| |
Author: Daniel Gustafsson <daniel@yesql.se>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
spgendscan neglected to pfree all the memory allocated by spgbeginscan.
It's possible to get away with that in most normal queries, since the
memory is allocated in the executor's per-query context which is about
to get deleted anyway; but it causes severe memory leakage during
creation or filling of large exclusion-constraint indexes.
Also, document that amendscan is supposed to free what ambeginscan
allocates. The docs' lack of clarity on that point probably caused this
bug to begin with. (There is discussion of changing that API spec going
forward, but I don't think it'd be appropriate for the back branches.)
Per report from Bruno Wolff. It's been like this since the beginning,
so back-patch to all active branches.
In HEAD, also fix an independent leak caused by commit 2a6368343
(allocating memory during spgrescan instead of spgbeginscan, which
might be all right if it got cleaned up, but it didn't). And do a bit
of code beautification on that commit, too.
Discussion: https://postgr.es/m/20181024012314.GA27428@wolff.to
|
|
|
|
|
| |
Author: Daniel Gustafsson
Discussion: https://postgr.es/m/A6817958-949E-4A5B-895D-FA421B6640C2@yesql.se
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function is able to promote a standby with this new SQL-callable
function. Execution access can be granted to non-superusers so that
failover tools can observe the principle of least privilege.
Catalog version is bumped.
Author: Laurenz Albe
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/6e7c79b3ec916cf49742fb8849ed17cd87aed620.camel@cybertec.at
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 16828d5c0 incorrectly set an invalid pointer for t_self for heap
tuples. This patch correctly copies it from the source tuple, and
includes a regression test that relies on it being set correctly.
Backpatch to release 11.
Fixes bug #15448 reported by Tillmann Schulz
Diagnosis and test case by Amit Langote
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's several reasons for this change:
1) It reduces the total size of TupleTableSlot / reduces alignment
padding, making the commonly accessed members fit into a single
cacheline (but we currently do not force proper alignment, so
that's not yet guaranteed to be helpful)
2) Combining the booleans into a flag allows to combine read/writes
from memory.
3) With the upcoming slot abstraction changes, it allows to have core
and extended flags, in a memory efficient way.
Author: Ashutosh Bapat and Andres Freund
Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
heaptuple.c was never a particular good fit for slot_getattr(),
slot_getsomeattrs() and slot_getmissingattrs(), but in upcoming
changes slots will be made more abstract (allowing slots that contain
different types of tuples), making it clearly the wrong place.
Note that slot_deform_tuple() remains in it's current place, as it
clearly deals with a HeapTuple. getmissingattrs() also remains, but
it's less clear that that's correct - but execTuples.c wouldn't be the
right place.
Author: Ashutosh Bapat.
Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We can allow this macro to accept either abbreviated or non-abbreviated
allocation parameters by making use of __VA_ARGS__. As noted by Andres
Freund, it's unlikely that any compiler would have __builtin_constant_p
but not __VA_ARGS__, so this gives up little or no error checking, and
it avoids a minor but annoying API break for extensions.
With this change, there is no reason for anybody to call
AllocSetContextCreateExtended directly, so in HEAD I renamed it to
AllocSetContextCreateInternal. It's probably too late for an ABI
break like that in 11, though.
Discussion: https://postgr.es/m/20181012170355.bhxi273skjt6sag4@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
| |
These types have been deprecated for a *long* time.
Catversion bump, for obvious reasons.
Author: Andres Freund
Discussion:
https://postgr.es/m/20181009192237.34wjp3nmw7oynmmr@alap3.anarazel.de
https://postgr.es/m/20171213080506.cwjkpcz3bkk6yz2u@alap3.anarazel.de
https://postgr.es/m/25615.1513115237@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Repeatedly rewriting a mapped catalog table with VACUUM FULL or
CLUSTER could cause logical decoding to fail with:
ERROR, "could not map filenode \"%s\" to relation OID"
To trigger the problem the rewritten catalog had to have live tuples
with toasted columns.
The problem was triggered as during catalog table rewrites the
heap_insert() check that prevents logical decoding information to be
emitted for system catalogs, failed to treat the new heap's toast table
as a system catalog (because the new heap is not recognized as a
catalog table via RelationIsLogicallyLogged()). The relmapper, in
contrast to the normal catalog contents, does not contain historical
information. After a single rewrite of a mapped table the new relation
is known to the relmapper, but if the table is rewritten twice before
logical decoding occurs, the relfilenode cannot be mapped to a
relation anymore. Which then leads us to error out. This only
happens for toast tables, because the main table contents aren't
re-inserted with heap_insert().
The fix is simple, add a new heap_insert() flag that prevents logical
decoding information from being emitted, and accept during decoding
that there might not be tuple data for toast tables.
Unfortunately that does not fix pre-existing logical decoding
errors. Doing so would require not throwing an error when a filenode
cannot be mapped to a relation during decoding, and that seems too
likely to hide bugs. If it's crucial to fix decoding for an existing
slot, temporarily changing the ERROR in ReorderBufferCommit() to a
WARNING appears to be the best fix.
Author: Andres Freund
Discussion: https://postgr.es/m/20180914021046.oi7dm4ra3ot2g2kt@alap3.anarazel.de
Backpatch: 9.4-, where logical decoding was introduced
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originally committed as 15bc038f (plus some follow-ups), this was
reverted in 28e07270 due to a problem discovered in parallel
workers. This new version corrects that problem by sending the
list of uncommitted enum values to parallel workers.
Here follows the original commit message describing the change:
To prevent possibly breaking indexes on enum columns, we must keep
uncommitted enum values from getting stored in tables, unless we
can be sure that any such column is new in the current transaction.
Formerly, we enforced this by disallowing ALTER TYPE ... ADD VALUE
from being executed at all in a transaction block, unless the target
enum type had been created in the current transaction. This patch
removes that restriction, and instead insists that an uncommitted enum
value can't be referenced unless it belongs to an enum type created
in the same transaction as the value. Per discussion, this should be
a bit less onerous. It does require each function that could possibly
return a new enum value to SQL operations to check this restriction,
but there aren't so many of those that this seems unmaintainable.
Author: Andrew Dunstan and Tom Lane, with parallel query fix by Thomas Munro
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D0Ei7g6PaNTbcmAh9tCRahQrk%3Dr5ZWLD-jr7hXweYX3yg%40mail.gmail.com
Discussion: https://postgr.es/m/4075.1459088427%40sss.pgh.pa.us
|
|
|
|
|
|
|
|
| |
Per discussion, this behavior seems less astonishing than not doing so.
Peter Eisentraut and Tom Lane
Discussion: https://postgr.es/m/20180920234040.GC29981@momjian.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 9a3cebeaa changed things so that parallel workers didn't obtain
any lock of their own on tables they access. That was clearly a bad
idea, but I'd mistakenly supposed that it was the intended end result
of the series of patches for simplifying the executor's lock management.
Undo that change in relation_open(), and adjust ExecOpenScanRelation()
so that it gets the correct lock if inside a parallel worker.
In passing, clean up some more obsolete comments about when locks
are acquired.
Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, a worker process would establish values for these based on
its own start time. In v10 and up, this can trivially be shown to cause
misbehavior of transaction_timestamp(), timestamp_in(), and related
functions which are (perhaps unwisely?) marked parallel-safe. It seems
likely that other behaviors might diverge from what happens in the parent
as well.
It's not as trivial to demonstrate problems in 9.6 or 9.5, but I'm sure
it's still possible, so back-patch to all branches containing parallel
worker infrastructure.
In HEAD only, mark now() and statement_timestamp() as parallel-safe
(other affected functions already were). While in theory we could
still squeeze that change into v11, it doesn't seem important enough
to force a last-minute catversion bump.
Konstantin Knizhnik, whacked around a bit by me
Discussion: https://postgr.es/m/6406dbd2-5d37-4cb6-6eb2-9c44172c7e7c@postgrespro.ru
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Historically we forbade datatype-specific comparison functions from
returning INT_MIN, so that it would be safe to invert the sort order
just by negating the comparison result. However, this was never
really safe for comparison functions that directly return the result
of memcmp(), strcmp(), etc, as POSIX doesn't place any such restriction
on those library functions. Buildfarm results show that at least on
recent Linux on s390x, memcmp() actually does return INT_MIN sometimes,
causing sort failures.
The agreed-on answer is to remove this restriction and fix relevant
call sites to not make such an assumption; code such as "res = -res"
should be replaced by "INVERT_COMPARE_RESULT(res)". The same is needed
in a few places that just directly negated the result of memcmp or
strcmp.
To help find places having this problem, I've also added a compile option
to nbtcompare.c that causes some of the commonly used comparators to
return INT_MIN/INT_MAX instead of their usual -1/+1. It'd likely be
a good idea to have at least one buildfarm member running with
"-DSTRESS_SORT_INT_MIN". That's far from a complete test of course,
but it should help to prevent fresh introductions of such bugs.
This is a longstanding portability hazard, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/20180928185215.ffoq2xrq5d3pafna@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of locking tables during executor startup, just Assert that
suitable locks were obtained already during the parse/plan pipeline
(or re-obtained by the plan cache). This must be so, else we have a
hazard that concurrent DDL has invalidated the plan.
This is pretty inefficient as well as undercommented, but it's all going
to go away shortly, so I didn't try hard. This commit is just another
attempt to use the buildfarm to see if we've missed anything in the plan
to simplify the executor's table management.
Note that the change needed here in relation_open() exposes that
parallel workers now really are accessing tables without holding any
lock of their own, whereas they were not doing that before this commit.
This does not give me a warm fuzzy feeling about that aspect of parallel
query; it does not seem like a good design, and we now know that it's
had exactly no actual testing. I think that we should modify parallel
query so that that change can be reverted.
Discussion: https://postgr.es/m/468c85d9-540e-66a2-1dde-fec2b741e688@lab.ntt.co.jp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's inefficient to use a single slot for mapping between tuple
descriptors for multiple tuples, as previously done when using
ConvertPartitionTupleSlot(), as that means the slot's tuple descriptors
change for every tuple.
Previously we also, via ConvertPartitionTupleSlot(), built new tuples
after the mapping even in cases where we, immediately afterwards,
access individual columns again.
Refactor the code so one slot, on demand, is used for each
partition. That avoids having to change the descriptor (and allows to
use the more efficient "fixed" tuple slots). Then use slot->slot
mapping, to avoid unnecessarily forming a tuple.
As the naming between the tuple and slot mapping functions wasn't
consistent, rename them to execute_attr_map_{tuple,slot}. It's likely
that we'll also rename convert_tuples_by_* to denote that these
functions "only" build a map, but that's left for later.
Author: Amit Khandekar and Amit Langote, editorialized by me
Reviewed-By: Amit Langote, Amit Khandekar, Andres Freund
Discussion:
https://postgr.es/m/CAJ3gD9fR0wRNeAE8VqffNTyONS_UfFPRpqxhnD9Q42vZB+Jvpg@mail.gmail.com
https://postgr.es/m/e4f9d743-cd4b-efb0-7574-da21d86a7f36%40lab.ntt.co.jp
Backpatch: -
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Opening a relation with no lock at all is unsafe; there's no guarantee
that we'll see a consistent state of the relevant catalog entries.
While use of MVCC scans to read the catalogs partially addresses that
complaint, it's still possible to switch to a new catalog snapshot
partway through loading the relcache entry. Moreover, whether or not
you trust the reasoning behind sometimes using less than
AccessExclusiveLock for ALTER TABLE, that reasoning is certainly not
valid if concurrent users of the table don't hold a lock corresponding
to the operation they want to perform.
Hence, add some assertion-build-only checks that require any caller
of relation_open(x, NoLock) to hold at least AccessShareLock. This
isn't a full solution, since we can't verify that the lock level is
semantically appropriate for the action --- but it's definitely of
some use, because it's already caught two bugs.
We can also assert that callers of addRangeTableEntryForRelation()
hold at least the lock level specified for the new RTE.
Amit Langote and Tom Lane
Discussion: https://postgr.es/m/16565.1538327894@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When the checkpointer receives a SIGHUP signal to update its configuration,
it may need to update the shared memory for full_page_writes and need to
write a WAL record for it. Now, it is quite possible that the XLOG
machinery has not been initialized by that time and it will lead to
assertion failure while doing that. Fix is to allow the initialization of
the XLOG machinery outside critical section.
This bug has been introduced by the commit 2c03216d83 which added the XLOG
machinery initialization in RecoveryInProgress code path.
Reported-by: Dilip Kumar
Author: Dilip Kumar
Reviewed-by: Michael Paquier and Amit Kapila
Backpatch-through: 9.5
Discussion: https://postgr.es/m/CAFiTN-u4BA8KXcQUWDPNgaKAjDXC=C2whnzBM8TAcv=stckYUw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A restart point or a checkpoint recycling WAL segments treats segments
marked with neither ".done" (archiving is done) or ".ready" (segment is
ready to be archived) in archive_status the same way for archive_mode
being "on" or "always". While for a primary this is fine, a standby
running a restart point with archive_mode = on would try to mark such a
segment as ready for archiving, which is something that will never
happen except after the standby is promoted.
Note that this problem applies only to WAL segments coming from the
local pg_wal the first time archive recovery is run. Segments part of a
self-contained base backup are the most common case where this could
happen, however even in this case normally the .done markers would be
most likely part of the backup. Segments recovered from an archive are
marked as .ready or .done by the startup process, and segments finished
streaming are marked as such by the WAL receiver, so they are handled
already.
Reported-by: Haruka Takatsuka
Author: Michael Paquier
Discussion: https://postgr.es/m/15402-a453c90ed4cf88b2@postgresql.org
Backpatch-through: 9.5, where archive_mode = always has been added.
|
| |
|
|
|
|
|
| |
Author: Mark Dilger
Discussion: https://postgr.es/m/B2AEFCD0-836D-4654-9D59-3DF616E0A6F3%40gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The activation and deactivation of commit timestamp tracking has not
been handled consistently for a primary or standbys at recovery. The
facility can be activated at three different moments of recovery:
- The beginning, where a primary would use the GUC value for the
decision-making, and where a standby relies on the contents of the
control file.
- When replaying a XLOG_PARAMETER_CHANGE record at redo.
- The end, where both primary and standby rely on the GUC value.
Using the GUC value for a primary at the beginning of recovery causes
problems with commit timestamp access when doing crash recovery.
Particularly, when replaying transaction commits, it could be possible
that an attempt to read commit timestamps is done for a transaction
which committed at a moment when track_commit_timestamp was disabled.
A test case is added to reproduce the failure. The test works down to
v11 as it takes advantage of transaction commits within procedures.
Reported-by: Hailong Li
Author: Masahiko Sawasa, Michael Paquier
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/11224478-a782-203b-1f17-e4797b39bdf0@qunar.com
Backpatch-through: 9.5, where commit timestamps have been introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Upcoming changes introduce further types of tuple table slots, in
preparation of making table storage pluggable. New storage methods
will have different representation of tuples, therefore the slot
accessor should refer explicitly to heap tuples.
Instead of just renaming the functions, split it into one function
that accepts heap tuples not residing in buffers, and one accepting
ones in buffers. Previously one function was used for both, but that
was a bit awkward already, and splitting will allow us to represent
slot types for tuples in buffers and normal memory separately.
This is split out from the patch introducing abstract slots, as this
largely consists out of mechanical changes.
Author: Ashutosh Bapat
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ensure that triggers get properly filled in tuples for the OLD value.
Also fix the logic of detecting missing null values. The previous logic
failed to detect a missing null column before the first missing column
with a default. Fixing this has simplified the logic a bit.
Regression tests are added to test changes. This should ensure better
coverage of expand_tuple().
Original bug reports, and some code and test scripts from Tomas Vondra
Backpatch to release 11.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Several users of extensions complained of crashes in parallel workers
that turned out to be due to syscache access from their _PG_init()
functions. Reorder the initialization of parallel workers so that
libraries are restored after the caches are initialized, and inside a
transaction.
This was reported in bug #15350 and elsewhere. We don't consider it
to be a bug: extensions shouldn't do that, because then they can't be
used in shared_preload_libraries. However, it's a fairly obscure
hazard and these extensions worked in practice before parallel query
came along. So let's make it work. Later commits might add a warning
message and eventually an error.
Back-patch to 9.6, where parallel query landed.
Author: Thomas Munro
Reviewed-by: Amit Kapila
Reported-by: Kieran McCusker, Jimmy
Discussion: https://postgr.es/m/153512195228.1489.8545997741965926448%40wrigleys.postgresql.org
|
|
|
|
|
|
|
|
| |
No periods at the ends of primary error messages, please.
Daniel Gustafsson
Discussion: https://postgr.es/m/43E004C0-18C6-42B4-A313-003B43EB0571@yesql.se
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, KNN searches were supported only by GiST. SP-GiST also capable to
support them. This commit implements that support. SP-GiST scan stack is
replaced with queue, which serves as stack if no ordering is specified. KNN
support is provided for three SP-GIST opclasses: quad_point_ops, kd_point_ops
and poly_ops (catversion is bumped). Some common parts between GiST and SP-GiST
KNNs are extracted into separate functions.
Discussion: https://postgr.es/m/570825e8-47d0-4732-2bf6-88d67d2d51c8%40postgrespro.ru
Author: Nikita Glukhov, Alexander Korotkov based on GSoC work by Vlad Sterzhanov
Review: Andrey Borodin, Alexander Korotkov
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
XLogInsert fails to attach a required FPI to the first record after
full_page_writes is turned on by the last checkpoint. This bug got
introduced in 9.5 due to code rearrangement in commits 2c03216d83 and
2076db2aea. Fix it by ensuring that XLogInsertRecord performs a
recomputation when the given record is generated with FPW as off but
found that the flag has been turned on while actually inserting the
record.
Reported-by: Kyotaro Horiguchi
Author: Kyotaro Horiguchi
Reviewed-by: Amit Kapila
Backpatch-through: 9.5 where this problem was introduced
Discussion: https://postgr.es/m/20180420.151043.74298611.horiguchi.kyotaro@lab.ntt.co.jp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
spgrescan would first reset traversalCxt, and then traverse a
potentially non-empty stack containing pointers to traversalValues
which had been allocated in those contexts, freeing them a second
time. This bug originates in commit ccd6eb49a where traversalValue was
introduced.
Repair by traversing the stack before the context reset; this isn't
ideal, since it means doing retail pfree in a context that's about to
be reset, but the freeing of a stack entry is also done in other
places in the code during the scan so it's not worth trying to
refactor it further. Regression test added.
Backpatch to 9.6 where the problem was introduced.
Per bug #15378; analysis and patch by me, originally from a report on
IRC by user velix; see also PostGIS ticket #4174; review by Alexander
Korotkov.
Discussion: https://postgr.es/m/153663176628.23136.11901365223750051490@wrigleys.postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ginRedoRecompress() replays actions over compressed segments of posting list
in-place. However, it might lead to write past pg_upper, because intermediate
state during playing the changes can take more space than both original state
and final state. This commit fixes that by refuse from in-place modification.
Instead page tail is copied once modification is started, and then it's used
as the source of original segments. Backpatch to 9.4 where posting list
compression was introduced.
Reported-by: Sivasubramanian Ramasubramanian
Discussion: https://postgr.es/m/1536091151804.6588%40amazon.com
Author: Alexander Korotkov based on patch from and ideas by Sivasubramanian Ramasubramanian
Review: Sivasubramanian Ramasubramanian
Backpatch-through: 9.4
|
|
|
|
|
|
|
|
| |
This has been detected using some interesting tricks with sed, and the
method used is mentioned in details in the discussion below.
Author: Justin Pryzby
Discussion: https://postgr.es/m/20180908013109.GB15350@telsasoft.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a corrupted two-phase state file is found by WAL replay, be it for
crash recovery or archive recovery, then the file is simply skipped and
a WARNING is logged to the user, causing the transaction to be silently
lost. Facing an on-disk WAL file which is corrupted is as likely to
happen as what is stored in WAL records, but WAL records are already
able to fail hard if there is a CRC mismatch. On-disk two-phase state
files, on the contrary, are simply ignored if corrupted. Note that when
restoring the initial two-phase data state at recovery, files newer than
the horizon XID are discarded hence no files present in pg_twophase/
should be torned and have been made durable by a previous checkpoint, so
recovery should never see any corrupted two-phase state file by design.
The situation got better since 978b2f6 which has added two-phase state
information directly in WAL instead of using on-disk files, so the risk
is limited to two-phase transactions which live across at least one
checkpoint for long periods. Backups having legit two-phase state files
on-disk could also lose silently transactions when restored if things
get corrupted.
This behavior exists since two-phase commit has been introduced, no
back-patch is done for now per the lack of complaints about this
problem.
Author: Michael Paquier
Discussion: https://postgr.es/m/20180709050309.GM1467@paquier.xyz
|
|
|
|
|
|
|
| |
These are just a few particularly egregious cases that were hard to read
and write, and error prone because of many similar adjacent types.
Discussion: https://www.postgresql.org/message-id/flat/4c9f01be-9245-2148-b569-61a8562ef190%402ndquadrant.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On a split, we allocate a new splitpoint's worth of bucket pages wherein
we initialize the last page with zeros which is fine, but we forgot to set
the checksum for that last page.
We decided to back-patch this fix till 10 because we don't have an easy
way to test it in prior versions. Another reason is that the hash-index
code is changed heavily in 10, so it is not advisable to push the fix
without testing it in prior versions.
Author: Amit Kapila
Reviewed-by: Yugo Nagata
Backpatch-through: 10
Discussion: https://postgr.es/m/5d03686d-727c-dbf8-0064-bf8b97ffe850@2ndquadrant.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's a project policy against using plain "char buf[BLCKSZ]" local
or static variables as page buffers; preferred style is to palloc or
malloc each buffer to ensure it is MAXALIGN'd. However, that policy's
been ignored in an increasing number of places. We've apparently got
away with it so far, probably because (a) relatively few people use
platforms on which misalignment causes core dumps and/or (b) the
variables chance to be sufficiently aligned anyway. But this is not
something to rely on. Moreover, even if we don't get a core dump,
we might be paying a lot of cycles for misaligned accesses.
To fix, invent new union types PGAlignedBlock and PGAlignedXLogBlock
that the compiler must allocate with sufficient alignment, and use
those in place of plain char arrays.
I used these types even for variables where there's no risk of a
misaligned access, since ensuring proper alignment should make
kernel data transfers faster. I also changed some places where
we had been palloc'ing short-lived buffers, for coding style
uniformity and to save palloc/pfree overhead.
Since this seems to be a live portability hazard (despite the lack
of field reports), back-patch to all supported versions.
Patch by me; thanks to Michael Paquier for review.
Discussion: https://postgr.es/m/1535618100.1286.3.camel@credativ.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Startup process has improved its calculation of incorrect minimum
consistent point in 8d68ee6, which ensures that all WAL available gets
replayed when doing crash recovery, and has introduced an incorrect
calculation of the minimum recovery point for non-startup processes,
which can cause incorrect page references on a standby when for example
the background writer flushed a couple of pages on-disk but was not
updating the control file to let a subsequent crash recovery replay to
where it should have.
The only case where this has been reported to be a problem is when a
standby needs to calculate the latest removed xid when replaying a btree
deletion record, so one would need connections on a standby that happen
just after recovery has thought it reached a consistent point. Using a
background worker which is started after the consistent point is reached
would be the easiest way to get into problems if it connects to a
database. Having clients which attempt to connect periodically could
also be a problem, but the odds of seeing this problem are much lower.
The fix used is pretty simple, as the idea is to give access to the
minimum recovery point written in the control file to non-startup
processes so as they use a reference, while the startup process still
initializes its own references of the minimum consistent point so as the
original problem with incorrect page references happening post-promotion
with a crash do not show up.
Reported-by: Alexander Kukushkin
Diagnosed-by: Alexander Kukushkin
Author: Michael Paquier
Reviewed-by: Kyotaro Horiguchi, Alexander Kukushkin
Discussion: https://postgr.es/m/153492341830.1368.3936905691758473953@wrigleys.postgresql.org
Backpatch-through: 9.3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Code in slot_getallattrs() is the same as if slot_getsomeattrs() is
called with number of attributes specified in the tuple
descriptor. Implement it that way instead of duplicating the code
between those two functions.
This is part of a patchseries abstracting TupleTableSlots so they can
store arbitrary forms of tuples, but is a nice enough cleanup on its
own.
Author: Ashutosh Bapat
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/20180220224318.gw4oe5jadhpmcdnm@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Historically, the term procedure was used as a synonym for function in
Postgres/PostgreSQL. Now we have procedures as separate objects from
functions, so we need to clean up the documentation to not mix those
terms.
In particular, mentions of "trigger procedures" are changed to "trigger
functions", and access method "support procedures" are changed to
"support functions". (The latter already used FUNCTION in the SQL
syntax anyway.) Also, the terminology in the SPI chapter has been
cleaned up.
A few tests, examples, and code comments are also adjusted to be
consistent with documentation changes, but not everything.
Reported-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Jonathan S. Katz <jonathan.katz@excoventures.com>
|