| Commit message (Collapse) | Author | Age |
|
|
|
| |
New in 147e3722f7e5.
|
|
|
|
|
|
|
| |
Reported-by: Alexander Lakhin
Author: Alexander Lakhin
Reviewed-by: Amit Kapila and Tom Lane
Discussion: https://postgr.es/m/7208de98-add8-8537-91c0-f8b089e2928c@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some of the wrapper functions didn't match the callback names. Many of
them due to staying "consistent" with historic naming of the wrapped
functionality. We decided that for most cases it's more important to
be for tableam to be consistent going forward, than with the past.
The one exception is beginscan/endscan/... because it'd have looked
odd to have systable_beginscan/endscan/... with a different naming
scheme, and changing the systable_* APIs would have caused way too
much churn (including breaking a lot of external users).
Author: Ashwin Agrawal, with some small additions by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CALfoeiugyrXZfX7n0ORCa4L-m834dzmaE8eFdbNR6PMpetU4Ww@mail.gmail.com
|
|
|
|
|
|
|
|
|
| |
Switch to 2.1 version of pg_bsd_indent. This formats
multiline function declarations "correctly", that is with
additional lines of parameter declarations indented to match
where the first line's left parenthesis is.
Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
|
|
|
|
|
|
|
|
| |
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.
Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
| |
This allows table AMs to completely suppress TOAST table creation, or
to modify the conditions under which they are created.
Patch by me. Reviewed by Andres Freund.
Discussion: http://postgr.es/m/CA+Tgmoa4O2n=yphqD2pERUnYmUO84bH1SqMsA-nSxBGsZ7gWfA@mail.gmail.com
|
|
|
|
|
|
|
| |
"segno" is the argument for the function, not "log" and "seg".
Author: Antonin Houska
Discussion: https://postgr.es/m/11863.1558361020@spoje.net
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Before this commit, when ANALYZE was run on a table and serializable
was used (either by virtue of an explicit BEGIN TRANSACTION ISOLATION
LEVEL SERIALIZABLE, or default_transaction_isolation being set to
serializable) a null pointer dereference lead to a crash.
The analyze scan doesn't need a snapshot (nor predicate locking), but
before this commit a scan only contained information about being a
bitmap or sample scan.
Refactor the option passing to the scan_begin callback to use a
bitmask instead. Alternatively we could have added a new boolean
parameter, but that seems harder to read. Even before this issue
various people (Heikki, Tom, Robert) suggested doing so.
These changes don't change the scan APIs outside of tableam. The flags
argument could be exposed, it's not necessary to fix this
problem. Also the wrapper table_beginscan* functions encapsulate most
of that complexity.
After these changes fixing the bug is trivial, just don't acquire
predicate lock for analyze style scans. That was already done for
bitmap heap scans. Add an assert that a snapshot is passed when
acquiring the predicate lock, so this kind of bug doesn't require
running with serializable.
Also add a comment about sample scans currently requiring predicate
locking the entire relation, that previously wasn't remarked upon.
Reported-By: Joe Wildish
Author: Andres Freund
Discussion:
https://postgr.es/m/4EA80A20-E9BF-49F1-9F01-5B66CAB21453@elusive.cx
https://postgr.es/m/20190411164947.nkii4gaeilt4bui7@alap3.anarazel.de
https://postgr.es/m/20190518203102.g7peu2fianukjuxm@alap3.anarazel.de
|
|
|
|
| |
Per buildfarm.
|
|
|
|
|
|
|
|
|
|
|
| |
Instead add a tableam callback to do so. To avoid adding per
validation overhead, pass a scan to tuple_tid_valid. In heap's case
we'd otherwise incurred a RelationGetNumberOfBlocks() call for each
tid - which'd have added noticable overhead to nodeTidscan.c.
Author: Andres Freund
Reviewed-By: Ashwin Agrawal
Discussion: https://postgr.es/m/20190515185447.gno2jtqxyktylyvs@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously various parts of the code routed size requests through
RelationGetNumberOfBlocks[InFork]. That works if md.c is used by the
AM, but not otherwise.
Add a tableam callback to return the size of the table. As not every
AM will use postgres' BLCKSZ, have it return bytes, and have
RelationGetNumberOfBlocksInFork() round the byte size up into blocks.
To allow code outside of the AM to determine the actual relation size
map InvalidForkNumber the total size of a relation, as not every AM
might just need the postgres defined forks.
A few users of RelationGetNumberOfBlocks() ought to be converted away
from that. One case, the use of it to determine whether a tid is
valid, will be fixed in a follow up commit. Others will have to wait
for v13.
Author: Andres Freund
Discussion: https://postgr.es/m/20190423225201.3bbv6tbqzkb5w7cw@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's not safe for nbtree VACUUM to attempt to delete a target page whose
right sibling is already half-dead, since that would fail the
cross-check when VACUUM attempts to re-find a downlink to the right
sibling in the parent page. Logic to prevent this from happening was
added by commit 8da31837803, which addressed a bug in the overhaul of
page deletion that went into PostgreSQL 9.4 (commit efada2b8e92).
VACUUM was made to check the right sibling page, and back off when it
happened to be half-dead already.
However, it is only truly necessary to do the right sibling check on the
leaf level, since that transitively determines if the deletion target's
parent's right sibling page is itself undergoing deletion. Remove the
internal page level check, and add a comment explaining why the leaf
level check alone suffices.
The extra check is also unnecessary due to the fact that internal pages
that are marked half-dead are generally considered corrupt. Commit
efada2b8e92 established the principle that there should never be
half-dead internal pages (internal pages pending deletion are possible,
but that status is never directly represented in the internal page).
VACUUM will complain about corruption when it encounters half-dead
internal pages, so VACUUM is bound to raise an error one way or another
when an nbtree index has a half-dead internal page (contrib/amcheck will
also report that the page is corrupt).
It's possible that a pg_upgrade'd 9.3 database will still have half-dead
internal pages, so it may seem like there is an argument for leaving the
check in place to reliably get a cleaner error message that advises the
user to REINDEX. However, leaf pages are also deleted in the first
phase of deletion prior to PostgreSQL 9.4, so I believe we won't even
attempt to re-find the parent page anyway (we won't have the fully
deleted leaf page as the right sibling of our target page, so we won't
even try to find a downlink for it).
Discussion: https://postgr.es/m/CAH2-Wzm_ntmqJjWLRyKzimFmFvk+BnVAvUpaA4s1h9Ja58woaQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Remove a Berkeley-era comment above _bt_insertonpg() that admonishes the
reader to grok Lehman and Yao's paper before making any changes. This
made a certain amount of sense back when _bt_insertonpg() was
responsible for most of the things that are now spread across
_bt_insertonpg(), _bt_findinsertloc(), _bt_insert_parent(), and
_bt_split(), but it doesn't work like that anymore.
I believe that this comment alludes to the need to "couple" or "crab"
buffer locks as we ascend the tree as page splits cascade upwards. The
nbtree README already explains this in detail, which seems sufficient.
Besides, the changes to page splits made by commit 40dae7ec537 altered
the exact details of how buffer locks are retained during splits; Lehman
and Yao's original algorithm seems to release the lock on the left child
page/buffer slightly earlier than _bt_insertonpg()/_bt_insert_parent()
can.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit fab25024, which taught nbtree to choose candidate split points
more carefully, had _bt_findsplitloc() record all possible split points
in an initial pass over a page that is about to be split. The order
that candidate split points were processed and stored in was assumed to
match the offset number order of split points on an imaginary version of
the page that contains the same items as the original, but also fits
newitem (the item that provoked the split precisely because it didn't
fit).
However, the order of split points in the final array was not quite what
was expected: the split point that makes newitem the firstright item
came after the split point that makes newitem the lastleft item -- not
before. As a result, _bt_findsplitloc() could get confused about the
leftmost and rightmost tuples among all possible split points recorded
for the page. This seems to have no appreciable impact on the quality
of the final split point chosen by _bt_findsplitloc(), but it's still
wrong.
To fix, switch the order in which newitem candidate splits are recorded
in. This also makes it possible to describe candidate split points in
terms of which pair of adjoining tuples enclose the split point within
_bt_findsplitloc(), making it clearer why it's generally safe for
_bt_split() to expect lastleft and firstright tuples.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For some reason both callsite and the implementation for heapam had
the meaning inverted (i.e. succeeded == true was passed in case of
conflict). That's confusing.
I (Andres) briefly pondered whether it'd be better to rename
table_complete_speculative's argument to 'bool specConflict' or such,
but decided not to. The 'complete' in the function name for me makes
`succeeded` sound a bit better.
Reported-By: Ashwin Agrawal, Melanie Plageman, Heikki Linnakangas
Discussion:
https://postgr.es/m/CALfoeitk7-TACwYv3hCw45FNPjkA86RfXg4iQ5kAOPhR+F1Y4w@mail.gmail.com
https://postgr.es/m/97673451-339f-b21e-a781-998d06b1067c@iki.fi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As we descend the GiST tree during insertion, we modify any downlinks on
the way down to include the new tuple we're about to insert (if they don't
cover it already). Modifying an existing downlink might cause an internal
page to split, if the new downlink tuple is larger than the old one. If
that happens, we need to back up to the parent and re-choose a page to
insert to. We used to detect that situation, thanks to the NSN-LSN
interlock normally used to detect concurrent page splits, but that got
broken by commit 9155580fd5. With that commit, we now use a dummy constant
LSN value for every page during index build, so the LSN-NSN interlock no
longer works. I thought that was OK because there can't be any other
backends modifying the index during index build, but missed that the
insertion itself can modify the page we're inserting to. The consequence
was that we would sometimes insert the new tuple to an incorrect page, one
whose downlink doesn't cover the new tuple.
To fix, add a flag to the stack that keeps track of the state while
descending tree, to indicate that a page was split, and that we need to
retry the descend from the parent.
Thomas Munro first reported that the contrib/intarray regression test was
failing occasionally on the buildfarm after commit 9155580fd5. The failure
was intermittent, because the gistchoose() function is not deterministic,
and would only occasionally create the right circumstances for this bug to
cause the failure.
Patch by Anastasia Lubennikova, with some changes by me to make it work
correctly also when the internal page split also causes the "grandparent"
to be split.
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJRzLo7tZExWfSbwM3XuK7aAK7FhdBV0FLkbUG%2BW0v0zg%40mail.gmail.com
|
|
|
|
|
|
|
|
| |
The conditions listed in this comment have changed several times, and at
some point the thing that the "if so" referred to was negated.
The text was OK up to 9.6. It was differently wrong in v10, v11 and
master, so fix in all those versions.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The term "item pointer" should not be used to refer to ItemIdData
variables, since that is needlessly ambiguous. Only
ItemPointerData/ItemPointer variables should be called item pointers.
To fix, establish the convention that ItemIdData variables should always
be referred to either as "item identifiers" or "line pointers". The
term "item identifier" already predominates in docs and translatable
messages, and so should be the preferred alternative there.
Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8fa30f906be reduced the elevel of a number of "can't happen"
_bt_split() errors from PANIC to ERROR. At the same time, the new right
page buffer for the split could continue to be acquired well before the
critical section. This was possible because it was relatively
straightforward to make sure that _bt_split() could not throw an error,
with a few specific exceptions. The exceptional cases were safe because
they involved specific, well understood errors, making it possible to
consistently zero the right page before actually raising an error using
elog(). There was no danger of leaving around a junk page, provided
_bt_split() stuck to this coding rule.
Commit 8224de4f, which introduced INCLUDE indexes, added code to make
_bt_split() truncate away non-key attributes. This happened at a point
that broke the rule around zeroing the right page in _bt_split(). If
truncation failed (perhaps due to palloc() failure), that would result
in an errant right page buffer with junk contents. This could confuse
VACUUM when it attempted to delete the page, and should be avoided on
general principle.
To fix, reorganize _bt_split() so that truncation occurs before the new
right page buffer is even acquired. A junk page/buffer will not be left
behind if _bt_nonkey_truncate()/_bt_truncate() raise an error.
Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Backpatch: 11-, where INCLUDE indexes were introduced.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The right way for IsCatalogRelation/Class to behave is to return true
for OIDs less than FirstBootstrapObjectId (not FirstNormalObjectId),
without any of the ad-hoc fooling around with schema membership.
The previous code was wrong because (1) it claimed that
information_schema tables were not catalog relations but their toast
tables were, which is silly; and (2) if you dropped and recreated
information_schema, which is a supported operation, the behavior
changed. That's even sillier. With this definition, "catalog
relations" are exactly the ones traceable to the postgres.bki data,
which seems like what we want.
With this simplification, we don't actually need access to the pg_class
tuple to identify a catalog relation; we only need its OID. Hence,
replace IsCatalogClass with "IsCatalogRelationOid(oid)". But keep
IsCatalogRelation as a convenience function.
This allows fixing some arguably-wrong semantics in contrib/sepgsql and
ReindexRelationConcurrently, which were using an IsSystemNamespace test
where what they really should be using is IsCatalogRelationOid. The
previous coding failed to protect toast tables of system catalogs, and
also was not on board with the general principle that user-created tables
do not become catalogs just by virtue of being renamed into pg_catalog.
We can also get rid of a messy hack in ReindexMultipleTables.
While we're at it, also rename IsSystemNamespace to IsCatalogNamespace,
because the previous name invited confusion with the more expansive
semantics used by IsSystemRelation/Class.
Also improve the comments in catalog.c.
There are a few remaining places in replication-related code that are
special-casing OIDs below FirstNormalObjectId. I'm inclined to think
those are wrong too, and if there should be any special case it should
just extend to FirstBootstrapObjectId. But first we need to debate
whether a FOR ALL TABLES publication should include information_schema.
Discussion: https://postgr.es/m/21697.1557092753@sss.pgh.pa.us
Discussion: https://postgr.es/m/15150.1557257111@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
| |
Commit dd299df8189, which added suffix truncation to nbtree, simplified
the WAL record format used by page splits. It became necessary to
explicitly WAL-log the new high key for the left half of a split in all
cases, which relieved the REDO routine from having to reconstruct a new
high key for the left page by copying the first item from the right
page. Remove a comment that referred to the previous practice.
|
|
|
|
| |
The flat file mechanism was removed in PostgreSQL 9.0.
|
|
|
|
|
|
|
| |
It is no longer possible under any circumstances for nbtree code to
reconstruct a strict lower bound key (parent page's pivot tuple key) for
a right sibling page by retrieving the first item in the right sibling
page.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds new parameter to VACUUM command, TRUNCATE,
which specifies that VACUUM should attempt to truncate off
any empty pages at the end of the table and allow the disk space
for the truncated pages to be returned to the operating system.
This parameter, if specified, overrides the vacuum_truncate
reloption. If neither the reloption nor the VACUUM option is
used, the default is true, as before.
Author: Fujii Masao
Reviewed-by: Julien Rouhaud, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoD+qtrSDL=GSma4Wd3kLYLeRC0hPna-YAdkDeV4z156vg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This feature was using a process local map to track the first few blocks
in the relation. The map was reset each time we get the block with enough
freespace. It was discussed that it would be better to track this map on
a per-relation basis in relcache and then invalidate the same whenever
vacuum frees up some space in the page or when FSM is created. The new
design would be better both in terms of API design and performance.
List of commits reverted, in reverse chronological order:
06c8a5090e Improve code comments in b0eaa4c51b.
13e8643bfc During pg_upgrade, conditionally skip transfer of FSMs.
6f918159a9 Add more tests for FSM.
9c32e4c350 Clear the local map when not used.
29d108cdec Update the documentation for FSM behavior..
08ecdfe7e5 Make FSM test portable.
b0eaa4c51b Avoid creation of the free space map for small heap relations.
Discussion: https://postgr.es/m/20190416180452.3pm6uegx54iitbt5@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 3f342839 corrected obsolete comments about buffer locks at the
main _bt_insert_parent() call site, but missed similar obsolete comments
above _bt_insert_parent() itself. Both sets of comments were rendered
obsolete by commit 40dae7ec537, which made the nbtree page split
algorithm more robust. Fix the comments that were missed the first time
around now.
In passing, refine a related _bt_insert_parent() comment about
re-finding the parent page to insert new downlink.
|
|
|
|
|
|
|
|
|
|
| |
Commit d2599ecfcc74 introduced some contorted, confused code around:
readers would think that it's possible for HeapTupleHeaderGetXmin return
a non-frozen value for some frozen tuples, which would be disastrous.
There's no actual bug, but it seems better to make it clearer.
Per gripe from Tom Lane and Andres Freund.
Discussion: https://postgr.es/m/30116.1555430496@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit dd299df8189, which made heap TID a tiebreaker nbtree index
column, introduced new rules on page space management to make suffix
truncation safe. In general, suffix truncation needs to have a small
amount of extra space available on the new left page when splitting a
leaf page. This is needed in case it turns out that truncation cannot
even "truncate away the heap TID column", resulting in a
larger-than-firstright leaf high key with an explicit heap TID
representation.
Despite all this, CREATE INDEX/nbtsort.c did not account for the
possible need for extra heap TID space on leaf pages when deciding
whether or not a new item could fit on current page. This could lead to
"failed to add item to the index page" errors when CREATE
INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a
larger-than-firstright leaf high key (it only had space for firstright
tuple, which was just short of what was needed following "truncation").
Several conditions needed to be met all at once for CREATE INDEX to
fail. The problem was in the hard limit on what will fit on a page,
which tends to be masked by the soft fillfactor-wise limit. The easiest
way to recreate the problem seems to be a CREATE INDEX on a low
cardinality text column, with tuples that are of non-uniform width,
using a fillfactor of 100.
To fix, bring nbtsort.c in line with nbtsplitloc.c, which already
pessimistically assumes that all leaf page splits will have high keys
that have a heap TID appended.
Reported-By: Andreas Joseph Krogh
Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new nleft_dead_tuples and nleft_dead_itemids fields are confusing
and do not seem like the correct way forward. One of them is tested
via an assertion that can fail, as it has already done on buildfarm
member topminnow. Remove the assertion and the fields.
Change the logic for the case where a tuple is not initially pruned
by heap_page_prune but later diagnosed HEAPTUPLE_DEAD by
HeapTupleSatisfiesVacuum. Previously, tupgone = true was set in
that case, which leads to treating the tuple as one that will be
removed. In a normal vacuum, that's OK, because we'll remove
index entries for it and then the second heap pass will remove the
tuple itself, but when index cleanup is disabled, those things
don't happen, so we must instead treat it as a recently-dead
tuple that we have voluntarily chosen to keep.
Report and analysis by Tom Lane. This patch loosely based on one
from Masahiko Sawada, but I changed most of it.
|
| |
|
|
|
|
|
|
|
|
| |
Mistake in ab0dfc961b6a; progress reporting would have wrapped around
for indexes created with more than 2^31 tuples.
Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wz=WbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of these stem from d25f519107 "tableam: relation creation, VACUUM
FULL/CLUSTER, SET TABLESPACE.".
1) To pass data to the relation_set_new_filenode()
RelationSetNewRelfilenode() was made to update RelationData.rd_rel
directly. That's not OK however, as it makes the relcache entries
temporarily inconsistent. Which among other scenarios is a problem
if a REINDEX targets an index on pg_class - the
CatalogTupleUpdate() in RelationSetNewRelfilenode(). Presumably
that was introduced because other places in the code do so - while
those aren't "good practice" they don't appear to be actively
buggy (e.g. because system tables may not be targeted).
I (Andres) should have caught this while reviewing and signficantly
evolving the code in that commit, mea culpa.
Fix that by instead passing in the new RelFileNode as separate
argument to relation_set_new_filenode() and rely on the relcache to
update the catalog entry. Also revert that the
RelationMapUpdateMap() call was changed to immediate, and undo some
other more unnecessary changes.
2) Document that the relation_set_new_filenode cannot rely on the
whole relcache entry to be valid. It might be worthwhile to
refactor the code to never have to rely on that, but given the way
heap_create() is currently coded, that'd be a large change.
3) ATExecSetTableSpace() shouldn't do FlushRelationBuffers() itself. A
table AM might not use shared buffers at all. Move to
index_copy_data() and heapam_relation_copy_data().
4) heapam_relation_set_new_filenode() previously sometimes accessed
rel->rd_rel->relpersistence rather than the `persistence`
argument. Code movement mistake.
5) Previously heapam_relation_set_new_filenode() re-opened the smgr
relation to create the init for, if necesary. Instead have
RelationCreateStorage() return the SMgrRelation and use it to
create the init fork.
6) Add a note about the danger of modifying the relcache directly to
ATExecSetTableSpace() - it's currently not a bug because there's a
check ERRORing for catalog tables.
Regression tests and assertion improvements that together trigger the
bug described in 1) will be added in a later commit, as there is a
related bug on all branches.
Reported-By: Michael Paquier
Diagnosed-By: Tom Lane and Andres Freund
Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20190418011430.GA19133@paquier.xyz
|
|
|
|
|
|
|
| |
Remove a comment that refers to a coding practice that was fully removed
by commit a8b8f4db, which introduced MarkBufferDirty(). It looks like
the comment was even obsolete before then, since it concerns
write-ordering dependencies with synchronous buffer writes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows table AMs that don't need these horizons. This was already
documented in the tableam relation_set_new_filenode callback, but an
assert prevented if from actually working (the test AM code contained
the change itself). Defang the asserts in the general code, and move
the stronger ones into heap AM.
Relatedly, after CLUSTER/VACUUM, we'd always assign a relfrozenxid /
relminmxid. Change the table_relation_copy_for_cluster() interface to
allow the AM to overwrite the horizons that get set on the pg_class
entry. This'd also in the future allow AMs like heap to compute a
relfrozenxid during rewrite that's the table's actual minimum rather
than a pre-determined value. Arguably it'd have been better to move
the whole computation / setting of those values into the callback, but
it seems likely that for other reasons it'd be better to be able to
use one value to vacuum/cluster multiple tables (e.g. a toast's
horizon shouldn't be different than the table's).
Reported-By: Heikki Linnakangas
Author: Andres Freund
Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f122@iki.fi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit dd299df8 made nbtree treat heap TID as a tiebreaker column,
establishing the principle that there is only one correct location (page
and page offset number) for every index tuple, no matter what.
Insertions of tuples into non-unique indexes proceed as if heap TID
(scan key's scantid) is just another user-attribute value, but
insertions into unique indexes are more delicate. The TID value in
scantid must initially be omitted to ensure that the unique index
insertion visits every leaf page that duplicates could be on. The
scantid is set once again after unique checking finishes successfully,
which can force _bt_findinsertloc() to step right one or more times, to
locate the leaf page that the new tuple must be inserted on.
Stepping right within _bt_findinsertloc() was assumed to occur no more
frequently than stepping right within _bt_check_unique(), but there was
one important case where that assumption was incorrect: inserting a
"duplicate" with NULL values. Since _bt_check_unique() didn't do any
real work in this case, it wasn't appropriate for _bt_findinsertloc() to
behave as if it was finishing off a conventional unique insertion, where
any existing physical duplicate must be dead or recently dead.
_bt_findinsertloc() might have to grovel through a substantial portion
of all of the leaf pages in the index to insert a single tuple, even
when there were no dead tuples.
To fix, treat insertions of tuples with NULLs into a unique index as if
they were insertions into a non-unique index: never unset scantid before
calling _bt_search() to descend the tree, and bypass _bt_check_unique()
entirely. _bt_check_unique() is no longer responsible for incoming
tuples with NULL values.
Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to parallel development, gist added the missing conflict
information in c952eae52a3, while 558a9165e08 moved that computation
to the primary for the index types that already had it. Thus adapt
gist to also compute on the primary, using
index_compute_xid_horizon_for_tuples() instead of its own copy of the
logic.
This also adds pg_waldump support for XLOG_GIST_DELETE records, which
previously was not properly present.
Bumps WAL version.
Author: Andres Freund
Discussion: https://postgr.es/m/20190406050243.bszosdg4buvabfrt@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transient files and wait events get normally cleaned up when seeing an
exception (be it in the context of a transaction for a backend or
another process like the checkpointer), hence there is little point in
complicating error code paths to do this work. This shaves a bit of
code, and removes some extra handling with errno which needed to be
preserved during the cleanup steps done.
Reported-by: Masahiko Sawada
Author: Michael Paquier
Reviewed-by: Tom Lane, Masahiko Sawada
Discussion: https://postgr.es/m/CAD21AoDhHYVq5KkXfkaHhmjA-zJYj-e4teiRAJefvXuKJz1tKQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
| |
Checks inside _bt_vacuum_needs_cleanup() allow division by zero to happen when
metad->btm_last_cleanup_num_heap_tuples == 0. This commit adjusts the
expression so that no division by zero might happen.
Reported-by: Piotr Stefaniak
Discussion: https://postgr.es/m/DB8PR03MB5931C41F7787A95313F08322F22A0%40DB8PR03MB5931.eurprd03.prod.outlook.com
Reviewed-by: Masahiko Sawada
Backpatch-through: 11
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit ad308058 switched to returning a FullTransactionId, but
failed to load the potentially updated value in the case where
xidVacLimit is reached and we release and reacquire the lock.
Repair, closing bug #15727.
While reviewing that commit, also fix the size computation used
by EstimateTransactionStateSize() and switch to the mul_size()
macro traditionally used in such expressions.
Author: Thomas Munro
Reported-by: Roman Zharkov
Discussion: https://postgr.es/m/15727-0be246e7d852d229%40postgresql.org
|
|
|
|
|
| |
Author: Kirk Jamison
Discussion: https://postgr.es/m/D09B13F772D2274BB348A310EE3027C6493463@g01jpexmbkw24
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
transaction.
The transaction that is initiated by the parallel worker to cooperate
with the actual transaction started by the main backend to complete the
query execution should not be counted as a separate transaction. The
other internal transactions started and committed by the parallel worker
are still counted as separate transactions as we that is what we do in
other places like autovacuum.
This will partially fix the bloat in transaction stats due to additional
transactions performed by parallel workers. For a complete fix, we need to
decide how we want to show all the transactions that are started internally
for various operations and that is a matter of separate patch.
Reported-by: Haribabu Kommi
Author: Haribabu Kommi
Reviewed-by: Amit Kapila, Jamison Kirk and Rahila Syed
Backpatch-through: 9.6
Discussion: https://postgr.es/m/CAJrrPGc9=jKXuScvNyQ+VNhO0FZk7LLAShAJRyZjnedd2D61EQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
vacuum_truncate controls whether vacuum tries to truncate off
any empty pages at the end of the table. Previously vacuum always
tried to do the truncation. However, the truncation could cause
some problems; for example, ACCESS EXCLUSIVE lock needs to
be taken on the table during the truncation and can cause
the query cancellation on the standby even if hot_standby_feedback
is true. Setting this reloption to false can be helpful to avoid
such problems.
Author: Tsunakawa Takayuki
Reviewed-By: Julien Rouhaud, Masahiko Sawada, Michael Paquier, Kirk Jamison and Fujii Masao
Discussion: https://postgr.es/m/CAHGQGwE5UqFqSq1=kV3QtTUtXphTdyHA-8rAj4A=Y+e4kyp3BQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit fixes three, unfortunately related, issues:
1) Since 5db6df0c01, the introduction of DML via tableam, it was
possible to trigger "ERROR: unexpected table_lock_tuple status: 1"
when updating a row that was previously updated in the same
transaction - but only when the previously updated row was before
updated in a concurrent transaction (and READ COMMITTED was
used). The reason for that was that that case simply wasn't
expected. Fixing that lead to:
2) Even before the above commit, there were error checks (introduced
in 6868ed7491b7) preventing a row being updated by different
commands within the same statement (say in a function called by an
UPDATE) - but that check wasn't performed when the row was first
updated in a concurrent transaction - instead the second update was
silently skipped in that case. After this change we throw the same
error as we'd without the concurrent transaction.
3) The error messages (introduced in 6868ed7491b7) preventing such
updates emitted the same error message for both DELETE and
UPDATE ("tuple to be updated was already modified by an operation
triggered by the current command"). While that could be changed
separately, it made it hard to write tests that verify the correct
correct behavior of the code.
This commit changes heap's implementation of table_lock_tuple() to
return TM_SelfModified instead of TM_Invisible (previously loosely
modeled after EvalPlanQualFetch), and teaches nodeModifyTable.c to
handle that in response to table_lock_tuple() and not just in response
to table_(delete|update).
Additionally it fixes the wrong error message (see 3 above). The
comment for table_lock_tuple() is also adjusted to state that
TM_Deleted won't return information in TM_FailureData - it'll not
always be available.
This also adds tests to ensure that DELETE/UPDATE correctly error out
when affecting a row that concurrently was modified by another
transaction.
Author: Andres Freund
Reported-By: Tom Lane, when investigating a bug bug fix to another bug
by Amit Langote
Discussion: https://postgr.es/m/19321.1554567786@sss.pgh.pa.us
|
|
|
|
|
| |
Author: David Rowley, Andres Freund
Discussion: https://postgr.es/m/CAKJS1f9=9phmm66diAji4gvHnWSrK7BGFoNct+mEUT_c8pPOjw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously it was allowed to set default_table_access_method to an
empty string. That makes sense for default_tablespace, where that was
copied from, as it signals falling back to the database's default
tablespace. As there is no equivalent for table AMs, forbid that.
Also make sure to throw a usable error when creating a table using an
index AM, by using get_am_type_oid() to implement get_table_am_oid()
instead of a separate copy. Previously we'd error out only later, in
GetTableAmRoutine().
Thirdly remove GetTableAmRoutineByAmId() - it was only used in an
earlier version of 8586bf7ed8.
Add tests for the above (some for index AMs as well).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds table_multi_insert(), and converts COPY FROM, the only user
of heap_multi_insert, to it.
A simple conversion of COPY FROM use slots would have yielded a
slowdown when inserting into a partitioned table for some
workloads. Different partitions might need different slots (both slot
types and their descriptors), and dropping / creating slots when
there's constant partition changes is measurable.
Thus instead revamp the COPY FROM buffering for partitioned tables to
allow to buffer inserts into multiple tables, flushing only when
limits are reached across all partition buffers. By only dropping
slots when there've been inserts into too many different partitions,
the aforementioned overhead is gone. By allowing larger batches, even
when there are frequent partition changes, we actuall speed such cases
up significantly.
By using slots COPY of very narrow rows into unlogged / temporary
might slow down very slightly (due to the indirect function calls).
Author: David Rowley, Andres Freund, Haribabu Kommi
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20190327054923.t3epfuewxfqdt22e@alap3.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The assertions added by commit b04aeb0a0 exposed that there are some
code paths wherein the executor will try to open an index without
holding any lock on it. We do have some lock on the index's table,
so it seems likely that there's no fatal problem with this (for
instance, the index couldn't get dropped from under us). Still,
it's bad practice and we should fix it.
To do so, remove the optimizations in ExecInitIndexScan and friends
that tried to avoid taking a lock on an index belonging to a target
relation, and just take the lock always. In non-bug cases, this
will result in no additional shared-memory access, since we'll find
in the local lock table that we already have a lock of the desired
type; hence, no significant performance degradation should occur.
Also, adjust the planner and executor so that the type of lock taken
on an index is always identical to the type of lock taken for its table,
by relying on the recently added RangeTblEntry.rellockmode field.
This avoids some corner cases where that might not have been true
before (possibly resulting in extra locking overhead), and prevents
future maintenance issues from having multiple bits of logic that
all needed to be in sync. In addition, this change removes all core
calls to ExecRelationIsTargetRelation, which avoids a possible O(N^2)
startup penalty for queries with large numbers of target relations.
(We'd probably remove that function altogether, were it not that we
advertise it as something that FDWs might want to use.)
Also adjust some places in selfuncs.c to not take any lock on indexes
they are transiently opening, since we can assume that plancat.c
did that already.
In passing, change gin_clean_pending_list() to take RowExclusiveLock
not AccessShareLock on its target index. Although it's not clear that
that's actually a bug, it seemed very strange for a function that's
explicitly going to modify the index to use only AccessShareLock.
David Rowley, reviewed by Julien Rouhaud and Amit Langote,
a bit of further tweaking by me
Discussion: https://postgr.es/m/19465.1541636036@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new reloption, vacuum_index_cleanup, which
controls whether index cleanup is performed for a particular
relation by default. It also adds a new option to the VACUUM
command, INDEX_CLEANUP, which can be used to override the
reloption. If neither the reloption nor the VACUUM option is
used, the default is true, as before.
Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
The wording of the documentation is mostly due to me.
Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
_bt_check_unique() failed to invalidate binary search bounds in the
event of a live conflict following commit e5adcb78. This resulted in
problems after waiting for the conflicting xact to commit or abort. The
subsequent call to _bt_check_unique() would restore the initial binary
search bounds, rather than starting a new search. Fix by explicitly
invalidating bounds when it becomes clear that there is a live conflict
that insertion will have to wait to resolve.
Ashutosh Sharma, with a few additional tweaks by me.
Author: Ashutosh Sharma
Reported-By: Ashutosh Sharma
Diagnosed-By: Ashutosh Sharma
Discussion: https://postgr.es/m/CAE9k0PnQp-qr-UYKMSCzdC2FBzdE4wKP41hZrZvvP26dKLonLg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, md.c and checkpointer.c were tightly integrated so that
fsync calls could be handed off and processed in the background.
Introduce a system of callbacks and file tags, so that other modules
can hand off fsync work in the same way.
For now only md.c uses the new interface, but other users are being
proposed. Since there may be use cases that are not strictly SMGR
implementations, use a new function table for sync handlers rather
than extending the traditional SMGR one.
Instead of using a bitmapset of segment numbers for each RelFileNode
in the checkpointer's hash table, make the segment number part of the
key. This requires sending explicit "forget" requests for every
segment individually when relations are dropped, but suits the file
layout schemes of proposed future users better (ie sparse or high
segment numbers).
Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
|