postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
*	Change SET LOCAL/CONSTRAINTS/TRANSACTION and ABORT behavior	Bruce Momjian	2013-11-25
\| \| \| \| \| \| \|	Change SET LOCAL/CONSTRAINTS/TRANSACTION behavior outside of a transaction block from error (post-9.3) to warning. (Was nothing in <= 9.3.) Also change ABORT outside of a transaction block from notice to warning.
*	Fix Hot-Standby initialization of clog and subtrans.	Heikki Linnakangas	2013-11-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These bugs can cause data loss on standbys started with hot_standby=on at the moment they start to accept read only queries, by marking committed transactions as uncommited. The likelihood of such corruptions is small unless the primary has a high transaction rate. 5a031a5556ff83b8a9646892715d7fef415b83c3 fixed bugs in HS's startup logic by maintaining less state until at least STANDBY_SNAPSHOT_PENDING state was reached, missing the fact that both clog and subtrans are written to before that. This only failed to fail in common cases because the usage of ExtendCLOG in procarray.c was superflous since clog extensions are actually WAL logged. f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the missing extensions of pg_subtrans due to the former commit's changes - which are not WAL logged - by performing the extensions when switching to a state > STANDBY_INITIALIZED and not performing xid assignments before that - again missing the fact that ExtendCLOG is unneccessary - but screwed up twice: Once because latestObservedXid wasn't updated anymore in that state due to the earlier commit and once by having an off-by-one error in the loop performing extensions. This means that whenever a CLOG_XACTS_PER_PAGE (32768 with default settings) boundary was crossed between the start of the checkpoint recovery started from and the first xl_running_xact record old transactions commit bits in pg_clog could be overwritten if they started and committed in that window. Fix this mess by not performing ExtendCLOG() in HS at all anymore since it's unneeded and evidently dangerous and by performing subtrans extensions even before reaching STANDBY_SNAPSHOT_PENDING. Analysis and patch by Andres Freund. Reported by Christophe Pettus. Backpatch down to 9.0, like the previous commit that caused this.
*	Avoid acquiring spinlock when checking if recovery has finished, for speed.	Heikki Linnakangas	2013-11-22
\| \| \| \| \| \| \| \| \| \| \| \|	RecoveryIsInProgress() can be called very frequently. During normal operation, it just checks a backend-local variable and returns quickly, but during hot standby, it checks a spinlock-protected shared variable. Those spinlock acquisitions can become a point of contention on a busy hot standby system. Replace the spinlock acquisition with a memory barrier. Per discussion with Andres Freund, Ants Aasma and Merlin Moncure.
*	Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.	Tom Lane	2013-11-21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds the ability to write TABLE( function1(), function2(), ...) as a single FROM-clause entry. The result is the concatenation of the first row from each function, followed by the second row from each function, etc; with NULLs inserted if any function produces fewer rows than others. This is believed to be a much more useful behavior than what Postgres currently does with multiple SRFs in a SELECT list. This syntax also provides a reasonable way to combine use of column definition lists with WITH ORDINALITY: put the column definition list inside TABLE(), where it's clear that it doesn't control the ordinality column as well. Also implement SQL-compliant multiple-argument UNNEST(), by turning UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)). The SQL standard specifies TABLE() with only a single function, not multiple functions, and it seems to require an implicit UNNEST() which is not what this patch does. There may be something wrong with that reading of the spec, though, because if it's right then the spec's TABLE() is just a pointless alternative spelling of UNNEST(). After further review of that, we might choose to adopt a different syntax for what this patch does, but in any case this functionality seems clearly worthwhile. Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and significantly revised by me
*	More GIN refactoring.	Heikki Linnakangas	2013-11-20
\| \| \| \| \| \| \| \| \| \| \|	Split off the portion of ginInsertValue that inserts the tuple to current level into a separate function, ginPlaceToPage. ginInsertValue's charter is now to recurse up the tree to insert the downlink, when a page split is required. This is in preparation for a patch to change the way incomplete splits are handled, which will need to do these operations separately. And IMHO makes the code more readable anyway.
*	Refactor the internal GIN B-tree interface for forming a downlink.	Heikki Linnakangas	2013-11-20
\| \| \| \| \| \|	This creates a new gin-btree callback function for creating a downlink for a page. Previously, ginxlog.c duplicated the logic used during normal operation.
*	Further GIN refactoring.	Heikki Linnakangas	2013-11-20
\| \| \| \| \|	Merge some functions that were always called together. Makes the code little bit more readable.
*	Fix bug in GIN posting tree root creation.	Heikki Linnakangas	2013-11-13
\| \| \| \| \| \| \| \| \| \|	The root page is filled with as many items as fit, and the rest are inserted using normal insertions. However, I fumbled the variable names, and the code actually memcpy'd all the items on the page, overflowing the buffer. While at it, rename the variable to make the distinction more clear. Reported by Teodor Sigaev. This bug was introduced by my recent refactorings, so no backpatching required.
*	Fix race condition in GIN posting tree page deletion.	Heikki Linnakangas	2013-11-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a page is deleted, and reused for something else, just as a search is following a rightlink to it from its left sibling, the search would continue scanning whatever the new contents of the page are. That could lead to incorrect query results, or even something more curious if the page is reused for a different kind of a page. To fix, modify the search algorithm to lock the next page before releasing the previous one, and refrain from deleting pages from the leftmost branch of the tree. Add a new Concurrency section to the README, explaining why this works. There is a lot more one could say about concurrency in GIN, but that's for another patch. Backpatch to all supported versions.
*	Fix setting of right bound at GIN page split.	Heikki Linnakangas	2013-11-07
\| \| \| \|	Broken by my refactoring.
*	Fix missing argument and function prototypes.	Heikki Linnakangas	2013-11-06
\| \| \| \|	Not sure how I missed these in previous commit.
*	Misc GIN refactoring.	Heikki Linnakangas	2013-11-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Merge the isEnoughSpace and placeToPage functions in the b-tree interface into one function that tries to put a tuple on page, and returns false if it doesn't fit. Move createPostingTree function to gindatapage.c, and change its contract so that it can be passed more items than fit on the root page. It's in a better position than the callers to know how many items fit. Move ginMergeItemPointers out of gindatapage.c, into a separate file. These changes make no difference now, but reduce the footprint of Alexander Korotkov's upcoming patch to pack item pointers more tightly.
*	Prevent memory leaks from accumulating across printtup() calls.	Tom Lane	2013-11-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Historically, printtup() has assumed that it could prevent memory leakage by pfree'ing the string result of each output function and manually managing detoasting of toasted values. This amounts to assuming that datatype output functions never leak any memory internally; an assumption we've already decided to be bogus elsewhere, for example in COPY OUT. range_out in particular is known to leak multiple kilobytes per call, as noted in bug #8573 from Godfried Vanluffelen. While we could go in and fix that leak, it wouldn't be very notationally convenient, and in any case there have been and undoubtedly will again be other leaks in other output functions. So what seems like the best solution is to run the output functions in a temporary memory context that can be reset after each row, as we're doing in COPY OUT. Some quick experimentation suggests this is actually a tad faster than the retail pfree's anyway. This patch fixes all the variants of printtup, except for debugtup() which is used in standalone mode. It doesn't seem worth worrying about query-lifespan leaks in standalone mode, and fixing that case would be a bit tedious since debugtup() doesn't currently have any startup or shutdown functions. While at it, remove manual detoast management from several other output-function call sites that had copied it from printtup(). This doesn't make a lot of difference right now, but in view of recent discussions about supporting "non-flattened" Datums, we're going to want that code gone eventually anyway. Back-patch to 9.2 where range_out was introduced. We might eventually decide to back-patch this further, but in the absence of known major leaks in older output functions, I'll refrain for now.
*	Retry after buffer locking failure during SPGiST index creation.	Tom Lane	2013-11-02
\| \| \| \| \| \| \| \| \|	The original coding thought this case was impossible, but it can happen if the bgwriter or checkpointer processes decide to write out an index page while creation is still proceeding, leading to a bogus "unexpected spgdoinsert() failure" error. Problem reported by Jonathan S. Katz. Teodor Sigaev
*	Use appendStringInfoString instead of appendStringInfo where possible.	Robert Haas	2013-10-31
\| \| \| \| \| \| \|	This shaves a few cycles, and generally seems like good programming practice. David Rowley
*	Prevent using strncpy with src == dest in TupleDescInitEntry.	Tom Lane	2013-10-28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The C and POSIX standards state that strncpy's behavior is undefined when source and destination areas overlap. While it remains dubious whether any implementations really misbehave when the pointers are exactly equal, some platforms are now starting to force the issue by complaining when an undefined call occurs. (In particular OS X 10.9 has been seen to dump core here, though the exact set of circumstances needed to trigger that remain elusive. Similar behavior can be expected to be optional on Linux and other platforms in the near future.) So tweak the code to explicitly do nothing when nothing need be done. Back-patch to all active branches. In HEAD, this also lets us get rid of an exception in valgrind.supp. Per discussion of a report from Matthias Schmitt.
*	Fix typos in comments.	Heikki Linnakangas	2013-10-24
\|
*	Consistently use unsigned arithmetic for alignment calculations.	Noah Misch	2013-10-20
\| \| \| \| \| \| \| \|	This avoids an assumption about the signed number representation. It is anticipated to have no functional changes on supported configurations; many two's complement assumptions remain elsewhere. Per a suggestion from Andres Freund.
*	TYPEALIGN doesn't work on int64 on 32-bit platforms.	Heikki Linnakangas	2013-10-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TYPEALIGN macro, and the related ones like MAXALIGN, don't work with values larger than intptr_t, because TYPEALIGN casts the argument to intptr_t to do the arithmetic. That's not a problem when dealing with pointers or lengths or offsets related to pointers, but the XLogInsert scaling patch added a call to MAXALIGN with an XLogRecPtr argument. To fix, add wider variants of the macros, called TYPEALIGN64 and MAXALIGN64, which are just like the existing variants but work with uint64 instead of intptr_t. Report and patch by David Rowley, analysis by Andres Freund.
*	Fix bugs in SSI tuple locking.	Heikki Linnakangas	2013-10-08
\| \| \| \| \| \| \| \| \| \| \| \| \|	1. In heap_hot_search_buffer(), the PredicateLockTuple() call is passed wrong offset number. heapTuple->t_self is set to the tid of the first tuple in the chain that's visited, not the one actually being read. 2. CheckForSerializableConflictIn() uses the tuple's t_ctid field instead of t_self to check for exiting predicate locks on the tuple. If the tuple was updated, but the updater rolled back, t_ctid points to the aborted dead tuple. Reported by Hannu Krosing. Backpatch to 9.1.
*	Minor GIN code refactoring.	Heikki Linnakangas	2013-10-03
\| \| \| \| \| \| \| \| \| \| \|	It makes for cleaner code to have separate Get/Add functions for PostingItems and ItemPointers. A few callsites that have to deal with both types need to be duplicated because of this, but all the callers have to know which one they're dealing with anyway. Overall, this reduces the amount of casting required. Extracted from Alexander Korotkov's larger patch to change the data page format.
*	Fix pgindent comment breakage	Alvaro Herrera	2013-09-24
\|
*	Typo fix.	Robert Haas	2013-09-18
\| \| \| \|	Etsuro Fujita
*	Rename various "freeze multixact" variables	Alvaro Herrera	2013-09-16
\| \| \| \| \| \| \| \| \|	It seems to make more sense to use "cutoff multixact" terminology throughout the backend code; "freeze" is associated with replacing of an Xid with FrozenTransactionId, which is not what we do for MultiXactIds. Andres Freund Some adjustments by Álvaro Herrera
*	Add a GUC to report whether data page checksums are enabled.	Heikki Linnakangas	2013-09-16
\| \| \| \|	Bernd Helmle
*	Introduce InvalidCommandId.	Robert Haas	2013-09-09
\| \| \| \| \| \| \|	This allows a 32-bit field to represent an optional command ID without a separate flag bit. Andres Freund
*	Revert WAL posix_fallocate() patches.	Jeff Davis	2013-09-04
\| \| \| \| \| \| \| \| \| \|	This reverts commit 269e780822abb2e44189afaccd6b0ee7aefa7ddd and commit 5b571bb8c8d2bea610e01ae1ee7bc05adcfff528. Unfortunately, the initial patch had insufficient performance testing, and resulted in a regression. Per report by Thom Brown.
*	Keep heavily-contended fields in XLogCtlInsert on different cache lines.	Heikki Linnakangas	2013-09-04
\| \| \| \| \| \| \|	Performance testing shows that if the insertpos_lck spinlock and the fields that it protects are on the same cache line with other variables that are frequently accessed, the false sharing can hurt performance a lot. Keep them apart by adding some padding.
*	Rename the "fast_promote" file to just "promote".	Heikki Linnakangas	2013-08-19
\| \| \| \| \| \| \| \| \| \|	This keeps the usual trigger file name unchanged from 9.2, avoiding nasty issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa. The fallback behavior of creating a full checkpoint before starting up is now triggered by a file called "fallback_promote". That can be useful for debugging purposes, but we don't expect any users to have to resort to that and we might want to remove that in the future, which is why the fallback mechanism is undocumented.
*	Fix pg_upgrade failure from servers older than 9.3	Alvaro Herrera	2013-08-19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	When upgrading from servers of versions 9.2 and older, and MultiXactIds have been used in the old server beyond the first page (that is, 2048 multis or more in the default 8kB-page build), pg_upgrade would set the next multixact offset to use beyond what has been allocated in the new cluster. This would cause a failure the first time the new cluster needs to use this value, because the pg_multixact/offsets/ file wouldn't exist or wouldn't be large enough. To fix, ensure that the transient server instances launched by pg_upgrade extend the file as necessary. Per report from Jesse Denardo in CANiVXAj4c88YqipsyFQPboqMudnjcNTdB3pqe8ReXqAFQ=HXyA@mail.gmail.com
*	Message punctuation and pluralization fixes	Peter Eisentraut	2013-08-09
\|
*	Add SQL Standard WITH ORDINALITY support for UNNEST (and any other SRF)	Greg Stark	2013-07-29
\| \| \| \| \|	Author: Andrew Gierth, David Fetter Reviewers: Dean Rasheed, Jeevan Chalke, Stephen Frost
*	Message style improvements	Peter Eisentraut	2013-07-28
\|
*	Use InvalidSnapshot, now SnapshotNow, as the default snapshot.	Robert Haas	2013-07-23
\| \| \| \| \| \| \| \| \| \| \| \|	As far as I can determine, there's no code in the core distribution that fails to explicitly set the snapshot of a scan or executor state. If there is any such code, this will probably cause it to seg fault; friendlier suggestions were discussed on pgsql-hackers, but there was no consensus that anything more than this was needed. This is another step towards the hoped-for complete removal of SnapshotNow.
*	Adjust HeapTupleSatisfies* routines to take a HeapTuple.	Robert Haas	2013-07-22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, these functions took a HeapTupleHeader, but upcoming patches for logical replication will introduce new a new snapshot type under which the tuple's TID will be used to lookup (CMIN, CMAX) for visibility determination purposes. This makes that information available. Code churn is minimal since HeapTupleSatisfiesVisibility took the HeapTuple anyway, and deferenced it before calling the satisfies function. Independently of logical replication, this allows t_tableOid and t_self to be cross-checked via assertions in tqual.c. This seems like a useful way to make sure that all callers are setting these values properly, which has been previously put forward as desirable. Andres Freund, reviewed by Álvaro Herrera
*	WITH CHECK OPTION support for auto-updatable VIEWs	Stephen Frost	2013-07-18
\| \| \| \| \| \| \| \| \| \| \| \| \|	For simple views which are automatically updatable, this patch allows the user to specify what level of checking should be done on records being inserted or updated. For 'LOCAL CHECK', new tuples are validated against the conditionals of the view they are being inserted into, while for 'CASCADED CHECK' the new tuples are validated against the conditionals for all views involved (from the top down). This option is part of the SQL specification. Dean Rasheed, reviewed by Pavel Stehule
*	Fix variable names mentioned in comment to match the code.	Heikki Linnakangas	2013-07-17
\| \| \| \| \| \| \|	Also, in another comment, explain why holding an insertion slot is a critical section. Per review by Amit Kapila.
*	Fix assert failure at end of recovery, broken by XLogInsert scaling patch.	Heikki Linnakangas	2013-07-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Initialization of the first XLOG buffer at end-of-recovery was broken for the case that the last read WAL record ended at a page boundary. Instead of trying to copy the last full xlog page to the buffer cache in that case, just set shared state so that the next page is initialized when the first WAL record after startup is inserted. (that's what we did in earlier version, too) To make the shared state required for that case less surprising, replace the XLogCtl->curridx variable, which was the index of the latest initialized buffer, with an XLogRecPtr of how far the buffers have been initialized. That also allows us to get rid of the XLogRecEndPtrToBufIdx macro. While we're at it, make a similar change for XLogCtl->Write.curridx, getting rid of that variable and calculating the next buffer to write from XLogCtl->LogwrtResult instead.
*	Fix systable_recheck_tuple() for MVCC scan snapshots.	Noah Misch	2013-07-16
\| \| \| \| \| \| \| \|	Since this function assumed non-MVCC snapshots, it broke when commit 568d4138c646cd7cd8a837ac244ef2caf27c6bb8 switched its one caller from SnapshotNow scans to MVCC-snapshot scans. Reviewed by Robert Haas, Tom Lane and Andres Freund.
*	Fix Windows build.	Heikki Linnakangas	2013-07-08
\| \| \| \| \| \|	Was broken by my xloginsert scaling patch. XLogCtl global variable needs to be initialized in each process, as it's not inherited by fork() on Windows.
*	Improve scalability of WAL insertions.	Heikki Linnakangas	2013-07-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch replaces WALInsertLock with a number of WAL insertion slots, allowing multiple backends to insert WAL records to the WAL buffers concurrently. This is particularly useful for parallel loading large amounts of data on a system with many CPUs. This has one user-visible change: switching to a new WAL segment with pg_switch_xlog() now fills the remaining unused portion of the segment with zeros. This potentially adds some overhead, but it has been a very common practice by DBA's to clear the "tail" of the segment with an external pg_clearxlogtail utility anyway, to make the WAL files compress better. With this patch, it's no longer necessary to do that. This patch adds a new GUC, xloginsert_slots, to tune the number of WAL insertion slots. Performance testing suggests that the default, 8, works pretty well for all kinds of worklods, but I left the GUC in place to allow others with different hardware to test that easily. We might want to remove that before release. Reviewed by Andres Freund.
*	Handle posix_fallocate() errors.	Jeff Davis	2013-07-06
\| \| \| \| \| \| \| \| \|	On some platforms, posix_fallocate() is available but may still return EINVAL if the underlying filesystem does not support it. So, in case of an error, fall through to the alternate implementation that just writes zeros. Per buildfarm failure and analysis by Tom Lane.
*	Update messages, comments and documentation for materialized views.	Noah Misch	2013-07-05
\| \| \| \| \|	All instances of the verbiage lagging the code. Back-patch to 9.3, where materialized views were introduced.
*	Use posix_fallocate() for new WAL files, where available.	Jeff Davis	2013-07-05
\| \| \| \| \| \| \| \|	This function is more efficient than actually writing out zeroes to the new file, per microbenchmarks by Jon Nelson. Also, it may reduce the likelihood of WAL file fragmentation. Jon Nelson, with review by Andres Freund, Greg Smith and me.
*	Fix typo in comment.	Fujii Masao	2013-07-05
\| \| \| \|	Michael Paquier
*	Add new GUC, max_worker_processes, limiting number of bgworkers.	Robert Haas	2013-07-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In 9.3, there's no particular limit on the number of bgworkers; instead, we just count up the number that are actually registered, and use that to set MaxBackends. However, that approach causes problems for Hot Standby, which needs both MaxBackends and the size of the lock table to be the same on the standby as on the master, yet it may not be desirable to run the same bgworkers in both places. 9.3 handles that by failing to notice the problem, which will probably work fine in nearly all cases anyway, but is not theoretically sound. A further problem with simply counting the number of registered workers is that new workers can't be registered without a postmaster restart. This is inconvenient for administrators, since bouncing the postmaster causes an interruption of service. Moreover, there are a number of applications for background processes where, by necessity, the background process must be started on the fly (e.g. parallel query). While this patch doesn't actually make it possible to register new background workers after startup time, it's a necessary prerequisite. Patch by me. Review by Michael Paquier.
*	Get rid of pg_class.reltoastidxid.	Fujii Masao	2013-07-04
\| \| \| \| \| \| \| \| \| \|	Treat TOAST index just the same as normal one and get the OID of TOAST index from pg_index but not pg_class.reltoastidxid. This change allows us to handle multiple TOAST indexes, and which is required infrastructure for upcoming REINDEX CONCURRENTLY feature. Patch by Michael Paquier, reviewed by Andres Freund and me.
*	Add support for multiple kinds of external toast datums.	Robert Haas	2013-07-02
\| \| \| \| \| \| \| \| \| \| \|	To that end, support tags rather than lengths for external datums. As an example of how this can be used, add support or "indirect" tuples which point to some externally allocated memory containing a toast tuple. Similar infrastructure could be used for other purposes, including, perhaps, support for alternative compression algorithms. Andres Freund, reviewed by Hitoshi Harada and myself
*	Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.	Robert Haas	2013-07-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SnapshotNow scans have the undesirable property that, in the face of concurrent updates, the scan can fail to see either the old or the new versions of the row. In many cases, we work around this by requiring DDL operations to hold AccessExclusiveLock on the object being modified; in some cases, the existing locking is inadequate and random failures occur as a result. This commit doesn't change anything related to locking, but will hopefully pave the way to allowing lock strength reductions in the future. The major issue has held us back from making this change in the past is that taking an MVCC snapshot is significantly more expensive than using a static special snapshot such as SnapshotNow. However, testing of various worst-case scenarios reveals that this problem is not severe except under fairly extreme workloads. To mitigate those problems, we avoid retaking the MVCC snapshot for each new scan; instead, we take a new snapshot only when invalidation messages have been processed. The catcache machinery already requires that invalidation messages be sent before releasing the related heavyweight lock; else other backends might rely on locally-cached data rather than scanning the catalog at all. Thus, making snapshot reuse dependent on the same guarantees shouldn't break anything that wasn't already subtly broken. Patch by me. Review by Michael Paquier and Andres Freund.
*	Retry short writes when flushing WAL.	Heikki Linnakangas	2013-07-01
\| \| \| \| \| \| \| \| \| \| \| \| \|	We don't normally bother retrying when the number of bytes written by write() is short of what was requested. It is generally assumed that a write() to disk doesn't return short, unless you run out of disk space. While writing the WAL, however, it seems prudent to try a bit harder, because a failure leads to PANIC. The write() is also much larger than most write()s in the backend (up to wal_buffers), so there's more room for surprises. Also retry on EINTR. All signals used in the backend are flagged SA_RESTART nowadays, so it shouldn't happen, but better to be defensive.