aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Rework handling of OOM when allocating record buffer in XLOG reader.Fujii Masao2015-04-03
| | | | | | | | | | Commit 2c03216 changed allocate_recordbuf() so that it uses a palloc to allocate the read buffer and fails immediately when an out-of-memory error shows up, even though its callers still expect that NULL is returned in that case. This bug is fixed making allocate_recordbuf() use a palloc_extended with MCXT_ALLOC_NO_OOM flag and return NULL in OOM case. Michael Paquier
* Define integer limits independently from the system definitions.Andres Freund2015-04-02
| | | | | | | | | | | | | | | | | | | | | In 83ff1618 we defined integer limits iff they're not provided by the system. That turns out not to be the greatest idea because there's different ways some datatypes can be represented. E.g. on OSX PG's 64bit datatype will be a 'long int', but OSX unconditionally uses 'long long'. That disparity then can lead to warnings, e.g. around printf formats. One way to fix that would be to back int64 using stdint.h's int64_t. While a good idea it's not that easy to implement. We would e.g. need to include stdint.h in our external headers, which we don't today. Also computing the correct int64 printf formats in that case is nontrivial. Instead simply prefix the integer limits with PG_ and define them unconditionally. I've adjusted all the references to them in code, but not the ones in comments; the latter seems unnecessary to me. Discussion: 20150331141423.GK4878@alap3.anarazel.de
* Fix bogus concurrent use of _hash_getnewbuf() in bucket split code.Tom Lane2015-03-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | _hash_splitbucket() obtained the base page of the new bucket by calling _hash_getnewbuf(), but it held no exclusive lock that would prevent some other process from calling _hash_getnewbuf() at the same time. This is contrary to _hash_getnewbuf()'s API spec and could in fact cause failures. In practice, we must only call that function while holding write lock on the hash index's metapage. An additional problem was that we'd already modified the metapage's bucket mapping data, meaning that failure to extend the index would leave us with a corrupt index. Fix both issues by moving the _hash_getnewbuf() call to just before we modify the metapage in _hash_expandtable(). Unfortunately there's still a large problem here, which is that we could also incur ENOSPC while trying to get an overflow page for the new bucket. That would leave the index corrupt in a more subtle way, namely that some index tuples that should be in the new bucket might still be in the old one. Fixing that seems substantially more difficult; even preallocating as many pages as we could possibly need wouldn't entirely guarantee that the bucket split would complete successfully. So for today let's just deal with the base case. Per report from Antonin Houska. Back-patch to all active branches.
* Make ginbuild's funcCtx be independent of its tmpCtx.Tom Lane2015-03-29
| | | | | | | | | | Previously the funcCtx was a child of the tmpCtx, but that was broken by commit eaa5808e8ec4e82ce1a87103a6b6f687666e4e4c, which made MemoryContextReset() delete, not reset, child contexts. The behavior of having a tmpCtx reset also clear the other context seems rather dubious anyway, so let's just disentangle them. Per report from Erik Rijkers. In passing, fix badly-inaccurate comments about these contexts.
* Fix GiST index-only scans for opclasses with different storage type.Heikki Linnakangas2015-03-26
| | | | | | | | | We cannot use the index's tuple descriptor directly to describe the index tuples returned in an index-only scan. That's because the index might use a different datatype for the values stored on disk than the type originally indexed. As long as they were both pass-by-ref, it worked, but will not work for pass-by-value types of different sizes. I noticed this as a crash when I started hacking a patch to add fetch methods to btree_gist.
* Tweak __attribute__-wrapping macros for better pgindent results.Tom Lane2015-03-26
| | | | | | | | | | | | | | | | | | | | | This improves on commit bbfd7edae5aa5ad5553d3c7e102f2e450d4380d4 by making two simple changes: * pg_attribute_noreturn now takes parentheses, ie pg_attribute_noreturn(). Likewise pg_attribute_unused(), pg_attribute_packed(). This reduces pgindent's tendency to misformat declarations involving them. * attributes are now always attached to function declarations, not definitions. Previously some places were taking creative shortcuts, which were not merely candidates for bad misformatting by pgindent but often were outright wrong anyway. (It does little good to put a noreturn annotation where callers can't see it.) In any case, if we would like to believe that these macros can be used with non-gcc compilers, we should avoid gratuitous variance in usage patterns. I also went through and manually improved the formatting of a lot of declarations, and got rid of excessively repetitive (and now obsolete anyway) comments informing the reader what pg_attribute_printf is for.
* Add support for index-only scans in GiST.Heikki Linnakangas2015-03-26
| | | | | | | | | | | | | | This adds a new GiST opclass method, 'fetch', which is used to reconstruct the original Datum from the value stored in the index. Also, the 'canreturn' index AM interface function gains a new 'attno' argument. That makes it possible to use index-only scans on a multi-column index where some of the opclasses support index-only scans but some do not. This patch adds support in the box and point opclasses. Other opclasses can added later as follow-on patches (btree_gist would be particularly interesting). Anastasia Lubennikova, with additional fixes and modifications by me.
* Minor cleanup of GiST code, for readability.Heikki Linnakangas2015-03-26
| | | | | Remove the gistcentryinit function, inlining the relevant part of it into the only caller.
* Centralize definition of integer limits.Andres Freund2015-03-25
| | | | | | | | | | | | | | | | | Several submitted and even committed patches have run into the problem that C89, our baseline, does not provide minimum/maximum values for various integer datatypes. C99's stdint.h does, but we can't rely on it. Several parts of the code defined limits locally, so instead centralize the definitions to c.h. This patch also changes the more obvious usages of literal limit values; there's more places that could be changed, but it's less clear whether it's beneficial to change those. Author: Andrew Gierth Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk
* Reduce pinning and buffer content locking for btree scans.Kevin Grittner2015-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even though the main benefit of the Lehman and Yao algorithm for btrees is that no locks need be held between page reads in an index search, we were holding a buffer pin on each leaf page after it was read until we were ready to read the next one. The reason was so that we could treat this as a weak lock to create an "interlock" with vacuum's deletion of heap line pointers, even though our README file pointed out that this was not necessary for a scan using an MVCC snapshot. The main goal of this patch is to reduce the blocking of vacuum processes by in-progress btree index scans (including a cursor which is idle), but the code rearrangement also allows for one less buffer content lock to be taken when a forward scan steps from one page to the next, which results in a small but consistent performance improvement in many workloads. This patch leaves behavior unchanged for some cases, which can be addressed separately so that each case can be evaluated on its own merits. These unchanged cases are when a scan uses a non-MVCC snapshot, an index-only scan, and a scan of a btree index for which modifications are not WAL-logged. If later patches allow all of these cases to drop the buffer pin after reading a leaf page, then the btree vacuum process can be simplified; it will no longer need the "super-exclusive" lock to delete tuples from a page. Reviewed by Heikki Linnakangas and Kyotaro Horiguchi
* Don't delay replication for less than recovery_min_apply_delay's resolution.Andres Freund2015-03-23
| | | | | | | | | | | | | | | | | Recovery delays are implemented by waiting on a latch, and latches take milliseconds as a parameter. The required amount of waiting was computed using microsecond resolution though and the wait loop's abort condition was checking the delay in microseconds as well. This could lead to short spurts of busy looping when the overall wait time was below a millisecond, but above 0 microseconds. Instead just formulate the wait loop's abort condition in millisecond granularity as well. Given that that's recovery_min_apply_delay resolution, it seems harmless to not wait for less than a millisecond. Backpatch to 9.4 where recovery_min_apply_delay was introduced. Discussion: 20150323141819.GH26995@alap3.anarazel.de
* Fix copy & paste error in 4f1b890b137.Andres Freund2015-03-23
| | | | | | | | | Due to the bug delayed standbys would not delay when applying prepared transactions. Discussion: CAB7nPqT6BO1cCn+sAyDByBxA4EKZNAiPi2mFJ=ANeZmnmewRyg@mail.gmail.com Michael Paquier via Coverity.
* Merge the various forms of transaction commit & abort records.Andres Freund2015-03-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 465883b0a two versions of commit records have existed. A compact version that was used when no cache invalidations, smgr unlinks and similar were needed, and a full version that could deal with all that. Additionally the full version was embedded into twophase commit records. That resulted in a measurable reduction in the size of the logged WAL in some workloads. But more recently additions like logical decoding, which e.g. needs information about the database something was executed on, made it applicable in fewer situations. The static split generally made it hard to expand the commit record, because concerns over the size made it hard to add anything to the compact version. Additionally it's not particularly pretty to have twophase.c insert RM_XACT records. Rejigger things so that the commit and abort records only have one form each, including the twophase equivalents. The presence of the various optional (in the sense of not being in every record) pieces is indicated by a bits in the 'xinfo' flag. That flag previously was not included in compact commit records. To prevent an increase in size due to its presence, it's only included if necessary; signalled by a bit in the xl_info bits available for xact.c, similar to heapam.c's XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE. Twophase commit/aborts are now the same as their normal counterparts. The original transaction's xid is included in an optional data field. This means that commit records generally are smaller, except in the case of a transaction with subtransactions, but no other special cases; the increase there is four bytes, which seems acceptable given that the more common case of not having subtransactions shrank. The savings are especially measurable for twophase commits, which previously always used the full version; but will in practice only infrequently have required that. The motivation for this work are not the space savings and and deduplication though; it's that it makes it easier to extend commit records with additional information. That's just a few lines of code now; without impacting the common case where that information is not needed. Discussion: 20150220152150.GD4149@awork2.anarazel.de, 235610.92468.qm%40web29004.mail.ird.yahoo.com Reviewed-By: Heikki Linnakangas, Simon Riggs
* Increase max_wal_size's default from 128MB to 1GB.Andres Freund2015-03-15
| | | | | | | | | | | | | | | | The introduction of min_wal_size & max_wal_size in 88e982302684 makes it feasible to increase the default upper bound in checkpoint size. Previously raising the default would lead to a increased disk footprint, even if more segments weren't beneficial. The low default of checkpoint size is one of common performance problem users have thus increasing the default makes sense. Setups where the increase in maximum disk usage is a problem will very likely have to run with a modified configuration anyway. Discussion: 54F4EFB8.40202@agliodbs.com, CA+TgmoZEAgX5oMGJOHVj8L7XOkAe05Gnf45rP40m-K3FhZRVKg@mail.gmail.com Author: Josh Berkus, after a discussion involving lots of people.
* Remove pause_at_recovery_target recovery.conf setting.Andres Freund2015-03-15
| | | | | | | | | The new recovery_target_action (introduced in aedccb1f6/b8e33a85d4) replaces it's functionality. Having both seems likely to cause more confusion than it saves worry due to the incompatibility. Discussion: 5484FC53.2060903@2ndquadrant.com Author: Petr Jelinek
* Suppress maybe-uninitialized compiler warnings.Fujii Masao2015-03-15
| | | | | | | Previously some compilers were thinking that the variables that 57aa5b2 added maybe-uninitialized. Spotted by Andres Freund
* Fix memory leaks in GIN index vacuum.Heikki Linnakangas2015-03-12
| | | | | Per bug #12850 by Walter Nordmann. Backpatch to 9.4 where the leak was introduced.
* Add macros wrapping all usage of gcc's __attribute__.Andres Freund2015-03-11
| | | | | | | | | | | | | | | | | | | | Until now __attribute__() was defined to be empty for all compilers but gcc. That's problematic because it prevents using it in other compilers; which is necessary e.g. for atomics portability. It's also just generally dubious to do so in a header as widely included as c.h. Instead add pg_attribute_format_arg, pg_attribute_printf, pg_attribute_noreturn macros which are implemented in the compilers that understand them. Also add pg_attribute_noreturn and pg_attribute_packed, but don't provide fallbacks, since they can affect functionality. This means that external code that, possibly unwittingly, relied on __attribute__ defined to be empty on !gcc compilers may now run into warnings or errors on those compilers. But there shouldn't be many occurances of that and it's hard to work around... Discussion: 54B58BA3.8040302@ohmu.fi Author: Oskari Saarenmaa, with some minor changes by me.
* Add GUC to enable compression of full page images stored in WAL.Fujii Masao2015-03-11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server compresses a full page image written to WAL when full_page_writes is on or during a base backup. A compressed page image will be decompressed during WAL replay. Turning this parameter on can reduce the WAL volume without increasing the risk of unrecoverable data corruption, but at the cost of some extra CPU spent on the compression during WAL logging and on the decompression during WAL replay. This commit changes the WAL format (so bumping WAL version number) so that the one-byte flag indicating whether a full page image is compressed or not is included in its header information. This means that the commit increases the WAL volume one-byte per a full page image even if WAL compression is not used at all. We can save that one-byte by borrowing one-bit from the existing field like hole_offset in the header and using it as the flag, for example. But which would reduce the code readability and the extensibility of the feature. Per discussion, it's not worth paying those prices to save only one-byte, so we decided to add the one-byte flag to the header. This commit doesn't introduce any new compression algorithm like lz4. Currently a full page image is compressed using the existing PGLZ algorithm. Per discussion, we decided to use it at least in the first version of the feature because there were no performance reports showing that its compression ratio is unacceptably lower than that of other algorithm. Of course, in the future, it's worth considering the support of other compression algorithm for the better compression. Rahila Syed and Michael Paquier, reviewed in various versions by myself, Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
* Move BRIN page type to page's last two bytesAlvaro Herrera2015-03-10
| | | | | | | | | | | | | | ... which is the usual convention among AMs, so that pg_filedump and similar utilities can tell apart pages of different AMs. It was also the intent of the original code, but I failed to realize that alignment considerations would move the whole thing to the previous-to-last word in the page. The new definition of the associated macro makes surrounding code a bit leaner, too. Per note from Heikki at http://www.postgresql.org/message-id/546A16EF.9070005@vmware.com
* Keep CommitTs module in sync in standby and masterAlvaro Herrera2015-03-09
| | | | | | | | | | | | | | | | We allow this module to be turned off on restarts, so a restart time check is enough to activate or deactivate the module; however, if there is a standby replaying WAL emitted from a master which is restarted, but the standby isn't, the state in the standby becomes inconsistent and can easily be crashed. Fix by activating and deactivating the module during WAL replay on parameter change as well as on system start. Problem reported by Fujii Masao in http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com Author: Petr JelĂ­nek
* Move WAL-related definitions from dbcommands.h to separate header file.Heikki Linnakangas2015-03-09
| | | | | | | | This makes it easier to write frontend programs that needs to understand the WAL record format of CREATE/DROP DATABASE. dbcommands.h cannot easily be #included in a frontend program, because it pulls in other header files that need backend stuff, but the new dbcommands_xlog.h header file has fewer dependencies.
* Add missing "goto err" statements in xlogreader.c.Fujii Masao2015-03-09
| | | | Spotted by Andres Freund.
* Fix typo in comment.Fujii Masao2015-03-05
|
* Reconsider when to wait for WAL flushes/syncrep during commit.Andres Freund2015-02-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up to now RecordTransactionCommit() waited for WAL to be flushed (if synchronous_commit != off) and to be synchronously replicated (if enabled), even if a transaction did not have a xid assigned. The primary reason for that is that sequence's nextval() did not assign a xid, but are worthwhile to wait for on commit. This can be problematic because sometimes read only transactions do write WAL, e.g. HOT page prune records. That then could lead to read only transactions having to wait during commit. Not something people expect in a read only transaction. This lead to such strange symptoms as backends being seemingly stuck during connection establishment when all synchronous replicas are down. Especially annoying when said stuck connection is the standby trying to reconnect to allow syncrep again... This behavior also is involved in a rather complicated <= 9.4 bug where the transaction started by catchup interrupt processing waited for syncrep using latches, but didn't get the wakeup because it was already running inside the same overloaded signal handler. Fix the issue here doesn't properly solve that issue, merely papers over the problems. In 9.5 catchup interrupts aren't processed out of signal handlers anymore. To fix all this, make nextval() acquire a top level xid, and only wait for transaction commit if a transaction both acquired a xid and emitted WAL records. If only a xid has been assigned we don't uselessly want to wait just because of writes to temporary/unlogged tables; if only WAL has been written we don't want to wait just because of HOT prunes. The xid assignment in nextval() is unlikely to cause overhead in real-world workloads. For one it only happens SEQ_LOG_VALS/32 values anyway, for another only usage of nextval() without using the result in an insert or similar is affected. Discussion: 20150223165359.GF30784@awork2.anarazel.de, 369698E947874884A77849D8FE3680C2@maumau, 5CF4ABBA67674088B3941894E22A0D25@maumau Per complaint from maumau and Thom Brown Backpatch all the way back; 9.0 doesn't have syncrep, but it seems better to be consistent behavior across all maintained branches.
* Replace checkpoint_segments with min_wal_size and max_wal_size.Heikki Linnakangas2015-02-23
| | | | | | | | | | | | | | | | | | | | | | | | Instead of having a single knob (checkpoint_segments) that both triggers checkpoints, and determines how many checkpoints to recycle, they are now separate concerns. There is still an internal variable called CheckpointSegments, which triggers checkpoints. But it no longer determines how many segments to recycle at a checkpoint. That is now auto-tuned by keeping a moving average of the distance between checkpoints (in bytes), and trying to keep that many segments in reserve. The advantage of this is that you can set max_wal_size very high, but the system won't actually consume that much space if there isn't any need for it. The min_wal_size sets a floor for that; you can effectively disable the auto-tuning behavior by setting min_wal_size equal to max_wal_size. The max_wal_size setting is now the actual target size of WAL at which a new checkpoint is triggered, instead of the distance between checkpoints. Previously, you could calculate the actual WAL usage with the formula "(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this patch, you set the desired WAL usage with max_wal_size, and the system calculates the appropriate CheckpointSegments with the reverse of that formula. That's a lot more intuitive for administrators to set. Reviewed by Amit Kapila and Venkata Balaji N.
* Add GUC to control the time to wait before retrieving WAL after failed attempt.Fujii Masao2015-02-23
| | | | | | | | | | | | | | Previously when the standby server failed to retrieve WAL files from any sources (i.e., streaming replication, local pg_xlog directory or WAL archive), it always waited for five seconds (hard-coded) before the next attempt. For example, this is problematic in warm-standby because restore_command can fail every five seconds even while new WAL file is expected to be unavailable for a long time and flood the log files with its error messages. This commit adds new parameter, wal_retrieve_retry_interval, to control that wait time. Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.
* Use FLEXIBLE_ARRAY_MEMBER in a number of other places.Tom Lane2015-02-21
| | | | I think we're about done with this...
* Use FLEXIBLE_ARRAY_MEMBER for HeapTupleHeaderData.t_bits[].Tom Lane2015-02-21
| | | | | | | This requires changing quite a few places that were depending on sizeof(HeapTupleHeaderData), but it seems for the best. Michael Paquier, some adjustments by me
* Use FLEXIBLE_ARRAY_MEMBER in some more places.Tom Lane2015-02-20
| | | | | | Fix a batch of structs that are only visible within individual .c files. Michael Paquier
* Use FLEXIBLE_ARRAY_MEMBER in struct varlena.Tom Lane2015-02-20
| | | | | | | This forces some minor coding adjustments in tuptoaster.c and inv_api.c, but the new coding there is cleaner anyway. Michael Paquier
* Fix knn-GiST queue comparison function to return heap tuples first.Heikki Linnakangas2015-02-17
| | | | | | | | | The part of the comparison function that was supposed to keep heap tuples ahead of index items was backwards. It would not lead to incorrect results, but it is more efficient to return heap tuples first, before scanning more index pages, when both have the same distance. Alexander Korotkov
* Minor cleanup/code review for "indirect toast" stuff.Tom Lane2015-02-09
| | | | | | | | | | | | | Fix some issues I noticed while fooling with an extension to allow an additional kind of toast pointer. Much of this is just comment improvement, but there are a couple of actual bugs, which might or might not be reachable today depending on what can happen during logical decoding. An example is that toast_flatten_tuple() failed to cover the possibility of an indirection pointer in its input. Back-patch to 9.4 just in case that is reachable now. In HEAD, also correct some really minor issues with recent compression reorganization, such as dangerously underparenthesized macros.
* Move pg_lzcompress.c to src/common.Fujii Masao2015-02-09
| | | | | | | | | | | | | | | | | | | | | | The meta data of PGLZ symbolized by PGLZ_Header is removed, to make the compression and decompression code independent on the backend-only varlena facility. PGLZ_Header is being used to store some meta data related to the data being compressed like the raw length of the uncompressed record or some varlena-related data, making it unpluggable once PGLZ is stored in src/common as it contains some backend-only code paths with the management of varlena structures. The APIs of PGLZ are reworked at the same time to do only compression and decompression of buffers without the meta-data layer, simplifying its use for a more general usage. On-disk format is preserved as well, so there is no incompatibility with previous major versions of PostgreSQL for TOAST entries. Exposing compression and decompression APIs of pglz makes possible its use by extensions and contrib modules. Especially this commit is required for upcoming WAL compression feature so that the WAL reader facility can decompress the WAL data by using pglz_decompress. Michael Paquier, reviewed by me.
* Use a separate memory context for GIN scan keys.Heikki Linnakangas2015-02-04
| | | | | | | | | | It was getting tedious to track and release all the different things that form a scan key. We were leaking at least the queryCategories array, and possibly more, on a rescan. That was visible if a GIN index was used in a nested loop join. This also protects from leaks in extractQuery method. No backpatching, given the lack of complaints from the field. Maybe later, after this has received more field testing.
* Fix reference-after-free when waiting for another xact due to constraint.Heikki Linnakangas2015-02-04
| | | | | | | | | | | | | | | | | | | If an insertion or update had to wait for another transaction to finish, because there was another insertion with conflicting key in progress, we would pass a just-free'd item pointer to XactLockTableWait(). All calls to XactLockTableWait() and MultiXactIdWait() had similar issues. Some passed a pointer to a buffer in the buffer cache, after already releasing the lock. The call in EvalPlanQualFetch had already released the pin too. All but the call in execUtils.c would merely lead to reporting a bogus ctid, however (or an assertion failure, if enabled). All the callers that passed HeapTuple->t_data->t_ctid were slightly bogus anyway: if the tuple was updated (again) in the same transaction, its ctid field would point to the next tuple in the chain, not the tuple itself. Backpatch to 9.4, where the 'ctid' argument to XactLockTableWait was added (in commit f88d4cfc)
* Fix query-duration memory leak with GIN rescans.Heikki Linnakangas2015-01-30
| | | | | | | | | | | | | | | | | | | | | The requiredEntries / additionalEntries arrays were not freed in freeScanKeys() like other per-key stuff. It's not obvious, but startScanKey() was only ever called after the keys have been initialized with ginNewScanKey(). That's why it doesn't need to worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap was never true, because ginrescan free's the existing keys, and it's not OK to call gingetbitmap twice in a row without calling ginrescan in between. To make that clear, remove the unnecessary ginIsNewKey(). And just to be extra sure that nothing funny happens if there is an existing key after all, call freeScanKeys() to free it if it exists. This makes the code more straightforward. (I'm seeing other similar leaks in testing a query that rescans an GIN index scan, but that's a different issue. This just fixes the obvious leak with those two arrays.) Backpatch to 9.4, where GIN fast scan was added.
* Fix BuildIndexValueDescription for expressionsStephen Frost2015-01-29
| | | | | | | | | | | | | | | In 804b6b6db4dcfc590a468e7be390738f9f7755fb we modified BuildIndexValueDescription to pay attention to which columns are visible to the user, but unfortunatley that commit neglected to consider indexes which are built on expressions. Handle error-reporting of violations of constraint indexes based on expressions by not returning any detail when the user does not have table-level SELECT rights. Backpatch to 9.0, as the prior commit was. Pointed out by Tom.
* Fix bug where GIN scan keys were not initialized with gin_fuzzy_search_limit.Heikki Linnakangas2015-01-29
| | | | | | | | | | | | | | | | | | | | | | | | When gin_fuzzy_search_limit was used, we could jump out of startScan() without calling startScanKey(). That was harmless in 9.3 and below, because startScanKey()() didn't do anything interesting, but in 9.4 it initializes information needed for skipping entries (aka GIN fast scans), and you readily get a segfault if it's not done. Nevertheless, it was clearly wrong all along, so backpatch all the way to 9.1 where the early return was introduced. (AFAICS startScanKey() did nothing useful in 9.3 and below, because the fields it initialized were already initialized in ginFillScanKey(), but I don't dare to change that in a minor release. ginFillScanKey() is always called in gingetbitmap() even though there's a check there to see if the scan keys have already been initialized, because they never are; ginrescan() free's them.) In the passing, remove unnecessary if-check from the second inner loop in startScan(). We already check in the first loop that the condition is true for all entries. Reported by Olaf Gawenda, bug #12694, Backpatch to 9.1 and above, although AFAICS it causes a live bug only in 9.4.
* Fix column-privilege leak in error-message pathsStephen Frost2015-01-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | While building error messages to return to the user, BuildIndexValueDescription, ExecBuildSlotValueDescription and ri_ReportViolation would happily include the entire key or entire row in the result returned to the user, even if the user didn't have access to view all of the columns being included. Instead, include only those columns which the user is providing or which the user has select rights on. If the user does not have any rights to view the table or any of the columns involved then no detail is provided and a NULL value is returned from BuildIndexValueDescription and ExecBuildSlotValueDescription. Note that, for key cases, the user must have access to all of the columns for the key to be shown; a partial key will not be returned. Further, in master only, do not return any data for cases where row security is enabled on the relation and row security should be applied for the user. This required a bit of refactoring and moving of things around related to RLS- note the addition of utils/misc/rls.c. Back-patch all the way, as column-level privileges are now in all supported versions. This has been assigned CVE-2014-8161, but since the issue and the patch have already been publicized on pgsql-hackers, there's no point in trying to hide this commit.
* Remove dead NULL-pointer checks in GiST code.Heikki Linnakangas2015-01-28
| | | | | | | | | | | | | | | | | | | | gist_poly_compress() and gist_circle_compress() checked for a NULL-pointer key argument, but that was dead code; the gist code never passes a NULL-pointer to the "compress" method. This commit also removes a documentation note added in commit a0a3883, about doing NULL-pointer checks in the "compress" method. It was added based on the fact that some implementations were doing NULL-pointer checks, but those checks were unnecessary in the first place. The NULL-pointer check in gbt_var_same() function was also unnecessary. The arguments to the "same" method come from the "compress", "union", or "picksplit" methods, but none of them return a NULL pointer. None of this is to be confused with SQL NULL values. Those are dealt with by the gist machinery, and are never passed to the GiST opclass methods. Michael Paquier
* Tweak BRIN minmax operator classAlvaro Herrera2015-01-22
| | | | | | | | | | | | | | | | | In the union support proc, we were not checking the hasnulls flag of value A early enough, so it could be skipped if the "allnulls" flag in value B is set. Also, a check on the allnulls flag of value "B" was redundant, so remove it. Also change inet_minmax_ops to not be the default opclass for type inet, as a future inclusion operator class would be more useful and it's pretty difficult to change default opclass for a datatype later on. (There is no catversion bump for this catalog change; this shouldn't be a problem.) Extracted from a larger patch to add an "inclusion" operator class. Author: Emre Hasegeli
* Use abbreviated keys for faster sorting of text datums.Robert Haas2015-01-19
| | | | | | | | | | | | | | | | | | | | | This commit extends the SortSupport infrastructure to allow operator classes the option to provide abbreviated representations of Datums; in the case of text, we abbreviate by taking the first few characters of the strxfrm() blob. If the abbreviated comparison is insufficent to resolve the comparison, we fall back on the normal comparator. This can be much faster than the old way of doing sorting if the first few bytes of the string are usually sufficient to resolve the comparison. There is the potential for a performance regression if all of the strings to be sorted are identical for the first 8+ characters and differ only in later positions; therefore, the SortSupport machinery now provides an infrastructure to abort the use of abbreviation if it appears that abbreviation is producing comparatively few distinct keys. HyperLogLog, a streaming cardinality estimator, is included in this commit and used to make that determination for text. Peter Geoghegan, reviewed by me.
* BRIN typo fix.Robert Haas2015-01-19
| | | | Amit Langote
* Fix thinko in re-setting wal_log_hints flag from a parameter-change record.Heikki Linnakangas2015-01-15
| | | | | | | | The flag is supposed to be copied from the record. Same issue with track_commit_timestamps, but that's master-only. Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was added.
* Tweak heapam's rmgr desc output slightlyAlvaro Herrera2015-01-12
| | | | | | | Some spaces were missing, and putting the affected tuple offset first in the lock cases instead of the locking data makes more sense. No backpatch since this is cosmetic and surrounding code has changed.
* Don't open a WAL segment for writing at end of recovery.Heikki Linnakangas2015-01-07
| | | | | | | | | | | | | | Since commit ba94518a, we used XLogFileOpen to open the next segment for writing, but if the end-of-recovery happens exactly at a segment boundary, the new segment might not exist yet. (Before ba94518a, XLogFileOpen was correct, because we would open the previous segment if the switch happened at the boundary.) Instead of trying to create it if necessary, it's simpler to not bother opening the segment at all. XLogWrite() will open or create it soon anyway, after writing the checkpoint or end-of-recovery record. Reported by Andres Freund.
* Update copyright for 2015Bruce Momjian2015-01-06
| | | | Backpatch certain files through 9.0
* Fix thinko in lock mode enumAlvaro Herrera2015-01-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0e5680f4737a9c6aa94aa9e77543e5de60411322 contained a thinko mixing LOCKMODE with LockTupleMode. This caused misbehavior in the case where a tuple is marked with a multixact with at most a FOR SHARE lock, and another transaction tries to acquire a FOR NO KEY EXCLUSIVE lock; this case should block but doesn't. Include a new isolation tester spec file to explicitely try all the tuple lock combinations; without the fix it shows the problem: starting permutation: s1_begin s1_lcksvpt s1_tuplock2 s2_tuplock3 s1_commit step s1_begin: BEGIN; step s1_lcksvpt: SELECT * FROM multixact_conflict FOR KEY SHARE; SAVEPOINT foo; a 1 step s1_tuplock2: SELECT * FROM multixact_conflict FOR SHARE; a 1 step s2_tuplock3: SELECT * FROM multixact_conflict FOR NO KEY UPDATE; a 1 step s1_commit: COMMIT; With the fixed code, step s2_tuplock3 blocks until session 1 commits, which is the correct behavior. All other cases behave correctly. Backpatch to 9.3, like the commit that introduced the problem.
* Remove superflous variable from xlogreader's XLogFindNextRecord().Andres Freund2015-01-04
| | | | | | | Pointed out by Coverity. Since this is mere, and debatable, cosmetics I'm not backpatching this.