aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Avoid palloc in critical section in GiST WAL-logging.Heikki Linnakangas2014-04-03
| | | | | | | | | | | | | | | | Memory allocation can fail if you run out of memory, and inside a critical section that will lead to a PANIC. Use conservatively-sized arrays in stack instead. There was previously no explicit limit on the number of pages a GiST split can produce, it was only limited by the number of LWLocks that can be held simultaneously (100 at the moment). This patch adds an explicit limit of 75 pages. That should be plenty, a typical split shouldn't produce more than 2-3 page halves. The bug has been there forever, but only backpatch down to 9.1. The code was changed significantly in 9.1, and it doesn't seem worth the risk or trouble to adapt this for 9.0 and 8.4.
* Fix bug in the new GIN incomplete-split code.Heikki Linnakangas2014-04-01
| | | | | | | | Inserting a downlink to an internal page clears the incomplete-split flag of the child's left sibling, so the left sibling's LSN also needs to be updated and it needs to be marked dirty. The codepath for an insertion got this right, but the case where the internal node is split because of inserting the new downlink missed that.
* Remove dead check for backup block, replace with Assert.Heikki Linnakangas2014-04-01
| | | | | We don't use backup blocks with GIN vacuum records anymore, the page is always recreated from scratch.
* Fix bug in the new B-tree incomplete-split code.Heikki Linnakangas2014-04-01
| | | | | | Inserting a downlink to an internal page clears the incomplete-split flag of the child's left sibling, so the left sibling's LSN also needs to be updated.
* Rewrite the way GIN posting lists are packed on a page, to reduce WAL volume.Heikki Linnakangas2014-03-31
| | | | | | | | | | | | | | Inserting (in retail) into the new 9.4 format GIN posting tree created much larger WAL records than in 9.3. The previous strategy to WAL logging was basically to log the whole page on each change, with the exception of completely unmodified segments up to the first modified one. That was not too bad when appending to the end of the page, as only the last segment had to be WAL-logged, but per Fujii Masao's testing, even that produced 2x the WAL volume that 9.3 did. The new strategy is to keep track of changes to the posting lists in a more fine-grained fashion, and also make the repacking" code smarter to avoid decoding and re-encoding segments unnecessarily.
* Rename GinLogicValue to GinTernaryValue.Heikki Linnakangas2014-03-31
| | | | | It's more descriptive. Also, get rid of the enum, and use #defines instead, per Greg Stark's suggestion.
* Pass more than the first XLogRecData entry to rm_desc, with WAL_DEBUG.Heikki Linnakangas2014-03-26
| | | | | | | | | | | | | | | If you compile with WAL_DEBUG and enable it with wal_debug=on, we used to only pass the first XLogRecData entry to the rm_desc routine. I think the original assumprion was that the first XLogRecData entry contains all the necessary information for the rm_desc routine, but that's a pretty shaky assumption. At least standby_redo didn't get the memo. To fix, piece together all the data in a temporary buffer, and pass that to the rm_desc routine. It's been like this forever, but the patch didn't apply cleanly to back-branches. Probably wouldn't be hard to fix the conflicts, but it's not worth the trouble.
* Don't forget to flush XLOG_PARAMETER_CHANGE record.Fujii Masao2014-03-26
| | | | Backpatch to 9.0 where XLOG_PARAMETER_CHANGE record was instroduced.
* Change ginMergeItemPointers to return a palloc'd array.Heikki Linnakangas2014-03-24
| | | | | That seems nicer than making it the caller's responsibility to pass a suitable-sized array. All the callers were just palloc'ing an array anyway.
* Remove dead code and add comments.Heikki Linnakangas2014-03-24
| | | | | 'cbuffer' variable was left over from an earlier version of the patch to rewrite the incomplete split handling.
* Fix "the the" typos.Heikki Linnakangas2014-03-24
| | | | Erik Rijkers
* Address ccvalid/ccnoinherit in TupleDesc support functions.Noah Misch2014-03-23
| | | | | | | | | equalTupleDescs() neglected both of these ConstrCheck fields, and CreateTupleDescCopyConstr() neglected ccnoinherit. At this time, the only known behavior defect resulting from these omissions is constraint exclusion disregarding a CHECK constraint validated by an ALTER TABLE VALIDATE CONSTRAINT statement issued earlier in the same transaction. Back-patch to 9.2, where these fields were introduced.
* Replace the XLogInsert slots with regular LWLocks.Heikki Linnakangas2014-03-21
| | | | | | | | | | The special feature the XLogInsert slots had over regular LWLocks is the insertingAt value that was updated atomically with releasing backends waiting on it. Add new functions to the LWLock API to do that, and replace the slots with LWLocks. This reduces the amount of duplicated code. (There's still some duplication, but at least it's all in lwlock.c now.) Reviewed by Andres Freund.
* Setup error context callback for transaction lock waitsAlvaro Herrera2014-03-19
| | | | | | | | | | | | | | | | | | With this in place, a session blocking behind another one because of tuple locks will get a context line mentioning the relation name, tuple TID, and operation being done on tuple. For example: LOG: process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms DETAIL: Process holding the lock: 11366. Wait queue: 11367. CONTEXT: while updating tuple (0,2) in relation "foo" STATEMENT: UPDATE foo SET value = 3; Most usefully, the new line is displayed by log entries due to log_lock_waits, although of course it will be printed by any other log message as well. Author: Christian Kruse, some tweaks by Álvaro Herrera Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
* Remove rm_safe_restartpoint machinery.Heikki Linnakangas2014-03-18
| | | | | | | | | It is no longer used, none of the resource managers have multi-record actions that would make it unsafe to perform a restartpoint. Also don't allow rm_cleanup to write WAL records, it's also no longer required. Move the call to rm_cleanup routines to make it more symmetric with rm_startup.
* Make the handling of interrupted B-tree page splits more robust.Heikki Linnakangas2014-03-18
| | | | | | | | | | | | | | | | | | | | | | Splitting a page consists of two separate steps: splitting the child page, and inserting the downlink for the new right page to the parent. Previously, we handled the case that you crash in between those steps with a cleanup routine after the WAL recovery had finished, which finished the incomplete split. However, that doesn't help if the page split is interrupted but the database doesn't crash, so that you don't perform WAL recovery. That could happen for example if you run out of disk space. Remove the end-of-recovery cleanup step. Instead, when a page is split, the left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink is inserted to the parent, the flag is cleared again. If an insertion sees a page with the flag set, it knows that the split was interrupted for some reason, and inserts the missing downlink before proceeding. I used the same approach to fix GIN and GiST split algorithms earlier. This was the last WAL cleanup routine, so we could get rid of that whole machinery now, but I'll leave that for a separate patch. Reviewed by Peter Geoghegan.
* Fix thinko: have trueTriConsistentFn return GIN_TRUE.Heikki Linnakangas2014-03-17
| | | | While we're at it, also improve comments in ginlogic.c.
* Fix typos in comments.Fujii Masao2014-03-17
| | | | Thom Brown
* Fix race condition in B-tree page deletion.Heikki Linnakangas2014-03-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In short, we don't allow a page to be deleted if it's the rightmost child of its parent, but that situation can change after we check for it. Problem ------- We check that the page to be deleted is not the rightmost child of its parent, and then lock its left sibling, the page itself, its right sibling, and the parent, in that order. However, if the parent page is split after the check but before acquiring the locks, the target page might become the rightmost child, if the split happens at the right place. That leads to an error in vacuum (I reproduced this by setting a breakpoint in debugger): ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey" We currently re-check that the page is still the rightmost child, and throw the above error if it's not. We could easily just give up rather than throw an error, but that approach doesn't scale to half-dead pages. To recap, although we don't normally allow deleting the rightmost child, if the page is the *only* child of its parent, we delete the child page and mark the parent page as half-dead in one atomic operation. But before we do that, we check that the parent can later be deleted, by checking that it in turn is not the rightmost child of the grandparent (potentially recursing all the way up to the root). But the same situation can arise there - the grandparent can be split while we're not holding the locks. We end up with a half-dead page that we cannot delete. To make things worse, the keyspace of the deleted page has already been transferred to its right sibling. As the README points out, the keyspace at the grandparent level is "out-of-whack" until the half-dead page is deleted, and if enough tuples with keys in the transferred keyspace are inserted, the page might get split and a downlink might be inserted into the grandparent that is out-of-order. That might not cause any serious problem if it's transient (as the README ponders), but is surely bad if it stays that way. Solution -------- This patch changes the page deletion algorithm to avoid that problem. After checking that the topmost page in the chain of to-be-deleted pages is not the rightmost child of its parent, and then deleting the pages from bottom up, unlink the pages from top to bottom. This way, the intermediate stages are similar to the intermediate stages in page splitting, and there is no transient stage where the keyspace is "out-of-whack". The topmost page in the to-be-deleted chain doesn't have a downlink pointing to it, like a page split before the downlink has been inserted. This also allows us to get rid of the cleanup step after WAL recovery, if we crash during page deletion. The deletion will be continued at next VACUUM, but the tree is consistent for searches and insertions at every step. This bug is old, all supported versions are affected, but this patch is too big to back-patch (and changes the WAL record formats of related records). We have not heard any reports of the bug from users, so clearly it's not easy to bump into. Maybe backpatch later, after this has had some field testing. Reviewed by Kevin Grittner and Peter Geoghegan.
* C comments: remove odd blank lines after #ifdef WIN32 linesBruce Momjian2014-03-13
|
* Only WAL-log the modified portion in an UPDATE, if possible.Heikki Linnakangas2014-03-12
| | | | | | | | | When a row is updated, and the new tuple version is put on the same page as the old one, only WAL-log the part of the new tuple that's not identical to the old. This saves significantly on the amount of WAL that needs to be written, in the common case that most fields are not modified. Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.
* Allow opclasses to provide tri-valued GIN consistent functions.Heikki Linnakangas2014-03-12
| | | | | | | | | | | | | | | With the GIN "fast scan" feature, GIN can skip items without fetching all the keys for them, if it can prove that they don't match regardless of those keys. So far, it has done the proving by calling the boolean consistent function with all combinations of TRUE/FALSE for the unfetched keys, but since that's O(n^2), it becomes unfeasible with more than a few keys. We can avoid calling consistent with all the combinations, if we can tell the operator class implementation directly which keys are unknown. This commit includes a triConsistent function for the built-in array and tsvector opclasses. Alexander Korotkov, with some changes by me.
* In WAL replay, restore GIN metapage unconditionally to avoid torn page.Heikki Linnakangas2014-03-12
| | | | | | | | | | | | | | | | We don't take a full-page image of the GIN metapage; instead, the WAL record contains all the information required to reconstruct it from scratch. But to avoid torn page hazards, we must re-initialize it from the WAL record every time, even if it already has a greater LSN, similar to how normal full page images are restored. This was highly unlikely to cause any problems in practice, because the GIN metapage is small. We rely on an update smaller than a 512 byte disk sector to be atomic elsewhere, at least in pg_control. But better safe than sorry, and this would be easy to overlook if more fields are added to the metapage so that it's no longer small. Reported by Noah Misch. Backpatch to all supported versions.
* Fix dangling smgr_owner pointer when a fake relcache entry is freed.Heikki Linnakangas2014-03-07
| | | | | | | | | | | | A fake relcache entry can "own" a SmgrRelation object, like a regular relcache entry. But when it was free'd, the owner field in SmgrRelation was not cleared, so it was left pointing to free'd memory. Amazingly this apparently hasn't caused crashes in practice, or we would've heard about it earlier. Andres found this with Valgrind. Report and fix by Andres Freund, with minor modifications by me. Backpatch to all supported versions.
* Do wal_level and hot standby checks when doing crash-then-archive recovery.Heikki Linnakangas2014-03-05
| | | | | | | | CheckRequiredParameterValues() should perform the checks if archive recovery was requested, even if we are going to perform crash recovery first. Reported by Kyotaro HORIGUCHI. Backpatch to 9.2, like the crash-then-archive recovery mode.
* Fix lastReplayedEndRecPtr calculation when starting from shutdown checkpoint.Heikki Linnakangas2014-03-05
| | | | | | | | | | | | | | | When entering crash recovery followed by archive recovery, and the latest checkpoint is a shutdown checkpoint, and there are no more WAL records to replay before transitioning from crash to archive recovery, we would not immediately allow read-only connections in hot standby mode even if we could. That's because when starting from a shutdown checkpoint, we set lastReplayedEndRecPtr incorrectly to the record before the checkpoint record, instead of the checkpoint record itself. We don't run the redo routine of the shutdown checkpoint record, but starting recovery from it goes through the same motions, so it should be considered as replayed. Reported by Kyotaro HORIGUCHI. All versions with hot standby are affected, so backpatch to 9.0.
* Introduce logical decoding.Robert Haas2014-03-03
| | | | | | | | | | | | | | | | | | | | | | This feature, building on previous commits, allows the write-ahead log stream to be decoded into a series of logical changes; that is, inserts, updates, and deletes and the transactions which contain them. It is capable of handling decoding even across changes to the schema of the effected tables. The output format is controlled by a so-called "output plugin"; an example is included. To make use of this in a real replication system, the output plugin will need to be modified to produce output in the format appropriate to that system, and to perform filtering. Currently, information can be extracted from the logical decoding system only via SQL; future commits will add the ability to stream changes via walsender. Andres Freund, with review and other contributions from many other people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan, Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve Singer.
* Remove bogus while-loop.Heikki Linnakangas2014-02-28
| | | | | | | | | | Commit abf5c5c9a4f142b3343614746bb9e99a794f8e7b added a bogus while- statement after the for(;;)-loop. It went unnoticed in testing, because it was dead code. Report by KONDO Mitsumasa. Backpatch to 9.3. The commit that introduced this was also applied to 9.2, but not the bogus while-loop part, because the code in 9.2 looks quite different.
* Fix WAL replay of locking an updated tupleAlvaro Herrera2014-02-27
| | | | | | | | | | | | | | | We were resetting the tuple's HEAP_HOT_UPDATED flag as well as t_ctid on WAL replay of a tuple-lock operation, which is incorrect when the tuple is already updated. Back-patch to 9.3. The clearing of both header elements was there previously, but since no update could be present on a tuple that was being locked, it was harmless. Bug reported by Peter Geoghegan and Greg Stark in CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com and CAM-w4HPTOeMT4KP0OJK+mGgzgcTOtLRTvFZyvD0O4aH-7dxo3Q@mail.gmail.com respectively; diagnosis by Andres Freund.
* btbuild no longer calls _bt_doinsert(), update comment.Heikki Linnakangas2014-02-26
| | | | Peter Geoghegan
* Improve comment on setting data_checksum GUC.Heikki Linnakangas2014-02-20
| | | | There was an extra space there, and "fixed" wasn't very descriptive.
* Switch various builtin functions to use pg_lsn instead of text.Robert Haas2014-02-19
| | | | | | | | | | | The functions in slotfuncs.c don't exist in any released version, but the changes to xlogfuncs.c represent backward-incompatibilities. Per discussion, we're hoping that the queries using these functions are few enough and simple enough that this won't cause too much breakage for users. Michael Paquier, reviewed by Andres Freund and further modified by me.
* Fix comment; checkpointer, not bgwriter, performs checkpoints since 9.2.Heikki Linnakangas2014-02-18
| | | | Amit Langote
* Prevent potential overruns of fixed-size buffers.Tom Lane2014-02-17
| | | | | | | | | | | | | | | | | | | | | | | Coverity identified a number of places in which it couldn't prove that a string being copied into a fixed-size buffer would fit. We believe that most, perhaps all of these are in fact safe, or are copying data that is coming from a trusted source so that any overrun is not really a security issue. Nonetheless it seems prudent to forestall any risk by using strlcpy() and similar functions. Fixes by Peter Eisentraut and Jozef Mlich based on Coverity reports. In addition, fix a potential null-pointer-dereference crash in contrib/chkpass. The crypt(3) function is defined to return NULL on failure, but chkpass.c didn't check for that before using the result. The main practical case in which this could be an issue is if libc is configured to refuse to execute unapproved hashing algorithms (e.g., "FIPS mode"). This ideally should've been a separate commit, but since it touches code adjacent to one of the buffer overrun changes, I included it in this commit to avoid last-minute merge issues. This issue was reported by Honza Horak. Security: CVE-2014-0065 for buffer overruns, CVE-2014-0066 for crypt()
* Change the order that pg_xlog and WAL archive are polled for WAL segments.Heikki Linnakangas2014-02-14
| | | | | | | | | | | | | | | | | | | If there is a WAL segment with same ID but different TLI present in both the WAL archive and pg_xlog, prefer the one with higher TLI. Before this patch, the archive was polled first, for all expected TLIs, and only if no file was found was pg_xlog scanned. This was a change in behavior from 9.3, which first scanned archive and pg_xlog for the highest TLI, then archive and pg_xlog for the next highest TLI and so forth. This patch reverts the behavior back to what it was in 9.2. The reason for this is that if for example you try to do archive recovery to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is not archived yet, we would replay past the timeline switch point on timeline 1 using the archived files, before even looking timeline 2's files in pg_xlog Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior was changed.
* Separate multixact freezing parameters from xid'sAlvaro Herrera2014-02-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we were piggybacking on transaction ID parameters to freeze multixacts; but since there isn't necessarily any relationship between rates of Xid and multixact consumption, this turns out not to be a good idea. Therefore, we now have multixact-specific freezing parameters: vacuum_multixact_freeze_min_age: when to remove multis as we come across them in vacuum (default to 5 million, i.e. early in comparison to Xid's default of 50 million) vacuum_multixact_freeze_table_age: when to force whole-table scans instead of scanning only the pages marked as not all visible in visibility map (default to 150 million, same as for Xids). Whichever of both which reaches the 150 million mark earlier will cause a whole-table scan. autovacuum_multixact_freeze_max_age: when for cause emergency, uninterruptible whole-table scans (default to 400 million, double as that for Xids). This means there shouldn't be more frequent emergency vacuuming than previously, unless multixacts are being used very rapidly. Backpatch to 9.3 where multixacts were made to persist enough to require freezing. To avoid an ABI break in 9.3, VacuumStmt has a couple of fields in an unnatural place, and StdRdOptions is split in two so that the newly added fields can go at the end. Patch by me, reviewed by Robert Haas, with additional input from Andres Freund and Tom Lane.
* In XLogReadBufferExtended, don't assume P_NEW yields consecutive pages.Tom Lane2014-02-12
| | | | | | | | | | | | | | | | | | | | | | | | | In a database that's not yet reached consistency, it's possible that some segments of a relation are not full-size but are not the last ones either. Because of the way smgrnblocks() works, asking for a new page with P_NEW will fill in the last not-full-size segment --- and if that makes it full size, the apparent EOF of the relation will increase by more than one page, so that the next P_NEW request will yield a page past the next consecutive one. This breaks the relation-extension logic in XLogReadBufferExtended, possibly allowing a page update to be applied to some page far past where it was intended to go. This appears to be the explanation for reports of table bloat on replication slaves compared to their masters, and probably explains some corrupted-slave reports as well. Fix the loop to check the page number it actually got, rather than merely Assert()'ing that dead reckoning got it to the desired place. AFAICT, there are no other places that make assumptions about exactly which page they'll get from P_NEW. Problem identified by Greg Stark, though this is not the same as his proposed patch. It's been like this for a long time, so back-patch to all supported branches.
* Fix WakeupWaiters() to not wake up an exclusive locker unnecessarily.Heikki Linnakangas2014-02-10
| | | | | | | | WakeupWaiters() is supposed to wake up all LW_WAIT_UNTIL_FREE waiters of the slot, but the loop incorrectly also woke up the first LW_EXCLUSIVE waiter, if there was no LW_WAIT_UNTIL_FREE waiters in the queue. Noted by Andres Freund. This code is new in 9.4, so no backpatching.
* Initialize the entryRes array between each call to triConsistent.Heikki Linnakangas2014-02-07
| | | | | | | | | | | | | | | | | | | | | | The shimTriConstistentFn, which calls the opclass's consistent function with all combinations of TRUE/FALSE for any MAYBE argument, modifies the entryRes array passed by the caller. Change startScanKey to re-initialize it between each call to accommodate that. It's actually a bad habit by shimTriConsistentFn to modify its argument. But the only caller that doesn't already re-initialize the entryRes array was startScanKey, and it's easy for startScanKey to do so. Add a comment to shimTriConsistentFn about that. Note: this does not give a free pass to opclass-provided consistent functions to modify the entryRes argument; shimTriConsistent assumes that they don't, even though it does it itself. While at it, refactor startScanKey to allocate the requiredEntries and additionalEntries after it knows exactly how large they need to be. Saves a little bit of memory, and looks nicer anyway. Per complaint by Tom Lane, buildfarm and the pg_trgm regression test.
* Speed up "rare & frequent" type GIN queries.Heikki Linnakangas2014-02-07
| | | | | | | | | | | | | | | | | | | | | | | | | | If you have a GIN query like "rare & frequent", we currently fetch all the items that match either rare or frequent, call the consistent function for each item, and let the consistent function filter out items that only match one of the terms. However, if we can deduce that "rare" must be present for the overall qual to be true, we can scan all the rare items, and for each rare item, skip over to the next frequent item with the same or greater TID. That greatly speeds up "rare & frequent" type queries. To implement that, introduce the concept of a tri-state consistent function, where the 3rd value is MAYBE, indicating that we don't know if that term is present. Operator classes only provide a boolean consistent function, so we simulate the tri-state consistent function by calling the boolean function several times, with the MAYBE arguments set to all combinations of TRUE and FALSE. Testing all combinations is only feasible for a small number of MAYBE arguments, but it is envisioned that we'll provide a way for operator classes to provide a native tri-state consistent function, which can be much more efficient. But that is not included in this patch. We were already using that trick to for lossy pages, calling the consistent function with the lossy entry set to TRUE and FALSE. Now that we have the tri-state consistent function, use it for lossy pages too. Alexander Korotkov, with fair amount of refactoring by me.
* Remove unnecessary relcache flushes after changing btree metapages.Tom Lane2014-02-05
| | | | | | | | | | | | | | | | | | | | | These flushes were added in my commit d2896a9ed, which added the btree logic that keeps a cached copy of the index metapage data in index relcache entries. The idea was to ensure that other backends would promptly update their cached copies after a change. However, this is not really necessary, since _bt_getroot() has adequate defenses against believing a stale root page link, and _bt_getrootheight() doesn't have to be 100% right. Moreover, if it were necessary, a relcache flush would be an unreliable way to do it, since the sinval mechanism believes that relcache flush requests represent transactional updates, and therefore discards them on transaction rollback. Therefore, we might as well drop these flush requests and save the time to rebuild the whole relcache entry after a metapage change. If we ever try to support in-place truncation of btree indexes, it might be necessary to revisit this issue so that _bt_getroot() can't get caught by trying to follow a metapage link to a page that no longer exists. A possible solution to that is to make use of an smgr, rather than relcache, inval request to force other backends to discard their cached metapages. But for the moment this is not worth pursuing.
* Add primary_slotname to recovery.conf.sample.Fujii Masao2014-02-03
|
* Introduce replication slots.Robert Haas2014-01-31
| | | | | | | | | | | | | | | | Replication slots are a crash-safe data structure which can be created on either a master or a standby to prevent premature removal of write-ahead log segments needed by a standby, as well as (with hot_standby_feedback=on) pruning of tuples whose removal would cause replication conflicts. Slots have some advantages over existing techniques, as explained in the documentation. In a few places, we refer to the type of replication slots introduced by this patch as "physical" slots, because forthcoming patches for logical decoding will also have slots, but with somewhat different properties. Andres Freund and Robert Haas
* Further optimize GIN multi-key searches.Heikki Linnakangas2014-01-29
| | | | | | | | When skipping over some items in a posting tree, re-find the new location by descending the tree from root, rather than walking the right links. This can save a lot of I/O. Heavily modified from Alexander Korotkov's fast scan patch.
* Further optimize multi-key GIN searches.Heikki Linnakangas2014-01-29
| | | | | | | If we're skipping past a certain TID, avoid decoding posting list segments that only contain smaller TIDs. Extracted from Alexander Korotkov's fast scan patch, heavily modified.
* Allow skipping some items in a multi-key GIN search.Heikki Linnakangas2014-01-29
| | | | | | | | | | In a multi-key search, ie. something like "col @> 'foo' AND col @> 'bar'", as soon as we find the next item that matches the first criteria, we don't need to check the second criteria for TIDs smaller the first match. That saves a lot of effort, especially if one of the terms is rare, while the second occurs very frequently. Based on ideas from Alexander Korotkov's fast scan patch.
* Revert C comment change in slot_attisnull()Bruce Momjian2014-01-28
| | | | Revert 89774b58b0ea2874765cae10c094bb6aaf707feb
* Relax the requirement that all lwlocks be stored in a single array.Robert Haas2014-01-27
| | | | | | | | | | | | | | This makes it possible to store lwlocks as part of some other data structure in the main shared memory segment, or in a dynamic shared memory segment. There is still a main LWLock array and this patch does not move anything out of it, but it provides necessary infrastructure for doing that in the future. This change is likely to increase the size of LWLockPadded on some platforms, especially 32-bit platforms where it was previously only 16 bytes. Patch by me. Review by Andres Freund and KaiGai Kohei.
* Adjust C comment in slot_attisnull() regarding nulls.Bruce Momjian2014-01-25
|
* Add recovery_target='immediate' option.Heikki Linnakangas2014-01-25
| | | | | | | | This allows ending recovery as a consistent state has been reached. Without this, there was no easy way to e.g restore an online backup, without replaying any extra WAL after the backup ended. MauMau and me.