aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Optimize commit_siblings in two ways to improve group commit.Simon Riggs2010-12-08
| | | | | | | | First, avoid scanning the whole ProcArray once we know there are at least commit_siblings active; second, skip the check altogether if commit_siblings = 0. Greg Smith
* Fix bugs in the hot standby known-assigned-xids tracking logic. If there'sHeikki Linnakangas2010-12-07
| | | | | | | | | | | | | | | | | | | | | | | an old transaction running in the master, and a lot of transactions have started and finished since, and a WAL-record is written in the gap between the creating the running-xacts snapshot and WAL-logging it, recovery will fail with "too many KnownAssignedXids" error. This bug was reported by Joachim Wieland on Nov 19th. In the same scenario, when fewer transactions have started so that all the xids fit in KnownAssignedXids despite the first bug, a more serious bug arises. We incorrectly initialize the clog code with the oldest still running transaction, and when we see the WAL record belonging to a transaction with an XID larger than one that committed already before the checkpoint we're recovering from, we zero the clog page containing the already committed transaction, leading to data loss. In hindsight, trying to track xids in the known-assigned-xids array before seeing the running-xacts record was too complicated. To fix that, hold XidGenLock while the running-xacts snapshot is taken and WAL-logged. That ensures that no transaction can begin or end in that gap, so that in recvoery we know that the snapshot contains all transactions running at that point in WAL.
* Fix two typos, by Fujii Masao.Heikki Linnakangas2010-12-06
|
* Fix two small bugs in new gistget.c logic.Tom Lane2010-12-04
| | | | | | | | | | | | | 1. Complain, rather than silently doing nothing, if an "invalid" tuple is found on a leaf page. Per off-list discussion with Heikki. 2. Fix oversight in code that removes a GISTSearchItem from the search queue: we have to reset lastHeap if this was the last heap item in the parent GISTSearchTreeItem. Otherwise subsequent additions will do the wrong thing. This was probably masked in early testing because in typical cases the parent item would now be completely empty and would be deleted on next call. You'd need a queued non-leaf page at exactly the same distance as a heap tuple to expose the bug.
* Add external documentation for KNNGIST.Tom Lane2010-12-03
|
* Put back gistgettuple's check for backwards scan request.Tom Lane2010-12-03
| | | | | On reflection it's a bad idea for the KNNGIST patch to have removed that. We don't want it silently returning incorrect answers.
* KNNGIST, otherwise known as order-by-operator support for GIST.Tom Lane2010-12-03
| | | | | | | | | | | | | | | This commit represents a rather heavily editorialized version of Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1 patches. I redid the opclass API to add a separate Distance method instead of turning the Consistent method into an illogical mess, fixed some bit-rot in the rbtree interfaces, and generally worked over the code style and comments. There's still no non-code documentation to speak of, but I'll work on that separately. Some contrib-module changes are also yet to come (right now, point <-> point is the only KNN-ified operator). Teodor Sigaev and Tom Lane
* Remove now-outdated mention of quotes being required in recovery.conf.Robert Haas2010-12-03
| | | | Noted by Itagaki Takahiro.
* Use GUC lexer for recovery.conf parsing.Robert Haas2010-12-03
| | | | | | | | This eliminates some crufty, special-purpose code and, as a non-trivial side benefit, allows recovery.conf parameters to be unquoted. Dimitri Fontaine, with review and cleanup by Alvaro Herrera, Itagaki Takahiro, and me.
* Create core infrastructure for KNNGIST.Tom Lane2010-12-02
| | | | | | | | | | | | | | | | | | | This is a heavily revised version of builtin_knngist_core-0.9. The ordering operators are no longer mixed in with actual quals, which would have confused not only humans but significant parts of the planner. Instead, ordering operators are carried separately throughout planning and execution. Since the API for ambeginscan and amrescan functions had to be changed anyway, this commit takes the opportunity to rationalize that a bit. RelationGetIndexScan no longer forces a premature index_rescan call; instead, callers of index_beginscan must call index_rescan too. Aside from making the AM-side initialization logic a bit less peculiar, this has the advantage that we do not make a useless extra am_rescan call when there are runtime key values. AMs formerly could not assume that the key values passed to amrescan were actually valid; now they can. Teodor Sigaev and Tom Lane
* Remove useless whitespace at end of linesPeter Eisentraut2010-11-23
|
* The GiST scan algorithm uses LSNs to detect concurrent pages splits, butHeikki Linnakangas2010-11-16
| | | | | | | | | | | | | temporary indexes are not WAL-logged. We used a constant LSN for temporary indexes, on the assumption that we don't need to worry about concurrent page splits in temporary indexes because they're only visible to the current session. But that assumption is wrong, it's possible to insert rows and split pages in the same session, while a scan is in progress. For example, by opening a cursor and fetching some rows, and INSERTing new rows before fetching some more. Fix by generating fake increasing LSNs, used in place of real LSNs in temporary GiST indexes.
* Cleanup various comparisons with the constant "true".Robert Haas2010-11-14
| | | | Itagaki Takahiro, with slight modifications.
* Fix bug introduced by the recent patch to check that the checkpoint redoHeikki Linnakangas2010-11-11
| | | | | | | location read from backup label file can be found: wasShutdown was set incorrectly when a backup label file was found. Jeff Davis, with a little tweaking by me.
* Add monitoring function pg_last_xact_replay_timestamp.Robert Haas2010-11-09
| | | | Fujii Masao, with a little wordsmithing by me.
* In rewriteheap.c (used by VACUUM FULL and CLUSTER), calculate the tupleHeikki Linnakangas2010-11-09
| | | | | | | | | | length stored in the line pointer the same way it's calculated in the normal heap_insert() codepath. As noted by Jeff Davis, the length stored by raw_heap_insert() included padding but the one stored by the normal codepath did not. While the mismatch seems to be harmless, inconsistency isn't good, and the normal codepath has received a lot more testing over the years. Backpatch to 8.3 where the heap rewrite code was introduced.
* Bootstrap WAL to begin at segment logid=0 logseg=1 (000000010000000000000001)Heikki Linnakangas2010-11-02
| | | | | | | | | | | | | rather than 0/0, so that we can safely use 0/0 as an invalid value. This is a more future-proof fix for the corner-case bug in streaming replication that was fixed yesterday. We had a similar corner-case bug with log/seg 0/0 back in February as well. Avoiding 0/0 as a valid value should prevent bugs like that in the future. Per Tom Lane's idea. Back-patch to 9.0. Since this only affects bootstrapping, it makes no difference to existing installations. We don't need to worry about the bug in existing installations, because if you've managed to get past the initial base backup already, you won't hit the bug in the future either.
* Fix corner-case bug in tracking of latest removed WAL segment duringHeikki Linnakangas2010-11-01
| | | | | | | | | streaming replication. We used log/seg 0/0 to indicate that no WAL segments have been removed since startup, but 0/0 is a valid value for the very first WAL segment after initdb. To make that disambiguous, store (latest removed WAL segment + 1) in the global variable. Per report from Matt Chesler, also reproduced by Greg Smith.
* Before removing backup_label and irrevocably changing pg_control file, checkHeikki Linnakangas2010-10-26
| | | | | | | | that WAL file containing the checkpoint redo-location can be found. This avoids making the cluster irrecoverable if the redo location is in an earlie WAL file than the checkpoint record. Report, analysis and patch by Jeff Davis, with small changes by me.
* Refactor typenameTypeId()Peter Eisentraut2010-10-25
| | | | | | Split the old typenameTypeId() into two functions: A new typenameTypeId() that returns only a type OID, and typenameTypeIdAndMod() that returns type OID and typmod. This isolates call sites better that actually care about the typmod.
* Don't try to fetch database name when SetTransactionIdLimit() is executedTom Lane2010-10-20
| | | | | | | | | | | | outside a transaction. This repairs brain fade in my patch of 2009-08-30: the reason we had been storing oldest-database name, not OID, in ShmemVariableCache was of course to avoid having to do a catalog lookup at times when it might be unsafe. This error explains why Aleksandr Dushein is having trouble getting out of an XID wraparound state in bug #5718, though not how he got into that state in the first place. I suspect pg_upgrade is at fault there.
* Remove AtStart_Cache() call in CommandCounterIncrement().Alvaro Herrera2010-10-20
| | | | | | | | This call was present in the aboriginal code from Berkeley, and has never been touched; it may very well be that it was there to mask effects of bugs in other places and it may no longer be necessary. The removal has been foreseen in a code comment since 2007; this seems to be a good time to test this hypothesis.
* Fix a passel of inappropriately-named global functions in GIN.Tom Lane2010-10-17
| | | | | | | | | | | | | | | | The GIN code has absolutely no business exporting GIN-specific functions with names as generic as compareItemPointers() or newScanKey(); that's just trouble waiting to happen. I got annoyed about this again just now and decided to fix it. This commit ensures that all global symbols defined in access/gin/ have names including "gin" or "Gin". There were a couple of cases, like names involving "PostingItem", where arguably the names were already sufficiently nongeneric; but I figured as long as I was risking creating merge problems for unapplied GIN patches I might as well impose a uniform policy. I didn't touch any static symbol names. There might be some places where it'd be appropriate to rename some static functions to match siblings that are exported, but I'll leave that for another time.
* Improve GIN indexscan cost estimation.Tom Lane2010-10-17
| | | | | | | | | | | | | The better estimate requires more statistics than we previously stored: in particular, counts of "entry" versus "data" pages within the index, as well as knowledge of the number of distinct key values. We collect this information during initial index build and update it during VACUUM, storing the info in new fields on the index metapage. No initdb is required because these fields will read as zeroes in a pre-existing index, and the new gincostestimate code is coded to behave (reasonably) sanely if they are zeroes. Teodor Sigaev, reviewed by Jan Urbanski, Tom Lane, and Itagaki Takahiro.
* Make startup process respond to signals to cancel waiting on latch.Simon Riggs2010-10-14
| | | | | | A tidy up for recently committed changes to startup latch. Fujii Masao
* Fix bug in comment of timeline history file.Simon Riggs2010-10-14
| | | | Fujii Masao
* Fix assorted bugs in GIN's WAL replay logic.Tom Lane2010-10-11
| | | | | | | | | | | | | | The original coding was quite sloppy about handling the case where XLogReadBuffer fails (because the page has since been deleted). This would result in either "bad buffer id: 0" or an Assert failure during replay, if indeed the page were no longer there. In a couple of places it also neglected to check whether the change had already been applied, which would probably result in corrupted index contents. I believe that bug #5703 is an instance of the first problem. These issues could show up without replication, but only if you were unfortunate enough to crash between modification of a GIN index and the next checkpoint. Back-patch to 8.2, which is as far back as GIN has WAL support.
* Improve logging in VACUUM FULL VERBOSE and CLUSTER VERBOSE.Tom Lane2010-10-07
| | | | | | | | | This patch resurrects some of the information that could be logged by the old, now-dead implementation of VACUUM FULL, in particular counts of live and dead tuples and the time taken for the table rebuild proper. There's still no logging about the ensuing index rebuilds, though. Itagaki Takahiro
* Remove cvs keywords from all files.Magnus Hagander2010-09-20
|
* Update HOT README about when single-page vacuums happen.Bruce Momjian2010-09-19
|
* Add some documentation about how we WAL-log filesystem actions.Tom Lane2010-09-17
| | | | Per a question from Robert Haas.
* Fix two typos in comments, spotted by Fujii Masao and Thom BrownHeikki Linnakangas2010-09-15
|
* Use a latch to make startup process wake up and replay immediately whenHeikki Linnakangas2010-09-15
| | | | | | | | | | new WAL arrives via streaming replication. This reduces the latency, and also allows us to use a longer polling interval, which is good for energy efficiency. We still need to poll to check for the appearance of a trigger file, but the interval is now 5 seconds (instead of 100ms), like when waiting for a new WAL segment to appear in WAL archive.
* SERIALIZABLE transactions are actually implemented beneath the covers withJoe Conway2010-09-11
| | | | | | | | | | | transaction snapshots, i.e. a snapshot registered at the beginning of a transaction. Change variable naming and comments to reflect this reality in preparation for a future, truly serializable mode, e.g. Serializable Snapshot Isolation (SSI). For the moment transaction snapshots are still used to implement SERIALIZABLE, but hopefully not for too much longer. Patch by Kevin Grittner and Dan Ports with review and some minor wording changes by me.
* Introduce latches. A latch is a boolean variable, with the capability toHeikki Linnakangas2010-09-11
| | | | | | | | | | | | | | | | | | wait until it is set. Latches can be used to reliably wait until a signal arrives, which is hard otherwise because signals don't interrupt select() on some platforms, and even when they do, there's race conditions. On Unix, latches use the so called self-pipe trick under the covers to implement the sleep until the latch is set, without race conditions. On Windows, Windows events are used. Use the new latch abstraction to sleep in walsender, so that as soon as a transaction finishes, walsender is woken up to immediately send the WAL to the standby. This reduces the latency between master and standby, which is good. Preliminary work by Fujii Masao. The latch implementation is by me, with helpful comments from many people.
* Fix oversight in RelFileNodeBackend patch: CreateFakeRelcacheEntry needs toTom Lane2010-08-30
| | | | | | initialize the rd_backend field of a fake Relation entry correctly. Fortunately, that is easy, since only non-temp relations should ever be mentioned in the WAL stream.
* Fix misleading DEBUG2 issued during RemoveOldXlogFiles()Simon Riggs2010-08-30
|
* Truncate subtrans after each restartpoint.Simon Riggs2010-08-30
| | | | Issue reported by Harald Kolb, patch by Fujii Masao, review by me.
* Reduce PANIC to ERROR in some occasionally-reported btree failure cases.Tom Lane2010-08-29
| | | | | | | | | | | | | | | | | | | | | | | | This patch changes _bt_split() and _bt_pagedel() to throw a plain ERROR, rather than PANIC, for several cases that are reported from the field from time to time: * right sibling's left-link doesn't match; * PageAddItem failure during _bt_split(); * parent page's next child isn't right sibling during _bt_pagedel(). In addition the error messages for these cases have been made a bit more verbose, with additional values included. The original motivation for PANIC here was to capture core dumps for subsequent analysis. But with so many users whose platforms don't capture core dumps by default, or who are unprepared to analyze them anyway, it's hard to justify a forced database restart when we can fairly easily detect the problems before we've reached the critical sections where PANIC would be necessary. It is not currently known whether the reports of these messages indicate well-hidden bugs in Postgres, or are a result of storage-level malfeasance; the latter possibility suggests that we ought to try to be more robust even if there is a bug here that's ultimately found. Backpatch to 8.2. The code before that is sufficiently different that it doesn't seem worth the trouble to back-port further.
* Remove duplicate translatable phraseAlvaro Herrera2010-08-26
|
* Tidy up a few calls to smrgextend().Robert Haas2010-08-19
| | | | | | | | | In the new API introduced by my patch to include the backend ID in temprel filenames, the last argument to smrgextend() became skipFsync rather than isTemp, but these calls didn't get the memo. It's not really a problem to pass rel->rd_istemp rather than just plain false, because smgrextend() now automatically skips the fsync for temprels anyway, but this seems cleaner and saves some minute number of cycles.
* Include the backend ID in the relpath of temporary relations.Robert Haas2010-08-13
| | | | | | | | | | | | | | | | | This allows us to reliably remove all leftover temporary relation files on cluster startup without reference to system catalogs or WAL; therefore, we no longer include temporary relations in XLOG_XACT_COMMIT and XLOG_XACT_ABORT WAL records. Since these changes require including a backend ID in each SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id field has been reduced from two bytes to one, and the maximum number of connections has been reduced from INT_MAX / 4 to 2^23-1. It would be possible to remove these restrictions by increasing the size of SharedInvalidationMessage by 4 bytes, but right now that doesn't seem like a good trade-off. Review by Jaime Casanova and Tom Lane.
* Make RecordTransactionCommit() respect wal_level.Robert Haas2010-08-13
| | | | | | | | | Since the only purpose of WAL-loggin SharedInvalidationMessages is to support Hot Standby operation, they needn't be included when wal_level < hot_standby. Back-patch to 9.0. Review by Heikki Linnakanagas and Fujii Masao.
* Correct sundry errors in Hot Standby-related comments.Robert Haas2010-08-12
| | | | Fujii Masao
* Fix an additional set of problems in GIN's handling of lossy page pointers.Tom Lane2010-08-01
| | | | | | | | | | | | | | | | | Although the key-combining code claimed to work correctly if its input contained both lossy and exact pointers for a single page in a single TID stream, in fact this did not work, and could not work without pretty fundamental redesign. Modify keyGetItem so that it will not return such a stream, by handling lossy-pointer cases a bit more explicitly than we did before. Per followup investigation of a gripe from Artur Dabrowski. An example of a query that failed given his data set is select count(*) from search_tab where (to_tsvector('german', keywords ) @@ to_tsquery('german', 'ee:* | dd:*')) and (to_tsvector('german', keywords ) @@ to_tsquery('german', 'aa:*')); Back-patch to 8.4 where the lossy pointer code was introduced.
* Rewrite the rbtree routines so that an RBNode is the first field of theTom Lane2010-08-01
| | | | | | | | | | | | | | | | | | | | struct representing a tree entry, rather than being a separately allocated piece of storage. This API is at least as clean as the old one (if not more so --- there were some bizarre choices in there) and it permits a very substantial memory savings, on the order of 2X in ginbulk.c's usage. Also, fix minor memory leaks in code called by ginEntryInsert, in particular in ginInsertValue and entryFillRoot, as well as ginEntryInsert itself. These leaks resulted in the GIN index build context continuing to bloat even after we'd filled it to maintenance_work_mem and started to dump data out to the index. In combination these fixes restore the GIN index build code to honoring the maintenance_work_mem limit about as well as it did in 8.4. Speed seems on par with 8.4 too, maybe even a bit faster, for a non-pathological case in which HEAD was formerly slower. Back-patch to 9.0 so we don't have a performance regression from 8.4.
* Rewrite the key-combination logic in GIN's keyGetItem() and scanGetItem()Tom Lane2010-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | routines to make them behave better in the presence of "lossy" index pointers. The previous coding was outright incorrect for some cases, as recently reported by Artur Dabrowski: scanGetItem would fail to return index entries in cases where one index key had multiple exact pointers on the same page as another key had a lossy pointer. Also, keyGetItem was extremely inefficient for cases where a single index key generates multiple "entry" streams, such as an @@ operator with a multiple-clause tsquery. The presence of a lossy page pointer in any one stream defeated its ability to use the opclass consistentFn, resulting in probing many heap pages that didn't really need to be visited. In Artur's example case, a query like WHERE tsvector @@ to_tsquery('a & b') was about 50X slower than the theoretically equivalent WHERE tsvector @@ to_tsquery('a') AND tsvector @@ to_tsquery('b') The way that I chose to fix this was to have GIN call the consistentFn twice with both TRUE and FALSE values for the in-doubt entry stream, returning a hit if either call produces TRUE, but not if they both return FALSE. The code handles this for the case of a single in-doubt entry stream, but punts (falling back to the stupid behavior) if there's more than one lossy reference to the same page. The idea could be scaled up to deal with multiple lossy references, but I think that would probably be wasted complexity. At least to judge by Artur's example, such cases don't occur often enough to be worth trying to optimize. Back-patch to 8.4. 8.3 did not have lossy GIN index pointers, so not subject to these problems.
* Rename asyncCommitLSN to asyncXactLSN to reflect changed role in 9.0.Simon Riggs2010-07-29
| | | | | | Transaction aborts now record their LSN to avoid corner case behaviour in SR/HS, hence change of name of variables and functions. As pointed out by Fujii Masao. Cosmetic changes only.
* Fix possible page corruption by ALTER TABLE .. SET TABLESPACE.Robert Haas2010-07-29
| | | | | | | | | | | | | | If a zeroed page is present in the heap, ALTER TABLE .. SET TABLESPACE will set the LSN and TLI while copying it, which is wrong, and heap_xlog_newpage() will do the same thing during replay, so the corruption propagates to any standby. Note, however, that the bug can't be demonstrated unless archiving is enabled, since in that case we skip WAL logging altogether, and the LSN/TLI are not set. Back-patch to 8.0; prior releases do not have tablespaces. Analysis and patch by Jeff Davis. Adjustments for back-branches and minor wordsmithing by me.
* Avoid deep recursion when assigning XIDs to multiple levels of subxacts.Robert Haas2010-07-23
| | | | | | Backpatch to 8.0. Andres Freund, with cleanup and adjustment for older branches by me.