aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Write an end-of-backup WAL record at pg_stop_backup(), and wait for it atHeikki Linnakangas2010-01-04
| | | | | | | | | | | | | | | | | | | recovery instead of reading the backup history file. This is more robust, as it stops you from prematurely starting up an inconsisten cluster if the backup history file is lost for some reason, or if the base backup was never finished with pg_stop_backup(). This also paves the way for a simpler streaming replication patch, which doesn't need to care about backup history files anymore. The backup history file is still created and archived as before, but it's not used by the system anymore. It's just for informational purposes now. Bump PG_CONTROL_VERSION as the location of the backup startpoint is now written to a new field in pg_control, and catversion because initdb is required Original patch by Fujii Masao per Simon's idea, with further fixes by me.
* Dept of second thoughts: my first cut at supporting "x IS NOT NULL" btreeTom Lane2010-01-03
| | | | | | | | | indexscans would do the wrong thing if index_rescan() was called with a NULL instead of a new set of scankeys and the index was DESC order, because sk_strategy would not get flipped a second time. I think that those provisions for a NULL argument are dead code now as far as the core backend goes, but possibly somebody somewhere is still using it. In any case, this refactoring seems clearer, and it's definitely shorter.
* Update copyright for the year 2010.Bruce Momjian2010-01-02
|
* Support "x IS NOT NULL" clauses as indexscan conditions. This turns outTom Lane2010-01-01
| | | | | | | | | | | to be just a minor extension of the previous patch that made "x IS NULL" indexable, because we can treat the IS NOT NULL condition as if it were "x < NULL" or "x > NULL" (depending on the index's NULLS FIRST/LAST option), just like IS NULL is treated like "x = NULL". Aside from any possible usefulness in its own right, this is an important improvement for index-optimized MAX/MIN aggregates: it is now reliably possible to get a column's min or max value cheaply, even when there are a lot of nulls cluttering the interesting end of the index.
* Redefine Datum as uintptr_t, instead of unsigned long.Tom Lane2009-12-31
| | | | | | | This is more in keeping with modern practice, and is a first step towards porting to Win64 (which has sizeof(pointer) > sizeof(long)). Tsutomu Yamada, Magnus Hagander, Tom Lane
* Reset minRecoveryPoint at checkpoints, so that we don't uselessly updateHeikki Linnakangas2009-12-30
| | | | | | it in the control file at crash recovery following an archive recovery. Per Fujii Masao and subsequent discussion.
* Fix wrong WAL info value generated when gistContinueInsert() performs anTom Lane2009-12-24
| | | | | | | | index page split. This would result in index corruption, or even more likely an error during WAL replay, if we were unlucky enough to crash during end-of-recovery cleanup after having completed an incomplete GIST insertion. Yoichi Hirai
* Allow read only connections during recovery, known as Hot Standby.Simon Riggs2009-12-19
| | | | | | | | | | | | Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
* Prevent indirect security attacks via changing session-local state withinTom Lane2009-12-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | an allegedly immutable index function. It was previously recognized that we had to prevent such a function from executing SET/RESET ROLE/SESSION AUTHORIZATION, or it could trivially obtain the privileges of the session user. However, since there is in general no privilege checking for changes of session-local state, it is also possible for such a function to change settings in a way that might subvert later operations in the same session. Examples include changing search_path to cause an unexpected function to be called, or replacing an existing prepared statement with another one that will execute a function of the attacker's choosing. The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against these threats, which are the same places previously deemed to need protection against the SET ROLE issue. GUC changes are still allowed, since there are many useful cases for that, but we prevent security problems by forcing a rollback of any GUC change after completing the operation. Other cases are handled by throwing an error if any change is attempted; these include temp table creation, closing a cursor, and creating or deleting a prepared statement. (In 7.4, the infrastructure to roll back GUC changes doesn't exist, so we settle for rejecting changes of "search_path" in these contexts.) Original report and patch by Gurjeet Singh, additional analysis by Tom Lane. Security: CVE-2009-4136
* Add exclusion constraints, which generalize the concept of uniqueness toTom Lane2009-12-07
| | | | | | | | support any indexable commutative operator, not just equality. Two rows violate the exclusion constraint if "row1.col OP row2.col" is TRUE for each of the columns in the constraint. Jeff Davis, reviewed by Robert Haas
* Fix an old bug in multixact and two-phase commit. Prepared transactions canHeikki Linnakangas2009-11-23
| | | | | | | | | | | | | be part of multixacts, so allocate a slot for each prepared transaction in the "oldest member" array in multixact.c. On PREPARE TRANSACTION, transfer the oldest member value from the current backends slot to the prepared xact slot. Also save and recover the value from the 2pc state file. The symptom of the bug was that after a transaction prepared, a shared lock still held by the prepared transaction was sometimes ignored by other transactions. Fix back to 8.1, where both 2PC and multixact were introduced.
* Fix multicolumn GIN's wrong results with fastupdate enabled.Teodor Sigaev2009-11-13
| | | | | | | | User-defined consistent functions believes the check array contains at least one true element which was not a true for scanning pending list. Per report from Yury Don <yura@vpcit.ru>
* Dept of second thoughts: after studying index_getnext() a bit more I realizeTom Lane2009-11-01
| | | | | | that it can scribble on scan->xs_ctup.t_self while following HOT chains, so we can't rely on that to stay valid between hashgettuple() calls. Introduce a private variable in HashScanOpaque, instead.
* Fix two serious bugs introduced into hash indexes by the 8.4 patch that madeTom Lane2009-11-01
| | | | | | | | | | | | | | | | | | | | | | | | hash indexes keep entries sorted by hash value. First, the original plans for concurrency assumed that insertions would happen only at the end of a page, which is no longer true; this could cause scans to transiently fail to find index entries in the presence of concurrent insertions. We can compensate by teaching scans to re-find their position after re-acquiring read locks. Second, neither the bucket split nor the bucket compaction logic had been fixed to preserve hashvalue ordering, so application of either of those processes could lead to permanent corruption of an index, in the sense that searches might fail to find entries that are present. This patch fixes the split and compaction logic to preserve hashvalue ordering, but it cannot do anything about pre-existing corruption. We will need to recommend reindexing all hash indexes in the 8.4.2 release notes. To buy back the performance loss hereby induced in split and compaction, fix them to use PageIndexMultiDelete instead of retail PageIndexDelete operations. We might later want to do something with qsort'ing the page contents rather than doing a binary search for each insertion, but that seemed more invasive than I cared to risk in a back-patch. Per bug #5157 from Jeff Janes and subsequent investigation.
* Code review for LIKE INCLUDING patch --- clean up some cosmetic and notTom Lane2009-10-13
| | | | so cosmetic stuff.
* CREATE LIKE INCLUDING COMMENTS and STORAGE, and INCLUDING ALL shortcut. ↵Andrew Dunstan2009-10-12
| | | | Itagaki Takahiro.
* Remove very ancient tuple-counting infrastructure (IncrRetrieved() andTom Lane2009-10-08
| | | | | | | | | friends). This code has all been ifdef'd out for many years, and doesn't seem to have any prospect of becoming any more useful in the future. EXPLAIN ANALYZE is what people use in practice, and I think if we did want process-wide counters we'd be more likely to put in dtrace events for that than try to resurrect this code. Get rid of it so as to have one less detail to worry about while refactoring execMain.c.
* Make sure that GIN fast-insert and regular code paths enforce the sameTom Lane2009-10-02
| | | | | | | | | | | tuple size limit. Improve the error message for index-tuple-too-large so that it includes the actual size, the limit, and the index name. Sync with the btree occurrences of the same error. Back-patch to 8.4 because it appears that the out-of-sync problem is occurring in the field. Teodor and Tom
* Fix incorrect arguments for gist_box_penalty call. The bug could be observedTeodor Sigaev2009-09-18
| | | | | | only for secondary page split (i.e. for non-first columns of index) Patch by Paul Ramsey <pramsey@opengeo.org>
* Fix two distinct errors in creation of GIN_INSERT_LISTPAGE xlog records.Tom Lane2009-09-15
| | | | | | | | | | | | | | | | | In practice these mistakes were always masked when full_page_writes was on, because XLogInsert would always choose to log the full page, and then ginRedoInsertListPage wouldn't try to do anything. But with full_page_writes off a WAL replay failure was certain. The GIN_INSERT_LISTPAGE record type could probably be eliminated entirely in favor of using XLOG_HEAP_NEWPAGE, but I refrained from doing that now since it would have required a significantly more invasive patch. In passing do a little bit of code cleanup, including making the accounting for free space on GIN list pages more precise. (This wasn't a bug as the errors were always in the conservative direction.) Per report from Simon. Back-patch to 8.4 which contains the identical code.
* Don't error out if recycling or removing an old WAL segment fails at the endHeikki Linnakangas2009-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of checkpoint. Although the checkpoint has been written to WAL at that point already, so that all data is safe, and we'll retry removing the WAL segment at the next checkpoint, if such a failure persists we won't be able to remove any other old WAL segments either and will eventually run out of disk space. It's better to treat the failure as non-fatal, and move on to clean any other WAL segment and continue with any other end-of-checkpoint cleanup. We don't normally expect any such failures, but on Windows it can happen with some anti-virus or backup software that lock files without FILE_SHARE_DELETE flag. Also, the loop in pgrename() to retry when the file is locked was broken. If a file is locked on Windows, you get ERROR_SHARE_VIOLATION, not ERROR_ACCESS_DENIED, at least on modern versions. Fix that, although I left the check for ERROR_ACCESS_DENIED in there as well (presumably it was correct in some environment), and added ERROR_LOCK_VIOLATION to be consistent with similar checks in pgwin32_open(). Reduce the timeout on the loop from 30s to 10s, on the grounds that since it's been broken, we've effectively had a timeout of 0s and no-one has complained, so a smaller timeout is actually closer to the old behavior. A longer timeout would mean that if recycling a WAL file fails because it's locked for some reason, InstallXLogFileSegment() will hold ControlFileLock for longer, potentially blocking other backends, so a long timeout isn't totally harmless. While we're at it, set errno correctly in pgrename(). Backpatch to 8.2, which is the oldest version supported on Windows. The xlog.c changes would make sense on other platforms and thus on older versions as well, but since there's no such locking issues on other platforms, it's not worth it.
* On Windows, when a file is deleted and another process still has an openHeikki Linnakangas2009-09-10
| | | | | | | | | | | | | | | file handle on it, the file goes into "pending deletion" state where it still shows up in directory listing, but isn't accessible otherwise. That confuses RemoveOldXLogFiles(), making it think that the file hasn't been archived yet, while it actually was, and it was deleted along with the .done file. Fix that by renaming the file with ".deleted" extension before deleting it. Also check the return value of rename() and unlink(), so that if the removal fails for any reason (e.g another process is holding the file locked), we don't delete the .done file until the WAL file is really gone. Backpatch to 8.2, which is the oldest version supported on Windows.
* Force VACUUM to recalculate oldestXmin even when we haven't changed ourTom Lane2009-09-01
| | | | | | own database's datfrozenxid, if the current value is old enough to be forcing autovacuums or warning messages. This ensures that a bogus value is replaced as soon as possible. Per a comment from Heikki.
* Actually, we need to bump the format identifier on twophase filesTom Lane2009-09-01
| | | | because of readjustment of 2PC rmgr IDs for flatfile removal.
* Remove flatfiles.c, which is now obsolete.Alvaro Herrera2009-09-01
| | | | | | Recent commits have removed the various uses it was supporting. It was a performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems it slowed down user creation after a billion users.
* Track the current XID wrap limit (or more accurately, the oldest unfrozenTom Lane2009-08-31
| | | | | | | | | | | | | | | XID) in checkpoint records. This eliminates the need to recompute the value from scratch during database startup, which is one of the two remaining reasons for the flatfile code to exist. It should also simplify life for hot-standby operation. To avoid bloating the checkpoint records unreasonably, I switched from tracking the oldest database by name to tracking it by OID. This turns out to save cycles in general (everywhere but the warning-generating paths, which we hardly care about) and also helps us deal with the case that the oldest database got dropped instead of being vacuumed. The prior coding might go for a long time without updating the wrap limit in that case, which is bad because it might result in a lot of useless autovacuum activity.
* Fix handling of autovacuum reloptions.Alvaro Herrera2009-08-27
| | | | | | | | In the original coding, setting a single reloption would cause default values to be used for all the other reloptions. This is a problem particularly for autovacuum reloptions. Itagaki Takahiro
* In the checkpoint written at the end of archive recovery, the WAL page headerHeikki Linnakangas2009-08-27
| | | | | | | | | | was incorrectly initialized with timeline ID 0. That rendered the WAL page unrecoverable, making a subsequent archive recovery stop at that point. ThisTimeLineID needs to be initialized before calling AdvanceXLInsertBuffer(). This fixes bug #5011 reported by James Bardin. Backpatch to 8.4, as the bug was introduced by the changes to use of bgwriter for writing the end-of-archive-recovery checkpoint. Patch by Tom Lane.
* Fix a violation of WAL coding rules in the recent patch to include anTom Lane2009-08-24
| | | | | | | | | | | "all tuples visible" flag in heap page headers. The flag update *must* be applied before calling XLogInsert, but heap_update and the tuple moving routines in VACUUM FULL were ignoring this rule. A crash and replay could therefore leave the flag incorrectly set, causing rows to appear visible in seqscans when they should not be. This might explain recent reports of data corruption from Jeff Ross and others. In passing, do a bit of editorialization on comments in visibilitymap.c.
* Department of marginal improvements: teach tupconvert.c to avoid doing aTom Lane2009-08-17
| | | | | | | physical conversion when there are dropped columns in the same places in the input and output tupdescs. This avoids possible performance loss from the recent patch to improve dropped-column handling, in some cases where the old code would have worked.
* Allow backends to start up without use of the flat-file copy of pg_database.Tom Lane2009-08-12
| | | | | | | | | | | | | | | | | | | | | | To make this work in the base case, pg_database now has a nailed-in-cache relation descriptor that is initialized using hardwired knowledge in relcache.c. This means pg_database is added to the set of relations that need to have a Schema_pg_xxx macro maintained in pg_attribute.h. When this path is taken, we'll have to do a seqscan of pg_database to find the row we need. In the normal case, we are able to do an indexscan to find the database's row by name. This is made possible by storing a global relcache init file that describes only the shared catalogs and their indexes (and therefore is usable by all backends in any database). A new backend loads this cache file, finds its database OID after an indexscan on pg_database, and then loads the local relcache init file for that database. This change should effectively eliminate number of databases as a factor in backend startup time, even with large numbers of databases. However, the real reason for doing it is as a first step towards getting rid of the flat files altogether. There are still several other sub-projects to be tackled before that can happen.
* Document that LocalSetXLogInsertAllowed can be re-executed.Tom Lane2009-08-08
| | | | Per comment from Simon.
* rm_cleanup functions need to be allowed to write WAL entries. This oversightTom Lane2009-08-07
| | | | | appears to explain the recent reports of "PANIC: cannot make new WAL entries during recovery".
* Improve plpgsql's ability to cope with rowtypes containing dropped columns,Tom Lane2009-08-06
| | | | | | | | | | | by supporting conversions in places that used to demand exact rowtype match. Since this issue is certain to come up elsewhere (in fact, already has, in ExecEvalConvertRowtype), factor out the support code into new core functions for tuple conversion. I chose to put these in a new source file since heaptuple.c is already overly long. Heavily revised version of a patch by Pavel Stehule.
* Add ALTER TABLE ... ALTER COLUMN ... SET STATISTICS DISTINCTTom Lane2009-08-02
| | | | Robert Haas
* Department of second thoughts: let's show the exact key during unique indexTom Lane2009-08-01
| | | | | build failures, too. Refactor a bit more since that error message isn't spelled the same.
* Improve unique-constraint-violation error messages to include the exactTom Lane2009-08-01
| | | | | | | | | values being complained of. In passing, also remove the arbitrary length limitation in the similar error detail message for foreign key violations. Itagaki Takahiro
* Support deferrable uniqueness constraints.Tom Lane2009-07-29
| | | | | | | | | | The current implementation fires an AFTER ROW trigger for each tuple that looks like it might be non-unique according to the index contents at the time of insertion. This works well as long as there aren't many conflicts, but won't scale to massive unique-key reassignments. Improving that case is a TODO item. Dean Rasheed
* Tweak TOAST code so that columns marked with MAIN storage strategy areTom Lane2009-07-22
| | | | | | | | not forced out-of-line unless that is necessary to make the row fit on a page. Previously, they were forced out-of-line if needed to get the row down to the default target size (1/4th page). Kevin Grittner
* Make backend header files C++ safePeter Eisentraut2009-07-16
| | | | | | | | | | | This alters various incidental uses of C++ key words to use other similar identifiers, so that a C++ compiler won't choke outright. You still (probably) need extern "C" { }; around the inclusion of backend headers. based on a patch by Kurt Harriman <harriman@acm.org> Also add a script cpluspluscheck to check for C++ compatibility in the future. As of right now, this passes without error for me.
* Cleanup and code review for the patch that made bgwriter active duringTom Lane2009-06-26
| | | | | | | | | | | | | archive recovery. Invent a separate state variable and inquiry function for XLogInsertAllowed() to clarify some tests and make the management of writing the end-of-recovery checkpoint less klugy. Fix several places that were incorrectly testing InRecovery when they should be looking at RecoveryInProgress or XLogInsertAllowed (because they will now be executed in the bgwriter not startup process). Clarify handling of bad LSNs passed to XLogFlush during recovery. Use a spinlock for setting/testing SharedRecoveryInProgress. Improve quite a lot of comments. Heikki and Tom
* Fix some serious bugs in archive recovery, now that bgwriter is activeHeikki Linnakangas2009-06-25
| | | | | | | | | | | | | | | | | | | | during it: When bgwriter is active, the startup process can't perform mdsync() correctly because it won't see the fsync requests accumulated in bgwriter's private pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery checkpoint as well, when it's active. When bgwriter is active (= archive recovery), the startup process must not accumulate fsync requests to its own pendingOpsTable, since bgwriter won't see them there when it performs restartpoints. Make startup process drop its pendingOpsTable when bgwriter is launched to avoid that. Update minimum recovery point one last time when leaving archive recovery. It won't be updated by the end-of-recovery checkpoint because XLogFlush() sees us as out of recovery already. This fixes bug #4879 reported by Fujii Masao.
* The code to unlink dropped relations in FinishPreparedTransaction() wasHeikki Linnakangas2009-06-25
| | | | | | acting like runs inside WAL recovery, but it doesn't. I must've copy-pasted this from a redo-function in the relation forks patch. Noticed by Tom Lane while he was looking through callers of smgrdounlink().
* Correct grammar in picksplit debug messagesPeter Eisentraut2009-06-24
|
* Fix a few errors in comments. Patch by Fujii Masao, plus the one inHeikki Linnakangas2009-06-18
| | | | visibilitymap.c by me.
* 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef listBruce Momjian2009-06-11
| | | | provided by Andrew.
* Improve capitalization and punctuation in recently added GiST message.Peter Eisentraut2009-06-10
|
* Keep rs_startblock the same during heap_rescan, so that a rescan of a SeqScanTom Lane2009-06-10
| | | | | | | | | | | | | | node starts from the same place as the first scan did. This avoids surprising behavior of scrollable and WITH HOLD cursors, as seen in Mark Kirkwood's bug report of yesterday. It's not entirely clear whether a rescan should be forced to drop out of the syncscan mode, but for the moment I left the code behaving the same on that point. Any change there would only be a performance and not a correctness issue, anyway. Back-patch to 8.3, since the unstable behavior was created by the syncscan patch.
* Improve the IndexVacuumInfo/IndexBulkDeleteResult API to allow somewhat saneTom Lane2009-06-06
| | | | | | | | | | | | | | | | | | | behavior in cases where we don't know the heap tuple count accurately; in particular partial vacuum, but this also makes the API a bit more useful for ANALYZE. This patch adds "estimated_count" flags to both structs so that an approximate count can be flagged as such, and adjusts the logic so that approximate counts are not used for updating pg_class.reltuples. This fixes my previous complaint that VACUUM was putting ridiculous values into pg_class.reltuples for indexes. The actual impact of that bug is limited, because the planner only pays attention to reltuples for an index if the index is partial; which probably explains why beta testers hadn't noticed a degradation in plan quality from it. But it needs to be fixed. The whole thing is a bit messy and should be redesigned in future, because reltuples now has the potential to drift quite far away from reality when a long period elapses with no non-partial vacuums. But this is as good as it's going to get for 8.4.
* Fix a serious bug introduced into GIN in 8.4: now that MergeItemPointers()Tom Lane2009-06-06
| | | | | | | | | is supposed to remove duplicate heap TIDs, we have to be sure to reduce the tuple size and posting-item count accordingly in addItemPointersToTuple(). Failing to do so resulted in the effective injection of garbage TIDs into the index contents, ie, whatever happened to be in the memory palloc'd for the new tuple. I'm not sure that this fully explains the index corruption reported by Tatsuo Ishii, but the test case I'm using no longer fails.