aboutsummaryrefslogtreecommitdiff
path: root/src/backend
Commit message (Collapse)AuthorAge
* Fix 'recheck' flag in tsquery's GIN tri-consistent function.Heikki Linnakangas2014-03-26
| | | | | | | It needs to be initialized, like in the boolean gin_tsquery_consistent version. Peter Geoghegan.
* Tidy up the populate/to_record{set} code for json a bit.Andrew Dunstan2014-03-25
| | | | In the process fix a small bug.
* Don't forget to flush XLOG_PARAMETER_CHANGE record.Fujii Masao2014-03-26
| | | | Backpatch to 9.0 where XLOG_PARAMETER_CHANGE record was instroduced.
* Remove wchar.c Asserts that were stricter than the main codeBruce Momjian2014-03-24
| | | | | | | | | | Assert errors were thrown for functions being passed invalid encodings, while the main code handled it just fine. Also document that libpq's PQclientEncoding() returns -1 for an encoding lookup failure. Per report from Peter Geoghegan
* Fix ts_rank_cd() to ignore stripped lexemesBruce Momjian2014-03-24
| | | | | | | Previously, stripped lexemes got a default location and could be considered if mixed with non-stripped lexemes. BACKWARD INCOMPATIBILITY CHANGE
* Change ginMergeItemPointers to return a palloc'd array.Heikki Linnakangas2014-03-24
| | | | | That seems nicer than making it the caller's responsibility to pass a suitable-sized array. All the callers were just palloc'ing an array anyway.
* Remove dead code and add comments.Heikki Linnakangas2014-03-24
| | | | | 'cbuffer' variable was left over from an earlier version of the patch to rewrite the incomplete split handling.
* Fix "the the" typos.Heikki Linnakangas2014-03-24
| | | | Erik Rijkers
* Introduce jsonb, a structured format for storing json.Andrew Dunstan2014-03-23
| | | | | | | | | | | | | | | | | | | | | | The new format accepts exactly the same data as the json type. However, it is stored in a format that does not require reparsing the orgiginal text in order to process it, making it much more suitable for indexing and other operations. Insignificant whitespace is discarded, and the order of object keys is not preserved. Neither are duplicate object keys kept - the later value for a given key is the only one stored. The new type has all the functions and operators that the json type has, with the exception of the json generation functions (to_json, json_agg etc.) and with identical semantics. In addition, there are operator classes for hash and btree indexing, and two classes for GIN indexing, that have no equivalent in the json type. This feature grew out of previous work by Oleg Bartunov and Teodor Sigaev, which was intended to provide similar facilities to a nested hstore type, but which in the end proved to have some significant compatibility issues. Authors: Oleg Bartunov, Teodor Sigaev, Peter Geoghegan and Andrew Dunstan. Review: Andres Freund
* Offer triggers on foreign tables.Noah Misch2014-03-23
| | | | | | | | | | | | | | | | | This covers all the SQL-standard trigger types supported for regular tables; it does not cover constraint triggers. The approach for acquiring the old row mirrors that for view INSTEAD OF triggers. For AFTER ROW triggers, we spool the foreign tuples to a tuplestore. This changes the FDW API contract; when deciding which columns to populate in the slot returned from data modification callbacks, writable FDWs will need to check for AFTER ROW triggers in addition to checking for a RETURNING clause. In support of the feature addition, refactor the TriggerFlags bits and the assembly of old tuples in ModifyTable. Ronan Dunklau, reviewed by KaiGai Kohei; some additional hacking by me.
* Improve comments about AfterTriggerBeginQuery() query level usage.Noah Misch2014-03-23
|
* Address ccvalid/ccnoinherit in TupleDesc support functions.Noah Misch2014-03-23
| | | | | | | | | equalTupleDescs() neglected both of these ConstrCheck fields, and CreateTupleDescCopyConstr() neglected ccnoinherit. At this time, the only known behavior defect resulting from these omissions is constraint exclusion disregarding a CHECK constraint validated by an ALTER TABLE VALIDATE CONSTRAINT statement issued earlier in the same transaction. Back-patch to 9.2, where these fields were introduced.
* Fix build with LWLOCK_STATS or dtrace.Heikki Linnakangas2014-03-21
| | | | | | | | Also fix the name of the dtrace probe for LWLockAcquireOrWait(). The function was renamed from LWLockWaitUntilFree to LWLockAqcuireOrWait, but the dtrace probe was neglected. Pointed out by Andres Freund and the buildfarm.
* Remove MinGW readdir/errno bug workaround fixed on 2003-10-10Bruce Momjian2014-03-21
|
* Properly check for readdir/closedir() failuresBruce Momjian2014-03-21
| | | | | | | Clear errno before calling readdir() and handle old MinGW errno bug while adding full test coverage for readdir/closedir failures. Backpatch through 8.4.
* Replace the XLogInsert slots with regular LWLocks.Heikki Linnakangas2014-03-21
| | | | | | | | | | The special feature the XLogInsert slots had over regular LWLocks is the insertingAt value that was updated atomically with releasing backends waiting on it. Add new functions to the LWLock API to do that, and replace the slots with LWLocks. This reduces the amount of duplicated code. (There's still some duplication, but at least it's all in lwlock.c now.) Reviewed by Andres Freund.
* Again fix initialization of auto-tuned effective_cache_size.Tom Lane2014-03-20
| | | | | | | | | | | | | | | | | | | The previous method was overly complex and underly correct; in particular, by assigning the default value with PGC_S_OVERRIDE, it prevented later attempts to change the setting in postgresql.conf, as noted by Jeff Janes. We should just assign the default value with source PGC_S_DYNAMIC_DEFAULT, which will have the desired priority relative to the boot_val as well as user-set values. There is still a gap in this method: if there's an explicit assignment of effective_cache_size = -1 in the postgresql.conf file, and that assignment appears before shared_buffers is assigned, the code will substitute 4 times the bootstrap default for shared_buffers, and that value will then persist (since it will have source PGC_S_FILE). I don't see any very nice way to avoid that though, and it's not a case to be expected in practice. The existing comments in guc-file.l look forward to a redesign of the DYNAMIC_DEFAULT mechanism; if that ever happens, we should consider this case as one of the things we'd like to improve.
* Setup error context callback for transaction lock waitsAlvaro Herrera2014-03-19
| | | | | | | | | | | | | | | | | | With this in place, a session blocking behind another one because of tuple locks will get a context line mentioning the relation name, tuple TID, and operation being done on tuple. For example: LOG: process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms DETAIL: Process holding the lock: 11366. Wait queue: 11367. CONTEXT: while updating tuple (0,2) in relation "foo" STATEMENT: UPDATE foo SET value = 3; Most usefully, the new line is displayed by log entries due to log_lock_waits, although of course it will be printed by any other log message as well. Author: Christian Kruse, some tweaks by Álvaro Herrera Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
* Fix memory leak during regular expression execution.Tom Lane2014-03-19
| | | | | | | | For a regex containing backrefs, pg_regexec() might fail to free all the sub-DFAs that were created during execution, resulting in a permanent (session lifespan) memory leak. Problem was introduced by me in commit 587359479acbbdc95c8e37da40707e37097423f5. Per report from Sandro Santilli; diagnosis by Greg Stark.
* Remove rm_safe_restartpoint machinery.Heikki Linnakangas2014-03-18
| | | | | | | | | It is no longer used, none of the resource managers have multi-record actions that would make it unsafe to perform a restartpoint. Also don't allow rm_cleanup to write WAL records, it's also no longer required. Move the call to rm_cleanup routines to make it more symmetric with rm_startup.
* Make the handling of interrupted B-tree page splits more robust.Heikki Linnakangas2014-03-18
| | | | | | | | | | | | | | | | | | | | | | Splitting a page consists of two separate steps: splitting the child page, and inserting the downlink for the new right page to the parent. Previously, we handled the case that you crash in between those steps with a cleanup routine after the WAL recovery had finished, which finished the incomplete split. However, that doesn't help if the page split is interrupted but the database doesn't crash, so that you don't perform WAL recovery. That could happen for example if you run out of disk space. Remove the end-of-recovery cleanup step. Instead, when a page is split, the left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink is inserted to the parent, the flag is cleared again. If an insertion sees a page with the flag set, it knows that the split was interrupted for some reason, and inserts the missing downlink before proceeding. I used the same approach to fix GIN and GiST split algorithms earlier. This was the last WAL cleanup routine, so we could get rid of that whole machinery now, but I'll leave that for a separate patch. Reviewed by Peter Geoghegan.
* Rewrite comment for shm_mq_receive_bytes.Robert Haas2014-03-18
| | | | | | | The comment and the code diverged at some point before the initial commit of this feature, and I failed to notice. Noted by Tom Lane.
* Fix relcache reference leak in refresh_by_match_merge().Tom Lane2014-03-18
| | | | | | | | | | | One path through the loop over indexes forgot to do index_close(). Rather than adding a fourth call, restructure slightly so that there's only one. In passing, get rid of an unnecessary syscache lookup: the pg_index struct for the index is already available from its relcache entry. Per report from YAMAMOTO Takashi, though this is a bit different from his suggested patch. This is new code in HEAD, so no need for back-patch.
* Improve shm_mq portability around MAXIMUM_ALIGNOF and sizeof(Size).Robert Haas2014-03-18
| | | | | | | | | | | Revise the original decision to expose a uint64-based interface and use Size everywhere possible. Avoid assuming that MAXIMUM_ALIGNOF is 8, or making any assumption about the relationship between that value and sizeof(Size). If MAXIMUM_ALIGNOF is bigger, we'll now insert padding after the length word; if it's smaller, we are now prepared to read and write the length word in chunks. Per discussion with Tom Lane.
* Make it easy to detach completely from shared memory.Robert Haas2014-03-18
| | | | | | | | | | The new function dsm_detach_all() can be used either by postmaster children that don't wish to take any risk of accidentally corrupting shared memory; or by forked children of regular backends with the same need. This patch also updates the postmaster children that already do PGSharedMemoryDetach() to do dsm_detach_all() as well. Per discussion with Tom Lane.
* During index build, check and elog (not just Assert) for broken HOT chain.Tom Lane2014-03-17
| | | | | | | The recently-fixed bug in WAL replay could result in not finding a parent tuple for a heap-only tuple. The existing code would either Assert or generate an invalid index entry, neither of which is desirable. Throw a regular error instead.
* Fix thinko: have trueTriConsistentFn return GIN_TRUE.Heikki Linnakangas2014-03-17
| | | | While we're at it, also improve comments in ginlogic.c.
* Fix typos in comments.Fujii Masao2014-03-17
| | | | Thom Brown
* Fix bug in clean shutdown of walsender that pg_receiving is connecting to.Fujii Masao2014-03-17
| | | | | | | | | | | | | | On clean shutdown, walsender waits for all WAL to be replicated to a standby, and exits. It determined whether that replication had been completed by checking whether its sent location had been equal to a standby's flush location. Unfortunately this condition never becomes true when the standby such as pg_receivexlog which always returns an invalid flush location is connecting to walsender, and then walsender waits forever. This commit changes walsender so that it just checks a standby's write location if a flush location is invalid. Back-patch to 9.1 where enough infrastructure for this exists.
* Make punctuation consistentPeter Eisentraut2014-03-16
|
* Fix whitespacePeter Eisentraut2014-03-16
|
* Cleanups from the remove-native-krb5 patchMagnus Hagander2014-03-16
| | | | | | | | | | | krb_srvname is actually not available anymore as a parameter server-side, since with gssapi we accept all principals in our keytab. It's still used in libpq for client side specification. In passing remove declaration of krb_server_hostname, where all the functionality was already removed. Noted by Stephen Frost, though a different solution than his suggestion
* Fix race condition in B-tree page deletion.Heikki Linnakangas2014-03-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In short, we don't allow a page to be deleted if it's the rightmost child of its parent, but that situation can change after we check for it. Problem ------- We check that the page to be deleted is not the rightmost child of its parent, and then lock its left sibling, the page itself, its right sibling, and the parent, in that order. However, if the parent page is split after the check but before acquiring the locks, the target page might become the rightmost child, if the split happens at the right place. That leads to an error in vacuum (I reproduced this by setting a breakpoint in debugger): ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey" We currently re-check that the page is still the rightmost child, and throw the above error if it's not. We could easily just give up rather than throw an error, but that approach doesn't scale to half-dead pages. To recap, although we don't normally allow deleting the rightmost child, if the page is the *only* child of its parent, we delete the child page and mark the parent page as half-dead in one atomic operation. But before we do that, we check that the parent can later be deleted, by checking that it in turn is not the rightmost child of the grandparent (potentially recursing all the way up to the root). But the same situation can arise there - the grandparent can be split while we're not holding the locks. We end up with a half-dead page that we cannot delete. To make things worse, the keyspace of the deleted page has already been transferred to its right sibling. As the README points out, the keyspace at the grandparent level is "out-of-whack" until the half-dead page is deleted, and if enough tuples with keys in the transferred keyspace are inserted, the page might get split and a downlink might be inserted into the grandparent that is out-of-order. That might not cause any serious problem if it's transient (as the README ponders), but is surely bad if it stays that way. Solution -------- This patch changes the page deletion algorithm to avoid that problem. After checking that the topmost page in the chain of to-be-deleted pages is not the rightmost child of its parent, and then deleting the pages from bottom up, unlink the pages from top to bottom. This way, the intermediate stages are similar to the intermediate stages in page splitting, and there is no transient stage where the keyspace is "out-of-whack". The topmost page in the to-be-deleted chain doesn't have a downlink pointing to it, like a page split before the downlink has been inserted. This also allows us to get rid of the cleanup step after WAL recovery, if we crash during page deletion. The deletion will be continued at next VACUUM, but the tree is consistent for searches and insertions at every step. This bug is old, all supported versions are affected, but this patch is too big to back-patch (and changes the WAL record formats of related records). We have not heard any reports of the bug from users, so clearly it's not easy to bump into. Maybe backpatch later, after this has had some field testing. Reviewed by Kevin Grittner and Peter Geoghegan.
* Prevent interrupts while reporting non-ERROR elog messages.Tom Lane2014-03-13
| | | | | | | | | | | | | | | | This should eliminate the risk of recursive entry to syslog(3), which appears to be the cause of the hang reported in bug #9551 from James Morton. Arguably, the real problem here is auth.c's willingness to turn on ImmediateInterruptOK while executing fairly wide swaths of backend code. We may well need to work at narrowing the code ranges in which the authentication_timeout interrupt is enabled. For the moment, though, this is a cheap and reasonably noninvasive fix for a field-reported failure; the other approach would be complex and not necessarily bug-free itself. Back-patch to all supported branches.
* Avoid transaction-commit race condition while receiving a NOTIFY message.Tom Lane2014-03-13
| | | | | | | | | | | | | | | | | | | | | | | | Use TransactionIdIsInProgress, then TransactionIdDidCommit, to distinguish whether a NOTIFY message's originating transaction is in progress, committed, or aborted. The previous coding could accept a message from a transaction that was still in-progress according to the PGPROC array; if the client were fast enough at starting a new transaction, it might fail to see table rows added/updated by the message-sending transaction. Which of course would usually be the point of receiving the message. We noted this type of race condition long ago in tqual.c, but async.c overlooked it. The race condition probably cannot occur unless there are multiple NOTIFY senders in action, since an individual backend doesn't send NOTIFY signals until well after it's done committing. But if two senders commit in close succession, it's certainly possible that we could see the second sender's message within the race condition window while responding to the signal from the first one. Per bug #9557 from Marko Tiikkaja. This patch is slightly more invasive than what he proposed, since it removes the now-redundant TransactionIdDidAbort call. Back-patch to 9.0, where the current NOTIFY implementation was introduced.
* C comments: remove odd blank lines after #ifdef WIN32 linesBruce Momjian2014-03-13
| | | | A few more
* C comments: remove odd blank lines after #ifdef WIN32 linesBruce Momjian2014-03-13
|
* Only WAL-log the modified portion in an UPDATE, if possible.Heikki Linnakangas2014-03-12
| | | | | | | | | When a row is updated, and the new tuple version is put on the same page as the old one, only WAL-log the part of the new tuple that's not identical to the old. This saves significantly on the amount of WAL that needs to be written, in the common case that most fields are not modified. Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.
* Items on GIN data pages are no longer always 6 bytes; update gincostestimate.Heikki Linnakangas2014-03-12
| | | | Also improve the comments a bit.
* Show PIDs of lock holders and waiters in log_lock_waits log message.Fujii Masao2014-03-13
| | | | Christian Kruse, reviewed by Kumar Rajeev Rastogi.
* Fix incorrect assertion about historical snapshots.Robert Haas2014-03-12
| | | | | | Also fix some nearby comments. Andres Freund
* Comment fixes related to logical decoding.Robert Haas2014-03-12
| | | | Andres Freund, per complaints by Peter Eisentraut.
* Allow opclasses to provide tri-valued GIN consistent functions.Heikki Linnakangas2014-03-12
| | | | | | | | | | | | | | | With the GIN "fast scan" feature, GIN can skip items without fetching all the keys for them, if it can prove that they don't match regardless of those keys. So far, it has done the proving by calling the boolean consistent function with all combinations of TRUE/FALSE for the unfetched keys, but since that's O(n^2), it becomes unfeasible with more than a few keys. We can avoid calling consistent with all the combinations, if we can tell the operator class implementation directly which keys are unknown. This commit includes a triConsistent function for the built-in array and tsvector opclasses. Alexander Korotkov, with some changes by me.
* In WAL replay, restore GIN metapage unconditionally to avoid torn page.Heikki Linnakangas2014-03-12
| | | | | | | | | | | | | | | | We don't take a full-page image of the GIN metapage; instead, the WAL record contains all the information required to reconstruct it from scratch. But to avoid torn page hazards, we must re-initialize it from the WAL record every time, even if it already has a greater LSN, similar to how normal full page images are restored. This was highly unlikely to cause any problems in practice, because the GIN metapage is small. We rely on an update smaller than a 512 byte disk sector to be atomic elsewhere, at least in pg_control. But better safe than sorry, and this would be easy to overlook if more fields are added to the metapage so that it's no longer small. Reported by Noah Misch. Backpatch to all supported versions.
* Allow dynamic shared memory segments to be kept until shutdown.Robert Haas2014-03-10
| | | | | Amit Kapila, reviewed by Kyotaro Horiguchi, with some further changes by me.
* Allow logical decoding via the walsender interface.Robert Haas2014-03-10
| | | | | | | | | | | | | | | In order for this to work, walsenders need the optional ability to connect to a database, so the "replication" keyword now allows true or false, for backward-compatibility, and the new value "database" (which causes the "dbname" parameter to be respected). walsender needs to loop not only when idle but also when sending decoded data to the user and when waiting for more xlog data to decode. This means that there are now three separate loops inside walsender.c; although some refactoring has been done here, this is still a bit ugly. Andres Freund, with contributions from Álvaro Herrera, and further review by me.
* Teach on_exit_reset() to discard pending cleanups for dsm.Robert Haas2014-03-10
| | | | | | | | | If a postmaster child invokes fork() and then calls on_exit_reset, that should be sufficient to let it exit() without breaking anything, but dynamic shared memory broke that by not updating on_exit_reset() to discard callbacks registered with dynamic shared memory segments. Per investigation of a complaint from Tom Lane.
* C comments: improve description of relfilenode uniquenessBruce Momjian2014-03-08
| | | | Report by Antonin Houska
* Remove unportable use of anonymous unions from reorderbuffer.h.Tom Lane2014-03-07
| | | | | | | | | | | | | In b89e151054a I had assumed it was ok to use anonymous unions as struct members, but while a longstanding extension in many compilers, it's only been standardized in C11. To fix, remove one of the anonymous unions which tried to hide some implementation specific enum values and give the other a name. The latter unfortunately requires changes in output plugins, but since the feature has only been added a few days ago... Andres Freund
* fix ReplicationSlotsCountDBSlots for dropping unrelated databasesBruce Momjian2014-03-07
| | | | YAMAMOTO Takashi