aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Ensure unlogged tables are reset even if crash recovery errors out.Andres Freund2014-11-15
| | | | | | | | | | | | | | | | | | | | | | | | Unlogged relations are reset at the end of crash recovery as they're only synced to disk during a proper shutdown. Unfortunately that and later steps can fail, e.g. due to running out of space. This reset was, up to now performed after marking the database as having finished crash recovery successfully. As out of space errors trigger a crash restart that could lead to the situation that not all unlogged relations are reset. Once that happend usage of unlogged relations could yield errors like "could not open file "...": No such file or directory". Luckily clusters that show the problem can be fixed by performing a immediate shutdown, and starting the database again. To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier, before marking the database as having successfully recovered. Discussion: 20140912112246.GA4984@alap3.anarazel.de Backpatch to 9.1 where unlogged tables were introduced. Abhijit Menon-Sen and Andres Freund
* Clean up includes from RLS patchStephen Frost2014-11-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | The initial patch for RLS mistakenly included headers associated with the executor and planner bits in rewrite/rowsecurity.h. Per policy and general good sense, executor headers should not be included in planner headers or vice versa. The include of execnodes.h was a mistaken holdover from previous versions, while the include of relation.h was used for Relation's definition, which should have been coming from utils/relcache.h. This patch cleans these issues up, adds comments to the RowSecurityPolicy struct and the RowSecurityConfigType enum, and changes Relation->rsdesc to Relation->rd_rsdesc to follow Relation field naming convention. Additionally, utils/rel.h was including rewrite/rowsecurity.h, which wasn't a great idea since that was pulling in things not really needed in utils/rel.h (which gets included in quite a few places). Instead, use 'struct RowSecurityDesc' for the rd_rsdesc field and add comments explaining why. Lastly, add an include into access/nbtree/nbtsort.c for utils/sortsupport.h, which was evidently missed due to the above mess. Pointed out by Tom in 16970.1415838651@sss.pgh.pa.us; note that the concerns regarding a similar situation in the custom-path commit still need to be addressed.
* Allow interrupting GetMultiXactIdMembersAlvaro Herrera2014-11-14
| | | | | | | | | | | This function has a loop which can lead to uninterruptible process "stalls" (actually infinite loops) when some bugs are triggered. Avoid that unpleasant situation by adding a check for interrupts in a place that shouldn't degrade performance in the normal case. Backpatch to 9.3. Older branches have an identical loop here, but the aforementioned bugs are only a problem starting in 9.3 so there doesn't seem to be any point in backpatching any further.
* Fix race condition between hot standby and restoring a full-page image.Heikki Linnakangas2014-11-13
| | | | | | | | | | | | | | | | | | | There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.
* Fix XLogReadBufferForRedoExtended to get cleanup lock when asked to do so.Heikki Linnakangas2014-11-13
|
* Rename pending_list_cleanup_size to gin_pending_list_limit.Fujii Masao2014-11-13
| | | | | Since this parameter is only for GIN index, it's better to add "gin" to the parameter name for easier understanding.
* Add GUC and storage parameter to set the maximum size of GIN pending list.Fujii Masao2014-11-11
| | | | | | | | | | | | | | | | Previously the maximum size of GIN pending list was controlled only by work_mem. But the reasonable value of work_mem and the reasonable size of the list are basically not the same, so it was not appropriate to control both of them by only one GUC, i.e., work_mem. This commit separates new GUC, pending_list_cleanup_size, from work_mem to allow users to control only the size of the list. Also this commit adds pending_list_cleanup_size as new storage parameter to allow users to specify the size of the list per index. This is useful, for example, when users want to increase the size of the list only for the GIN index which can be updated heavily, and decrease it otherwise. Reviewed by Etsuro Fujita.
* BRIN: fix bug in xlog backup block countingAlvaro Herrera2014-11-10
| | | | | | | | | | | | | The code that generates the BRIN_XLOG_UPDATE removes the buffer reference when the page that's target for the updated tuple is freshly initialized. This is a pretty usual optimization, but was breaking the case where the revmap buffer, which is referenced in the same WAL record, is getting a backup block: the replay code was using backup block index 1, which is not valid when the update target buffer gets pruned; the revmap buffer gets assigned 0 instead. Make sure to use the correct backup block index for revmap when replaying. Bug reported by Fujii Masao.
* Further code and wording tweaks in BRINAlvaro Herrera2014-11-10
| | | | | | | Besides a couple of typo fixes, per David Rowley, Thom Brown, and Amit Langote, and mentions of BRIN in the general CREATE INDEX page again per David, this includes silencing MSVC compiler warnings (thanks Microsoft) and an additional variable initialization per Coverity scanner.
* Fix compiler warning for non-assert builds.Kevin Grittner2014-11-10
| | | | | Reported by Peter Geoghegan David Rowley
* Fix some coding issues in BRINAlvaro Herrera2014-11-08
| | | | | | | | | | | | | | | Reported by David Rowley: variadic macros are a problem. Get rid of them using a trick suggested by Tom Lane: add extra parentheses where needed. In the future we might decide we don't need the calls at all and remove them, but it seems appropriate to keep them while this code is still new. Also from David Rowley: brininsert() was trying to use a variable before initializing it. Fix by moving the brin_form_tuple call (which initializes the variable) to within the locked section. Reported by Peter Eisentraut: can't use "new" as a struct member name, because C++ compilers will choke on it, as reported by cpluspluscheck.
* Fix building with WAL_DEBUG.Heikki Linnakangas2014-11-07
| | | | | | | | | | | Now that the backup blocks are appended to the WAL record in xloginsert.c, XLogInsert doesn't see them anymore and cannot remove them from the version reconstructed for xlog_outdesc. This makes running with wal_debug=on more expensive, as we now make (unnecessary) temporary copies of the backup blocks, but it doesn't seem worth convoluting the code to keep that optimization. Reported by Alvaro Herrera.
* Use the sortsupport infrastructure in more cases.Robert Haas2014-11-07
| | | | | | This removes some fmgr overhead from cases such as btree index builds. Peter Geoghegan, reviewed by Andreas Karlsson and me.
* BRIN: Block Range IndexesAlvaro Herrera2014-11-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BRIN is a new index access method intended to accelerate scans of very large tables, without the maintenance overhead of btrees or other traditional indexes. They work by maintaining "summary" data about block ranges. Bitmap index scans work by reading each summary tuple and comparing them with the query quals; all pages in the range are returned in a lossy TID bitmap if the quals are consistent with the values in the summary tuple, otherwise not. Normal index scans are not supported because these indexes do not store TIDs. As new tuples are added into the index, the summary information is updated (if the block range in which the tuple is added is already summarized) or not; in the latter case, a subsequent pass of VACUUM or the brin_summarize_new_values() function will create the summary information. For data types with natural 1-D sort orders, the summary info consists of the maximum and the minimum values of each indexed column within each page range. This type of operator class we call "Minmax", and we supply a bunch of them for most data types with B-tree opclasses. Since the BRIN code is generalized, other approaches are possible for things such as arrays, geometric types, ranges, etc; even for things such as enum types we could do something different than minmax with better results. In this commit I only include minmax. Catalog version bumped due to new builtin catalog entries. There's more that could be done here, but this is a good step forwards. Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera, with contribution by Heikki Linnakangas. Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas. Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo. PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.
* Fix generation of SP-GiST vacuum WAL records.Heikki Linnakangas2014-11-07
| | | | | I broke these in 8776faa81cb651322b8993422bdd4633f1f6a487. Backpatch to 9.4, where that was done.
* Remove obsolete cases from GiST update redo code.Heikki Linnakangas2014-11-07
| | | | | | | | | | | | The code that generated a record to clear the F_TUPLES_DELETED flag hasn't existed since we got rid of old-style VACUUM FULL. I kept the code that sets the flag, although it's not used for anything anymore, because it might still be interesting information for debugging purposes that some tuples have been deleted from a page. Likewise, the code to turn the root page from non-leaf to leaf page was removed when we got rid of old-style VACUUM FULL. Remove the code to replay that action, too.
* Prevent the unnecessary creation of .ready file for the timeline history file.Fujii Masao2014-11-06
| | | | | | | | | | | Previously .ready file was created for the timeline history file at the end of an archive recovery even when WAL archiving was not enabled. This creation is unnecessary and causes .ready file to remain infinitely. This commit changes an archive recovery so that it creates .ready file for the timeline history file only when WAL archiving is enabled. Backpatch to all supported versions.
* Move the backup-block logic from XLogInsert to a new file, xloginsert.c.Heikki Linnakangas2014-11-06
| | | | | | | | | | | | xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
* Switch to CRC-32C in WAL and other places.Heikki Linnakangas2014-11-04
| | | | | | | | | | | | | | | | | | | | | | | The old algorithm was found to not be the usual CRC-32 algorithm, used by Ethernet et al. We were using a non-reflected lookup table with code meant for a reflected lookup table. That's a strange combination that AFAICS does not correspond to any bit-wise CRC calculation, which makes it difficult to reason about its properties. Although it has worked well in practice, seems safer to use a well-known algorithm. Since we're changing the algorithm anyway, we might as well choose a different polynomial. The Castagnoli polynomial has better error-correcting properties than the traditional CRC-32 polynomial, even if we had implemented it correctly. Another reason for picking that is that some new CPUs have hardware support for calculating CRC-32C, but not CRC-32, let alone our strange variant of it. This patch doesn't add any support for such hardware, but a future patch could now do that. The old algorithm is kept around for tsquery and pg_trgm, which use the values in indexes that need to remain compatible so that pg_upgrade works. While we're at it, share the old lookup table for CRC-32 calculation between hstore, ltree and core. They all use the same table, so might as well.
* Avoid setup work for invalidation messages at start-of-(sub)xact.Robert Haas2014-10-29
| | | | | | | | | | | | Instead of initializing a new TransInvalidationInfo for every transaction or subtransaction, we can just do it for those transactions or subtransactions that actually need to queue invalidation messages. That also avoids needing to free those entries at the end of a transaction or subtransaction that does not generate any invalidation messages, which is by far the common case. Patch by me. Review by Simon Riggs and Andres Freund.
* Update README.tuplockAlvaro Herrera2014-10-23
| | | | | | | This file was documenting an older version of patch 0ac5ad5134; update it to match what was really committed Author: Florian Pflug
* Prevent the already-archived WAL file from being archived again.Fujii Masao2014-10-23
| | | | | | | | | | | | | | | | | | Previously the archive recovery always created .ready file for the last WAL file of the old timeline at the end of recovery even when it's restored from the archive and has .done file. That is, there was the case where the WAL file had both .ready and .done files. This caused the already-archived WAL file to be archived again. This commit prevents the archive recovery from creating .ready file for the last WAL file if it has .done file, in order to prevent it from being archived again. This bug was added when cascading replication feature was introduced, i.e., the commit 5286105800c7d5902f98f32e11b209c471c0c69c. So, back-patch to 9.2, where cascading replication was added. Reviewed by Michael Paquier
* Don't duplicate log_checkpoint messages for both of restart and checkpoints.Andres Freund2014-10-21
| | | | | | | | | | | | | | The duplication originated in cdd46c765, where restartpoints were introduced. In LogCheckpointStart's case the duplication actually lead to the compiler's format string checking not to be effective because the format string wasn't constant. Arguably these messages shouldn't be elog(), but ereport() style messages. That'd even allow to translate the messages... But as there's more mistakes of that kind in surrounding code, it seems better to change that separately.
* Renumber CHECKPOINT_* flags.Andres Freund2014-10-21
| | | | | | | | | | | Commit 7dbb6069382 added a new CHECKPOINT_FLUSH_ALL flag. As that commit needed to be backpatched I didn't change the numeric values of the existing flags as that could lead to nastly problems if any external code issued checkpoints. That's not a concern on master, so renumber them there. Also add a comment about CHECKPOINT_FLUSH_ALL above CreateCheckPoint().
* Flush unlogged table's buffers when copying or moving databases.Andres Freund2014-10-20
| | | | | | | | | | | | | | | | | | | | | | CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source database directory on the filesystem level. To ensure the on disk state is consistent they block out users of the affected database and force a checkpoint to flush out all data to disk. Unfortunately, up to now, that checkpoint didn't flush out dirty buffers from unlogged relations. That bug means there could be leftover dirty buffers in either the template database, or the database in its old location. Leading to problems when accessing relations in an inconsistent state; and to possible problems during shutdown in the SET TABLESPACE case because buffers belonging files that don't exist anymore are flushed. This was reported in bug #10675 by Maxim Boguk. Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and Fujii Masao. Backpatch to 9.1 where unlogged tables were introduced.
* Message improvementsPeter Eisentraut2014-10-12
|
* Split builtins.h to a new header ruleutils.hAlvaro Herrera2014-10-08
| | | | | | | The new header contains many prototypes for functions in ruleutils.c that are not exposed to the SQL level. Reviewed by Andres Freund and Michael Paquier.
* Implement SKIP LOCKED for row-level locksAlvaro Herrera2014-10-07
| | | | | | | | | | | | | | | | | | This clause changes the behavior of SELECT locking clauses in the presence of locked rows: instead of causing a process to block waiting for the locks held by other processes (or raise an error, with NOWAIT), SKIP LOCKED makes the new reader skip over such rows. While this is not appropriate behavior for general purposes, there are some cases in which it is useful, such as queue-like tables. Catalog version bumped because this patch changes the representation of stored rules. Reviewed by Craig Ringer (based on a previous attempt at an implementation by Simon Riggs, who also provided input on the syntax used in the current patch), David Rowley, and Álvaro Herrera. Author: Thomas Munro
* Check for GiST index tuples that don't fit on a page.Heikki Linnakangas2014-10-03
| | | | | | | | | The page splitting code would go into infinite recursion if you try to insert an index tuple that doesn't fit even on an empty page. Per analysis and suggested fix by Andrew Gierth. Fixes bug #11555, reported by Bryan Seitz (analysis happened over IRC). Backpatch to all supported versions.
* Remove num_xloginsert_locks GUC, replace with a #defineHeikki Linnakangas2014-10-01
| | | | | | | I left the GUC in place for the beta period, so that people could experiment with different values. No-one's come up with any data that a different value would be better under some circumstances, so rather than try to document to users what the GUC, let's just hard-code the current value, 8.
* Rename CACHE_LINE_SIZE to PG_CACHE_LINE_SIZE.Andres Freund2014-10-01
| | | | | | | | | | | | | | | As noted in http://bugs.debian.org/763098 there is a conflict between postgres' definition of CACHE_LINE_SIZE and the definition by various *bsd platforms. It's debatable who has the right to define such a name, but postgres' use was only introduced in 375d8526f290 (9.4), so it seems like a good idea to rename it. Discussion: 20140930195756.GC27407@msg.df7cb.de Per complaint of Christoph Berg in the above email, although he's not the original bug reporter. Backpatch to 9.4 where the define was introduced.
* Remove most volatile qualifiers from xlog.cAndres Freund2014-09-22
| | | | | | | | | | | | | | For the reason outlined in df4077cda2e also remove volatile qualifiers from xlog.c. Some of these uses of volatile have been added after noticing problems back when spinlocks didn't imply compiler barriers. So they are a good test - in fact removing the volatiles breaks when done without the barriers in spinlocks present. Several uses of volatile remain where they are explicitly used to access shared memory without locks. These locations are ok with slightly out of date data, but removing the volatile might lead to the variables never being reread from memory. These uses could also be replaced by barriers, but that's a separate change of doubtful value.
* Improve code around the recently added rm_identify rmgr callback.Andres Freund2014-09-22
| | | | | | | | | | | | | | | There are four weaknesses in728f152e07f998d2cb4fe5f24ec8da2c3bda98f2: * append_init() in heapdesc.c was ugly and required that rm_identify return values are only valid till the next call. Instead just add a couple more switch() cases for the INIT_PAGE cases. Now the returned value will always be valid. * a couple rm_identify() callbacks missed masking xl_info with ~XLR_INFO_MASK. * pg_xlogdump didn't map a NULL rm_identify to UNKNOWN or a similar string. * append_init() was called when id=NULL - which should never actually happen. But it's better to be careful.
* Add rmgr callback to name xlog record types for display purposes.Andres Freund2014-09-19
| | | | | | | | | | | | | | | | | | | This is primarily useful for the upcoming pg_xlogdump --stats feature, but also allows to remove some duplicated code in the rmgr_desc routines. Due to the separation and harmonization, the output of dipsplayed records changes somewhat. But since this isn't enduser oriented content that's ok. It's potentially desirable to further change pg_xlogdump's display of records. It previously wasn't possible to show the record type separately from the description forcing it to be in the last column. But that's better done in a separate commit. Author: Abhijit Menon-Sen, slightly editorialized by me Reviewed-By: Álvaro Herrera, Andres Freund, and Heikki Linnakangas Discussion: 20140604104716.GA3989@toroid.org
* Fix GIN data page split ratio calculation.Heikki Linnakangas2014-09-12
| | | | | | | | | | The code that tried to split a page at 75/25 ratio, when appending to the end of an index, was buggy in two ways. First, there was a silly typo that caused it to just fill the left page as full as possible. But the logic as it was intended wasn't correct either, and would actually have given a ratio closer to 60/40 than 75/25. Gaetano Mendola spotted the typo. Backpatch to 9.4, where this code was added.
* Remove dead InRecovery check.Heikki Linnakangas2014-09-11
| | | | | With the new B-tree incomplete split handling in 9.4, _bt_insert_parent is never called in recovery.
* Assorted message fixes and improvementsPeter Eisentraut2014-09-05
|
* Refactor per-page logic common to all redo routines to a new function.Heikki Linnakangas2014-09-02
| | | | | | | | | | | | Every redo routine uses the same idiom to determine what to do to a page: check if there's a backup block for it, and if not read, the buffer if the block exists, and check its LSN. Refactor that into a common function, XLogReadBufferForRedo, making all the redo routines shorter and more readable. This has no user-visible effect, and makes no changes to the WAL format. Reviewed by Andres Freund, Alvaro Herrera, Michael Paquier.
* pg_is_xlog_replay_paused(): remove super-user-only restrictionBruce Momjian2014-08-29
| | | | | | | | Also update docs to mention which function are super-user-only. Report by sys-milan@statpro.com Backpatch through 9.4
* Fix bug in compressed GIN data leaf page splitting code.Heikki Linnakangas2014-08-29
| | | | | | | | | The list of posting lists it's dealing with can contain placeholders for deleted posting lists. The placeholders are kept around so that they can be WAL-logged, but we must be careful to not try to access them. This fixes bug #11280, reported by Mårten Svantesson. Backpatch to 9.4, where the compressed data leaf page code was added.
* Revert "Allow units to be specified in relation option setting value."Fujii Masao2014-08-29
| | | | | | | | | | | This reverts commit e23014f3d40f7d2c23bc97207fd28efbe5ba102b. As the side effect of the reverted commit, when the unit is specified, the reloption was stored in the catalog with the unit. This broke pg_dump (specifically, it prevented pg_dump from outputting restorable backup regarding the reloption) and turned the buildfarm red. Revert the commit until the fixed version is ready.
* Allow units to be specified in relation option setting value.Fujii Masao2014-08-28
| | | | | | | | | This introduces an infrastructure which allows us to specify the units like ms (milliseconds) in integer relation option, like GUC parameter. Currently only autovacuum_vacuum_cost_delay reloption can accept the units. Reviewed by Michael Paquier
* revert "Throw error for ALTER TABLE RESET of an invalid option"Bruce Momjian2014-08-25
| | | | | | Reverts commits 73d78e11a0f7183c80b93eefbbb6026fe9664015 and b0488e5c4fbfdce8acc989bdc17d9f0ec09ac281. Also reverts pg_upgrade changes.
* Throw error for ALTER TABLE RESET of an invalid optionBruce Momjian2014-08-25
| | | | | | | Also adjust pg_upgrade to not use this method for optional TOAST table creation. Patch by Fabrízio de Royes Mello
* Revert XactLockTableWait context setup in conditional multixact waitAlvaro Herrera2014-08-25
| | | | | | | | | There's no point in setting up a context error callback when doing conditional lock acquisition, because we never actually wait and so the user wouldn't be able to see the context message anywhere. In fact, this is more in line with what ConditionalXactLockTableWait is doing. Backpatch to 9.4, where this was added.
* Use newly added InvalidCommandId instead of 0Alvaro Herrera2014-08-25
| | | | | | | | | The symbol was added by 71901ab6d; the original code was introduced by 6868ed749. Development of both overlapped which is why we apparently failed to notice. This is a (very slight) behavior change, so I'm not backpatching this to 9.4 for now, even though the symbol does exist there.
* Fix outdated commentAlvaro Herrera2014-08-22
|
* Replace a few strncmp() calls with strlcpy().Noah Misch2014-08-18
| | | | | | | | | | | strncmp() is a specialized API unsuited for routine copying into fixed-size buffers. On a system where the length of a single filename can exceed MAXPGPATH, the pg_archivecleanup change prevents a simple crash in the subsequent strlen(). Few filesystems support names that long, and calling pg_archivecleanup with untrusted input is still not a credible use case. Therefore, no back-patch. David Rowley
* Prevent memory leaks in parseRelOptions().Tom Lane2014-08-13
| | | | | | | | | | | | | | | | parseRelOptions() tended to leak memory in the caller's context. Most of the time this doesn't really matter since the caller's context is at most query-lifespan, and the function won't be invoked very many times. However, when testing with CLOBBER_CACHE_RECURSIVELY, the same relcache entry can get rebuilt a *lot* of times in one query, leading to significant intraquery memory bloat if it has any reloptions. Noted while investigating a related report from Tomas Vondra. In passing, get rid of some Asserts that are redundant with the one done by deconstruct_array(). As with other patches to avoid leaks in CLOBBER_CACHE testing, it doesn't really seem worth back-patching this.
* Move log_newpage and log_newpage_buffer to xlog.c.Heikki Linnakangas2014-07-31
| | | | | | | | | | | log_newpage is used by many indexams, in addition to heap, but for historical reasons it's always been part of the heapam rmgr. Starting with 9.3, we have another WAL record type for logging an image of a page, XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to xlog.c, and switch to using the XLOG_FPI record type. Bump the WAL version number because the code to replay the old HEAP_NEWPAGE records is removed.