aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Update transaction README for persistent multixactsAlvaro Herrera2014-11-28
| | | | | Multixacts are now maintained during recovery, but the README didn't get the memo. Backpatch to 9.3, where the divergence was introduced.
* Fix WAL-logging of B-tree "unlink halfdead page" operation.Heikki Linnakangas2014-11-17
| | | | | | | | | | | There was some confusion on how to record the case that the operation unlinks the last non-leaf page in the branch being deleted. _bt_unlink_halfdead_page set the "topdead" field in the WAL record to the leaf page, but the redo routine assumed that it would be an invalid block number in that case. This commit fixes _bt_unlink_halfdead_page to do what the redo routine expected. This code is new in 9.4, so backpatch there.
* Ensure unlogged tables are reset even if crash recovery errors out.Andres Freund2014-11-15
| | | | | | | | | | | | | | | | | | | | | | | | Unlogged relations are reset at the end of crash recovery as they're only synced to disk during a proper shutdown. Unfortunately that and later steps can fail, e.g. due to running out of space. This reset was, up to now performed after marking the database as having finished crash recovery successfully. As out of space errors trigger a crash restart that could lead to the situation that not all unlogged relations are reset. Once that happend usage of unlogged relations could yield errors like "could not open file "...": No such file or directory". Luckily clusters that show the problem can be fixed by performing a immediate shutdown, and starting the database again. To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier, before marking the database as having successfully recovered. Discussion: 20140912112246.GA4984@alap3.anarazel.de Backpatch to 9.1 where unlogged tables were introduced. Abhijit Menon-Sen and Andres Freund
* Allow interrupting GetMultiXactIdMembersAlvaro Herrera2014-11-14
| | | | | | | | | | | This function has a loop which can lead to uninterruptible process "stalls" (actually infinite loops) when some bugs are triggered. Avoid that unpleasant situation by adding a check for interrupts in a place that shouldn't degrade performance in the normal case. Backpatch to 9.3. Older branches have an identical loop here, but the aforementioned bugs are only a problem starting in 9.3 so there doesn't seem to be any point in backpatching any further.
* Fix race condition between hot standby and restoring a full-page image.Heikki Linnakangas2014-11-13
| | | | | | | | | | | | | | | | | | | There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.
* Fix generation of SP-GiST vacuum WAL records.Heikki Linnakangas2014-11-07
| | | | | I broke these in 8776faa81cb651322b8993422bdd4633f1f6a487. Backpatch to 9.4, where that was done.
* Prevent the unnecessary creation of .ready file for the timeline history file.Fujii Masao2014-11-06
| | | | | | | | | | | Previously .ready file was created for the timeline history file at the end of an archive recovery even when WAL archiving was not enabled. This creation is unnecessary and causes .ready file to remain infinitely. This commit changes an archive recovery so that it creates .ready file for the timeline history file only when WAL archiving is enabled. Backpatch to all supported versions.
* Prevent the already-archived WAL file from being archived again.Fujii Masao2014-10-23
| | | | | | | | | | | | | | | | | | Previously the archive recovery always created .ready file for the last WAL file of the old timeline at the end of recovery even when it's restored from the archive and has .done file. That is, there was the case where the WAL file had both .ready and .done files. This caused the already-archived WAL file to be archived again. This commit prevents the archive recovery from creating .ready file for the last WAL file if it has .done file, in order to prevent it from being archived again. This bug was added when cascading replication feature was introduced, i.e., the commit 5286105800c7d5902f98f32e11b209c471c0c69c. So, back-patch to 9.2, where cascading replication was added. Reviewed by Michael Paquier
* Flush unlogged table's buffers when copying or moving databases.Andres Freund2014-10-20
| | | | | | | | | | | | | | | | | | | | | | CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source database directory on the filesystem level. To ensure the on disk state is consistent they block out users of the affected database and force a checkpoint to flush out all data to disk. Unfortunately, up to now, that checkpoint didn't flush out dirty buffers from unlogged relations. That bug means there could be leftover dirty buffers in either the template database, or the database in its old location. Leading to problems when accessing relations in an inconsistent state; and to possible problems during shutdown in the SET TABLESPACE case because buffers belonging files that don't exist anymore are flushed. This was reported in bug #10675 by Maxim Boguk. Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and Fujii Masao. Backpatch to 9.1 where unlogged tables were introduced.
* Message improvementsPeter Eisentraut2014-10-12
|
* Check for GiST index tuples that don't fit on a page.Heikki Linnakangas2014-10-03
| | | | | | | | | The page splitting code would go into infinite recursion if you try to insert an index tuple that doesn't fit even on an empty page. Per analysis and suggested fix by Andrew Gierth. Fixes bug #11555, reported by Bryan Seitz (analysis happened over IRC). Backpatch to all supported versions.
* Remove num_xloginsert_locks GUC, replace with a #defineHeikki Linnakangas2014-10-01
| | | | | | | I left the GUC in place for the beta period, so that people could experiment with different values. No-one's come up with any data that a different value would be better under some circumstances, so rather than try to document to users what the GUC, let's just hard-code the current value, 8.
* Rename CACHE_LINE_SIZE to PG_CACHE_LINE_SIZE.Andres Freund2014-10-01
| | | | | | | | | | | | | | | As noted in http://bugs.debian.org/763098 there is a conflict between postgres' definition of CACHE_LINE_SIZE and the definition by various *bsd platforms. It's debatable who has the right to define such a name, but postgres' use was only introduced in 375d8526f290 (9.4), so it seems like a good idea to rename it. Discussion: 20140930195756.GC27407@msg.df7cb.de Per complaint of Christoph Berg in the above email, although he's not the original bug reporter. Backpatch to 9.4 where the define was introduced.
* Fix GIN data page split ratio calculation.Heikki Linnakangas2014-09-12
| | | | | | | | | | The code that tried to split a page at 75/25 ratio, when appending to the end of an index, was buggy in two ways. First, there was a silly typo that caused it to just fill the left page as full as possible. But the logic as it was intended wasn't correct either, and would actually have given a ratio closer to 60/40 than 75/25. Gaetano Mendola spotted the typo. Backpatch to 9.4, where this code was added.
* Remove dead InRecovery check.Heikki Linnakangas2014-09-11
| | | | | With the new B-tree incomplete split handling in 9.4, _bt_insert_parent is never called in recovery.
* Assorted message fixes and improvementsPeter Eisentraut2014-09-05
|
* pg_is_xlog_replay_paused(): remove super-user-only restrictionBruce Momjian2014-08-29
| | | | | | | | Also update docs to mention which function are super-user-only. Report by sys-milan@statpro.com Backpatch through 9.4
* Fix bug in compressed GIN data leaf page splitting code.Heikki Linnakangas2014-08-29
| | | | | | | | | The list of posting lists it's dealing with can contain placeholders for deleted posting lists. The placeholders are kept around so that they can be WAL-logged, but we must be careful to not try to access them. This fixes bug #11280, reported by Mårten Svantesson. Backpatch to 9.4, where the compressed data leaf page code was added.
* Revert XactLockTableWait context setup in conditional multixact waitAlvaro Herrera2014-08-25
| | | | | | | | There's no point in setting up a context error callback when doing conditional lock acquisition, because we never actually wait and so the able wouldn't be able to see it. Backpatch to 9.4, where this was added.
* Fix outdated commentAlvaro Herrera2014-08-22
|
* Oops, fix recoveryStopsBefore functions for regular commits.Heikki Linnakangas2014-07-29
| | | | | Pointed out by Tom Lane. Backpatch to 9.4, the code was structured differently in earlier branches and didn't have this mistake.
* Treat 2PC commit/abort the same as regular xacts in recovery.Heikki Linnakangas2014-07-29
| | | | | | | | | | | | | | | | | There were several oversights in recovery code where COMMIT/ABORT PREPARED records were ignored: * pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits) * recovery_min_apply_delay (2PC commits were applied immediately) * recovery_target_xid (recovery would not stop if the XID used 2PC) The first of those was reported by Sergiy Zuban in bug #11032, analyzed by Tom Lane and Andres Freund. The bug was always there, but was masked before commit d19bd29f07aef9e508ff047d128a4046cc8bc1e2, because COMMIT PREPARED always created an extra regular transaction that was WAL-logged. Backpatch to all supported versions (older versions didn't have all the features and therefore didn't have all of the above bugs).
* Fix checkpointer crash in EXEC_BACKEND builds.Robert Haas2014-07-24
| | | | | | | | | | | | | | Nothing in the checkpointer calls InitXLOGAccess(), so WALInsertLocks never got initialized there. Without EXEC_BACKEND, it works anyway because the correct value is inherited from the postmaster, but with EXEC_BACKEND we've got a problem. The problem appears to have been introduced by commit 68a2e52bbaf98f136a96b3a0d734ca52ca440a95. To fix, move the relevant initialization steps from InitXLOGAccess() to XLOGShmemInit(), making this more parallel to what we do elsewhere. Amit Kapila
* Add missing serial commasPeter Eisentraut2014-07-15
| | | | | Also update one place where the wal_level "logical" was not added to an error message.
* Move view reloptions into their own varlena structAlvaro Herrera2014-07-14
| | | | | | | Per discussion after a gripe from me in http://www.postgresql.org/message-id/20140611194633.GH18688@eldon.alvh.no-ip.org Jaime Casanova
* Fix decoding of consecutive MULTI_INSERTs emitted by one heap_multi_insert().Andres Freund2014-07-12
| | | | | | | | | | | | | | | | | | | Commit 1b86c81d2d fixed the decoding of toasted columns for the rows contained in one xl_heap_multi_insert record. But that's not actually enough, because heap_multi_insert() will actually first toast all passed in rows and then emit several *_multi_insert records; one for each page it fills with tuples. Add a XLOG_HEAP_LAST_MULTI_INSERT flag which is set in xl_heap_multi_insert->flag denoting that this multi_insert record is the last emitted by one heap_multi_insert() call. Then use that flag in decode.c to only set clear_toast_afterwards in the right situation. Expand the number of rows inserted via COPY in the corresponding regression test to make sure that more than one heap page is filled with tuples by one heap_multi_insert() call. Backpatch to 9.4 like the previous commit.
* Rename logical decoding's pg_llog directory to pg_logical.Andres Freund2014-07-09
| | | | | | | | | | | | | | | | | | | The old name wasn't very descriptive as of actual contents of the directory, which are historical snapshots in the snapshots/ subdirectory and mappingdata for rewritten tuples in mappings/. There's been a fair amount of discussion what would be a good name. I'm settling for pg_logical because it's likely that further data around logical decoding and replication will need saving in the future. Also add the missing entry for the directory into storage.sgml's list of PGDATA contents. Bumps catversion as the data directories won't be compatible. Backpatch to 9.4, which I missed to do as Michael Paquier luckily noticed. As there already has been a catversion bump after 9.4beta1, there's no reasons for having 9.4 diverge from master.
* Have multixact be truncated by checkpoint, not vacuumAlvaro Herrera2014-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of truncating pg_multixact at vacuum time, do it only at checkpoint time. The reason for doing it this way is twofold: first, we want it to delete only segments that we're certain will not be required if there's a crash immediately after the removal; and second, we want to do it relatively often so that older files are not left behind if there's an untimely crash. Per my proposal in http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org we now execute the truncation in the checkpointer process rather than as part of vacuum. Vacuum is in only charge of maintaining in shared memory the value to which it's possible to truncate the files; that value is stored as part of checkpoints also, and so upon recovery we can reuse the same value to re-execute truncate and reset the oldest-value-still-safe-to-use to one known to remain after truncation. Per bug reported by Jeff Janes in the course of his tests involving bug #8673. While at it, update some comments that hadn't been updated since multixacts were changed. Backpatch to 9.3, where persistency of pg_multixact files was introduced by commit 0ac5ad5134f2.
* Fix broken Assert() introduced by 8e9a16ab8f7f0e58Alvaro Herrera2014-06-27
| | | | | | | | | | | | | | | | Don't assert MultiXactIdIsRunning if the multi came from a tuple that had been share-locked and later copied over to the new cluster by pg_upgrade. Doing that causes an error to be raised unnecessarily: MultiXactIdIsRunning is not open to the possibility that its argument came from a pg_upgraded tuple, and all its other callers are already checking; but such multis cannot, obviously, have transactions still running, so the assert is pointless. Noticed while investigating the bogus pg_multixact/offsets/0000 file left over by pg_upgrade, as reported by Andres Freund in http://www.postgresql.org/message-id/20140530121631.GE25431@alap3.anarazel.de Backpatch to 9.3, as the commit that introduced the buglet.
* Consistency improvements for slot and decoding code.Andres Freund2014-06-12
| | | | | | | | Change the order of checks in similar functions to be the same; remove a parameter that's not needed anymore; rename a memory context and expand a couple of comments. Per review comments from Amit Kapila
* Fix infinite loop when splitting inner tuples in SPGiST text indexes.Tom Lane2014-06-09
| | | | | | | | | | | | | | | | | | | | | | | Previously, the code used a node label of zero both for strings that contain no bytes beyond the inner tuple's prefix, and for cases where an "allTheSame" inner tuple has to be split to allow a string with a different next byte to be inserted into it. Failing to distinguish these cases meant that if a string ending with the current prefix needed to be inserted into an allTheSame tuple, we got into an infinite loop, because after splitting the tuple we'd descend into the child allTheSame tuple and then find we need to split again. To fix, instead use -1 and -2 as the node labels for these two cases. This requires widening the node label type from "char" to int2, but fortunately SPGiST stores all pass-by-value node label types in their Datum representation, which means that this change is transparently upward compatible so far as the on-disk representation goes. We continue to recognize zero as a dummy node label for reading purposes, but will not attempt to push new index entries down into such a label, so that the loop won't occur even when dealing with an existing index. Per report from Teodor Sigaev. Back-patch to 9.2 where the faulty code was introduced.
* Wrap multixact/members correctly during extension, take 2Alvaro Herrera2014-06-09
| | | | | | | | In a50d97625497b7 I already changed this, but got it wrong for the case where the number of members is larger than the number of entries that fit in the last page of the last segment. As reported by Serge Negodyuck in a followup to bug #8673.
* Add defenses against running with a wrong selection of LOBLKSIZE.Tom Lane2014-06-05
| | | | | | | | | | | | | | | | | | | | | It's critical that the backend's idea of LOBLKSIZE match the way data has actually been divided up in pg_largeobject. While we don't provide any direct way to adjust that value, doing so is a one-line source code change and various people have expressed interest recently in changing it. So, just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in pg_control and cross-check that the backend's compiled-in setting matches the on-disk data. Also tweak the code in inv_api.c so that fetches from pg_largeobject explicitly verify that the length of the data field is not more than LOBLKSIZE. Formerly we just had Asserts() for that, which is no protection at all in production builds. In some of the call sites an overlength data value would translate directly to a security-relevant stack clobber, so it seems worth one extra runtime comparison to be sure. In the back branches, we can't change the contents of pg_control; but we can still make the extra checks in inv_api.c, which will offer some amount of protection against running with the wrong value of LOBLKSIZE.
* Consistently spell a replication slot's name as slot_name.Andres Freund2014-06-05
| | | | | | | | | | | Previously there's been a mix between 'slotname' and 'slot_name'. It's not nice to be unneccessarily inconsistent in a new feature. As a post beta1 initdb now is required in the wake of eeca4cd35e, fix the inconsistencies. Most the changes won't affect usage of replication slots because the majority of changes is around function parameter names. The prominent exception to that is that the recovery.conf parameter 'primary_slotname' is now named 'primary_slot_name'.
* Adjust SP-GiST WAL record formats to reduce alignment padding.Heikki Linnakangas2014-06-05
| | | | | | | | | | | | | | | | | | The way the code was written, the padding was copied from uninitialized memory areas.. Because the structs are local variables in the code where the WAL records are constructed, making them larger and zeroing the padding bytes would not make the code very pretty, so rather than fixing this directly by zeroing out the padding bytes, it seems more clear to not try to align the tuples in the WAL records. The redo functions are taught to copy the tuple header to a local variable to avoid unaligned access. Stable-branches have the same problem, but we can't change the WAL format there, so fix in master only. Reading a few random extra bytes at the stack is harmless in practice, so it's not worth crafting a different back-patchable fix. Per reports from Kevin Grittner and Andres Freund, using clang static analyzer and Valgrind, respectively.
* Fix error when trying to delete page with half-dead left sibling.Heikki Linnakangas2014-05-25
| | | | | | | | | | | | The new page deletion code didn't cope with the case the target page's right sibling was marked half-dead. It failed a sanity check which checked that the downlinks in the parent page match the lower level, because a half-dead page has no downlink. To cope, check for that condition, and just give up on the deletion if it happens. The vacuum will finish the deletion of the half-dead page when it gets there, and on the next vacuum after that the empty can be deleted. Reported by Jeff Janes.
* Fix backup-block numbering in redo of b-tree split.Heikki Linnakangas2014-05-19
| | | | | | | | | | | | I got the backup block numbers off-by-one in the commit that changed the way incomplete-splits are handled. I blame the comments, which said "backup block 1" and "backup block 2", even though the backup blocks are numbered starting from 0, in the macros and functions used in replay. Fix the comments and the code. Per Jeff Janes' bug report about corruption caused by torn page writes. The incorrect code is new in git master, but backpatch the comment change down to 9.0, where the numbering in the redo-side macros was changed.
* Fix a bunch of functions that were declared static then defined not-static.Tom Lane2014-05-17
| | | | Per testing with a compiler that whines about this.
* Update README, we don't do post-recovery cleanup actions anymore.Heikki Linnakangas2014-05-17
| | | | | | | transam/README explained how B-tree incomplete splits were tracked and fixed after recovery, as an example of handling complex actions that need multiple WAL records, but that's not how it works anymore. Explain the new paradigm.
* Initialize tsId and dbId fields in WAL record of COMMIT PREPARED.Heikki Linnakangas2014-05-16
| | | | | | | | | | | | | | | | Commit dd428c79 added dbId and tsId to the xl_xact_commit struct but missed that prepared transaction commits reuse that struct. Fix that. Because those fields were left unitialized, replaying a commit prepared WAL record in a hot standby node would fail to remove the relcache init file. That can lead to "could not open file" errors on the standby. Relcache init file only needs to be removed when a system table/index is rewritten in the transaction using two phase commit, so that should be rare in practice. In HEAD, the incorrect dbId/tsId values are also used for filtering in logical replication code, causing the transaction to always be filtered out. Analysis and fix by Andres Freund. Backpatch to 9.0 where hot standby was introduced.
* Fix race condition in preparing a transaction for two-phase commit.Heikki Linnakangas2014-05-15
| | | | | | | | | | | | | | | | | | | | To lock a prepared transaction's shared memory entry, we used to mark it with the XID of the backend. When the XID was no longer active according to the proc array, the entry was implicitly considered as not locked anymore. However, when preparing a transaction, the backend's proc array entry was cleared before transfering the locks (and some other state) to the prepared transaction's dummy PGPROC entry, so there was a window where another backend could finish the transaction before it was in fact fully prepared. To fix, rewrite the locking mechanism of global transaction entries. Instead of an XID, just have simple locked-or-not flag in each entry (we store the locking backend's backend id rather than a simple boolean, but that's just for debugging purposes). The backend is responsible for explicitly unlocking the entry, and to make sure that that happens, install a callback to unlock it on abort or process exit. Backpatch to all supported versions.
* Code review for recent changes in relcache.c.Tom Lane2014-05-14
| | | | | | | | | | | | | | | | | | | | | | | rd_replidindex should be managed the same as rd_oidindex, and rd_keyattr and rd_idattr should be managed like rd_indexattr. Omissions in this area meant that the bitmapsets computed for rd_keyattr and rd_idattr would be leaked during any relcache flush, resulting in a slow but permanent leak in CacheMemoryContext. There was also a tiny probability of relcache entry corruption if we ran out of memory at just the wrong point in RelationGetIndexAttrBitmap. Otherwise, the fields were not zeroed where expected, which would not bother the code any AFAICS but could greatly confuse anyone examining the relcache entry while debugging. Also, create an API function RelationGetReplicaIndex rather than letting non-relcache code be intimate with the mechanisms underlying caching of that value (we won't even mention the memory leak there). Also, fix a relcache flush hazard identified by Andres Freund: RelationGetIndexAttrBitmap must not assume that rd_replidindex stays valid across index_open. The aspects of this involving rd_keyattr date back to 9.3, so back-patch those changes.
* Rename min_recovery_apply_delay to recovery_min_apply_delay.Tom Lane2014-05-10
| | | | | | | Per discussion, this seems like a more consistent choice of name. Fabrízio de Royes Mello, after a suggestion by Peter Eisentraut; some additional documentation wordsmithing by me
* Fix bug in lossy-page handling in GINHeikki Linnakangas2014-05-10
| | | | | | | | | | | When returning rows from a bitmap, as done with partial match queries, we would get stuck in an infinite loop if the bitmap contained a lossy page reference. This bug is new in master, it was introduced by the patch to allow skipping items refuted by other entries in GIN scans. Report and fix by Alexander Korotkov
* Remove overeager assertion in logical_heap_begin_rewrite.Robert Haas2014-05-09
| | | | | | | It's legal to configure wal_level=logical and max_replication_slots=0 simultaneously. Andres Freund
* Protect against torn pages when deleting GIN list pages.Heikki Linnakangas2014-05-08
| | | | | | | | | To-be-deleted list pages contain no useful information, as they are being deleted, but we must still protect the writes from being torn by a crash after a partial write. To do that, re-initialize the pages on WAL replay. Jeff Janes caught this with a test program to test partial writes. Backpatch to all supported versions.
* pgindent run for 9.4Bruce Momjian2014-05-06
| | | | | This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
* Correct comment in Hot Standby nbtree handlingSimon Riggs2014-05-06
| | | | Logic is correct, matching handling of LP_DEAD elsewhere.
* Assert that pre/post-fix updated tuples are on the same page during replay.Heikki Linnakangas2014-05-05
| | | | | | | | | | | If they were not 'oldtup.t_data' would be dereferenced while set to NULL in case of a full page image for block 0. Do so primarily to silence coverity; but also to make sure this prerequisite isn't changed without adapting the replay routine as that would appear to work in many cases. Andres Freund
* Fix failure to detoast fields in composite elements of structured types.Tom Lane2014-05-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we have an array of records stored on disk, the individual record fields cannot contain out-of-line TOAST pointers: the tuptoaster.c mechanisms are only prepared to deal with TOAST pointers appearing in top-level fields of a stored row. The same applies for ranges over composite types, nested composites, etc. However, the existing code only took care of expanding sub-field TOAST pointers for the case of nested composites, not for other structured types containing composites. For example, given a command such as UPDATE tab SET arraycol = ARRAY[(ROW(x,42)::mycompositetype] ... where x is a direct reference to a field of an on-disk tuple, if that field is long enough to be toasted out-of-line then the TOAST pointer would be inserted as-is into the array column. If the source record for x is later deleted, the array field value would become a dangling pointer, leading to errors along the line of "missing chunk number 0 for toast value ..." when the value is referenced. A reproducible test case for this was provided by Jan Pecek, but it seems likely that some of the "missing chunk number" reports we've heard in the past were caused by similar issues. Code-wise, the problem is that PG_DETOAST_DATUM() is not adequate to produce a self-contained Datum value if the Datum is of composite type. Seen in this light, the problem is not just confined to arrays and ranges, but could also affect some other places where detoasting is done in that way, for example form_index_tuple(). I tried teaching the array code to apply toast_flatten_tuple_attribute() along with PG_DETOAST_DATUM() when the array element type is composite, but this was messy and imposed extra cache lookup costs whether or not any TOAST pointers were present, indeed sometimes when the array element type isn't even composite (since sometimes it takes a typcache lookup to find that out). The idea of extending that approach to all the places that currently use PG_DETOAST_DATUM() wasn't attractive at all. This patch instead solves the problem by decreeing that composite Datum values must not contain any out-of-line TOAST pointers in the first place; that is, we expand out-of-line fields at the point of constructing a composite Datum, not at the point where we're about to insert it into a larger tuple. This rule is applied only to true composite Datums, not to tuples that are being passed around the system as tuples, so it's not as invasive as it might sound at first. With this approach, the amount of code that has to be touched for a full solution is greatly reduced, and added cache lookup costs are avoided except when there actually is a TOAST pointer that needs to be inlined. The main drawback of this approach is that we might sometimes dereference a TOAST pointer that will never actually be used by the query, imposing a rather large cost that wasn't there before. On the other side of the coin, if the field value is used multiple times then we'll come out ahead by avoiding repeat detoastings. Experimentation suggests that common SQL coding patterns are unaffected either way, though. Applications that are very negatively affected could be advised to modify their code to not fetch columns they won't be using. In future, we might consider reverting this solution in favor of detoasting only at the point where data is about to be stored to disk, using some method that can drill down into multiple levels of nested structured types. That will require defining new APIs for structured types, though, so it doesn't seem feasible as a back-patchable fix. Note that this patch changes HeapTupleGetDatum() from a macro to a function call; this means that any third-party code using that macro will not get protection against creating TOAST-pointer-containing Datums until it's recompiled. The same applies to any uses of PG_RETURN_HEAPTUPLEHEADER(). It seems likely that this is not a big problem in practice: most of the tuple-returning functions in core and contrib produce outputs that could not possibly be toasted anyway, and the same probably holds for third-party extensions. This bug has existed since TOAST was invented, so back-patch to all supported branches.