aboutsummaryrefslogtreecommitdiff
path: root/src/backend/utils
Commit message (Collapse)AuthorAge
...
* Fill CommonRdOptions with default values in extract_autovac_opts()Alexander Korotkov2024-04-08
| | | | | | Reported-by: Thomas Munro Reported-by: Pavel Borisov Discussion: https://postgr.es/m/CA%2BhUKGLZzLR50RBvuqOO3MZ%3DF54ETz-rTp1PDX9uDGP_GqyYqA%40mail.gmail.com
* Adjust wording of trace_connection_negotiation GUC's descriptionHeikki Linnakangas2024-04-08
| | | | | | | | | | We're not very consistent about this across all the GUCs, but the "Logs ..." phrasing is more common than "Log ...", and is used by the neighboring "log_connections" and "log_disconnections" GUCs, so switch to that. Author: Kyotaro Horiguchi Discussion: https://www.postgresql.org/message-id/20240408.154010.1170771365226258348.horikyota.ntt@gmail.com
* Custom reloptions for table AMAlexander Korotkov2024-04-08
| | | | | | | | | | | | | | | | | | Let table AM define custom reloptions for its tables. This allows specifying AM-specific parameters by the WITH clause when creating a table. The reloptions, which could be used outside of table AM, are now extracted into the CommonRdOptions data structure. These options could be by decision of table AM directly specified by a user or calculated in some way. The new test module test_tam_options evaluates the ability to set up custom reloptions and calculate fields of CommonRdOptions on their base. The code may use some parts from prior work by Hao Wu. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
* JSON_TABLE: Add support for NESTED paths and columnsAmit Langote2024-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A NESTED path allows to extract data from nested levels of JSON objects given by the parent path expression, which are projected as columns specified using a nested COLUMNS clause, just like the parent COLUMNS clause. Rows comprised from a NESTED columns are "joined" to the row comprised from the parent columns. If a particular NESTED path evaluates to 0 rows, then the nested COLUMNS will emit NULLs, making it an OUTER join. NESTED columns themselves may include NESTED paths to allow extracting data from arbitrary nesting levels, which are likewise joined against the rows at the parent level. Multiple NESTED paths at a given level are called "sibling" paths and their rows are combined by UNIONing them, that is, after being joined against the parent row as described above. Author: Nikita Glukhov <n.gluhov@postgrespro.ru> Author: Teodor Sigaev <teodor@sigaev.ru> Author: Oleg Bartunov <obartunov@gmail.com> Author: Alexander Korotkov <aekorotkov@gmail.com> Author: Andrew Dunstan <andrew@dunslane.net> Author: Amit Langote <amitlangote09@gmail.com> Author: Jian He <jian.universality@gmail.com> Reviewers have included (in no particular order): Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby, Álvaro Herrera, Jian He Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
* Fix JsonExpr deparsing to emit QUOTES and WRAPPER correctlyAmit Langote2024-04-08
| | | | | | | | | | | | | | | Currently, get_json_expr_options() does not emit the default values for QUOTES (KEEP QUOTES) and WRAPPER (WITHOUT WRAPPER). That causes the deparsed JSON_TABLE() columns, such as those contained in a a view's query, to behave differently when executed than the original definition. That's because the rules encoded in transformJsonTableColumns() will choose either JSON_VALUE() or JSON_QUERY() as implementation to execute a given column's path expression depending on the QUOTES and WRAPPER specificationd and they have slightly different semantics. Reported-by: Jian He <jian.universality@gmail.com> Discussion: https://postgr.es/m/CACJufxEqhqsfrg_p7EMyo5zak3d767iFDL8vz_4%3DZBHpOtrghw%40mail.gmail.com
* Fix the wording of or_to_any_transform_limit descriptionAlexander Korotkov2024-04-08
| | | | | Reported-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/20240408.144657.1746688590065601661.horikyota.ntt%40gmail.com
* Fix the value of or_to_any_transform_limit in postgresql.conf.sampleAlexander Korotkov2024-04-08
| | | | | Reported-by: Justin Pryzby Discussion: https://postgr.es/m/ZhM8jH8gsKm5Q-9p%40pryzbyj2023
* Enhance libpq encryption negotiation tests with new GUCHeikki Linnakangas2024-04-08
| | | | | | | | | | | | | | | | | | The new "log_connection_negotiation" server option causes the server to print messages to the log when it receives a SSLRequest or GSSENCRequest packet from the client. Together with "log_connections", it gives a trace of how a connection and encryption is negotiatated. Use the option in the libpq_encryption test, to verify in more detail how libpq negotiates encryption with different gssencmode and sslmode options. This revealed a couple of cases where libpq retries encryption or authentication, when it should already know that it cannot succeed. I marked them with XXX comments in the test tables. They only happen when the connection was going to fail anyway, and only with rare combinations of options, so they're not serious. Discussion: https://www.postgresql.org/message-id/CAEze2Wja8VUoZygCepwUeiCrWa4jP316k0mvJrOW4PFmWP0Tcw@mail.gmail.com
* Fix incorrect KeeperBlock macro in bump.cDavid Rowley2024-04-08
| | | | | | | | | | | | | The macro was missing a MAXALIGN around the sizeof(BumpContext) which would cause problems detecting the keeper block on 32-bit systems that have a MAXALIGN value of 8. Thank you to Andres Freund, Tomas Vondra and Tom Lane for investigating and testing. Reported-by: Melanie Plageman, Tomas Vondra Discussion: https://postgr.es/m/CAAKRu_Y6dZjiJEZghgNZp0Gjar1JVq-CH7XGDqExDVHnPgDjuw@mail.gmail.com Discussion: https://postgr.es/m/a4a10b89-6ba8-4abd-b449-019aafff04fc@enterprisedb.com
* Transform OR clauses to ANY expressionAlexander Korotkov2024-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...]) on the preliminary stage of optimization when we are still working with the expression tree. Here Cn is a n-th constant expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'Cn op expr'). Sometimes it can lead to not optimal plan. This is why there is a or_to_any_transform_limit GUC. It specifies a threshold value of length of arguments in an OR expression that triggers the OR-to-ANY transformation. Generally, more groupable OR arguments mean that transformation will be more likely to win than to lose. Discussion: https://postgr.es/m/567ED6CA.2040504%40sigaev.ru Author: Alena Rybakina <lena.ribackina@yandex.ru> Author: Andrey Lepikhov <a.lepikhov@postgrespro.ru> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com>
* Use conditional variable to wait for next MultiXact offsetAlvaro Herrera2024-04-07
| | | | | | | | | | | | | | | In one multixact.c edge case, we need a mechanism to wait for one multixact offset to be written before being allowed to read the next one. We used to handle this case by sleeping for one millisecond and retrying, but such sleeps have been reported as problematic in production cases. We can avoid the problem by using a condition variable: readers sleep on it and then every creator of multixacts broadcasts into the CV when creation is sufficiently far along. Author: Kyotaro Horiguchi <horikyotajntt@gmail.com> Reviewed-by: Andrey Borodin <amborodin@acm.org> Discussion: https://postgr.es/m/47A598F4-B4E7-4029-8FEC-A06A6C3CB4B5@yandex-team.ru Discussion: https://postgr.es/m/20200515.090333.24867479329066911.horikyota.ntt
* Use bump memory context for tuplesortsDavid Rowley2024-04-08
| | | | | | | | | | | | | | | | 29f6a959c added a bump allocator type for efficient compact allocations. Here we make use of this for non-bounded tuplesorts to store tuples. This is very space efficient when storing narrow tuples due to bump.c not having chunk headers. This means we can fit more tuples in work_mem before spilling to disk, or perform an in-memory sort touching fewer cacheline. Author: David Rowley Reviewed-by: Nathan Bossart Reviewed-by: Matthias van de Meent Reviewed-by: Tomas Vondra Reviewed-by: John Naylor Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
* Introduce a bump memory allocatorDavid Rowley2024-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces a bump MemoryContext type. The bump context is best suited for short-lived memory contexts which require only allocations of memory and never a pfree or repalloc, which are unsupported. Memory palloc'd into a bump context has no chunk header. This makes bump a useful context type when lots of small allocations need to be done without any need to pfree those allocations. Allocation sizes are rounded up to the next MAXALIGN boundary, so with this and no chunk header, allocations are very compact indeed. Allocations are also very fast as bump does not check any freelists to try and make use of previously free'd chunks. It just checks if there is enough room on the current block, and if so it bumps the freeptr beyond this chunk and returns the value that the freeptr was previously pointing to. Simple and fast. A new block is malloc'd when there's not enough space in the current block. Code using the bump allocator must take care never to call any functions which could try to call realloc() (or variants), pfree(), GetMemoryChunkContext() or GetMemoryChunkSpace() on a bump allocated chunk. Due to lack of chunk headers, these operations are unsupported. To increase the chances of catching such issues, when compiled with MEMORY_CONTEXT_CHECKING, bump allocated chunks are given a header and any attempt to perform an unsupported operation will result in an ERROR. Without MEMORY_CONTEXT_CHECKING, code attempting an unsupported operation could result in a segfault. A follow-on commit will implement the first user of bump. Author: David Rowley Reviewed-by: Nathan Bossart Reviewed-by: Matthias van de Meent Reviewed-by: Tomas Vondra Reviewed-by: John Naylor Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
* Enlarge bit-space for MemoryContextMethodIDDavid Rowley2024-04-07
| | | | | | | | | | | | | | | | Reserve 4 bits for MemoryContextMethodID rather than 3. 3 bits did technically allow a maximum of 8 memory context types, however, we've opted to reserve some bit patterns which left us with only 4 slots, all of which were used. Here we add another bit which frees up 8 slots for future memory context types. In passing, adjust the enum names in MemoryContextMethodID to make it more clear which ones can be used and which ones are reserved. Author: Matthias van de Meent, David Rowley Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
* Implement ALTER TABLE ... SPLIT PARTITION ... commandAlexander Korotkov2024-04-07
| | | | | | | | | | | | | | | | | | | | This new DDL command splits a single partition into several parititions. Just like ALTER TABLE ... MERGE PARTITIONS ... command, new patitions are created using createPartitionTable() function with parent partition as the template. This commit comprises quite naive implementation which works in single process and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the operations including the tuple routing. This is why this new DDL command can't be recommended for large partitioned tables under a high load. However, this implementation come in handy in certain cases even as is. Also, it could be used as a foundation for future implementations with lesser locking and possibly parallel. Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru Author: Dmitry Koval Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires
* Use an LWLock instead of a spinlock in waitlsn.cAlexander Korotkov2024-04-07
| | | | | | | This should prevent busy-waiting when number of waiting processes is high. Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql Author: Alvaro Herrera
* Enhance nbtree ScalarArrayOp execution.Peter Geoghegan2024-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals natively. This works by pushing down the full context (the array keys) to the nbtree index AM, enabling it to execute multiple primitive index scans that the planner treats as one continuous index scan/index path. This earlier enhancement enabled nbtree ScalarArrayOp index-only scans. It also allowed scans with ScalarArrayOp quals to return ordered results (with some notable restrictions, described further down). Take this general approach a lot further: teach nbtree SAOP index scans to decide how to execute ScalarArrayOp scans (when and where to start the next primitive index scan) based on physical index characteristics. This can be far more efficient. All SAOP scans will now reliably avoid duplicative leaf page accesses (just like any other nbtree index scan). SAOP scans whose array keys are naturally clustered together now require far fewer index descents, since we'll reliably avoid starting a new primitive scan just to get to a later offset from the same leaf page. The scan's arrays now advance using binary searches for the array element that best matches the next tuple's attribute value. Required scan key arrays (i.e. arrays from scan keys that can terminate the scan) ratchet forward in lockstep with the index scan. Non-required arrays (i.e. arrays from scan keys that can only exclude non-matching tuples) "advance" without the process ever rolling over to a higher-order array. Naturally, only required SAOP scan keys trigger skipping over leaf pages (non-required arrays cannot safely end or start primitive index scans). Consequently, even index scans of a composite index with a high-order inequality scan key (which we'll mark required) and a low-order SAOP scan key (which we won't mark required) now avoid repeating leaf page accesses -- that benefit isn't limited to simpler equality-only cases. In general, all nbtree index scans now output tuples as if they were one continuous index scan -- even scans that mix a high-order inequality with lower-order SAOP equalities reliably output tuples in index order. This allows us to remove a couple of special cases that were applied when building index paths with SAOP clauses during planning. Bugfix commit 807a40c5 taught the planner to avoid generating unsafe path keys: path keys on a multicolumn index path, with a SAOP clause on any attribute beyond the first/most significant attribute. These cases are now all safe, so we go back to generating path keys without regard for the presence of SAOP clauses (just like with any other clause type). Affected queries can now exploit scan output order in all the usual ways (e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early). Also undo changes from follow-up bugfix commit a4523c5a, which taught the planner to produce alternative index paths, with path keys, but without low-order SAOP index quals (filter quals were used instead). We'll no longer generate these alternative paths, since they can no longer offer any meaningful advantages over standard index qual paths. Affected queries thereby avoid all of the disadvantages that come from using filter quals within index scan nodes. They can avoid extra heap page accesses from using filter quals to exclude non-matching tuples (index quals will never have that problem). They can also skip over irrelevant sections of the index in more cases (though only when nbtree determines that starting another primitive scan actually makes sense). There is a theoretical risk that removing restrictions on SAOP index paths from the planner will break compatibility with amcanorder-based index AMs maintained as extensions. Such an index AM could have the same limitations around ordered SAOP scans as nbtree had up until now. Adding a pro forma incompatibility item about the issue to the Postgres 17 release notes seems like a good idea. Author: Peter Geoghegan <pg@bowt.ie> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
* Increase default vacuum_buffer_usage_limit to 2MB.Thomas Munro2024-04-06
| | | | | | | | | | | | | | | | | | The BAS_VACUUM ring size has been 256kB since commit d526575f introduced the mechanism 17 years ago. Commit 1cbbee03 recently made it configurable but retained the traditional default. The correct default size has been debated for years, but 256kB is certainly very small. VACUUM soon needs to write back data it dirtied only 32 blocks ago, which usually requires flushing the WAL. New experiments in prefetching pages for VACUUM exacerbated the problem by crashing into dirty data even sooner. Let's make the default 2MB. That's 1.6% of the default toy buffer pool size, and 0.2% of 1GB, which would be a considered a small shared_buffers setting for a real system these days. Users are still free to set the GUC to a different value. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de Discussion: https://postgr.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com
* Add "ABI_compatibility" regions to wait_event_names.txtMichael Paquier2024-04-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current design behind the automatic generation of the C code and documentation related to wait events introduced in fa88928470b5 does not offer a way to attach new wait events without breaking ABI compatibility, as all the events are forcibly reordered for each section in the input file wait_event_names.txt. Adding new wait events to stable branches is something that has happened in the past, 0b6517a3b79a being a recent example of that with VERSION_FILE_SYNC, so we need a way to generate any C code for wait events while maintaining compatibility on stable branches already released. This commit solves this issue by adding a new region called "ABI_compatibility" (keyword could be updated to something else if someone had a better idea) to each section of wait_event_names.txt, so as one can add new wait events to stable branches in wait_event_names.txt while keeping the code ABI-compatible. "ABI_compatibility" has no impact on the documentation generated: all the wait events of one section are still alphabetically ordered. LWLock and Lock sections generate their C code elsewhere, so they do not need an "ABI_compatibility" region. For example, let's imagine a wait_event_names.txt like that: Section: ClassName - Foo FOO_1 "Waiting in Foo 1" FOO_2 "Waiting in Foo 2" ABI_compatibility: NEW_FOO_1 "Waiting in New Foo 1" NEW_BAR_1 "Waiting in New Bar 1" This results in the following enum, where the events in the ABI region are listed last with the same ordering as in wait_event_names.txt: typedef enum { WAIT_EVENT_FOO_1, WAIT_EVENT_FOO_2, WAIT_EVENT_NEW_FOO_1, WAIT_EVENT_NEW_BAR_1 } WaitEventFoo; New wait events added in stable branches should be added at the end of each ABI_compatibility region, and ABI_compatibility should remain empty on HEAD and unreleased stable branches. This design has been suggested by Noah Misch and me. Reported-by: Noah Misch Author: Bertrand Drouvot Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/20240317183114.16@rfd.leadboat.com
* Add basic JSON_TABLE() functionalityAmit Langote2024-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JSON_TABLE() allows JSON data to be converted into a relational view and thus used, for example, in a FROM clause, like other tabular data. Data to show in the view is selected from a source JSON object using a JSON path expression to get a sequence of JSON objects that's called a "row pattern", which becomes the source to compute the SQL/JSON values that populate the view's output columns. Column values themselves are computed using JSON path expressions applied to each of the JSON objects comprising the "row pattern", for which the SQL/JSON query functions added in 6185c9737cf4 are used. To implement JSON_TABLE() as a table function, this augments the TableFunc and TableFuncScanState nodes that are currently used to support XMLTABLE() with some JSON_TABLE()-specific fields. Note that the JSON_TABLE() spec includes NESTED COLUMNS and PLAN clauses, which are required to provide more flexibility to extract data out of nested JSON objects, but they are not implemented here to keep this commit of manageable size. Author: Nikita Glukhov <n.gluhov@postgrespro.ru> Author: Teodor Sigaev <teodor@sigaev.ru> Author: Oleg Bartunov <obartunov@gmail.com> Author: Alexander Korotkov <aekorotkov@gmail.com> Author: Andrew Dunstan <andrew@dunslane.net> Author: Amit Langote <amitlangote09@gmail.com> Author: Jian He <jian.universality@gmail.com> Reviewers have included (in no particular order): Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby, Álvaro Herrera, Jian He Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
* Coordinate emit_log_hook and all log destinations to share the same timevalMichael Paquier2024-04-04
| | | | | | | | | | | | | | | | | | This would cause the timestamp values used by emit_log_hook and all the other log destinations to differ, because the timestamps are reset before sending the logs to the server and after calling the hook. This change matters for emit_log_hook when generating log information with 'n' or 'm' in log_line_prefix through log_status_format(), or when doing direct calls to get_formatted_log_time() like in the JSON or CSV log formats. While on it, this commit fixes a couple of comments related to the formatted timestamps where the JSON was not mentioned. Oversight in dc686681e079, that I have noticed while reviewing this patch. Author: Kambam Vinay, Michael Paquier Discussion: https://postgr.es/m/CANiRfmsK36A0i8mnQtzaxhSm3CUCimPwJPp4WQNq53OdSNkgWg@mail.gmail.com
* Add error codes to some PANIC/FATAL errors reportsDaniel Gustafsson2024-04-03
| | | | | | | | | | | This adds errcodes to a set of PANIC and FATAL errors in xlog.c and relcache.c, which previously had no errcode at all set, in order to make fleetwide analysis of errorlogs easier. There are many more ereport/elogs left which could benefit from having an errcode but this at least makes a dent in the issue. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
* Implement pg_wal_replay_wait() stored procedureAlexander Korotkov2024-04-02
| | | | | | | | | | | | | | | | | | | | | | | | pg_wal_replay_wait() is to be used on standby and specifies waiting for the specific WAL location to be replayed before starting the transaction. This option is useful when the user makes some data changes on primary and needs a guarantee to see these changes on standby. The queue of waiters is stored in the shared memory array sorted by LSN. During replay of WAL waiters whose LSNs are already replayed are deleted from the shared memory array and woken up by setting of their latches. pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise, the snapshot could prevent the replay of WAL records implying a kind of self-deadlock. This is why it is only possible to implement pg_wal_replay_wait() as a procedure working in a non-atomic context, not a function. Catversion is bumped. Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru Author: Kartyshov Ivan, Alexander Korotkov Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
* Provide vectored variant of ReadBuffer().Thomas Munro2024-04-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Break ReadBuffer() up into two steps. StartReadBuffers() and WaitReadBuffers() give us two main advantages: 1. Multiple consecutive blocks can be read with one system call. 2. Advice (hints of future reads) can optionally be issued to the kernel ahead of time. The traditional ReadBuffer() function is now implemented in terms of those functions, to avoid duplication. A new GUC io_combine_limit is defined, and the functions for limiting per-backend pin counts are made into public APIs. Those are provided for use by callers of StartReadBuffers(), when deciding how many buffers to read at once. The following commit will add a higher level mechanism for doing that automatically with a practical interface. With some more infrastructure in later work, StartReadBuffers() could be extended to start real asynchronous I/O instead of just issuing advice and leaving WaitReadBuffers() to do the work synchronously. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> (some optimization tweaks) Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
* Revert "Custom reloptions for table AM"Alexander Korotkov2024-04-02
| | | | | | | | This reverts commit c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues spotted after commit. Reported-by: Jeff Davis Discussion: https://postgr.es/m/11550b536211d5748bb2865ed6cb3502ff073bf7.camel%40j-davis.com
* Use TidStore for dead tuple TIDs storage during lazy vacuum.Masahiko Sawada2024-04-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, we used a simple array for storing dead tuple IDs during lazy vacuum, which had a number of problems: * The array used a single allocation and so was limited to 1GB. * The allocation was pessimistically sized according to table size. * Lookup with binary search was slow because of poor CPU cache and branch prediction behavior. This commit replaces that array with the TID store from commit 30e144287a. Since the backing radix tree makes small allocations as needed, the 1GB limit is now gone. Further, the total memory used is now often smaller by an order of magnitude or more, depending on the distribution of blocks and offsets. These two features should make multiple rounds of heap scanning and index cleanup an extremely rare event. TID lookup during index cleanup is also several times faster, even more so when index order is correlated with heap tuple order. Since there is no longer a predictable relationship between the number of dead tuples vacuumed and the space taken up by their TIDs, the number of tuples no longer provides any meaningful insights for users, nor is the maximum number predictable. For that reason this commit also changes to byte-based progress reporting, with the relevant columns of pg_stat_progress_vacuum renamed accordingly to max_dead_tuple_bytes and dead_tuple_bytes. For parallel vacuum, both the TID store and supplemental information specific to vacuum are shared among the parallel vacuum workers. As with the previous array, we don't take any locks on TidStore during parallel vacuum since writes are still only done by the leader process. Bump catalog version. Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
* Custom reloptions for table AMAlexander Korotkov2024-03-30
| | | | | | | | | | | Let table AM define custom reloptions for its tables. This allows to specify AM-specific parameters by WITH clause when creating a table. The code may use some parts from prior work by Hao Wu. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent
* Add pg_basetype() function to extract a domain's base type.Tom Lane2024-03-30
| | | | | | | | | | | | | This SQL-callable function behaves much like our internal utility function getBaseType(), except it returns NULL rather than failing for an invalid type OID. (That behavior is modeled on our experience with other catalog-inquiry functions such as the ACL checking functions.) The key advantage over doing a join to pg_type is that it will loop as needed to find the bottom base type of a nest of domains. Steve Chavez, reviewed by jian he and others Discussion: https://postgr.es/m/CAGRrpzZSX8j=MQcbCSEisFA=ic=K3bknVfnFjAv1diVJxFHJvg@mail.gmail.com
* Add support for MERGE ... WHEN NOT MATCHED BY SOURCE.Dean Rasheed2024-03-30
| | | | | | | | | | | | | | | | | | | This allows MERGE commands to include WHEN NOT MATCHED BY SOURCE actions, which operate on rows that exist in the target relation, but not in the data source. These actions can execute UPDATE, DELETE, or DO NOTHING sub-commands. This is in contrast to already-supported WHEN NOT MATCHED actions, which operate on rows that exist in the data source, but not in the target relation. To make this distinction clearer, such actions may now be written as WHEN NOT MATCHED BY TARGET. Writing WHEN NOT MATCHED without specifying BY SOURCE or BY TARGET is equivalent to writing WHEN NOT MATCHED BY TARGET. Dean Rasheed, reviewed by Alvaro Herrera, Ted Yu and Vik Fearing. Discussion: https://postgr.es/m/CAEZATCWqnKGc57Y_JanUBHQXNKcXd7r=0R4NEZUVwP+syRkWbA@mail.gmail.com
* Add unicode_strtitle() for Unicode Default Case Conversion.Jeff Davis2024-03-29
| | | | | | | | | | | | | This brings the titlecasing implementation for the builtin provider out of formatting.c and into unicode_case.c, along with unicode_strlower() and unicode_strupper(). Accepts an arbitrary word boundary callback. Simple for now, but can be extended to support the Unicode Default Case Conversion algorithm with full case mapping. Discussion: https://postgr.es/m/3bc653b5d562ae9e2838b11cb696816c328a489a.camel@j-davis.com Reviewed-by: Peter Eisentraut
* Use version for builtin collations.Jeff Davis2024-03-29
| | | | | | | | Given that the version field already exists, there's little reason not to use it. Suggestion from Peter Eisentraut. Discussion: https://postgr.es/m/613c120a-5413-4fa7-a501-6590eae558f8@eisentraut.org Reviewed-by: Peter Eisentraut
* Add allow_alter_system GUC.Robert Haas2024-03-29
| | | | | | | | | | | | | | | | | | This is marked PGC_SIGHUP, so it can only be set in a configuration file, not anywhere else; and it is also marked GUC_DISALLOW_IN_AUTO_FILE, so it can't be set using ALTER SYSTEM. When set to false, the ALTER SYSTEM command is disallowed. There was considerable concern that this would be misinterpreted as a security feature, which it is not, because a determined superuser has various ways of bypassing it. Hence, a lot of work has gone into wordsmithing the documentation, in the hopes of avoiding any such confusion. Jelte Fennemia-Nio and Gabriele Bartolini, with wording suggestions for the documentation from many others. Discussion: http://postgr.es/m/CA%2BVUV5rEKt2%2BCdC_KUaPoihMu%2Bi5ChT4WVNTr4CD5-xXZUfuQw%40mail.gmail.com
* Rename COMPAT_OPTIONS_CLIENT to COMPAT_OPTIONS_OTHER.Robert Haas2024-03-27
| | | | | | | | | | | | The user-facing name is "Other Platforms and Clients", but the internal name seems too focused on clients specifically, especially given the plan to add a new setting to this session that is about platform or deployment model compatibility rather than client compatibility. Jelte Fennema-Nio Discussion: http://postgr.es/m/CAGECzQTfMbDiM6W3av+3weSnHxJvPmuTEcjxVvSt91sQBdOxuQ@mail.gmail.com
* Add functions to generate random numbers in a specified range.Dean Rasheed2024-03-27
| | | | | | | | | | | | | | | | | | | | | | | | This adds 3 new variants of the random() function: random(min integer, max integer) returns integer random(min bigint, max bigint) returns bigint random(min numeric, max numeric) returns numeric Each returns a random number x in the range min <= x <= max. For the numeric function, the number of digits after the decimal point is equal to the number of digits that "min" or "max" has after the decimal point, whichever has more. The main entry points for these functions are in a new C source file. The existing random(), random_normal(), and setseed() functions are moved there too, so that they can all share the same PRNG state, which is kept private to that file. Dean Rasheed, reviewed by Jian He, David Zhang, Aleksander Alekseev, and Tomas Vondra. Discussion: https://postgr.es/m/CAEZATCV89Vxuq93xQdmc0t-0Y2zeeNQTdsjbmV7dyFBPykbV4Q@mail.gmail.com
* Allow specifying initial and maximum segment sizes for DSA.Masahiko Sawada2024-03-27
| | | | | | | | | | | | | | Previously, the DSA segment size always started with 1MB and grew up to DSA_MAX_SEGMENT_SIZE. It was inconvenient in certain scenarios, such as when the caller desired a soft constraint on the total DSA segment size, limiting it to less than 1MB. This commit introduces the capability to specify the initial and maximum DSA segment sizes when creating a DSA area, providing more flexibility and control over memory usage. Reviewed-by: John Naylor, Tomas Vondra Discussion: https://postgr.es/m/CAD21AoAYGGC1ePjVX0H%2Bpp9rH%3D9vuPK19nNOiu12NprdV5TVJA%40mail.gmail.com
* Optimize roles_is_member_of() with a Bloom filter.Nathan Bossart2024-03-26
| | | | | | | | | | | When the list of roles gathered by roles_is_member_of() grows very large, a Bloom filter is created to help avoid some linear searches through the list. The threshold for creating the Bloom filter is set arbitrarily high and may require future adjustment. Suggested-by: Tom Lane Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com
* C comment: mention no doc for negative start of substring(text)Bruce Momjian2024-03-26
| | | | | | Also add URL to hackers discussion. Backpatch-through: master
* Allow "make check"-style testing to work with musl C library.Tom Lane2024-03-26
| | | | | | | | | | | | | | | | | | | | | The musl dynamic linker saves a pointer to the process' environment value of LD_LIBRARY_PATH very early in startup. When we move/clobber the environment to make more room for ps status strings, we clobber that value and thereby prevent libraries from being found via LD_LIBRARY_PATH, which breaks the use of a temporary installation for testing purposes. To fix, stop collecting usable space for ps status if we notice that the variable we are about to clobber is LD_LIBRARY_PATH. This will result in some reduction in how long the ps status can be, but it's only likely to occur in temporary test contexts, so it doesn't seem like a big problem. In any case, we don't have to do it if we see we are on glibc, which surely is where the majority of our Linux testing is done. Thomas Munro, Bruce Momjian, and Tom Lane, per report from Wolfgang Walther. Back-patch to all supported branches, with the hope that we'll set up a buildfarm animal to test on this platform. Discussion: https://postgr.es/m/fddd1cd6-dc16-40a2-9eb5-d7fef2101488@technowledgy.de
* Clarify comment for LogicalTapeSetBlocks().Jeff Davis2024-03-25
| | | | | Discussion: https://postgr.es/m/1229327.1711160246@sss.pgh.pa.us Backpatch-through: 13
* Allow specifying an access method for partitioned tablesAlvaro Herrera2024-03-25
| | | | | | | | | | | | | | | | | | | | | | | | | It's now possible to specify a table access method via CREATE TABLE ... USING for a partitioned table, as well change it with ALTER TABLE ... SET ACCESS METHOD. Specifying an AM for a partitioned table lets the value be used for all future partitions created under it, closely mirroring the behavior of the TABLESPACE option for partitioned tables. Existing partitions are not modified. For a partitioned table with no AM specified, any new partitions are created with the default_table_access_method. Also add ALTER TABLE ... SET ACCESS METHOD DEFAULT, which reverts to the original state of using the default for new partitions. The relcache of partitioned tables is not changed: rd_tableam is not set, even if a partitioned table has a relam set. Author: Justin Pryzby <pryzby@telsasoft.com> Author: Soumyadeep Chakraborty <soumyadeep2007@gmail.com> Author: Michaël Paquier <michael@paquier.xyz> Reviewed-by: The authors themselves Discussion: https://postgr.es/m/CAE-ML+9zM4wJCGCBGv01k96qQ3gFv4WFcFy=zqPHKeaEFwwv6A@mail.gmail.com Discussion: https://postgr.es/m/20210308010707.GA29832%40telsasoft.com
* Allow more cases to pass the unsafe-use-of-new-enum-value restriction.Tom Lane2024-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up to now we've rejected cases like BEGIN; CREATE TYPE rainbow AS ENUM (); ALTER TYPE rainbow ADD VALUE 'red'; -- use the value 'red', perhaps in a constraint or index COMMIT; The concern is that the uncommitted enum value 'red' might get into an index and then break the index if we roll back the ALTER ADD. If the ALTER is in the same transaction as the CREATE then it's really perfectly safe, but we weren't taking the trouble to identify that. pg_dump in binary-upgrade mode will emit enum definitions that look like the above, which up to now didn't fall foul of the unsafe-usage check because we processed each restore command as a separate transaction. However an upcoming patch proposes to bundle the restore commands into large transactions to reduce XID consumption during pg_upgrade, and that makes this behavior a problem. To fix, remember the OIDs of enum types created in the current transaction, and allow use of enum values that are added to one later in the same transaction. To do this fully correctly in the presence of subtransactions, we'd have to track subtransaction nesting level of the CREATE and do maintenance work at every subsequent subtransaction exit. That seems expensive, and we don't need it to satisfy pg_dump's usage. Hence, apply the additional optimization only when the CREATE and ALTER are at outermost transaction level. Patch by me, reviewed by Andrew Dunstan Discussion: https://postgr.es/m/1548468.1711220438@sss.pgh.pa.us
* Add temporal FOREIGN KEY contraintsPeter Eisentraut2024-03-24
| | | | | | | | | | | | | | | | | | Add PERIOD clause to foreign key constraint definitions. This is supported for range and multirange types. Temporal foreign keys check for range containment instead of equality. This feature matches the behavior of the SQL standard temporal foreign keys, but it works on PostgreSQL's native ranges instead of SQL's "periods", which don't exist in PostgreSQL (yet). Reference actions ON {UPDATE,DELETE} {CASCADE,SET NULL,SET DEFAULT} are not supported yet. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Revert "Add notBefore and notAfter to SSL cert info display"Daniel Gustafsson2024-03-22
| | | | | | | | | | This reverts commit 6acb0a628eccab8764e0306582c2b7e2a1441b9b since LibreSSL didn't support ASN1_TIME_diff until OpenBSD 7.1, leaving the older OpenBSD animals in the buildfarm complaining. Per plover in the buildfarm. Discussion: https://postgr.es/m/F0DF7102-192D-4C21-96AE-9A01AE153AD1@yesql.se
* Use a hash table for catcache.c's CatCList objects.Tom Lane2024-03-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | Up to now, all of the "catcache list" objects within a catalog cache were just chained together on a single dlist, requiring O(N) time to search. Remarkably, we've not had serious performance problems with that so far; but we got a complaint of a bad performance regression from v15 in a case with a large number of roles in the system, which traced down to O(N^2) total time when we probed N catcache lists. Replace that data structure with a hashtable having an enlargeable number of dlists, in an exactly parallel way to the data structure we've used for years for the plain CatCTup cache members. The extra cost of maintaining a hash table seems negligible, since we were already computing a hash value for list searches. Normally this'd be HEAD-only material, but in view of the performance regression it seems advisable to back-patch into v16. In the v16 version of the patch, leave the dead cc_lists field where it is and add the new fields at the end of struct catcache, to avoid possible ABI breakage in case any external code is looking at these structs. (We assume no external code is actually allocating new catcache structs.) Per report from alex work. Discussion: https://postgr.es/m/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz=vggErdSG7pv2s6vmmTOLJSg@mail.gmail.com
* Add notBefore and notAfter to SSL cert info displayDaniel Gustafsson2024-03-22
| | | | | | | | | | | | | | | | | | This adds the X509 attributes notBefore and notAfter to sslinfo as well as pg_stat_ssl to allow verifying and identifying the validity period of the current client certificate. OpenSSL has APIs for extracting notAfter and notBefore, but they are only supported in recent versions so we have to calculate the dates by hand in order to make this work for the older versions of OpenSSL that we still support. Original patch by Cary Huang with additional hacking by Jacob and myself. Author: Cary Huang <cary.huang@highgo.ca> Co-author: Jacob Champion <jacob.champion@enterprisedb.com> Co-author: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/182b8565486.10af1a86f158715.2387262617218380588@highgo.ca
* Add TupleTableSlotOps.is_current_xact_tuple() methodAlexander Korotkov2024-03-21
| | | | | | | | | | This allows us to abstract how/whether table AM uses transaction identifiers. A custom table AM can use a custom slot, which may not store xmin directly, but determine the tuple belonging to the current transaction in the other way. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov Reviewed-by: Nikita Malakhov, Japin Li
* Allow table AM to store complex data structures in rd_amcacheAlexander Korotkov2024-03-21
| | | | | | | | | | | The new table AM method free_rd_amcache is responsible for freeing all the memory related to rd_amcache and setting free_rd_amcache to NULL. If the new method is not specified, we still assume rd_amcache to be a single chunk of memory, which could be just pfree'd. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov Reviewed-by: Nikita Malakhov, Japin Li
* Add SQL/JSON query functionsAmit Langote2024-03-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces the following SQL/JSON functions for querying JSON data using jsonpath expressions: JSON_EXISTS(), which can be used to apply a jsonpath expression to a JSON value to check if it yields any values. JSON_QUERY(), which can be used to to apply a jsonpath expression to a JSON value to get a JSON object, an array, or a string. There are various options to control whether multi-value result uses array wrappers and whether the singleton scalar strings are quoted or not. JSON_VALUE(), which can be used to apply a jsonpath expression to a JSON value to return a single scalar value, producing an error if it multiple values are matched. Both JSON_VALUE() and JSON_QUERY() functions have options for handling EMPTY and ERROR conditions, which can be used to specify the behavior when no values are matched and when an error occurs during jsonpath evaluation, respectively. Author: Nikita Glukhov <n.gluhov@postgrespro.ru> Author: Teodor Sigaev <teodor@sigaev.ru> Author: Oleg Bartunov <obartunov@gmail.com> Author: Alexander Korotkov <aekorotkov@gmail.com> Author: Andrew Dunstan <andrew@dunslane.net> Author: Amit Langote <amitlangote09@gmail.com> Author: Peter Eisentraut <peter@eisentraut.org> Author: Jian He <jian.universality@gmail.com> Reviewers have included (in no particular order): Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby, Álvaro Herrera, Jian He, Anton A. Melnikov, Nikita Malakhov, Peter Eisentraut, Tomas Vondra Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org Discussion: https://postgr.es/m/CA+HiwqHROpf9e644D8BRqYvaAPmgBZVup-xKMDPk-nd4EpgzHw@mail.gmail.com Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
* Add to_regtypemod function to extract typemod from a string type name.Tom Lane2024-03-20
| | | | | | | | | | | | | | | | | | | | | | | In combination with to_regtype, this allows converting a string to the "canonicalized" form emitted by format_type. That usage requires parsing the string twice, which is slightly annoying but not really too expensive. We considered alternatives such as returning a record type, but that way was notationally uglier than this, and possibly less flexible. Like to_regtype(), we'd rather that this return NULL for any bad input, but the underlying type-parsing logic isn't yet capable of not throwing syntax errors. Adjust the documentation for both functions to point that out. In passing, fix up a couple of nearby entries in the System Catalog Information Functions table that had not gotten the word about our since-v13 convention for displaying function usage examples. David Wheeler and Erik Wienhold, reviewed by Pavel Stehule, Jim Jones, and others. Discussion: https://postgr.es/m/DF2324CA-2673-4ABE-B382-26B5770B6AA3@justatheory.com
* Avoid overflow in MaybeRemoveOldWalSummaries().Nathan Bossart2024-03-20
| | | | | | | | | | This commit limits the maximum value of wal_summary_keep_time to INT_MAX / SECS_PER_MINUTE to avoid overflow when it is converted to seconds. In passing, use the HOURS_PER_DAY, MINS_PER_HOUR, and SECS_PER_MINUTE macros in the code for this GUC instead of hard- coding those values. Discussion: https://postgr.es/m/20240314210010.GA3056455%40nathanxps13