aboutsummaryrefslogtreecommitdiff
path: root/src/include
Commit message (Collapse)AuthorAge
* Avoid using a C++ keyword in header filePeter Eisentraut2016-10-26
| | | | per cpluspluscheck
* Fix typos in comments.Heikki Linnakangas2016-10-26
| | | | Vinayak Pokale
* Allow pg_basebackup to stream transaction log in tar modeMagnus Hagander2016-10-23
| | | | | | | | | | | | | | | This will write the received transaction log into a file called pg_wal.tar(.gz) next to the other tarfiles instead of writing it to base.tar. When using fetch mode, the transaction log is still written to base.tar like before, and when used against a pre-10 server, the file is named pg_xlog.tar. To do this, implement a new concept of a "walmethod", which is responsible for writing the WAL. Two implementations exist, one that writes to a plain directory (which is also used by pg_receivexlog) and one that writes to a tar file with optional compression. Reviewed by Michael Paquier
* Rename "pg_xlog" directory to "pg_wal".Robert Haas2016-10-20
| | | | | | | | | | | | | | | | | | | | | "xlog" is not a particularly clear abbreviation for "write-ahead log", and it sometimes confuses users into believe that the contents of the "pg_xlog" directory are not critical data, leading to unpleasant consequences. So, rename the directory to "pg_wal". This patch modifies pg_upgrade and pg_basebackup to understand both the old and new directory layouts; the former is necessary given the purpose of the tool, while the latter merely avoids an unnecessary backward-compatibility break. We may wish to consider renaming other programs, switches, and functions which still use the old "xlog" naming to also refer to "wal". However, that's still under discussion, so let's do just this much for now. Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com Michael Paquier
* Fix a few typos in simplehash.h.Andres Freund2016-10-18
| | | | | Author: Erik Rijkers Discussion: <274e4c8ac545d6622735f97c1f6c354b@xs4all.nl>
* Fix typo in comment.Robert Haas2016-10-18
| | | | Amit Langote
* Revert "Replace PostmasterRandom() with a stronger way of generating ↵Heikki Linnakangas2016-10-18
| | | | | | | | | | | | | | randomness." This reverts commit 9e083fd4683294f41544e6d0d72f6e258ff3a77c. That was a few bricks shy of a load: * Query cancel stopped working * Buildfarm member pademelon stopped working, because the box doesn't have /dev/urandom nor /dev/random. This clearly needs some more discussion, and a quite different patch, so revert for now.
* Replace PostmasterRandom() with a stronger way of generating randomness.Heikki Linnakangas2016-10-17
| | | | | | | | | | | | | | | | | | | | This adds a new routine, pg_strong_random() for generating random bytes, for use in both frontend and backend. At the moment, it's only used in the backend, but the upcoming SCRAM authentication patches need strong random numbers in libpq as well. pg_strong_random() is based on, and replaces, the existing implementation in pgcrypto. It can acquire strong random numbers from a number of sources, depending on what's available: - OpenSSL RAND_bytes(), if built with OpenSSL - On Windows, the native cryptographic functions are used - /dev/urandom - /dev/random Original patch by Magnus Hagander, with further work by Michael Paquier and me. Discussion: <CAB7nPqRy3krN8quR9XujMVVHYtXJ0_60nqgVc6oUk8ygyVkZsA@mail.gmail.com>
* Use more efficient hashtable for execGrouping.c to speed up hash aggregation.Andres Freund2016-10-14
| | | | | | | | | | | | | | | | | | | | | The more efficient hashtable speeds up hash-aggregations with more than a few hundred groups significantly. Improvements of over 120% have been measured. Due to the the different hash table queries that not fully determined (e.g. GROUP BY without ORDER BY) may change their result order. The conversion is largely straight-forward, except that, due to the static element types of simplehash.h type hashes, the additional data some users store in elements (e.g. the per-group working data for hash aggregaters) is now stored in TupleHashEntryData->additional. The meaning of BuildTupleHashTable's entrysize (renamed to additionalsize) has been changed to only be about the additionally stored size. That size is only used for the initial sizing of the hash-table. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
* Add a macro templatized hashtable.Andres Freund2016-10-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dynahash.c hash tables aren't quite fast enough for some use-cases. There are several reasons for lacking performance: - the use of chaining for collision handling makes them cache inefficient, that's especially an issue when the tables get bigger. - as the element sizes for dynahash are only determined at runtime, offset computations are somewhat expensive - hash and element comparisons are indirect function calls, causing unnecessary pipeline stalls - it's two level structure has some benefits (somewhat natural partitioning), but increases the number of indirections to fix several of these the hash tables have to be adjusted to the individual use-case at compile-time. C unfortunately doesn't provide a good way to do compile code generation (like e.g. c++'s templates for all their weaknesses do). Thus the somewhat ugly approach taken here is to allow for code generation using a macro-templatized header file, which generates functions and types based on a prefix and other parameters. Later patches use this infrastructure to use such hash tables for tidbitmap.c (bitmap scans) and execGrouping.c (hash aggregation, ...). In queries where these use up a large fraction of the time, this has been measured to lead to performance improvements of over 100%. There are other cases where this could be useful (e.g. catcache.c). The hash table design chosen is a variant of linear open-addressing. The biggest disadvantage of simple linear addressing schemes are highly variable lookup times due to clustering, and deletions leaving a lot of tombstones around. To address these issues a variant of "robin hood" hashing is employed. Robin hood hashing optimizes chaining lengths by moving elements close to their optimal bucket ("rich" elements), out of the way if a to-be-inserted element is further away from its optimal position (i.e. it's "poor"). While that can make insertions slower, the average lookup performance is a lot better, and higher fill factors can be used in a still performant manner. To avoid tombstones - which normally solve the issue that a deleted node's presence is relevant to determine whether a lookup needs to continue looking or is done - buckets following a deleted element are shifted backwards, unless they're empty or already at their optimal position. There's further possible improvements that can be made to this implementation. Amongst others: - Use distance as a termination criteria during searches. This is generally a good idea, but I've been able to see the overhead of distance calculations in some cases. - Consider combining the 'empty' status into the hashvalue, and enforce storing the hashvalue. That could, in some cases, increase memory density and remove a few instructions. - Experiment further with the, very conservatively choosen, fillfactor. - Make maximum size of hashtable configurable, to allow storing very very large tables. That'd require 64bit hash values to be more common than now, though. - some smaller memcpy calls could be optimized to copy larger chunks But since the new implementation is already considerably faster than dynahash it seem sensible to start using it. Reviewed-By: Tomas Vondra Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
* Add likely/unlikely() branch hint macros.Andres Freund2016-10-14
| | | | | | | | | | | | These are useful for very hot code paths. Because it's easy to guess wrongly about likelihood, and because such likelihoods change over time, they should be used sparingly. Past tests have shown it'd be a good idea to use them in some places, e.g. in error checks around ereports that ERROR out, but that's work for later. Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
* Revert addition of PGDLLEXPORT in PG_FUNCTION_INFO_V1 macro.Tom Lane2016-10-12
| | | | | | | | | | | | | | | | | | This turns out not to be as harmless as I thought: MSVC will complain if it sees an "extern" declaration without PGDLLEXPORT and then one with. (Seems fairly silly, given that this can be changed after the fact by the linker, but there you have it.) Therefore, contrib modules that have extern's for V1 functions in header files are falling over in the buildfarm, since none of those externs are marked PGDLLEXPORT. We might or might not conclude that we're willing to plaster those declarations with PGDLLEXPORT in HEAD, but in any case there's no way we're going to ship this change in the back branches. Third-party authors would not thank us for breaking their code in a minor release. Hence, revert the addition of PGDLLEXPORT (but let's keep the extra info in the comment). If we do the other changes we can revert this commit in HEAD. Per buildfarm.
* Remove unnecessary int2vector-specific hash function and equality operator.Tom Lane2016-10-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions were originally added in commit d8cedf67a to support use of int2vector columns as catcache lookup keys. However, there are no catcaches that use such columns. (Indeed I now think it must always have been dead code: a catcache with such a key column would need an underlying unique index on the column, but we've never had an int2vector btree opclass.) Getting rid of the int2vector-specific operator and function does not lose any functionality, because operations on int2vectors will now fall back to the generic anyarray support. This avoids a wart that a btree index on an int2vector column (made using anyarray_ops) would fail to match equality searches, because int2vectoreq wasn't a member of the opclass. We don't really care much about that, since int2vector is not meant as a type for users to use, but it's silly to have extra code and less functionality. If we ever do want a catcache to be indexed by an int2vector column, we'd need to put back full btree and hash opclasses for int2vector, comparable to the support for oidvector. (The anyarray code can't be used at such a low level, because it needs to do catcache lookups.) But we'll deal with that if/when the need arises. Also worth noting is that removal of the hash int2vector_ops opclass will break any user-created hash indexes on int2vector columns. While hash anyarray_ops would serve the same purpose, it would probably not compute the same hash values and thus wouldn't be on-disk-compatible. Given that int2vector isn't a user-facing type and we're planning other incompatible changes in hash indexes for v10 anyway, this doesn't seem like something to worry about, but it's probably worth mentioning here. Amit Langote Discussion: <d9bb74f8-b194-7307-9ebd-90645d377e45@lab.ntt.co.jp>
* Provide DLLEXPORT markers for C functions via PG_FUNCTION_INFO_V1 macro.Tom Lane2016-10-12
| | | | | | | | | | | | | | | | This isn't really necessary for our own code, because we use a .DEF file in MSVC builds (see gendef.pl), or --export-all-symbols in MinGW and Cygwin builds, to ensure that all global symbols in loadable modules will be exported on Windows. However, third-party authors might use different build processes that need this marker, and it's harmless enough for our own builds. To some extent, this is an oversight in commit e7128e8db, so back-patch to 9.4 where that was added. Laurenz Albe Discussion: <A737B7A37273E048B164557ADEF4A58B539300BD@ntex2010a.host.magwien.gv.at>
* Fix copy-pasto in comment.Heikki Linnakangas2016-10-12
| | | | Amit Langote
* Simplify the code for logical tape read buffers.Heikki Linnakangas2016-10-12
| | | | | | | | | | | | | | | | | Pass the buffer size as argument to LogicalTapeRewindForRead, rather than setting it earlier with the separate LogicTapeAssignReadBufferSize call. This way, the buffer size is set closer to where it's actually used, which makes the code easier to understand. This makes the calculation for how much memory to use for the buffers less precise. We now use the same amount of memory for every tape, rounded down to the nearest BLCKSZ boundary, instead of using one more block for some tapes, to get the total up to exact amount of memory available. That should be OK, merging isn't too sensitive to the exact amount of memory used. Reviewed by Peter Geoghegan Discussion: <0f607c4b-df23-353e-bf56-c0389d28495f@iki.fi>
* Drop server support for FE/BE protocol version 1.0.Tom Lane2016-10-11
| | | | | | | | | While this isn't a lot of code, it's been essentially untestable for a very long time, because libpq doesn't support anything older than protocol 2.0, and has not since release 6.3. There's no reason to believe any other client-side code still uses that protocol, either. Discussion: <2661.1475849167@sss.pgh.pa.us>
* Remove "sco" and "unixware" ports.Tom Lane2016-10-11
| | | | | | | | | | | SCO OpenServer and SCO UnixWare are more or less dead platforms. We have never had a buildfarm member testing the "sco" port, and the last "unixware" member was last heard from in 2012, so it's fair to doubt that the code even compiles anymore on either one. Remove both ports. We can always undo this if someone shows up with an interest in maintaining and testing these platforms. Discussion: <17177.1476136994@sss.pgh.pa.us>
* Fix fallback implementation of pg_atomic_write_u32().Andres Freund2016-10-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I somehow had assumed that in the spinlock (in turn possibly using semaphores) based fallback atomics implementation 32 bit writes could be done without a lock. As far as the write goes that's correct, since postgres supports only platforms with single-copy atomicity for aligned 32bit writes. But writing without holding the spinlock breaks read-modify-write operations like pg_atomic_compare_exchange_u32(), since they'll potentially "miss" a concurrent write, which can't happen in actual hardware implementations. In 9.6+ when using the fallback atomics implementation this could lead to buffer header locks not being properly marked as released, and potentially some related state corruption. I don't see a related danger in 9.5 (earliest release with the API), because pg_atomic_write_u32() wasn't used in a concurrent manner there. The state variable of local buffers, before this change, were manipulated using pg_atomic_write_u32(), to avoid unnecessary synchronization overhead. As that'd not be the case anymore, introduce and use pg_atomic_unlocked_write_u32(), which does not correctly interact with RMW operations. This bug only caused issues when postgres is compiled on platforms without atomics support (i.e. no common new platform), or when compiled with --disable-atomics, which explains why this wasn't noticed in testing. Reported-By: Tom Lane Discussion: <14947.1475690465@sss.pgh.pa.us> Backpatch: 9.5-, where the atomic operations API was introduced.
* Rename WAIT_* constants to PG_WAIT_*.Robert Haas2016-10-05
| | | | | | | | Windows apparently has a constant named WAIT_TIMEOUT, and some of these other names are pretty generic, too. Insert "PG_" at the front of each name in order to disambiguate. Michael Paquier
* Remove trailing commas from enums.Robert Haas2016-10-04
| | | | | | Buildfarm member mylodon doesn't like them. Actually, I don't like them either, but I failed to notice these before pushing commit 6f3bd98ebfc008cbd676da777bb0b2376c4c4bfa.
* Extend framework from commit 53be0b1ad to report latch waits.Robert Haas2016-10-04
| | | | | | | | | | | | | | | | | | | | | | WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an additional wait_event_info parameter; legal values are defined in pgstat.h. This makes it possible to uniquely identify every point in the core code where we are waiting for a latch; extensions can pass WAIT_EXTENSION. Because latches were the major wait primitive not previously covered by this patch, it is now possible to see information in pg_stat_activity on a large number of important wait events not previously addressed, such as ClientRead, ClientWrite, and SyncRep. Unfortunately, many of the wait events added by this patch will fail to appear in pg_stat_activity because they're only used in background processes which don't currently appear in pg_stat_activity. We should fix this either by creating a separate view for such information, or else by deciding to include them in pg_stat_activity after all. Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and Thomas Munro.
* Show a sensible value in pg_settings.unit for GUC_UNIT_XSEGS variables.Tom Lane2016-10-03
| | | | | | | | | | | | | | | | Commit 88e982302 invented GUC_UNIT_XSEGS for min_wal_size and max_wal_size, but neglected to make it display sensibly in pg_settings.unit (by adding a case to the switch in GetConfigOptionByNum). Fix that, and adjust said switch to throw a run-time error the next time somebody forgets. In passing, avoid using a static buffer for the output string --- the rest of this function pstrdup's from a local buffer, and I see no very good reason why the units code should do it differently and less safely. Per report from Otar Shavadze. Back-patch to 9.5 where the new unit type was added. Report: <CAG-jOyA=iNFhN+yB4vfvqh688B7Tr5SArbYcFUAjZi=0Exp-Lg@mail.gmail.com>
* Change the way pre-reading in external sort's merge phase works.Heikki Linnakangas2016-10-03
| | | | | | | | | | | | | | | | | | | | | | Don't pre-read tuples into SortTuple slots during merge. Instead, use the memory for larger read buffers in logtape.c. We're doing the same number of READTUP() calls either way, but managing the pre-read SortTuple slots is much more complicated. Also, the on-tape representation is more compact than SortTuples, so we can fit more pre-read tuples into the same amount of memory this way. And we have better cache-locality, when we use just a small number of SortTuple slots. Now that we only hold one tuple from each tape in the SortTuple slots, we can greatly simplify the "batch memory" management. We now maintain a small set of fixed-sized slots, to hold the tuples, and fall back to palloc() for larger tuples. We use this method during all merge phases, not just the final merge, and also when randomAccess is requested, and also in the TSS_SORTEDONTAPE case. In other words, it's used whenever we do an external sort. Reviewed by Peter Geoghegan and Claudio Freire. Discussion: <CAM3SWZTpaORV=yQGVCG8Q4axcZ3MvF-05xe39ZvORdU9JcD6hQ@mail.gmail.com>
* Fix breakage in previous changePeter Eisentraut2016-09-30
|
* Separate enum from structPeter Eisentraut2016-09-30
| | | | | | Otherwise the enum symbols are not visible outside the struct in C++. Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com>
* pg_basebackup pg_receivexlog: Issue fsync more carefullyPeter Eisentraut2016-09-29
| | | | | | | | | | Several places weren't careful about fsyncing in the way. See 1d4a0ab1 and 606e0f98 for details about required fsyncs. This adds a couple of functions in src/common/ that have an equivalent in the backend: durable_rename(), fsync_parent_path() From: Michael Paquier <michael.paquier@gmail.com>
* Move fsync routines of initdb into src/common/Peter Eisentraut2016-09-29
| | | | | | | The intention is to used those in other utilities such as pg_basebackup and pg_receivexlog. From: Michael Paquier <michael.paquier@gmail.com>
* Fix CRC check handling in get_controlfilePeter Eisentraut2016-09-28
| | | | | | | | The previous patch broke this by returning NULL for a failed CRC check, which pg_controldata would then try to read. Fix by returning the result of the CRC check in a separate argument. Michael Paquier and myself
* Turn password_encryption GUC into an enum.Heikki Linnakangas2016-09-28
| | | | | | | | | | | | | This makes the parameter easier to extend, to support other password-based authentication protocols than MD5. (SCRAM is being worked on.) The GUC still accepts on/off as aliases for "md5" and "plain", although we may want to remove those once we actually add support for another password hash type. Michael Paquier, reviewed by David Steele, with some further edits by me. Discussion: <CAB7nPqSMXU35g=W9X74HVeQp0uvgJxvYOuA4A-A3M+0wfEBv-w@mail.gmail.com>
* Fix some typos in commentPeter Eisentraut2016-09-26
|
* Replace the built-in GIN array opclasses with a single polymorphic opclass.Tom Lane2016-09-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had thirty different GIN array opclasses sharing the same operators and support functions. That still didn't cover all the built-in types, nor did it cover arrays of extension-added types. What we want is a single polymorphic opclass for "anyarray". There were two missing features needed to make this possible: 1. We have to be able to declare the index storage type as ANYELEMENT when the opclass is declared to index ANYARRAY. This just takes a few more lines in index_create(). Although this currently seems of use only for GIN, there's no reason to make index_create() restrict it to that. 2. We have to be able to identify the proper GIN compare function for the index storage type. This patch proceeds by making the compare function optional in GIN opclass definitions, and specifying that the default btree comparison function for the index storage type will be looked up when the opclass omits it. Again, that seems pretty generically useful. Since the comparison function lookup is done in initGinState(), making use of the second feature adds an additional cache lookup to GIN index access setup. It seems unlikely that that would be very noticeable given the other costs involved, but maybe at some point we should consider making GinState data persist longer than it now does --- we could keep it in the index relcache entry, perhaps. Rather fortuitously, we don't seem to need to do anything to get this change to play nice with dump/reload or pg_upgrade scenarios: the new opclass definition is automatically selected to replace existing index definitions, and the on-disk data remains compatible. Also, if a user has created a custom opclass definition for a non-builtin type, this doesn't break that, since CREATE INDEX will prefer an exact match to opcintype over a match to ANYARRAY. However, if there's anyone out there with handwritten DDL that explicitly specifies _bool_ops or one of the other replaced opclass names, they'll need to adjust that. Tom Lane, reviewed by Enrique Meneses Discussion: <14436.1470940379@sss.pgh.pa.us>
* Refer to OS X as "macOS", except for the port name which is still "darwin".Tom Lane2016-09-25
| | | | | | | | | | | | | | | | | | We weren't terribly consistent about whether to call Apple's OS "OS X" or "Mac OS X", and the former is probably confusing to people who aren't Apple users. Now that Apple has rebranded it "macOS", follow their lead to establish a consistent naming pattern. Also, avoid the use of the ancient project name "Darwin", except as the port code name which does not seem desirable to change. (In short, this patch touches documentation and comments, but no actual code.) I didn't touch contrib/start-scripts/osx/, either. I suspect those are obsolete and due for a rewrite, anyway. I dithered about whether to apply this edit to old release notes, but those were responsible for quite a lot of the inconsistencies, so I ended up changing them too. Anyway, Apple's being ahistorical about this, so why shouldn't we be?
* Avoid using PostmasterRandom() for DSM control segment ID.Tom Lane2016-09-23
| | | | | | | | | | | | | | | Commits 470d886c3 et al intended to fix the problem that the postmaster selected the same "random" DSM control segment ID on every start. But using PostmasterRandom() for that destroys the intended property that the delay between random_start_time and random_stop_time will be unpredictable. (Said delay is probably already more predictable than we could wish, but that doesn't mean that reducing it by a couple orders of magnitude is OK.) Revert the previous patch and add a comment warning against misuse of PostmasterRandom. Fix the original problem by calling srandom() early in PostmasterMain, using a low-security seed that will later be overwritten by PostmasterRandom. Discussion: <20789.1474390434@sss.pgh.pa.us>
* Remove nearly-unused SizeOfIptrData macro.Tom Lane2016-09-22
| | | | | | | | | | | | | Past refactorings have removed all but one reference to SizeOfIptrData (and that one place was in a pretty noncritical spot). Since nobody's complained, it seems probable that there are no supported compilers that don't think sizeof(ItemPointerData) is 6. If there are, we're wasting MAXALIGN per heap tuple anyway, so it's rather silly to worry about whether we can shave space in places like WAL records. Pavan Deolasee Discussion: <CABOikdOOawDda4hwLOT6zdA6MFfPLu3Z2YBZkX0JdayNS6JOeQ@mail.gmail.com>
* pg_ctl: Detect current standby state from pg_controlPeter Eisentraut2016-09-21
| | | | | | | | | | pg_ctl used to determine whether a server was in standby mode by looking for a recovery.conf file. With this change, it instead looks into pg_control, which is potentially more accurate. There are also occasional discussions about removing recovery.conf, so this removes one dependency. Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
* Fix typoPeter Eisentraut2016-09-21
| | | | From: Michael Paquier <michael.paquier@gmail.com>
* Use PostmasterRandom(), not random(), for DSM control segment ID.Robert Haas2016-09-20
| | | | | Otherwise, every startup gets the same "random" value, which is definitely not what was intended.
* Fix outdated comments, GIST search queue is not an RBTree anymore.Heikki Linnakangas2016-09-20
| | | | | | The GiST search queue is implemented as a pairing heap rather than as Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments in that commit.
* Add debugging aid "bmsToString(Bitmapset *bms)".Tom Lane2016-09-16
| | | | | | | | | | | | | | This function has no direct callers at present, but it's convenient for manual use in a debugger, rather than having to inspect memory and do bit-counting in your head. In passing, get rid of useless outBitmapset() wrapper around _outBitmapset(); let's just export the function that does the work. Likewise for outToken(). Ashutosh Bapat, tweaked a bit by me Discussion: <CAFjFpRdiht8e1HTVirbubr4YzaON5iZTzFJjq909y4sU8M_6eA@mail.gmail.com>
* Fix building with LibreSSL.Heikki Linnakangas2016-09-15
| | | | | | | | | | | | | | | | LibreSSL defines OPENSSL_VERSION_NUMBER to claim that it is version 2.0.0, but it doesn't have the functions added in OpenSSL 1.1.0. Add autoconf checks for the individual functions we need, and stop relying on OPENSSL_VERSION_NUMBER. Backport to 9.5 and 9.6, like the patch that broke this. In the back-branches, there are still a few OPENSSL_VERSION_NUMBER checks left, to check for OpenSSL 0.9.8 or 0.9.7. I left them as they were - LibreSSL has all those functions, so they work as intended. Per buildfarm member curculio. Discussion: <2442.1473957669@sss.pgh.pa.us>
* Improve code comment for GatherPath's single_copy flag.Robert Haas2016-09-14
| | | | Discussion: 5934.1472642782@sss.pgh.pa.us
* Be pickier about converting between Name and Datum.Tom Lane2016-09-13
| | | | | | | | | | | | | We were misapplying NameGetDatum() to plain C strings in some places. This worked, because it was just a pointer cast anyway, but it's a type cheat in some sense. Use CStringGetDatum instead, and modify the NameGetDatum macro so it won't compile if applied to something that's not a pointer to NameData. This should result in no changes to generated code, but it is logically cleaner. Mark Dilger, tweaked a bit by me Discussion: <EFD8AC94-4C1F-40C1-A5EA-304080089C1B@gmail.com>
* Improve parser's and planner's handling of set-returning functions.Tom Lane2016-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach the parser to reject misplaced set-returning functions during parse analysis using p_expr_kind, in much the same way as we do for aggregates and window functions (cf commit eaccfded9). While this isn't complete (it misses nesting-based restrictions), it's much better than the previous error reporting for such cases, and it allows elimination of assorted ad-hoc expression_returns_set() error checks. We could add nesting checks later if it seems important to catch all cases at parse time. There is one case the parser will now throw error for although previous versions allowed it, which is SRFs in the tlist of an UPDATE. That never behaved sensibly (since it's ill-defined which generated row should be used to perform the update) and it's hard to see why it should not be treated as an error. It's a release-note-worthy change though. Also, add a new Query field hasTargetSRFs reporting whether there are any SRFs in the targetlist (including GROUP BY/ORDER BY expressions). The parser can now set that basically for free during parse analysis, and we can use it in a number of places to avoid expression_returns_set searches. (There will be more such checks soon.) In some places, this allows decontorting the logic since it's no longer expensive to check for SRFs in the tlist --- so I made the checks parallel to the handling of hasAggs/hasWindowFuncs wherever it seemed appropriate. catversion bump because adding a Query field changes stored rules. Andres Freund and Tom Lane Discussion: <24639.1473782855@sss.pgh.pa.us>
* Have heapam.h include lockdefs.h rather than lock.h.Robert Haas2016-09-13
| | | | | | | | | lockdefs.h was only split from lock.h relatively recently, and represents a minimal subset of the old lock.h. heapam.h only needs that smaller subset, so adjust it to include only that. This requires some corresponding adjustments elsewhere. Peter Geoghegan
* Improve unreachability recognition in elog() macro.Tom Lane2016-09-10
| | | | | | | | | | | | | | Some experimentation with an older version of gcc showed that it is able to determine whether "if (elevel_ >= ERROR)" is compile-time constant if elevel_ is declared "const", but otherwise not so much. We had accounted for that in ereport() but were too miserly with braces to make it so in elog(). I don't know how many currently-interesting compilers have the same quirk, but in case it will save some code space, let's make sure that elog() is on the same footing as ereport() for this purpose. Back-patch to 9.3 where we introduced pg_unreachable() calls into elog/ereport.
* Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.Tom Lane2016-09-09
| | | | | | | | | | | | The full generality of deleting an arbitrary number of tuples is no longer needed, so let's save some code and cycles by replacing the original coding with an implementation based on PageIndexTupleDelete. We can always get back the old code from git if we need it again for new callers (though I don't care for its willingness to mess with line pointers it wasn't told to mess with). Discussion: <552.1473445163@sss.pgh.pa.us>
* Convert PageAddItem into a macro to save a few cycles.Tom Lane2016-09-09
| | | | | | | | | | Nowadays this is just a backwards-compatibility wrapper around PageAddItemExtended, so let's avoid the extra level of function call. In addition, because pretty much all callers are passing constants for the two bool arguments, compilers will be able to constant-fold the conversion to a flags bitmask. Discussion: <552.1473445163@sss.pgh.pa.us>
* Invent PageIndexTupleOverwrite, and teach BRIN and GiST to use it.Tom Lane2016-09-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PageIndexTupleOverwrite performs approximately the same function as PageIndexTupleDelete (or PageIndexDeleteNoCompact) followed by PageAddItem targeting the same item pointer offset. But in the case where the new tuple is the same size as the old, it avoids shuffling other data around on the page, because the new tuple is placed where the old one was rather than being appended to the end of the page. This has been shown to provide a substantial speedup for some GiST use-cases. Also, this change allows some API simplifications: we can get rid of the rather klugy and error-prone PAI_ALLOW_FAR_OFFSET flag for PageAddItemExtended, since that was used only to cover a corner case for BRIN that's better expressed by using PageIndexTupleOverwrite. Note that this patch causes a rather subtle WAL incompatibility: the physical page content change represented by certain WAL records is now different than it was before, because while the tuples have the same itempointer line numbers, the tuples themselves are in different places. I have not bumped the WAL version number because I think it doesn't matter unless you are trying to do bitwise comparisons of original and replayed pages, and in any case we're early in a devel cycle and there will probably be more WAL changes before v10 gets out the door. There is probably room to make use of PageIndexTupleOverwrite in SP-GiST and GIN too, but that is left for a future patch. Andrey Borodin, reviewed by Anastasia Lubennikova, whacked around a bit by me Discussion: <CAJEAwVGQjGGOj6mMSgMwGvtFd5Kwe6VFAxY=uEPZWMDjzbn4VQ@mail.gmail.com>
* Improve scalability of md.c for large relations.Andres Freund2016-09-08
| | | | | | | | | | | | | | So far md.c used a linked list of segments. That proved to be a problem when processing large relations, because every smgr.c/md.c level access to a page incurred walking through a linked list of all preceding segments. Thus making accessing pages O(#segments). Replace the linked list of segments hanging off SMgrRelationData with an array of opened segments. That allows O(1) access to individual segments, if they've previously been opened. Discussion: <20140331101001.GE13135@alap3.anarazel.de> Reviewed-By: Peter Geoghegan, Tom Lane (in an older version)