| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit introduces SVE implementations of pg_popcount{32,64}.
Unlike the Neon versions, we need an additional configure-time
check to determine if the compiler supports SVE intrinsics, and we
need a runtime check to determine if the current CPU supports SVE
instructions. Our testing showed that the SVE implementations are
much faster for larger inputs and are comparable to the status
quo for smaller inputs.
Author: "Devanga.Susmitha@fujitsu.com" <Devanga.Susmitha@fujitsu.com>
Co-authored-by: "Chiranmoy.Bhattacharya@fujitsu.com" <Chiranmoy.Bhattacharya@fujitsu.com>
Co-authored-by: "Malladi, Rama" <ramamalladi@hotmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
Discussion: https://postgr.es/m/OSZPR01MB84990A9A02A3515C6E85A65B8B2A2%40OSZPR01MB8499.jpnprd01.prod.outlook.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit introduces Neon implementations of pg_popcount{32,64},
pg_popcount(), and pg_popcount_masked(). As in simd.h, we assume
that all available AArch64 hardware supports Neon, so we don't need
any new configure-time or runtime checks. Some compilers already
emit Neon instructions for these functions, but our hand-rolled
implementations for pg_popcount() and pg_popcount_masked()
performed better in testing, likely due to better instruction-level
parallelism.
Author: "Chiranmoy.Bhattacharya@fujitsu.com" <Chiranmoy.Bhattacharya@fujitsu.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This macro protects x86_64-specific code, and a subsequent commit
will introduce AArch64-specific versions of that code. To prevent
confusion, let's rename it to clearly indicate that it's for
x86_64. We should likely move this code to its own file (perhaps
merging it with the AVX-512 popcount code), but that is left as a
future exercise.
Reviewed-by: "Chiranmoy.Bhattacharya@fujitsu.com" <Chiranmoy.Bhattacharya@fujitsu.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
|
|
|
|
|
|
|
|
| |
Since commit f78667bd91, we've used __attribute__((target(...)))
instead of extra compiler flags for AVX-512 support, but this
comment still says that we put the code in a separate file because
it might require extra compiler flags. Let's just remove that part
of the comment.
|
|
|
|
| |
Backpatch-through: 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Redefine our exact width types with standard C99 types and macros,
including int64_t, INT64_MAX, INT64_C(), PRId64 etc. We were already
using <stdint.h> types in a few places.
One complication is that Windows' <inttypes.h> uses format strings like
"%I64d", "%I32", "%I" for PRI*64, PRI*32, PTR*PTR, instead of mapping to
other standardized format strings like "%lld" etc as seen on other known
systems. Teach our snprintf.c to understand them.
This removes a lot of configure clutter, and should also allow 64-bit
numbers and other standard types to be used in localized messages
without casting.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/ME3P282MB3166F9D1F71F787929C0C7E7B6312%40ME3P282MB3166.AUSP282.PROD.OUTLOOK.COM
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 792752af4e added infrastructure for using AVX-512 intrinsic
functions, and this commit uses that infrastructure to optimize
visibilitymap_count(). Specificially, a new pg_popcount_masked()
function is introduced that applies a bitmask to every byte in the
buffer prior to calculating the population count, which is used to
filter out the all-visible or all-frozen bits as needed. Platforms
without AVX-512 support should also see a nice speedup due to the
reduced number of calls to a function pointer.
Co-authored-by: Ants Aasma
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If there aren't many bytes to process, the function call overhead
of the optimized implementation isn't worth taking, so instead we
inline a loop that consults pg_number_of_ones in that case. If
there are many bytes to process, we accept the function call
overhead because the optimized versions are likely to be faster.
The threshold at which we use the optimized implementation is set
to the smallest amount of data required to use special popcount
instructions.
Reviewed-by: Alvaro Herrera, Tom Lane
Discussion: https://postgr.es/m/20240402155301.GA2750455%40nathanxps13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some systems, calls to pg_popcount{32,64} are indirected through
a function pointer. This commit converts pg_popcount() to a
function pointer on those systems so that we can inline the
appropriate pg_popcount{32,64} implementations into each of the
pg_popcount() implementations. Since pg_popcount() may call
pg_popcount{32,64} several times, this can significantly improve
its performance.
Suggested-by: David Rowley
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAApHDvrb7MJRB6JuKLDEY2x_LKdFHwVbogpjZBCX547i5%2BrXOQ%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
| |
Add PGDLLIMPORT to pg_popcount32/64. In passing, fix a typo.
Diagnosis by Masahiko Sawada, patch by David Rowley
Per buildfarm members drongo and fairywren
Discussion: https://postgr.es/m/CAD21AoAMm1mQd%3Dw4PrfrKK%3DOMP8j8%3D7ntJRPF8%2B%3D10iUuvwiCA%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvov7724UrD1Ug0D1eV%2B9Pd_x5VEQmw-6HVG9w1WdCxXPA%40mail.gmail.com
|
|
|
|
|
|
|
|
| |
Reported-by: Michael Paquier
Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz
Backpatch-through: 12
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MSVC's _BitScan* functions return a boolean indicating whether any
bits were set in the input, and we were previously asserting that
they returned true, per our API. This is correct. However, other
platforms simply assert that the input is non-zero, so do that to be
more consistent.
Noted while investigating a hypothesis from Ranier Vilela about
undefined behavior, but this is not his proposed patch.
Discussion: https://www.postgresql.org/message-id/CAEudQAoDhUZyKGJ1vbMGcgVUOcsixe-%3DjcVaDWarqkUg163D2w%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
677319746 added support for making use of MSVC's bit scanning functions.
However, that commit failed to consider 32-bit MSVC builds where the
64-bit versions of these functions are unavailable. This resulted in
compilation failures on 32-bit MSVC.
Here we adjust the code so we fall back on the manual way of finding the
bit positions for 64-bit integers when building on 32-bit MSVC.
Bug: #17967
Reported-by: Youmiu Mo
Discussion: https://postgr.es/m/17967-cd21e34a314141b2@postgresql.org
|
|
|
|
|
|
|
|
|
|
| |
These were valuable during development, but are unlikely to tell us
anything going forward. This reverts 204b0cbec and adjusts the content
of 677319746 to more closely match the more-readable original style.
Per review from Tom Lane
Discussion: https://www.postgresql.org/message-id/3567481.1676906261%40sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To allow testing for general support for fast bitscan intrinsics,
add symbols HAVE_BITSCAN_REVERSE and HAVE_BITSCAN_FORWARD.
Also do related cleanup in AllocSetFreeIndex(): Previously, we
tested for HAVE__BUILTIN_CLZ and copied the relevant internals of
pg_leftmost_one_pos32(), with a special fallback that does less
work than the general fallback for that function. Now that we have
a more general test, we just call pg_leftmost_one_pos32() directly
for platforms with intrinsic support. On gcc at least, there is no
difference in the binary for non-assert builds.
Discussion: https://www.postgresql.org/message-id/CAFBsxsEPc%2BFnX_0vmmQ5DHv60sk4rL_RZJ%2BMD6ei%3D76L0kFMvA%40mail.gmail.com
|
|
|
|
| |
Discussion: https://www.postgresql.org/message-id/CAFBsxsEPc%2BFnX_0vmmQ5DHv60sk4rL_RZJ%2BMD6ei%3D76L0kFMvA%40mail.gmail.com
|
|
|
|
| |
Backpatch-through: 11
|
|
|
|
|
|
|
|
| |
Use a more compact notation that allows us to add more size_t variants
as required. This will be used by a later commit.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CA%2BhUKG%2B7dSX1XF8yFGmYk-%3D48dbjH2kmzZj16XvhbrWP-9BzRg%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There were a number of places in the code that used bespoke bit-twiddling
expressions to do bitwise rotation. While we've had pg_rotate_right32()
for a while now, we hadn't gotten around to standardizing on that. Do so
now. Since many potential call sites look more natural with the "left"
equivalent, add that function too.
Reviewed by Tom Lane and Yugo Nagata
Discussion:
https://www.postgresql.org/message-id/CAFBsxsH7c1LC0CGZ0ADCBXLHU5-%3DKNXx-r7tHYPAW51b2HK4Qw%40mail.gmail.com
|
|
|
|
| |
Backpatch-through: 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commits c2d1eea9e and 11b500072, as well as similar hacks
elsewhere, in favor of setting up the PGDLLIMPORT macro so that it can
just be used unconditionally. That can work because in frontend code,
we need no marking in either the defining or consuming files for a
variable exported from these libraries; and frontend code has no need
to access variables exported from the core backend, either.
While at it, write some actual documentation about the PGDLLIMPORT
and PGDLLEXPORT macros.
Patch by me, based on a suggestion from Robert Haas.
Discussion: https://postgr.es/m/1160385.1638165449@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, all pg_popcount{32,64} calls were indirected through
a function pointer, even though we had no fast implementation for
non-x86 platforms. Instead, for those platforms use wrappers around
the pg_popcount{32,64}_slow functions.
Review and additional hacking by David Rowley
Reviewed by Álvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/CAFBsxsE7otwnfA36Ly44zZO%2Bb7AEWHRFANxR1h1kxveEV%3DghLQ%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The point of introducing the hash_mem_multiplier GUC was to let users
reproduce the old behavior of hash aggregation, i.e. that it could use
more than work_mem at need. However, the implementation failed to get
the job done on Win64, where work_mem is clamped to 2GB to protect
various places that calculate memory sizes using "long int". As
written, the same clamp was applied to hash_mem. This resulted in
severe performance regressions for queries requiring a bit more than
2GB for hash aggregation, as they now spill to disk and there's no
way to stop that.
Getting rid of the work_mem restriction seems like a good idea, but
it's a big job and could not conceivably be back-patched. However,
there's only a fairly small number of places that are concerned with
the hash_mem value, and it turns out to be possible to remove the
restriction there without too much code churn or any ABI breaks.
So, let's do that for now to fix the regression, and leave the
larger task for another day.
This patch does introduce a bit more infrastructure that should help
with the larger task, namely pg_bitutils.h support for working with
size_t values.
Per gripe from Laurent Hasson. Back-patch to v13 where the
behavior change came in.
Discussion: https://postgr.es/m/997817.1627074924@sss.pgh.pa.us
Discussion: https://postgr.es/m/MN2PR15MB25601E80A9B6D1BA6F592B1985E39@MN2PR15MB2560.namprd15.prod.outlook.com
|
|
|
|
| |
Backpatch-through: 9.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
First pass of modifying various places that obtain the next power of 2 of
a number and make them use the new functions added in pg_bitutils.h
instead.
This also removes the _hash_log2() function. There are no longer any
callers in core. Other users can swap their _hash_log2(n) call to make use
of pg_ceil_log2_32(n).
Author: David Fetter, with some minor adjustments by me
Reviewed-by: John Naylor, Jesse Zhang
Discussion: https://postgr.es/m/20200114173553.GE32763%40fetter.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are many areas in the code where we need to determine the next
highest power of 2 of a given number. We tend to always do that in an
ad-hoc way each time, generally with some tight for loop which performs a
bitshift left once per loop and goes until it finds a number above the
given number.
Here we add two generic functions which make use of the existing
pg_leftmost_one_pos* functions which, when available, will allow us to
calculate the next power of 2 without any looping.
Here we don't add any code which uses these new functions. That will be
done in follow-up commits.
Author: David Fetter, with some minor adjustments by me
Reviewed-by: John Naylor, Jesse Zhang
Discussion: https://postgr.es/m/20200114173553.GE32763%40fetter.org
|
|
|
|
| |
Backpatch-through: update all files in master, backpatch legal files through 9.4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Our algorithm for choosing batch numbers turned out not to work
effectively for multi-billion key inner relations. We would use
more hash bits than we have, and effectively concentrate all tuples
into a smaller number of batches than we intended. While ideally
we should switch to wider hashes, for now, change the algorithm to
one that effectively gives up bits from the bucket number when we
don't have enough bits. That means we'll finish up with longer
bucket chains than would be ideal, but that's better than having
batches that don't fit in work_mem and can't be divided.
Batch-patch to all supported releases.
Author: Thomas Munro
Reviewed-by: Tom Lane, thanks also to Tomas Vondra, Alvaro Herrera, Andres Freund for testing and discussion
Reported-by: James Coleman
Discussion: https://postgr.es/m/16104-dc11ed911f1ab9df%40postgresql.org
|
|
|
|
|
| |
Author: Alexander Lakhin
Discussion: https://postgr.es/m/0a5419ea-1452-a4e6-72ff-545b1a5a8076@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Test for the compiler builtins __builtin_clz, __builtin_ctz, and
__builtin_popcount, and make use of these in preference to
handwritten C code if they're available. Create src/port
infrastructure for "leftmost one", "rightmost one", and "popcount"
so as to centralize these decisions.
On x86_64, __builtin_popcount generally won't make use of the POPCNT
opcode because that's not universally supported yet. Provide code
that checks CPUID and then calls POPCNT via asm() if available.
This requires indirecting through a function pointer, which is
an annoying amount of overhead for a one-instruction operation,
but it's probably not worth working harder than this for our
current use-cases.
I'm not sure we've found all the existing places that could profit
from this new infrastructure; but we at least touched all the
ones that used copied-and-pasted versions of the bitmapset.c code,
and got rid of multiple copies of the associated constant arrays.
While at it, replace c-compiler.m4's one-per-builtin-function
macros with a single one that can handle all the cases we need
to worry about so far. Also, because I'm paranoid, make those
checks into AC_LINK checks rather than just AC_COMPILE; the
former coding failed to verify that libgcc has support for the
builtin, in cases where it's not inline code.
David Rowley, Thomas Munro, Alvaro Herrera, Tom Lane
Discussion: https://postgr.es/m/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commits fc6c72747ae6, 109de05cbb03, d0b4663c23b7 and
711bab1e4d19.
Somebody will have to try harder before submitting this patch again.
I've spent entirely too much time on it already, and the #ifdef maze yet
to be written in order for it to build at all got on my nerves. The
amount of work needed to get a platform-specific performance improvement
that's barely above the noise level is not worth it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Split out these new functions in three parts: one in a new file that
uses the compiler builtin and gets compiled with the -mpopcnt compiler
option if it exists; another one that uses the compiler builtin but not
the compiler option; and finally the fallback with open-coded
algorithms.
Split out the configure logic: in the original commit, it was selecting
to use the -mpopcnt compiler switch together with deciding whether to
use the compiler builtin, but those two things are really separate.
Split them out. Also, expose whether the builtin exists to
Makefile.global, so that src/port's Makefile can decide whether to
compile the hw-optimized file.
Remove CPUID test for CTZ/CLZ. Make pg_{right,left}most_ones use either
the compiler intrinsic or open-coded algo; trying to use the
HW-optimized version is a waste of time. Make them static inline
functions.
Discussion: https://postgr.es/m/20190213221719.GA15976@alvherre.pgsql
|
|
These opcodes have been around in the AMD world since 2007, and 2008 in
the case of intel. They're supported in GCC and Clang via some __builtin
macros. The opcodes may be unavailable during runtime, in which case we
fall back on a C-based implementation of the code. In order to get the
POPCNT instruction we must pass the -mpopcnt option to the compiler. We
do this only for the pg_bitutils.c file.
David Rowley (with fragments taken from a patch by Thomas Munro)
Discussion: https://postgr.es/m/CAKJS1f9WTAGG1tPeJnD18hiQW5gAk59fQ6WK-vfdAKEHyRg2RA@mail.gmail.com
|