| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
|
|
|
|
| |
Backpatch certain files through 9.1
|
|
|
|
|
|
|
|
| |
Prior to commit 0709b7ee72e4bc71ad07b7120acd117265ab51d0, access to
variables within a spinlock-protected critical section had to be done
through a volatile pointer, but that should no longer be necessary.
Review by Andres Freund
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use "a" and "an" correctly, mostly in comments. Two error messages were
also fixed (they were just elogs, so no translation work required). Two
function comments in pg_proc.h were also fixed. Etsuro Fujita reported one
of these, but I found a lot more with grep.
Also fix a few other typos spotted while grepping for the a/an typos.
For example, "consists out of ..." -> "consists of ...". Plus a "though"/
"through" mixup reported by Euler Taveira.
Many of these typos were in old code, which would be nice to backpatch to
make future backpatching easier. But much of the code was new, and I didn't
feel like crafting separate patches for each branch. So no backpatching.
|
|
|
|
| |
I think we're about done with this...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.
Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.
In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.
As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.
Discussion: 20140202151319.GD32123@awork2.anarazel.de
Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since their introduction latches have required barriers in SetLatch
and ResetLatch - but when they were introduced there wasn't any
barrier abstraction. Instead latches were documented to rely on the
callsites to provide barrier semantics.
Now that the barrier support looks halfway complete, add the necessary
barriers to both latch implementations.
Also remove a now superflous lock acquisition from syncrep.c and a
superflous (and insufficient) barrier from freelist.c. There might be
other cases that can now be simplified, but those are the only ones
I've seen on a quick scan.
We might want to backpatch this at some later point, but right now the
barrier infrastructure in the backbranches isn't totally on par with
master.
Discussion: 20150112154026.GB2092@awork2.anarazel.de
|
|
|
|
| |
Backpatch certain files through 9.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
StrategyGetBuffer() has proven to be a bottleneck in a number of
buffer acquisition heavy workloads. To some degree this has already
been alleviated by 5d7962c6, but it still can be quite a heavy
bottleneck. The problem is that in unfortunate usage patterns a
single StrategyGetBuffer() call will have to look at a large number of
buffers - in turn making it likely that the process will be put to
sleep while still holding the spinlock.
Replace most of the usage of the buffer_strategy_lock spinlock for the
clock sweep by a atomic nextVictimBuffer variable. That variable,
modulo NBuffers, is the current hand of the clock sweep. The buffer
clock-sweep then only needs to acquire the spinlock after a
wraparound. And even then only in the process that did the wrapping
around. That alleviates nearly all the contention on the relevant
spinlock, although significant contention on the cacheline can still
exist.
Reviewed-By: Robert Haas and Amit Kapila
Discussion: 20141010160020.GG6670@alap3.anarazel.de,
20141027133218.GA2639@awork2.anarazel.de
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, we used an lwlock that was held from the time we began
seeking a candidate buffer until the time when we found and pinned
one, which is disastrous for concurrency. Instead, use a spinlock
which is held just long enough to pop the freelist or advance the
clock sweep hand, and then released. If we need to advance the clock
sweep further, we reacquire the spinlock once per buffer.
This represents a significant increase in atomic operations around
buffer eviction, but it still wins on many workloads. On others, it
may result in no gain, or even cause a regression, unless the number
of buffer mapping locks is also increased. However, that seems like
material for a separate commit. We may also need to consider other
methods of mitigating contention on this spinlock, such as splitting
it into multiple locks or jumping the clock sweep hand more than one
buffer at a time, but those, too, seem like separate improvements.
Patch by me, inspired by a much larger patch from Amit Kapila.
Reviewed by Andres Freund.
|
|
|
|
|
| |
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
|
|
|
|
|
| |
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
|
|
|
|
|
| |
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
|
|
|
|
|
|
|
| |
The Solaris Studio compiler warns about these instances, unlike more
mainstream compilers such as gcc. But manual inspection showed that
the code is clearly not reachable, and we hope no worthy compiler will
complain about removing this code.
|
|
|
|
| |
commit-fest.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 6d90eaaa89a007e0d365f49d6436f35d2392cfeb added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption. However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out. That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening. Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop. To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.
In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes. I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately. In any case it did nothing to improve the readability or
robustness of the code.
In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant. I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
of the 256KB limit originally enforced by a patch committed 2008-11-06.
Per recent test results, the smaller size resulted in an undesirable decrease
in bulk data loading speed, due to COPY processing frequently getting blocked
for WAL flushing. This area might need more tweaking later, but this setting
seems to be good enough for 8.4.
|
| |
|
|
|
|
|
|
|
| |
(but not locked, as that would risk deadlocks). Also, make it work in a small
ring of buffers to avoid having bulk inserts trash the whole buffer arena.
Robert Haas, after an idea of Simon Riggs'.
|
| |
|
|
|
|
| |
avoid this problem in the future.)
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
buffers that cannot possibly need to be cleaned, and estimates how many
buffers it should try to clean based on moving averages of recent allocation
requests and density of reusable buffers. The patch also adds a couple
more columns to pg_stat_bgwriter to help measure the effectiveness of the
bgwriter.
Greg Smith, building on his own work and ideas from several other people,
in particular a much older patch from Itagaki Takahiro.
|
|
|
|
|
|
|
| |
when multiple backends are scanning the same relation concurrently, each page
is (ideally) read only once.
Jeff Davis, with review by Heikki and Tom.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
buffers, rather than blowing out the whole shared-buffer arena. Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer. Those flushes will now occur only once per
ring-ful. The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done. The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.
This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it. To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.
Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.
Original patch by Simon, reworked by Heikki and again by Tom.
|
|
|
|
| |
back-stamped for this.
|
| |
|
|
|
|
|
|
|
|
|
| |
BufferAlloc tries to insert a new mapping entry before deleting the old one
for a buffer, we have a transient need for more than NBuffers entries ---
one more in 8.1, and as many as NUM_BUFFER_PARTITIONS more in CVS HEAD.
In theory this could lead to an "out of shared memory" failure if shmem
had already been completely claimed by the time the extra entries were
needed.
|
| |
|
| |
|
|
|
|
|
|
|
| |
pointers, to ensure that compilers won't rearrange accesses to occur
while we're not holding the buffer header spinlock. It's probably
not necessary to mark volatile in every single place in bufmgr.c,
but better safe than sorry. Per trouble report from Kevin Grittner.
|
|
|
|
|
|
|
|
|
|
|
| |
to 'Size' (that is, size_t), and install overflow detection checks in it.
This allows us to remove the former arbitrary restrictions on NBuffers
etc. It won't make any difference in a 32-bit machine, but in a 64-bit
machine you could theoretically have terabytes of shared buffers.
(How efficiently we could manage 'em remains to be seen.) Similarly,
num_temp_buffers, work_mem, and maintenance_work_mem can be set above
2Gb on a 64-bit machine. Original patch from Koichi Suzuki, additional
work by moi.
|
|
|
|
|
|
|
|
| |
the freelist, plus per-buffer spinlocks that protect access to individual
shared buffer headers. This requires abandoning a global freelist (since
the freelist is a global contention point), which shoots down ARC and 2Q
as well as plain LRU management. Adopt a clock sweep algorithm instead.
Preliminary results show substantial improvement in multi-backend situations.
|
|
|
|
|
| |
This refactoring does not change any algorithms or data structures, just
remove visibility of the ARC datastructures from other source files.
|
|
|
|
|
|
|
|
| |
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
as per recent discussions. Invent SubTransactionIds that are managed like
CommandIds (ie, counter is reset at start of each top transaction), and
use these instead of TransactionIds to keep track of subtransaction status
in those modules that need it. This means that a subtransaction does not
need an XID unless it actually inserts/modifies rows in the database.
Accordingly, don't assign it an XID nor take a lock on the XID until it
tries to do that. This saves a lot of overhead for subtransactions that
are only used for error recovery (eg plpgsql exceptions). Also, arrange
to release a subtransaction's XID lock as soon as the subtransaction
exits, in both the commit and abort cases. This avoids holding many
unique locks after a long series of subtransactions. The price is some
additional overhead in XactLockTableWait, but that seems acceptable.
Finally, restructure the state machine in xact.c to have a more orthogonal
set of states for subtransactions.
|
| |
|
| |
|
|
|
|
|
| |
Bug is only latent given that sole caller is passing NBuffers, but it
could bite someone in the rear someday.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
place of time_t, as per prior discussion. The behavior does not change
on machines without a 64-bit-int type, but on machines with one, which
is most, we are rid of the bizarre boundary behavior at the edges of
the 32-bit-time_t range (1901 and 2038). The system will now treat
times over the full supported timestamp range as being in your local
time zone. It may seem a little bizarre to consider that times in
4000 BC are PST or EST, but this is surely at least as reasonable as
propagating Gregorian calendar rules back that far.
I did not modify the format of the zic timezone database files, which
means that for the moment the system will not know about daylight-savings
periods outside the range 1901-2038. Given the way the files are set up,
it's not a simple decision like 'widen to 64 bits'; we have to actually
think about the range of years that need to be supported. We should
probably inquire what the plans of the upstream zic people are before
making any decisions of our own.
|
|
|
|
|
|
|
|
|
| |
of whether we have successfully read data into a buffer; this makes the
error behavior a bit more transparent (IMHO anyway), and also makes it
work correctly for local buffers which don't use Start/TerminateBufferIO.
Collapse three separate functions for writing a shared buffer into one.
This overlaps a bit with cleanups that Neil proposed awhile back, but
seems not to have committed yet.
|
|
|
|
|
|
|
|
| |
of VACUUM cases so that VACUUM requests don't affect the ARC state at all,
avoid corner case where BufferSync would uselessly rewrite a buffer that
no longer contains the page that was to be flushed. Make some minor
other cleanups in and around the bufmgr as well, such as moving PinBuffer
and UnpinBuffer into bufmgr.c where they really belong.
|
|
|
|
|
|
|
|
|
|
|
| |
for already empty buffers because their buffer tag was not cleard out
when the buffers have been invalidated before.
Also removed the misnamed BM_FREE bufhdr flag and replaced the checks,
which effectively ask if the buffer is unpinned, with checks against the
refcount field.
Jan
|
|
|
|
| |
Jan
|