diff options
Diffstat (limited to 'src/backend/storage/buffer/README')
-rw-r--r-- | src/backend/storage/buffer/README | 100 |
1 files changed, 100 insertions, 0 deletions
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README new file mode 100644 index 00000000000..519c9c9ebc0 --- /dev/null +++ b/src/backend/storage/buffer/README @@ -0,0 +1,100 @@ +$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.1 2001/07/06 21:04:25 tgl Exp $ + +Notes about shared buffer access rules +-------------------------------------- + +There are two separate access control mechanisms for shared disk buffers: +reference counts (a/k/a pin counts) and buffer locks. (Actually, there's +a third level of access control: one must hold the appropriate kind of +lock on a relation before one can legally access any page belonging to +the relation. Relation-level locks are not discussed here.) + +Pins: one must "hold a pin on" a buffer (increment its reference count) +before being allowed to do anything at all with it. An unpinned buffer is +subject to being reclaimed and reused for a different page at any instant, +so touching it is unsafe. Typically a pin is acquired via ReadBuffer and +released via WriteBuffer (if one modified the page) or ReleaseBuffer (if not). +It is OK and indeed common for a single backend to pin a page more than +once concurrently; the buffer manager handles this efficiently. It is +considered OK to hold a pin for long intervals --- for example, sequential +scans hold a pin on the current page until done processing all the tuples +on the page, which could be quite a while if the scan is the outer scan of +a join. Similarly, btree index scans hold a pin on the current index page. +This is OK because normal operations never wait for a page's pin count to +drop to zero. (Anything that might need to do such a wait is instead +handled by waiting to obtain the relation-level lock, which is why you'd +better hold one first.) Pins may not be held across transaction +boundaries, however. + +Buffer locks: there are two kinds of buffer locks, shared and exclusive, +which act just as you'd expect: multiple backends can hold shared locks on +the same buffer, but an exclusive lock prevents anyone else from holding +either shared or exclusive lock. (These can alternatively be called READ +and WRITE locks.) These locks are short-term: they should not be held for +long. They are implemented as per-buffer spinlocks, so another backend +trying to acquire a competing lock will spin as long as you hold yours! +Buffer locks are acquired and released by LockBuffer(). It will *not* work +for a single backend to try to acquire multiple locks on the same buffer. +One must pin a buffer before trying to lock it. + +Buffer access rules: + +1. To scan a page for tuples, one must hold a pin and either shared or +exclusive lock. To examine the commit status (XIDs and status bits) of +a tuple in a shared buffer, one must likewise hold a pin and either shared +or exclusive lock. + +2. Once one has determined that a tuple is interesting (visible to the +current transaction) one may drop the buffer lock, yet continue to access +the tuple's data for as long as one holds the buffer pin. This is what is +typically done by heap scans, since the tuple returned by heap_fetch +contains a pointer to tuple data in the shared buffer. Therefore the +tuple cannot go away while the pin is held (see rule #5). Its state could +change, but that is assumed not to matter after the initial determination +of visibility is made. + +3. To add a tuple or change the xmin/xmax fields of an existing tuple, +one must hold a pin and an exclusive lock on the containing buffer. +This ensures that no one else might see a partially-updated state of the +tuple. + +4. It is considered OK to update tuple commit status bits (ie, OR the +values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or +HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and +pin on a buffer. This is OK because another backend looking at the tuple +at about the same time would OR the same bits into the field, so there +is little or no risk of conflicting update; what's more, if there did +manage to be a conflict it would merely mean that one bit-update would +be lost and need to be done again later. These four bits are only hints +(they cache the results of transaction status lookups in pg_log), so no +great harm is done if they get reset to zero by conflicting updates. + +5. To physically remove a tuple or compact free space on a page, one +must hold a pin and an exclusive lock, *and* observe while holding the +exclusive lock that the buffer's shared reference count is one (ie, +no other backend holds a pin). If these conditions are met then no other +backend can perform a page scan until the exclusive lock is dropped, and +no other backend can be holding a reference to an existing tuple that it +might expect to examine again. Note that another backend might pin the +buffer (increment the refcount) while one is performing the cleanup, but +it won't be able to actually examine the page until it acquires shared +or exclusive lock. + + +As of 7.1, the only operation that removes tuples or compacts free space is +(oldstyle) VACUUM. It does not have to implement rule #5 directly, because +it instead acquires exclusive lock at the relation level, which ensures +indirectly that no one else is accessing pages of the relation at all. + +To implement concurrent VACUUM we will need to make it obey rule #5 fully. +To do this, we'll create a new buffer manager operation +LockBufferForCleanup() that gets an exclusive lock and then checks to see +if the shared pin count is currently 1. If not, it releases the exclusive +lock (but not the caller's pin) and waits until signaled by another backend, +whereupon it tries again. The signal will occur when UnpinBuffer +decrements the shared pin count to 1. As indicated above, this operation +might have to wait a good while before it acquires lock, but that shouldn't +matter much for concurrent VACUUM. The current implementation only +supports a single waiter for pin-count-1 on any particular shared buffer. +This is enough for VACUUM's use, since we don't allow multiple VACUUMs +concurrently on a single relation anyway. |