1 files changed, 100 insertions, 0 deletions
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
new file mode 100644
index 00000000000..519c9c9ebc0
--- /dev/null
+++ b/src/backend/storage/buffer/README
@@ -0,0 +1,100 @@
+$Header: /cvsroot/pgsql/src/backend/storage/buffer/README,v 1.1 2001/07/06 21:04:25 tgl Exp $
+
+Notes about shared buffer access rules
+--------------------------------------
+
+There are two separate access control mechanisms for shared disk buffers:
+reference counts (a/k/a pin counts) and buffer locks.  (Actually, there's
+a third level of access control: one must hold the appropriate kind of
+lock on a relation before one can legally access any page belonging to
+the relation.  Relation-level locks are not discussed here.)
+
+Pins: one must "hold a pin on" a buffer (increment its reference count)
+before being allowed to do anything at all with it.  An unpinned buffer is
+subject to being reclaimed and reused for a different page at any instant,
+so touching it is unsafe.  Typically a pin is acquired via ReadBuffer and
+released via WriteBuffer (if one modified the page) or ReleaseBuffer (if not).
+It is OK and indeed common for a single backend to pin a page more than
+once concurrently; the buffer manager handles this efficiently.  It is
+considered OK to hold a pin for long intervals --- for example, sequential
+scans hold a pin on the current page until done processing all the tuples
+on the page, which could be quite a while if the scan is the outer scan of
+a join.  Similarly, btree index scans hold a pin on the current index page.
+This is OK because normal operations never wait for a page's pin count to
+drop to zero.  (Anything that might need to do such a wait is instead
+handled by waiting to obtain the relation-level lock, which is why you'd
+better hold one first.)  Pins may not be held across transaction
+boundaries, however.
+
+Buffer locks: there are two kinds of buffer locks, shared and exclusive,
+which act just as you'd expect: multiple backends can hold shared locks on
+the same buffer, but an exclusive lock prevents anyone else from holding
+either shared or exclusive lock.  (These can alternatively be called READ
+and WRITE locks.)  These locks are short-term: they should not be held for
+long.  They are implemented as per-buffer spinlocks, so another backend
+trying to acquire a competing lock will spin as long as you hold yours!
+Buffer locks are acquired and released by LockBuffer().  It will *not* work
+for a single backend to try to acquire multiple locks on the same buffer.
+One must pin a buffer before trying to lock it.
+
+Buffer access rules:
+
+1. To scan a page for tuples, one must hold a pin and either shared or
+exclusive lock.  To examine the commit status (XIDs and status bits) of
+a tuple in a shared buffer, one must likewise hold a pin and either shared
+or exclusive lock.
+
+2. Once one has determined that a tuple is interesting (visible to the
+current transaction) one may drop the buffer lock, yet continue to access
+the tuple's data for as long as one holds the buffer pin.  This is what is
+typically done by heap scans, since the tuple returned by heap_fetch
+contains a pointer to tuple data in the shared buffer.  Therefore the
+tuple cannot go away while the pin is held (see rule #5).  Its state could
+change, but that is assumed not to matter after the initial determination
+of visibility is made.
+
+3. To add a tuple or change the xmin/xmax fields of an existing tuple,
+one must hold a pin and an exclusive lock on the containing buffer.
+This ensures that no one else might see a partially-updated state of the
+tuple.
+
+4. It is considered OK to update tuple commit status bits (ie, OR the
+values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
+HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
+pin on a buffer.  This is OK because another backend looking at the tuple
+at about the same time would OR the same bits into the field, so there
+is little or no risk of conflicting update; what's more, if there did
+manage to be a conflict it would merely mean that one bit-update would
+be lost and need to be done again later.  These four bits are only hints
+(they cache the results of transaction status lookups in pg_log), so no
+great harm is done if they get reset to zero by conflicting updates.
+
+5. To physically remove a tuple or compact free space on a page, one
+must hold a pin and an exclusive lock, *and* observe while holding the
+exclusive lock that the buffer's shared reference count is one (ie,
+no other backend holds a pin).  If these conditions are met then no other
+backend can perform a page scan until the exclusive lock is dropped, and
+no other backend can be holding a reference to an existing tuple that it
+might expect to examine again.  Note that another backend might pin the
+buffer (increment the refcount) while one is performing the cleanup, but
+it won't be able to actually examine the page until it acquires shared
+or exclusive lock.
+
+
+As of 7.1, the only operation that removes tuples or compacts free space is
+(oldstyle) VACUUM.  It does not have to implement rule #5 directly, because
+it instead acquires exclusive lock at the relation level, which ensures
+indirectly that no one else is accessing pages of the relation at all.
+
+To implement concurrent VACUUM we will need to make it obey rule #5 fully.
+To do this, we'll create a new buffer manager operation
+LockBufferForCleanup() that gets an exclusive lock and then checks to see
+if the shared pin count is currently 1.  If not, it releases the exclusive
+lock (but not the caller's pin) and waits until signaled by another backend,
+whereupon it tries again.  The signal will occur when UnpinBuffer
+decrements the shared pin count to 1.  As indicated above, this operation
+might have to wait a good while before it acquires lock, but that shouldn't
+matter much for concurrent VACUUM.  The current implementation only
+supports a single waiter for pin-count-1 on any particular shared buffer.
+This is enough for VACUUM's use, since we don't allow multiple VACUUMs
+concurrently on a single relation anyway.