$PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.7.4.1 2005/03/03 16:47:43 tgl Exp $ Notes about shared buffer access rules -------------------------------------- There are two separate access control mechanisms for shared disk buffers: reference counts (a/k/a pin counts) and buffer locks. (Actually, there's a third level of access control: one must hold the appropriate kind of lock on a relation before one can legally access any page belonging to the relation. Relation-level locks are not discussed here.) Pins: one must "hold a pin on" a buffer (increment its reference count) before being allowed to do anything at all with it. An unpinned buffer is subject to being reclaimed and reused for a different page at any instant, so touching it is unsafe. Typically a pin is acquired via ReadBuffer and released via WriteBuffer (if one modified the page) or ReleaseBuffer (if not). It is OK and indeed common for a single backend to pin a page more than once concurrently; the buffer manager handles this efficiently. It is considered OK to hold a pin for long intervals --- for example, sequential scans hold a pin on the current page until done processing all the tuples on the page, which could be quite a while if the scan is the outer scan of a join. Similarly, btree index scans hold a pin on the current index page. This is OK because normal operations never wait for a page's pin count to drop to zero. (Anything that might need to do such a wait is instead handled by waiting to obtain the relation-level lock, which is why you'd better hold one first.) Pins may not be held across transaction boundaries, however. Buffer locks: there are two kinds of buffer locks, shared and exclusive, which act just as you'd expect: multiple backends can hold shared locks on the same buffer, but an exclusive lock prevents anyone else from holding either shared or exclusive lock. (These can alternatively be called READ and WRITE locks.) These locks are intended to be short-term: they should not be held for long. Buffer locks are acquired and released by LockBuffer(). It will *not* work for a single backend to try to acquire multiple locks on the same buffer. One must pin a buffer before trying to lock it. Buffer access rules: 1. To scan a page for tuples, one must hold a pin and either shared or exclusive lock. To examine the commit status (XIDs and status bits) of a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. 2. Once one has determined that a tuple is interesting (visible to the current transaction) one may drop the buffer lock, yet continue to access the tuple's data for as long as one holds the buffer pin. This is what is typically done by heap scans, since the tuple returned by heap_fetch contains a pointer to tuple data in the shared buffer. Therefore the tuple cannot go away while the pin is held (see rule #5). Its state could change, but that is assumed not to matter after the initial determination of visibility is made. 3. To add a tuple or change the xmin/xmax fields of an existing tuple, one must hold a pin and an exclusive lock on the containing buffer. This ensures that no one else might see a partially-updated state of the tuple. 4. It is considered OK to update tuple commit status bits (ie, OR the values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and pin on a buffer. This is OK because another backend looking at the tuple at about the same time would OR the same bits into the field, so there is little or no risk of conflicting update; what's more, if there did manage to be a conflict it would merely mean that one bit-update would be lost and need to be done again later. These four bits are only hints (they cache the results of transaction status lookups in pg_clog), so no great harm is done if they get reset to zero by conflicting updates. 5. To physically remove a tuple or compact free space on a page, one must hold a pin and an exclusive lock, *and* observe while holding the exclusive lock that the buffer's shared reference count is one (ie, no other backend holds a pin). If these conditions are met then no other backend can perform a page scan until the exclusive lock is dropped, and no other backend can be holding a reference to an existing tuple that it might expect to examine again. Note that another backend might pin the buffer (increment the refcount) while one is performing the cleanup, but it won't be able to actually examine the page until it acquires shared or exclusive lock. VACUUM FULL ignores rule #5, because it instead acquires exclusive lock at the relation level, which ensures indirectly that no one else is accessing pages of the relation at all. Plain (concurrent) VACUUM must respect rule #5 fully. Obtaining the necessary lock is done by the bufmgr routine LockBufferForCleanup(). It first gets an exclusive lock and then checks to see if the shared pin count is currently 1. If not, it releases the exclusive lock (but not the caller's pin) and waits until signaled by another backend, whereupon it tries again. The signal will occur when UnpinBuffer decrements the shared pin count to 1. As indicated above, this operation might have to wait a good while before it acquires lock, but that shouldn't matter much for concurrent VACUUM. The current implementation only supports a single waiter for pin-count-1 on any particular shared buffer. This is enough for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a single relation anyway. Buffer replacement strategy interface ------------------------------------- The file freelist.c contains the buffer cache replacement strategy. The interface to the strategy is: BufferDesc *StrategyBufferLookup(BufferTag *tagPtr, bool recheck, int *cdb_found_index) This is always the first call made by the buffer manager to check if a disk page is in memory. If so, the function returns the buffer descriptor and no further action is required. If the page is not in memory, StrategyBufferLookup() returns NULL. The flag recheck tells the strategy that this is a second lookup after flushing a dirty block. If the buffer manager has to evict another buffer, it will release the bufmgr lock while doing the write IO. During this time, another backend could possibly fault in the same page this backend is after, so we have to check again after the IO is done if the page is in memory now. *cdb_found_index is set to the index of the found CDB, or -1 if none. This is not intended to be used by the caller, except to pass to StrategyReplaceBuffer(). BufferDesc *StrategyGetBuffer(int *cdb_replace_index) The buffer manager calls this function to get an unpinned cache buffer whose content can be evicted. The returned buffer might be empty, clean or dirty. The returned buffer is only a candidate for replacement. It is possible that while the buffer is being written, another backend finds and modifies it, so that it is dirty again. The buffer manager will then have to call StrategyGetBuffer() again to ask for another candidate. *cdb_replace_index is set to the index of the candidate CDB, or -1 if none (meaning we are using a previously free buffer). This is not intended to be used by the caller, except to pass to StrategyReplaceBuffer(). void StrategyReplaceBuffer(BufferDesc *buf, BufferTag *newTag, int cdb_found_index, int cdb_replace_index) Called by the buffer manager at the time it is about to change the association of a buffer with a disk page. Before this call, StrategyBufferLookup() still has to find the buffer under its old tag, even if it was returned by StrategyGetBuffer() as a candidate for replacement. After this call, this buffer must be returned for a lookup of the new page identified by *newTag. cdb_found_index and cdb_replace_index must be the auxiliary values returned by previous calls to StrategyBufferLookup and StrategyGetBuffer. void StrategyInvalidateBuffer(BufferDesc *buf) Called by the buffer manager to inform the strategy that the content of this buffer is being thrown away. This happens for example in the case of dropping a relation. The buffer must be clean and unpinned on call. If the buffer was associated with a disk page, StrategyBufferLookup() must not return it for this page after the call. void StrategyHintVacuum(bool vacuum_active) Because VACUUM reads all relations of the entire database through the buffer manager, it can greatly disturb the buffer replacement strategy. This function is used by VACUUM to inform the strategy that subsequent buffer lookups are (or are not) caused by VACUUM scanning relations. Buffer replacement strategy --------------------------- The buffer replacement strategy actually used in freelist.c is a version of the Two Queue (2Q) algorithm specially tailored for PostgreSQL. The algorithm works as follows: There are three lists of Cache Directory Blocks (CDBs), T1, T2, and B1. The T1 and T2 lists are the "real" cache entries, linking a file page to a memory buffer where the page is currently cached. Consequently T1len+T2len <= NBuffers. B1 is a list of pages that are not currently buffered but recently were. We allocate a total of 1.5*NBuffers CDBs, so the maximum length of B1 when cache is fully used is NBuffers/2. The target size of T1 is NBuffers/4. Assuming we have a full cache, one of 4 cases happens on a lookup: MISS On a cache miss, depending on T1target and the actual T1len the LRU buffer of either T1 or T2 is evicted. Its CDB is removed from the T list and added as MRU of B1. The now free buffer is replaced with the requested page and added as MRU of T1. T1 hit The T1 CDB is moved to the MRU position of the T2 list. T2 hit The T2 CDB is moved to the MRU position of the T2 list. B1 hit This means that a page that was recently evicted from cache is now requested again. We evict a buffer the same as in the normal cache miss case, but the reloaded page goes to the MRU of T2 rather than T1. Thus, every page that is found on lookup in any of the three lists ends up as the MRU of the T2 list. The T2 list therefore is the "frequency" cache, holding frequently requested pages. Every page that is seen for the first time ends up as the MRU of the T1 list. The T1 list is the "recency" cache, holding recent newcomers. The tailoring done for PostgreSQL has to do with the way the query executor works. A typical UPDATE or DELETE first scans the relation, searching for the tuples and then calls heap_update() or heap_delete(). This causes at least 2 lookups for the block in the same statement. In the case of multiple matches in one block even more often. As a result, every block touched in an UPDATE or DELETE would directly jump into the T2 cache, which is wrong. To prevent this the strategy remembers which transaction added a buffer to the T1 list and will not promote it from there into the T2 cache during the same transaction. Another specialty is the change of the strategy during VACUUM. Lookups during VACUUM do not represent application needs, and do not suggest that the page will be hit again soon. Therefore, a page read in to satisfy vacuum is placed at the LRU position of the T1 list, for immediate reuse. Also, if we happen to get a hit on a CDB entry during VACUUM, we do not promote the page above its current position in the list. Since VACUUM usually requests many pages very fast, the effect of this is that it will get back the very buffers it filled and possibly modified on the next call and will therefore do its work in a few shared memory buffers, while being able to use whatever it finds in the cache already. This also implies that most of the write traffic caused by a VACUUM will be done by the VACUUM itself and not pushed off onto other processes.