aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/nbtree/README
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/access/nbtree/README')
-rw-r--r--src/backend/access/nbtree/README68
1 files changed, 68 insertions, 0 deletions
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
new file mode 100644
index 00000000000..a204ad4af08
--- /dev/null
+++ b/src/backend/access/nbtree/README
@@ -0,0 +1,68 @@
+$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $
+
+This directory contains a correct implementation of Lehman and Yao's
+btree management algorithm that supports concurrent access for Postgres.
+We have made the following changes in order to incorporate their algorithm
+into Postgres:
+
+ + The requirement that all btree keys be unique is too onerous,
+ but the algorithm won't work correctly without it. As a result,
+ this implementation adds an OID (guaranteed to be unique) to
+ every key in the index. This guarantees uniqueness within a set
+ of duplicates. Space overhead is four bytes.
+
+ For this reason, when we're passed an index tuple to store by the
+ common access method code, we allocate a larger one and copy the
+ supplied tuple into it. No Postgres code outside of the btree
+ access method knows about this xid or sequence number.
+
+ + Lehman and Yao don't require read locks, but assume that in-
+ memory copies of tree nodes are unshared. Postgres shares
+ in-memory buffers among backends. As a result, we do page-
+ level read locking on btree nodes in order to guarantee that
+ no record is modified while we are examining it. This reduces
+ concurrency but guaranteees correct behavior.
+
+ + Read locks on a page are held for as long as a scan has a pointer
+ to the page. However, locks are always surrendered before the
+ sibling page lock is acquired (for readers), so we remain deadlock-
+ free. I will do a formal proof if I get bored anytime soon.
+
+In addition, the following things are handy to know:
+
+ + Page zero of every btree is a meta-data page. This page stores
+ the location of the root page, a pointer to a list of free
+ pages, and other stuff that's handy to know.
+
+ + This algorithm doesn't really work, since it requires ordered
+ writes, and UNIX doesn't support ordered writes.
+
+ + There's one other case where we may screw up in this
+ implementation. When we start a scan, we descend the tree
+ to the key nearest the one in the qual, and once we get there,
+ position ourselves correctly for the qual type (eg, <, >=, etc).
+ If we happen to step off a page, decide we want to get back to
+ it, and fetch the page again, and if some bad person has split
+ the page and moved the last tuple we saw off of it, then the
+ code complains about botched concurrency in an elog(WARN, ...)
+ and gives up the ghost. This is the ONLY violation of Lehman
+ and Yao's guarantee of correct behavior that I am aware of in
+ this code.
+
+Notes to operator class implementors:
+
+ With this implementation, we require the user to supply us with
+ a procedure for pg_amproc. This procedure should take two keys
+ A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
+ respectively. See the contents of that relation for the btree
+ access method for some samples.
+
+Notes to mao for implementation document:
+
+ On deletions, we need to adjust the position of active scans on
+ the index. The code in nbtscan.c handles this. We don't need to
+ do this for splits because of the way splits are handled; if they
+ happen behind us, we'll automatically go to the next page, and if
+ they happen in front of us, we're not affected by them. For
+ insertions, if we inserted a tuple behind the current scan location
+ on the current scan page, we move one space ahead.