1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
|
<!--
$PostgreSQL: pgsql/doc/src/sgml/page.sgml,v 1.17 2003/12/14 00:10:32 neilc Exp $
-->
<chapter id="page">
<title>Page Files</title>
<abstract>
<para>
A description of the database file page format.
</para>
</abstract>
<para>
This section provides an overview of the page format used by
<productname>PostgreSQL</productname> tables and indexes. (Index
access methods need not use this page format. At present, all index
methods do use this basic format, but the data kept on index metapages
usually doesn't follow the item layout rules exactly.) TOAST tables
and sequences are formatted just like a regular table.
</para>
<para>
In the following explanation, a
<firstterm>byte</firstterm>
is assumed to contain 8 bits. In addition, the term
<firstterm>item</firstterm>
refers to an individual data value that is stored on a page. In a table,
an item is a row; in an index, an item is an index entry.
</para>
<para>
<xref linkend="page-table"> shows the basic layout of a page.
There are five parts to each page.
</para>
<table tocentry="1" id="page-table">
<title>Sample Page Layout</title>
<titleabbrev>Page Layout</titleabbrev>
<tgroup cols="2">
<thead>
<row>
<entry>
Item
</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>PageHeaderData</entry>
<entry>20 bytes long. Contains general information about the page, including
free space pointers.</entry>
</row>
<row>
<entry>ItemPointerData</entry>
<entry>Array of (offset,length) pairs pointing to the actual items.</entry>
</row>
<row>
<entry>Free space</entry>
<entry>The unallocated space. All new rows are allocated from here, generally from the end.</entry>
</row>
<row>
<entry>Items</entry>
<entry>The actual items themselves.</entry>
</row>
<row>
<entry>Special Space</entry>
<entry>Index access method specific data. Different methods store different
data. Empty in ordinary tables.</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
The first 20 bytes of each page consists of a page header
(PageHeaderData). Its format is detailed in <xref
linkend="pageheaderdata-table">. The first two fields deal with WAL
related stuff. This is followed by three 2-byte integer fields
(<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
and <structfield>pd_special</structfield>). These represent byte offsets to
the start
of unallocated space, to the end of unallocated space, and to the start of
the special space.
</para>
<table tocentry="1" id="pageheaderdata-table">
<title>PageHeaderData Layout</title>
<titleabbrev>PageHeaderData Layout</titleabbrev>
<tgroup cols="4">
<thead>
<row>
<entry>Field</entry>
<entry>Type</entry>
<entry>Length</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>pd_lsn</entry>
<entry>XLogRecPtr</entry>
<entry>8 bytes</entry>
<entry>LSN: next byte after last byte of xlog</entry>
</row>
<row>
<entry>pd_sui</entry>
<entry>StartUpID</entry>
<entry>4 bytes</entry>
<entry>SUI of last changes (currently it's used by heap AM only)</entry>
</row>
<row>
<entry>pd_lower</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to start of free space.</entry>
</row>
<row>
<entry>pd_upper</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to end of free space.</entry>
</row>
<row>
<entry>pd_special</entry>
<entry>LocationIndex</entry>
<entry>2 bytes</entry>
<entry>Offset to start of special space.</entry>
</row>
<row>
<entry>pd_pagesize_version</entry>
<entry>uint16</entry>
<entry>2 bytes</entry>
<entry>Page size and layout version number information.</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
All the details may be found in
<filename>src/include/storage/bufpage.h</filename>.
</para>
<para>
Special space is a region at the end of the page that is allocated at page
initialization time and contains information specific to an access method.
The last 2 bytes of the page header,
<structfield>pd_pagesize_version</structfield>, store both the page size
and a version indicator. Beginning with
<productname>PostgreSQL</productname> 7.3 the version number is 1; prior
releases used version number 0. (The basic page layout and header format
has not changed, but the layout of heap row headers has.) The page size
is basically only present as a cross-check; there is no support for having
more than one page size in an installation.
</para>
<para>
Following the page header are item identifiers
(<type>ItemIdData</type>), each requiring four bytes.
An item identifier contains a byte-offset to
the start of an item, its length in bytes, and a set of attribute bits
which affect its interpretation.
New item identifiers are allocated
as needed from the beginning of the unallocated space.
The number of item identifiers present can be determined by looking at
<structfield>pd_lower</>, which is increased to allocate a new identifier.
Because an item
identifier is never moved until it is freed, its index may be used on a
long-term basis to reference an item, even when the item itself is moved
around on the page to compact free space. In fact, every pointer to an
item (<type>ItemPointer</type>, also known as
<type>CTID</type>) created by
<productname>PostgreSQL</productname> consists of a page number and the
index of an item identifier.
</para>
<para>
The items themselves are stored in space allocated backwards from the end
of unallocated space. The exact structure varies depending on what the
table is to contain. Tables and sequences both use a structure named
<type>HeapTupleHeaderData</type>, described below.
</para>
<para>
The final section is the <quote>special section</quote> which may
contain anything the access method wishes to store. Ordinary tables
do not use this at all (indicated by setting
<structfield>pd_special</> to equal the pagesize).
</para>
<para>
All table rows are structured the same way. There is a fixed-size
header (occupying 23 bytes on most machines), followed by an optional null
bitmap, an optional object ID field, and the user data. The header is
detailed
in <xref linkend="heaptupleheaderdata-table">. The actual user data
(columns of the row) begins at the offset indicated by
<structfield>t_hoff</>, which must always be a multiple of the MAXALIGN
distance for the platform.
The null bitmap is
only present if the <firstterm>HEAP_HASNULL</firstterm> bit is set in
<structfield>t_infomask</structfield>. If it is present it begins just after
the fixed header and occupies enough bytes to have one bit per data column
(that is, <structfield>t_natts</> bits altogether). In this list of bits, a
1 bit indicates not-null, a 0 bit is a null. When the bitmap is not
present, all columns are assumed not-null.
The object ID is only present if the <firstterm>HEAP_HASOID</firstterm> bit
is set in <structfield>t_infomask</structfield>. If present, it appears just
before the <structfield>t_hoff</> boundary. Any padding needed to make
<structfield>t_hoff</> a MAXALIGN multiple will appear between the null
bitmap and the object ID. (This in turn ensures that the object ID is
suitably aligned.)
</para>
<table tocentry="1" id="heaptupleheaderdata-table">
<title>HeapTupleHeaderData Layout</title>
<titleabbrev>HeapTupleHeaderData Layout</titleabbrev>
<tgroup cols="4">
<thead>
<row>
<entry>Field</entry>
<entry>Type</entry>
<entry>Length</entry>
<entry>Description</entry>
</row>
</thead>
<tbody>
<row>
<entry>t_xmin</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>insert XID stamp</entry>
</row>
<row>
<entry>t_cmin</entry>
<entry>CommandId</entry>
<entry>4 bytes</entry>
<entry>insert CID stamp (overlays with t_xmax)</entry>
</row>
<row>
<entry>t_xmax</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>delete XID stamp</entry>
</row>
<row>
<entry>t_cmax</entry>
<entry>CommandId</entry>
<entry>4 bytes</entry>
<entry>delete CID stamp (overlays with t_xvac)</entry>
</row>
<row>
<entry>t_xvac</entry>
<entry>TransactionId</entry>
<entry>4 bytes</entry>
<entry>XID for VACUUM operation moving row version</entry>
</row>
<row>
<entry>t_ctid</entry>
<entry>ItemPointerData</entry>
<entry>6 bytes</entry>
<entry>current TID of this or newer row version</entry>
</row>
<row>
<entry>t_natts</entry>
<entry>int16</entry>
<entry>2 bytes</entry>
<entry>number of attributes</entry>
</row>
<row>
<entry>t_infomask</entry>
<entry>uint16</entry>
<entry>2 bytes</entry>
<entry>various flags</entry>
</row>
<row>
<entry>t_hoff</entry>
<entry>uint8</entry>
<entry>1 byte</entry>
<entry>offset to user data</entry>
</row>
</tbody>
</tgroup>
</table>
<para>
All the details may be found in
<filename>src/include/access/htup.h</filename>.
</para>
<para>
Interpreting the actual data can only be done with information obtained
from other tables, mostly <firstterm>pg_attribute</firstterm>. The
particular fields are <structfield>attlen</structfield> and
<structfield>attalign</structfield>. There is no way to directly get a
particular attribute, except when there are only fixed width fields and no
NULLs. All this trickery is wrapped up in the functions
<firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
and <firstterm>heap_getsysattr</firstterm>.
</para>
<para>
To read the data you need to examine each attribute in turn. First check
whether the field is NULL according to the null bitmap. If it is, go to
the next. Then make sure you have the right alignment. If the field is a
fixed width field, then all the bytes are simply placed. If it's a
variable length field (attlen == -1) then it's a bit more complicated,
using the variable length structure <type>varattrib</type>.
Depending on the flags, the data may be either inline, compressed or in
another table (TOAST).
</para>
</chapter>
|