diff options
author | Tom Lane <tgl@sss.pgh.pa.us> | 2024-04-01 16:25:56 -0400 |
---|---|---|
committer | Tom Lane <tgl@sss.pgh.pa.us> | 2024-04-01 16:25:56 -0400 |
commit | a45c78e3284b269085e9a0cbd0ea3b236b7180fa (patch) | |
tree | eb56b7d4a7678f830914209e9c78d7fbb2f19fe9 /src/bin/pg_dump/pg_backup_db.c | |
parent | 5eac8cef24543767015a9b248af08bbfa10b1b70 (diff) | |
download | postgresql-a45c78e3284b269085e9a0cbd0ea3b236b7180fa.tar.gz postgresql-a45c78e3284b269085e9a0cbd0ea3b236b7180fa.zip |
Rearrange pg_dump's handling of large objects for better efficiency.
Commit c0d5be5d6 caused pg_dump to create a separate BLOB metadata TOC
entry for each large object (blob), but it did not touch the ancient
decision to put all the blobs' data into a single "BLOBS" TOC entry.
This is bad for a few reasons: for databases with millions of blobs,
the TOC becomes unreasonably large, causing performance issues;
selective restore of just some blobs is quite impossible; and we
cannot parallelize either dump or restore of the blob data, since our
architecture for that relies on farming out whole TOC entries to
worker processes.
To improve matters, let's group multiple blobs into each blob metadata
TOC entry, and then make corresponding per-group blob data TOC entries.
Selective restore using pg_restore's -l/-L switches is then possible,
though only at the group level. (Perhaps we should provide a switch
to allow forcing one-blob-per-group for users who need precise
selective restore and don't have huge numbers of blobs. This patch
doesn't do that, instead just hard-wiring the maximum number of blobs
per entry at 1000.)
The blobs in a group must all have the same owner, since the TOC entry
format only allows one owner to be named. In this implementation
we also require them to all share the same ACL (grants); the archive
format wouldn't require that, but pg_dump's representation of
DumpableObjects does. It seems unlikely that either restriction
will be problematic for databases with huge numbers of blobs.
The metadata TOC entries now have a "desc" string of "BLOB METADATA",
and their "defn" string is just a newline-separated list of blob OIDs.
The restore code has to generate creation commands, ALTER OWNER
commands, and drop commands (for --clean mode) from that. We would
need special-case code for ALTER OWNER and drop in any case, so the
alternative of keeping the "defn" as directly executable SQL code
for creation wouldn't buy much, and it seems like it'd bloat the
archive to little purpose.
Since we require the blobs of a metadata group to share the same ACL,
we can furthermore store only one copy of that ACL, and then make
pg_restore regenerate the appropriate commands for each blob. This
saves space in the dump file not only by removing duplicative SQL
command strings, but by not needing a separate TOC entry for each
blob's ACL. In turn, that reduces client-side memory requirements for
handling many blobs.
ACL TOC entries that need this special processing are labeled as
"ACL"/"LARGE OBJECTS nnn..nnn". If we have a blob with a unique ACL,
continue to label it as "ACL"/"LARGE OBJECT nnn". We don't actually
have to make such a distinction, but it saves a few cycles during
restore for the easy case, and it seems like a good idea to not change
the TOC contents unnecessarily.
The data TOC entries ("BLOBS") are exactly the same as before,
except that now there can be more than one, so we'd better give them
identifying tag strings.
Also, commit c0d5be5d6 put the new BLOB metadata TOC entries into
SECTION_PRE_DATA, which perhaps is defensible in some ways, but
it's a rather odd choice considering that we go out of our way to
treat blobs as data. Moreover, because parallel restore handles
the PRE_DATA section serially, this means we'd only get part of the
parallelism speedup we could hope for. Move these entries into
SECTION_DATA, letting us parallelize the lo_create calls not just the
data loading when there are many blobs. Add dependencies to ensure
that we won't try to load data for a blob we've not yet created.
As this stands, we still generate a separate TOC entry for any comment
or security label attached to a blob. I feel comfortable in believing
that comments and security labels on blobs are rare, so this patch
should be enough to get most of the useful TOC compression for blobs.
We have to bump the archive file format version number, since existing
versions of pg_restore wouldn't know they need to do something special
for BLOB METADATA, plus they aren't going to work correctly with
multiple BLOBS entries or multiple-large-object ACL entries.
The directory and tar-file format handlers need some work
for multiple BLOBS entries: they used to hard-wire the file name
as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc".
The 002_pg_dump.pl test script also knows about that and requires
minor updates. (I had to drop the test for manually-compressed
blobs.toc files with LZ4, because lz4's obtuse command line
design requires explicit specification of the output file name
which seems impractical here. I don't think we're losing any
useful test coverage thereby; that test stanza seems completely
duplicative with the gzip and zstd cases anyway.)
In passing, centralize management of the lo_buf used to hold data
while restoring blobs. The code previously had each format handler
create lo_buf, which seems rather pointless given that the format
handlers all make it the same way. Moreover, the format handlers
never use lo_buf directly, making this setup a failure from a
separation-of-concerns standpoint. Let's move the responsibility into
pg_backup_archiver.c, which is the only module concerned with lo_buf.
The reason to do this in this patch is that it allows a centralized
fix for the now-false assumption that we never restore blobs in
parallel. Also, get rid of dead code in DropLOIfExists: it's been a
long time since we had any need to be able to restore to a pre-9.0
server.
Discussion: https://postgr.es/m/a9f9376f1c3343a6bb319dce294e20ac@EX13D05UWC001.ant.amazon.com
Diffstat (limited to 'src/bin/pg_dump/pg_backup_db.c')
-rw-r--r-- | src/bin/pg_dump/pg_backup_db.c | 131 |
1 files changed, 112 insertions, 19 deletions
diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index f766b65059d..f9683fb0c53 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -541,29 +541,122 @@ CommitTransaction(Archive *AHX) ExecuteSqlCommand(AH, "COMMIT", "could not commit database transaction"); } +/* + * Issue per-blob commands for the large object(s) listed in the TocEntry + * + * The TocEntry's defn string is assumed to consist of large object OIDs, + * one per line. Wrap these in the given SQL command fragments and issue + * the commands. (cmdEnd need not include a semicolon.) + */ void -DropLOIfExists(ArchiveHandle *AH, Oid oid) +IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, + const char *cmdBegin, const char *cmdEnd) { - /* - * If we are not restoring to a direct database connection, we have to - * guess about how to detect whether the LO exists. Assume new-style. - */ - if (AH->connection == NULL || - PQserverVersion(AH->connection) >= 90000) + /* Make a writable copy of the command string */ + char *buf = pg_strdup(te->defn); + char *st; + char *en; + + st = buf; + while ((en = strchr(st, '\n')) != NULL) { - ahprintf(AH, - "SELECT pg_catalog.lo_unlink(oid) " - "FROM pg_catalog.pg_largeobject_metadata " - "WHERE oid = '%u';\n", - oid); + *en++ = '\0'; + ahprintf(AH, "%s%s%s;\n", cmdBegin, st, cmdEnd); + st = en; } - else + ahprintf(AH, "\n"); + pg_free(buf); +} + +/* + * Process a "LARGE OBJECTS" ACL TocEntry. + * + * To save space in the dump file, the TocEntry contains only one copy + * of the required GRANT/REVOKE commands, written to apply to the first + * blob in the group (although we do not depend on that detail here). + * We must expand the text to generate commands for all the blobs listed + * in the associated BLOB METADATA entry. + */ +void +IssueACLPerBlob(ArchiveHandle *AH, TocEntry *te) +{ + TocEntry *blobte = getTocEntryByDumpId(AH, te->dependencies[0]); + char *buf; + char *st; + char *st2; + char *en; + bool inquotes; + + if (!blobte) + pg_fatal("could not find entry for ID %d", te->dependencies[0]); + Assert(strcmp(blobte->desc, "BLOB METADATA") == 0); + + /* Make a writable copy of the ACL commands string */ + buf = pg_strdup(te->defn); + + /* + * We have to parse out the commands sufficiently to locate the blob OIDs + * and find the command-ending semicolons. The commands should not + * contain anything hard to parse except for double-quoted role names, + * which are easy to ignore. Once we've split apart the first and second + * halves of a command, apply IssueCommandPerBlob. (This means the + * updates on the blobs are interleaved if there's multiple commands, but + * that should cause no trouble.) + */ + inquotes = false; + st = en = buf; + st2 = NULL; + while (*en) { - /* Restoring to pre-9.0 server, so do it the old way */ - ahprintf(AH, - "SELECT CASE WHEN EXISTS(" - "SELECT 1 FROM pg_catalog.pg_largeobject WHERE loid = '%u'" - ") THEN pg_catalog.lo_unlink('%u') END;\n", - oid, oid); + /* Ignore double-quoted material */ + if (*en == '"') + inquotes = !inquotes; + if (inquotes) + { + en++; + continue; + } + /* If we found "LARGE OBJECT", that's the end of the first half */ + if (strncmp(en, "LARGE OBJECT ", 13) == 0) + { + /* Terminate the first-half string */ + en += 13; + Assert(isdigit((unsigned char) *en)); + *en++ = '\0'; + /* Skip the rest of the blob OID */ + while (isdigit((unsigned char) *en)) + en++; + /* Second half starts here */ + Assert(st2 == NULL); + st2 = en; + } + /* If we found semicolon, that's the end of the second half */ + else if (*en == ';') + { + /* Terminate the second-half string */ + *en++ = '\0'; + Assert(st2 != NULL); + /* Issue this command for each blob */ + IssueCommandPerBlob(AH, blobte, st, st2); + /* For neatness, skip whitespace before the next command */ + while (isspace((unsigned char) *en)) + en++; + /* Reset for new command */ + st = en; + st2 = NULL; + } + else + en++; } + pg_free(buf); +} + +void +DropLOIfExists(ArchiveHandle *AH, Oid oid) +{ + ahprintf(AH, + "SELECT pg_catalog.lo_unlink(oid) " + "FROM pg_catalog.pg_largeobject_metadata " + "WHERE oid = '%u';\n", + oid); } |