Rearrange pg_dump's handling of large objects for better efficiency.

Commit c0d5be5d6 caused pg_dump to create a separate BLOB metadata TOC entry for each large object (blob), but it did not touch the ancient decision to put all the blobs' data into a single "BLOBS" TOC entry. This is bad for a few reasons: for databases with millions of blobs, the TOC becomes unreasonably large, causing performance issues; selective restore of just some blobs is quite impossible; and we cannot parallelize either dump or restore of the blob data, since our architecture for that relies on farming out whole TOC entries to worker processes. To improve matters, let's group multiple blobs into each blob metadata TOC entry, and then make corresponding per-group blob data TOC entries. Selective restore using pg_restore's -l/-L switches is then possible, though only at the group level. (Perhaps we should provide a switch to allow forcing one-blob-per-group for users who need precise selective restore and don't have huge numbers of blobs. This patch doesn't do that, instead just hard-wiring the maximum number of blobs per entry at 1000.) The blobs in a group must all have the same owner, since the TOC entry format only allows one owner to be named. In this implementation we also require them to all share the same ACL (grants); the archive format wouldn't require that, but pg_dump's representation of DumpableObjects does. It seems unlikely that either restriction will be problematic for databases with huge numbers of blobs. The metadata TOC entries now have a "desc" string of "BLOB METADATA", and their "defn" string is just a newline-separated list of blob OIDs. The restore code has to generate creation commands, ALTER OWNER commands, and drop commands (for --clean mode) from that. We would need special-case code for ALTER OWNER and drop in any case, so the alternative of keeping the "defn" as directly executable SQL code for creation wouldn't buy much, and it seems like it'd bloat the archive to little purpose. Since we require the blobs of a metadata group to share the same ACL, we can furthermore store only one copy of that ACL, and then make pg_restore regenerate the appropriate commands for each blob. This saves space in the dump file not only by removing duplicative SQL command strings, but by not needing a separate TOC entry for each blob's ACL. In turn, that reduces client-side memory requirements for handling many blobs. ACL TOC entries that need this special processing are labeled as "ACL"/"LARGE OBJECTS nnn..nnn". If we have a blob with a unique ACL, continue to label it as "ACL"/"LARGE OBJECT nnn". We don't actually have to make such a distinction, but it saves a few cycles during restore for the easy case, and it seems like a good idea to not change the TOC contents unnecessarily. The data TOC entries ("BLOBS") are exactly the same as before, except that now there can be more than one, so we'd better give them identifying tag strings. Also, commit c0d5be5d6 put the new BLOB metadata TOC entries into SECTION_PRE_DATA, which perhaps is defensible in some ways, but it's a rather odd choice considering that we go out of our way to treat blobs as data. Moreover, because parallel restore handles the PRE_DATA section serially, this means we'd only get part of the parallelism speedup we could hope for. Move these entries into SECTION_DATA, letting us parallelize the lo_create calls not just the data loading when there are many blobs. Add dependencies to ensure that we won't try to load data for a blob we've not yet created. As this stands, we still generate a separate TOC entry for any comment or security label attached to a blob. I feel comfortable in believing that comments and security labels on blobs are rare, so this patch should be enough to get most of the useful TOC compression for blobs. We have to bump the archive file format version number, since existing versions of pg_restore wouldn't know they need to do something special for BLOB METADATA, plus they aren't going to work correctly with multiple BLOBS entries or multiple-large-object ACL entries. The directory and tar-file format handlers need some work for multiple BLOBS entries: they used to hard-wire the file name as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc". The 002_pg_dump.pl test script also knows about that and requires minor updates. (I had to drop the test for manually-compressed blobs.toc files with LZ4, because lz4's obtuse command line design requires explicit specification of the output file name which seems impractical here. I don't think we're losing any useful test coverage thereby; that test stanza seems completely duplicative with the gzip and zstd cases anyway.) In passing, centralize management of the lo_buf used to hold data while restoring blobs. The code previously had each format handler create lo_buf, which seems rather pointless given that the format handlers all make it the same way. Moreover, the format handlers never use lo_buf directly, making this setup a failure from a separation-of-concerns standpoint. Let's move the responsibility into pg_backup_archiver.c, which is the only module concerned with lo_buf. The reason to do this in this patch is that it allows a centralized fix for the now-false assumption that we never restore blobs in parallel. Also, get rid of dead code in DropLOIfExists: it's been a long time since we had any need to be able to restore to a pre-9.0 server. Discussion: https://postgr.es/m/a9f9376f1c3343a6bb319dce294e20ac@EX13D05UWC001.ant.amazon.com
author: Tom Lane <tgl@sss.pgh.pa.us> 2024-04-01 16:25:56 -0400
committer: Tom Lane <tgl@sss.pgh.pa.us> 2024-04-01 16:25:56 -0400
commit: a45c78e3284b269085e9a0cbd0ea3b236b7180fa (patch)
tree: eb56b7d4a7678f830914209e9c78d7fbb2f19fe9 /src/bin/pg_dump/pg_backup_db.c
parent: 5eac8cef24543767015a9b248af08bbfa10b1b70 (diff)
download: postgresql-a45c78e3284b269085e9a0cbd0ea3b236b7180fa.tar.gz
postgresql-a45c78e3284b269085e9a0cbd0ea3b236b7180fa.zip
1 files changed, 112 insertions, 19 deletions
diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c
index f766b65059d..f9683fb0c53 100644
--- a/src/bin/pg_dump/pg_backup_db.c
+++ b/src/bin/pg_dump/pg_backup_db.c
@@ -541,29 +541,122 @@ CommitTransaction(Archive *AHX)
 	ExecuteSqlCommand(AH, "COMMIT", "could not commit database transaction");
 }
 
+/*
+ * Issue per-blob commands for the large object(s) listed in the TocEntry
+ *
+ * The TocEntry's defn string is assumed to consist of large object OIDs,
+ * one per line.  Wrap these in the given SQL command fragments and issue
+ * the commands.  (cmdEnd need not include a semicolon.)
+ */
 void
-DropLOIfExists(ArchiveHandle *AH, Oid oid)
+IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te,
+					const char *cmdBegin, const char *cmdEnd)
 {
-	/*
-	 * If we are not restoring to a direct database connection, we have to
-	 * guess about how to detect whether the LO exists.  Assume new-style.
-	 */
-	if (AH->connection == NULL ||
-		PQserverVersion(AH->connection) >= 90000)
+	/* Make a writable copy of the command string */
+	char	   *buf = pg_strdup(te->defn);
+	char	   *st;
+	char	   *en;
+
+	st = buf;
+	while ((en = strchr(st, '\n')) != NULL)
 	{
-		ahprintf(AH,
-				 "SELECT pg_catalog.lo_unlink(oid) "
-				 "FROM pg_catalog.pg_largeobject_metadata "
-				 "WHERE oid = '%u';\n",
-				 oid);
+		*en++ = '\0';
+		ahprintf(AH, "%s%s%s;\n", cmdBegin, st, cmdEnd);
+		st = en;
 	}
-	else
+	ahprintf(AH, "\n");
+	pg_free(buf);
+}
+
+/*
+ * Process a "LARGE OBJECTS" ACL TocEntry.
+ *
+ * To save space in the dump file, the TocEntry contains only one copy
+ * of the required GRANT/REVOKE commands, written to apply to the first
+ * blob in the group (although we do not depend on that detail here).
+ * We must expand the text to generate commands for all the blobs listed
+ * in the associated BLOB METADATA entry.
+ */
+void
+IssueACLPerBlob(ArchiveHandle *AH, TocEntry *te)
+{
+	TocEntry   *blobte = getTocEntryByDumpId(AH, te->dependencies[0]);
+	char	   *buf;
+	char	   *st;
+	char	   *st2;
+	char	   *en;
+	bool		inquotes;
+
+	if (!blobte)
+		pg_fatal("could not find entry for ID %d", te->dependencies[0]);
+	Assert(strcmp(blobte->desc, "BLOB METADATA") == 0);
+
+	/* Make a writable copy of the ACL commands string */
+	buf = pg_strdup(te->defn);
+
+	/*
+	 * We have to parse out the commands sufficiently to locate the blob OIDs
+	 * and find the command-ending semicolons.  The commands should not
+	 * contain anything hard to parse except for double-quoted role names,
+	 * which are easy to ignore.  Once we've split apart the first and second
+	 * halves of a command, apply IssueCommandPerBlob.  (This means the
+	 * updates on the blobs are interleaved if there's multiple commands, but
+	 * that should cause no trouble.)
+	 */
+	inquotes = false;
+	st = en = buf;
+	st2 = NULL;
+	while (*en)
 	{
-		/* Restoring to pre-9.0 server, so do it the old way */
-		ahprintf(AH,
-				 "SELECT CASE WHEN EXISTS("
-				 "SELECT 1 FROM pg_catalog.pg_largeobject WHERE loid = '%u'"
-				 ") THEN pg_catalog.lo_unlink('%u') END;\n",
-				 oid, oid);
+		/* Ignore double-quoted material */
+		if (*en == '"')
+			inquotes = !inquotes;
+		if (inquotes)
+		{
+			en++;
+			continue;
+		}
+		/* If we found "LARGE OBJECT", that's the end of the first half */
+		if (strncmp(en, "LARGE OBJECT ", 13) == 0)
+		{
+			/* Terminate the first-half string */
+			en += 13;
+			Assert(isdigit((unsigned char) *en));
+			*en++ = '\0';
+			/* Skip the rest of the blob OID */
+			while (isdigit((unsigned char) *en))
+				en++;
+			/* Second half starts here */
+			Assert(st2 == NULL);
+			st2 = en;
+		}
+		/* If we found semicolon, that's the end of the second half */
+		else if (*en == ';')
+		{
+			/* Terminate the second-half string */
+			*en++ = '\0';
+			Assert(st2 != NULL);
+			/* Issue this command for each blob */
+			IssueCommandPerBlob(AH, blobte, st, st2);
+			/* For neatness, skip whitespace before the next command */
+			while (isspace((unsigned char) *en))
+				en++;
+			/* Reset for new command */
+			st = en;
+			st2 = NULL;
+		}
+		else
+			en++;
 	}
+	pg_free(buf);
+}
+
+void
+DropLOIfExists(ArchiveHandle *AH, Oid oid)
+{
+	ahprintf(AH,
+			 "SELECT pg_catalog.lo_unlink(oid) "
+			 "FROM pg_catalog.pg_largeobject_metadata "
+			 "WHERE oid = '%u';\n",
+			 oid);
 }
author	Tom Lane <tgl@sss.pgh.pa.us>	2024-04-01 16:25:56 -0400
committer	Tom Lane <tgl@sss.pgh.pa.us>	2024-04-01 16:25:56 -0400
commit	a45c78e3284b269085e9a0cbd0ea3b236b7180fa (patch)
tree	eb56b7d4a7678f830914209e9c78d7fbb2f19fe9 /src/bin/pg_dump/pg_backup_db.c
parent	5eac8cef24543767015a9b248af08bbfa10b1b70 (diff)
download	postgresql-a45c78e3284b269085e9a0cbd0ea3b236b7180fa.tar.gz postgresql-a45c78e3284b269085e9a0cbd0ea3b236b7180fa.zip