Allocate hash join files in a separate memory context

Should a hash join exceed memory limit, the hashtable is split up into multiple batches. The number of batches is doubled each time a given batch is determined not to fit in memory. Each batch file is allocated with a block-sized buffer for buffering tuples and parallel hash join has additional sharedtuplestore accessor buffers. In some pathological cases requiring a lot of batches, often with skewed data, bad stats, or very large datasets, users can run out-of-memory solely from the memory overhead of all the batch files' buffers. Batch files were allocated in the ExecutorState memory context, making it very hard to identify when this batch explosion was the source of an OOM. This commit allocates the batch files in a dedicated memory context, making it easier to identify the cause of an OOM and work to avoid it. Based on initial draft by Tomas Vondra, with significant reworks and improvements by Jehan-Guillaume de Rorthais. Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> Author: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20190421114618.z3mpgmimc3rmubi4@development Discussion: https://postgr.es/m/20230504193006.1b5b9622%40karst#273020ff4061fc7a2fbb1ba96b281f17
author: Tomas Vondra <tomas.vondra@postgresql.org> 2023-05-19 16:31:11 +0200
committer: Tomas Vondra <tomas.vondra@postgresql.org> 2023-05-19 17:17:58 +0200
commit: 8c4040edf456d9241816176eacb79e4d9a0034fc (patch)
tree: c175487d46f9b8133829bb3843f498f52f06dbc6 /src/backend/executor/nodeHashjoin.c
parent: 507615fc533b1b65bcecc6218e36436687fe8420 (diff)
download: postgresql-8c4040edf456d9241816176eacb79e4d9a0034fc.tar.gz
postgresql-8c4040edf456d9241816176eacb79e4d9a0034fc.zip
1 files changed, 25 insertions, 6 deletions
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 615d9980cf5..e40436db38e 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -495,7 +495,8 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 					Assert(parallel_state == NULL);
 					Assert(batchno > hashtable->curbatch);
 					ExecHashJoinSaveTuple(mintuple, hashvalue,
-										  &hashtable->outerBatchFile[batchno]);
+										  &hashtable->outerBatchFile[batchno],
+										  hashtable);
 
 					if (shouldFree)
 						heap_free_minimal_tuple(mintuple);
@@ -1317,21 +1318,39 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate)
  * The data recorded in the file for each tuple is its hash value,
  * then the tuple in MinimalTuple format.
  *
- * Note: it is important always to call this in the regular executor
- * context, not in a shorter-lived context; else the temp file buffers
- * will get messed up.
+ * fileptr points to a batch file in one of the the hashtable arrays.
+ *
+ * The batch files (and their buffers) are allocated in the spill context
+ * created for the hashtable.
  */
 void
 ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
-					  BufFile **fileptr)
+					  BufFile **fileptr, HashJoinTable hashtable)
 {
 	BufFile    *file = *fileptr;
 
+	/*
+	 * The batch file is lazily created. If this is the first tuple
+	 * written to this batch, the batch file is created and its buffer is
+	 * allocated in the spillCxt context, NOT in the batchCxt.
+	 *
+	 * During the build phase, buffered files are created for inner
+	 * batches. Each batch's buffered file is closed (and its buffer freed)
+	 * after the batch is loaded into memory during the outer side scan.
+	 * Therefore, it is necessary to allocate the batch file buffer in a
+	 * memory context which outlives the batch itself.
+	 *
+	 * Also, we use spillCxt instead of hashCxt for a better accounting of
+	 * the spilling memory consumption.
+	 */
 	if (file == NULL)
 	{
-		/* First write to this batch file, so open it. */
+		MemoryContext	oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
+
 		file = BufFileCreateTemp(false);
 		*fileptr = file;
+
+		MemoryContextSwitchTo(oldctx);
 	}
 
 	BufFileWrite(file, &hashvalue, sizeof(uint32));
author	Tomas Vondra <tomas.vondra@postgresql.org>	2023-05-19 16:31:11 +0200
committer	Tomas Vondra <tomas.vondra@postgresql.org>	2023-05-19 17:17:58 +0200
commit	8c4040edf456d9241816176eacb79e4d9a0034fc (patch)
tree	c175487d46f9b8133829bb3843f498f52f06dbc6 /src/backend/executor/nodeHashjoin.c
parent	507615fc533b1b65bcecc6218e36436687fe8420 (diff)
download	postgresql-8c4040edf456d9241816176eacb79e4d9a0034fc.tar.gz postgresql-8c4040edf456d9241816176eacb79e4d9a0034fc.zip