more concurrency when spooling results#2759

jycor · 2026-05-26T22:05:49Z

This PR splits the convert to wire format and send to client steps into two threads.
Copies optimization from gms: dolthub/go-mysql-server#3310

github-actions · 2026-05-26T22:37:40Z

	Main	PR
covering_index_scan_postgres	1363.77/s	1394.75/s	+2.2%
groupby_scan_postgres	62.82/s	63.95/s	+1.7%
index_join_postgres	179.17/s	180.71/s	+0.8%
index_join_scan_postgres	190.90/s	182.50/s	-4.5%
index_scan_postgres	12.04/s	${\color{lightgreen}18.15/s}$	${\color{lightgreen}+50.7\%}$
oltp_delete_insert_postgres	678.47/s	685.58/s	+1.0%
oltp_insert	643.90/s	623.44/s	-3.2%
oltp_point_select	2539.33/s	2534.20/s	-0.3%
oltp_read_only	1850.28/s	1869.54/s	+1.0%
oltp_read_write	1536.24/s	1530.82/s	-0.4%
oltp_update_index	623.02/s	648.91/s	+4.1%
oltp_update_non_index	676.57/s	693.37/s	+2.4%
oltp_write_only	1478.15/s	1509.53/s	+2.1%
select_random_points	122.10/s	130.60/s	+6.9%
select_random_ranges	875.70/s	882.97/s	+0.8%
table_scan_postgres	11.45/s	${\color{lightgreen}17.29/s}$	${\color{lightgreen}+51.0\%}$
types_delete_insert_postgres	698.27/s	705.14/s	+0.9%
types_table_scan_postgres	5.17/s	${\color{lightgreen}8.20/s}$	${\color{lightgreen}+58.6\%}$

github-actions · 2026-05-26T23:27:56Z

	Main	PR
Total	42090	42090
Successful	18250	18250
Failures	23840	23840
Partial Successes¹	5385	5385

	Main	PR
Successful	43.3595%	43.3595%
Failures	56.6405%	56.6405%

These are tests that we're marking as Successful, however they do not match the expected output in some way. This is due to small differences, such as different wording on the error messages, or the column names being incorrect while the data itself is correct. ↩

itoqa · 2026-05-26T23:31:25Z

Ito Test Report ❌

12 test cases ran. 2 failed, 10 passed.

Overall, 12 test cases ran with 10 passing and 2 failing, so the run is not release-clean despite broad stability across streaming, batching, timeout, result-shape, and callback backpressure scenarios. The most important finding is a high-severity regression introduced by this PR in server/doltgres_handler.go where non-EOF iterator errors are returned incorrectly and can be silently dropped in both simple (ComQuery) and prepared (ComExecuteBound) paths, allowing partial/truncated result streams to appear successful to clients.

❌ Failed (2)

Category	Summary	Screenshot
Iterator	⚠️ Non-EOF iterator errors are swallowed, so partial rows can be returned with success status.	N/A
Iterator	⚠️ Prepared execution path inherits the same swallowed iterator-error behavior.

⚠️ Non-EOF iterator errors are silently dropped

What failed: A non-EOF iterator failure does not propagate to the client; the stream can terminate as if successful after partial rows, but expected behavior is a client-visible query error.
Impact: Query clients can receive truncated data without an error, which breaks correctness guarantees for streamed reads. This can cause downstream systems to trust incomplete results as successful query output.
Steps to reproduce:
1. Start Doltgres and connect as superuser on port 5432.
2. Run a fixture query that triggers a non-EOF iter.Next error after partial progress.
3. Observe returned client status and error propagation.
Stub / mock context: Authentication bootstrap was bypassed so local sessions could run deterministically as the postgres role without full auth handshake; no iterator-result mocking or route interception was used for this defect verification.
Code analysis: I inspected the iterator read loop and downstream default-iterator call path in server/doltgres_handler.go; the non-EOF branch returns the named err return variable instead of iErr, and doQuery treats a nil return from resultForDefaultIter as success.
Why this is likely a bug: The production code drops the actual iterator failure object on the error branch, so a real stream fault can be misreported as success.

Relevant code:

server/doltgres_handler.go (lines 653-659)

row, iErr := iter.Next(ctx)
if iErr == io.EOF {
    return nil
}
if iErr != nil {
    return err
}

server/doltgres_handler.go (lines 409-417)

} else {
    resultFields, err := schemaToFieldDescriptions(sqlCtx, schema, formatCodes)
    if err != nil {
        return err
    }
    r, processedAtLeastOneBatch, err = h.resultForDefaultIter(sqlCtx, schema, rowIter, callback, resultFields, formatCodes)
    if err != nil {
        return err
    }
}

⚠️ Prepared query path swallows iterator failures

What failed: Prepared execution inherits the same error-propagation defect, so iterator failures can be swallowed instead of surfaced as a query error.
Impact: Prepared-query consumers can also receive incomplete streamed data without a failure signal. This undermines reliability for applications that rely on prepared execution for core read paths.
Steps to reproduce:
1. Connect as superuser and prepare a large SELECT statement.
2. Execute the prepared statement and begin reading streamed rows.
3. Trigger a non-EOF iterator failure condition mid-stream and inspect the client-visible result.
Stub / mock context: Authentication bootstrap was bypassed so local sessions could run deterministically as the postgres role without full auth handshake; prepared-path confirmation in this run relied on direct code-path verification rather than a full end-to-end prepared fixture replay.
Code analysis: I traced ComExecuteBound into doQuery, which routes prepared queries through resultForDefaultIter; the same non-EOF branch returns err instead of iErr, so prepared streams share the same silent-failure defect.
Why this is likely a bug: The prepared path reaches the same incorrect non-EOF error branch, so a production prepared stream can mask real iterator failures.

Relevant code:

server/doltgres_handler.go (lines 140-153)

func (h *DoltgresHandler) ComExecuteBound(ctx context.Context, conn *mysql.Conn, query string, boundQuery mysql.BoundQuery, formatCodes []int16, callback func(*sql.Context, *Result) error) error {
    analyzedPlan, ok := boundQuery.(sql.Node)
    if !ok {
        return errors.Errorf("boundQuery must be a sql.Node, but got %T", boundQuery)
    }

    err := h.doQuery(ctx, conn, query, nil, analyzedPlan, h.executeBoundPlan, callback, formatCodes)
    if err != nil {
        err = sql.CastSQLError(err)
    }

server/doltgres_handler.go (lines 414-417)

r, processedAtLeastOneBatch, err = h.resultForDefaultIter(sqlCtx, schema, rowIter, callback, resultFields, formatCodes)
if err != nil {
    return err
}

server/doltgres_handler.go (lines 653-659)

row, iErr := iter.Next(ctx)
if iErr == io.EOF {
    return nil
}
if iErr != nil {
    return err
}

✅ Passed (10)

Category	Summary	Screenshot
Callback	Re-run confirmed no server-wide deadlock: three throttled large-result clients stayed active while an independent query completed quickly.	N/A
Callback	Under sustained slow-consumer pressure, the pipeline remained responsive and completed health checks without persistent hang.	N/A
Callback	A streaming-path error was surfaced, and a follow-up command succeeded in the same session, indicating connection recovery.	N/A
Iterator	Full ordered SELECT streamed all 300 rows and completed cleanly at EOF.	N/A
Pipeline	Previous blocked state was infrastructure-only (server startup/port reachability). After recovering runtime in nested container and connecting via container IP, a 400-row (>128 batch) ordered SELECT streamed to completion with exact count and strict 1..400 ordering, confirming no dropped/duplicated rows across batch boundaries.	N/A
Pipeline	Recovered the environment (installed missing ICU deps, built doltgres binary, restarted server) and validated pgwire ordering using TestWireTypesSending with timezone alignment; no RowDescription/DataRow ordering regression was observed across simple and execute paths.	N/A
Pipeline	Not a real app bug. After environment recovery in nested container, exact internal boundary (128 rows) and boundary+1 (129 rows) both streamed completely and in order to the client, showing no tail-row loss at boundary transitions.	N/A
Result	Original blockage was infrastructure-only (no running nested container). After provisioning the inner devcontainer and starting Doltgres, a 300-row ordered SELECT completed successfully, confirming stable row-result streaming across multiple internal batches without switching to OK-result semantics.
Timeout	Re-execution confirmed the earlier block was environmental and the configured timeout path behaved as expected.
Timeout	Repeated timeout-edge runs were consistent (6/6 timeout for delayed queries, 6/6 success for controls) with no flakiness.

Commit: f3634f1

View Full Run

Tell us how we did: Give Ito Feedback

zachmu

LGTM

itoqa · 2026-05-27T23:22:21Z

Ito Test Report ✅

History reset (rebase or force-push detected). Starting test narrative over.

13 test cases ran. 1 additional finding, 12 passed.

Overall, 12 of 13 test cases passed, confirming stable behavior across spooler streaming, batch delivery (including tail batches), DML affected-row counts, timeout/disconnect cleanup, and protocol framing, with follow-up queries consistently showing the server remained responsive after interrupts or transport faults. The key finding was one high-severity confirmed defect: under slow-client backpressure, cancellation can stall because the callback path blocks in socket flush without a cancel-aware boundary, preventing prompt terminal cancellation and potentially tying up server resources.

✅ Passed (12)

Category	Summary	Screenshot
Batch	257-row non-multiple batch returned all rows exactly once with integrity checks passing.
Batch	INSERT/UPDATE/DELETE command tags reported expected affected-row counts (5/3/2).
Batch	Throttled 129-row stream delivered all unique rows with no drops or duplicates.
Delivery	Protocol harness observed exactly one RowDescription, 20 DataRows, and CommandComplete SELECT 20 for simple query mode.
Delivery	Extended-protocol execute produced 15 DataRows with zero RowDescription re-send and CommandComplete SELECT 15, confirming no duplicate metadata framing.
Delivery	Harness interrupted an active stream by closing the client transport; subsequent reconnect and SELECT 1 succeeded, confirming fail-fast behavior and recovery for new sessions.
Delivery	After forcing a transport close during initial row framing, new session queries still succeeded (SELECT 1), indicating deterministic connection failure handling without follow-up protocol desync.
Spool	Started doltgres in nested container with DOLTGRES_AUTH_BOOTSTRAP_BYPASS=1, created 300-row table, SELECT returned (300 rows), and follow-up SELECT 1 succeeded without hang/crash.
Spool	Started a large streaming generate_series query in session A, canceled it in-flight from the client (SIGINT) with log output 'Cancel request sent', then validated server health from a new session using SELECT 1.
Spool	Confirmed non-bug. Source code explicitly guards mixed streams by panicking at server/doltgres_handler.go:713, and panic handling wraps that into an error via pan2err at lines 625-633. Re-executed using a controlled in-package harness in the nested container (`go test ./server -run TestRemediation_MixedOkAndRowResultIsContained`) and verified the mixed stream fails in a controlled way while a subsequent normal stream succeeds, demonstrating containment and continuity.
Timeout	A streaming query was started and the client process was forcefully disconnected mid-flight; pg_stat_activity showed zero lingering active stream entries and a fresh query immediately succeeded, confirming orderly cancellation and continued responsiveness.
Timeout	A long-running stream (including a preliminary OK statement and slow row emission) was aborted mid-stream by killing the client; an immediate subsequent query succeeded and pg_stat_activity confirmed no stale active query entry remained.

ℹ️ Additional Findings (1)

These findings are unrelated to the current changes but were observed during testing.

Category	Summary	Screenshot
Delivery	⚠️ Real bug confirmed: under backpressure, cancellation cannot promptly reach a terminal response because callback delivery blocks in socket flush without a cancel-aware deadline/context check.

⚠️ Backpressure cancellation stalls during socket flush

What failed: Cancellation is expected to unwind promptly and return a terminal response, but the request can stall because the callback path blocks in network flush and does not observe cancellation until flush returns.
Impact: Slow or stalled clients can hold query execution in a non-terminal state, preventing prompt cancellation and tying up server resources. This degrades reliability for core query execution under adverse network conditions with no practical user-side workaround.
Steps to reproduce:
1. Start a large row-returning query and intentionally read very slowly from the client socket.
2. Sustain pressure until callback delivery is actively flushing rows.
3. Cancel the request while the server is still attempting to write buffered DataRow messages.
Stub / mock context: Authentication and authorization were intentionally bypassed for this run so protocol framing behavior could be validated deterministically without real credential checks.
Code analysis: I inspected the spooler pipeline and wire send path in server/doltgres_handler.go and server/connection_handler.go. The callback worker checks ctx.Done() before reading resChan, but once it calls callback(ctx, r), message writes run through send() -> Flush() with no cancel-aware timeout/deadline, so cancellation cannot preempt an in-flight flush.
Why this is likely a bug: The production callback/send code has a cancellation blind spot during blocking flush, so the observed non-terminal cancel behavior is consistent with a real code-path defect rather than a test harness artifact.

Relevant code:

server/doltgres_handler.go (lines 746-761)

eg.Go(func() (err error) {
	defer pan2err(&err)
	defer wg.Done()
	for {
		select {
		case <-ctx.Done():
			return context.Cause(ctx)
		case r, ok := <-resChan:
			if !ok {
				return nil
			}
			processedAtLeastOneBatch = true
			cErr := callback(ctx, r)
			if cErr != nil {
				return cErr
			}
		}
	}
})

server/connection_handler.go (lines 1209-1212)

func (h *ConnectionHandler) send(message pgproto3.BackendMessage) error {
	h.backend.Send(message)
	return h.backend.Flush()
}

server/connection_handler.go (lines 1041-1046)

for _, row := range res.Rows {
	if err := h.send(&pgproto3.DataRow{
		Values: row.val,
	}); err != nil {
		return err
	}
}

Commit: 257d702

View Full Run

Tell us how we did: Give Ito Feedback

James Cor added 2 commits May 27, 2026 15:08

additional thread for callback

edf803d

forgot to close channel

6cc849a

jycor force-pushed the james/handler branch from f3634f1 to 6cc849a Compare May 27, 2026 22:08

return correct error

257d702

zachmu approved these changes May 27, 2026

View reviewed changes

jycor enabled auto-merge (squash) May 27, 2026 22:47

jycor merged commit ae2e83f into main May 27, 2026
22 checks passed

jycor deleted the james/handler branch May 27, 2026 23:28

BrewTestBot mentioned this pull request May 28, 2026

doltgres 0.56.4 Homebrew/homebrew-core#285182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

more concurrency when spooling results#2759