Skip to content

more concurrency when spooling results#2759

Merged
jycor merged 3 commits into
mainfrom
james/handler
May 27, 2026
Merged

more concurrency when spooling results#2759
jycor merged 3 commits into
mainfrom
james/handler

Conversation

@jycor

@jycor jycor commented May 26, 2026

Copy link
Copy Markdown
Contributor

This PR splits the convert to wire format and send to client steps into two threads.
Copies optimization from gms: dolthub/go-mysql-server#3310

@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor
Main PR
covering_index_scan_postgres 1363.77/s 1394.75/s +2.2%
groupby_scan_postgres 62.82/s 63.95/s +1.7%
index_join_postgres 179.17/s 180.71/s +0.8%
index_join_scan_postgres 190.90/s 182.50/s -4.5%
index_scan_postgres 12.04/s ${\color{lightgreen}18.15/s}$ ${\color{lightgreen}+50.7\%}$
oltp_delete_insert_postgres 678.47/s 685.58/s +1.0%
oltp_insert 643.90/s 623.44/s -3.2%
oltp_point_select 2539.33/s 2534.20/s -0.3%
oltp_read_only 1850.28/s 1869.54/s +1.0%
oltp_read_write 1536.24/s 1530.82/s -0.4%
oltp_update_index 623.02/s 648.91/s +4.1%
oltp_update_non_index 676.57/s 693.37/s +2.4%
oltp_write_only 1478.15/s 1509.53/s +2.1%
select_random_points 122.10/s 130.60/s +6.9%
select_random_ranges 875.70/s 882.97/s +0.8%
table_scan_postgres 11.45/s ${\color{lightgreen}17.29/s}$ ${\color{lightgreen}+51.0\%}$
types_delete_insert_postgres 698.27/s 705.14/s +0.9%
types_table_scan_postgres 5.17/s ${\color{lightgreen}8.20/s}$ ${\color{lightgreen}+58.6\%}$

@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor
Main PR
Total 42090 42090
Successful 18250 18250
Failures 23840 23840
Partial Successes1 5385 5385
Main PR
Successful 43.3595% 43.3595%
Failures 56.6405% 56.6405%

Footnotes

  1. These are tests that we're marking as Successful, however they do not match the expected output in some way. This is due to small differences, such as different wording on the error messages, or the column names being incorrect while the data itself is correct.

@itoqa

itoqa Bot commented May 26, 2026

Copy link
Copy Markdown

Ito Test Report ❌

12 test cases ran. 2 failed, 10 passed.

Overall, 12 test cases ran with 10 passing and 2 failing, so the run is not release-clean despite broad stability across streaming, batching, timeout, result-shape, and callback backpressure scenarios. The most important finding is a high-severity regression introduced by this PR in server/doltgres_handler.go where non-EOF iterator errors are returned incorrectly and can be silently dropped in both simple (ComQuery) and prepared (ComExecuteBound) paths, allowing partial/truncated result streams to appear successful to clients.

❌ Failed (2)
Category Summary Screenshot
Iterator ⚠️ Non-EOF iterator errors are swallowed, so partial rows can be returned with success status. N/A
Iterator ⚠️ Prepared execution path inherits the same swallowed iterator-error behavior. ITERATOR-3
⚠️ Non-EOF iterator errors are silently dropped
  • What failed: A non-EOF iterator failure does not propagate to the client; the stream can terminate as if successful after partial rows, but expected behavior is a client-visible query error.
  • Impact: Query clients can receive truncated data without an error, which breaks correctness guarantees for streamed reads. This can cause downstream systems to trust incomplete results as successful query output.
  • Steps to reproduce:
    1. Start Doltgres and connect as superuser on port 5432.
    2. Run a fixture query that triggers a non-EOF iter.Next error after partial progress.
    3. Observe returned client status and error propagation.
  • Stub / mock context: Authentication bootstrap was bypassed so local sessions could run deterministically as the postgres role without full auth handshake; no iterator-result mocking or route interception was used for this defect verification.
  • Code analysis: I inspected the iterator read loop and downstream default-iterator call path in server/doltgres_handler.go; the non-EOF branch returns the named err return variable instead of iErr, and doQuery treats a nil return from resultForDefaultIter as success.
  • Why this is likely a bug: The production code drops the actual iterator failure object on the error branch, so a real stream fault can be misreported as success.

Relevant code:

server/doltgres_handler.go (lines 653-659)

row, iErr := iter.Next(ctx)
if iErr == io.EOF {
    return nil
}
if iErr != nil {
    return err
}

server/doltgres_handler.go (lines 409-417)

} else {
    resultFields, err := schemaToFieldDescriptions(sqlCtx, schema, formatCodes)
    if err != nil {
        return err
    }
    r, processedAtLeastOneBatch, err = h.resultForDefaultIter(sqlCtx, schema, rowIter, callback, resultFields, formatCodes)
    if err != nil {
        return err
    }
}
⚠️ Prepared query path swallows iterator failures
  • What failed: Prepared execution inherits the same error-propagation defect, so iterator failures can be swallowed instead of surfaced as a query error.
  • Impact: Prepared-query consumers can also receive incomplete streamed data without a failure signal. This undermines reliability for applications that rely on prepared execution for core read paths.
  • Steps to reproduce:
    1. Connect as superuser and prepare a large SELECT statement.
    2. Execute the prepared statement and begin reading streamed rows.
    3. Trigger a non-EOF iterator failure condition mid-stream and inspect the client-visible result.
  • Stub / mock context: Authentication bootstrap was bypassed so local sessions could run deterministically as the postgres role without full auth handshake; prepared-path confirmation in this run relied on direct code-path verification rather than a full end-to-end prepared fixture replay.
  • Code analysis: I traced ComExecuteBound into doQuery, which routes prepared queries through resultForDefaultIter; the same non-EOF branch returns err instead of iErr, so prepared streams share the same silent-failure defect.
  • Why this is likely a bug: The prepared path reaches the same incorrect non-EOF error branch, so a production prepared stream can mask real iterator failures.

Relevant code:

server/doltgres_handler.go (lines 140-153)

func (h *DoltgresHandler) ComExecuteBound(ctx context.Context, conn *mysql.Conn, query string, boundQuery mysql.BoundQuery, formatCodes []int16, callback func(*sql.Context, *Result) error) error {
    analyzedPlan, ok := boundQuery.(sql.Node)
    if !ok {
        return errors.Errorf("boundQuery must be a sql.Node, but got %T", boundQuery)
    }

    err := h.doQuery(ctx, conn, query, nil, analyzedPlan, h.executeBoundPlan, callback, formatCodes)
    if err != nil {
        err = sql.CastSQLError(err)
    }

server/doltgres_handler.go (lines 414-417)

r, processedAtLeastOneBatch, err = h.resultForDefaultIter(sqlCtx, schema, rowIter, callback, resultFields, formatCodes)
if err != nil {
    return err
}

server/doltgres_handler.go (lines 653-659)

row, iErr := iter.Next(ctx)
if iErr == io.EOF {
    return nil
}
if iErr != nil {
    return err
}
✅ Passed (10)
Category Summary Screenshot
Callback Re-run confirmed no server-wide deadlock: three throttled large-result clients stayed active while an independent query completed quickly. N/A
Callback Under sustained slow-consumer pressure, the pipeline remained responsive and completed health checks without persistent hang. N/A
Callback A streaming-path error was surfaced, and a follow-up command succeeded in the same session, indicating connection recovery. N/A
Iterator Full ordered SELECT streamed all 300 rows and completed cleanly at EOF. N/A
Pipeline Previous blocked state was infrastructure-only (server startup/port reachability). After recovering runtime in nested container and connecting via container IP, a 400-row (>128 batch) ordered SELECT streamed to completion with exact count and strict 1..400 ordering, confirming no dropped/duplicated rows across batch boundaries. N/A
Pipeline Recovered the environment (installed missing ICU deps, built doltgres binary, restarted server) and validated pgwire ordering using TestWireTypesSending with timezone alignment; no RowDescription/DataRow ordering regression was observed across simple and execute paths. N/A
Pipeline Not a real app bug. After environment recovery in nested container, exact internal boundary (128 rows) and boundary+1 (129 rows) both streamed completely and in order to the client, showing no tail-row loss at boundary transitions. N/A
Result Original blockage was infrastructure-only (no running nested container). After provisioning the inner devcontainer and starting Doltgres, a 300-row ordered SELECT completed successfully, confirming stable row-result streaming across multiple internal batches without switching to OK-result semantics. RESULT-1
Timeout Re-execution confirmed the earlier block was environmental and the configured timeout path behaved as expected. TIMEOUT-1
Timeout Repeated timeout-edge runs were consistent (6/6 timeout for delayed queries, 6/6 success for controls) with no flakiness. TIMEOUT-2

Commit: f3634f1

View Full Run


Tell us how we did: Give Ito Feedback

@zachmu zachmu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jycor jycor enabled auto-merge (squash) May 27, 2026 22:47
@itoqa

itoqa Bot commented May 27, 2026

Copy link
Copy Markdown

Ito Test Report ✅

History reset (rebase or force-push detected). Starting test narrative over.

13 test cases ran. 1 additional finding, 12 passed.

Overall, 12 of 13 test cases passed, confirming stable behavior across spooler streaming, batch delivery (including tail batches), DML affected-row counts, timeout/disconnect cleanup, and protocol framing, with follow-up queries consistently showing the server remained responsive after interrupts or transport faults. The key finding was one high-severity confirmed defect: under slow-client backpressure, cancellation can stall because the callback path blocks in socket flush without a cancel-aware boundary, preventing prompt terminal cancellation and potentially tying up server resources.

✅ Passed (12)
Category Summary Screenshot
Batch 257-row non-multiple batch returned all rows exactly once with integrity checks passing. BATCH-1
Batch INSERT/UPDATE/DELETE command tags reported expected affected-row counts (5/3/2). BATCH-2
Batch Throttled 129-row stream delivered all unique rows with no drops or duplicates. BATCH-3
Delivery Protocol harness observed exactly one RowDescription, 20 DataRows, and CommandComplete SELECT 20 for simple query mode. DELIVERY-1
Delivery Extended-protocol execute produced 15 DataRows with zero RowDescription re-send and CommandComplete SELECT 15, confirming no duplicate metadata framing. DELIVERY-2
Delivery Harness interrupted an active stream by closing the client transport; subsequent reconnect and SELECT 1 succeeded, confirming fail-fast behavior and recovery for new sessions. DELIVERY-3
Delivery After forcing a transport close during initial row framing, new session queries still succeeded (SELECT 1), indicating deterministic connection failure handling without follow-up protocol desync. DELIVERY-5
Spool Started doltgres in nested container with DOLTGRES_AUTH_BOOTSTRAP_BYPASS=1, created 300-row table, SELECT returned (300 rows), and follow-up SELECT 1 succeeded without hang/crash. SPOOL-1
Spool Started a large streaming generate_series query in session A, canceled it in-flight from the client (SIGINT) with log output 'Cancel request sent', then validated server health from a new session using SELECT 1. SPOOL-2
Spool Confirmed non-bug. Source code explicitly guards mixed streams by panicking at server/doltgres_handler.go:713, and panic handling wraps that into an error via pan2err at lines 625-633. Re-executed using a controlled in-package harness in the nested container (go test ./server -run TestRemediation_MixedOkAndRowResultIsContained) and verified the mixed stream fails in a controlled way while a subsequent normal stream succeeds, demonstrating containment and continuity. SPOOL-3
Timeout A streaming query was started and the client process was forcefully disconnected mid-flight; pg_stat_activity showed zero lingering active stream entries and a fresh query immediately succeeded, confirming orderly cancellation and continued responsiveness. TIMEOUT-1
Timeout A long-running stream (including a preliminary OK statement and slow row emission) was aborted mid-stream by killing the client; an immediate subsequent query succeeded and pg_stat_activity confirmed no stale active query entry remained. TIMEOUT-2
ℹ️ Additional Findings (1)

These findings are unrelated to the current changes but were observed during testing.

Category Summary Screenshot
Delivery ⚠️ Real bug confirmed: under backpressure, cancellation cannot promptly reach a terminal response because callback delivery blocks in socket flush without a cancel-aware deadline/context check. DELIVERY-4
⚠️ Backpressure cancellation stalls during socket flush
  • What failed: Cancellation is expected to unwind promptly and return a terminal response, but the request can stall because the callback path blocks in network flush and does not observe cancellation until flush returns.
  • Impact: Slow or stalled clients can hold query execution in a non-terminal state, preventing prompt cancellation and tying up server resources. This degrades reliability for core query execution under adverse network conditions with no practical user-side workaround.
  • Steps to reproduce:
    1. Start a large row-returning query and intentionally read very slowly from the client socket.
    2. Sustain pressure until callback delivery is actively flushing rows.
    3. Cancel the request while the server is still attempting to write buffered DataRow messages.
  • Stub / mock context: Authentication and authorization were intentionally bypassed for this run so protocol framing behavior could be validated deterministically without real credential checks.
  • Code analysis: I inspected the spooler pipeline and wire send path in server/doltgres_handler.go and server/connection_handler.go. The callback worker checks ctx.Done() before reading resChan, but once it calls callback(ctx, r), message writes run through send() -> Flush() with no cancel-aware timeout/deadline, so cancellation cannot preempt an in-flight flush.
  • Why this is likely a bug: The production callback/send code has a cancellation blind spot during blocking flush, so the observed non-terminal cancel behavior is consistent with a real code-path defect rather than a test harness artifact.

Relevant code:

server/doltgres_handler.go (lines 746-761)

eg.Go(func() (err error) {
	defer pan2err(&err)
	defer wg.Done()
	for {
		select {
		case <-ctx.Done():
			return context.Cause(ctx)
		case r, ok := <-resChan:
			if !ok {
				return nil
			}
			processedAtLeastOneBatch = true
			cErr := callback(ctx, r)
			if cErr != nil {
				return cErr
			}
		}
	}
})

server/connection_handler.go (lines 1209-1212)

func (h *ConnectionHandler) send(message pgproto3.BackendMessage) error {
	h.backend.Send(message)
	return h.backend.Flush()
}

server/connection_handler.go (lines 1041-1046)

for _, row := range res.Rows {
	if err := h.send(&pgproto3.DataRow{
		Values: row.val,
	}); err != nil {
		return err
	}
}

Commit: 257d702

View Full Run


Tell us how we did: Give Ito Feedback

@jycor jycor merged commit ae2e83f into main May 27, 2026
22 checks passed
@jycor jycor deleted the james/handler branch May 27, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants