HPML Final Project: Towards Multi-Turn Dialog Systems for Industrial Asset Operations

Course: High Performance Machine Learning
Semester: Spring 2026
Instructor: Dr. Kaoutar El Maghraoui
Project Mentor: Dr. Dhaval Patel

Team Information

Team 5
Members:
- Rujing Li (rl3641) — Scenario design, profiling pipeline, results analysis
- Yitong Bai (yb2636) — Multi-agent architecture optimization techniques
- Chengrui Li (cl4750) — Baseline refactoring, multi-agent architecture setup
- Rui Li (rl3586) — Evaluation pipeline, results analysis, figures
Team Members have equal contribution.

Submission

GitHub repository:
- Main System & Profiling: https://github.com/Coderlicr/Multi-Turn-AssetOps (this repository)
- Evaluation: https://github.com/Rui2026/Multi-Turn-AssetOps-Evaluation
Final report: deliverables/HPML_Final_Report_Team5.pdf
Final presentation: deliverables/HPML_Final_Presentation_Team5.pdf
Experiment tracking: link to public Wandb dashboard

1. Problem Statement

Industrial operations and maintenance (O&M) question answering is naturally multi-turn: users refine queries, ask follow-up questions, and expect the system to reuse previous evidence while invoking specialized tools. The baseline Plan-Execute single-agent workflow is fragile in this setting because it plans mostly linearly, repeats expensive tool calls, struggles with tool-argument hallucination, and expands context rapidly after failures.

This project targets inference-time system performance for a tool-centric industrial diagnostic agent. The primary bottleneck is not GPU training throughput, but end-to-end inference latency and cost across remote LLM calls, MCP tool execution, CouchDB retrieval, and multi-agent routing. We optimize the runtime by adding memory-aware artifact reuse, Supervisor-Specialist routing, and optional parallel MCP tool execution.

2. Model/Application Description

Application: Multi-turn industrial asset operations assistant for fault diagnosis, predictive maintenance, operational monitoring, maintenance planning, and end-to-end remediation workflows.
LLM backend: LiteLLM wrapper over IBM WatsonX. The default model in the CLI is watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8.
Agent architectures compared:
- Baseline: Plan-Execute single-agent workflow with sequential MCP tool calls.
- SS: Supervisor-Specialist architecture implemented with LangGraph.
- SSA: Supervisor-Specialist Advanced with parallel MCP tool batches.
Specialist agents: Data Collection, Time Series Analysis, Failure Reasoning, and Maintenance Planning, routed by a Supervisor agent.
Forecasting/anomaly models: IBM Granite-TSFM / TinyTimeMixer artifacts under src/servers/tsfm/artifacts/tsfm_models/; conformal anomaly detection and TSFM forecasting are exposed through the TSFM MCP server.
Frameworks and libraries: Python 3.12+, LangGraph, LiteLLM, FastMCP/MCP, CouchDB, PyTorch/Transformers/Granite-TSFM, pandas, NumPy, SciPy, Pydantic, Weights & Biases.
Dataset: 16 multi-turn industrial diagnosis scenarios in eval/scenarios.py, derived from the AssetOpsBench-style workflow design in DESIGN.md.
Operational data: IoT time-series data for MAIN site chillers 3, 4, 6, and 9; work orders, events, alerts, and failure-code mappings loaded into CouchDB.
Hardware target: Remote IBM WatsonX inference for LLM calls plus local Python/MCP/CouchDB execution. The performance study measures system-level inference latency rather than model training throughput on a fixed GPU.

3. Final Results Summary

Measurements below are from the presentation results over 16 benchmark dialogs using IBM WatsonX / Llama-4-Maverick-17B-FP8.

Headline numbers

Result	Baseline	Optimized / SS	Improvement
End-to-end dialog latency	323.5 s avg/dialog	265.5 s avg/dialog	1.9x faster end-to-end
TSFM tool latency	159.5 s avg/dialog	37.4 s avg/dialog	4.3x TSFM tool speedup
Follow-up turn latency	145 s on SS turn 1	34 s avg on SS turns 2-5	4.2x faster after turn 1

Profiler run-level summary

Metric	Baseline	SS	SSA
Total wall time	83.9 min	65.2 min	73.3 min
Total tokens consumed	2,553,150	3,322,234	3,623,430
Total LLM API calls	841	941	751

Latency breakdown by architecture

Architecture	LLM Time	Tool Time	Routing / Other
Baseline Plan-Execute	43.0%	47.3%	9.7%
SS	69.3%	26.3%	4.4%
SSA	72.3%	23.9%	3.8%

Headline result: The Supervisor-Specialist system shifts the dominant bottleneck away from redundant tool execution: tool time drops from 47.3% to 26.3% of wall time, TSFM latency drops from 159.5 s to 37.4 s per dialog, and end-to-end evaluation wall time improves from 83.9 min to 65.2 min across the benchmark.

4. Repository Structure

.
|-- README.md
|-- DESIGN.md / DESIGN_annotated.md
|-- PROFILING.md
|-- pyproject.toml
|-- uv.lock
|-- eval/
|   |-- run_eval.py
|   |-- scenarios.py
|   `-- results/
|-- logs/
|   `-- supervisor_specialist/
|-- src/
|   |-- agent/
|   |   |-- cli.py
|   |   |-- plan_execute/
|   |   `-- supervisor_specialist/
|   |       |-- cli.py
|   |       |-- graph.py
|   |       |-- runner.py
|   |       |-- agents/
|   |       `-- runtime/
|   |-- couchdb/
|   |   |-- docker-compose.yaml
|   |   |-- init_asset_data.py
|   |   |-- init_wo.py
|   |   `-- sample_data/
|   |-- llm/
|   |   |-- base.py
|   |   `-- litellm.py
|   `-- servers/
|       |-- iot/
|       |-- wo/
|       |-- tsfm/
|       |-- fmsr/
|       |-- utilities/
|       `-- vibration/
`-- wandb/

5. Reproducibility Instructions

A. Environment Setup

git clone https://github.com/Coderlicr/Multi-Turn-AssetOps.git
cd Multi-Turn-AssetOps

uv sync
source .venv/bin/activate

The project uses uv and requires Python 3.12+. Optional TSFM dependencies include PyTorch, Transformers, and granite-tsfm.

Create a .env file from .env.example and fill the WatsonX credentials:

cp .env.example .env

Important environment variables:

COUCHDB_URL=http://localhost:5984
IOT_DBNAME=chiller
WO_DBNAME=workorder
COUCHDB_USERNAME=admin
COUCHDB_PASSWORD=password

WATSONX_APIKEY=<your key>
WATSONX_PROJECT_ID=<your project id>
WATSONX_URL=https://us-south.ml.cloud.ibm.com

SS_MODEL_ID=watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8
SS_MAX_STEPS=12
SUPERVISOR_SPECIALIST_PARALLELISM=4

B. Data Setup

Place the downloaded IoT main.json at:

src/couchdb/sample_data/iot/main.json

Start CouchDB and initialize the databases:

docker compose -f src/couchdb/docker-compose.yaml up -d
python src/couchdb/check_couchdb_data.py

The setup imports:

IoT sensor data for Chiller 3, Chiller 4, Chiller 6, and Chiller 9 into the chiller database.
Work-order and alert data from src/couchdb/sample_data/work_order/ into the workorder database.
Optional vibration data into the vibration database.

C. Run the Baseline

uv run plan-execute --show-plan --show-history \
  "What is the current date and time? Also list assets at site MAIN."

D. Run Supervisor-Specialist

Single-turn example:

uv run supervisor-specialist --reference-date 2020-06-20 \
  "The temperature of our chiller at Site MAIN seems unusually high lately. Can you look into it?"

Multi-turn session:

uv run supervisor-specialist --multi-turn --reference-date 2020-06-20

Parallel tool batches:

uv run supervisor-specialist --parallel --reference-date 2020-06-20 \
  "Compare Chiller 3, Chiller 4, Chiller 6, and Chiller 9 over the past month."

E. Run the Benchmark Evaluation

Run all 16 dialogs:

uv run python eval/run_eval.py --system supervisor-specialist

Run a subset:

uv run python eval/run_eval.py --system supervisor-specialist --dialogs 1 2

Override the model:

uv run python eval/run_eval.py \
  --system supervisor-specialist \
  --model-id watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8

Each evaluation writes dialog_XX.json, per-dialog metrics JSONL files, and summary.json into eval/results/<timestamp>/.

F. Profiling and Tracking

Profiling is implemented at three layers:

LLM call metrics: src/llm/litellm.py writes prompt tokens, completion tokens, total tokens, model name, and latency when LITELLM_METRICS_FILE is set.
MCP tool metrics: Plan-Execute and Supervisor-Specialist tool wrappers write tool name, server, latency, and success status when TOOL_METRICS_FILE is set.
CouchDB query metrics: IoT, WO, and vibration servers write query latency, status, and document counts when COUCHDB_METRICS_FILE is set.

For WandB and LangSmith, copy .env.profiling.example to .env.profiling and fill:

WANDB_API_KEY=<your key>
WANDB_PROJECT=multi-turn-assetops
LANGCHAIN_API_KEY=<your key>
LANGCHAIN_PROJECT=<your project>

Then run:

uv run python eval/run_eval.py --system supervisor-specialist

The evaluation harness logs per-dialog latency, token usage, LLM calls, tool calls, CouchDB query metrics, per-turn success, and run-level summaries.

6. Results and Observations

Artifact reuse improves multi-turn efficiency. The Supervisor-Specialist graph stores structured artifacts and rolling conversation memory so follow-up turns can reuse prior site, asset, sensor, time-window, anomaly, and failure-mode context.
Tool execution was the initial bottleneck. In the baseline, tool calls account for 47.3% of wall time. SS reduces this to 26.3%, and SSA reduces it further to 23.9%.
TSFM dominates baseline latency. Time-series forecasting and anomaly tools drop from 159.5 s per dialog in Plan-Execute to 37.4 s in SS and 36.1 s in SSA.
LLM latency becomes the new ceiling. After reducing redundant tool work, LLM API time accounts for about 69-72% of wall time in the Supervisor-Specialist variants.
Parallelism helps selectively, but routing and reuse matter more. SSA has the lowest tool fraction and fewer LLM calls than SS, but higher total tokens and longer total wall time than SS in the reported run.
Reliability improves with structured routing. Tool-name validity reaches 100%, schema failures drop by 68.7%, execution failures drop by 59.0%, and recovery failures are eliminated in the reported benchmark.

Representative per-server tool latency:

Server	Baseline	SS	SSA
TSFM	159.5 s	37.4 s	36.1 s
IoT	24.9 s	12.9 s	11.8 s
WO	12.4 s	13.4 s	12.9 s
Utilities	4.0 s	4.6 s	4.1 s
FMSR	9.9 s	12.1 s	19.7 s

7. Implementation Notes

MCP servers live in src/servers/ and expose IoT, work-order, time-series, failure-mode, utility, and vibration tools.
The Plan-Execute baseline is implemented under src/agent/plan_execute/.
The Supervisor-Specialist system is implemented under src/agent/supervisor_specialist/.
src/agent/supervisor_specialist/runtime/artifact_store.py holds in-memory artifacts for cross-turn reuse.
src/agent/supervisor_specialist/runtime/mcp_tools.py implements MCP routing and both sequential and parallel tool-call execution.
eval/scenarios.py defines the 16 benchmark dialogs and reference dates.
PROFILING.md documents all profiling metrics and dashboard semantics.

ACKNOWLEDGMENT

We thank Dr. Dhaval Patel and Dr. Kaoutar El Maghraoui from IBM Research for their guidance and mentorship throughout this project.

AI Use Disclosure

Did your team use any AI tool in completing this project?

No, we did not use any AI tool.
Yes, we used AI assistance as described below.

Tool(s) used: ChatGPT, Codex, Claude.

Specific purpose: AI tools were used as support tools in the following ways:

Background reading and clarification. During initial research, we used AI tools to help understand public resources such as Hugging Face dataset descriptions, AssetOpsBench source code, and examples of agent-system implementations. This helped clarify terminology, system behavior, and relevant design patterns.
Code debugging and assistance. All code used in this project was created, reviewed, improved, and tested by our team. Codex and related AI tools were used to assist with understanding error messages, identifying debugging directions, considering refactoring suggestions, and exploring possible optimizations.
Reading, translation, and language polishing. While reading related papers and documentation, we used AI tools to clarify technical language. During report writing, AI tools were used only for partial translation, grammar correction, prose polishing, and improving academic wording for content already drafted by the team.

Sections affected: Background research notes, debugging workflow, and report language polishing.

How we verified correctness: The team reviewed all AI-assisted outputs before using them. Code changes were inspected and tested by the team, and documentation content was cross-checked against the repository implementation (src/agent, eval/run_eval.py, eval/scenarios.py), and project documentation (README.md, PROFILING.md, DESIGN.md).

By submitting this project, the team confirms that the analysis, interpretations, and conclusions are our own, and that any AI assistance is fully disclosed above.

License

Released under the Apache-2.0 License. See LICENSE.

Citation

@misc{multiTurnAssetOps2026hpml,
  title  = {Towards Multi-Turn Dialog Systems for Industrial Asset Operations},
  author = {Li, Rujing and Bai, Yitong and Li, Chengrui and Li, Rui},
  year   = {2026},
  note   = {HPML Spring 2026 Final Project, Columbia University},
  url    = {https://github.com/Coderlicr/Multi-Turn-AssetOps}
}

Contact

Open a GitHub issue or email [chengrui.cu@gmail.com].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPML Final Project: Towards Multi-Turn Dialog Systems for Industrial Asset Operations

Team Information

Submission

1. Problem Statement

2. Model/Application Description

3. Final Results Summary

4. Repository Structure

5. Reproducibility Instructions

A. Environment Setup

B. Data Setup

C. Run the Baseline

D. Run Supervisor-Specialist

E. Run the Benchmark Evaluation

F. Profiling and Tracking

6. Results and Observations

7. Implementation Notes

ACKNOWLEDGMENT

AI Use Disclosure

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
deliverables		deliverables
eval		eval
logs		logs
src		src
.env.example		.env.example
.env.profiling.example		.env.profiling.example
.gitignore		.gitignore
DESIGN.md		DESIGN.md
DESIGN_annotated.md		DESIGN_annotated.md
INSTRUCTIONS.md		INSTRUCTIONS.md
LICENSE		LICENSE
PROFILING.md		PROFILING.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

HPML Final Project: Towards Multi-Turn Dialog Systems for Industrial Asset Operations

Team Information

Submission

1. Problem Statement

2. Model/Application Description

3. Final Results Summary

4. Repository Structure

5. Reproducibility Instructions

A. Environment Setup

B. Data Setup

C. Run the Baseline

D. Run Supervisor-Specialist

E. Run the Benchmark Evaluation

F. Profiling and Tracking

6. Results and Observations

7. Implementation Notes

ACKNOWLEDGMENT

AI Use Disclosure

License

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages