Course: High Performance Machine Learning
Semester: Spring 2026
Instructor: Dr. Kaoutar El Maghraoui
Project Mentor: Dr. Dhaval Patel
- Team 5
- Members:
- Rujing Li (rl3641) — Scenario design, profiling pipeline, results analysis
- Yitong Bai (yb2636) — Multi-agent architecture optimization techniques
- Chengrui Li (cl4750) — Baseline refactoring, multi-agent architecture setup
- Rui Li (rl3586) — Evaluation pipeline, results analysis, figures
- Team Members have equal contribution.
- GitHub repository:
- Main System & Profiling: https://github.com/Coderlicr/Multi-Turn-AssetOps (this repository)
- Evaluation: https://github.com/Rui2026/Multi-Turn-AssetOps-Evaluation
- Final report:
deliverables/HPML_Final_Report_Team5.pdf - Final presentation:
deliverables/HPML_Final_Presentation_Team5.pdf - Experiment tracking: link to public Wandb dashboard
Industrial operations and maintenance (O&M) question answering is naturally multi-turn: users refine queries, ask follow-up questions, and expect the system to reuse previous evidence while invoking specialized tools. The baseline Plan-Execute single-agent workflow is fragile in this setting because it plans mostly linearly, repeats expensive tool calls, struggles with tool-argument hallucination, and expands context rapidly after failures.
This project targets inference-time system performance for a tool-centric industrial diagnostic agent. The primary bottleneck is not GPU training throughput, but end-to-end inference latency and cost across remote LLM calls, MCP tool execution, CouchDB retrieval, and multi-agent routing. We optimize the runtime by adding memory-aware artifact reuse, Supervisor-Specialist routing, and optional parallel MCP tool execution.
- Application: Multi-turn industrial asset operations assistant for fault diagnosis, predictive maintenance, operational monitoring, maintenance planning, and end-to-end remediation workflows.
- LLM backend: LiteLLM wrapper over IBM WatsonX. The default model in the CLI is
watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8. - Agent architectures compared:
- Baseline: Plan-Execute single-agent workflow with sequential MCP tool calls.
- SS: Supervisor-Specialist architecture implemented with LangGraph.
- SSA: Supervisor-Specialist Advanced with parallel MCP tool batches.
- Specialist agents: Data Collection, Time Series Analysis, Failure Reasoning, and Maintenance Planning, routed by a Supervisor agent.
- Forecasting/anomaly models: IBM Granite-TSFM / TinyTimeMixer artifacts under
src/servers/tsfm/artifacts/tsfm_models/; conformal anomaly detection and TSFM forecasting are exposed through the TSFM MCP server. - Frameworks and libraries: Python 3.12+, LangGraph, LiteLLM, FastMCP/MCP, CouchDB, PyTorch/Transformers/Granite-TSFM, pandas, NumPy, SciPy, Pydantic, Weights & Biases.
- Dataset: 16 multi-turn industrial diagnosis scenarios in
eval/scenarios.py, derived from the AssetOpsBench-style workflow design inDESIGN.md. - Operational data: IoT time-series data for MAIN site chillers 3, 4, 6, and 9; work orders, events, alerts, and failure-code mappings loaded into CouchDB.
- Hardware target: Remote IBM WatsonX inference for LLM calls plus local Python/MCP/CouchDB execution. The performance study measures system-level inference latency rather than model training throughput on a fixed GPU.
Measurements below are from the presentation results over 16 benchmark dialogs using IBM WatsonX / Llama-4-Maverick-17B-FP8.
Headline numbers
| Result | Baseline | Optimized / SS | Improvement |
|---|---|---|---|
| End-to-end dialog latency | 323.5 s avg/dialog | 265.5 s avg/dialog | 1.9x faster end-to-end |
| TSFM tool latency | 159.5 s avg/dialog | 37.4 s avg/dialog | 4.3x TSFM tool speedup |
| Follow-up turn latency | 145 s on SS turn 1 | 34 s avg on SS turns 2-5 | 4.2x faster after turn 1 |
Profiler run-level summary
| Metric | Baseline | SS | SSA |
|---|---|---|---|
| Total wall time | 83.9 min | 65.2 min | 73.3 min |
| Total tokens consumed | 2,553,150 | 3,322,234 | 3,623,430 |
| Total LLM API calls | 841 | 941 | 751 |
Latency breakdown by architecture
| Architecture | LLM Time | Tool Time | Routing / Other |
|---|---|---|---|
| Baseline Plan-Execute | 43.0% | 47.3% | 9.7% |
| SS | 69.3% | 26.3% | 4.4% |
| SSA | 72.3% | 23.9% | 3.8% |
Headline result: The Supervisor-Specialist system shifts the dominant bottleneck away from redundant tool execution: tool time drops from 47.3% to 26.3% of wall time, TSFM latency drops from 159.5 s to 37.4 s per dialog, and end-to-end evaluation wall time improves from 83.9 min to 65.2 min across the benchmark.
.
|-- README.md
|-- DESIGN.md / DESIGN_annotated.md
|-- PROFILING.md
|-- pyproject.toml
|-- uv.lock
|-- eval/
| |-- run_eval.py
| |-- scenarios.py
| `-- results/
|-- logs/
| `-- supervisor_specialist/
|-- src/
| |-- agent/
| | |-- cli.py
| | |-- plan_execute/
| | `-- supervisor_specialist/
| | |-- cli.py
| | |-- graph.py
| | |-- runner.py
| | |-- agents/
| | `-- runtime/
| |-- couchdb/
| | |-- docker-compose.yaml
| | |-- init_asset_data.py
| | |-- init_wo.py
| | `-- sample_data/
| |-- llm/
| | |-- base.py
| | `-- litellm.py
| `-- servers/
| |-- iot/
| |-- wo/
| |-- tsfm/
| |-- fmsr/
| |-- utilities/
| `-- vibration/
`-- wandb/
git clone https://github.com/Coderlicr/Multi-Turn-AssetOps.git
cd Multi-Turn-AssetOps
uv sync
source .venv/bin/activateThe project uses uv and requires Python 3.12+. Optional TSFM dependencies include PyTorch, Transformers, and granite-tsfm.
Create a .env file from .env.example and fill the WatsonX credentials:
cp .env.example .envImportant environment variables:
COUCHDB_URL=http://localhost:5984
IOT_DBNAME=chiller
WO_DBNAME=workorder
COUCHDB_USERNAME=admin
COUCHDB_PASSWORD=password
WATSONX_APIKEY=<your key>
WATSONX_PROJECT_ID=<your project id>
WATSONX_URL=https://us-south.ml.cloud.ibm.com
SS_MODEL_ID=watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8
SS_MAX_STEPS=12
SUPERVISOR_SPECIALIST_PARALLELISM=4
Place the downloaded IoT main.json at:
src/couchdb/sample_data/iot/main.json
Start CouchDB and initialize the databases:
docker compose -f src/couchdb/docker-compose.yaml up -d
python src/couchdb/check_couchdb_data.pyThe setup imports:
- IoT sensor data for Chiller 3, Chiller 4, Chiller 6, and Chiller 9 into the
chillerdatabase. - Work-order and alert data from
src/couchdb/sample_data/work_order/into theworkorderdatabase. - Optional vibration data into the
vibrationdatabase.
uv run plan-execute --show-plan --show-history \
"What is the current date and time? Also list assets at site MAIN."Single-turn example:
uv run supervisor-specialist --reference-date 2020-06-20 \
"The temperature of our chiller at Site MAIN seems unusually high lately. Can you look into it?"Multi-turn session:
uv run supervisor-specialist --multi-turn --reference-date 2020-06-20Parallel tool batches:
uv run supervisor-specialist --parallel --reference-date 2020-06-20 \
"Compare Chiller 3, Chiller 4, Chiller 6, and Chiller 9 over the past month."Run all 16 dialogs:
uv run python eval/run_eval.py --system supervisor-specialistRun a subset:
uv run python eval/run_eval.py --system supervisor-specialist --dialogs 1 2Override the model:
uv run python eval/run_eval.py \
--system supervisor-specialist \
--model-id watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8Each evaluation writes dialog_XX.json, per-dialog metrics JSONL files, and summary.json into eval/results/<timestamp>/.
Profiling is implemented at three layers:
- LLM call metrics:
src/llm/litellm.pywrites prompt tokens, completion tokens, total tokens, model name, and latency whenLITELLM_METRICS_FILEis set. - MCP tool metrics: Plan-Execute and Supervisor-Specialist tool wrappers write tool name, server, latency, and success status when
TOOL_METRICS_FILEis set. - CouchDB query metrics: IoT, WO, and vibration servers write query latency, status, and document counts when
COUCHDB_METRICS_FILEis set.
For WandB and LangSmith, copy .env.profiling.example to .env.profiling and fill:
WANDB_API_KEY=<your key>
WANDB_PROJECT=multi-turn-assetops
LANGCHAIN_API_KEY=<your key>
LANGCHAIN_PROJECT=<your project>
Then run:
uv run python eval/run_eval.py --system supervisor-specialistThe evaluation harness logs per-dialog latency, token usage, LLM calls, tool calls, CouchDB query metrics, per-turn success, and run-level summaries.
- Artifact reuse improves multi-turn efficiency. The Supervisor-Specialist graph stores structured artifacts and rolling conversation memory so follow-up turns can reuse prior site, asset, sensor, time-window, anomaly, and failure-mode context.
- Tool execution was the initial bottleneck. In the baseline, tool calls account for 47.3% of wall time. SS reduces this to 26.3%, and SSA reduces it further to 23.9%.
- TSFM dominates baseline latency. Time-series forecasting and anomaly tools drop from 159.5 s per dialog in Plan-Execute to 37.4 s in SS and 36.1 s in SSA.
- LLM latency becomes the new ceiling. After reducing redundant tool work, LLM API time accounts for about 69-72% of wall time in the Supervisor-Specialist variants.
- Parallelism helps selectively, but routing and reuse matter more. SSA has the lowest tool fraction and fewer LLM calls than SS, but higher total tokens and longer total wall time than SS in the reported run.
- Reliability improves with structured routing. Tool-name validity reaches 100%, schema failures drop by 68.7%, execution failures drop by 59.0%, and recovery failures are eliminated in the reported benchmark.
Representative per-server tool latency:
| Server | Baseline | SS | SSA |
|---|---|---|---|
| TSFM | 159.5 s | 37.4 s | 36.1 s |
| IoT | 24.9 s | 12.9 s | 11.8 s |
| WO | 12.4 s | 13.4 s | 12.9 s |
| Utilities | 4.0 s | 4.6 s | 4.1 s |
| FMSR | 9.9 s | 12.1 s | 19.7 s |
- MCP servers live in
src/servers/and expose IoT, work-order, time-series, failure-mode, utility, and vibration tools. - The Plan-Execute baseline is implemented under
src/agent/plan_execute/. - The Supervisor-Specialist system is implemented under
src/agent/supervisor_specialist/. src/agent/supervisor_specialist/runtime/artifact_store.pyholds in-memory artifacts for cross-turn reuse.src/agent/supervisor_specialist/runtime/mcp_tools.pyimplements MCP routing and both sequential and parallel tool-call execution.eval/scenarios.pydefines the 16 benchmark dialogs and reference dates.PROFILING.mddocuments all profiling metrics and dashboard semantics.
We thank Dr. Dhaval Patel and Dr. Kaoutar El Maghraoui from IBM Research for their guidance and mentorship throughout this project.
Did your team use any AI tool in completing this project?
- No, we did not use any AI tool.
- Yes, we used AI assistance as described below.
Tool(s) used: ChatGPT, Codex, Claude.
Specific purpose: AI tools were used as support tools in the following ways:
-
Background reading and clarification. During initial research, we used AI tools to help understand public resources such as Hugging Face dataset descriptions, AssetOpsBench source code, and examples of agent-system implementations. This helped clarify terminology, system behavior, and relevant design patterns.
-
Code debugging and assistance. All code used in this project was created, reviewed, improved, and tested by our team. Codex and related AI tools were used to assist with understanding error messages, identifying debugging directions, considering refactoring suggestions, and exploring possible optimizations.
-
Reading, translation, and language polishing. While reading related papers and documentation, we used AI tools to clarify technical language. During report writing, AI tools were used only for partial translation, grammar correction, prose polishing, and improving academic wording for content already drafted by the team.
Sections affected: Background research notes, debugging workflow, and report language polishing.
How we verified correctness: The team reviewed all AI-assisted outputs before using them. Code changes were inspected and tested by the team, and documentation content was cross-checked against the repository implementation (src/agent, eval/run_eval.py, eval/scenarios.py), and project documentation (README.md, PROFILING.md, DESIGN.md).
By submitting this project, the team confirms that the analysis, interpretations, and conclusions are our own, and that any AI assistance is fully disclosed above.
Released under the Apache-2.0 License. See LICENSE.
@misc{multiTurnAssetOps2026hpml,
title = {Towards Multi-Turn Dialog Systems for Industrial Asset Operations},
author = {Li, Rujing and Bai, Yitong and Li, Chengrui and Li, Rui},
year = {2026},
note = {HPML Spring 2026 Final Project, Columbia University},
url = {https://github.com/Coderlicr/Multi-Turn-AssetOps}
}Open a GitHub issue or email [chengrui.cu@gmail.com].