PAHF

Abstract

Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding explicit user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time.

We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) seeking pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from memory, and (3) integrating post-action feedback to update memory when preferences drift.

We evaluate PAHF using a four-phase protocol across embodied manipulation and online shopping to measure initial learning and adaptation to preference drift. Theoretical and empirical results confirm that combining explicit memory with dual feedback is critical: PAHF consistently outperforms baselines, minimizing initial errors and enabling rapid recovery from non-stationary preferences.

Learn initial preference → Learn context-dependent preferences → Adapt to preference shifts

Personalized Agent from Human Feedback

Explicit per-user memory with two feedback channels: pre-action clarification and post-action correction.

Pre-action Interaction

Ask if preference ambiguous

Action Execution

Act using retrieved memory

Post-action Correction

Use feedback to correct and refresh stale memory.

Our framework enables continual personalization by leveraging online user feedback to dynamically read and write to memory, ensuring the agent adapts to evolving preferences.

Evaluation

Embodied manipulation

Home/office tasks with contextual preferences.

Online shopping

Multi-constraint product selection.

Training

Phase 1

Initial Learning

Testing

Phase 2

Initial Personalization

Training

Phase 3

Adaptation to Drift

Testing

Phase 4

Adapted Personalization

Results

Learning curves for initial preference learning (Phase 1) and adaptation to preference shifts (Phase 3).

Test phase success rates

	Embodied		Shopping
Method	Phase 2	Phase 4	Phase 2	Phase 4
No memory	32.3±0.4	44.8±0.5	27.8±0.2	27.0±0.4
Pre-action only	54.1±1.1	35.7±1.0	34.4±0.5	56.0±0.7
Post-action only	67.9±1.5	68.3±1.2	38.9±0.5	66.9±0.8
PAHF (pre+post)	70.5±1.7	68.8±1.3	41.3±0.8	70.3±1.1

Takeaways

✓ Pre-action feedback reduces initial personalization error.
✕ Pre-action feedback alone fails when preferences change.
✓ Post-action feedback enables fast adaptation to preference shifts.
✕ Post-action feedback alone can cause significant initial error.
✓ PAHF combines strength of both feedback channels.

BibTeX

@article{liang2026learning,
  title={Learning Personalized Agents from Human Feedback},
  author={Liang, Kaiqu and Kruk, Julia and Qian, Shengyi and Yang, Xianjun and Bi, Shengjie and Yao, Yuanshun and Nie, Shaoliang and Zhang, Mingyang and Liu, Lijuan and Fisac, Jaime Fern{\'a}ndez and others},
  journal={arXiv preprint arXiv:2602.16173},
  year={2026}
}