SunChungEn

Chung-En, Sun SunChungEn

Pinned Loading

Trustworthy-ML-Lab/Steer2Edit Trustworthy-ML-Lab/Steer2Edit Public

Python 1
Trustworthy-ML-Lab/ThinkEdit Trustworthy-ML-Lab/ThinkEdit Public

[EMNLP 25] An effective and interpretable weight-editing method for mitigating overly short reasoning in LLMs, and a mechanistic study uncovering how reasoning length is encoded in the model’s repr…

Python 19 1
ADV-LLM ADV-LLM Public

[NAACL 25] A framework to build powerful adversarial LLMs that can generate jailbreak prompts.

Python 8
Trustworthy-ML-Lab/CB-LLMs Trustworthy-ML-Lab/CB-LLMs Public

[ICLR 25] A novel framework for building intrinsically interpretable LLMs with human-understandable concepts to ensure safety, reliability, transparency, and trustworthiness.

Python 33 19
Trustworthy-ML-Lab/when2tool Trustworthy-ML-Lab/when2tool Public

Python 6 1
Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap Public

Python 1