Why do image-generation models often show Mario for "video-game plumber"? 🧐 How can we identify such keywords? How can we improve upon common mitigation strategies to protect copyrighted characters?
See our CopyCat 🐱 paper to find out more! copycat-eval.github.io (1/8)
Luxi (Lucy) He
105 posts
- Fine-tuning on benign data (e.g. Alpaca) can jailbreak models unexpectedly. We study this problem through a data-centric perspective and find that some seemingly benign data could be more harmful than explicitly malicious data! ⚠️🚨‼️ Paper: arxiv.org/pdf/2404.01099… [1/n]
- I'm attending @COLM_conf next week! Excited to meet folks and chat about alignment, safety, reasoning, LM evaluations, and more! Please feel free to reach out anytime :) @xiamengzhou and I will present our work on data selection + safety on Tuesday afternoon, come chat with us!
- Excited to be attending #ICLR2024 this week! I will be giving an oral presentation of our work, which was designated Best Paper at the Data Problems for Foundation Models (DPFM) Workshop! Come say hi at ICLR- would love to chat about LLMs, alignment, safety, copyright, and more!Fine-tuning on benign data (e.g. Alpaca) can jailbreak models unexpectedly. We study this problem through a data-centric perspective and find that some seemingly benign data could be more harmful than explicitly malicious data! ⚠️🚨‼️ Paper: arxiv.org/pdf/2404.01099… [1/n]
- [𝐒𝐩𝐨𝐭𝐥𝐢𝐠𝐡𝐭 @genlawcenter '24] Fantastic Copyrighted Beasts and How (Not) to Generate Them. We'll have a spotlight talk at the ICML 24 GenLaw Workshop, and please feel free to reach out and chat more!Why do image-generation models often show Mario for "video-game plumber"? 🧐 How can we identify such keywords? How can we improve upon common mitigation strategies to protect copyrighted characters? See our CopyCat 🐱 paper to find out more! copycat-eval.github.io (1/8)
- Join us today at 3 pm ET for a discussion on AI safety and alignment with @DavidSKrueger 🤩 Submit your questions in advance at the link in the post!PASS seminar tomorrow, 10/15 at 3pm ET! Speaker: @DavidSKrueger from @Cambridge_Uni Live: youtube.com/@PrincetonPLI/… Submit questions: tinyurl.com/pass-question Recordings later at: youtube.com/@PrincetonPLI
- Excited for the talk today at 2pm ET! YouTube link here youtube.com/@PrincetonPLI and submit your questions via forms.gle/7GQXAr9aonfvy1… 🤩Giving a talk in 4 hours for Princeton AI Alignment and Safety Seminar on new ways we're pushing the frontier of open recipes for fine-tuning. Lots of good details on recipes, datasets, and models we'll be releasing soon. We put a lot of effort into this one. Link for more info:youtube.comPrinceton Language & Intelligence
- Tune in to our PASS Seminar with @natolambert next Monday! Submit your question via the link in thread :)UPCOMING PASS SEMINAR, 11/4 at 2pm ET! Speaker: @natolambert from @allen_ai Live: youtube.com/@PrincetonPLI/… Recordings later at: youtube.com/@PrincetonPLI
- Happening today! You can submit your questions for Gillian here:UPCOMING PASS SEMINAR, 11/19 at 1pm ET! Speaker: @ghadfield from @UofT Live: youtube.com/@PrincetonPLI/… Recordings later at: youtube.com/@PrincetonPLIdocs.google.comPASS Question SubmissionSubmit your question for the speaker at Princeton AI Alignment and Safety Seminar (PASS)! We will moderate the questions and ask the speaker during the discussion period. Upcoming Talk: Apr 2 2025,...
- Wondering why your user experience with many MLLMs doesn't quite align with their high performance on existing benchmarks? 🤨 Our human-curated benchmark CharXiv shows flaws in MLLM chart-understanding, as well as gaps between open/ closed-source models!🤨 Are Multimodal Large Language Models really as 𝐠𝐨𝐨𝐝 at 𝐜𝐡𝐚𝐫𝐭 𝐮𝐧𝐝𝐞𝐫𝐬𝐭𝐚𝐧𝐝𝐢𝐧𝐠 as existing benchmarks such as ChartQA suggest? 🚫 Our ℂ𝕙𝕒𝕣𝕏𝕚𝕧 benchmark suggests NO! 🥇Humans achieve ✨𝟖𝟎+% correctness. 🥈Sonnet 3.5 outperforms GPT-4o by 10+ points,
00:00 - I’m attending @NeurIPSConf 2023! I will be presenting our Spotlight Paper “Aleatoric and Epistemic Discrimination: Fundamental Limits of Fairness Interventions”. Excited to learn more about ML privacy, fairness, LLM safety, and more! #NeurIPS
- Replying to @LuxiHeLucyPreventing copyrighted characters (eg. Mario, Batman) generation is important for image & video generation models. We build CopyCat evaluation suite with diverse copyrighted characters and an evaluation pipeline measuring character detection and input consistency. (2/8)
- Replying to @LuxiHeLucyBoth approaches are effective in identifying such benign subsets that break safety. The gradient-based method is more consistent across datasets. Data selected using Llama-7b-chat as the base model also successfully attacks the Llama-13b-chat model. [4/N]
- Replying to @LuxiHeLucySuch seemingly benign but effectively harmful data further raise awareness of safety vulnerabilities when fine-tuning. This type of approaches could help identify optimal safety-utility data mixtures or provide mechanism for data-centric debugging for safety degradation. [6/N]













