Course description

What is the internal structure of modern neural networks and how can we study it? This course provides a broad and deep introduction to interpretability, the subfield of machine learning concerned with understanding precisely how models process information and why they produce the outputs they do. We will cover topics such as probing, steering, causal abstraction, and sparse autoencoders, with a particular emphasis on causal methods and large language models. The course will include guest lectures from leading interpretability labs across academia and industry.

Staff

Thomas Icard
Thomas Icard StanfordInstructor
Atticus Geiger
Atticus Geiger GoodfireInstructor
Amir Zur
Amir Zur StanfordInstructor
Jing Huang
Jing Huang StanfordInstructor
Junyi Tao
Junyi Tao StanfordTeaching Assistant
Siri Vatsavaya
Siri Vatsavaya GoodfireCourse Manager

Please reach the staff at cs221m-spr2526-staff@lists.stanford.edu.

Logistics

Coursework

The course will have five weeks of notebook-guided lectures, four weeks of guest lectures, and one week of final presentations. Students will be graded for participation in lectures and for their final project.

Syllabus

Please download the syllabus here.

.

Schedule

Note: schedule is subject to change.

Date Lesson Readings Materials
Week 1
Mon. March 30
Introduction
Week 1
Wed. April 1
Review of language models Slides
Interactive notebook
Week 2
Mon. April 6
Behavioral analysis and input attribution
Week 2
Wed. April 8
Probes for decoding activations
Week 3
Mon. April 13
Interventions for steering activations
Week 3
Wed. April 15
Causal mediation analysis
Week 4
Mon. April 20
Theory of causal abstraction I
Week 4
Wed. April 22
Designing counterfactuals
Week 5
Mon. April 27
Automated causal interpretability
Davies et al. 2023
Cao et al. 2020, 2022
Geiger et al. 2023 DAS
Wu et al. 2023 boundless DAS
Week 5
Wed. April 29
Theory of causal abstraction II
Week 6
Mon. May 4
Week 6
Wed. May 6
Week 7
Mon. May 11
Week 7
Wed. May 13
Week 8
Mon. May 18
Week 8
Wed. May 20
Week 9
Mon. May 25
Week 9
Wed. May 27
Week 10
Mon. June 1
Project presentations
Week 10
Wed. June 3
Project presentations

Frequently asked questions

I have submitted an application but have not heard back by Mar 27th, is it still possible to enroll in the course?

We have received more than 200 applications, far more than what we initially expected. It is truly exciting to see so many students interested in interpretability! We have increased the course capacity to accommodate as many students as we can, however, we are constrained by resources, e.g., course staff, project mentors, compute, etc. At this point, we do not plan to further increase the class size. We will likely have another iteration of the course next year, so if you are still around, check it out next spring!

Can I audit this course without enrollment?

We generally do not allow auditing. However, you are more than welcome to attend the guest lectures, which will be in the second half of the course. We will also try to make most of the course materials public.

I have enrolled in the class, but cannot attend some lectures in person.

We value participation. Students are expected to attend all lectures and engage with the course materials. If you are unable to attend a lecture due to travel or other unforeseen circumstances, you must notify us by email in advance, i.e., before the lecture. Please include the date of the anticipated absence and the reason for your absence. We will follow up with you as necessary.