OUTDATED: Computational Linguistics I (CMSC 723)

OUTDATED!

Computational Linguistics I
CMSC 723
Fall 2010

Schedule:	Tue/Thr 12:30-1:45pm
Location:	CSI 2107
Instructor:	Hal Daume III:
Office Hours:	AVW 3227; TBD or by appointment
Course blog:	http://cmsc723.blogspot.com/ or Atom feed
TA:	Amit Goyal (office hours: 10:50-11:50 Tue/Thr, AVW 1152 [Grad Lounge])

Jump to: [Background] [Structure] [Grading] [Textbooks] [Schedule] [Homework] [Links] [Policies]

Background and Description

Computational linguistics (CL) is the science of doing what linguists do with language, but using computers. Natural language processing (NLP) is getting computers to do what people do with language. Despite the title, most of this course is actually about NLP, not CL! But we'll hit a lot of linguistics along the way.

This course is indended to be a broad introduction to what is a very broad field. We will cover both rule-based and statistical approaches to a wide variety of challenging problems in natural language processing. We will discover that language ambiguity is the rabid wolf of NLP, and develop techniques to try to tame it. Along the way, we will see some linguistic theories developed specifically for computational linguistics, which sheds some like on what sort of linguistic models make sense, computationally.

Prerequisites: You must be able to program. You must find language interesting. If you cannot write breadth- and depth-first search in your programming language of choice in under an hour, you will struggle in this class. If you cannot find humor in the sentence pair "I ate spaghetti with meatballs / I ate spaghetti with a fork" then you might not enjoy the class. Linguistic background is not necessary, though of course it never hurts.

Structure of Class

I will take a slightly non-standard approach to class time. I will not spend 3 hours per week going over material that was in the readings. As a result, you should read. And you should do the short written assignments. My responsibility will be to help you understand things that are hard, and to give you an insider's view of the field. Class time will be interactive. Certain homework problems will be marked for in-class presentation, and you will do them. The rest of class time will be spent talking about issues that arise, things that I think are particularly interesting, doing activities and/or demos.

Your responsibilities are as follows:

Read the assigned reading assignments before class. It will be very helpful if you write down a short list of questions before class, though this isn't actually required. I'm serious about reading; to demonstrate that, most reading assignments are about 10 pages long (some are 15ish).
Complete the assigned weekly homework assignments before class. Some will be "starred", meaning that we will spend the first part of class time going over the solutions. Students will present the solutions: you will be chosen to present uniformly at random (without replacement). We have our own handin system.
Participate actively in class discussions.

Given that this is a three credit class, I expect you to spend nine hours per week working on CL stuff. Three of those hours will be in class. Of the remaining six, I expect about two to be spent reading (one hour per assignment), two to be spent on written homeworks and two to be spent on projects. If things are taking significantly more time than this, you should talk to us.

Grading

The purpose of grading (in my mind) is to provide extra incentive for you to keep up with the material and to ensure that you exit the class as a computational linguistics genius. If everyone gets an A, that would make me happy (sadly, it hasn't happened yet). The components of grading are:

	40%	Programming projects There are four programming projects, each worth 10% of your final grade. You will be graded on both code correctness as well as your analysis of the results. These must be completed in teams of two or three students, with cross-department (CS to linguistics) teams highly encouraged.
	30%	Written homeworks There are eleven written homeworks (one per week), each is worth 3% of your final grade. They will be graded on a high-pass (100%), low-pass (50%) or fail (0%) basis. These are to be completed individually. Your lowest scoring homework will be dropped. (The initial homework, HW00, is not graded, but required if you do not want to fail.)
	25%	Final project Everyone is to complete a final project, in teams of size up to three. We will discuss the scope of the project later in class.
	5%	Class participation You will be graded on your in-class presentations of homework questions and other general participation, including participation in the comments on the blog. This is mostly subjective.

Important note for MS-comp students: Your grading will be different, as required by departmental policy. You will have a midterm and final exam, both take-home. You will not have a final project. Your final exam will be worth 30%, and the midterm worth 20%. For the remaining, 2% will be for participation, 28% for projects (7% each) and 20% for homeworks (2% each).

Late homeworks are not allowed (without prior approval). This is because I need to put solutions up on the web page. You may hand any project in up to 48 hours late; however, once it is late by one minute, your final score will be halved.

We will post notes on the blog when assignments have been graded. If you handed something in and do not get a score for an assignment, you have a one week moritorium on complaints.

Textbooks

The textbook is the new-ish book by Jurafsky and Martin, Speech and Language Processing (Second Edition) (ISBN 978-0-13-605234-0).

Other recommended (but not required) books:

Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schutze (ISBN 0-262-13360-1)
Natural Language Understanding (Second Edition) by James Allen (ISBN 978-0805303346)

Schedule (tentative)

The following schedule is subject to change, but likely not by very much. The readings listed are readings that you should have finished by that date (all from Jurafsky+Martin unless otherwise noted). Everything is due by 12:20pm on the date listed on the schedule. Written assignments are to be handed in in PDF format.

Date	Topics	Readings	Due	Notes
31 Aug	Welcome to Computational Linguistics What is this class about, linguistic phenomena	-	-	blog
02 Sep	History and Approaches Initial attempts, ALPAC, statistics and data	1-1.6	HW00	blog
Nuts and Bolts
07 Sep	Regular Languages Finite state machines and morphology	2-2.2, 3-3.3	-	blog
09 Sep	Probability and Statistics A refresher, with a language focus	4-4.3, 4.10-4.11	HW01	blog
14 Sep	N-gram models Language modeling and smoothing	4.4-4.6	-	blog
16 Sep	Part of Speech Tagging Rule-based approaches	5.1-5.4	HW02	blog
21 Sep	Part of Speech Tagging II Hidden Markov Models and the Viterbi algorithm	5.5, 5.8	-	-
Syntax
23 Sep	Context Free Grammars Expressivity, X-bar theory and parsing as search	12-12.3, 12.5, 13-13.3	HW03	blog
28 Sep	Context Free Grammars II Dynamic programming and the CKY algorithm	13.4,X-bar_theory	-	blog
30 Sep	No class: finish up P1 (Deadline extended 2 hours)	-	P1	-
05 Oct	No class: Hal sick (again) :(	-	HW04	-
07 Oct	Statistical Parsing From treebanks to grammars, and Markovization	12.4, 14-14.4	-	blog
12 Oct	Incorporating Context Features-based grammars, unification	15-15.4	-	blog
Semantics
14 Oct	Representing Meaning First-order logic	17-17.3, 18.4	HW05	blog
19 Oct	Interpreting Text Interpretation as abduction	abduct (sec 1-3)	P2	blog
21 Oct	Linguistic Challenges Metaphor, metonymy, time, scope, quantifiers, etc. Plus FINAL PROJECT INFO	17.4, 18.3, 18.6, 19.6	HW06	blog
26 Oct	Computational Lexical Semantics Word sense disambiguation + midterm	19-19.3	-	blog
28 Oct	Computational Semantics Semantic roles and frames	19.4-19.5	HW07	blog
Machine Learning for NLP
02 Nov	Classification with Decision Trees Learning, generalization and features	dt	-	blog
04 Nov	Linear Models for Learning Perceptron learning for sentiments	Perceptron	HW08	blog
09 Nov	Sequential Learning Named entity recognition	22.1	-	blog
11 Nov	Using World Knowledge Bootstrapping knowledge from text	20.5, boot	HW09	blog
Higher-level Structure
16 Nov	Local Discourse Context Anaphors, antecedents and coreference	21.3-21.7	-	blog
18 Nov	Document Coherence TexTiling and argumentative zoning	21-21.1, zone	HW10	blog
23 Nov	Hierarchical Text Structure Rhetorical structure theory and the discourse treebank	21.2, discourse	P3	blog
Applications
30 Nov	Information Extraction	22.2, 22.4	HW11	blog
02 Dec	Mapping Text to Actions	mapping	-	blog
07 Dec	Machine Translation	25-25.5	-	-
09 Dec	Automatic Document Summarization	23.3-23.6	P4	-
17 Dec	Final Exam and Final Projects Due	-	-	-

Homework Assignments

All written homeworks are due on Thursday. See the schedule above for due dates. You may handin your homework/projects here. You're free to use the LaTeX source in any way you want, but you'll need haldefs.sty and notes.sty to build them.

Written Homeworks

HW00: Survey and basic concepts (tex)
HW01: Regular languages and probability* (tex)
*if you need an extension until 11 Sept due to Rosh Hashanah, please let me know by 07 Sept
HW02: Language modeling and POS tagging (tex)
HW03: HMMs and CFGs (tex)
HW04: CKY and Treebanks (tex)
HW05: Unification (tex)
HW06: FOL and Abduction (tex)
HW07: WSD and Frames (tex)
HW08: Machine Learning (tex)
HW09: Perceptron and NER (tex)
HW10: Bootstrapping and Coreference (tex)
HW11: Coherence and RST (tex)

Programming Projects

P1: Part of Speech Tagging (PDF)
P2: Syntactic Parsing (PDF)
P3: Machine Learning (PDF)
P4: Discourse Analysis (PDF)

Final Project

TBD

Useful Links

This course has been taught at UMD (though not by me) in the past: Fall 2009, Fall 2008 and Fall 2007 .

Course Policies

Cheating: Any assignment or exam that is handed in must be your own work. However, talking with one another to understand the material better is strongly encouraged. Recognizing the distinction between cheating and cooperation is very important. If you copy someone else's solution, you are cheating. If you let someone else copy your solution, you are cheating. If someone dictates a solution to you, you are cheating. Everything you hand in must be in your own words, and based on your own understanding of the solution. If someone helps you understand the problem during a high-level discussion, you are not cheating. We strongly encourage students to help one another understand the material presented in class, in the book, and general issues relevant to the assignments. When taking an exam, you must work independently. Any collaboration during an exam will be considered cheating. Any student who is caught cheating will be given an E in the course and referred to the University Student Behavior Committee. Please don't take that chance - if you're having trouble understanding the material, please let us know and we will be more than happy to help.

ADA: Any student eligible for and requesting reasonable academic accommodations due to a disability is requested to provide, to the instructor in office hours, a letter of accommodation from the Office of Disability Support Services (DSS) within the first two weeks of the semester. You may reach them at 301-314-7682 or by visiting Susquehanna Hall on the 4th Floor.

College guidelines: Document concerning adding, dropping, etc. here.