Mark E. Whiting

Common sense, machines, and what they don’t know

2025-10-15T00:00:00+00:00

The framework we introduced for quantifying common sense was built around people — individuals rating claims, and the structure of agreement across a population. The obvious next question is what happens when the raters aren’t human.

With Tuan Dung (Josh) Nguyen and Duncan Watts, we applied the same methodology at scale to large language models, evaluating commonsense knowledge in humans and in LLMs on the same set of claims PNAS Nexus. The comparison is useful both ways: it tells us what current models know in the same terms we used for people, and it exposes where benchmarks designed for humans break down when pointed at a machine.

Running alongside this, a broader community effort — organized by Vinay Chaudhri with many others — has been articulating what a new knowledge resource for AI might look like, one that goes beyond existing knowledge graphs and taps into the kind of structured, commonsense, and expert knowledge that modern AI systems still struggle to use reliably AI Magazine.

Taken together, these projects push at the same question from two sides: how do we measure what machines know, and how do we build the resources that would let them know more?

AI hasn’t fixed teamwork

2025-09-15T00:00:00+00:00

Every wave of productivity technology arrives with a promise that it will finally make teamwork better. Generative AI is the current one, and perhaps the last one.

With Qing Xiao, Xinlan Emily Hu, Arvind Karunakaran, Hong Shen, and Hancheng Cao, we followed a project-based software development organization longitudinally from 2023 to 2025 — through the period in which generative AI tools went from novelty to default — to see what actually changed in how teams collaborated. The preprint is on arXiv.

The short version of what we found is in the title: AI hasn’t fixed teamwork. It has, however, shifted the collaborative culture in measurable and sometimes surprising ways — in who does what, in what counts as a contribution, and in the texture of day-to-day coordination. That’s a distinct finding from either “AI is transformative” or “AI changes nothing,” and it lines up with a broader pattern in our team research: the effects of a new tool depend on the task, the team, and the culture they arrive into.

A task space for team research

2025-09-01T00:00:00+00:00

Teams research is fragmented. Every discipline that studies groups — organizational behavior, social psychology, HCI, economics, operations — has its own favorite tasks: brainstorming, jury deliberation, prisoner’s dilemma, hidden profile, creative writing prompts, estimation games. Results on these tasks rarely talk to each other, and it’s genuinely hard to tell whether that’s because the findings conflict or because the tasks do. This problem extends beyond teh task to how tasks are operationalized and how experiments are parameterized and measured.

With Xinlan Emily Hu, Linnea Gandhi, Duncan Watts, and Abdullah Almaatouq, we introduce the Task Space: a framework that organizes the tasks teams do along dimensions that matter for how teams actually perform them, so findings from one task can be meaningfully compared with findings from another Management Science.

This sits on top of a stack of earlier work that kept pointing at the same gap. In Did It Have To End This Way? we showed that the same teams can produce very different outcomes depending on the task in front of them CSCW’19. In Parallel Worlds we re-convened the same teams without them knowing to see how much of their trajectory was locked in CSCW’20. And in Online Juries we found that for some decision tasks, teams are remarkably consistent — while for others, they’re essentially random CHI’21.

The Task Space takes the next step: rather than studying individual tasks and hoping the findings generalize, it maps the space those tasks live in so that generalization can be tested directly.

Quantifying common sense

2024-01-16T00:00:00+00:00

Common sense is one of those ideas that everyone appeals to and no one agrees on. It’s supposedly universal, but it’s also regularly invoked to complain that other people don’t have any. That tension — universal in principle, contested in practice — is actually measurable, if you’re careful about what you count.

With Duncan Watts and collaborators, I developed a method to quantify common sense empirically, at two levels: for an individual (how aligned is this person’s take on a claim with everyone else’s?) and for a collective (how much does a group actually agree?). Running the method over a large set of human-rated claims, we found that what we think of as “commonsense” varies a lot depending on the kind of claim. The clearest agreement shows up on plainly-worded factual claims about the physical world; agreement drops off sharply for claims that are social, normative, or ambiguously worded. Interestingly, who the raters are matters much less than what kind of claim it is. And at the collective level, the universal common sense people often assume exists mostly doesn’t PNAS.

The paper’s full significance statement spells this out:

Common sense, while often portrayed as universal, is paradoxically also often claimed not to exist. Here, we resolve this puzzling situation by introducing a formal methodology to empirically quantify common sense both at individual and collective levels. We then demonstrate the method with a dataset involving human raters evaluating claims. We show that common sense varies considerably across types of claims but aligns most closely with plainly worded, factual claims about physical reality; in contrast, does not vary much across different types of people. We also find limited presence of collective common sense, undermining universalist claims and supporting skeptics. Finally, we argue that quantifying common sense is useful both for applications in social science and AI.

The method turns out to be useful beyond humans — we’ve since used it to evaluate common sense in large language models, where having a single framework that applies to both humans and machines is genuinely handy.

The work has received attention at a few sources, which Altmetric summarize nicely:

How well can social scientists predict society?

2023-04-15T00:00:00+00:00

One way to ask whether a theory is any good is to ask what it predicts. During COVID, a large group of social scientists ran a forecasting tournament to find out — teams competed to predict near-term changes in societal outcomes like mood, polarization, political ideology, discrimination, and life satisfaction, using whatever theoretical apparatus they wanted.

I contributed to the effort as part of the Forecasting Collaborative, led by Igor Grossmann and colleagues. The headline result is sobering: across nearly every outcome, expert forecasts were not reliably better than simple statistical benchmarks, and often worse Nat Hum Behav.

That finding fits uncomfortably well with the argument we make in Beyond Playing 20 Questions with Nature and in our work on common sense: if a field can’t generate reliable predictions about the phenomena it studies, it’s worth asking hard questions about how the field is cumulating knowledge in the first place. Forecasting tournaments are one of the cleaner tests we have — they are hard to game, easy to score, and directly tied to the kind of claims theories are supposed to support.

Integrative experiments

2022-12-21T00:00:00+00:00

The dominant way we run experiments in social and behavioral science — one experiment at a time, each treated as a test of a theory assumed to generalize — has a serious problem. The integration across experiments that is supposed to happen in the published record largely doesn’t, and the recent push for more reliable single findings doesn’t fix it. You can do every individual experiment perfectly well and still not end up with a cumulative theory.

With Abdullah Almaatouq, Tom Griffiths, Jordan Suchow, James Evans, and Duncan Watts, we argue that the fix has to happen at the level of experimental design. In integrative experiments, researchers explicitly map the space of possible experiments associated with a research question, then iteratively sample from that space. Instead of trying to defend a single experimental condition as the one that captures the phenomenon, you treat the design space itself as the object of study BBS.

The paper drew a large set of commentaries from across the field, which we responded to in a follow-up BBS. The discussion — about what theories are for, how generalization should work, and whether the field needs a different unit of analysis — is, for me, the most interesting part of the project.

Most of the other threads in my recent work connect here. Empirica is the infrastructure that makes integrative experiments actually runnable; the common sense framework and the task space are both attempts to make the design space in a particular domain tractable; and the forecasting results are one way of showing that the status quo is leaving a lot on the table.

Ambiguity online

2022-05-01T00:00:00+00:00

Face-to-face, a huge amount of what we communicate is nonverbal — a pause, a glance, a shrug. Online, most of that channel disappears, and the little that remains gets compressed into things like read receipts, likes, profile changes, and emoji. These signals are easy to produce and easy to misread.

With So Yeon Park and Michael Shanks at Stanford, as part of the HPI-Stanford Design Thinking Research Program, we studied these “nonverbal online actions” — how people use them, how others interpret them, and where the gaps between sender and receiver sit DTR’21. A follow-up looked specifically at what happens when those actions create confusion, and what people do to repair it DTR’22.

A sharper version of the same problem is synthetic media. With Dilrukshi Gamage, Piyush Ghasiya, Vamshi Bonagiri, and Kazutoshi Sasahara, we analyzed Reddit conversations about deepfakes to see how people actually talk about them — the concerns they raise, the distinctions they draw, and the societal implications that surface in their own words CHI’22.

Across these projects, the pattern that keeps coming up is that online communication is less a thinner version of in-person communication than it is a different medium with its own ambiguities — and the interesting research question is usually how people navigate those ambiguities, not whether they exist.

COVID dashboards

2021-08-15T00:00:00+00:00

Together with Duncan Watts and others on the team at the CSSLab at Penn, and in conjunction with the city of Philadelphia, we built a collection of interactive data dashboards that visually summarize human mobility patterns over time and space for a number of cities, starting with Philadelphia, along with highlighting potentially relevant demographic correlates. The dashboards are available at covid.seas.upenn.edu.

The dashboards are built in Observable. The data are a proprietary combination of cell phone GPS data, demographic data derived from the American Community Survey, and COVID-19 caseload data from the New York Times.

A related piece of the same broader effort was the non-pharmaceutical interventions (NPI) dataset, where a large team of student annotators tracked how local COVID-19 policy — mask mandates, business closures, gathering limits, and so on — changed in near-real-time across many jurisdictions. Policies shifted week to week, and there was no single authoritative source for what was actually in force where, which made modeling the pandemic’s trajectory much harder than it needed to be. With Benjamin Hurt, Madhav Marathe, Michael Bernstein, and many others, we released the resulting annotated dataset and the annotation workflow behind it Scientific Data.

Online learning at scale

2021-06-15T00:00:00+00:00

Massive open online courses promised education at a scale that in-person teaching can’t reach, but the instructor–student ratio that works in a seminar doesn’t survive the jump to tens of thousands of learners. So much of what makes a class work — feedback, discussion, a sense that anyone is paying attention — has to be reconstructed out of what the learners themselves do for each other.

Over several years with Dilrukshi Gamage, Indika Perera, Shantha Fernando, Thomas Staubitz and others, we studied how to make peer-driven learning actually work. We found that aligning incentives to the quality of feedback, rather than simply its length, produced feedback learners judged more useful L@S’17. We showed that seeding peer assessment groups with trained “introduced peers” improved the quality of subsequent discussion TALE’18. We surveyed the broader state of the field in a systematic literature review of peer assessment in MOOCs Distance Education. And we showed that treating learners as communities of practice — rather than isolated users moving through a course — measurably improves outcomes Asian CHI’21, which was recognized with a best paper award.

The through-line across these projects is that scale doesn’t remove the social structure of learning; it just changes who has to supply it.

Running experiments at scale

2021-03-15T00:00:00+00:00

Most of the empirical work I do with collaborators needs infrastructure that doesn’t really exist off the shelf. Running a single behavioral experiment with a handful of people is well supported; running thousands of them, each with groups of participants interacting in real time, is not.

Lead by Abdullah Almaatouq, and with Joshua Becker, James Houghton, Nicolas Paton, and Duncan Watts, we built and extended Empirica, an open-source virtual lab designed for exactly this kind of high-throughput, macro-level experimentation. Empirica handles the hard parts — synchronization, treatment randomization, real-time interaction, and iterative design — so researchers can focus on the experiment itself BRM’21.

Empirica was one piece of a broader conversation, across many labs, about what it would take to actually scale up behavioral and social science. In a working paper with a large group of collaborators we laid out a vision for shared infrastructure, shared protocols, and shared participant access as the missing middle of the field OSF’21.

This infrastructure turned out to be essential to the work on integrative experiments that came next — you can’t run a design space full of experiments if each one costs you a month of engineering.