Professor, researcher, consultant, speaker, educational technology, learning design, online education, emerging technologies

Image

Category: GenAI

Simple Checklists to Verify the Accuracy of AI-Generated Research Summaries

Do you share AI-generated audio/video summaries of your research with students? or with the broader public on social media? Below is a short article I wrote encouraging researchers to share a checklists alongside those summaries verifying their accuracy and noting their limits (the final version is at Veletsianos, G. (2025). Simple Checklists to Verify the Accuracy of AI-Generated Research Summaries. Tech Trends, XX(X), Xx-xx but here’s a public pre-print too).

Simple Checklists to Verify the Accuracy of AI-Generated Research Summaries

Picture this: An educational technology researcher shares a seven-minute AI-generated audio or video of their latest paper on social media. It sounds engaging and professional. But buried in that smooth narration, the AI has quietly transformed “may suggest” into “proves,” dropped crucial limitations, and expanded the study’s claims beyond what the data supports. The listeners, including students, policymakers, and journalists, have no way of knowing.

The proliferation of AI-generated audio and video summaries of research papers—through tools like Google’s NotebookLM and others—represents both an opportunity and a challenge for scholarly communication. These summaries are promising as they can expand the reach, accessibility, and consumption of our research for diverse audiences (cf. Veletsianos, 2016). They also allow us to efficiently engage with literature outside of our expertise. A seven-minute podcast consumed during a commute may reach audiences who would never read a 30-page paper.

Yet this convenience comes with risks. Peters and Chin-Yee (2025) for example, found that summaries generated by Large Language Models omitted study details and made overgeneralizations. Such risks can propagate misunderstandings, particularly when summaries circulate without clear indicators of their accuracy or limitations.

While some technical solutions to address this problem exist (e.g., including fine-tuning models and implementing algorithmic constraints) these approaches remain inaccessible to most researchers. We need a low-barrier intervention that empowers authors to assess and communicate the quality of AI-generated summaries to listeners.

I propose that researchers who share AI-generated summaries complete and publish a brief verification checklist alongside their summary. This practice serves two purposes: it encourages authors to critically review AI output before dissemination, and it provides audiences with transparency about the summary’s accuracy and limitations. Just as we expect ethics approval for research, we should normalize quality assurance for AI-generated scholarly content.

To facilitate this practice, below are two verification checklists, one for academic audiences and another for the general public, even though the latter could serve both audiences. Both are deliberately concise to enable sharing across digital platforms where these summaries circulate, from social media to publishers’ websites to course management systems.

Checklist 1: For Academic Audiences Checklist 2: For the General Public
Author verification: This summary of [paper title] was AI-generated using [tool name] on [date] and reviewed by the author(s). It accurately represents our work. For full details, nuance, and context, please refer to the original work at [URL].

 

The following items were verified:
✓ Research purpose or questions stated correctly
✓ Study design described correctly
✓ Summary matches study results (no fabricated data)
✓ Conclusions are explicitly limited to the study’s scope and context
✓ Key terminology used properly
✓ Theoretical, conceptual, and/or methodological frameworks are framed appropriately and are neither omitted, nor misrepresented
✓ Major limitations are included
✓ Context and scope are clear
✓ There summary does not omit anything of significance
✓ The tone is consistent with the original work

 

Issues noted: [Note any issues]

Author verification: This summary of [paper title] was AI-generated using [tool name] on [date] and reviewed by the author(s). It accurately represents our work. For full details, nuance, and context, please refer to the original work at [URL].

 

What we checked:
✓ Main findings are correct – nothing made up
✓ Doesn’t overstate what we found
✓ Includes what we studied and who participated
✓ Mentions important limitations
✓ Uses language appropriately
✓ Matches our original tone and message

 

Issues noted: [Note any issues, using plain language]

 

These checklists are a starting point, not a comprehensive solution. I have attempted to make them flexible enough to accommodate different research paradigms, but if you do use them, you should refine them to fit your needs and orientation. The point is not to develop the perfect checklist, but to provide a flexible tool that can be adapted and improved to minimize the risks of AI-generated research summaries. As AI tools become increasingly integrated into research dissemination, we must develop community standards for responsible use. Normalizing transparency practices now contributes toward maintaining the integrity that underpins scholarly communication.

In an academic landscape saturated with contested claims, particularly in education and educational technology where myths and zombie theories persist (e.g., Sinatra, & Jacobson, 2019; Suárez-Guerrero, Rivera-Vargas, & Raffaghelli, 2023), our commitment to accuracy and transparency must remain constant. Verifying AI-generated summaries constitutes a form of reputational stewardship. This quality assurance practice encourages authors to critically review AI output before it circulates, signaling to colleagues, institutions, and the public that they take seriously their role as knowledge custodians. By proactively verifying summaries, researchers can protect the integrity of their findings and build a reputation for reliability that enhances the trustworthiness of their entire body of work. At the end of the day, the few minutes invested in verifying AI-generated summaries of one’s work pale in comparison to the time that might be required to correct a misleading summary that gains traction on social media. Once an AI-generated misrepresentation goes viral, no amount of clarification can fully revise it. In this sense, verification checklists function as both quality control and professional insurance. They are a small investment that yield returns in credibility and peace of mind.

I encourage researchers to adopt versions of these checklists, journals to consider requiring them for AI-generated supplementary materials, and the broader academic community to refine and expand upon this framework. In an era of rapid AI developments, our commitment to scholarly accuracy and transparency must remain constant.

Author notes and transparency statement, as suggested by Bozkurt (2024): This editorial was reviewed, edited, and refined with the assistance of ChatGPT o3 and Gemini Pro 2.5 as of July 2025, complementing the human editorial process to address grammar, flow, and style. I critically assessed and validated the content and assessed potential biases inherent in AI-generated content. The final version of the paper is my sole responsibility.

References

Bozkurt, A. (2024). GenAI et al. Cocreation, authorship, ownership, academic ethics and integrity in a time of generative AI. Open Praxis16(1), 1-10.

Peters, U., & Chin-Yee, B. (2025). Generalization bias in large language model summarization of scientific research. Royal Society Open Science12(4), 241776. https://doi.org/10.1098/rsos.241776

Sinatra, G. M., & Jacobson, N. (2019). Zombie Concepts in Education: Why They Won’t Die and Why You Cannot Kill Them. In P. Kendeou, D. H. Robinson, & M. T. McCrudden (Eds.), Misinformation and fake news in education (S. 7–27). Information Age Publishing, Inc.

Suárez-Guerrero, C., Rivera-Vargas, P., & Raffaghelli, J. (2023). EdTech myths: towards a critical digital educational agenda. Technology, Pedagogy and Education32(5), 605-620.

Veletsianos, G. (2016). Networked Scholars: Social Media in Academia. New York, NY: Routledge.

ChatGPTs ‘Helpful’ Suggestions Are Actually a Design Problem

In a recent op ed in The NY Times, Meghan O’Rourke highlights how AI systems might tempt learners to offload an increasing amount of their work and thinking. It’s an excellent piece that identifies many crucial problems, and she writes:

Students often turn to A.I. only for research, outlining and proofreading. The problem is that the moment you use it, the boundary between tool and collaborator, even author, begins to blur. First, students might ask it to summarize a PDF they didn’t read. Then — tentatively — to help them outline, say, an essay on Nietzsche. The bot does this, and asks: “If you’d like, I can help you fill this in with specific passages, transitions, or even draft the opening paragraphs?” At that point, students or writers have to actively resist the offer of help. You can imagine how, under deadline, they accede, perhaps “just to see.” And there the model is, always ready with more: another version, another suggestion, and often a thoughtful observation about something missing.

To counteract this, she recommends a variety of pedagogical changes, such as reconsidering the essay format and letter grades. These are fine recommendations. Another approach might be for users to add a system prompt to their LLM to guide it in a way in which it limits its suggestions to a specified task. For example, such a prompt might be phrased as follows:

Only respond to my specific request without offering to do additional work, expand your role, or suggest next steps. Do not ask if I’d like help with related tasks, drafting, or improvements unless I explicitly ask. Keep your assistance limited to exactly what I’ve requested.

However, both O’Rourke’s pedagogical reforms and the system prompt I described share a common limitation in that they place the burden of change on educators and users rather than addressing the underlying system design that creates these temptations in the first place. In other words, AI’s invitation to take over additional aspects of one’s work/writing is a particular design decision. Some design decisions – namely system defaults or those settings which are picked for you  – are more powerful than others. It is simple to stick to defaults and challenging to resist or change system defaults. For example, in the past I wrote how YouTube’s default settings (i.e. defaulting to uploaded video a copyright, rather than a Creative Commons license) has important and unanticipated impacts on open education, as well as how the defaults in Learning Management Systems structure faculty-student relationships in particular ways.

Another approach is to address the system – the design of the chatbot itself – such that the handoff of cognitive work becomes more visible and intentional rather than seamless and automatic. For example, some approaches might include:

  1. Adding friction through confirmation prompts: A few years ago, Twitter made a change to its retweeting practice. If you tried to retweet an article without having clicked it first, it asked you if you really wanted to do that. The intent was to add friction, and to address some of the challenges associated with echo chambers, where we all share things we tend to agree with, even if we don’t actually read them. Similarly, the AI system could add friction, by asking: “Are you sure you want me to draft paragraphs for you?” or “This request would significantly reduce your own writing practice – continue anyway?” before taking on substantial work.
  2. Implementing escalation warnings: When a user’s requests progressively increase AI involvement within a session, the chatbot could display messages like “You’ve now asked me to research, outline, and draft – consider what learning opportunities you might be missing.”
  3. Defaulting to partial assistance: Instead of offering complete solutions, the system could default to giving hints, questions, or partial frameworks that require human completion. This changes the pedagogical role of the chatbot. It’s probably one of the most consequential decisions that designers of education-specific chatbots must contend with.

These solutions aren’t without downsides. First, they directly conflict with AI companies’ business incentives. More seamless and extensive AI assistance likely increases user engagement, subscription renewals, and the perceived value of their products. Voluntary adoption of these friction-inducing features are unlikely without regulatory pressure or industry-wide coordination. Second, confirmation prompts might become annoying click-through obstacles that users eventually ignore.

The question isn’t whether these solutions are perfect. The alternative, accepting AI’s current design as inevitable, essentially outsources pedagogy.

Prompting Claude to build a simple sudoku

Continuing with sharing simple prompting experiments, a few days ago I built a simple sudoku game. One reason someone might want to create their own version version might be to escape the ads trackers that are embedded in free online games. Another might be to just see what these models are and aren’t capable of. The initial prompt was:

I want to create a standalone webpage where i can play a simple 9 x 9 sudoku puzzle.

Followup prompts were

remove the hint button [Note: I didn’t ask for one. I suspect “hints” are standard in the training data consisting of or similar to sudoku. This is a reminder that LLMs are probabilistic machines. This is also a reminder of the “power of defaults.” Any choices that Claude makes will likely have a significant influence of people – eg likely to keep the hint button – without consideration of alternatives because it’s just easy to keep it there. Ask a thousand students to use Claude to build a sudoku. How many will include a hint button?] 

remove this error message which shows up in the console: UDOIT [Note: This was a terrible prompt, and it should have asked for a fix in the code to address the error. Yet, Claude deciphered the meaning and tried to fix it. BUT. UDOIT wasn’t the error message that was showing up. UDOIT was the last piece of text that was in my clipboard. Claude happily followed my instructions, agreed it was an error, removed a function that was possibly throwing that error, and declared success, all while the error remained]

this text appears at the top of the board. remove it: .controls { display: flex; gap: 10px; margin-bottom: 10px; } [Note: This was at the top of the html file, and claude just could not locate it. I eventually had to manually remove it]

When i add a number, I want you to instantly check whether it is accurate or not. If accurate it should immediately turn green. if wrong, it should immediately turn red.

To be sure, while this is a very very simple example of what Andrej Karpathy recently called vibe coding (also see the decent wikipedia entry on the term),  it’s not an argument against the need for essential training and skills in coding: understanding errors, timers, sorting, data structures such as arrays, randomization, logic, and loops – all basic elements of this game – are the kinds of foundational building blocks that are helpful in many other contexts. Building a sudoku game just happens to be an example of putting these theoretical concepts in practice. What I find most important about the conversations around AI in education (including vibe coding) is questions around who benefits: Amateurs of experts? For whom is vibe coding most helpful? Who accrues the most benefits? And who might lose out on developing foundational and necessary skills as a result of the advent of these approaches/technologies? These lead to more questions, which intersect with literacies, the labor market, education futures, and the design/adoption of these technologies.

Screenshot of an online Sudoku game interface showing difficulty selection buttons (Easy, Medium, Hard), action buttons (New Game, Check Solution, Solve), a timer reading 00:33, and a partially filled Sudoku grid with some cells highlighted, including one number marked in red.

Prompting Claude to build a coffee journal

To truly grasp the possibilities, limits, and limitations of Generative AI, you have to get your hands dirty with the technology, meaning you have to use it not just once, but ongoing and consistently. Reading about the technology can only get you so far. One of the assignments in my AI in Education course for example, invites students to turn to an AI tool as much as they can over three days and to reflect on that experience.

I’ve been doing something similar, and when I saw D’Arcy’s post on prompting Claude to build a sleep journal, I thought that I should try to do something similar, and since I’ve neglected this blog a little, I should post it here.  A few adjustments and errors, and about an hour later, I built a simple coffee journal that tracks and visualizes your coffee consumption and purchases over time. The initial prompt was:

I need to build a “coffee journal” to document my coffee purchasing patterns. I need it to be a standalone web page that stores the coffee data, visualizes the data and calculates trends. Coffee journal entries will be entered every afternoon, to document my purchases of the day. It needs fields for: number coffees for the day (integer), time that I purchased each of the coffees (time), kind of coffee each one was (text field), and cost of each coffee (integer). I need to visualize the data over time, and calculate and display total cost per day and in total so far.

One of the adjustments for example, was the kind of coffee field. Initially I imagined this to be an open-ended text field. But then realized that users might want to track the kind of coffees they had. I converted this into a dropdown menu, which then made it possible to track the kinds of purchases over time. Will I be using this? Probably not – I prefer weak homemade coffee – but the point is to see what Claude can and cannot do in terms of a simple web app.

A coffee tracking app interface with fields to add date, time, coffee type, and cost. Below, an empty table shows a daily total of $0. A green 'Save Day' button is present. The 'Statistics' section shows 7 total coffees, $28 total spent, and an average cost of $4.

Powered by WordPress & Theme by Anders Norén