Joho the Blog » scale

September 5, 2024

Do LLM’s store what they’re trained on? A reason to think not.

If you ask chatGPT to give you the first paragraph of Moby-Dick, it gets it exactly right. This and other such examples are sometimes taken as evidence that it has kept a copy of that book in memory, a potential copyright violation. The makers of these Large Language Models say they don’t, or at least I haven’t come across any who cop to it.

My understanding is that they analyze the terxts they’re trained on looking for statistical relationships, keep the stats and ditch the texts. (In this, they’re similar to how search engines compile their indexes.)

But my understanding is quite limited. So, I poked around.

PROMPT: What is the first paragraph of Moby-Dick, verbatim?

CHATGPT:

Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, I account it high time to get to sea as soon as I can

Perfect.

PROMPT: What is the first paragraph of David Weinberger’s book Everything Is Miscellaneous?

CHATGPT:

The first paragraph of Everything is Miscellaneous by David Weinberger reads:

“If you keep your music in iTunes or your photos in Flickr, you’re using an application that lets you put a piece of information in more than one category at a time. That’s weird. In the real world—where our species evolved—everything has its place. Our books are on one shelf at a time, our plates are in one cupboard, and our shoes are by the door or in the closet. That’s the way it’s supposed to be because that’s the way the world is. Except, of course, it’s not.”

This opening sets the stage for Weinberger’s exploration of how digital organization challenges traditional, physical-world categorization.

That’s not a bad intro to the book’s themes: it picks up some of the examples in the book, and it is certainly reminiscent of my style of writing. But it has nothing to do with the first paragraph of the first chapter:

Before the Web, browsing was usually a polite way of telling a salesperson to buzz off. “May I help you?”, a salesperson asks. “I’m just browsing,” you reply with a little smile.With that word, a customer declares a lack of commitment, With that smile, she asserts that she’s within her rights: “Just try and stop me, salesboy!”

I’m slightly embarrassed by that, but the point is that the real first paragraph doesn’t talk about anything mentioned in chatGPT’s attempt to quote it.

Further, while chatGPT can quote the opening of Alice’s Adventures in Wonderland, if you ask it to quote the opening of Carroll’s lesser known “Sylvie and Bruno” it responds:

One winter night, at a little town called Elveston, a telegram was received by the Sub-Warden, who sat sipping his coffee, and wondering lazily whether it was worth while to finish his cigar, or whether he would take a stroll down to the Club, and see if Captain Muriel had returned yet.

Here’s the actual opening:

—and then all the people cheered again, and one man, who was more excited than the rest, flung his hat high into the air, and shouted (as well as I could make out) “Who roar for the Sub-Warden?” Everybody roared, but whether it was for the Sub-Warden, or not, did not clearly appear: some were shouting “Bread!” and some “Taxes!”, but no one seemed to know what it was they really wanted.

The phrase “one winter” doesn’t appear anywhere in the original. Nor does “a little town called Elveston”.

Finally, if you repeat the prompt, it gives substantially different responses, including the unfortunately smutty these days:

‘Less bread!’ cried the poor. I looked again, and saw it was not bread they had needed, but more head.

It seems to unqualified me that this bolsters the claims of the LLM makers that they don’t keep copies of their training materials (or at least don’t make them accessible to the model in operation), and that the training-time statistical analysis of texts that quote other texts, plus the magic of transformer technology, is sufficient to explain how they can provide verbatim quotes of oft-cited works.

Am I wrong? It’s highly likely.

Follow me

Categories: ai, copyright, machine learning, tech Tagged with: ai • chatgpt • copyright • llm • ml • scale Date: September 5th, 2024 dw

1 Comment »

October 27, 2013

[2b2k] Globalization of local expertise

In August, I blogged about a mangled quotation supposedly from Mark Twain posted on an interstitial page at Forbes.com. When I tweeted about the post, it was (thanks to John Overholt [twitter:JohnOverholt]) noticed by Quote Investigator [twitter:QuoteResearch] , who over the course of a few hours tweeted the results of his investigation. Yes, it was mangled. No, it was not Twain. It was probably Christian Bovee. Quote Investigator, who goes by the pen name Garson O’Toole, has now posted on his site at greater length about this investigation.

It’s been clear from the beginning of the Web that it gives us access to experts on topics we never even thought of. As the Web has become more social, and as conversations have become scaled up, these crazy-smart experts are no longer nestling at home. They’re showing up like genies summoned by the incantation of particular words. We see this at Twitter, Reddit, and other sites with large populations and open-circle conversations.

This is a great thing, especially if the conversational space is engineered to give prominence to the contributions of drive-by experts. We want to take advantage of the fact that if enough people are in a conversation, one of them will be an expert.

Follow me

Categories: experts Tagged with: 2b2k • conversations • experts • reddit • scale • twitter Date: October 27th, 2013 dw

1 Comment »

May 7, 2012

[everythingismisc] Scaling Japan

MetaFilter popped up a three-year-old post from Derek Sivers about how streeet addresses work in Japan. The system does a background-foreground duck-rabbit Gestalt flip on Western addressing schemes. I’d already heard about it — book-larnin’ because I’ve never been to Japan — but the post got me thinking about how things scale up.

What we would identify by street address, the Japanese identify by house number within a block name. Within a block, the addresses are non-sequential, reflecting instead the order of construction.

I can’t remember where I first read about this (I’m pretty sure I wrote about it in Everything Is Miscellaneous), but it pointed out some of the assumptions and advantages of this systems: it assumes local knowledge, confuses invaders, etc. But my reaction then was the same as when I read Derek’s post this morning: Yeah, but it doesn’t scale. Confusing invaders is a positive outcome of a failure to scale, but getting tourists lost is not. The math just doesn’t work: 4 streets intersected by 4 avenues creates 9 blocks, but add just 2 more streets and 2 more avenues and you’ve enclosed another 16 blocks. So, to navigate a large western city you have to know many many fewer streets and avenues than the number of existing blocks.

But of course I’m wrong. Tokyo hasn’t fallen apart because there are too many blocks to memorize. Clearly the Japanese system does scale.

In part that’s because according to the Wikipedia article on it, blocks are themselves located within a nested set of named regions. So you can pop up the geographic hierarchy to a level where there are fewer entities in order to get a more general location, just as we do with towns, counties, states, countries, solar system, galaxy, the universe.

But even without that, the Japanese system scales in ways that peculiarly mirror how the Net scales. Computers have scaled information in the Western city way: bits are tucked into chunks of memory that have sequential addresses. (At least they did the last time I looked in 1987.) But the Internet moves packets to their destinations much the way a Japanese city’s inhabitants might move inquiring visitors along: You ask someone (who we will call Ms. Router) how to get to a particular place, and Ms. Router sends you in a general direction. After a while you ask another person. Bit by bit you get closer, without anyone having a map of the whole.

At the other end of the stack of abstraction, computers have access to such absurdly large amounts of information either locally or in the cloud — and here namespaces are helpful — that storing the block names and house numbers for all of Tokyo isn’t such a big deal. Point your mobile phone to Google Maps’ Tokyo map if you need proof. With enough memory,we do not need to scale physical addresses by using schemes that reduce it to streeets and avenues. We can keep the arrangement random and just look stuff up. In the same way, we can stock our warehouses in a seemingly random order and rely on our computers to tell us where each item is; this has the advantage of letting us put the most requested items up front, or on the shelves that require humans to do the least bending or stretching.

So, I’m obviously wrong. The Japanese system does scale. It just doesn’t scale in the ways we used when memory spaces were relatively small.

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingismisc • namespaces • scale Date: May 7th, 2012 dw

3 Comments »