Add tokenization and prompting API to GPT models#651

JulienVig · 2024-03-18T17:56:19Z

Fixes #646

Adds tokenization support, allowing to choose any pre-trained tokenizer from Transformers.js
Implement text generation with a trained model
Extend text processing with shuffling and left padding
Create an example of language model training in docs/examples/wikitext.ts
I encountered a lot of issues and limitations which I listed in Improve and rework GPT-tfjs #654

JulienVig · 2024-03-18T18:03:47Z

@tharvik @peacefulotter I'd be happy to hear your take on how to integrate tokenization and prompting into the Disco architecture.
The initial commit is a POC of a pipeline from text training data (so not already preprocessed) to textual output (cf docs/example/wikitext.ts)
I wanted to make the tokenizer as a trainingInformation field in the infamous TaskProvider but it seems that complex objects (like functions) are removed from the task object when communicated between the server and the client...
My next step is to migrate away from gpt-tokenizer to use Transformers.js such that we can specify which pre-trained tokenizer we want to use as a string in the trainingInformation

peacefulotter · 2024-03-18T18:10:03Z

Congrats on the LLM POC milestone!

Firstly, I am wondering if having a custom tokenizer is even really relevant in the first place. Is there really a use case for it in Disco?

If so, instead of passing the tokenizer object, you could pass a string representing the tokenizer to use. Then an object could map tokenizer ids / names to the corresponding instance.

peacefulotter · 2024-03-18T18:11:21Z

Secondly, since it appears tokenization needs to be done on the fly (before or during training) and not as a completely separate step. It could be done as a preprocessing step like Hugo planned to do on his branch

JulienVig · 2024-03-19T10:05:39Z

@peacefulotter yes I want to do both points you mentioned! The second one is already implemented, the tokenization is part of the preprocessing.
And for the first I want to migrate to Transformers.js for that reason, it allows loading different pre-trained tokenizer from a string (rather than an import like gpt-tokenizer)

tharvik · 2024-03-19T11:27:12Z

I'd be happy to hear your take on how to integrate tokenization and prompting into the Disco architecture.

I don't like Task, it's the goto place for every context variable in Disco. here are some questions before adding it to task

is it applicable to every model? ie, can images be tokenized?
would someone want to change the tokenizer of wikitext?

after a bit more thinking, I don't think it really matters for now, as long as #639 is not done

I wanted to make the tokenizer as a trainingInformation field in the infamous TaskProvider but it seems that complex objects (like functions) are removed from the task object when communicated between the server and the client...

indeed, functions can't be represented by msgpack/json so it gets removed. it'll be nice to have a serialization process to ensure that we get function back from the network (a string such as gpt-tokenizer/davinci in the serialized object that get mapped to a known function)

…n text_preprocessing

JulienVig · 2024-03-21T08:53:17Z

Transformer.js requires Node >= 18, I'm waiting on #653 to be merged

Closes: #633

Closes: #634

Closes: #632

web-client/.gitignore

tharvik

great work! a bit of nitpicking here and there but nothing really vital

discojs/discojs-core/package.json

discojs/discojs-node/src/data/text_loader.ts

discojs/discojs-core/src/task/training_information.ts

discojs/discojs-core/src/dataset/data/preprocessing/text_preprocessing.ts

discojs/discojs-core/src/models/gpt/evaluate.ts

docs/examples/wikitext.ts

web-client/.gitignore

Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>

discojs/discojs-core/src/models/tokenizer.ts

tharvik

I've created a new models/tokenizer.ts file

LGTM!

a few more comments (I'm a never ending stream of critics, please stop me)

discojs/discojs-core/src/models/gpt/evaluate.ts

discojs/discojs-core/src/dataset/data/preprocessing/text_preprocessing.ts

Basic tokenization with single hard-coded tokenizer

b18d187

JulienVig added the discojs Related to Disco.js label Mar 18, 2024

JulienVig self-assigned this Mar 18, 2024

JulienVig added 4 commits March 19, 2024 13:03

Mv batch size from gpt-tfjs config to Task and mv all preprocessing i…

59c1b7e

…n text_preprocessing

Move epochs parameter from gpt-tfjs config to Task

57e425a

No memory leak with iterator while true

612d441

Clean training loop and fix evaluate memory leak

9bb4120

tharvik and others added 17 commits March 21, 2024 11:11

*: use latest LTS node

d6c12c4

github: upgrade actions

737339a

isomorphic-wrtc/node: use node-datachannel

f5624ce

Closes: #633

web-client: rm wrtc unneeded specifics

6159198

discojs-core: inline Weights

2a584c4

discojs-core: remove test dep on discojs-node

12ab70d

Closes: #634

discojs-core/trainer_builder: drop aggregator

df4c7de

web-client: readd cypress

18e90b4

discojs-core/task: ensure type guards completeness

276ba1f

web-client/locale: simplify

a256cd2

*: upgrade to ES2022 modules

2c75b04

Closes: #632

*: upgrade eslint

babb84d

web-client: bump deps

b9db50f

Merge with ES2022 branch

309b6c1

Add transformers.js

e2c2057

Integrate Transformers.js tokenizers

b4e46c3

Load tokenizer only once during pre-processing

270ca10

JulienVig added 6 commits March 26, 2024 15:38

Change tokenizer trainingInformation name

7306fe8

Implement text dataset shuffling and left padding preprocessing

b561f2a

Update wikitext example

07175e0

Merge with develop

30ad5c9

Fix merge

482df2d

Fix merge

8c3e245

JulienVig commented Mar 27, 2024

View reviewed changes

web-client/.gitignore Show resolved Hide resolved

JulienVig added 3 commits March 27, 2024 17:06

Fix linting errors

decb341

Fixup package-lock.json

45b7c98

Fix lint error

ff90cb7

JulienVig marked this pull request as ready for review March 27, 2024 16:45

JulienVig requested review from peacefulotter and tharvik March 27, 2024 16:45

JulienVig added 2 commits March 27, 2024 18:02

Fix default wikitext max iter

05ad23a

Change wikitext default hp for server test

971e744

tharvik approved these changes Mar 28, 2024

View reviewed changes

JulienVig and others added 5 commits March 28, 2024 14:31

Update discojs/discojs-core/src/models/gpt/evaluate.ts

4a18a55

Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>

Only specify transformers.js major version in package.json

00cff66

Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>

Address PR' comments

ec94c08

Use async array rather than arraySync

53fd0cb

Improve getTaskTokenizer doc

8d86f54

JulienVig commented Mar 28, 2024

View reviewed changes

discojs/discojs-core/src/models/tokenizer.ts Show resolved Hide resolved

tharvik approved these changes Apr 2, 2024

View reviewed changes

discojs/discojs-core/src/models/gpt/evaluate.ts Outdated Show resolved Hide resolved

discojs/discojs-core/src/dataset/data/preprocessing/text_preprocessing.ts Show resolved Hide resolved

discojs/discojs-core/src/dataset/data/preprocessing/text_preprocessing.ts Show resolved Hide resolved

JulienVig removed the request for review from peacefulotter April 2, 2024 09:34

Add text preprocessing type checks

1ba5f60

JulienVig merged commit 7c282e7 into develop Apr 3, 2024

JulienVig deleted the 646-tokenizer-julien branch April 3, 2024 11:17

Conversation

JulienVig commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JulienVig commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peacefulotter commented Mar 18, 2024

Uh oh!

peacefulotter commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JulienVig commented Mar 19, 2024

Uh oh!

tharvik commented Mar 19, 2024

Uh oh!

JulienVig commented Mar 21, 2024

Uh oh!

Uh oh!

tharvik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tharvik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulienVig commented Mar 18, 2024 •

edited

Loading

JulienVig commented Mar 18, 2024 •

edited

Loading

peacefulotter commented Mar 18, 2024 •

edited

Loading