Add tokenization and prompting API to GPT models#651
Conversation
|
@tharvik @peacefulotter I'd be happy to hear your take on how to integrate tokenization and prompting into the Disco architecture. |
|
Congrats on the LLM POC milestone! Firstly, I am wondering if having a custom tokenizer is even really relevant in the first place. Is there really a use case for it in Disco? If so, instead of passing the tokenizer object, you could pass a string representing the tokenizer to use. Then an object could map tokenizer ids / names to the corresponding instance. |
|
Secondly, since it appears tokenization needs to be done on the fly (before or during training) and not as a completely separate step. It could be done as a preprocessing step like Hugo planned to do on his branch |
|
@peacefulotter yes I want to do both points you mentioned! The second one is already implemented, the tokenization is part of the preprocessing. |
I don't like
after a bit more thinking, I don't think it really matters for now, as long as #639 is not done
indeed, functions can't be represented by msgpack/json so it gets removed. it'll be nice to have a serialization process to ensure that we get function back from the network (a string such as |
|
Transformer.js requires Node >= 18, I'm waiting on #653 to be merged |
Closes: #632
tharvik
left a comment
There was a problem hiding this comment.
great work! a bit of nitpicking here and there but nothing really vital
discojs/discojs-core/src/dataset/data/preprocessing/text_preprocessing.ts
Show resolved
Hide resolved
Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>
Co-authored-by: Valérian Rousset <tharvik@users.noreply.github.com>
tharvik
left a comment
There was a problem hiding this comment.
I've created a new models/tokenizer.ts file
LGTM!
a few more comments (I'm a never ending stream of critics, please stop me)
Fixes #646
docs/examples/wikitext.ts