Skip to content

Conversation

@ivan-gorin
Copy link
Contributor

Fixes my issue #724
Changed the way vocabulary is read and converted to ggml. Used the same code as in original tiktoken library.

The resulting ggml files are the same as ones downloaded from huggingface. For multilingual exactly the same, for en only there is a difference of 17 bytes, due to the difference in whisper's vocab files.

-rw-rw-r--  1 ivan ivan 77691713 Mar 22 11:09 ggml-tiny.bin
-rw-rw-r--  1 ivan ivan 77691713 Apr  6 06:13 ggml-tiny.bin.new
-rw-rw-r--  1 ivan ivan 77704715 Mar 22 11:08 ggml-tiny.en.bin
-rw-rw-r--  1 ivan ivan 77704698 Apr  6 06:14 ggml-tiny.en.bin.new

In the old vocab.json there are 50257 tokens, the last one <|endoftext|> with index 50256. In the new gpt2.tiktoken there are only 50256 tokens, the endoftext is removed. 4 bytes for the int storing string length + 13 bytes for the string itself = 17 byte difference. It doesn't seem to make a difference in the model output anyway, not sure why this token was in the vocab.json previously, since it is probably a special token.

I have only tested with tiny models, but the only change is in the tokenizer so it should work for all the others.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with medium and it produces the same model as before.
Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants