Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files #725

ivan-gorin · 2023-04-06T03:30:58Z

Fixes my issue #724
Changed the way vocabulary is read and converted to ggml. Used the same code as in original tiktoken library.

The resulting ggml files are the same as ones downloaded from huggingface. For multilingual exactly the same, for en only there is a difference of 17 bytes, due to the difference in whisper's vocab files.

-rw-rw-r--  1 ivan ivan 77691713 Mar 22 11:09 ggml-tiny.bin
-rw-rw-r--  1 ivan ivan 77691713 Apr  6 06:13 ggml-tiny.bin.new
-rw-rw-r--  1 ivan ivan 77704715 Mar 22 11:08 ggml-tiny.en.bin
-rw-rw-r--  1 ivan ivan 77704698 Apr  6 06:14 ggml-tiny.en.bin.new

In the old vocab.json there are 50257 tokens, the last one <|endoftext|> with index 50256. In the new gpt2.tiktoken there are only 50256 tokens, the endoftext is removed. 4 bytes for the int storing string length + 13 bytes for the string itself = 17 byte difference. It doesn't seem to make a difference in the model output anyway, not sure why this token was in the vocab.json previously, since it is probably a special token.

I have only tested with tiny models, but the only change is in the tokenizer so it should work for all the others.

ggerganov

Tested with medium and it produces the same model as before.
Thank you very much!

…gml-org#725)

Changed convert-pt-to-ggml to use .tiktoken tokenizer files

c0ad8bb

ivan-gorin force-pushed the master branch from cda97bc to c0ad8bb Compare April 12, 2023 14:34

ivan-gorin mentioned this pull request Apr 12, 2023

the ggml conversion script is broken #741

Closed

ggerganov approved these changes Apr 14, 2023

View reviewed changes

ggerganov merged commit 62b51c3 into ggml-org:master Apr 14, 2023

akashmjn mentioned this pull request Jun 11, 2023

Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001

Merged

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

models : change convert-pt-to-ggml to use .tiktoken tokenizer files (g…

d0154e6

…gml-org#725)

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

models : change convert-pt-to-ggml to use .tiktoken tokenizer files (g…

df0fd97

…gml-org#725)

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023

models : change convert-pt-to-ggml to use .tiktoken tokenizer files (g…

6e43776

…gml-org#725)

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024

models : change convert-pt-to-ggml to use .tiktoken tokenizer files (g…

4dfe62e

…gml-org#725)

Jaffe2718 mentioned this pull request Dec 8, 2025

fix: eliminate hard-coded vocab definitions to make the Whisper model compatible with custom vocabularies and embedding layer lengths #3555

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files #725

Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files #725

Uh oh!

ivan-gorin commented Apr 6, 2023

Uh oh!

ggerganov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files #725

Changed convert-pt-to-ggml.py to use .tiktoken tokenizer files #725

Uh oh!

Conversation

ivan-gorin commented Apr 6, 2023

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants