Skip to content

Conversation

@tonyfettes
Copy link
Contributor

@tonyfettes tonyfettes commented May 10, 2024

For llama-3, I found there is an inconsistency between llama.cpp's tokenizer and Huggingface's tokenizers. Example:

 Việt

llama.cpp:

 11655 -> ' Vi'
 26298 -> 'ệ'
    83 -> 't'

Huggingface's tokenizers with tokenizer.json from llama-3:

101798

After comparing the implementation, it seems that Huggingface's tokenizers will try to lookup a split word in the vocabulary first, and push to the result tokens if found; if not, it will try to merge the word at byte level instead. In llama.cpp, we always do the byte-level merge, hence the inconsistency.

This is a simple fix to the problem, by just looking the word up before do the merging.

PS: I have checked with tiktoken and it seems they did the same thing at src/lib.rs:228 in CoreBPE::_encode_native

PPS: I searched tokenizer.json from all BPE models (some are license-walled so I checked their variants) and it seems that llama-3 is the only one doing this?

Model tokenizer.json
DBRX (Walled) https://huggingface.co/turboderp/dbrx-instruct-exl2/tree/2.3bpw
Deepseek LLM https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat/raw/main/tokenizer.json
Deepseek Coder https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct/raw/main/tokenizer.json
Falcon https://huggingface.co/tiiuae/falcon-7b/raw/main/tokenizer.json
Starcoder https://huggingface.co/bigcode/starcoder/raw/main/tokenizer.json
Refact https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/main/tokenizer.json
Command R+ https://huggingface.co/CohereForAI/c4ai-command-r-plus/blob/main/tokenizer.json
GPT2 https://huggingface.co/openai-community/gpt2/raw/main/tokenizer.json
OLMo https://huggingface.co/allenai/OLMo-7B-Instruct/raw/main/tokenizer.json
Qwen2 (Qwen1.5) https://huggingface.co/Qwen/Qwen1.5-110B-Chat/raw/main/tokenizer.json

@tonyfettes tonyfettes marked this pull request as draft May 10, 2024 08:05
@tonyfettes tonyfettes changed the title Llama3 tokenizer ignore merge fix : lookup word in vocab before doing BPE merges May 10, 2024
@tonyfettes tonyfettes marked this pull request as ready for review May 10, 2024 08:48
@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level bugfix fixes an issue or bug labels May 10, 2024
@mofosyne mofosyne requested a review from goerch May 10, 2024 10:27
@tonyfettes tonyfettes force-pushed the llama3-tokenizer-ignore-merge branch from 4ba2e5c to 63207d1 Compare May 10, 2024 13:02
@ggerganov
Copy link
Member

This change not only fixed the llama3 tokenization, but it also improved the performance by a factor of x4:

./tests/test-tokenizer-0.sh llama-bpe ./build/wikitext-2-raw/wiki.train.raw
  • master
Testing llama-bpe on ./build/wikitext-2-raw/wiki.train.raw ...
main : tokenized in 3141.467 ms (py)
main : tokenized in 6085.319 ms (cpp)
1842692c1842692,1842694
< 101798
---
> 11655
> 26298
> 83
Tokenization differs!
  • PR
Testing llama-bpe on ./build/wikitext-2-raw/wiki.train.raw ...
main : tokenized in 3157.516 ms (py)
main : tokenized in 1408.991 ms (cpp)
Tokenization is correct!

We now tokenize wiki.train.raw 2x faster than Python AutoTokenizer

@ggerganov
Copy link
Member

PPS: I searched tokenizer.json from all BPE models (some are license-walled so I checked their variants) and it seems that llama-3 is the only one doing this?

Which parameter in the tokenizer config determines this behaviour?

@tonyfettes
Copy link
Contributor Author

@ggerganov "ignore_merges", under "model"

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge after the green is CI

@tonyfettes tonyfettes force-pushed the llama3-tokenizer-ignore-merge branch from 0f48f9e to 0c9a0ae Compare May 11, 2024 01:52
llama.cpp Outdated
Comment on lines 12302 to 12315
if (ignore_merges && vocab.token_to_id.find(word) != vocab.token_to_id.end()) {
llm_symbol sym;
sym.text = word.c_str();
sym.n = word.size();
sym.prev = final_prev_index;
sym.next = -1;
if (final_prev_index != -1) {
symbols_final[final_prev_index].next = symbols_final.size();
}
symbols_final.emplace_back(sym);
final_prev_index = symbols_final.size() - 1;
continue;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's apply @jaime-m-p's suggestion here, to reduce the code duplication in this loop:

#6965 (comment)

@ggerganov ggerganov merged commit f99e1e4 into ggml-org:master May 11, 2024
@github-actions
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 551 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8538.14ms p(95)=21008.61ms fails=, finish reason: stop=490 truncated=61
  • Prompt processing (pp): avg=104.75tk/s p(95)=461.86tk/s
  • Token generation (tg): avg=34.2tk/s p(95)=49.2tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=llama3-tokenizer-ignore-merge commit=b8d3cd5337bfa74f816138af84e7181c5208f717

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 334.04, 334.04, 334.04, 334.04, 334.04, 539.06, 539.06, 539.06, 539.06, 539.06, 540.1, 540.1, 540.1, 540.1, 540.1, 568.12, 568.12, 568.12, 568.12, 568.12, 600.22, 600.22, 600.22, 600.22, 600.22, 664.0, 664.0, 664.0, 664.0, 664.0, 669.42, 669.42, 669.42, 669.42, 669.42, 687.3, 687.3, 687.3, 687.3, 687.3, 704.77, 704.77, 704.77, 704.77, 704.77, 705.71, 705.71, 705.71, 705.71, 705.71, 725.62, 725.62, 725.62, 725.62, 725.62, 750.53, 750.53, 750.53, 750.53, 750.53, 782.72, 782.72, 782.72, 782.72, 782.72, 797.23, 797.23, 797.23, 797.23, 797.23, 708.58, 708.58, 708.58, 708.58, 708.58, 718.95, 718.95, 718.95, 718.95, 718.95, 725.3, 725.3, 725.3, 725.3, 725.3, 750.02, 750.02, 750.02, 750.02, 750.02, 753.58, 753.58, 753.58, 753.58, 753.58, 755.08, 755.08, 755.08, 755.08, 755.08, 761.87, 761.87, 761.87, 761.87, 761.87, 768.28, 768.28, 768.28, 768.28, 768.28, 779.72, 779.72, 779.72, 779.72, 779.72, 772.27, 772.27, 772.27, 772.27, 772.27, 778.04, 778.04, 778.04, 778.04, 778.04, 793.89, 793.89, 793.89, 793.89, 793.89, 792.22, 792.22, 792.22, 792.22, 792.22, 790.81, 790.81, 790.81, 790.81, 790.81, 791.91, 791.91, 791.91, 791.91, 791.91, 797.01, 797.01, 797.01, 797.01, 797.01, 798.59, 798.59, 798.59, 798.59, 798.59, 796.94, 796.94, 796.94, 796.94, 796.94, 800.09, 800.09, 800.09, 800.09, 800.09, 812.08, 812.08, 812.08, 812.08, 812.08, 817.73, 817.73, 817.73, 817.73, 817.73, 828.4, 828.4, 828.4, 828.4, 828.4, 827.68, 827.68, 827.68, 827.68, 827.68, 825.77, 825.77, 825.77, 825.77, 825.77, 828.25, 828.25, 828.25, 828.25, 828.25, 832.11, 832.11, 832.11, 832.11, 832.11, 837.64, 837.64, 837.64, 837.64, 837.64, 848.16, 848.16, 848.16, 848.16, 848.16, 833.41, 833.41, 833.41, 833.41, 833.41, 832.42, 832.42, 832.42, 832.42, 832.42, 830.71, 830.71, 830.71, 830.71, 830.71, 834.03, 834.03, 834.03, 834.03, 834.03, 833.95, 833.95, 833.95, 833.95, 833.95, 835.53, 835.53, 835.53, 835.53, 835.53, 838.98, 838.98, 838.98, 838.98, 838.98, 841.39, 841.39, 841.39, 841.39, 841.39, 843.69, 843.69, 843.69, 843.69, 843.69, 848.57, 848.57, 848.57, 848.57, 848.57, 847.22, 847.22, 847.22, 847.22, 847.22, 851.13, 851.13, 851.13, 851.13, 851.13, 852.54, 852.54, 852.54, 852.54, 852.54, 853.08, 853.08, 853.08, 853.08, 853.08, 852.6, 852.6, 852.6, 852.6, 852.6, 853.59, 853.59, 853.59, 853.59, 853.59, 854.47, 854.47, 854.47, 854.47, 854.47, 857.58, 857.58, 857.58, 857.58, 857.58, 858.06, 858.06, 858.06, 858.06, 858.06, 858.06]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.94, 37.94, 37.94, 37.94, 37.94, 42.61, 42.61, 42.61, 42.61, 42.61, 34.13, 34.13, 34.13, 34.13, 34.13, 30.78, 30.78, 30.78, 30.78, 30.78, 31.39, 31.39, 31.39, 31.39, 31.39, 31.7, 31.7, 31.7, 31.7, 31.7, 32.86, 32.86, 32.86, 32.86, 32.86, 33.5, 33.5, 33.5, 33.5, 33.5, 34.0, 34.0, 34.0, 34.0, 34.0, 33.88, 33.88, 33.88, 33.88, 33.88, 33.87, 33.87, 33.87, 33.87, 33.87, 34.01, 34.01, 34.01, 34.01, 34.01, 33.91, 33.91, 33.91, 33.91, 33.91, 33.11, 33.11, 33.11, 33.11, 33.11, 32.43, 32.43, 32.43, 32.43, 32.43, 32.04, 32.04, 32.04, 32.04, 32.04, 32.14, 32.14, 32.14, 32.14, 32.14, 32.42, 32.42, 32.42, 32.42, 32.42, 32.07, 32.07, 32.07, 32.07, 32.07, 32.03, 32.03, 32.03, 32.03, 32.03, 31.85, 31.85, 31.85, 31.85, 31.85, 31.82, 31.82, 31.82, 31.82, 31.82, 31.92, 31.92, 31.92, 31.92, 31.92, 31.76, 31.76, 31.76, 31.76, 31.76, 32.09, 32.09, 32.09, 32.09, 32.09, 32.08, 32.08, 32.08, 32.08, 32.08, 31.74, 31.74, 31.74, 31.74, 31.74, 31.3, 31.3, 31.3, 31.3, 31.3, 31.34, 31.34, 31.34, 31.34, 31.34, 31.54, 31.54, 31.54, 31.54, 31.54, 31.56, 31.56, 31.56, 31.56, 31.56, 31.69, 31.69, 31.69, 31.69, 31.69, 31.72, 31.72, 31.72, 31.72, 31.72, 31.68, 31.68, 31.68, 31.68, 31.68, 31.67, 31.67, 31.67, 31.67, 31.67, 31.51, 31.51, 31.51, 31.51, 31.51, 31.05, 31.05, 31.05, 31.05, 31.05, 31.06, 31.06, 31.06, 31.06, 31.06, 31.27, 31.27, 31.27, 31.27, 31.27, 31.37, 31.37, 31.37, 31.37, 31.37, 31.46, 31.46, 31.46, 31.46, 31.46, 31.29, 31.29, 31.29, 31.29, 31.29, 31.07, 31.07, 31.07, 31.07, 31.07, 30.67, 30.67, 30.67, 30.67, 30.67, 29.85, 29.85, 29.85, 29.85, 29.85, 29.57, 29.57, 29.57, 29.57, 29.57, 29.52, 29.52, 29.52, 29.52, 29.52, 29.45, 29.45, 29.45, 29.45, 29.45, 29.56, 29.56, 29.56, 29.56, 29.56, 29.63, 29.63, 29.63, 29.63, 29.63, 29.76, 29.76, 29.76, 29.76, 29.76, 29.77, 29.77, 29.77, 29.77, 29.77, 29.72, 29.72, 29.72, 29.72, 29.72, 29.53, 29.53, 29.53, 29.53, 29.53, 29.49, 29.49, 29.49, 29.49, 29.49, 29.67, 29.67, 29.67, 29.67, 29.67, 29.7, 29.7, 29.7, 29.7, 29.7, 29.87, 29.87, 29.87, 29.87, 29.87, 30.0, 30.0, 30.0, 30.0, 30.0, 30.04, 30.04, 30.04, 30.04, 30.04, 30.05, 30.05, 30.05, 30.05, 30.05, 30.1]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.42, 0.42, 0.42, 0.42, 0.42, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.35, 0.35, 0.35, 0.35, 0.35, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.37, 0.37, 0.37, 0.37, 0.37, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.43, 0.43, 0.43, 0.43, 0.43, 0.62, 0.62, 0.62, 0.62, 0.62, 0.54, 0.54, 0.54, 0.54, 0.54, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.3, 0.3, 0.3, 0.3, 0.3, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 551 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715448542 --> 1715449172
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0]
                    
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix fixes an issue or bug Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants