Skip to content

Conversation

@ejones
Copy link
Collaborator

@ejones ejones commented Aug 8, 2023

As discussed in #2501, the grammar implementation incorrectly assumes that tokens contain full UTF-8 sequences. In reality, a unicode character may span (the bytes of) several tokens. This patch fixes the issue with the suggestion from @ai-and-i to track partial UTF-8 sequences and match against the range of unicode code points it represents.

Changes

  • updated decode_utf8 return partial and invalid UTF-8 sequences, and to resume from a partial UTF-8 sequence
  • added partial UTF-8 sequence to grammar state and sampling candidates
  • updated grammar candidate selection to check for overlap between char range in grammar and UTF-8 code points in the case of a partial UTF-8 sequence

Testing

Emojis (from #2501)
% ./main -m $LLAMA_30B_Q4_0 -n 32 --grammar 'root ::= [😀-🙏]+'             
main: build = 969 (bedce3c)
main: seed  = 1691499471
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_head_kv  = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.16 MB
llama_model_load_internal: mem required  = 17452.69 MB (+  780.00 MB per state)
llama_new_context_with_model: kv self size  =  780.00 MB
llama_new_context_with_model: compute buffer total size =   97.35 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


main: grammar:
root ::= root_1 
root_1 ::= [<U+1F600>-<U+1F64F>] root_1 | [<U+1F600>-<U+1F64F>] 

 😊😄🙃🙅🙆🙇🙈🙉
llama_print_timings:        load time =   700.01 ms
llama_print_timings:      sample time =    73.92 ms /    32 runs   (    2.31 ms per token,   432.91 tokens per second)
llama_print_timings: prompt eval time =   270.68 ms /     2 tokens (  135.34 ms per token,     7.39 tokens per second)
llama_print_timings:        eval time =  4851.84 ms /    31 runs   (  156.51 ms per token,     6.39 tokens per second)
llama_print_timings:       total time =  5211.93 ms
Student schema (Jsonformer)
% ./main -m $LLAMA_13B_Q4_0 --grammar "$( python3 examples/json-schema-to-grammar.py ../schemas/student.json --prop-order 'is_student,name,age' )" -p 'Hermione Granger '    
main: build = 969 (bedce3c)
main: seed  = 1691499724
llama.cpp: loading model from /Users/evan/llama-models/13B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 6983.72 MB (+  400.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.35 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


main: grammar:
space ::= space_1 
space_1 ::= [ ] | 
boolean ::= boolean_3 space 
boolean_3 ::= [t] [r] [u] [e] | [f] [a] [l] [s] [e] 
string ::= ["] string_7 ["] space 
string_5 ::= [^"\] | [\] string_6 
string_6 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
string_7 ::= string_5 string_7 | 
number ::= number_9 number_15 number_19 space 
number_9 ::= number_10 number_11 
number_10 ::= [-] | 
number_11 ::= [0-9] | [1-9] number_12 
number_12 ::= [0-9] number_12 | 
number_13 ::= [.] number_14 
number_14 ::= [0-9] number_14 | [0-9] 
number_15 ::= number_13 | 
number_16 ::= [eE] number_17 number_18 
number_17 ::= [-+] | 
number_18 ::= [0-9] number_18 | [0-9] 
number_19 ::= number_16 | 
courses ::= [[] space courses_24 []] space 
courses_21 ::= string courses_23 
courses_22 ::= [,] space string 
courses_23 ::= courses_22 courses_23 | 
courses_24 ::= courses_21 | 
root ::= [{] space ["] [i] [s] [_] [s] [t] [u] [d] [e] [n] [t] ["] space [:] space boolean [,] space ["] [n] [a] [m] [e] ["] space [:] space string [,] space ["] [a] [g] [e] ["] space [:] space number [,] space ["] [c] [o] [u] [r] [s] [e] [s] ["] space [:] space courses [}] space 

 Hermione Granger {"is_student":true,"name":"Hermione Granger","age":15,"courses":["Transfiguration","Charms","Defence Against the Dark Arts","Herbology","Potions","History of Magic","Arithmancy","Divination","Care of Magical Creatures","Astronomy","Muggle Studies","Ancient Runes"]} [end of text]

llama_print_timings:        load time =   357.06 ms
llama_print_timings:      sample time =   386.62 ms /    86 runs   (    4.50 ms per token,   222.44 tokens per second)
llama_print_timings: prompt eval time =   421.07 ms /     6 tokens (   70.18 ms per token,    14.25 tokens per second)
llama_print_timings:        eval time =  5607.78 ms /    85 runs   (   65.97 ms per token,    15.16 tokens per second)
llama_print_timings:       total time =  6466.07 ms
CJK
% ./main -m $LLAMA_30B_Q4_0  -p $'Creating a website in 5 steps (Chinese):\n\n' --grammar-file grammars/japanese.gbnf -n 32                                              
main: build = 969 (bedce3c)
main: seed  = 1691499821
llama.cpp: loading model from /Users/evan/llama-models/30B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_head_kv  = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.16 MB
llama_model_load_internal: mem required  = 17452.69 MB (+  780.00 MB per state)
llama_new_context_with_model: kv self size  =  780.00 MB
llama_new_context_with_model: compute buffer total size =   97.35 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


main: grammar:
root ::= root_2 root_5 
jp-char ::= hiragana | katakana | punctuation | cjk 
root_2 ::= jp-char root_2 | jp-char 
root_3 ::= [ <U+0009><U+000A>] root_4 
root_4 ::= jp-char root_4 | jp-char 
root_5 ::= root_3 root_5 | 
hiragana ::= [<U+3041>-<U+309F>] 
katakana ::= [<U+30A1>-<U+30FF>] 
punctuation ::= [<U+3001>-<U+303E>] 
cjk ::= [<U+4E00>-<U+9FFF>] 

 Creating a website in 5 steps (Chinese):

從電腦開始輕鬆建立網站組件。
使用不同的
llama_print_timings:        load time =   500.31 ms
llama_print_timings:      sample time =   133.83 ms /    32 runs   (    4.18 ms per token,   239.11 tokens per second)
llama_print_timings: prompt eval time =  1669.25 ms /    14 tokens (  119.23 ms per token,     8.39 tokens per second)
llama_print_timings:        eval time =  4848.56 ms /    31 runs   (  156.41 ms per token,     6.39 tokens per second)
llama_print_timings:       total time =  6668.48 ms

@ejones ejones marked this pull request as ready for review August 14, 2023 08:59
@ejones ejones requested a review from SlyEcho August 16, 2023 00:29
Copy link
Contributor

@SlyEcho SlyEcho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work!

./main -m $MODEL -ngl 99 --grammar 'root ::= [🐀-🐿]+' -n 32 -p Rodents:

output:

 Rodents:🐀🐿🐁🐁🐁🐦🐤🐜

@ggerganov ggerganov mentioned this pull request Aug 17, 2023
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants