Add script to convert GGMLv3 LLaMA models to GGUF #2682

KerfuffleV2 · 2023-08-20T15:36:56Z

Currently in a pretty reasonable state. Testing/feedback would be appreciated.

Converted file tested to parse these prompts to the same tokens as pre-GGUF llama.cpp:

你喜欢小狗吗？
Once upon a time, in a dark forest, there lived a little fox

I also tested these models with the second prompt:

Random LLaMA1 7B
openorca-platypus2-13b.ggmlv3.q5_K_M.bin
gplatty-30b-superhot-8k.ggmlv3.q4_K_M.bin
platypus2-70b-instruct.ggmlv3.q4_K_M.bin

Identical generation compared to loading the actual GGML file with pre-GGUF llama.cpp when specifying a seed.

Note: When testing, be sure to specify --eps and --gqa as is appropriate. You'll probably also want to specify --context-length (it defaults to 2048).

edit: It's now possible to use HF or "original" format metadata like vocab when converting. Some information about this and the current state of the pull: #2682 (comment)

Some perplexity results here: #2682 (comment)

Dampfinchen · 2023-08-20T16:03:40Z

Nice. With the breaking change coming, such a script is crucial so many people can keep using their models!

KerfuffleV2 · 2023-08-20T18:00:00Z

@TheBloke Would something like this actually be useful for you? It still requires rewriting the whole file, but should be a lot faster than converting from HF or .pth format and then quantizing.

One deficiency currently is that it has to try to revert the vocabulary mangling stuff the initial conversion to GGML performed and it doesn't seem possible to do this 100% correctly. However, I could potentially add a way to load the vocab from the original model metadata (tokenizer.model, tokenizer_config.json, config.json) and use that rather than the vocab in the GGML file.

klosax · 2023-08-20T18:32:46Z

You should add a warning that models converted without a a new copy of the needed vocab parts may not be fully functional in the future. There is on going work done on the tokenizer in llama.cpp and there could be issues later on without the additional data.

klosax · 2023-08-20T18:35:48Z

If you load the vocab from the original model during conversion, you could compare the model with a real gguf model using sha256sum to verify that your conversion script works.

KerfuffleV2 · 2023-08-20T18:44:13Z

You should add a warning that models converted without a a new copy of the needed vocab parts may not be fully functional in the future.

There's already pretty big warning every time it runs:

=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===

Is that not enough?

I'm also not sure what you mean about "needed vocab parts". I don't think parts are necessarily missing, it's just special meta information like which tokens are "unknown" or whatever isn't really possible to recover.

If you load the vocab from the original model during conversion, you could compare the model with a real gguf model using sha256sum to verify that your conversion script works.

Maybe that approach could work for just the vocab part (but there's probably an easier way).

I've been looking at the existing conversion scripts and it's not exactly clear what exactly the correct approach here is. For example, convert.py doesn't even add token types. convert-llama-hf-to-gguf.py does. Also kind of annoying how the latter just has everything at the top level so you can't import and reuse functions.

Anyway, assuming I did implement loading from tokenizer.json or whatever to build a new copy of the vocab rather than using what was in the GGML file I'd just reuse existing code so the vocab part section at least would be exactly the same as the other conversion scripts. There's no reason to write a custom version of that.

klosax · 2023-08-20T18:52:30Z

There's already pretty big warning every time it runs:

Ok good.

I've been looking at the existing conversion scripts and it's not exactly clear what exactly the correct approach here is. For example, convert.py doesn't even add token types. convert-llama-hf-to-gguf.py does.

To test things out we made the simpler convert-llama-hf-to-gguf.py scripts first, since the convert.py is so complex. The latter is not fully finished yet and work is being done here #2668.

Beside a full copy of the vocab and scores we should have token types and the special token mapping for eos/bos etc.

convert-llama-ggmlv3-to-gguf.py

KerfuffleV2 · 2023-08-20T18:59:26Z

Beside a full copy of the vocab and scores we should have token types and the special token mapping for eos/bos etc.

Seems reasonable, and once convert.py is updated I can just use that to load the vocab and my GGUF output for vocab at least should be exactly the same as the official conversion.

I don't want to go too crazy with the amount of work I put into this when so far there's on indication that it's a candidate to get merged. Also, converting from GGML but also requiring the HF metadata seems like it would be kind of a niche use and no one's said "I'd actually use this feature!" yet.

Green-Sky · 2023-08-20T19:03:34Z

seems like it would be kind of a niche use

not everyone has endless high speed internet access :)

convert-llama-ggmlv3-to-gguf.py

KerfuffleV2 · 2023-08-20T19:18:10Z

@Green-Sky

not everyone has endless high speed internet access :)

So you're saying you need and would use this feature? If so, I'll look into adding it.

Probably need to wait until the convert.py vocab stuff stabilizes, hopefully that will happen at a point that gives me enough time to update this.

klosax · 2023-08-20T19:22:41Z

You should add a parameter to set the model name, and maybe default it to the filename. gguf_writer.add_name()

TheBloke · 2023-08-20T20:28:29Z

@KerfuffleV2 Thanks very much for working on this!

If it's confirmed that this will produce 100% identical results to a new convert.py conversion to GGUF then yes I would definitely use it, and save ~10+TB of data download :)

And if that is achieved, I think it absolutely should be merged. There's many users out there on slow or metered internet connections who aren't going to relish re-downloading all their favourite GGML models.

Great work!

KerfuffleV2 · 2023-08-20T20:41:55Z

If it's confirmed that this will produce 100% identical results to a new convert.py conversion to GGUF

I can understand why you'd say that but a guarantee like that isn't really possible. I'll do the best I can to ensure correct output, but I can't promise it's 100%. No one can even promise convert.py is going to always be correct either.

Even in cases where it's correct, output from a GGML conversion probably won't be exactly the same as what convert.py would output because the order of tensors in the GGML file may differ and that kind of thing. That's not something that should actually cause a noticeable difference, but it still wouldn't be exactly the same file.

KerfuffleV2 · 2023-08-20T22:00:30Z

I added the capability to override the hyperparameters and vocab from the HF or PyTorch metadata. Note that it leverages the existing convert.py script and right now that doesn't fully handle vocabulary types (it might actually be worse than converting from GGML).

For testing this, you can try copying in convert.py from #2668 - my conversion script should be able to handle either version of convert.py (assuming that pull doesn't add new changes that are incompatible).

netrunnereve · 2023-08-20T22:12:18Z

not everyone has endless high speed internet access :)

And if that is achieved, I think it absolutely should be merged. There's many users out there on slow or metered internet connections who aren't going to relish re-downloading all their favourite GGML models.

I'll definitely be using this conversion script as I have pretty slow internet myself. I've also introduced LLaMA to less affluent folk with <10Mbps connections that are having a great time running 7B on 8GB computers.

Now I'm aware that a lot of the devs here have Gigabit fiber and modern workstations but there are many other people out there as well 😉

klosax · 2023-08-20T22:14:04Z

I guess you dontt have to bother with other vocab types than the LLaMA spm. The chinese Aquila is the only LLaMA model using the gpt2 bpe tokenizer I know of. But if you are going to include that you will also need to include the merges from the original model, see the gptneox / falcon conversion examples in the gguf branch.

KerfuffleV2 · 2023-08-20T22:15:59Z

I'll definitely be using this conversion script as I have pretty slow internet myself.

Please test it out and provide feedback if you're able to!

The chinese Aquila is the only LLaMA model using the gpt2 bpe tokenizer I know of. But if you are going to include that you will also need to include the merges from the original model, see the gptneox / falcon conversion examples in the gguf branch.

I'm just using convert.py's vocabulary loader and there's a commandline option to specify vocab type also same as with convert.py. So it should just work, I think (as long as convert.py works).

Basically when using --model-metadata-dir the vocab and general parameters/hyperparameters should be the same as what convert.py produces. The only thing that's coming from the GGML model at that point is the tensors.

netrunnereve · 2023-08-21T01:08:35Z

I think that the most scientific way of testing this would be to run perplexity on the real GGUF models versus the converted models on the same text. If the script works correctly the numbers should be practically identical.

KerfuffleV2 · 2023-08-21T08:28:56Z

I think that the most scientific way of testing this would be to run perplexity on the real GGUF models versus the converted models on the same text.

Surely the benchmark would be to run it against the original GGML, right? It might be possible to do better than the original GGML by using the new GGUF stuff to create the vocab and parameters but if output from the conversion script is at least as good as the original GGML file I think I'd count that as a success.

Running perplexity for me is really slow. I ran 100 blocks for a 7B LLaMA1 Q5_K model vs the original GGML file and the GGUF converted output from it. It's just about identical. However, I'm not sure this is really that definitive for two reasons: 1) there's more likely to be issues with non-English tokens like Chinese, emojis, etc and those might just not have been in the first 100 chunks of wikitext and 2) it's probably models that have special tokens that will be affected by the vocab conversion stuff and those tokens aren't likely to be in the wikitext data.

I'll try to add some more (unfortunately abbreviated) perplexity results if I get a chance.

LLaMA1 7B Q5_K

Original GGML

[1]4.2319,[2]4.7100,[3]5.5832,[4]6.1975,[5]6.3267,[6]6.2936,[7]6.4933,[8]6.5816,[9]6.9006,[10]7.1455,[11]7.3400,[12]7.3664,[13]7.2751,[14]7.3210,[15]7.5625,[16]7.1909,[17]7.0829,[18]7.0293,[19]6.6809,[20]6.6681,[21]6.5676,[22]6.3924,[23]6.3512,[24]6.2579,[25]6.2538,[26]6.0920,[27]5.9206,[28]5.8173,[29]5.7298,[30]5.5769,[31]5.5465,[32]5.5673,[33]5.5154,[34]5.5459,[35]5.5666,[36]5.6013,[37]5.5990,[38]5.6092,[39]5.6407,[40]5.6894,[41]5.6974,[42]5.7345,[43]5.6978,[44]5.7535,[45]5.7559,[46]5.7267,[47]5.7496,[48]5.7258,[49]5.7257,[50]5.6853,[51]5.6816,[52]5.6726,[53]5.7190,[54]5.7032,[55]5.6835,[56]5.7115,[57]5.7313,[58]5.7501,[59]5.7679,[60]5.8070,[61]5.7988,[62]5.8559,[63]5.8853,[64]5.8979,[65]5.9400,[66]5.9492,[67]5.9659,[68]5.9810,[69]6.0034,[70]6.0320,[71]6.0533,[72]6.0853,[73]6.1411,[74]6.1457,[75]6.1605,[76]6.1732,[77]6.1841,[78]6.1694,[79]6.1967,[80]6.1911,[81]6.2031,[82]6.2086,[83]6.1598,[84]6.1445,[85]6.1330,[86]6.1121,[87]6.0524,[88]6.0291,[89]6.0106,[90]5.9959,[91]6.0190,[92]6.0135,[93]6.0116,[94]6.0081,[95]6.0361,[96]6.0363,[97]6.0303,[98]6.0259,[99]6.0128,[100]6.0118

GGML->GGUF (no external metadata)

[1]4.2321,[2]4.7101,[3]5.5833,[4]6.1976,[5]6.3268,[6]6.2937,[7]6.4933,[8]6.5816,[9]6.9007,[10]7.1455,[11]7.3400,[12]7.3664,[13]7.2752,[14]7.3210,[15]7.5625,[16]7.1909,[17]7.0830,[18]7.0293,[19]6.6809,[20]6.6682,[21]6.5676,[22]6.3924,[23]6.3512,[24]6.2579,[25]6.2538,[26]6.0920,[27]5.9206,[28]5.8173,[29]5.7298,[30]5.5769,[31]5.5465,[32]5.5673,[33]5.5154,[34]5.5459,[35]5.5666,[36]5.6013,[37]5.5990,[38]5.6092,[39]5.6408,[40]5.6894,[41]5.6975,[42]5.7345,[43]5.6978,[44]5.7535,[45]5.7559,[46]5.7267,[47]5.7496,[48]5.7258,[49]5.7257,[50]5.6853,[51]5.6816,[52]5.6726,[53]5.7190,[54]5.7032,[55]5.6835,[56]5.7116,[57]5.7313,[58]5.7501,[59]5.7679,[60]5.8070,[61]5.7988,[62]5.8559,[63]5.8853,[64]5.8979,[65]5.9401,[66]5.9492,[67]5.9659,[68]5.9810,[69]6.0034,[70]6.0320,[71]6.0533,[72]6.0853,[73]6.1411,[74]6.1457,[75]6.1605,[76]6.1732,[77]6.1841,[78]6.1694,[79]6.1967,[80]6.1911,[81]6.2031,[82]6.2086,[83]6.1598,[84]6.1445,[85]6.1330,[86]6.1121,[87]6.0524,[88]6.0291,[89]6.0106,[90]5.9959,[91]6.0190,[92]6.0135,[93]6.0116,[94]6.0081,[95]6.0361,[96]6.0363,[97]6.0303,[98]6.0259,[99]6.0128,[100]6.0118

openorca-platypus 13B Q5_K

Don't really understand why the non-metadata converted version would have lower perplexity results here but only 10 blocks is probably not enough to draw a conclusion. Can't run LLaMA2 13B with CLBlast apparently and CPU is very slow.

Also even though metadata/non-metadata looks the same there are actual differences. Loading non:

llama_model_load_internal: BOS token = 1 ''
llama_model_load_internal: EOS token = 2 ''

Loading with one converted with external metadata:

llama_model_load_internal: BOS token = 1 '<s>'
llama_model_load_internal: EOS token = 2 '</s>'

Original GGML

[1]4.2607,[2]4.7298,[3]5.4177,[4]6.1269,[5]6.3246,[6]6.2631,[7]6.4312,[8]6.4886,[9]6.8180,[10]7.0552

GGML->GGUF (no external metadata)

[1]4.2473,[2]4.7223,[3]5.4120,[4]6.1220,[5]6.3206,[6]6.2598,[7]6.4283,[8]6.4861,[9]6.8156,[10]7.0530

GGML->GGUF (with external metadata)

[1]4.2473,[2]4.7223,[3]5.4120,[4]6.1220,[5]6.3206,[6]6.2598,[7]6.4283,[8]6.4861,[9]6.8156,[10]7.0530

TheBloke · 2023-08-21T08:45:55Z

If it's confirmed that this will produce 100% identical results to a new convert.py conversion to GGUF

I can understand why you'd say that but a guarantee like that isn't really possible. I'll do the best I can to ensure correct output, but I can't promise it's 100%. No one can even promise convert.py is going to always be correct either.

Even in cases where it's correct, output from a GGML conversion probably won't be exactly the same as what convert.py would output because the order of tensors in the GGML file may differ and that kind of thing. That's not something that should actually cause a noticeable difference, but it still wouldn't be exactly the same file.

Yeah I quite understand. I just have to be cautious - the worst case scenario for me is where I convert thousands of files with the script, they all seem to work fine, and then 48 hours later there's a reports of issues in niche cases I'd not tested and I feel I have to do them all again. It's unlikely to happen, I just have to weigh up the possibility when deciding what method to use.

Let me know if some HW would help your testing. Eg I could provide you a 4090 pod with fast CPU and 1 Gb/s internet if that'd help you run perplexity tests, or speed up any other testing.

Thanks again for working on this!

KerfuffleV2 · 2023-08-21T09:24:05Z

@TheBloke

I just have to weigh up the possibility when deciding what method to use.

Absolutely. Unfortunately, there are risks with both approaches. As far as I know, I'm the only one who's even tried this pull and having only one person test something is never going to be ideal. At least the official approach has a lot more eyes on it and more people trying it on various models, etc.

Eg I could provide you a 4090 pod with fast CPU and 1 Gb/s internet if that'd help you run perplexity tests, or speed up any other testing.

That's a very generous offer. This would definitely help with perplexity tests but I'm not 100% convinced perplexity is going to highlight the kind of issues that would occur from vocab conversion issues (which I think is the most likely problem).

Not sure if you have the time/inclination but it actually might be better for you to do some of those tests yourself. Since you're not me, you're likely to do stuff I wouldn't do and potentially run into issues that I wouldn't. Of course it's possible to use both approaches.

Let me also give you (and anyone else interested) a quick summary of the current state:

Converting without the external metadata (config.json, tokenizer.model, etc) from the original model might be good enough for general use but I'd say it's probably not suitable for conversion/wide scale distribution like you're aiming for. Some information just isn't in the GGML file and can't be recovered without external stuff.

Converting with the external metadata should produce a GGUF with the same parameters and vocab as you'd get converting from HF to GGUF directly. It's just the tensor data/info that is getting copied from the GGML file in that case and as far as I know nothing changed there between GGUF and GGML. The file might not be exactly identical (due to stuff like the order of the tensors) but it should be functionally the same.

Note: Using the external metadata just uses the existing convert.py and the one in the gguf branch lacks some features. This would be an issue converting from HF also but I'd recommend you use the convert.py from #2668 - you can just copy it into the directory this pull is checked out to. (But to make life more complicated, I don't think the version in #2668 has the fixed LLaMA2 70B permute stuff so don't use convert.py from that branch to convert HF models.)

Example of running conversion with external metadata:

python convert-llama-ggmlv3-to-gguf.py --in blah.bin --out blah.gguf -m /path/to/metadata

It's also possible to specify --vocab-dir and --vocabtype same as the official conversion script. The metadata directory can be HF format or the "original" format. However, when using "original" format metadata if n_vocab or n_ff is missing from params.json it'll fail (it'll just crash, not do anything like produce incorrect output).

klosax · 2023-08-21T09:28:50Z

There is also a slight difference in perplexity between master and gguf with models converted from HF models.
(wiki.test.raw.406)

model	master	gguf
openllama-7b-v2-f16	7.16735811	7.16541492
openllama-7b-v2-q8_0	7.17201356	7.17004949
openllama-7b-v2-q4_0	7.32185657	7.31998091

Dampfinchen · 2023-08-21T09:35:26Z

So basically for best results one needs to download the metadata of the FP16 model file from HF (tokenizer_config.json, config.json etc etc), put it in a folder and then use -m to direct the program to this folder?

So for example for MythoMax... https://huggingface.co/Gryphe/MythoMax-L2-13b/tree/main you would download everything aside from the model itself, and put that in a folder called Metadata. Since I use q4k_m, the command would look like this:

python convert-llama-ggmlv3-to-gguf.py --in "mythomax-l2-13b.ggmlv3.q4_K_M.bin" --out "mythomax-l2-13b.ggmlv3.q4_K_M.gguf" -m Metadata

(Assuming the script is running in the same folder)

If I'm understanding it correctly, that's a pretty easy process to do.

KerfuffleV2 · 2023-08-21T09:40:50Z

So basically for best results one needs to download the metadata of the FP16 model file from HF (tokenizer_config.json, config.json etc etc), put it in a folder and then use -m to direct the program to this folder?

Since the model data won't be used, it doesn't have to be FP16 or anything in particular, but basically yes. If you have git lfs set up you can just do something like:

env GIT_LFS_SKIP_SMUDGE=1 git lfs clone https://huggingface.co/user/model
cd model
git lfs pull --include=tokenizer.model

The smudge stuff tells git to only fetch pointers to the large files (so that'll skip all the model datafiles). You do need to actually fetch tokenizer.model though.

If I'm understanding it correctly, that's a pretty easy process to do.

Yep, you understand correctly. Just note the part about using convert.py from #2668 (at least until it gets merged). If you use convert.py from this branch or gguf it'll potentially be worse than just trying to convert the vocab from the GGML file.

Better handling of original style metadata

ggerganov

When you are ready, merge this to gguf
I'll merge gguf to master in an hour or two. Alternatively, you can change the target branch to master and merge it after #2398

KerfuffleV2 · 2023-08-21T14:36:45Z

@ggerganov

When you are ready, merge this to gguf

Sounds good. I think I'm done making changes unless other people request stuff or find issues. Do you have a preference over doing it now or waiting? This pull really benefits from #2668 but conversion in general also would. Hopefully that one can make it in also (merging or not won't require changes to this pull).

Also just want to double check that you saw the changes to gguf.py -

I added the ability to avoid using a temp file (my conversion stuff just uses numpy's memmap functionality so it doesn't have to explicitly get read into memory). I think the normal convert.py stuff could use that approach too but I didn't want to mess with existing stuff too much.
I made it possible to override the shape and tensor type for add_tensor and add_tensor_info.

Both of those changes are opt-in and the default behavior should be the same as without this pull.

ggerganov · 2023-08-21T14:45:30Z

I think the normal convert.py stuff could use that approach too but I didn't want to mess with existing stuff too much.

Sure - improvements are welcome.

Will try to get #2668 merged in gguf as well before the final merge to master

@klosax

* gguf : first API pass * gguf : read header + meta data * gguf : read tensor info * gguf : initial model loading - not tested * gguf : add gguf_get_tensor_name() * gguf : do not support passing existing ggml_context to gguf_init * gguf : simplify gguf_get_val * gguf : gguf.c is now part of ggml.c * gguf : read / write sample models * gguf : add comments * refactor : reduce code duplication and better API (#2415) * gguf : expose the gguf_type enum through the API for now * gguf : add array support * gguf.py : some code style changes * convert.py : start a new simplified implementation by removing old stuff * convert.py : remove GGML vocab + other obsolete stuff * GGUF : write tensor (#2426) * WIP: Write tensor * GGUF : Support writing tensors in Python * refactor : rm unused import and upd todos * fix : fix errors upd writing example * rm example.gguf * gitignore *.gguf * undo formatting * gguf : add gguf_find_key (#2438) * gguf.cpp : find key example * ggml.h : add gguf_find_key * ggml.c : add gguf_find_key * gguf : fix writing tensors * gguf : do not hardcode tensor names to read * gguf : write sample tensors to read * gguf : add tokenization constants * quick and dirty conversion example * gguf : fix writing gguf arrays * gguf : write tensors one by one and code reuse * gguf : fix writing gguf arrays * gguf : write tensors one by one * gguf : write tensors one by one * gguf : write tokenizer data * gguf : upd gguf conversion script * Update convert-llama-h5-to-gguf.py * gguf : handle already encoded string * ggml.h : get array str and f32 * ggml.c : get arr str and f32 * gguf.py : support any type * Update convert-llama-h5-to-gguf.py * gguf : fix set is not subscriptable * gguf : update convert-llama-h5-to-gguf.py * constants.py : add layer norm eps * gguf.py : add layer norm eps and merges * ggml.h : increase GGML_MAX_NAME to 64 * ggml.c : add gguf_get_arr_n * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Makefile : add gptneox gguf example * Update convert-llama-h5-to-gguf.py * add gptneox gguf example * Update convert-llama-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-gptneox-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * gguf : support custom alignment value * gguf : fix typo in function call * gguf : mmap tensor data example * fix : update convert-llama-h5-to-gguf.py * Update convert-llama-h5-to-gguf.py * convert-gptneox-h5-to-gguf.py : Special tokens * gptneox-main.cpp : special tokens * Update gptneox-main.cpp * constants.py : special tokens * gguf.py : accumulate kv and tensor info data + special tokens * convert-gptneox-h5-to-gguf.py : accumulate kv and ti + special tokens * gguf : gguf counterpart of llama-util.h * gguf-util.h : update note * convert-llama-h5-to-gguf.py : accumulate kv / ti + special tokens * convert-llama-h5-to-gguf.py : special tokens * Delete gptneox-common.cpp * Delete gptneox-common.h * convert-gptneox-h5-to-gguf.py : gpt2bpe tokenizer * gptneox-main.cpp : gpt2 bpe tokenizer * gpt2 bpe tokenizer (handles merges and unicode) * Makefile : remove gptneox-common * gguf.py : bytesarray for gpt2bpe tokenizer * cmpnct_gpt2bpe.hpp : comments * gguf.py : use custom alignment if present * gguf : minor stuff * Update gptneox-main.cpp * map tensor names * convert-gptneox-h5-to-gguf.py : map tensor names * convert-llama-h5-to-gguf.py : map tensor names * gptneox-main.cpp : map tensor names * gguf : start implementing libllama in GGUF (WIP) * gguf : start implementing libllama in GGUF (WIP) * rm binary commited by mistake * upd .gitignore * gguf : calculate n_mult * gguf : inference with 7B model working (WIP) * gguf : rm deprecated function * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : start implementing gguf_file_saver (WIP) * gguf : add gguf_get_kv_type * gguf : add gguf_get_kv_type * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver (WIP) * gguf : write metadata in gguf_file_saver * gguf : rm references to old file formats * gguf : shorter name for member variable * gguf : rm redundant method * gguf : get rid of n_mult, read n_ff from file * Update gguf_tensor_map.py * Update gptneox-main.cpp * gguf : rm references to old file magics * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : start implementing quantization (WIP) * gguf : quantization is working * gguf : roper closing of file * gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : no need to convert tensors twice * convert-llama-h5-to-gguf.py : no need to convert tensors twice * convert-gptneox-h5-to-gguf.py : simplify nbytes * convert-llama-h5-to-gguf.py : simplify nbytes * gptneox-main.cpp : n_layer --> n_block * constants.py : n_layer --> n_block * gguf.py : n_layer --> n_block * convert-gptneox-h5-to-gguf.py : n_layer --> n_block * convert-llama-h5-to-gguf.py : n_layer --> n_block * gptneox-main.cpp : n_layer --> n_block * Update gguf_tensor_map.py * convert-gptneox-h5-to-gguf.py : load model in parts to save memory * convert-llama-h5-to-gguf.py : load model in parts to save memory * convert : write more metadata for LLaMA * convert : rm quantization version * convert-gptneox-h5-to-gguf.py : add file_type key * gptneox-main.cpp : add file_type key * fix conflicts * gguf : add todos and comments * convert-gptneox-h5-to-gguf.py : tensor name map changes * Create gguf_namemap.py : tensor name map changes * Delete gguf_tensor_map.py * gptneox-main.cpp : tensor name map changes * convert-llama-h5-to-gguf.py : fixes * gguf.py : dont add empty strings * simple : minor style changes * gguf : use UNIX line ending * Create convert-llama-7b-pth-to-gguf.py * llama : sync gguf-llama.cpp with latest llama.cpp (#2608) * llama : sync gguf-llama.cpp with latest llama.cpp * minor : indentation + assert * llama : refactor gguf_buffer and gguf_ctx_buffer * llama : minor * gitignore : add gptneox-main * llama : tokenizer fixes (#2549) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * convert : update convert-new.py with tokenizer fixes (#2614) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * llama : sync gguf-llama with llama (#2613) * llama : sync gguf-llama with llama * tests : fix build + warnings (test-tokenizer-1 still fails) * tests : fix wstring_convert * convert : fix layer names * llama : sync gguf-llama.cpp * convert : update HF converter to new tokenizer voodoo magics * llama : update tokenizer style * convert-llama-h5-to-gguf.py : add token types * constants.py : add token types * gguf.py : add token types * convert-llama-7b-pth-to-gguf.py : add token types * gguf-llama.cpp : fix n_head_kv * convert-llama-h5-to-gguf.py : add 70b gqa support * gguf.py : add tensor data layout * convert-llama-h5-to-gguf.py : add tensor data layout * convert-llama-7b-pth-to-gguf.py : add tensor data layout * gptneox-main.cpp : add tensor data layout * convert-llama-h5-to-gguf.py : clarify the reverse permute * llama : refactor model loading code (#2620) * llama : style formatting + remove helper methods * llama : fix quantization using gguf tool * llama : simplify gguf_file_saver * llama : fix method names * llama : simplify write_header() * llama : no need to pass full file loader to the file saver just gguf_ctx * llama : gguf_file_saver write I32 * llama : refactor tensor names (#2622) * gguf: update tensor names searched in quantization * gguf : define tensor names as constants * gguf : initial write API (not tested yet) * gguf : write to file API (not tested) * gguf : initial write API ready + example * gguf : fix header write * gguf : fixes + simplify example + add ggml_nbytes_pad() * gguf : minor * llama : replace gguf_file_saver with new gguf write API * gguf : streaming support when writing files * gguf : remove oboslete write methods * gguf : remove obosolete gguf_get_arr_xxx API * llama : simplify gguf_file_loader * llama : move hparams and vocab from gguf_file_loader to llama_model_loader * llama : merge gguf-util.h in llama.cpp * llama : reorder definitions in .cpp to match .h * llama : minor simplifications * llama : refactor llama_model_loader (WIP) wip : remove ggml_ctx from llama_model_loader wip : merge gguf_file_loader in llama_model_loader * llama : fix shape prints * llama : fix Windows build + fix norm_rms_eps key * llama : throw error on missing KV paris in model meta data * llama : improve printing + log meta data * llama : switch print order of meta data --------- Co-authored-by: M. Yusuf Sarıgöz <[email protected]> * gguf : deduplicate (#2629) * gguf : better type names * dedup : CPU + Metal is working * ggml : fix warnings about unused results * llama.cpp : fix line feed and compiler warning * llama : fix strncpy warning + note token_to_str does not write null * llama : restore the original load/save session implementation Will migrate this to GGUF in the future * convert-llama-h5-to-gguf.py : support alt ctx param name * ggml : assert when using ggml_mul with non-F32 src1 * examples : dedup simple --------- Co-authored-by: klosax <[email protected]> * gguf.py : merge all files in gguf.py * convert-new.py : pick #2427 for HF 70B support * examples/gguf : no need to keep q option for quantization any more * llama.cpp : print actual model size * llama.cpp : use ggml_elements() * convert-new.py : output gguf (#2635) * convert-new.py : output gguf (WIP) * convert-new.py : add gguf key-value pairs * llama : add hparams.ctx_train + no longer print ftype * convert-new.py : minor fixes * convert-new.py : vocab-only option should work now * llama : fix tokenizer to use llama_char_to_byte * tests : add new ggml-vocab-llama.gguf * convert-new.py : tensor name mapping * convert-new.py : add map for skipping tensor serialization * convert-new.py : convert script now works * gguf.py : pick some of the refactoring from #2644 * convert-new.py : minor fixes * convert.py : update to support GGUF output * Revert "ci : disable CI temporary to not waste energy" This reverts commit 7e82d25. * convert.py : n_head_kv optional and .gguf file extension * convert.py : better always have n_head_kv and default it to n_head * llama : sync with recent PRs on master * editorconfig : ignore models folder ggml-ci * ci : update ".bin" to ".gguf" extension ggml-ci * llama : fix llama_model_loader memory leak * gptneox : move as a WIP example * llama : fix lambda capture ggml-ci * ggml : fix bug in gguf_set_kv ggml-ci * common.h : .bin --> .gguf * quantize-stats.cpp : .bin --> .gguf * convert.py : fix HF tensor permuting / unpacking ggml-ci * llama.cpp : typo * llama : throw error if gguf fails to init from file ggml-ci * llama : fix tensor name grepping during quantization ggml-ci * gguf.py : write tensors in a single pass (#2644) * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : single pass for writing tensors + refactoring writer * gguf : style fixes in simple conversion script * gguf : refactor gptneox conversion script * gguf : rename h5 to hf (for HuggingFace) * gguf : refactor pth to gguf conversion script * gguf : rm file_type key and method * gguf.py : fix vertical alignment * gguf.py : indentation --------- Co-authored-by: Georgi Gerganov <[email protected]> * convert-gptneox-hf-to-gguf.py : fixes * gguf.py : gptneox mapping * convert-llama-hf-to-gguf.py : fixes * convert-llama-7b-pth-to-gguf.py : fixes * ggml.h : reverse GGUF_MAGIC * gguf.py : reverse GGUF_MAGIC * test-tokenizer-0.cpp : fix warning * llama.cpp : print kv general.name * llama.cpp : get special token kv and linefeed token id * llama : print number of tensors per type + print arch + style * tests : update vocab file with new magic * editorconfig : fix whitespaces * llama : re-order functions * llama : remove C++ API + reorganize common source in /common dir * llama : minor API updates * llama : avoid hardcoded special tokens * llama : fix MPI build ggml-ci * llama : introduce enum llama_vocab_type + remove hardcoded string constants * convert-falcon-hf-to-gguf.py : falcon HF --> gguf conversion, not tested * falcon-main.cpp : falcon inference example * convert-falcon-hf-to-gguf.py : remove extra kv * convert-gptneox-hf-to-gguf.py : remove extra kv * convert-llama-7b-pth-to-gguf.py : remove extra kv * convert-llama-hf-to-gguf.py : remove extra kv * gguf.py : fix for falcon 40b * falcon-main.cpp : fix for falcon 40b * convert-falcon-hf-to-gguf.py : update ref * convert-falcon-hf-to-gguf.py : add tensor data layout * cmpnct_gpt2bpe.hpp : fixes * falcon-main.cpp : fixes * gptneox-main.cpp : fixes * cmpnct_gpt2bpe.hpp : remove non-general stuff * Update examples/server/README.md Co-authored-by: slaren <[email protected]> * cmpnct_gpt2bpe.hpp : cleanup * convert-llama-hf-to-gguf.py : special tokens * convert-llama-7b-pth-to-gguf.py : special tokens * convert-permute-debug.py : permute debug print * convert-permute-debug-master.py : permute debug for master * convert-permute-debug.py : change permute type of attn_q * convert.py : 70b model working (change attn_q permute) * Delete convert-permute-debug-master.py * Delete convert-permute-debug.py * convert-llama-hf-to-gguf.py : fix attn_q permute * gguf.py : fix rope scale kv * convert-llama-hf-to-gguf.py : rope scale and added tokens * convert-llama-7b-pth-to-gguf.py : rope scale and added tokens * llama.cpp : use rope scale kv * convert-llama-7b-pth-to-gguf.py : rope scale fix * convert-llama-hf-to-gguf.py : rope scale fix * py : fix whitespace * gguf : add Python script to convert GGMLv3 LLaMA models to GGUF (#2682) * First pass at converting GGMLv3 LLaMA models to GGUF * Cleanups, better output during conversion * Fix vocab space conversion logic * More vocab conversion fixes * Add description to converted GGUF files * Improve help text, expand warning * Allow specifying name and description for output GGUF * Allow overriding vocab and hyperparams from original model metadata * Use correct params override var name * Fix wrong type size for Q8_K Better handling of original style metadata * Set default value for gguf add_tensor raw_shape KW arg * llama : improve token type support (#2668) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * llama : add API for token type ggml-ci * tests : use new tokenizer type API (#2692) * Merge tokenizer fixes into the gguf branch. * Add test vocabularies * Adapt convert-new.py (and fix a clang-cl compiler error on windows) * Improved tokenizer test But does it work on MacOS? * Improve token type support - Added @klosax code to convert.py - Improved token type support in vocabulary * Exclude platform dependent tests * More sentencepiece compatibility by eliminating magic numbers * Restored accidentally removed comment * Improve commentary * Use token type API in test-tokenizer-1.cpp * py : cosmetics * readme : add notice about new file format ggml-ci --------- Co-authored-by: M. Yusuf Sarıgöz <[email protected]> Co-authored-by: klosax <[email protected]> Co-authored-by: goerch <[email protected]> Co-authored-by: slaren <[email protected]> Co-authored-by: Kerfuffle <[email protected]>

KerfuffleV2 mentioned this pull request Aug 20, 2023

GGUF #2398

Merged

34 tasks

KerfuffleV2 marked this pull request as ready for review August 20, 2023 17:07

Green-Sky reviewed Aug 20, 2023

View reviewed changes

convert-llama-ggmlv3-to-gguf.py Outdated Show resolved Hide resolved

klosax reviewed Aug 20, 2023

View reviewed changes

convert-llama-ggmlv3-to-gguf.py Outdated Show resolved Hide resolved

KerfuffleV2 added 3 commits August 21, 2023 04:34

First pass at converting GGMLv3 LLaMA models to GGUF

8afc1ef

Cleanups, better output during conversion

f7e61fd

Fix vocab space conversion logic

08959c8

KerfuffleV2 added 7 commits August 21, 2023 04:34

More vocab conversion fixes

8083e20

Add description to converted GGUF files

ff25134

Improve help text, expand warning

80912f0

Allow specifying name and description for output GGUF

f56db21

Allow overriding vocab and hyperparams from original model metadata

e854cd7

Use correct params override var name

996aaca

Fix wrong type size for Q8_K

f68aef5

Better handling of original style metadata

KerfuffleV2 force-pushed the feat-convert-ggml-to-gguf branch from 297cce3 to f68aef5 Compare August 21, 2023 10:34

ggerganov approved these changes Aug 21, 2023

View reviewed changes

Set default value for gguf add_tensor raw_shape KW arg

0547760

ggerganov merged commit e06cbce into ggml-org:gguf Aug 21, 2023

ghost mentioned this pull request Aug 22, 2023

[User] GGUF conversion, stop sequence Problem #2711

Closed

KerfuffleV2 deleted the feat-convert-ggml-to-gguf branch November 17, 2023 03:12

abc-nix mentioned this pull request Jan 15, 2024

gguf_init_from_file: invalid magic characters #3905

Closed

Add script to convert GGMLv3 LLaMA models to GGUF #2682

Add script to convert GGMLv3 LLaMA models to GGUF #2682

Uh oh!

Conversation

KerfuffleV2 commented Aug 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dampfinchen commented Aug 20, 2023

Uh oh!

KerfuffleV2 commented Aug 20, 2023

Uh oh!

klosax commented Aug 20, 2023

Uh oh!

klosax commented Aug 20, 2023

Uh oh!

KerfuffleV2 commented Aug 20, 2023

Uh oh!

klosax commented Aug 20, 2023

Uh oh!

Uh oh!

KerfuffleV2 commented Aug 20, 2023

Uh oh!

Green-Sky commented Aug 20, 2023

Uh oh!

Uh oh!

KerfuffleV2 commented Aug 20, 2023

Uh oh!

klosax commented Aug 20, 2023

Uh oh!

TheBloke commented Aug 20, 2023

Uh oh!

KerfuffleV2 commented Aug 20, 2023

Uh oh!

KerfuffleV2 commented Aug 20, 2023

Uh oh!

netrunnereve commented Aug 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klosax commented Aug 20, 2023

Uh oh!

KerfuffleV2 commented Aug 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Aug 21, 2023

Uh oh!

KerfuffleV2 commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LLaMA1 7B Q5_K

Original GGML

GGML->GGUF (no external metadata)

openorca-platypus 13B Q5_K

Original GGML

GGML->GGUF (no external metadata)

GGML->GGUF (with external metadata)

Uh oh!

TheBloke commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KerfuffleV2 commented Aug 21, 2023

Uh oh!

klosax commented Aug 21, 2023

Uh oh!

Dampfinchen commented Aug 21, 2023

Uh oh!

KerfuffleV2 commented Aug 21, 2023

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

KerfuffleV2 commented Aug 21, 2023

Uh oh!

ggerganov commented Aug 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

KerfuffleV2 commented Aug 20, 2023 •

edited

Loading

netrunnereve commented Aug 20, 2023 •

edited

Loading

KerfuffleV2 commented Aug 20, 2023 •

edited

Loading

KerfuffleV2 commented Aug 21, 2023 •

edited

Loading

TheBloke commented Aug 21, 2023 •

edited

Loading