Assertion ggml_nelements(a) == ne0*ne1*ne2 when loading TheBloke/Llama-2-70B-GGML/llama-2-70b.ggmlv3.q2_K.bin

Loading the Llama 2 - 70B model from [TheBloke](https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML/resolve/main/llama-2-70b-chat.ggmlv3.q2_K.bin) with [rustformers/llm](https://github.com/rustformers/llm) seems to work but fails on inference.

llama.cpp raises an assertion regardless of the `use_gpu` option : 
```
Loading of model complete
Model size = 27262.60 MB / num tensors = 723
[2023-07-29T14:24:19Z INFO  actix_server::builder] starting 10 workers
[2023-07-29T14:24:19Z INFO  actix_server::server] Actix runtime found; starting in Actix runtime
GGML_ASSERT: llama-cpp/ggml.c:6192: ggml_nelements(a) == ne0*ne1*ne2
```

This might be related to the model files, but the models from TheBloke are usually reliable.

Running on MacBook Pro M1 Max 32 GB RAM.
macOS 14.0.0 23A5301g


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Assertion ggml_nelements(a) == ne0ne1ne2 when loading TheBloke/Llama-2-70B-GGML/llama-2-70b.ggmlv3.q2_K.bin #2445

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Assertion ggml_nelements(a) == ne0*ne1*ne2 when loading TheBloke/Llama-2-70B-GGML/llama-2-70b.ggmlv3.q2_K.bin #2445

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Assertion ggml_nelements(a) == ne0ne1ne2 when loading TheBloke/Llama-2-70B-GGML/llama-2-70b.ggmlv3.q2_K.bin #2445