Exploring AI Model Formats: A Deep Dive into Llama.cpp, GGUF, GGML, and Huggingface Transformers

Posted by Aug on November 26, 2023

Awesome Breakdown Post

Some clarifications/findings from the above: GGML isn’t necessarily quantized. I ran the GGML converter on my GPT-J float 16 model. It remained float 16, but it just ran a lot faster.

I got GPT-4 to summarize, here it is:

The Reddit post you’ve referenced discusses various tools and libraries for loading, training, and inference of AI models, focusing on compatibility and supported file formats. Here’s a summary:

  1. Compatibility & Supported File Formats:
    • Llama.cpp (by Georgi Gerganov) supports safetensors and can convert .safetensors model files into .gguf.
    • GGUF: A new file format that includes all necessary metadata within the model file, eliminating the need for additional files like tokenizer_config.json.
    • GGML: An older format, now deprecated in Llama.cpp.
    • Transformers (by Hugging Face): Supports safetensors (unquantized and quantized using GPTQ algorithm) and bin (unquantized) formats.
    • AutoGPTQ: A quantization library based on the GPTQ algorithm, also available via Transformers.
    • koboldcpp: A fork of Llama.cpp supporting bin files using the GGML algorithm.
    • ExLlama v2: Offers optimized GPTQ backend for LLaMA models, supporting safetensors quantized using the GPTQ algorithm and AWQ (low-bit quantization).
    • AWQ: Supports low-bit quantization (INT3/4) and uses the safetensors format.
  2. Additional Notes:
    • The “Use in Transformers” button on the Huggingface site may not work for all hosted models; it’s more like an auto-generated template.
    • GGUF files usually contain all necessary components (e.g., tokenizer), but they don’t include the prompt template. Prompt templates can often be found on the model card on HuggingFace.
    • Safetensors can be used with Transformers but not with Llama.cpp.
    • GGML is no longer supported by Llama.cpp, and its support has been deprecated.
    • There are multiple frameworks (like Transformers, Llama.cpp, koboldcpp, ExLlama, etc.), and some support multiple quantization formats while others require specific formats.

This discussion highlights the evolving landscape of AI model formats and the tools used for their management, emphasizing the importance of compatibility and the continuous development of these technologies.

Next, I will work on integrating a UI for my model.