Stories by EngineeredAI on Medium

LLM Quantization Explained: What Q4, Q5, and Q8 Actually Mean for Your GPU

EngineeredAI — Thu, 07 May 2026 11:16:38 GMT

The naming makes no sense until it does. Here’s the breakdown.

The first time you pull a model in Ollama, the name looks like this: phi4:Q4_K_M or mistral:Q5_K_S or llama3.2:Q8_0. Nobody explains what those suffixes mean. The model page shows the file size. The README says it's a quantized version. Neither tells you which one to pull or why the choice matters.

LLM quantization is a method of compressing model weights from full floating-point precision down to lower bit representations so the model fits in less VRAM without destroying output quality. The number after Q is the bits per weight. The letter after that describes the quantization method. Everything follows from those two facts.

Why This Matters More Than People Realize

A full precision 7B model at FP16 needs roughly 14GB of VRAM. A 14B model needs around 28GB. Neither number is accessible on any consumer GPU that most people actually own. Quantization is what makes the local AI ecosystem work on consumer hardware at all.

The savings are not marginal. A 7B model at Q4_K_M loads in 4 to 4.5GB. A 14B model loads in 8 to 9GB. That’s the difference between a model loading cleanly and a model refusing to load entirely.

The Quantization Ladder

Q2 and Q3 compress aggressively. The VRAM wins are real. The quality loss is also real, coherence degrades, factual reliability drops, and the model produces confident-sounding nonsense more frequently than at higher quantizations. If a model only fits at Q3 but not Q4, the correct answer is usually a smaller model at Q4, not a larger model at Q3.

Q4_K_M is the practical standard. Output quality is strong enough for general use: drafting, coding assistance, summarization, reasoning tasks. The gap between Q4 and full FP16 precision is real but not operationally significant for most use cases. This is what Ollama pulls by default, and that default is correct.

Q5_K_M adds roughly 10 to 15 percent more VRAM cost compared to Q4 and delivers a noticeable quality improvement on structured output, complex reasoning chains, and constrained code generation. If your card has headroom after Q4 loads comfortably, Q5 is the upgrade worth trying.

Q8 is near-full-precision output at approximately half the VRAM cost of FP16. A 7B model at Q8 needs 7 to 8GB. A 14B model needs 14 to 15GB. If the model fits cleanly with headroom, Q8 is worth running. If it fits but barely, stay at Q5.

What K_M Actually Means

The K in Q4_K_M stands for K-quant, a quantization method that applies different bit depths to different layers of the model rather than treating all weights identically. Some layers matter more for output quality. K-quants allocate more bits where precision has the most impact.

The letter after K is the group size: M is medium, S is small, L is large. K_M is the best default because it balances VRAM efficiency and output quality across the widest range of tasks. The difference between K_M, K_S, and K_L at the same Q level is smaller than the difference between Q levels themselves. Start with K_M.

The _0 suffix on tags like Q8_0 indicates an older non-K quantization method. If both Q4_K_M and Q4_0 are available, pull Q4_K_M.

The Counterintuitive Part

The quality gap between Q3 and Q4 is larger than the gap between Q4 and Q8. That’s the thing most people don’t know when they’re building their first local inference setup. People assume Q8 is dramatically better than Q4. It’s better, but not dramatically. Q3 vs Q4 is where the cliff is.

For a full breakdown with VRAM numbers per model size and a concrete Phi-4 14B example across every quantization tier: https://engineeredai.net/llm-quantization-explained/

Medium API Alternative Automation: Why It Fails and What Actually Works

EngineeredAI — Thu, 30 Apr 2026 14:56:43 GMT

In a world where automating workflows has become the norm, one platform’s API shutdown throws a wrench in the gears of countless developers, AI builders, and tech-adjacent professionals.

Session cookies creating blank drafts, automated processes grinding to a halt, it’s a nightmare we’ve all faced at some point.

But fear not!

We’ve dived deep into this issue, investigating alternatives from GraphQL mapping to Burp Suite analysis.

And guess what?

Playwright browser automation emerges as the only approach that works consistently and reliably.

Can You Run a Local LLM on Your Android Phone? Here’s What Actually Works

EngineeredAI — Sat, 25 Apr 2026 07:34:53 GMT

The premise sounds like a stretch until you look at what flagship Android hardware is actually doing in 2026. Neural Processing Units on Snapdragon 8 Gen 2 chips. 12–16GB RAM in gaming phones that have been in people’s pockets for two years. Quantized models that fit inside that RAM without meaningful quality loss on everyday tasks.

I tested this because I was in my garden, phone in my hand, and the desk felt like friction I didn’t need for what amounted to a workflow trigger. The question wasn’t whether local AI on mobile is theoretically possible, it’s whether the hardware you already own can do useful work without turning into a heat source and a battery drain.

The answer for most flagship Android devices from the last two years is yes, with the right setup.

What the hardware threshold actually is

6GB RAM is the technical floor, but 1B to 3B parameter models is all you’re getting there. The practical starting point is 8GB RAM on a Snapdragon 8 Gen 2 or equivalent. That opens up 3B to 7B models, which is the range where output quality becomes genuinely useful rather than just functional. At 12GB and above, Llama 3.2 7B and Qwen 3 4B run cleanly at 15–30 tokens per second on flagship hardware.

NPU routing matters. Apps built to use the Neural Processing Unit get better throughput at lower battery cost than CPU-only inference. Off Grid handles this automatically on supported Snapdragon devices — NPU first, Adreno GPU via OpenCL as fallback.

The one setting to change immediately

Go into Off Grid settings and switch the KV cache to q4_0. That single change has more impact on inference speed than anything else you’ll do in the app. Everything else is optional tuning.

Which models at which RAM tier

At 8GB: Qwen 3 1.7B and Phi-4 Mini. Fast, coherent, useful for drafting and summarization. At 12GB and above: Llama 3.2 7B and Qwen 3 4B. These are the practical ceiling before the speed-to-capability tradeoff starts working against you in conversation. Q4_K_M quantization throughout, it uses roughly half the memory of full precision with minimal real-world quality loss.

What it’s actually for

If you’ve already automated parts of your workflow, what the phone handles is the gap before automation kicks in. Reviewing a brief. Approving a draft. Kicking off a sub-workflow. None of that needs a desktop. It needs a capable interface and enough on-device AI to handle lightweight reasoning before the automated layer takes over.

The desk was always optional for these tasks. The hardware just needed to catch up to that conclusion.

Full post with app breakdown and model recommendations: https://engineeredai.net/run-local-llm-on-android-phone/

How to Automate Your Business with AI Tools: Save Time and Boost Productivity

EngineeredAI — Fri, 17 Apr 2026 17:55:10 GMT

As businesses continue to grow and expand, finding ways to optimize operations and increase efficiency becomes increasingly important. One way to achieve this is by leveraging artificial intelligence (AI) tools that can automate various tasks, freeing up time for employees to focus on high-value activities. In this article, we’ll explore how AI tools can help automate your business, saving time and boosting productivity.

One of the primary areas where AI tools excel is in content creation. Automated content generation tools can quickly produce high-quality articles, social media posts, and other written content, allowing businesses to maintain a consistent online presence without the need for extensive writing resources. These tools use natural language processing (NLP) algorithms to analyze topics, tone, and style, ensuring that generated content aligns with an organization’s brand voice.

Another area where AI can greatly impact business operations is in quality assurance (QA). Automated testing tools can quickly identify bugs and errors in software applications, allowing developers to fix issues before they reach customers. This not only improves the overall user experience but also reduces the risk of costly rework or even product recalls.

Customer support is another critical area where AI-powered automation can make a significant difference. Chatbots and virtual assistants can be programmed to respond to common customer inquiries, freeing up human support agents to tackle more complex issues. This not only enhances customer satisfaction but also increases efficiency, allowing businesses to provide 24/7 support without the need for extensive staffing.

In addition to these areas, AI tools can also automate tasks such as bookkeeping and accounting, data entry, and even marketing campaign management. By leveraging machine learning algorithms, businesses can analyze vast amounts of data to identify trends and patterns, informing informed decision-making that drives business growth.

One key benefit of using AI tools for automation is the ability to scale operations quickly and efficiently. As a business grows, it’s essential to find ways to maintain productivity while expanding into new markets or services. By automating tasks with AI tools, businesses can adapt more rapidly to changing market conditions without sacrificing quality or efficiency.

Of course, implementing AI-powered automation solutions requires careful planning and execution. It’s essential to assess the specific needs of your business and identify areas where automation can make the most impact. This may involve working closely with IT staff or external consultants to ensure that chosen tools integrate seamlessly with existing systems.

As businesses continue to explore the potential of AI tools for automation, it’s clear that the benefits extend far beyond mere productivity gains. By freeing up time for employees to focus on high-value activities and enhancing customer satisfaction, AI-powered automation can drive business growth and success in the long term.

If you’re interested in learning more about how AI tools can help automate your business, check out our original article at https://engineeredai.net/how-to-automate-your-business-with-ai-tools/. In this comprehensive guide, we’ll dive deeper into the world of AI-powered automation, exploring its applications and potential impact on businesses across various industries.