AI News Feed

7 pts

r/LocalLLaMA Community 23h ago

DeepSeek Core Researcher Daya Guo Rumored to Have Resigned

Recently, heavy-hitting news regarding a major personnel change has emerged in the field of Large Language Models (LLMs): Daya Guo, a core researcher at DeepSeek and one of the primary authors of the DeepSeek-R1 paper, has reportedly resigned. Public records show that Daya Guo po...

6 pts

r/LocalLLaMA Community 1h ago

TGI is in maintenance mode. Time to switch?

Our company uses hugging face TGI as the default engine on AWS Sagemaker AI. I really had bad experiences of TGI comparing to my home setup using llama.cpp and vllm. I just saw that Huggingface ended new developments of TGI: https://huggingface.co/docs/text-generation-inference/i...

5 pts

r/LocalLLaMA Community 15h ago

Llama 8B matching 70B on multi-hop QA with structured prompting, no fine-tuning

Ran a bunch of experiments with Graph RAG (KET-RAG) on multi hop question answering. Turns out retrieval is basically solved, the answer is in the context 77 to 91% of the time. The bottleneck is reasoning: 73 to 84% of wrong answers come from the model failing to connect the dot...

5 pts

The Verge AI Industry 48m ago

The gen AI Kool-Aid tastes like eugenics

Like many people, director Valerie Veatch was intrigued when OpenAI first released its Sora text-to-video generative AI model to the public in 2024. Though she didn't fully understand the technology, she was curious about what it could do, and she saw that other artists were buil...

4 pts

r/LocalLLaMA Community 16h ago

Litesearch: Karpathy's autoresearch but for consumer GPUs (4–8GB) + easy GUI

Karpathy's autoresearch is awesome — agent edits train.py and runs tiny LLM experiments overnight. But it wants serious VRAM. I forked it to run on normal cards like my 1080/3060: Auto-picks model size/depth/batch/seq len so it fits your VRAM (leaves buffer, no more OOM sur...

4 pts

r/LocalLLaMA Community 15h ago

🚀 [Project-IMO] We're building a permissionless protocol to train AI models across consumer GPUs — no data center, no gatekeeper. Looking for contributors.

TL;DR: IMO is an open protocol that lets anyone contribute their GPU to collectively train AI models — LLMs, diffusion models, video generators, speech synthesizers, anything — using a swarm of heterogeneous consumer hardware. Think BitTorrent meets Bitcoin mining, but for AI...

4 pts

r/LocalLLaMA Community 21h ago

Qwen3.5-9B.Q4_K_M on RTX 3070 Mobile (8GB) with ik_llama.cpp — optimization findings + ~50 t/s gen speed, looking for tips

Disclouse: This post partly written with the help of Claude Opus 4.6 to help with gathering the info and making it understandable for myself first and foremost.... and this post etc! Hi! Been tuning local inference on my laptop and wanted to share some info reallyu because some o...

4 pts

r/LocalLLaMA Community 15h ago

Nemotron-Cascade-2 10GB MAC ONLY Scores 88% on MMLU.

Even if someone did happen to make an MLX quant of this size (10gb) it would be completely incoherent at 2bit. https://huggingface.co/JANGQ-AI/Nemotron-Cascade-2-30B-A3B-JANG\_2L Mistral 4 30-40gb and a 60-70gb version coming out later today. submitted by /u/HealthyCo...

4 pts

r/LocalLLaMA Community 18h ago

I wrote a PowerShell script to sweep llama.cpp MoE nCpuMoe vs batch settings

Hi all, I have been playing around with Qwen 3.5 MOE models and found the sweetspot tradeoff between nCpuMoe and the batchsize for speed isn't linear. I also kept rerunning the same tests across different quants, which got tedious. If there is a tool/script that does this alr...

4 pts

r/LocalLLaMA Community 40m ago

Fixing Qwen thinking repetition

ok so I found the fix to Qwen thinking repetition. I discovered that pasting this system prompt from Claude fixes it completely. Other long system prompts might also work. I use 1.5 presence penalty, everything else llama.cpp webui defaults, no kv cache quant (f16), and i use a q...

3 pts

r/LocalLLaMA Community 19h ago

I designed a new architecture for language models to learn how to speak by starting with an empty dataset & only using accumulating memory.

Savvy is a model designed to accumulate data for episodic memory, sentence prediction & morpheme token prediction. These two experiments are proof of concept The goal for the first experiment was to teach Savvy how to say “hi Spaceman” which was harder than I thought beca...

3 pts

r/LocalLLaMA Community 19h ago

Local Windows terminal assistant with Ollama/Qwen2.5 for files, commands, web search and installs

I developed Fennec, a local Windows terminal assistant based on Ollama/Qwen 2.5. First and foremost, I want to make it clear that this isn’t really self-promotion; I’m not looking to become famous, make money, or anything like that. It’s simply for the joy of sharing and de...

3 pts

r/LocalLLaMA Community 7h ago

What is everyones thoughts on Nemotron-Cascade 30b a3b

heres the model https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B submitted by /u/Odd-Ordinary-5922 [link] [comments]

3 pts

r/LocalLLaMA Community 16h ago

Building a reasoning layer with persistent knowledge graph per user looking for people hitting the same wall

The specific problem: LLM reasoning pipelines that work locally at each step but produce globally inconsistent conclusions because nothing tracks the dependency structure of prior conclusions across sessions. RAG doesn't solve this. You can retrieve perfectly relevant chunks ...

3 pts

TheAIGRID Video 17h ago

How A Man Used ChatGPT to Cure His Dog’s Cancer…