Google researchers introduce ‘faithful uncertainty,’ allowing LLMs to offer best guesses instead of hallucinations
Large language models continue to struggle with hallucinations, presenting a major roadblock for real-world enterprise applications. Reducing these errors is a messy business, forcing model developers to navigate a strict tradeoff where eliminating factual errors often suppresses valid answers. In a new paper, Google researchers introduce the concept of “faithful uncertainty,” a metacognitive technique that […]
NanoClaw and JFrog launch ‘immune system’ to block AI agents from downloading malicious code
The creators of the hit, enterprise-friendly, open source OpenClaw variant NanoClaw are partnering with software supply chain management leader JFrog have to launch a new, joint security integration they say will protect NanoClaw autonomous agents from malicious code injection. “These agents are doing things that you cannot necessarily control, and you cannot necessarily train,” said […]
PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x
Most enterprise RAG pipelines start the same way: a text parser converts web pages and documents into plain text so they can be chunked and indexed for retrieval. That conversion step destroys retrieval signals — and according to new research, it’s responsible for the majority of wrong answers. A research team from UC Berkeley, Princeton […]
Xiaomi’s new open source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasks
Xiaomi’s MiMo AI team has open-sourced MiMo Code V0.1.0, a terminal-native AI coding assistant that the Chinese electronics giant says outperforms Anthropic’s Claude Code on key agentic coding benchmarks, especially on long-horizon, multi-step tasks (200+ steps) — at least, according to its own internal beta release and survey of 576 developers. It’s also bundling limited-time […]
Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit
Context windows are becoming a computational bottleneck. The longer an agent runs, the more tokens accumulate from retrieved documents, reasoning traces and conversation history, and the more memory and compute that growing context demands. Most existing solutions either degrade model accuracy, require the full context to load before compression begins, or produce memory savings that […]
Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights
Agent skills have become an important part of real-world AI applications, providing a mechanism — a set of instructions saved in a folder of text-based markdown (.md) files, usually — for models to adapt to specific enterprise use cases and complex workflows. However, optimizing these skills is a slow process and faulty process, as they […]
What AI benchmarks miss about real-world performance
Presented by F5 Enterprise AI teams have spent years solving for compute, securing GPU allocations, negotiating cloud capacity, and benchmarking training throughput. The assumption embedded in that work is that the path between storage and compute will keep up. In production, that assumption increasingly does not hold. Real traffic introduces latency spikes, network jitter, and […]
Google’s DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes
GenAI image generators like Stable Diffusion do not draw a picture pixel by pixel from left to right. They start with noise and iteratively refine the entire image in parallel until it converges, in a process known as diffusion. For years, applying that same principle to text generation had remained out of reach at scale. […]
Why AI that works in the lab often fails in production — and what actually fixes it
Presented by Capital One Enterprises aren’t struggling to experiment with AI; they’re struggling to make it work in the real world. Moving from promising prototypes to reliable, production-scale systems is where most efforts stall. In my role within Capital One’s AI Foundations organization, I’ve seen firsthand that successful AI implementation isn’t just about adopting the […]
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
Researchers from the University of California, Berkeley’s Center for Responsible, Decentralized Intelligence (RDI), alongside an advisory committee of over 300 domain experts, have launched Agents’ Last Exam (ALE)—a grueling new benchmark built to measure whether artificial intelligence can actually execute economically valuable, long-horizon professional workflows. In a shocking upset, OpenAI’s GPT-5.5 from April, operating through […]
