Imagine asking the same person the same question twice, but getting slightly different answers Thinking Machines is a team with deep roots in creating some of the most transformative AI systems we know today — ChatGPT, Character.ai, and other leading platforms. They’ve now launched a brand-new research group, and their first blog post is already tackling a question that has puzzled both researchers and practitioners: 👉 Why do large language models sometimes give different answers, even when we set temperature to zero? For many, the assumption has been simple: “temperature=0” should mean deterministic output — the same prompt, the same answer, every time. Yet in practice, that hasn’t always been true. 🔍 In their blog, the Thinking Machines team dives deep into this issue of nondeterminism in LLM inference and uncovers what’s really going on: » Floating point math is non-associative. Even tiny changes in the order of addition or multiplication can cause slightly different results. » Parallel computing on GPUs makes this worse. Operations may not always run in the same sequence, producing small but noticeable output differences. » Inference engines add complexity. Optimizations like kernel fusion, graph compilers, and distributed execution can make the math path non-reproducible. » Tie-breaking between equally likely tokens isn’t always consistent. This can lead to diverging outputs, even with identical settings. 💡 What’s powerful about this work is not just the diagnosis, but the solutions they propose. The team shows how to configure inference so that outputs really are deterministic: » Using deterministic kernels and carefully chosen compute libraries. » Controlling seeds and ensuring fixed tie-breaking behavior. » Structuring execution so floating-point order is preserved. This matters because scientific reproducibility and engineering trust depend on being able to reproduce results exactly. In research, we want to confirm findings without hidden randomness. In production, companies want users to have consistent, reliable interactions with AI systems. ✨ With this first blog, Thinking Machines has shown what their new research direction is all about: digging into hard, foundational issues in AI systems, and pushing for solutions that make large-scale models more reliable, transparent, and useful. 📖 You can read the full blog here: https://lnkd.in/e38jfiJt
Thinking Machines explores nondeterminism in LLM inference
More Relevant Posts
-
Thinking Machine Labs, founded by OpenAI’x CTO Mira Murati, published their first research paper earlier this month. The paper titled - "Defeating Nondeterminism in LLM Inference" tackles the problem of “reproducibility” in language models. In simple terms, even with temperature = 0, why do language models give different answers to the same question and how can this be solved. The most common hypothesis for this problem was concurrency + floating-point non-associativity which states that (a+b)+c is not the same as a+(b+c). Hence, the order of concurrent threads will determine the final output during atomic addition of each such individual thread. However, in the typical forward pass of an LLM, there is usually not a single atomic add present. According to the paper, the real culprit of this problem is batch invariance. Servers handle requests in batches and depending on the number of requests, the batch size changes and since the GPU math depends on the batch size/ layout, the output is different. Hence, the output depends on parallel user requests. The Proposed solution in this paper is to fix the reduction order with a “fixed split-size” strategy. In other words, instead of fixing the # of splits, the size of each split is fixed and then end up with a varying number of splits. In this manner, we can guarantee that regardless of how many tokens we’re processing, we always perform the identical reduction order. There is a performance trade off with this approach but the consequences of fixing this problem is huge. AI companies doing research can't reproduce their own experiments reliably, businesses using AI for critical decisions get inconsistent results, training new AI models becomes way more expensive when you can't trust your outputs. Read the full paper here - https://lnkd.in/gHWsZTHR
To view or add a comment, sign in
-
💭 Ever noticed that your LLM sometimes gives different answers… even with the same prompt and temperature = 0? You’re not alone - and it’s not a bug. It’s a deeper issue in how large language models run inference. I recently came across this brilliant piece by Thinking Machines Lab - Defeating Nondeterminism in LLM Inference - and it completely changed how I think about reproducibility in AI systems. We tend to assume that if randomness is removed, the model should always produce the same result. But here’s what’s really going on 👇 ⚙️ 1. Floating-point math isn’t exact. Tiny rounding errors in (a + b) + c ≠ a + (b + c) accumulate through billions of operations — creating subtle differences in output. 📦 2. Batch size changes everything. Your result can depend on how many other requests the server processes at the same time. Different batch compositions → slightly different computations → different outputs. 🧩 3. The fix: “batch-invariant” kernels. By redesigning GPU kernels (like RMSNorm and attention) to produce the same result no matter the batch size, the Thinking Machines team achieved true deterministic inference. The impact is striking: 1,000 identical prompts (temperature = 0) → 80 different outputs with standard inference → 1,000 identical outputs with the batch-invariant version. 💡 The real challenge isn’t just making LLMs smarter — it’s making them consistent. Because reliability in AI isn’t only about accuracy or scale — it’s about being predictably right. As LLMs become core to enterprise systems, ensuring deterministic inference will matter as much as speed or efficiency. 👉 Full article here: https://lnkd.in/exf9QEGF #AI #MachineLearning #LLM #ArtificialIntelligence #DeepLearning
To view or add a comment, sign in
-
Mira Murati, OpenAI’s former CTO, announced a breakthrough that solves Gen AI consistency after launching her new business with a $2B seed round. Mira and her team at Thinking Machines Lab, led by Horace He, released a paper titled 'Defeating Nondeterminism in LLM Inference' (see: https://lnkd.in/e-K7nywB). It tackles the problem of “reproducibility” in large language models, to make AI more reliable. You've probably noticed that even when you set AI models to their most predictable setting (temperature = 0), you still get different answers to the same question. Every business leader has brought up the trust problem in AI systems, whether I'm discussing variance in AI outputs with board members or legal teams: 'How can you trust a system that gives you a different answer each time you ask it the exact same question?' Thankfully Mira's team has solved this. 💡 Their solution: new batch-invariant kernels. They found that the root cause isn’t randomness, it's that the GPU kernels handling inference aren't consistent across different batch sizes, so your results can actually change depending on the load from other users on the same server. Mira has released their solution as open-source code - true to her promise of “science is better when shared”. 💎 The result? Same input, same output, every time. Great news for compliance teams in every business, particularly in healthcare, finance, and law where reproducibility is essential. But Mira acknowledges that even if those answers were consistent, they could still be wrong. I admire her honesty. That means: Today: same enquiry = five different responses, one may (or may not) be right. Tomorrow (with determinism): same enquiry = the same response five times… but it may still be wrong. Her team tested this solution 1000 times and got identical results every run. It slows things down - taking 42 seconds versus 26 - but wouldn't you rather wait an extra 16 seconds for something you can rely on, especially if you're making executive decisions or treating patients?
To view or add a comment, sign in
-
The Subtle Randomness of AI: Why Identical Prompts Yield Different Answers Ever passed the same prompt to a LLM — with temperature = 0 — and got different results? That’s not your imagination. It’s a subtle but critical problem in modern inference pipelines: nondeterminism. Horace He and Thinking Machines Lab just published an excellent piece explaining why this happens, and also how to address this issue. The usual suspects (floating-point rounding, concurrency) only tell part of the story. The real issue lies in batch invariance: how dynamic batching can alter the numerical path of computations, making outputs depend on how requests are grouped, not on their content. Their solution? Re-engineering key kernels — RMSNorm, MatMul, Attention — to become batch-invariant, ensuring bit-for-bit reproducibility even under load. Expected result: the same input, the same output — every single time. Why it matters: - Determinism improves debuggability and safety. - It aligns training and inference behaviour. - It helps build trustworthy agentic systems and reproducible research. This is a quiet but foundational step toward reliable AI infrastructure; something that matters far more than hype. Read the original post here: https://lnkd.in/ghSEM5Bd #AI #LLM #MachineLearning #Reproducibility #AgenticAI #IdeasArtificiales
To view or add a comment, sign in
-
Defeating Nondeterminism in LLM Inference One of the biggest frustrations with large language models is how the same prompt can give different results—even when you set everything to “deterministic.” Thinking Machines just shared a deep dive into this issue and proposed solutions that make outputs truly reproducible. Their approach tackles the hidden sources of nondeterminism (like batch size variance) and introduces batch-invariant kernels to ensure bit-identical results across runs. This is a big step forward. Deterministic outputs mean more reliable debugging, safer deployment, and less technical debt for teams building real-world AI systems. Read more: https://lnkd.in/gKHbbJ_y
To view or add a comment, sign in
-
🔍 Same Prompt. Same Settings. Different Answers. Ever noticed how LLMs sometimes change their response even when you don’t change the prompt or temperature? For a long time, we brushed this off as “randomness” or “GPU quirks.” But Mira Murati’s new lab Thinking Machines just showed the real reason: 👉 Batch-sensitive kernels. When multiple requests are batched together, or when the sequence is chunked differently, the math inside reductions (RMSNorm, matmuls, attention) happens in a different order. Since floating-point math isn’t perfectly associative, the final result shifts — leading to nondeterministic outputs. Their proposed solution: Batch-invariant kernels. This means no matter how requests are batched or sliced, the arithmetic order stays fixed → outputs become reproducible. Why this matters: • ✅ Consistent outputs for the same input (trust in critical systems) • ✅ Reliable CI pipelines and A/B testing • ✅ Safer deployments in regulated industries like finance & healthcare Yes, there’s a trade-off — determinism may slightly reduce performance — but reproducibility is a huge step forward for building trustworthy AI. Read the full deep-dive from Thinking Machines here: 🔗 https://lnkd.in/guwZtEGW 💡 What do you think? Would you choose deterministic but slightly slower inference or faster, non-deterministic inference in production? #AI #LLM #MLOps #GenAI #Reproducibility #ThinkingMachines
To view or add a comment, sign in
-
📖 A few good reads lately: ⚜️ Breakdown of the design of NotebookLM: https://lnkd.in/gJ-tMnNW. NotebookLM is one of my favourite AI products, and the thought that creators put into its design underlines the importance of thinking deeply about the user journey and interaction patterns when designing AI products. 🕵️ Definition of agent: https://lnkd.in/giST7UUb We may finally have a definition of 'agent' that is non-buzzwordy, non-cringe-worthy, and non-jargony, and I can get behind it. ✅ An LLM agent runs tools in a loop to achieve a goal ❌ Agents as human replacements because the features that still remain unique to humans are accountability and agency. 🎡 LLMs being non-deterministic had long been a challenge in designing applications with it. New research from Thinking Machines explores the possibility of defeating nondeterminism in LLM inference: https://lnkd.in/gcmgnikG. 🌟 Key takeaway: A commonly held view blames floating-point non-associativity and concurrency for nondeterminism. While floating-point math can yield minute differences, the main culprit is how inference workloads are batched. A user's result depends on how many other user requests are being handled and how those are grouped into batches, which is inherently variable and hidden from product-level logic. What have you read lately that you would recommend? #ai #machinelearning #LLM #notebookLM #agents
To view or add a comment, sign in
-
Reliability has always been the elephant in the room with LLMs. Hallucinations make it hard to trust AI in high-stakes settings, and that's a barrier we need to solve if we want AI to play a meaningful role in travel planning, visitor services or marketing. Two articles I read this week give me some real optimism that we're close to a breakthrough: OpenAI – Why Language Models Hallucinate: https://lnkd.in/gudk-wWN Thinking Machines – Defeating Nondeterminism in LLM Inference: https://lnkd.in/gz9kyhuA Both pieces dig into why AI sometimes "makes things up" and, more importantly, what's being done to make outputs more stable and trustworthy. Together, they suggest that the next six months could bring major improvements in reliability. For those of us in the travel and destination space, that's huge. Imagine AI tools you can trust to give visitors accurate information every time, or assistants that DMOs can confidently use in front of travelers without the risk of "creative" errors. If you've been experimenting with AI in your organization, how big a difference would true reliability make for you?
To view or add a comment, sign in
-
New blog from Thinking Machines (Mira Murati’s new company) about defeating nondeterminism in LLM inference. Deterministic LLMs — models that always return the same intended answer 🤗🫡 This could completely change the rules of AI 👑 https://lnkd.in/dvPtwGVq
To view or add a comment, sign in
-
✨ Imagine asking the same question one thousand times on ChatGPT… and getting one thousand identical answers. That might sound obvious, but it’s not what happens today. Even with “deterministic” settings, large language models often produce slightly different answers to the same question. The article Defeating Nondeterminism in LLM Inference explains why: tiny quirks in floating-point math, batching, and caching make outputs inconsistent. It’s a fascinating, step-by-step breakdown of why AI systems sometimes feel less predictable than we think. For practitioners, the real value is in how the article unpacks nondeterminism at every layer of inference: from floating-point non-associativity to GPU reduction ordering to cache-vs-no-cache discrepancies in attention. The proposed solution, batch-invariant kernels that enforce deterministic reduction order and cache alignment, directly tackles these issues. Their prototype in vLLM demonstrates that we don’t have to trade reproducibility for performance. The closing line stuck with me: “We reject this defeatism.” Too often, nondeterminism is accepted as the price of scale. This piece reframes reproducibility as a baseline requirement for trustworthy AI. Deterministic inference isn’t just about consistency, it’s about building AI systems we can debug, audit, and ultimately trust. A must-read for anyone working at the intersection of research and production. https://lnkd.in/gPxmFFFt #AI #MachineLearning #LLM #Reproducibility #MLOps #ArtificialIntelligence #DeepLearning
To view or add a comment, sign in
Explore related topics
- How to Optimize Large Language Models
- How to Improve ChatGPT Output Quality
- Understanding Generative AI and Large Language Models
- How Llms Process Language
- How to Build Reliable LLM Systems for Production
- Recent Developments in LLM Models
- Best Practices for AI Safety and Trust in Language Models
- How Large Language Models Create Text Responses
- How ChatGPT Is Changing US Tech Careers
- GPU Matrix Multiplication Methods