Nvidia says software cuts DeepSeek V4 token costs fivefold

Wed, 1st Jul 2026

Nvidia said its inference software has cut token costs for the DeepSeek V4 model on the Blackwell platform by up to five times, following software improvements made over about a month.

The announcement focuses on the economics of running artificial intelligence models at scale. Nvidia argued that cost per token is becoming more important than raw chip specifications as companies move from pilot projects to live deployments. Tokens are the units of text or data that large language models process and generate, and the cost of serving them has become a key benchmark for AI service operators.

Nvidia said its software stack spans model serving, runtime scheduling, kernels, communication libraries and hardware-level optimisation. On Blackwell systems, those layers can combine techniques such as disaggregated serving, large expert parallelism over NVLink, NVFP4 precision and multi-token prediction to increase throughput by as much as 20 times compared with a baseline setup.

The update also highlights the role of software in determining how efficiently AI infrastructure runs. While traditional web and software workloads often scale by adding similar servers for broadly similar tasks, Nvidia said agentic AI creates more complex, distributed workflows across models, tools, memory, networking and storage.

That complexity matters because a single user request can trigger multiple tasks across different systems. Nvidia said the software layer determines whether those tasks use compute resources efficiently or add cost through idle capacity, poor scheduling or inefficient communication between processors and memory.

Customer examples

Nvidia cited several companies as users of the software stack on Blackwell-based systems. Baseten used the open source TensorRT-LLM library to serve DeepSeek V4 Pro for reasoning, coding and long-context workloads, and Nvidia said proprietary runtime changes helped it deliver up to 50% more tokens per second.

Cognition is using the Dynamo inference framework to manage inference graphics processors, according to Nvidia. Nvidia said this allows Cognition to scale reinforcement learning workloads without building the underlying infrastructure itself.

Deep Infra was also identified as a user of the software stack for serving open source frontier models on Blackwell systems from the outset, including DeepSeek V4. Together AI, meanwhile, used TensorRT-LLM on Blackwell in work linked to Cursor's real-time coding service.

Open source role

Nvidia framed much of the improvement within the broader open source software ecosystem built around CUDA. It said many widely used AI frameworks and inference projects are designed to run natively on its computing platform, allowing new research and optimisation work to move quickly into production use on Nvidia hardware.

PyTorch featured prominently in that argument. Nvidia said the framework, first launched with native CUDA support, has evolved alongside its chip architectures and gives developers access to technologies such as Tensor Cores, Transformer Engine and NVFP4 through a familiar software environment.

Nvidia pointed to examples including DFlash speculative decoding, which it said can deliver up to 15 times more throughput on existing hardware, and FastVideo, which it said can generate 1080p video in less than five seconds. Its broader point was that when such methods arrive in common frameworks, they can be used immediately on Nvidia systems without extensive rewrites.

That dynamic also affects the deployment of new open models. Nvidia said that when DeepSeek V4 became available, frameworks including vLLM and SGLang had day-one deployment recipes for Blackwell, making the model available across large installed fleets of Blackwell processors.

Nvidia said DeepSeek V4 performance on Blackwell improved by up to five times within about a month across the vLLM and SGLang frameworks. According to the company, that reduced token costs to about one-fifth of earlier levels.

Economic focus

The emphasis on token costs reflects a broader shift across the AI industry, where spending on chips, networking and electricity has increased pressure on developers and cloud providers to show that deployments can operate economically. Throughput alone is no longer the only selling point, particularly for services that must meet latency targets while handling interactive user requests.

Nvidia's position is that software improvements can continue to lower inference costs even after hardware has been installed. That is an important message for cloud operators and enterprise buyers weighing whether newer generations of infrastructure can deliver better returns without relying solely on headline chip performance.

Nvidia also cited benchmark material from SemiAnalysis InferenceX comparing token cost and interactivity for GB300 NVL72 systems using SGLang and the Dynamo inference framework, as well as throughput comparisons for GB200 NVL72 systems using vLLM and Dynamo.

The figures form part of Nvidia's broader effort to argue that AI infrastructure economics depend as much on software as on semiconductors, networking and system design, with token output per dollar, per watt and within latency targets emerging as the metric it says matters most.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google