AI education

An Overview of NVIDIA Blackwell Ultra (B300 and GB300 GPUs)

Explore NVIDIA Blackwell Ultra - B300 GPU and the GB300 NVL72 rack-scale system shaping the next era of AI inference, reasoning, and large-model deployment.
Deepak Manoor
Posted : December, 6, 2025
Posted : December, 6, 2025
    image

    For the past few years, the story of data center GPUs has been about training ever-larger models. With NVIDIA's Blackwell Ultra, that center of gravity shifts. The NVIDIA Blackwell Ultra architecture is explicitly tuned for reasoning and test-time compute: the phase where AI models consume far more tokens than they ever saw during training and where latency, power, and memory behavior matter as much as raw peak FLOPS.

    Within that broader story, two products quietly define the new "unit of compute" and the new "unit of deployment" for AI infrastructure: the B300 GPU and the GB300 NVL72 rack-scale system. Together, they turn Blackwell Ultra from an incremental update into a blueprint for how enterprise and sovereign AI will be built and operated in massive AI factories.

    What “Ultra” Actually Changes Inside Blackwell

    Blackwell Ultra continues the trajectory of Blackwell, but with design choices that strongly favor AI inference, reasoning, and enormous memory and interconnect capacity.

    • At the silicon level, Ultra delivers roughly 1.5× higher performance than standard NVIDIA Blackwell. The native NVFP4 precision format support effectively doubles usable compute density and reduces memory footprint while maintaining accuracy.
    Blackwell Ultra Performance

    Source: NVIDIA

    • Another key improvement over standard Blackwell is the doubling of attention-layer acceleration, an area that now dominates inference cost for long-context and reasoning-heavy models. Attention mechanisms rely on GEMMs (General Matrix Multiplication) and softmax operations for exponential calculation, but while matrix multiplications have become faster each GPU generation, the Special Functions Unit (SFU) that performs exponentials and other transcendental math hasn't really kept pace. Dive deeper into this asymmetry in this excellent blog post from PyTorch. With the SFU enhancements, Blackwell Ultra now delivers more powerful attention mechanisms, which means faster reasoning and lower compute costs for transformer models.
    NVIDIA B300 vs B200

    Source: NVIDIA

    • Blackwell Ultra also receives a significant upgrade in high-bandwidth memory (HBM): with 288 GB of HBM3e memory per GPU, built using 12-high HBM stacks (versus 8-high in earlier versions), Blackwell Ultra pushes the memory ceiling far beyond what was previously feasible. This capacity increase is deeply meaningful for large language models (LLMs), retrieval-augmented generation (RAG) systems, and mixture-of-experts (MoE) models that demand large memory for weights, KV caches, and activations. These changes reflect a strategic shift toward steady-state inference economics, and not just model training runs.
    NVIDIA GB300 vs GB200

    Source: NVIDIA

    Even the die composition reinforces this orientation. Ultra continues to use a dual-reticle design with more die area devoted to tensor cores and the memory paths that feed them. FP64 compute, once a symbolic measure of HPC status, has been dramatically minimized. Ultra is not a general-purpose scientific processor; it is built for AI at industrial scale, focusing on deep learning and generative AI applications.

    The Blackwell Ultra GPU family: B300 and GB300 NVL72

    The B300 GPU is the foundational compute element of Blackwell Ultra designed as an inference-weighted, memory-dense processor built for modern LLMs, agentic systems, and long-context reasoning. It introduces a dual-reticle architecture, dramatic increases in NVFP4/Tensor Core throughput, and a substantial jump to 288 GB of HBM3e per GPU. B300 is intentionally specialized: it deprioritizes traditional FP64 performance to maximize low-precision efficiency, fast attention, and the memory bandwidth required to hold very large models or many MoE experts in-situ. In practical deployments, B300 mostly appears as an 8-GPU node (DGX/HGX B300), forming the basic unit schedulers and platform teams design around.

    The GB300 NVL72, by contrast, represents a fully assembled "AI factory rack." It integrates 72 B300-class GPUs, 36 Grace CPUs, and a next-generation NVLink fabric into a single, coherent accelerator domain. With over a dozen terabytes of combined HBM and massive NVLink bandwidth, GB300 behaves less like a cluster of nodes and more like a monolithic super-accelerator purpose-built for AI reasoning at industrial scale. Whereas B300 optimizes the fundamentals: memory, throughput, attention performance, the GB300 optimizes the system: power, cooling, interconnect technology, and rack-scale coherence. It is the deployment primitive used to build sovereign AI zones, multi-rack inference farms, and next-generation enterprise AI clouds.

    Here is a comparison of B300 GPUs and GB300 NVL72 system:


    FeatureNVIDIA B300 GPUNVIDIA GB300 NVL72
    Role Core GPU building block for computeRack-scale, unified accelerator system
    ArchitectureDual-reticle Blackwell Ultra GPU with NV-HBI die-to-die link72× Blackwell Ultra GPUs + 36 Grace CPUs connected via NVLink Switch
    Memory288 GB HBM3e per GPU~20+ TB total HBM across 72 GPUs
    Compute FocusNVFP4/FP8 inference and reasoning; 1.5×–2× attention boostLarge-scale reasoning, long-context LLMs, agentic workloads at rack-level density
    Primary Bottleneck AddressedGPU memory capacity and attention throughputInterconnect coherence, power/cooling density, large-model horizontal scaling
    Deployment FormAppears in 8-GPU DGX/HGX systemsSelf-contained, liquid-cooled, 120+ kW rack
    Optimal Use CasesHigh-throughput LLM inference, MoE, test-time scaling, fine-tuningMulti-trillion-parameter models, massive concurrency, sovereign AI zones

    How Ultra Performs in Benchmarks: What MLPerf Signals

    DeepSeek-R1 PerformanceGB300 NVL72GB200 NVL72Improvement over GB200 NVL72NVIDIA DGX H200Improvement over H200
    Offline(tokens/sec/GPU)5,8424,02445.18%1,253366.24%
    Server(tokens/sec/GPU)2,9072,32724.92%556422.84%

    Source: NVIDIA

    These improvements are not just architectural. NVIDIA has acknowledged the importance of NVFP4 quantization flows, new parallelism techniques and new sharding strategies for NVLink 5 fabrics. In other words, Ultra's performance advantage emerges from the combination of hardware specialization and software that understands how to saturate it.

    One important signal: the largest gains consistently appear on attention-bound workloads, not dense matrix multiplications. This reinforces the central design direction—Ultra is tuned for the economics of inference and reasoning rather than brute-force FP8 training alone.

    Where Blackwell Ultra Points the Industry Next

    Prioritizing Low-Precision Compute at the Expense of FP64

    Blackwell Ultra (and B300) is optimized heavily for low-precision tensor formats (NVFP4, FP8), and attention-based workloads. That means applications requiring high-precision FP64 compute such as many traditional high-performance computing (HPC), physics simulations, or scientific workloads may no longer see the same relative performance uplift. This is a design decision that signals that NVIDIA is doubling down on AI workloads (transformers, LLMs, inference, reasoning) instead of trying to be a one-size-fits-all accelerator.

    Memory Density and Thermal / Power Constraints

    With 288 GB HBM3e per GPU and dozens of GPUs per rack, memory capacity is no longer the bottleneck but power consumption and heat density become first-order concerns. The GB300 NVL72 racks typically require liquid cooling and specially provisioned power delivery, reflecting just how dense and heavy-duty these systems are. This changes data center planning: whitespace layout, cooling loops, power phases, redundancy, and even vibration or structural support may all become relevant.

    Blackwell Ultra sets the stage for a more mature architectural model

    The introduction of Blackwell Ultra is more than just an expansion, it represents a change of operating model for AI infrastructure. Rather than incremental gains in FLOPS, we get a redefinition of what a "GPU" is meant to do: heavy, memory-dense, low-precision tensor work; inference, reasoning, and large-scale deployment; and rack-level coherence. For enterprises, cloud providers, and research labs planning for the next five years, this matters. It changes how you plan data center space, power, cooling, scheduling, tenancy, and even how you think about what an "AI cluster" looks like. With Blackwell Ultra, you're not just provisioning GPUs. You're building AI factories focused on accelerated computing, deep learning, and the next generation of AI applications.

    Share