Nvidia Waves and Moats

This Article is available as a video essay on YouTube


From the Wall Street Journal:

The Nvidia frenzy over artificial intelligence has come to this: Chief Executive Jensen Huang unveiled his company’s latest chips on Monday in a sports arena at an event one analyst dubbed the “AI Woodstock.”

Customers, partners and fans of the chip company descended on the SAP Center, the home of the National Hockey League’s San Jose Sharks, for Huang’s keynote speech at an annual Nvidia conference that, this year, has a seating capacity of about 11,000. Professional wrestling’s WWE Monday Night RAW event took place there in February. Justin Timberlake is scheduled to play the arena in May. Even Apple’s much-watched launch events for the iPhone and iPad didn’t fill a venue this large. At the center of the tech world’s attention is Huang, who has gone from a semiconductor CEO with a devoted following among videogame enthusiasts to an AI impresario with broad-enough appeal to draw thousands to a corporate event.

Or, as Nvidia Research Manager Jim Fan put it on X:

I’m disappointed that the Wall Street Journal used this lead for their article about the event, but not because I thought they should have talked about the actual announcements: rather, they and I had the exact same idea. It was the spectacle, even more than the announcements, that was the most striking takeaway of Huang’s keynote.

I do think, contra the Wall Street Journal, that iPhone announcements are a relevant analogy; Apple could have, particularly in the early days of the iPhone, easily filled an 11,000 seat arena. Perhaps an even better analogy, though, was the release of Windows 95. Lance Ulanoff wrote a retrospective on Medium in 2001:

It’s hard to imagine an operating system, by itself, garnering the kind of near-global attention the Windows 95 launch attracted in 1995. Journalists arrived from around the world on August 24, 1995, settling on the lush green, and still relatively small Microsoft Campus in Redmond, Washington. There were tickets (I still have mine) featuring the original Windows Start Button (“Start” was a major theme for the entire event) granting admission to the invite-only, carnival-like event…It was a relatively happy and innocent time in technology. Perhaps the last major launch before the internet dominated everything, when a software platform, and not blog post or a piece of hardware, could change the world.

One can envision an article in 2040 looking back on the “relatively happy and innocent time in technology” as we witnessed “perhaps the last major launch before AI dominated everything” when a chip “could change the world”; perhaps retrospectives of the before times will be the last refuge of human authors like myself.

GTCs of Old

What is interesting to a once-and-future old fogey like myself, who has watched multiple Huang keynotes, is how relatively focused this event was: yes, Huang talked about things like weather and robotics and Omniverse and cars, but this was, first-and-foremost, a chip launch — the Blackwell B200 generation of GPUs — with a huge chunk of the keynote talking about its various features and permutations, performance, partnerships, etc.

I thought this stood in marked contrast to GTC 2022 when Huang announced the Hopper H100 generation of GPUs: that had a much shorter section on the chips/system architecture, accompanied by a lot of talk about potential use cases and a list of all of the various libraries Nvidia was developing for CUDA. This was normal for GTC, as I explained a year earlier:

This was, frankly, a pretty overwhelming keynote; Liberty thinks this is cool:

Robots and digital twins and games and machine learning accelerators and data-center-scale computing and cybersecurity and self-driving cars and computational biology and quantum computing and metaverse-building-tools and trillion-parameter AI models! Yes plz

Something Huang emphasized in the introduction to the keynote, though, is that there is a rhyme and reason to this volume…

I then went on an extended explainer of CUDA and why it was essential to understanding Nvidia’s long-term opportunity, and concluded:

This is a useful way to think about Nvidia’s stack: writing shaders is like writing assembly, as in its really hard and very few people can do it well. CUDA abstracted that away into a universal API that was much more generalized and approachable — it’s the operating system in this analogy. Just like with operating systems, though, it is useful to have libraries that reduce duplicative work amongst programmers, freeing them to focus on their own programs. So it is with CUDA and all of those SDKs that Huang referenced: those are libraries that make it much simpler to implement programs that run on Nvidia GPUs.

This is how it is that a single keynote can cover “Robots and digital twins and games and machine learning accelerators and data-center-scale computing and cybersecurity and self-driving cars and computational biology and quantum computing and metaverse-building-tools and trillion-parameter AI models”; most of those are new or updated libraries on top of CUDA, and the more that Nvidia makes, the more they can make.

This isn’t the only part of the Nvidia stack: the company has also invested in networking and infrastructure, both on the hardware and software level, that allows applications to scale across an entire data center, running on top of thousands of chips. This too requires a distinct software plane, which reinforces that the most important thing to understand about Nvidia is that it is not a hardware company, and not a software company: it is a company that integrates both.

Those GTCs were, in retrospect, put on by a company before it had achieved astronomical product-market fit. Sure, Huang and Nvidia knew about transformers and GPT models — Huang referenced his hand-delivery of the first DGX supercomputer to OpenAI in 2016 in yesterday’s opening remarks — but notice how his hand-drawn slide of computing history seems to exclude a lot of the stuff that used to be at GTC:

Huang's drawing of computing journey to today

Suddenly all that matters in those intervening years was transformers!

I am not, to be clear, short-changing Huang or Nvidia in any way; quite the opposite. What is absolutely correct is that Nvidia had on their hands a new way of computing, and the point of those previous GTC’s was to experiment and push the world to find use cases for it; today, in this post-ChatGPT world, the largest use case — generative AI — is abundantly clear, and the most important message for Huang to deliver is why Nvidia will continue to dominate that use case for the foreseeable future.

Blackwell

So about Blackwell itself; from Bloomberg:

Nvidia Corp. unveiled its most powerful chip architecture at the annual GPU Technology Conference, dubbed Woodstock for AI by some analysts. Chief Executive Officer Jensen Huang took the stage to show off the new Blackwell computing platform, headlined by the B200 chip, a 208-billion-transistor powerhouse that exceeds the performance of Nvidia’s already class-leading AI accelerators. The chip promises to extend Nvidia’s lead on rivals at a time when major businesses and even nations are making AI development a priority. After riding Blackwell’s predecessor, Hopper, to surpass a valuation of more than $2 trillion, Nvidia is setting high expectations with its latest product.

The first thing to note about Blackwell is that it is actually two dies fused into one chip, with what the company says is full coherence; what this means in practice is that a big portion of Blackwell’s gains relative to Hopper is that it is simply much bigger. Here is Huang holding a Hopper and Blackwell chip up for comparison:

Huang holding a Hopper GPU and a Blackwell GPU

The “Blackwell is bigger” theme holds for the systems Nvidia is building around it. The fully integrated GB200 platform has two Blackwell chips with one Grace CPU chip, as opposed to Hopper’s 1 to 1 architecture. Huang also unveiled the GB200 NVL72, a liquid-cooled rack sized system that included 72 GPUs interconnected with a new generation of NVLink, which the company claims provides a 30x performance increase over the same number of H100 GPUs for LLM inference (thanks in part to dedicated hardware for transformer-based inference), with a 25x reduction in cost and energy consumption. One set of numbers I found notable were on these slides:

Blackwell's increased performance in training relative to Hopper

What is interesting to note is that both training runs take the same amount of time — 90 days. This is because the actual calculation speed is basically the same; this makes sense because Blackwell is, like Hopper, fabbed on TSMC’s 4nm process,1 and the actual calculations are fairly serial in nature (and thus primarily governed by the underlying speed of the chip). “Accelerated computing”, though, isn’t about serial speed, but rather parallelism, and every new generation of chips, combined with new networking, enables ever greater amounts of efficient parallelism that keeps those GPUs full; that’s why the big improvment is in the number of GPUs necessary and thus the overall amount of power drawn.

That, by extension, means that a Hopper-sized fleet of Blackwell GPUs will be capable of building that much larger of a model, and given that there appears to be a linear relationship between scale and model capability, the path to GPT-6 and beyond remains clear (GPT-5 was presumably trained on Hopper GPUs; GPT-4 was trained on Ampere A100s).

What is interesting to note is that there are reports that while the B100 costs twice as much as the H100 to manufacture, Nvidia is increasing the price much less than expected; this explains the somewhat lower margins the company is expecting going forward. The report — which has since disappeared from the Internet (perhaps because it was published before the keynote?) — speculated that Nvidia is concerned about preserving its market share in the face of AMD being aggressive in price, and its biggest customers trying to build their own chips. There is, needless to say, tremendous incentives to find alternatives, particularly for inference.

Nvidia Inference Microservices (NIM)

I think this provides useful context for another GTC announcement; from the Nvidia developer blog:

The rise in generative AI adoption has been remarkable. Catalyzed by the launch of OpenAI’s ChatGPT in 2022, the new technology amassed over 100M users within months and drove a surge of development activities across almost every industry. By 2023, developers began POCs [Proof of Concepts] using APIs and open-source community models from Meta, Mistral, Stability, and more.

Entering 2024, organizations are shifting their focus to full-scale production deployments, which involve connecting AI models to existing enterprise infrastructure, optimizing system latency and throughput, logging, monitoring, and security, among others. This path to production is complex and time-consuming — it requires specialized skills, platforms, and processes, especially at scale.

NVIDIA NIM, part of NVIDIA AI Enterprise, provides a streamlined path for developing AI-powered enterprise applications and deploying AI models in production.

NIM is a set of optimized cloud-native microservices designed to shorten time-to-market and simplify deployment of generative AI models anywhere, across cloud, data center, and GPU-accelerated workstations. It expands the developer pool by abstracting away the complexities of AI model development and packaging for production ‌using industry-standard APIs.

NIM’s are pre-built containers that contain everything an organization needs to get started with model deployment, and they are addressing a real need not just today, but in the future; Huang laid out a compelling scenario where companies’ use multiple NIMs in an agent-type of framework to accomplish complex tasks:

Think about what an AI API is: an AI API is an interface that you just talk to. So this is a piece of software that in the future that has a really simple API, and that API is called human. These packages, incredible bodies of software, will be optimized and packaged and we’ll put it on a website, and you can download it, you can take it with you, you can run it on any cloud, you can run it in your datacenter, you can run it on workstations if it fits, and all you have to do is come to ai.nvidia.com. We call it Nvidia Inference Microservices, but inside the company we all call it NIMs.

Just imagine, someday there’s going to be one of these chatbots, and these chatbots is just going to be in a NIM. You’ll assemble a whole bunch of chatbots, and that’s the way that software is going to be built some day. How do we build software in the future? It is unlikely that you’ll write it from scratch, or write a whole bunch of Python code or anything like that. It is very likely that you assemble a team of AIs.

There’s probably going to be a super-AI that you use that takes the mission that you give it and breaks it down into an execution plan. Some of that execution plan could be handed off to another NIM, that NIM would maybe understand SAP. The language of SAP is ABAP. It might understand ServiceNow and go and retrieve some information from their platforms. It might then hand that result to another NIM, who goes off and does some calculation on it. Maybe it’s an optimization software, a combinatorial optimization algorithm. Maybe it’s just some basic calculator. Maybe it’s pandas to do some numerical analysis on it. And then it comes back with its answer, and it gets combined with everybody else’s, and because it’s been presented with “This is what the right answer should look like,” it knows what right answers to produce, and it presents it to you. We can get a report every single day, top-of-the-hour, that has something to do with a build plan or some forecast or some customer alert or some bugs databased or whatever it happens to be, and we can assemble it using all these NIMs.

And because these NIMs have been packaged up and ready to work on your system, so long as you have Nvidia GPUs in your datacenter or in the cloud, these NIMs will work together as a team and do amazing things.

Did you notice the catch? NIMs — which Nvidia is going to both create itself and also spur the broader ecosystem to create, with the goal of making them freely available — will only run on Nvidia GPUs.

NIM's only run on Nvidia GPUs

This takes this Article full circle: in the before-times, i.e. before the release of ChatGPT, Nvidia was building quite the (free) software moat around its GPUs; the challenge is that it wasn’t entirely clear who was going to use all of that software. Today, meanwhile, the use cases for those GPUs is very clear, and those use cases are happening at a much higher level than CUDA frameworks (i.e. on top of models); that, combined with the massive incentives towards finding cheaper alternatives to Nvidia, means both the pressure to and the possibility of escaping CUDA is higher than it has ever been (even if it is still distant for lower level work, particularly when it comes to training).

Nvidia has already started responding: I think that one way to understand DGX Cloud is that it is Nvidia’s attempt to capture the same market that is still buying Intel server chips in a world where AMD chips are better (because they already standardized on them); NIM’s are another attempt to build lock-in.

In the meantime, though, it remains noteworthy that Nvidia appears to not be taking as much margin with Blackwell as many may have expected; the question as to whether they will have to give back more in future generations will depend on not just their chips’ performance, but also on re-digging a software moat increasingly threatened by the very wave that made GTC such a spectacle.



  1. I was mistaken about this previously