HPC Precision Wars: Satoshi Matsuoka Plants the Ozaki Flag | Deep Dive

Home

Administration

Deep Dive

HPC Precision Wars: Satoshi Matsuoka Plants the Ozaki Flag

Details: Written by Douglas Eadline; Published: 19 June 2026; Hits: 779

XFrom the FP64 is not as boring as it used to be department

A recent paper submitted to ArXiv by famed HPC scientist Satoshi Matsuoka, Director of the RIKEN Center for Computational Science in Kobe, Japan, has shaken the tried-and-true FP64 HPC relationship to its core. The paper is entitled: FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail: A Tensor–Memory Equilibrium Model and Implementation Strategy for Ozaki Scheme II on Memory-Bound Workloads in the Post-FP64 Era.

There is quite a lot to unpack in the paper, let alone the title. Matsuoka is basically positing that FP64 (64-bit double-precision arithmetic) hardware is not the best way to perform certain HPC computations that require 64-bit precision on newer Nvidia GPUs. Instead, he demonstrates how FP8 (8-bit floating-point operations) found in abundance on modern GPUs can be combined using the Ozaki Scheme II to achieve faster computation at the same 64-bit precision. This rebuke of legacy 64-bit floating-point technology is Matsuoka's flag-in-the-sand moment, and it will have consequences throughout the HPC industry. This article attempts to summarize some of the paper's major points; however, consulting the paper provides greater breadth and support for Satoshi's arguments and predictions.

Using a simple yet strained analogy, Matsuoka is saying: stop using F1 autos to get around the HPC track as fast as possible. Instead use a bunch of ebikes! In this sense, Matsuoka's proposal seems absurd and would never work, given traditional HPC beliefs. The "F1-FP64" car has delivered floating-point performance to HPC applications for decades and is considered an integral part of any performance race. Historically, CPU and GPU vendors have consistently delivered generational increases in FP64 performance. That trend is changing because, in a market driven by Gen-AI, many FP8 e-bikes are preferred. While HPC applications require higher precision FP64, Gen-AI training and inference require lower precision. To accommodate the larger market, Nvidia has regressed FP64 performance on newer-generation processors (AMD has no such plans) and created an F1-FP64 gap that Matsuoka posits can be filled by lower-precision FP8-ebikes.

Hardware trends

The HPC market has often borrowed technology from larger markets. Back in the early days of Linux Clusters (Beowulf), HPC shifted away from custom supercomputing processors and toward commodity hardware. The cost of creating next-generation processors required large sales volumes, which were not available for specialized supercomputing and workstation processors. The commodity desktop/server market could justify these volumes, and HPC took advantage of this trend. GPU-assisted computing began as video cards evolved into true HPC acceleration devices. Server GPUs (not necessarily desktop video cards) offered a large FP64 capability needed by HPC applications and helped establish a viable market -- initially dominated by Nvidia.

Enter the age of GenAI. LLM model training works faster with lower-precision floating-point numbers during training and inference. This reduction in precision corresponds to a commensurate decrease in the memory required to represent numbers, without any loss in training fidelity. In addition, the reduced amount of data requires fewer math operations and is much faster. For instance, performing standard 64-bit arithmetic operations requires manipulating 8 bytes per operand. Using an 8-bit floating-point format requires 8 times less memory and only requires manipulating 8 bits. These reduced floating-point sizes range from 32 bits to 4 bits (FP32, FP16 (BF16), FP8, and even FP4).

The GenAI market has pushed GPU and CPU designs to emphasize lower precision "AI-format" operations over FP64. The current trend in FP64 vs lower precision for Nvidia is evident in Table 1, which compares recent Nvidia and AMD results.

Table 1: Comparison of floating-point precision performance for Nvidia and AMD GPUs. Data taken from the Sotachi paper (Table 2) and AMD documents. The "emulated" result is explained below.

As shown in Table 1, FP64 performance on Nvidia GPUs has regressed, and on the B300 it is effectively non-existent. The AMD GPUs, however, are maintaining the FP64 growth. As shown in the lower-precision results, Nvidia is driving GPU performance growth in these areas. This migration of GPU silicon from FP64 to FP8 and FP4 is clearly due to the GenAI market.

The HPC community has recognized this trend for years. Outside of AMD, the HPC community is asking: "Where will we get our FP64 performance? It seems we have become second-class citizens in the Nvidia GPU space." To answer that, we need to consider using what is now available in abundance- lower-precision hardware to produce high-precision results.

The Ozaki Methods

Back in the Spring of 2025, I wrote "Have You Heard About the Ozaki Scheme? You Will" as an introduction to a family of error-free transformations (EFTs) that allow higher-precision arithmetic to be emulated using lower-precision hardware.

In 2012, five years before Tensor Cores were placed in Nvidia GPUs, Katsuhisa Ozaki, Takeshi Ogita, Shin'ichi Oishi, and Siegfried M. Rump published a paper entitled " Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications." In the paper, the authors describe a technique for fast, error-free splitting of floating-point numbers. Using this technique, they first develop an error-free transformation of a product of two floating-point matrices into a sum of floating-point matrices.

The Ozaki-I scheme, as it is now called, is a method for performing high-accuracy matrix multiplication by leveraging INT8 Tensor cores on modern GPUs. It achieves this by splitting high-precision input matrices into multiple components and then performing matrix multiplications on these components using low-precision arithmetic. The results are then combined to obtain the final accurate high-precision matrix product.

A second EFT method, Ozaki II, is now available. The Ozaki-II scheme offers advantages over the conventional Ozaki-I scheme by providing superior computational efficiency and better scalability for large-scale, high-fidelity numerical computations. In early 2026, Uchino, Ozaki, and Imamura observed that the original Ozaki-II algorithm cannot be directly adapted to FP8 matrix-multiply-accumulate units. They introduced a quantization trick that emulates modular arithmetic over FP8. This adaptation is why Ozaki-II remains viable on Blackwell Ultra and the upcoming NVIDIA Rubin GPU, both of which have reduced INT8 support compared to increased FP8/FP4 hardware.

Testing assumptions (HPC dogma)

Matsuoka's argument hinges on assumptions about roofline graphs for applications running on specific processors. The roofline graph shown in Figure 1 is an easy way to visualize the behavior of HPC applications. In general, all applications running on a processor are subject to one of two limitations. The first is memory bandwidth, or the speed at which data needs to be moved into and out of the processor for a given application. The diagonal line on the left side of the graph in Figure 1 represents this performance region. The second limit is the peak speed at which the processor can perform. This region is the horizontal line at the top of the image.

Figure 1: Roof line graph example (Image from Wikipedia)

As shown in the figure, App1 is in the bandwidth-limited zone, while App2 and App3 are in the peak performance zone. Several conclusions can be drawn from this graph: App1 is clearly memory bandwidth-limited, and increasing processor performance will have no effect on App1 (save your money on faster processors). In the case of App2 and App3, they are running at peak performance, and increasing memory bandwidth will not affect performance (Again, save your money on faster memory). The ideal spot for an application to live is on the ridge point, the intersection of memory bandwidth and peak performance. At this point, the processor is being "fed" data as fast as it can process it, which represents a good balance between the characteristics of the processor's memory subsystem. Remember, most HPC applications are memory-bandwidth-limited and lie on the diagonal line in the figure.

As shown in Table 1, Matsuoka points out that the regression of FP64 performance dramatically changes the roofline for newer Nvidia GPUs. He points out two important consequences of these changes.

Consequence 1: Previous memory-bound kernels become compute-bound. The classical roofline ridge point is given by the peak FP64 throughput divided by the HBM bandwidth. On B300, this ridge sits at 1.3 TFLOPS/8 TB/s = 0.16 FLOPS/Byte, forcing every dense linear-algebra kernel narrower than a General Matrix-Matrix Multiplication (GEMM) into the compute-bound regime. In other words, there is more than adequate memory bandwidth for application needs, because only a small amount is consumed by the arithmetic units.

Consequence 2: Low-precision tensor units are dormant. By design, the B300 carries 10 PFLOPS of dense (NV)FP4 throughput (15 PFLOPS sparse) and 5 PFLOPS of dense FP8 (10 PFLOPS sparse). When running a typical FP64-only HPC kernel, these units are idle and contribute nothing to the kernel's time-to-solution. This situation is the Dark Silicon manifestation of the AI–HPC divergence.

As Matsuoka aptly points out, despite the rapid maturation of Ozaki-I and Ozaki-II for dense General Matrix-Matrix Multiplication (GEMM), all published performance studies have focused on the compute-bound regime, where the Ozaki technique most obviously wins due to the sheer number of lower-precision Tensor Cores.

He indicates that no published analysis (to their knowledge, as of May 2026) has asked the question: When is Ozaki-II profitable for memory-bound kernels? As mentioned, most HPC applications live in this region of the roofline. The conventional wisdom that EFT methods like Ozaki-I and II cannot help bandwidth-limited kernels because they inflate operand counts deserves a careful look on hardware where the FP64 compute roof has collapsed below the memory roof.

Based on the analysis presented in the paper, Matsuoka dramatically labels native PF64 processing as the only way to achieve HPC-level performance, as dogma when using Nvidia B300 generation (and beyond) GPUs. The often staid HPC community has something to discuss.

Performance numbers

To demonstrate and support his position, Satochi examines performance using four standard HPC primitives.

GEMM (General Matrix-Matrix Multiplication)
Batched GEMV (General Matrix-Vector Multiplication)
7-point Stencil (localized neighboring elements)
SpMV (Sparse Matrix-Vector Multiplication)

Table 2 reports speedups within each GPU, that is, how much Ozaki-II/FP8 accelerates each workload relative to that GPU's own native FP64 performance. Satoshi suggests this view is correct for evaluating emulation on a single chip. Still, it is not the correct view for evaluating whether FP64-emulated execution regresses or progresses relative to the prior-generation HPC baseline.

The appropriate baseline for that question is the H100, the last data-center Nvidia GPU whose architecture was balanced for HPC rather than for AI inference. Table 2 therefore reports absolute achievable FP64-equivalent throughput for the same five workloads, with all GPUs normalized to the H100 native FP64.

Table 2: Achievable FP64-equivalent throughput per workload, in TFLOPS, and relative to H100 native FP64 (last column block, in parentheses). Native throughput uses the FP64 tensor path for dense GEMM and the FP64 vector path for the memory-bound primitives.

As Matsuoka states in the paper, "Three patterns in the above table (table 4 in the paper) support a single thesis: Ozaki-II does not regress performance against the H100 baseline; on the contrary, it restores or improves the prior-generation scaling on every workload." He also states, "Ozaki-II is not just a compensation mechanism for Nvidia's FP64 regression; it is the mechanism that converts the silicon-area savings into bandwidth-scaling and into AI-grade tensor throughput, both of which the application then sees as faster FP64."

Cheap lunch at the AI cafe?

Using the Ozaki-II library requires recoding many of the standard libraries (e.g., DGEMM). Indeed, any HPC applications that wish to take advantage of the emulated FP64 performance on new Nvidia hardware will require adaptation to the Ozaki-II scheme -- there is no free lunch. This situation is not unlike the past when applications were migrated to CUDA. Ozaki is a bit more complex and may slow adoption. However, as Matsuoka points out, the reported success of GenAI-assisted coding may provide a rapid path forward for enabling applications with Ozaki-II emulation. He states in the paper:

This represents an unusual moment in which two ostensibly unrelated AI developments—the architectural pivot of GPUs toward low-precision tensor cores, and the maturation of AI coding assistants—combine to make the emulation strategy practically realizable on the timescale of

the FugakuNEXT, Doudna, and Blue Lion deployments.

This assumption/suggestion covers a lot of ground. While GenAI coding assistants are reporting increased productivity, their use is not without issues. Time will be needed to see if this prediction plays out.

AMD sticks to FP64 with emulation options

As noted in Table 1, AMD GPUs are following the traditional FP64 trajectory by increasing performance with each new generation. As a result, AMD is less focused on emulation than Nvidia. Indeed, in the HPCwire article AMD Hints at Big FP64 Increases in MI430X GPU as Ozaki Underwhelms, AMD Fellow Nick Malaya points out several issues with error-free transformations, such as the Ozaki methods.

First, the software is not IEEE-compliant, and it does not produce the same results as running the code on actual FP64 hardware. He states in the article, "In some cases, that's okay, but in a lot of matrices that are common that we've observed, the accuracy implications are pretty profound. In fact, you can give it matrices that differ by a few orders of magnitude in terms of the elements in the matrix…Ozaki has accuracy problems."

The second major problem with Ozaki concerns its expectation for square matrices. If the HPC workload does not use square matrices, the performance drops below native FP64 hardware performance, Malaya said. In addition, Malaya also states that traditional HPC applications are vector-based, for which Ozaki methods show little benefit, and that fewer than 10% of these applications have been covered in a matrix format that allows Ozaki methods to be used.

Finally, AMD will support Ozaki emulation on its chips, Malaya said. "There's no reason not to. It's software. We can release it and support it. And you can have libraries that allow you to dynamically switch between the native and the Ozaki method and probably estimate it," he said. "But we're not finding it compelling as, 'You can replace all the 64-bit hardware pipes.' You need those FP64 pipes to fall back onto."

To be Precise ...

HPCwire is honored to be cited three times in Satoshi's paper. It is gratifying to report on the significant advances in HPC. There are, however, some points that could use clarification.

In the paper's introduction, Matsuoka wrote that there is "an announced reliance of the U.S. Department of Energy's Genesis Mission on Ozaki emulation" by reference to an article entitled Genesis Mission Will Lean Heavily on Ozaki Scheme for FP64 Capability written by HPCwire Managing Editor Alex Woodie. And further on, he states, referencing the HPCwire article: "DOE's Genesis Mission explicitly identified Ozaki emulation as its fallback path for FP64-accurate scientific computing on AI-centric hardware."

Although the article title may give that impression by using the phrase "Lean Heavily," the article was a bit more nuanced. As part of the interview, Darío Gil, DOE Under Secretary for Science, stated;

"In discussions I've had with both [AMD CEO] Lisa Su and with [Nvidia CEO] Jensen [Huang], they have expressed a strong commitment for FP64, that it will continue," Gil said in an interview last week. "For us, it's very important, because we don't view this [as a] substitution. These are complementary."

As the article continued, Gil said, adding that the two types of computing will work together to support Genesis Mission's goal of pushing the limits in AI-powered science and engineering. To be clear, Gil never explicitly stated Ozaki techniques as a "fallback" to traditional FP64 methods, nor did he announce a reliance on Ozaki emulation in the interview. As more research and testing continues, HPCwire will provide updates and analysis of the situation.

Begun, the HPC precision wars have

This paper is undoubtedly the first of many addressing FP64 emulation. The FP64 bifurcation will continue because GenAI provides a strong economic justification for emphasizing lower precision, which has changed the broader Nvidia hardware market. Matsuoka paper represents a solid new HPC path forward in that regard.

There is much to be learned, however. From one perspective, jumping in an AMD F1-FP64 car and racing your existing application around the HPC track has a definite convenience factor. On the other hand, connecting FP8-ebikes may provide higher absolute, high-precision track speed as the Gen-AI skew continues in Nvidia hardware. We sense a disturbance in the FP64 force.

You have no rights to post comments

JComments

Main Menu

Search

Login And Newsletter

Feedburner

Subscribe Now!

Front Page RSS Feed

This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.