A while ago, someone in the community published a Zhihu post on simulating MI300X with gem5. Since I was recently verifying AMDGPU floating-point precision, I wanted to compare the floating-point accuracy of gem5’s MI300X model with real hardware.

I recommend running gem5 on a server or workstation, because a personal computer may not have enough resources.

The setup mainly follows that post and the official documentation Full System AMD GPU model.

The basic idea is:

  1. Use qemu-system-x86_64 to build an image and kernel that include the AMDGPU driver and ROCm environment, along with the required kernel;
  2. Use gem5’s KVM mode to emulate an x86 system and integrate the MI300X model as a PCIe card;
  3. Finally, load the prepared image and kernel in gem5, and you are ready to use it.

For convenience, I used the Docker environment provided by gem5 to build everything.

1docker pull ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0

Pull the relevant gem5 repositories locally, create the gem5 Docker container, and share the repository directory with the container:

1git clone https://github.com/gem5/gem5.git
2git clone https://github.com/gem5/gem5-resources.git gem5/gem5-resources
3docker run --name gem5-amdgpu \
4--device /dev/kvm \
5--volume /home/zevorn/gem5:/gem5 \
6--privileged \
7-it ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0 

The next step is the build stage, which has two parts.

First, create the image and kernel in the gem5/gem5-resources repository outside the container:

1cd gem5/gem5-resources/src/x86-ubuntu-gpu-ml/
2./build.sh

Be patient here. This build script uses packer to create the image. Once the build succeeds, it produces the following files:

1├── disk-image
2│   └── x86-ubuntu-gpu-ml
3├── vmlinux-gpu-ml

Then enter the container and start building gem5:

1cd /gem5
2scons build/VEGA_X86/gem5.opt -j$(nproc)

We need to modify the default MI300X Python file so it does not exit after execution:

 1$ git diff configs/example/gpufs/mi300.py
 2@@ -153,7 +153,7 @@ def runMI300GPUFS(
 3         )
 4         b64file.write(runscriptStr)
 5 
 6-    args.script = tempRunscript
 7+    # args.script = tempRunscript
 8 
 9     # Defaults for CPU
10     args.cpu_type = "X86KvmCPU"

In the gem5 source tree, create a pytorch_test.py file for gem5 to load when running MI300, although it will not actually be executed:

1#!/usr/bin/env python3
2
3import torch
4
5x = torch.rand(5, 3).to('cuda')
6y = torch.rand(3, 5).to('cuda')
7
8z = x @ y

We also install tmux in the container so that, after starting gem5, we can connect to the terminal from another pane:

1apt update
2apt install tmux

Then try starting it:

1build/VEGA_X86/gem5.opt configs/example/gpufs/mi300.py --disk-image gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-gpu-ml --kernel gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-gpu-ml --app pytorch_test.py

Using tmux, create another terminal in the current window, then connect to gem5 with the following command:

1./util/term/gem5term localhost 3456
2root@gem5:~#

Next, we need to manually load the AMDGPU driver. You can enter the following commands directly in the gem5 terminal:

1export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH
2export HSA_ENABLE_INTERRUPT=0
3export HCC_AMDGPU_TARGET=gfx942
4dmesg -n8
5cat /proc/cpuinfo
6dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128

Let’s verify that it loaded successfully:

 1root@gem5:~# rocminfo
 2ROCk module version 6.12.12 is loaded
 3=====================
 4HSA System Attributes
 5=====================
 6Runtime Version:         1.15
 7Runtime Ext Version:     1.7
 8System Timestamp Freq.:  0.001000MHz
 9Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
10Machine Model:           LARGE
11System Endianness:       LITTLE
12Mwaitx:                  DISABLED
13XNACK enabled:           NO
14DMAbuf Support:          YES
15VMM Support:             YES
16...
17*******
18Agent 2
19*******
20  Name:                    gfx942
21  Uuid:                    GPU-XX
22  Marketing Name:          AMD Instinct MI300X
23  Vendor Name:             AMD
24  Feature:                 KERNEL_DISPATCH
25  Profile:                 BASE_PROFILE
26  Float Round Mode:        NEAR
27  Max Queue Number:        128(0x80)
28  Queue Min Size:          64(0x40)
29  Queue Max Size:          131072(0x20000)
30  Queue Type:              MULTI
31  Node:                    1
32  Device Type:             GPU
33...

At this point, you can use it normally.

When I tested the fma instruction results on gem5 MI300X, I compared them bit-for-bit against the CPU and real MI300X hardware and found they were always off by 3–4 ulps. I then read the fma source and found that gem5 was simulating it with separate multiply and add steps, which introduced floating-point precision errors relative to real hardware, so I changed the source:

 1$ git diff src/arch/amdgpu/vega/insts/instructions.hh
 2diff --git a/src/arch/amdgpu/vega/insts/instructions.hh b/src/arch/amdgpu/vega/insts/instructions.hh
 3index 7a328f9230..76bd96beee 100644
 4--- a/src/arch/amdgpu/vega/insts/instructions.hh
 5+++ b/src/arch/amdgpu/vega/insts/instructions.hh
 6@@ -44352,8 +44352,9 @@ namespace VegaISA
 7                         int lane_A = i + M * (block + B * (k / K_L));
 8                         int lane_B = j + N * (block + B * (k / K_L));
 9                         int item = k % K_L;
10-                        result[i][j] +=
11-                          src0[item][lane_A] * src1[item][lane_B];
12+                        result[i][j] =
13+                            std::fma(src0[item][lane_A], src1[item][lane_B],
14+                                     result[i][j]);
15                     }
16                 }
17             }

After that, the test behaved correctly. I contributed this bug fix upstream, and it has now been merged: arch-vega: Improve MFMA precision to match MI300X hardware.