A while ago, someone in the community published a Zhihu post on simulating MI300X with gem5. Since I was recently verifying AMDGPU floating-point precision, I wanted to compare the floating-point accuracy of gem5’s MI300X model with real hardware.
I recommend running gem5 on a server or workstation, because a personal computer may not have enough resources.
The setup mainly follows that post and the official documentation Full System AMD GPU model.
The basic idea is:
- Use qemu-system-x86_64 to build an image and kernel that include the AMDGPU driver and ROCm environment, along with the required kernel;
- Use gem5’s KVM mode to emulate an x86 system and integrate the MI300X model as a PCIe card;
- Finally, load the prepared image and kernel in gem5, and you are ready to use it.
For convenience, I used the Docker environment provided by gem5 to build everything.
1docker pull ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0
Pull the relevant gem5 repositories locally, create the gem5 Docker container, and share the repository directory with the container:
1git clone https://github.com/gem5/gem5.git
2git clone https://github.com/gem5/gem5-resources.git gem5/gem5-resources
3docker run --name gem5-amdgpu \
4--device /dev/kvm \
5--volume /home/zevorn/gem5:/gem5 \
6--privileged \
7-it ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0
The next step is the build stage, which has two parts.
First, create the image and kernel in the gem5/gem5-resources repository outside the container:
1cd gem5/gem5-resources/src/x86-ubuntu-gpu-ml/
2./build.sh
Be patient here. This build script uses packer to create the image. Once the build succeeds, it produces the following files:
1├── disk-image
2│ └── x86-ubuntu-gpu-ml
3├── vmlinux-gpu-ml
Then enter the container and start building gem5:
1cd /gem5
2scons build/VEGA_X86/gem5.opt -j$(nproc)
We need to modify the default MI300X Python file so it does not exit after execution:
1$ git diff configs/example/gpufs/mi300.py
2@@ -153,7 +153,7 @@ def runMI300GPUFS(
3 )
4 b64file.write(runscriptStr)
5
6- args.script = tempRunscript
7+ # args.script = tempRunscript
8
9 # Defaults for CPU
10 args.cpu_type = "X86KvmCPU"
In the gem5 source tree, create a pytorch_test.py file for gem5 to load when running MI300, although it will not actually be executed:
1#!/usr/bin/env python3
2
3import torch
4
5x = torch.rand(5, 3).to('cuda')
6y = torch.rand(3, 5).to('cuda')
7
8z = x @ y
We also install tmux in the container so that, after starting gem5, we can connect to the terminal from another pane:
1apt update
2apt install tmux
Then try starting it:
1build/VEGA_X86/gem5.opt configs/example/gpufs/mi300.py --disk-image gem5-resources/src/x86-ubuntu-gpu-ml/disk-image/x86-ubuntu-gpu-ml --kernel gem5-resources/src/x86-ubuntu-gpu-ml/vmlinux-gpu-ml --app pytorch_test.py
Using tmux, create another terminal in the current window, then connect to gem5 with the following command:
1./util/term/gem5term localhost 3456
2root@gem5:~#
Next, we need to manually load the AMDGPU driver. You can enter the following commands directly in the gem5 terminal:
1export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH
2export HSA_ENABLE_INTERRUPT=0
3export HCC_AMDGPU_TARGET=gfx942
4dmesg -n8
5cat /proc/cpuinfo
6dd if=/root/roms/mi300.rom of=/dev/mem bs=1k seek=768 count=128
Let’s verify that it loaded successfully:
1root@gem5:~# rocminfo
2ROCk module version 6.12.12 is loaded
3=====================
4HSA System Attributes
5=====================
6Runtime Version: 1.15
7Runtime Ext Version: 1.7
8System Timestamp Freq.: 0.001000MHz
9Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
10Machine Model: LARGE
11System Endianness: LITTLE
12Mwaitx: DISABLED
13XNACK enabled: NO
14DMAbuf Support: YES
15VMM Support: YES
16...
17*******
18Agent 2
19*******
20 Name: gfx942
21 Uuid: GPU-XX
22 Marketing Name: AMD Instinct MI300X
23 Vendor Name: AMD
24 Feature: KERNEL_DISPATCH
25 Profile: BASE_PROFILE
26 Float Round Mode: NEAR
27 Max Queue Number: 128(0x80)
28 Queue Min Size: 64(0x40)
29 Queue Max Size: 131072(0x20000)
30 Queue Type: MULTI
31 Node: 1
32 Device Type: GPU
33...
At this point, you can use it normally.
When I tested the fma instruction results on gem5 MI300X, I compared them bit-for-bit against the CPU and real MI300X hardware and found they were always off by 3–4 ulps. I then read the fma source and found that gem5 was simulating it with separate multiply and add steps, which introduced floating-point precision errors relative to real hardware, so I changed the source:
1$ git diff src/arch/amdgpu/vega/insts/instructions.hh
2diff --git a/src/arch/amdgpu/vega/insts/instructions.hh b/src/arch/amdgpu/vega/insts/instructions.hh
3index 7a328f9230..76bd96beee 100644
4--- a/src/arch/amdgpu/vega/insts/instructions.hh
5+++ b/src/arch/amdgpu/vega/insts/instructions.hh
6@@ -44352,8 +44352,9 @@ namespace VegaISA
7 int lane_A = i + M * (block + B * (k / K_L));
8 int lane_B = j + N * (block + B * (k / K_L));
9 int item = k % K_L;
10- result[i][j] +=
11- src0[item][lane_A] * src1[item][lane_B];
12+ result[i][j] =
13+ std::fma(src0[item][lane_A], src1[item][lane_B],
14+ result[i][j]);
15 }
16 }
17 }
After that, the test behaved correctly. I contributed this bug fix upstream, and it has now been merged: arch-vega: Improve MFMA precision to match MI300X hardware.