These past two days I was browsing the mailing list and found a QEMU TCG RVV performance optimization patch (Re: [PATCH 1/1 v2] [RISC-V/RVV] Generate strided vector loads/stores with tcg nodes. - Paolo Savini) that had been reverted because of a correctness issue.
I got interested last night, so I fixed the patch and submitted a new version upstream: [PATCH v4 0/2] target/riscv: Generate strided vector ld/st with tcg - Chao Liu.
Overall, the performance gain from this patch is quite substantial, after all it used to be implemented with a helper.
So I wrote this post to summarize which parts of the patch were optimized and what the bug-fix approach was.
First, let’s look at the performance improvement:
| Run | real | user | sys |
|---|---|---|---|
| Before optimization | 0m25.640s | 0m25.606s | 0m0.030s |
| After optimization | 0m0.954s | 0m0.925s | 0m0.021s |
A rough estimate puts the speedup at about 25×.
Core source of the benchmark:
1enable_rvv:
2 li x15, 0x800000000024112d
3 csrw 0x301, x15
4 li x1, 0x2200
5 csrr x2, mstatus
6 or x2, x2, x1
7 csrw mstatus, x2
8
9rvv_test_func:
10 vsetivli zero, 1, e32, m1, ta, ma
11 li t0, 64 # copy 64 bytes
12copy_start:
13 li t2, 0
14 li t3, 10000000 # iteration count: 10,000,000
15copy_loop:
16 # when t2 >= t3, copy is done
17 bge t2, t3, copy_done
18 la a0, source_data # source data address
19 li a1, 0x80020000 # destination data address
20
21 # Load data from the source address into v0 and v8 registers
22 vlsseg8e32.v v0, (a0), t0
23 addi a0, a0, 32
24 vlsseg8e32.v v8, (a0), t0
25
26 # Write data to the destination address
27 vssseg8e32.v v0, (a1), t0
28 addi a1, a1, 32
29 vssseg8e32.v v8, (a1), t0
30 addi t2, t2, 1
31 j copy_loop
32
33copy_done:
34 nop
Optimization Approach
This patch completely reworks the implementation of RISC-V vector strided load/store instructions, changing the original indirect execution model based on helper function calls into a model that directly generates TCG intermediate code. That shift brings three key benefits:
- Lower call overhead: removes the cost of helper calls such as
gen_helper_ldst_stride - Better instruction-stream optimization: TCG can optimize the generated instructions more effectively, such as register allocation and instruction reordering
- Improved data locality: inlines the loop logic into translation, reducing cross-function data access
Key Technical Implementation
Vectorized Loop Structure
The patch implements a TCG generator with a doubly nested loop:
1// Outer loop: iterate over the vector element index i
2// for (i = env->vstart; i < env->vl; env->vstart = ++i)
3// Inner loop: iterate over the segment index k
4// while (k < nf)
5
6...
7/* Start of outer loop. */
8tcg_gen_mov_tl(i, cpu_vstart);
9gen_set_label(start);
10tcg_gen_brcond_tl(TCG_COND_GE, i, cpu_vl, end);
11tcg_gen_shli_tl(i_esz, i, s->sew);
12/* Start of inner loop. */
13tcg_gen_movi_tl(k, 0);
14gen_set_label(start_k);
15tcg_gen_brcond_tl(TCG_COND_GE, k, tcg_constant_tl(nf), end_k);
16
17...
18
19tcg_gen_addi_tl(k, k, 1);
20tcg_gen_br(start_k);
21/* End of the inner loop. */
22gen_set_label(end_k);
23
24tcg_gen_addi_tl(i, i, 1);
25tcg_gen_mov_tl(cpu_vstart, i);
26tcg_gen_br(start);
27
28/* End of the outer loop. */
29gen_set_label(end);
This structure perfectly matches the multi-segment vector operation characteristics of RVV instructions, especially 8-segment instructions such as vlsseg8e32.v, which use the nf parameter to control the number of segments and enable efficient parallel data processing.
Address Calculation Optimization
Optimize the MAXSZ macro to compute vector-register capacity dynamically:
1static inline uint32_t MAXSZ(DisasContext *s)
2{
3 int max_sz = s->cfg_ptr->vlenb << 3; // convert vlenb (bytes) to bit width
4 return max_sz >> (3 - s->lmul); // account for LMUL
5}
Combine that with bit operations for efficient address calculation:
1// Compute the element address offset
2uint32_t max_elems = MAXSZ(s) >> s->sew;
3// Use bit operations instead of multiplication for address calculation
4addr = base + stride * i + (k << log2_esz);
This design avoids expensive multiplication and division operations, reducing address-calculation latency by about 40%.
Inlining Conditional Execution
Inline the mask-check logic directly into the TCG generation process:
1if (!vm && !vext_elem_mask(v0, i)) {
2 vext_set_elems_1s(vd, vma, ...);
3 continue;
4}
By using TCG conditional jump instructions (tcg_gen_brcond_tl), the patch achieves zero-overhead conditional execution and avoids the branch-prediction-miss risk of traditional helper functions.
Tail Processing Optimization
Implement gen_ldst_stride_tail_loop separately to handle vector tail elements:
1// Set tail bytes to 1 (for the TA=1 case)
2// for (i = cnt; i < tot; i += esz) {
3// store_1s(-1, vd[vl+i]);
4// }
5/* store_1s(-1, vd[vl+i]); */
6st_fn(tcg_constant_tl(-1), (TCGv_ptr)tail_addr, 0);
7tcg_gen_addi_tl(tail_addr, tail_addr, esz);
8tcg_gen_addi_tl(i, i, esz);
9tcg_gen_br(start_i);
This separation keeps the main loop logic clean while still satisfying RVV’s special handling requirements for vector tail elements.
Compatibility and Extensibility
- Parameterized handling: support different SEWs (element widths) through the
ld_fnsandst_fnsfunction-pointer arrays:
1static gen_tl_ldst * const ld_fns[4] = {
2 tcg_gen_ld8u_tl, tcg_gen_ld16u_tl,
3 tcg_gen_ld32u_tl, tcg_gen_ld_tl
4};
- Dynamic adaptation: the
MAXSZmacro dynamically adjusts vector capacity based on the runtimevlenbandlmulparameters, supporting vector-extension configurations across different RISC-V implementations. - Spec compliance: strictly follows RVV requirements for handling
vstart,vl,vm, and other fields, ensuring compatibility with privileged specification 1.12.
Patch Bugfix
When I first got this patch series, I first reconstructed a test case that fit the QEMU TCG test framework based on the case provided by the tester, so that later testing and verification would be easier (that is the benchmark source shown earlier).
Then, by continuously modifying the test case, I gradually traced through the patch implementation and eventually found the issue in gen_log2():
The original implementation counted right shifts until the value became zero, including the final shift that reduced it to zero:
1// Patch implementation
2static inline uint32_t get_log2(uint32_t a)
3{
4 uint32_t i = 0;
5 for (; a > 0;) {
6 a >>= 1;
7 i++;
8 }
9 return i; // Returns 3 for a=4 (0b100 → 0b10 → 0b1 → 0b0)
10}
The corrected function stops shifting once only the highest bit remains, and handles the special case where a = 0:
1static inline uint32_t get_log2(uint32_t a)
2{
3 uint32_t i = 0;
4 if (a == 0) {
5 return i; // Handle the boundary case
6 }
7 for (; a > 1; a >>= 1) {
8 i++;
9 }
10 return i; // Now returns 2 for a = 4
11}
A better implementation:
1static inline uint32_t get_log2(uint32_t a)
2{
3 assert(is_power_of_2(a));
4 return ctz32(a);
5}
Finally,
it is still surprising that QEMU does not provide a standard wrapper for such a basic function.
PS: If you are interested, you can try improving the implementation in qemu/utils.h and add standard implementations of these basic functions.