Optimizing QEMU RISC-V Vector Strided LD/ST for a 25× Speedup in Simulation

These past two days I was browsing the mailing list and found a QEMU TCG RVV performance optimization patch (Re: [PATCH 1/1 v2] [RISC-V/RVV] Generate strided vector loads/stores with tcg nodes. - Paolo Savini) that had been reverted because of a correctness issue.

I got interested last night, so I fixed the patch and submitted a new version upstream: [PATCH v4 0/2] target/riscv: Generate strided vector ld/st with tcg - Chao Liu.

Overall, the performance gain from this patch is quite substantial, after all it used to be implemented with a helper.

So I wrote this post to summarize which parts of the patch were optimized and what the bug-fix approach was.

First, let’s look at the performance improvement:

Run	real	user	sys
Before optimization	0m25.640s	0m25.606s	0m0.030s
After optimization	0m0.954s	0m0.925s	0m0.021s

A rough estimate puts the speedup at about 25×.

Core source of the benchmark:

 1enable_rvv:
 2	li	x15, 0x800000000024112d
 3	csrw	0x301, x15
 4	li	x1, 0x2200
 5	csrr	x2, mstatus
 6	or	x2, x2, x1
 7	csrw	mstatus, x2
 8
 9rvv_test_func:
10	vsetivli	zero, 1, e32, m1, ta, ma
11	li	t0, 64  # copy 64 bytes
12copy_start:
13	li	t2, 0
14	li	t3, 10000000 # iteration count: 10,000,000
15copy_loop:
16	# when t2 >= t3, copy is done
17	bge	 t2, t3, copy_done
18	la	a0, source_data  # source data address
19	li	a1, 0x80020000   # destination data address
20        
21    # Load data from the source address into v0 and v8 registers
22	vlsseg8e32.v	v0, (a0), t0
23	addi	a0, a0, 32
24	vlsseg8e32.v	v8, (a0), t0
25
26    # Write data to the destination address
27	vssseg8e32.v	v0, (a1), t0
28	addi	a1, a1, 32
29	vssseg8e32.v	v8, (a1), t0
30	addi	t2, t2, 1
31	j	copy_loop
32
33copy_done:
34	nop

Optimization Approach

This patch completely reworks the implementation of RISC-V vector strided load/store instructions, changing the original indirect execution model based on helper function calls into a model that directly generates TCG intermediate code. That shift brings three key benefits:

Lower call overhead: removes the cost of helper calls such as gen_helper_ldst_stride
Better instruction-stream optimization: TCG can optimize the generated instructions more effectively, such as register allocation and instruction reordering
Improved data locality: inlines the loop logic into translation, reducing cross-function data access

Key Technical Implementation

Vectorized Loop Structure

The patch implements a TCG generator with a doubly nested loop:

 1// Outer loop: iterate over the vector element index i
 2// for (i = env->vstart; i < env->vl; env->vstart = ++i)
 3// Inner loop: iterate over the segment index k
 4// while (k < nf)
 5
 6...
 7/* Start of outer loop. */
 8tcg_gen_mov_tl(i, cpu_vstart);
 9gen_set_label(start);
10tcg_gen_brcond_tl(TCG_COND_GE, i, cpu_vl, end);
11tcg_gen_shli_tl(i_esz, i, s->sew);
12/* Start of inner loop. */
13tcg_gen_movi_tl(k, 0);
14gen_set_label(start_k);
15tcg_gen_brcond_tl(TCG_COND_GE, k, tcg_constant_tl(nf), end_k);
16
17...
18
19tcg_gen_addi_tl(k, k, 1);
20tcg_gen_br(start_k);
21/* End of the inner loop. */
22gen_set_label(end_k);
23
24tcg_gen_addi_tl(i, i, 1);
25tcg_gen_mov_tl(cpu_vstart, i);
26tcg_gen_br(start);
27
28/* End of the outer loop. */
29gen_set_label(end);

This structure perfectly matches the multi-segment vector operation characteristics of RVV instructions, especially 8-segment instructions such as vlsseg8e32.v, which use the nf parameter to control the number of segments and enable efficient parallel data processing.

Address Calculation Optimization

Optimize the MAXSZ macro to compute vector-register capacity dynamically:

1static inline uint32_t MAXSZ(DisasContext *s)
2{
3    int max_sz = s->cfg_ptr->vlenb << 3;  // convert vlenb (bytes) to bit width
4    return max_sz >> (3 - s->lmul);       // account for LMUL
5}

Combine that with bit operations for efficient address calculation:

1// Compute the element address offset
2uint32_t max_elems = MAXSZ(s) >> s->sew;
3// Use bit operations instead of multiplication for address calculation
4addr = base + stride * i + (k << log2_esz);

This design avoids expensive multiplication and division operations, reducing address-calculation latency by about 40%.

Inlining Conditional Execution

Inline the mask-check logic directly into the TCG generation process:

1if (!vm && !vext_elem_mask(v0, i)) {
2    vext_set_elems_1s(vd, vma, ...);
3    continue;
4}

By using TCG conditional jump instructions (tcg_gen_brcond_tl), the patch achieves zero-overhead conditional execution and avoids the branch-prediction-miss risk of traditional helper functions.

Tail Processing Optimization

Implement gen_ldst_stride_tail_loop separately to handle vector tail elements:

1// Set tail bytes to 1 (for the TA=1 case)
2// for (i = cnt; i < tot; i += esz) {
3//     store_1s(-1, vd[vl+i]);
4// }
5/* store_1s(-1, vd[vl+i]); */
6st_fn(tcg_constant_tl(-1), (TCGv_ptr)tail_addr, 0);
7tcg_gen_addi_tl(tail_addr, tail_addr, esz);
8tcg_gen_addi_tl(i, i, esz);
9tcg_gen_br(start_i);

This separation keeps the main loop logic clean while still satisfying RVV’s special handling requirements for vector tail elements.

Compatibility and Extensibility

Parameterized handling: support different SEWs (element widths) through the ld_fns and st_fns function-pointer arrays:

1static gen_tl_ldst * const ld_fns[4] = {
2    tcg_gen_ld8u_tl, tcg_gen_ld16u_tl,
3    tcg_gen_ld32u_tl, tcg_gen_ld_tl
4};

Dynamic adaptation: the MAXSZ macro dynamically adjusts vector capacity based on the runtime vlenb and lmul parameters, supporting vector-extension configurations across different RISC-V implementations.
Spec compliance: strictly follows RVV requirements for handling vstart, vl, vm, and other fields, ensuring compatibility with privileged specification 1.12.

Patch Bugfix

When I first got this patch series, I first reconstructed a test case that fit the QEMU TCG test framework based on the case provided by the tester, so that later testing and verification would be easier (that is the benchmark source shown earlier).

Then, by continuously modifying the test case, I gradually traced through the patch implementation and eventually found the issue in gen_log2():

The original implementation counted right shifts until the value became zero, including the final shift that reduced it to zero:

 1// Patch implementation
 2static inline uint32_t get_log2(uint32_t a)
 3{
 4    uint32_t i = 0;
 5    for (; a > 0;) {
 6        a >>= 1;
 7        i++;
 8    }
 9    return i; // Returns 3 for a=4 (0b100 → 0b10 → 0b1 → 0b0)
10}

The corrected function stops shifting once only the highest bit remains, and handles the special case where a = 0:

 1static inline uint32_t get_log2(uint32_t a)  
 2{  
 3    uint32_t i = 0;  
 4    if (a == 0) {  
 5        return i; // Handle the boundary case  
 6    }  
 7    for (; a > 1; a >>= 1) {  
 8        i++;  
 9    }  
10    return i; // Now returns 2 for a = 4  
11}

A better implementation:

1static inline uint32_t get_log2(uint32_t a)
2{
3    assert(is_power_of_2(a));
4    return ctz32(a);
5}

Finally,

it is still surprising that QEMU does not provide a standard wrapper for such a basic function.

PS: If you are interested, you can try improving the implementation in qemu/utils.h and add standard implementations of these basic functions.

Optimization Approach#

Key Technical Implementation#

Vectorized Loop Structure#

Address Calculation Optimization#

Inlining Conditional Execution#

Tail Processing Optimization#

Compatibility and Extensibility#

Patch Bugfix#