I have used agents to recreate many tasks and ideas that fit within my stack, and the efficiency gains are obvious. But the way to stay competitive is not to fork work; it is to do innovative work.

Here, “fork” means using an agent to finish faster the things you already know how to do — building a CRUD service you are familiar with, porting a driver you have written before, or refactoring code you already understand.

The speedup is substantial, but it is only acceleration, not expansion. Once everyone can use an agent to complete the same kind of work at 10x speed, that efficiency advantage is no longer a moat.

Innovative work means using an agent to enter domains where you do not yet have complete capability and produce something that did not exist before.

I know QEMU and RISC-V reasonably well, but I did not understand Triton’s internal MLIR pass pipeline or the IR lowering path of its CPU backend.

By myself, implementing a Triton RISC-V CPU backend from scratch would take weeks of learning and trial and error. An agent can act as a bridge in that process — connecting my RISC-V architecture knowledge to Triton’s compiler framework and producing a backend implementation that did not exist before.

So I tried using an agent to help me add a new qemu-riscv64 backend to Triton. This article covers:

  1. Heuristic prompting to help a human learn a new domain quickly with agent assistance;
  2. Using Git so each agent-generated feature remains buildable, testable, and bisectable;
  3. How to turn the result into something engineered and shippable.

What Triton Is

Triton is a DSL for writing high-performance parallel compute kernels, developed by OpenAI. It is based on Python syntax, and its relationship to CUDA is similar to a higher-level DSL front end.

Compared with CUDA, Triton’s core idea is to focus on how to operate on a block of data. That is the first thing a programmer needs to care about when writing a Triton kernel. How work is scheduled onto hardware threads, how vector instructions are used, and how caches are managed are all decided by the compiler.

That design is intuitive and can dramatically reduce the cognitive load of operator development.

If your workflow has moved to agents, you can use the following prompt to have an agent help you understand Triton:

“Please review Triton’s official documentation, summarize its core mechanisms, programming examples, and the most important differences from other DSLs, and turn that into a document. At the end, include the relevant reference links. You may ask me questions during the process, one question at a time.”

That prompt has two key points: first, the source of truth must stay faithful to the official materials; second, the agent should keep asking the human questions so the answer matches that person’s technical background.

The GPU backend is Triton’s native target, but a RISC-V CPU backend has the following value:

  1. Development and debugging: develop and debug Triton kernels on machines without a GPU;
  2. General deployment: run the same kernel code on edge devices, embedded systems, and RISC-V servers;
  3. Performance baseline: provide a CPU reference implementation for GPU kernels to validate correctness;
  4. New-architecture validation: the RISC-V ecosystem is growing quickly, and a CPU backend makes Triton a compilation front end for RISC-V vector computing.

The project already has official x86 and aarch64 reference implementations under the name triton-cpu, which is a downstream Triton repository that supports a CPU backend while regularly rebasing onto upstream.

If you want a quick overview of triton-cpu, you can ask your agent:

“Review the triton-cpu GitHub repository and summarize the current progress of the CPU backend, along with the commit differences from upstream Triton.”

Now let’s look at a simple vector add kernel to understand Triton’s programming model.

 1import triton
 2import triton.language as tl
 3
 4@triton.jit  # Kernel decorator
 5def vector_add_kernel(x_ptr, y_ptr, out_ptr, n_elements,
 6                      BLOCK_SIZE: tl.constexpr):  # program_id(0) returns the index of the current program instance along dimension 0
 7    pid = tl.program_id(0)  # Compute the element offset handled by this block
 8    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)  # Bounds-checking mask
 9    mask = offsets < n_elements  # Masked vector load
10    x = tl.load(x_ptr + offsets, mask=mask)
11    y = tl.load(y_ptr + offsets, mask=mask)
12    # Vector store
13    tl.store(out_ptr + offsets, x + y, mask=mask)

The key features are in the last few lines. tl.load and tl.store describe how data blocks are loaded and stored, which makes it clear that the programmer’s focus is on how data blocks are partitioned and used.

Next, we need to understand how Triton DSL is compiled step by step into RISC-V instructions. That helps explain Triton’s role, and if you know LLVM, it is a very good analogy.

At this point the picture is clear: Triton, as a domain-specific compilation front end, uses a series of specialized passes for multi-stage IR lowering, lands in LLVM IR through the standard MLIR-to-LLVM path, and is then finished by LLVM’s instruction selection, register allocation, and machine code generation.

The difference is that Triton IR is natively vectorized and block-parallel, while LLVM IR usually starts from scalar semantics.

You can ask the agent this:

“Where do the Triton CPU backend and GPU backend differ in the IR lowering pipeline? Do their final outputs differ in how they are called and executed?”

Triton RISC-V CPU Backend Implementation

Supporting a RISC-V backend allows triton-cpu to compile Triton kernels into RISC-V machine code. Choosing qemu-riscv64 as the execution target provides a general RISC-V reference implementation that can run RISC-V machine code on a PC through binary translation.

Here is the architecture diagram:

The part decorated by triton.jit is the kernel function itself. It is lowered into TTIR (Triton IR), which is a hardware-independent high-level IR;

then it is lowered into TTCIR (Triton CPU IR), where the CPU concept is introduced;

then further lowered into TTTCIR (Target Triton CPU IR) for target-architecture specialization. The ConvertDotToRVV pass shown in the diagram was written by the agent to vectorize scalar operations.

From TTCIR onward, the pipeline enters LLVM’s world: first it is converted to LLVM MLIR (LLVM dialect in MLIR);

then to LLVM IR, and then emitted as ASM text, which in this case is RISC-V assembly;

finally it is compiled into a .so shared library. If it needs to run under qemu-riscv64, gcc or clang can be used for cross compilation; after that comes runtime loading.

Let’s draw another diagram to summarize it:

The above is the spec we worked out with the agent. Here I recommend using Claude + humanize to generate a higher-quality plan. You can use the following prompt:

1/humanize:gen-idea
2Help me add a new qemu-riscv64 backend to triton-cpu so it can run RISC-V machine code on x86 machines. During the exploration, ask me as many questions as you need.

Claude will then follow the prompt, launch multiple subagents for parallel exploration, and return a draft.

After reading that draft carefully, we need several rounds of discussion with the agent to make sure the direction is correct. Then we generate the plan:

1/humanize:gen-plan
2Generate a plan based on the idea.

This step breaks the draft idea into concrete ACs and tasks, and determines milestones and acceptance criteria.

Once we have reviewed the plan and found no issues, we can start the rlcr-loop and move into implementation:

1/humanize:start-rlcr-loop
2Start executing this plan.

To make sure we fully understand the plan, humanize will ask us random questions before starting the rlcr-loop based on the plan content. If we answer incorrectly, it will suggest reading the plan again before starting.

If everything is fine, the agent will work through every objective in the plan and, at the end of each loop, call Codex (GPT-5.5) for a strict review.

Before starting the rlcr-loop, we also need to align on the Git workflow.

If the code generated by the agent does not land in traceable commits, it cannot be reviewed later, cannot be bisected, and cannot be reproduced. In the plan and the conversation, I used the following prompt to constrain the agent’s Git behavior:

“On the development branch, create a commit at the end of each loop. The commit message should start with Round N and describe the ACs completed in that round and the key changes. After all loops finish, squash the multi-round commits into separate feature commits by feature and merge them into main, so that every commit on main can be built and tested independently.”

The actual Git history looks like this:

 1# Development branch triton-cpu-riscv-qemu (full iteration history preserved)
 2dc2586a9a [CPU] Incremental model-fragment tests: end-to-end qemu validation for softmax + layernorm + MLP
 38048af6b9 [CPU] Round 1: Improve model-fragment tests - three .so MLP compositions + exact OMP consistency
 44f4a3debe [CPU] Round 2: Fix softmax -inf masking + negative tests + C driver hardening
 5a1b41196c [CPU] Round 3: MLP dimension-mismatch detection + direct qemu tests + softmax reference fix...
 6ae838915d [CPU] Round 19: Fix test_math.py vector-library expectations to match backend behavior
 7cb38cd3d4 [CPU] Finalize: Extract shared logic into helper methods + move deferred imports
 8# main branch (squashed feature commits, each independently buildable and testable)
 9700f8cd8a [CPU] Add RISC-V 64-bit cross-compilation infrastructure
10f522fa953 [CPU] Add ConvertDotToRVV pass and SLEEF RVVM1 math dispatch
11cbb9a02ee [CPU] Add RISC-V build targets and SLEEF cross-compilation
1275920596b [CPU] Add comprehensive RISC-V 64-bit test suite
133bd5f5e10 [CPU] Add RISC-V 64-bit backend documentation

The benefit of this two-layer structure is that the development branch preserves the full per-loop modification history, so you can trace any fix made after a Codex review;

on main, each commit corresponds to a complete feature delivery, so you can use git bisect to locate regressions or cherry-pick a single feature into another branch.

To complete the RISC-V backend adaptation, I eventually defined four plans: feature implementation, RVV/BF16/FP16 testing, model-fragment testing, and documentation writing.

Those four plans averaged about three hours each, with about nine loops per plan, and the final code quality was very high.

Key Pass Implementation Breakdown

Let’s walk through the passes stage by stage. For the parts that already existed in triton-cpu, I will keep the explanation brief and focus on where the agent made the key changes.

Stage 1:

make_ttir()

— hardware-independent optimization

1passes.common.add_inliner(pm)           # Function inlining
2passes.ttir.add_combine(pm)             # Triton IR combine optimization
3passes.common.add_canonicalizer(pm)     # MLIR canonicalization
4passes.ttir.add_reorder_broadcast(pm)   # Reorder broadcast operations
5passes.common.add_cse(pm)               # Common subexpression elimination
6passes.common.add_licm(pm)              # Loop-invariant code motion
7passes.common.add_symbol_dce(pm)        # Dead symbol elimination

This stage is shared with the GPU backend and is similar to LLVM’s general mid-end optimizations under -O2.

Stage 2:

make_ttcir()

— TTIR to CPU IR lowering

 1cpu.passes.ttcpuir.add_scalarize(pm, True)                 # Scalarize selected operations
 2cpu.passes.ttcpuir.add_convert_memory_ops(pm, True)        # tl.load/tl.store → memref operations
 3cpu.passes.ttcpuir.add_convert_ptr_ops(pm)                 # Pointer arithmetic conversion
 4cpu.passes.ttcpuir.add_convert_elementwise_ops(pm)         # Elementwise ops (add/sub/mul/div)
 5cpu.passes.ttcpuir.add_convert_elem_manip_ops(pm)          # Element manipulation (reshape, broadcast)
 6cpu.passes.ttcpuir.add_convert_dot_op(pm)                  # tl.dot → cpu::DotOp
 7cpu.passes.ttcpuir.add_convert_histogram_op(pm)            # Histogram ops
 8cpu.passes.ttcpuir.add_convert_reduction_op(pm, True, False)  # tl.sum/tl.max → reductions
 9cpu.passes.ttcpuir.add_convert_scan_op(pm)                 # Prefix-scan ops
10cpu.passes.ttcpuir.add_convert_cf_ops(pm)                  # Control-flow conversion
11cpu.passes.ttcpuir.add_convert_atomic_ops(pm)              # Atomic ops
12cpu.passes.ttcpuir.add_convert_debug_ops(pm)               # Debug ops
13passes.common.add_cse(pm)                                  # CSE cleanup
14passes.common.add_symbol_dce(pm)
15passes.common.add_canonicalizer(pm)

This is the mapping from Triton-specific operations to CPU semantics. For example, the mask semantics of tl.load are expanded here into conditional vector loads.

Triton’s memory operations natively support masks for out-of-bounds protection:

1x = tl.load(x_ptr + offsets, mask=mask, other=0.0)

In the CPU backend’s make_ttcir() stage, the add_convert_memory_ops pass lowers this into:

1; With a mask → LLVM masked load intrinsic
2%result = call <N x float> @llvm.masked.load.vNf32.p0(
3    ptr %addr,
4    i32 alignment,
5    <N x i1> %mask,
6    <N x float> %passthru    ; `other=0.0` becomes zeroinitializer
7)

Without a mask, it becomes a normal LLVM load instruction.

On the RISC-V backend, llvm.masked.load is eventually lowered by LLVM’s RVV backend into vle32.v + vmerge or into masked conditional-load instructions.

Stage 3:

make_tttcir()

— target-architecture specialization

This is the core stage for RISC-V specialization. Different passes are selected based on architecture and extension features:

 1cpu.passes.ttcpuir.add_triton_cpu_canonicalizer(pm)    # CPU IR canonicalization
 2cpu.passes.ttcpuir.add_optimize_masks(pm)              # Mask optimization
 3passes.common.add_canonicalizer(pm)
 4
 5# --- architecture-specific DotOp conversion (ordered by priority) ---
 6
 7# Intel AMX (x86, when the amx-tile feature is present)
 8# cpu.passes.ttcpuir.add_convert_dot_to_amx(pm, ...)
 9
10# x86 AVX512 FMA
11# cpu.passes.ttcpuir.add_convert_dot_to_fma(pm)
12
13# *** RISC-V RVV (riscv64, when the +v feature is present) ***
14if arch == "riscv64" and "v" in features:
15    cpu.passes.ttcpuir.add_convert_dot_to_rvv(pm)
16
17# generic fallback (registered for all architectures)
18cpu.passes.ttcpuir.add_convert_dot_generic(pm)
19
20# --- data type conversion strategy ---
21# BF16/FP16 conditional compilation (see section 4.3)
22cpu.passes.ttcpuir.add_convert_unsupported_ops(pm, ...)
23cpu.passes.ttcpuir.add_decompose_fp_conversions(pm, ...)

At the make_tttcir() stage, Triton’s tl.dot semantics are also handled; this is the core operator for matrix multiplication.

The main pass here is ConvertDotToRVV, which expands a cpu::DotOp with shape [M, K] x [K, N] -> [M, N] into row-wise vector FMA operations. The LLVM backend then maps that to RVV vfmacc.vf instructions.

This pass lives in:

third_party/cpu/lib/TritonCPUTransforms/ConvertDotOp/ConvertDotToRVV.cpp

The agent followed my instructions and reused as much existing implementation as possible.

As a result, ConvertDotToRVV and ConvertDotToFMA (x86) share the same high-level strategy:

first, iterate over the DotOp, run candidate checks to find valid cases, then analyze the memory buffers with findInputBuffer() for LHS/RHS, and then check the loop-carried accumulator with isLoopCarriedAcc();

finally, lower row by row into FMA: extract the LHS scalar, broadcast it into an N-wide vector, and multiply-add it with the RHS row.

The key difference is that FMA on x86 targets fixed-width SIMD (SSE/AVX), while LLVM uses different backend strategies for x86 (direct instruction selection) and RISC-V (vsetvli + RVV instruction selection).

ConvertDotToAMX (x86) is much more complicated because it uses hardware tile registers, explicit tile-size constraints, and VNNI encodings. RVV does not have tile registers, so the vector-FMA approach is simpler.

The matching condition for this pass (isRvvCandidate) requires the input to be 2D, supports only f32 and f64 for now, and does not require a specific shape size.

For block shapes, any rank-2 DotOp is acceptable. The pass does not impose a block-size constraint, and the LLVM backend uses vsetvli to handle arbitrary vector widths, which fully leverages RVV’s characteristics.

Here is the conversion strategy:

Given

C[M,N] += A[M,K] * B[K,N]

it generates the following MLIR operation sequence:

 1# 1. Extract each accumulator row
 2for m in 0..M:
 3    accRow[m] = vector.extract C, [m]        # <N x f32>
 4
 5# 2. Outer-product accumulation
 6for m in 0..M:
 7    for k in 0..K:
 8        scalar = vector.extract A, [m, k]     # f32
 9        splat  = vector.broadcast scalar → <N x f32>
10        row_b  = vector.extract B, [k]        # <N x f32>
11        accRow[m] = vector.fma(splat, row_b, accRow[m])
12
13# 3. Write back the result
14for m in 0..M:
15    result = vector.insert accRow[m], result, [m]

The benefit of this approach is that each FMA is a row-level vector operation, so LLVM can map it directly to RVV’s vfmacc.vf instruction (scalar-vector FMA) without extra vector shuffles.

There is also a loop-carried-accumulator optimization:

isLoopCarriedAcc() checks whether the accumulator is updated repeatedly inside the loop, as in the K-loop of matmul. If so, the accumulator value can stay in registers and avoid repeated load/store operations.

Overall, this pass is fairly conventional and did not feel especially surprising.

Stage 4:

make_llir()

— lowering to LLVM IR

 1# TritonCPU IR → LLVM dialect (MLIR)
 2cpu.passes.ttcpuir.add_lower_vector_multi_dim(pm)       # Flatten multi-dimensional vectors
 3cpu.passes.ttcpuir.add_expand_strided_metadata(pm)     # Expand stride metadata
 4cpu.passes.ttcpuir.add_vector_to_scf(pm, True, 1, False)  # vector → scf loops
 5cpu.passes.ttcpuir.add_lower_affine(pm)                # affine → standard
 6passes.convert.add_scf_to_cf(pm)                       # scf → cf (control flow)
 7passes.convert.add_index_to_llvmir(pm)                 # index → LLVM
 8
 9# *** Triton-specific LLVM conversions ***
10cpu.passes.ttcpuir.add_func_op_to_llvmir(pm)           # tt::FuncOp → LLVM::FuncOp
11# (append 6 extra program-id parameters here)
12cpu.passes.ttcpuir.add_program_id_to_llvmir(pm)        # GetProgramIdOp → read function parameters
13cpu.passes.ttcpuir.add_memory_op_to_llvmir(pm)         # Memory ops to LLVM
14cpu.passes.ttcpuir.add_atomic_ops_to_llvmir(pm)        # Atomic ops to LLVM
15cpu.passes.ttcpuir.add_debug_ops_to_llvmir(pm)         # Debug ops to LLVM
16
17# *** math library dispatch ***
18# For riscv64+v: SLEEF RVVM1 vectorized math functions
19# For x86 AVX512: libmvec or SLEEF
20cpu.passes.ttcpuir.add_math_to_vec_lib(pm, vec_lib, features)
21passes.convert.add_math_to_llvmir(pm)                  # math dialect → LLVM intrinsics
22cpu.passes.ttcpuir.add_math_to_libm(pm)                # Remaining math → libm calls

A major difference from the GPU backend is how tl.program_id semantics are mapped.

On the GPU, program_id corresponds to a hardware block-index register; on the CPU, it becomes function parameters.

In the add_func_op_to_llvmir pass of make_llir(), FuncOpConversion::amendProgramIdArgs() appends six parameters to every kernel function:

 1// Original signature: void kernel(float* x, float* y, float* out, int n)
 2// After amendment:
 3// void kernel(float* x, float* y, float* out, int n,
 4//             int32_t pid0, int32_t pid1, int32_t pid2,
 5//             uint32_t gridX, uint32_t gridY, uint32_t gridZ)
 6amendedInputTy.push_back(i32_ty);     // pid0
 7amendedInputTy.push_back(i32_ty);     // pid1
 8amendedInputTy.push_back(i32_ty);     // pid2
 9amendedInputTy.push_back(ui32_ty);    // gridX
10amendedInputTy.push_back(ui32_ty);    // gridY
11amendedInputTy.push_back(ui32_ty);    // gridZ

GetProgramIdOpConversion replaces tl.program_id(axis) with a direct read from the function parameters:

 1// Utility.cpp
 2Value getProgramId(FunctionOpInterface funcOp, int axis) {
 3    auto args = funcOp.getArguments();
 4    auto argIdx = args.size() - 6 + axis;
 5    return args[argIdx];
 6}
 7
 8Value getNumPrograms(FunctionOpInterface funcOp, int axis) {
 9    auto args = funcOp.getArguments();
10    auto argIdx = args.size() - 3 + axis;
11    return args[argIdx];
12}

Let’s also look at math-function handling, focusing on the semantics of tl.exp and tl.sin.

The math compilation path goes through three layers of lowering:

 1tl.exp(x)
 2  → math::ExpOp (MLIR math dialect)
 3    → [MathToVecLib pass]
 4      → if riscv64+v and SLEEF is available:
 5          func.call @Sleef_expfx_u10rvvm1(%vec)   # SLEEF RVVM1 vectorized version
 6      → if riscv64+v but SLEEF is unavailable:
 7          [math_to_llvmir pass]
 8            → llvm.exp intrinsic (expanded by LLVM into a scalar loop + libm calls)
 9      → if riscv64 without +v:
10          [math_to_libm pass]
11            → func.call @expf(%scalar)             # Scalar libm

Here the agent chose the SLEEF library to provide Triton with vectorized math functions.

The backend mainly supports the SLEEF RVVM1 architecture. RVVM1 means that SLEEF uses an RVV vector-register group with LMUL=1, which is a vector-length-agnostic (VLA) ABI. LLVM determines the actual vector length at code-generation time based on the target hardware’s VLEN.

Concretely, the decision happens in the compile-time pass layer (MathToVecLib.cpp). When isRvv == true, populateSleefRvvPatterns() registers RVVM1 variants for 27 math functions:

1// Each function uses SleefNameGenerator("name", ulp, /*rvvm1=*/true)
2populatePatternsForOp<math::SinOp>(patterns, gen("sin"), ...);
3populatePatternsForOp<math::CosOp>(patterns, gen("cos"), ...);
4populatePatternsForOp<math::ExpOp>(patterns, gen("exp"), ...);
5// ... 27 math operations in total

The RVVM1 naming pattern in SleefNameGenerator is:

1if (useRvvm1) {
2    return "Sleef_" + baseName + (bitwidth == 32 ? "f" : "d") +
3           "x" + ulpSuffix + "rvvm1";
4}

The RVVM1 mode is enabled through RVV feature detection in updatevecsize():

1if (feature == "v") {
2    isRvv = true;
3    vec_size_in_bits = std::max<size_t>(vec_size_in_bits, 128);
4}

Stage 5:

make_asm()

— LLVM IR to assembly

1# Cross-compilation path
2llvm.translate_to_asm(src, target_triple, target_cpu, target_features, ...)
3
4# Native-compilation path
5llvm.translate_to_host_asm(src, ...)

At this stage LLVM calls TargetMachine::addPassesToEmitFile() and runs the full backend pipeline: SelectionDAG / GlobalISel, instruction selection, register allocation, instruction scheduling, MC code emission, and so on.

Stage 6:

make_so()

— assembly to shared library

1# Cross-compilation path: call riscv64-linux-gnu-gcc
2_cross_build("kernel", asm_path, tmpdir, options)
3
4# Native-compilation path: call the host gcc/clang
5_build("kernel", asm_path, tmpdir, lib_dirs, include_dirs, libs, ccflags)

Kernel Runtime Execution Model

Triton’s parallel model is that each program instance independently handles one tile, and all instances form a 3D grid. On the CPU backend, that grid is mapped to threads with OpenMP parallel for.

The core scheduling code lives in the run_omp_kernels template in thirdparty/cpu/backend/driver.py:

 1// 1. Flatten the 3D grid into 1D
 2size_t N = gridX * gridY * gridZ;
 3auto all_grids = get_all_grids(gridX, gridY, gridZ);
 4
 5// 2. Special case: a single instance is called directly
 6if (N == 1) {
 7    (*kernel_ptr)(args..., 0, 0, 0, 1, 1, 1);
 8    return;
 9}
10
11// 3. Special case: single-threaded sequential execution (avoid OMP overhead)
12if (max_threads == 1) {
13    for (size_t i = 0; i < N; ++i) {
14        const auto [x, y, z] = all_grids[i];
15        (*kernel_ptr)(args..., x, y, z, gridX, gridY, gridZ);
16    }
17    return;
18}
19
20// 4. Multi-threaded parallel execution
21#pragma omp parallel for schedule(static) num_threads(max_threads)
22for (size_t i = 0; i < N; ++i) {
23    const auto [x, y, z] = all_grids[i];
24    (*kernel_ptr)(args..., x, y, z, gridX, gridY, gridZ);
25}

The grid expansion order is (z, y, x) with z outermost, which matches CUDA’s grid traversal order. Each OpenMP thread calls the kernel with a different (x, y, z) value, and that is where tl.program_id() comes from.

 116-core example:
 2vector_add(n=16384, BLOCK_SIZE=1024)
 3grid = (16,)  →  16 grid points
 4
 5OMP_NUM_THREADS=16:
 6  core  0: kernel(pid=0)  → arr[0:1024]       one block per core
 7  core  1: kernel(pid=1)  → arr[1024:2048]
 8  ...
 9  core 15: kernel(pid=15) → arr[15360:16384]
10
11OMP_NUM_THREADS=4:
12  core 0: pid=0,1,2,3     (schedule=static)   four blocks per core
13  core 1: pid=4,5,6,7
14  core 2: pid=8,9,10,11
15  core 3: pid=12,13,14,15

During cross compilation, -fopenmp is added by default (compiler.py:crossbuild()), so the generated .so already contains OpenMP parallel code. It can be disabled with TRITONDISABLEOPENMP=1.

PS: if you are interested in parallel programming, I strongly recommend the Chinese translation of Understanding Parallel Programming, 2nd Edition by Dr. Zhouzhouyi. In the agent era, being able to sit down and read one book deeply is a rare and valuable skill.

The ABI of the compiled kernel function is as follows:

 1typedef void (*kernel_ptr_t)(
 2    // User-defined arguments (non-constexpr)
 3    float* x_ptr,        // tl.pointer
 4    float* y_ptr,
 5    float* out_ptr,
 6    int32_t n_elements,  // tl.int32
 7
 8    // Fixed 6 scheduling arguments (added automatically by the compiler)
 9    uint32_t pid0,       // tl.program_id(0)
10    uint32_t pid1,       // tl.program_id(1)
11    uint32_t pid2,       // tl.program_id(2)
12    uint32_t gridX,      // tl.num_programs(0)
13    uint32_t gridY,      // tl.num_programs(1)
14    uint32_t gridZ       // tl.num_programs(2)
15);

Note that pid0/pid1/pid2 use signed int32_t, while gridX/gridY/gridZ use unsigned uint32_t, which matches the type definitions in FuncOpToLLVM.cpp.

Let’s also add the dlopen/dlsym dynamic-loading mechanism.

CPUUtils.load_binary() uses Python’s ctypes.cdll.LoadLibrary() to load the compiled .so dynamically:

 1def load_binary(self, name, kernel, shared_mem, device):
 2    # 1. Check whether the ELF architecture matches
 3    machine = _elf_machine(kernel)
 4    if machine != host_machine:
 5        raise RuntimeError("Cannot load ... ELF binary on ... host")
 6
 7    # 2. Write to a temporary file and load it
 8    with tempfile.NamedTemporaryFile(suffix=".so") as f:
 9        f.write(kernel)
10        f.flush()
11        lib = ctypes.cdll.LoadLibrary(f.name)
12        fn_ptr = getattr(lib, name)
13        return (lib, fn_ptr, 0, 0, 0)

Cross-compiled artifacts cannot be loaded directly on the host because the ELF machine does not match; they must run under QEMU or on real target hardware.

QEMU Validation System

Triton CPU uses qemu-user user-mode emulation rather than qemu-system full-system emulation, mainly for performance reasons. Through collaboration with the agent, we confirmed the following facts:

qemu-user can validate multithreaded correctness, multithreaded scheduling, and synchronization.

qemu-user cannot validate RVWMO memory ordering, because the host memory model is incorporated and relaxed-ordering bugs may be masked; multicore simulation through pthreads is not the same as real RISC-V.

Because the cross-compiled Triton kernel is a .so shared library (with no main function), the agent designed QemuRunner so that each test scenario gets its own C driver program. The driver is responsible for loading kernel.so, feeding in data, calling the kernel according to the grid traversal order, and printing the results.

The driver compilation and execution flow is:

 1def _compile_driver(self, driver_src, tmpdir, extra_flags=None):
 2    # 1. Write the C source code
 3    Path(src_path).write_text(driver_src)
 4    # 2. Cross compile
 5    cmd = [self.cc, src_path, "-o", bin_path,
 6           f"-march={march}", f"-mabi={self.target_abi}",
 7           "-ldl", "-lm", "-O2"]
 8    # 3. Optional OpenMP support
 9    if extra_flags:
10        cmd.extend(extra_flags)  # e.g. ["-fopenmp"]
11
12def _run_qemu(self, binary, args, tmpdir):
13    # 1. Build the qemu command
14    qemu_cmd = [self.qemu]
15    if self._needs_cpu_max():
16        qemu_cmd += ["-cpu", "max"]
17    # 2. Set the library search path
18    qemu_cmd += ["-L", self.sysroot]
19    qemu_cmd += ["-E", f"LD_LIBRARY_PATH={':'.join(lib_paths)}"]
20    # 3. Execute (60-second timeout)
21    ret = subprocess.run(qemu_cmd, capture_output=True, text=True, timeout=60)

To ensure the correctness and stability of agent-generated code, I designed a model-fragment testing plan. The tests are ordered by increasing Triton feature complexity, with each stage introducing a new computation pattern. The main coverage includes:

  • Single-operator correctness, such as basic vector operations, RVV vector operations, and BF16/FP16 type validation;
  • Single-pass reduction Softmax, covering baseline/RVV/OMP/all-inf/non-pow2 cases, with tolerances of atol=1e-4 and rtol=1e-4;
  • Multi-pass reduction LayerNorm, covering baseline/multipass/RVV/OMP plus baseline-vs-RVV three-output comparison, with tolerances of atol=1e-3 and rtol=1e-3;
  • Multi-kernel composition MLP, covering baseline/RVV plus invalid .so, wrong symbol, and dimension-mismatch cases, with tolerances of atol=0.1 and rtol=1e-2.

To improve delivery quality, I asked the agent to add as many positive and negative tests as possible for each feature, aiming to cover every functional point.

humanize Collaboration Retrospective

Finally, let’s look back at the collaboration process. The core toolchain for this work was Claude Code + the humanize skill. Humanize provides a structured workflow of gen-idea → gen-plan → rlcr-loop, and at the end of each loop it calls Codex (GPT-5.5, xhigh effort) for independent review.

The whole project was split into four plans:

PlanTask CountAC CountMilestones
RISC-V backend feature implementation2395 (from native smoke tests to RVV optimization)
RVV/BF16/FP16 testing2295 (from regression baseline to SLEEF integration)
Model-fragment testing1264 (Softmax → LayerNorm → MLP)
Documentation refactor954 (merge 8 files into 3, 45% deduplication)

Across the four plans, that is 66 tasks, 29 acceptance criteria, and 18 milestones. The planning phase took three days (April 27–29), and the execution phase averaged about three hours and nine loops per plan.

The core problem humanize solved here was estimate correction.

In the initial draft, I estimated the cross-compilation work at “300–500 lines of code.”

In the second review round, Codex pointed out that tritoncpu.cc hard-coded opts.x86 = true, vec_lib_requirements had no RISC-V branch at all, and translate_to_host_asm could not handle cross-compilation scenarios.

Another value is blind-spot discovery. Sixteen Codex reviews (13 of them using GPT-5.5) identified seven major technical blind spots in total:

x86 hard-coding, cross-compilation complexity, the C++/CMake/pybind work required for the RVV pass, cache-key design, the 16-core validation standard, the OpenMP validation method, and the multi-kernel composition architecture. If these issues had only surfaced during implementation, the rework cost would have been high.

Back to the original question: fork or innovation?

Honestly, this work sits on the boundary between the two.

ConvertDotToRVV is a RISC-V variant of ConvertDotToFMA, and SLEEF integration is an architecture extension on the x86 path. In terms of implementation technique, the agent was still doing a migration with a reference implementation, which is a high-quality fork.

But the result is new. Triton did not have a RISC-V CPU backend before; now it does. I did not understand the MLIR pass system before; now I can break down the pass pipeline across six compilation stages and write about it.

That is probably what “innovation” looks like in the agent era: you do not have to start from zero.

You bring depth in one domain, and the agent bridges you into another framework. My RISC-V knowledge determined the direction and the judgment criteria, while the agent filled in the Triton compiler-framework blind spots.

Sixteen Codex reviews helped me find seven technical problems I would not have found on my own, and the parallel exploration in gen-idea helped me build a structural understanding of an unfamiliar codebase in just a few minutes.

The upper bound of a fork is the original work. The upper bound of innovation depends on the domain knowledge you bring with you.

The agent lowers the barrier to cross-domain composition from “master both domains” to “master one domain and be able to ask the right questions about the other.”

That workflow still has clear boundaries. If the task is to design a tile-scheduling algorithm for a RISC-V matrix extension from scratch, the quality of the plan produced by gen-plan drops sharply because the agent lacks a reference anchor. Agents are great at producing high-quality variants of known patterns; they are not good at original exploration in unknown space.

Recognizing that boundary is the key to using the agent in the right place.

Delivery and Trust

Writing the code with the agent is only the beginning. The real question is: how do you deliver that result so that others trust it?

The first instinct may be to ask whether the agent implemented something perfectly. But that is not the first-order issue. The first-order issue is whether you, as the developer, fully understand the implementation the agent produced.

Once the code is delivered, the audience is human.

People will ask why you chose RVVM1 instead of RVVM2, someone will challenge the tiling strategy in ConvertDotToRVV during code review, and someone will hit a missing SLEEF symbol link error in production.

Those situations do not call for running the agent again. They call for you to respond in time, localize the problem, and explain it.

That is also why I spent so much time earlier breaking down the six compilation stages.

It was not to show off the agent’s output, but to confirm that I myself understood what each layer of the pass pipeline was doing. If I had just pasted the agent’s output, the first bug report would have exposed that I did not understand what I was shipping.

The more realistic issue is that the agent will not always be there.

The session will expire, the context will overflow, the toolchain will change, and the agent’s output quality may regress — all of that will happen. When that does, can you keep iterating on the code the agent wrote? Can you manually edit a pass, add a test, or fix a cross-compilation link flag?

That is the value of the Git workflow. The development branch preserves the full 19-loop history, and every change, every reason for the change, and every issue discovered by Codex review is traceable.

The five feature commits on main are independently buildable. Anyone who picks up the repository can start from any commit, reconstruct the context, and continue the work.

So the essence of delivery is not the quality of the agent’s code; it is your control over that code. The thing other people trust is not the agent — it is the developer.

That is enough rambling. Happy May Day holidays, everyone~

Written by: Zevorn

Cover image: Kernyr

Review: Zevorn

Image and reference sources:

[1] OpenAI Triton official repository: https://github.com/triton-lang/triton

[2] triton-cpu (CPU backend) repository: https://github.com/triton-lang/triton-cpu

[3] Triton official documentation and tutorials: https://triton-lang.org

[4] SLEEF vector math library: https://github.com/shibatch/sleef

[5] QEMU official documentation: https://www.qemu.org/docs/master/

[6] RISC-V specification (Vector Extension V 1.0): https://github.com/riscv/riscv-v-spec

[7] MLIR official documentation: https://mlir.llvm.org

[8] LLVM official documentation: https://llvm.org/docs/

[9] The illustrative diagrams in this article were generated by GPT image-2.

Further reading

  • Follow the official account and get free access to the community-provided ima knowledge base

Now available:

AI Infra/QEMU/Compiler/Linux