Sparse and Dense Switches on RISC‑V

This post looks at a couple of size tricks used in the RP2350 bootrom. There is one trick for sparse and one trick for dense case statements.

That bootrom has some pretty gnarly size hacks because it has to fit a lot of functionality into a 32 kB mask ROM, and execute on both Arm and RISC‑V. These two tricks are on the more generally useful end of the spectrum; you can apply them in your own hand-written RISC‑V code if you are tight on space.

For the purpose of this post we are interested only in static code size. Performance benchmarking is for nerds.

PC-relative Compressed Instructions

The RISC‑V C extension has relatively few instructions which observe or modify the program counter:

That’s it. One painful omission is a 16-bit counterpart for auipc, like lda from Thumb; in fact the only 16-bit instructions that write a PC-relative value into a GPR are c.jal and c.jalr, and you can bet we will make use of this fact.

There are also the cm.popret and cm.popretz instructions from Zcmp, but they are mostly irrelevant for this post. We’ll stick to the RV32IC dialect.

Sparse Case Statements

A sparse case statement is one where the case items are not numerically consecutive. Say I want to select one of five integers depending on whether an upper-case letter appears in my name:

// Still a better workload than Dhrystone
int is_luke(char x, int a, int b, int c, int d, int e) {
    switch (x) {
        case 'L': return a;
        case 'U': return b;
        case 'K': return c;
        case 'E': return d;
        default:  return e;
    }
}

You can look at that code on Godbolt here but I’m going to use objdump output because it makes the instruction sizes more visually obvious. All compilation in this post is with GCC 15.1.0.

$ riscv32-unknown-elf-gcc -march=rv32ic -Os -c sparse-case.c
$ riscv32-unknown-elf-objdump -d sparse-case.o

sparse-case.o:     file format elf32-littleriscv


Disassembly of section .text:

00000000 <is_luke>:
   0:   04c00813        li      a6,76
   4:   03050563        beq     a0,a6,2e <.L4>
   8:   00a86d63        bltu    a6,a0,22 <.L3>
   c:   04500613        li      a2,69
  10:   02c50163        beq     a0,a2,32 <.L5>
  14:   04b00713        li      a4,75
  18:   00e51363        bne     a0,a4,1e <.L2>
  1c:   87b6            mv      a5,a3

0000001e <.L2>:
  1e:   853e            mv      a0,a5
  20:   8082            ret

00000022 <.L3>:
  22:   05500713        li      a4,85
  26:   fee51ce3        bne     a0,a4,1e <.L2>
  2a:   87b2            mv      a5,a2
  2c:   bfcd            j       1e <.L2>

0000002e <.L4>:
  2e:   87ae            mv      a5,a1
  30:   b7fd            j       1e <.L2>

00000032 <.L5>:
  32:   87ba            mv      a5,a4
  34:   b7ed            j       1e <.L2>

The compiler output is fairly straightforward: load each comparison constant in turn. If it compares equal to x (passed in a0), then branch to a subroutine that returns the correct integer a through e (passed in a1 through a5). One neat detail is that it actually searches in a binary tree instead of going through the cases one by one; the first value it checks is 76, or ASCII L, so U is above it and K and E are below it.

The important thing to notice in that disassembly is that the instructions in the branch tree are all 32-bit instructions.

To make this smaller, think back to our compressible instructions: the only conditional branches are comparisons for equal/not equal to zero, so we need to transform the comparisons into this form. We can do this by adding a series of differences to a0 so that it is zero when the initial value was equal to our case item:

is_luke:
    addi a0, a0, -'E'
    beqz a0, 1f
    addi a0, a0, 'E'-'K'
    beqz a0, 2f
    addi a0, a0, 'K'-'L'
    beqz a0, 3f
    addi a0, a0, 'L'-'U'
    beqz a0, 4f
0:
    mv a0, a5
    ret
1:
    mv a0, a1
    ret
2:
    mv a0, a2
    ret
3:
    mv a0, a3
    ret
4:
    mv a0, a4
    ret

These are all 16-bit instructions, except for the very first instruction which subtracts 69; c.addi has a range of -32 to 31. We can check this by assembling and disassembling:

$ riscv32-unknown-elf-gcc -march=rv32ic -c sparse-case.S
$ riscv32-unknown-elf-objdump -d sparse-case.o          

sparse-case.o:     file format elf32-littleriscv


Disassembly of section .text:

00000000 <is_luke>:
   0:   fbb50513        addi    a0,a0,-69
   4:   c909            beqz    a0,16 <.L1^B1>
   6:   1569            addi    a0,a0,-6
   8:   c909            beqz    a0,1a <.L2^B1>
   a:   157d            addi    a0,a0,-1
   c:   c909            beqz    a0,1e <.L3^B1>
   e:   155d            addi    a0,a0,-9
  10:   c909            beqz    a0,22 <.L4^B1>
  12:   853e            mv      a0,a5
  14:   8082            ret

00000016 <.L1^B1>:
  16:   852e            mv      a0,a1
  18:   8082            ret

0000001a <.L2^B1>:
  1a:   8532            mv      a0,a2
  1c:   8082            ret

0000001e <.L3^B1>:
  1e:   8536            mv      a0,a3
  20:   8082            ret

00000022 <.L4^B1>:
  22:   853a            mv      a0,a4
  24:   8082            ret

The compiler output is 36 bytes, and the hand-written code is 26 bytes, for a 28% reduction.

Dense Case Statements

Hopefully you found that agreeable. The next one is uglier. I mentioned earlier that c.jalr and c.jal are the only compressed instructions which write a PC-relative value to a GPR.

A dense case statement is one where the case items are all consecutive (my definition). Say we want to apply one of several operations to a pair of integers based on some opcode (Godbolt link):

int alu(int op, int a, int b) {
    switch (op & 0x7) {
        case 0: return a + b;
        case 1: return a - b;
        case 2: return a & b;
        case 3: return a | b;
        case 4: return a ^ b;
        case 5: return a >> b;
        case 6: return a << b;
        case 7: return ~a;
        default: __builtin_unreachable();
    }
}

The compiler output is:

$ riscv32-unknown-elf-gcc -march=rv32ic -Os -c dense-case.c 
$ riscv32-unknown-elf-objdump -d dense-case.o

dense-case.o:     file format elf32-littleriscv


Disassembly of section .text:

00000000 <alu>:
   0:   891d            andi    a0,a0,7
   2:   157d            addi    a0,a0,-1
   4:   4799            li      a5,6
   6:   00a7ea63        bltu    a5,a0,1a <.L2>
   a:   000007b7        lui     a5,0x0
   e:   00078793        mv      a5,a5
  12:   050a            slli    a0,a0,0x2
  14:   953e            add     a0,a0,a5
  16:   411c            lw      a5,0(a0)
  18:   8782            jr      a5

0000001a <.L2>:
  1a:   00c58533        add     a0,a1,a2
  1e:   8082            ret

00000020 <.L10>:
  20:   40c58533        sub     a0,a1,a2
  24:   8082            ret

00000026 <.L9>:
  26:   00c5f533        and     a0,a1,a2
  2a:   8082            ret

0000002c <.L8>:
  2c:   00c5e533        or      a0,a1,a2
  30:   8082            ret

00000032 <.L7>:
  32:   00c5c533        xor     a0,a1,a2
  36:   8082            ret

00000038 <.L6>:
  38:   40c5d533        sra     a0,a1,a2
  3c:   8082            ret

0000003e <.L5>:
  3e:   00c59533        sll     a0,a1,a2
  42:   8082            ret

00000044 <.L3>:
  44:   fff5c513        not     a0,a1
  48:   8082            ret

This is mostly just performing a 32-bit lookup with a0 & 0x7 as an index. a0 is the op operand to our original C function. I’m not quite sure why it does the initial branch for > 6; it might be vestigial from the default case, if the compiler doesn’t realise that the case is fully populated due to the AND on op.

The lui; mv pair is actually getting the absolute address of a table of 32-bit pointers in .rodata:

$ riscv32-unknown-elf-objdump -dr -j .rodata dense-case.o

Disassembly of section .rodata:

00000000 <.L4>:
        ...
                 0: R_RISCV_32   .L10
                 4: R_RISCV_32   .L9
                 8: R_RISCV_32   .L8
                 c: R_RISCV_32   .L7
                 10: R_RISCV_32  .L6
                 14: R_RISCV_32  .L5
                 18: R_RISCV_32  .L3

I have two observations about this code:

Here is an alternative approach:

alu:
    mv t1, ra
    andi t0, a0, 0x7
    jal table_branch_byte
alu_op_table:
    .byte 0f - alu_op_table
    .byte 1f - alu_op_table
    .byte 2f - alu_op_table
    .byte 3f - alu_op_table
    .byte 4f - alu_op_table
    .byte 5f - alu_op_table
    .byte 6f - alu_op_table
    .byte 7f - alu_op_table
0:  add a0, a1, a2
    jr t1
1:  sub a0, a1, a2
    jr t1
2:  and a0, a1, a2
    jr t1
3:  or a0, a1, a2
    jr t1
4:  xor a0, a1, a2
    jr t1
5:  sra a0, a1, a2
    jr t1
6:  srl a0, a1, a2
    jr t1
7:  not a0, a1
    jr t1

table_branch_byte:
    add t0, t0, ra
    lbu t0, (t0)
    add t0, t0, ra
    jr t0

This assembles to 74 bytes, including the jump table. What’s more, the table_branch_byte subroutine (10 bytes) can be shared among multiple functions. The compiler version is 106 bytes, including the jump table (minus a potential 6 bytes for linker relaxation of the non-position-independent table reference).

The table_branch_byte routine takes the following steps:

This trashes ra but you are likely to have it on the stack already – it’s pushed for “free” by any cm.push and popped by any cm.pop*.

See here for an example of this trick being used in the RP2350 bootrom. There are some macros in use to cut down on wear and tear on the programmer’s keyboard.

Bonus Trick: Constant Islands in Functions

RISC‑V avoids emitting constants in .text sections because, among other reasons, it makes it difficult to mark .text as execute-only for security purposes. What if we didn’t care about security? What if we were just nice to each other?

Say we wanted to zero a list of memory regions during startup:

.p2align 2
    jal a1, 1f
.word __bss_start, __bss_end
.word __heap_start, __heap_end
.word 0
1:
    lw a0, (a1)
    beqz a0, 3f
    lw a1, 4(a1)
    bgeu a0, a1, 3f
2:
    sw zero, (a0)
    addi a0, a0, 4
    bltu a0, a1, 2b
3:
    // continue with startup...

That’s it, I hope you learned a new trick, or at least recoiled from your screen in horror a couple of times.

⇥ Return to wren.wtf