Friday, December 5, 2008

iPhone VFP for n00bs

I thought I'd jot down a couple notes outlining my experience so far with writing vector assembly for the iPhone.

First of all, a couple of useful links...

About the only examples on the web to date are in the vfpmathlibrary: http://code.google.com/p/vfpmathlibrary

The website is currently down, but the ARM GCC Inline Assembly Cookbook has lots of great info: http://www.ethernut.de/en/documents/arm-inline-asm.html

You'll also likely want the quick reference card for the ARM/VFP instruction set: http://www.voti.nl/hvu/arm/ARMquickref.pdf
(I unfortunately couldn't find a link on arm's site)

If your app is floating point heavy (which is probably why you want to use the vfp in the first place), then you'll likely already have disabled compiling for thumb. If not, you might want to open up your project settings and uncheck that box. Otherwise, you'll have to follow the vfpmathlibrary and wrap your assembly in their macros switch to arm mode and back.

Now, go read the inline assembly cookbook linked above if you're not already familiar with how inline assembly works in gcc. The basic format is:

void test(float * src, float * dst)
{
asm volatile (
"fldmias %0!, {s8-s11} \n\t" // operations
"fstmias %1!, {s8-s11} \n\t"
: "=r" (src), "=r" (dst) // output operands
: "0" (src), "1" (dst) // input operands
: "memory" // clobbers
);
}


A couple of notes here:
  • If you're compiling your code as C instead of C++, then you'll need to use __asm__.
  • 'volatile' tells the compiler not to optimize your code.
  • The load and store commands don't obey the vector width (more on that below), so you have to specify the range explicitly.
  • The first 8 fp registers (or rather the first two banks - s0-s7 and d0-d3) are treated as scalars, so they also ignore the width. This is useful for loading a single value into and using to multiply an entire vector by.
  • VFP has support for both single and double precision. This is what the last letter of each instruction specifies (e.g. fmuls vs fmuld), and the first letter for registers (s0-s31 vs d0-d15).
  • fldmias = floating [point], load multiple, increment after, single [precision]. In the example above, it loads the values at src[0], src[1], src[2] & src[3] into registers s8, s9, s10 & s11 respectively. Alternatively, you can use decrement before (fldmdbs). Some good details on increment ascending vs decrement before, etc. can be found here.
  • The exclamation mark (!) is required to actually increment or decrement the pointer. If you want to leave it alone, leave the exclamation point off. Technically speaking, it's used to set the 'w' bit. When you modify it this way, you need to make sure it's both an input and output operand.
  • Output operands are write-only. Input operands are read-only. Normally, you'd specify it using just "r", but when you want it to be both read and write, you use the output operand index instead. Refer back to the inline assembler cookbook for more details.
  • Instructions should be separated by new lines and tabs as illustrated above so when it's linked into the rest it'll read nicely for debugging.
The vfp can support widths up to 4 floats. This is how many floats are processed per vfp operation. To set this use the following code:

"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00030000 \n\t" // the 3 in here is the width
"fmxr fpscr, r0 \n\t"


(or just use common_macros.h from vfpmathlibrary)

The value is always width-1, so for width 4 you use the value 3, and width 1 you use 0.

If you change the width, you should set it back to one when you're done. Changing the width is expensive since it needs to wait for existing operations to finish. This means vfp is mostly useful for doing large batches of work instead of wrapping up all your vector operations individually. My guess is this is why Apple hasn't released a nice math library already.

For inline assembly, you can only use local labels (0-9). When you branch, you use f or b to jump to the next (forward) or previous (back) specified label. Here's a simple example.

void loop()
{
const int count = 10;

asm volatile (

// Setup loop using count.
"mov r0, %0 \n\t"
"1: \n\t"

// Stuff we want to do count times.
...

// Decrement count and loop till zero.
"subs r0, r0, #1 \n\t"
"bne 1b \n\t"

: // no output in this example
: "r" (count)
: "r0", "cc"
);
}


In the clobbers, we need to specify that we changed "r0" (which we're using to count down), and "cc" because we set the condition flag in the "subs" instruction - the last 's' does that. Refer back to the quick reference card for all the neat things you can do if you're not familiar with arm assembly. Here, we're subtracting 1 (use the '#' prefix for constants) from r0 and storing it back in r0, but also setting the condition, so we can use it for branching in the next instruction.

One thing that's easy to forget early on is that modifying registers (i.e your output operands) modifies the variables. It sounds stupid when put like that, but when you're doing your fldmias %0! to load in a bunch of data, it's updating your 'src' pointer along the way, so when you're done, 'src' will point off the end of your array.

Finally, you'll likely want to wrap your assembly in #if !TARGET_IPHONE_SIMULATOR / #endif. Or rather, you'll want something like:

#if TARGET_IPHONE_SIMULATOR
C/C++ only versions here
#else
inline asm version here
#endif

Otherwise, you won't be able to compile for the simulator (which uses your Intel Mac's x86 processor, not ARM asm).