Friday, December 5, 2008

iPhone VFP for n00bs

I thought I'd jot down a couple notes outlining my experience so far with writing vector assembly for the iPhone.

First of all, a couple of useful links...

About the only examples on the web to date are in the vfpmathlibrary: http://code.google.com/p/vfpmathlibrary

The website is currently down, but the ARM GCC Inline Assembly Cookbook has lots of great info: http://www.ethernut.de/en/documents/arm-inline-asm.html

You'll also likely want the quick reference card for the ARM/VFP instruction set: http://www.voti.nl/hvu/arm/ARMquickref.pdf
(I unfortunately couldn't find a link on arm's site)

If your app is floating point heavy (which is probably why you want to use the vfp in the first place), then you'll likely already have disabled compiling for thumb. If not, you might want to open up your project settings and uncheck that box. Otherwise, you'll have to follow the vfpmathlibrary and wrap your assembly in their macros switch to arm mode and back.

Now, go read the inline assembly cookbook linked above if you're not already familiar with how inline assembly works in gcc. The basic format is:

void test(float * src, float * dst)
{
asm volatile (
"fldmias %0!, {s8-s11} \n\t" // operations
"fstmias %1!, {s8-s11} \n\t"
: "=r" (src), "=r" (dst) // output operands
: "0" (src), "1" (dst) // input operands
: "memory" // clobbers
);
}


A couple of notes here:
  • If you're compiling your code as C instead of C++, then you'll need to use __asm__.
  • 'volatile' tells the compiler not to optimize your code.
  • The load and store commands don't obey the vector width (more on that below), so you have to specify the range explicitly.
  • The first 8 fp registers (or rather the first two banks - s0-s7 and d0-d3) are treated as scalars, so they also ignore the width. This is useful for loading a single value into and using to multiply an entire vector by.
  • VFP has support for both single and double precision. This is what the last letter of each instruction specifies (e.g. fmuls vs fmuld), and the first letter for registers (s0-s31 vs d0-d15).
  • fldmias = floating [point], load multiple, increment after, single [precision]. In the example above, it loads the values at src[0], src[1], src[2] & src[3] into registers s8, s9, s10 & s11 respectively. Alternatively, you can use decrement before (fldmdbs). Some good details on increment ascending vs decrement before, etc. can be found here.
  • The exclamation mark (!) is required to actually increment or decrement the pointer. If you want to leave it alone, leave the exclamation point off. Technically speaking, it's used to set the 'w' bit. When you modify it this way, you need to make sure it's both an input and output operand.
  • Output operands are write-only. Input operands are read-only. Normally, you'd specify it using just "r", but when you want it to be both read and write, you use the output operand index instead. Refer back to the inline assembler cookbook for more details.
  • Instructions should be separated by new lines and tabs as illustrated above so when it's linked into the rest it'll read nicely for debugging.
The vfp can support widths up to 4 floats. This is how many floats are processed per vfp operation. To set this use the following code:

"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00030000 \n\t" // the 3 in here is the width
"fmxr fpscr, r0 \n\t"


(or just use common_macros.h from vfpmathlibrary)

The value is always width-1, so for width 4 you use the value 3, and width 1 you use 0.

If you change the width, you should set it back to one when you're done. Changing the width is expensive since it needs to wait for existing operations to finish. This means vfp is mostly useful for doing large batches of work instead of wrapping up all your vector operations individually. My guess is this is why Apple hasn't released a nice math library already.

For inline assembly, you can only use local labels (0-9). When you branch, you use f or b to jump to the next (forward) or previous (back) specified label. Here's a simple example.

void loop()
{
const int count = 10;

asm volatile (

// Setup loop using count.
"mov r0, %0 \n\t"
"1: \n\t"

// Stuff we want to do count times.
...

// Decrement count and loop till zero.
"subs r0, r0, #1 \n\t"
"bne 1b \n\t"

: // no output in this example
: "r" (count)
: "r0", "cc"
);
}


In the clobbers, we need to specify that we changed "r0" (which we're using to count down), and "cc" because we set the condition flag in the "subs" instruction - the last 's' does that. Refer back to the quick reference card for all the neat things you can do if you're not familiar with arm assembly. Here, we're subtracting 1 (use the '#' prefix for constants) from r0 and storing it back in r0, but also setting the condition, so we can use it for branching in the next instruction.

One thing that's easy to forget early on is that modifying registers (i.e your output operands) modifies the variables. It sounds stupid when put like that, but when you're doing your fldmias %0! to load in a bunch of data, it's updating your 'src' pointer along the way, so when you're done, 'src' will point off the end of your array.

Finally, you'll likely want to wrap your assembly in #if !TARGET_IPHONE_SIMULATOR / #endif. Or rather, you'll want something like:

#if TARGET_IPHONE_SIMULATOR
C/C++ only versions here
#else
inline asm version here
#endif

Otherwise, you won't be able to compile for the simulator (which uses your Intel Mac's x86 processor, not ARM asm).

12 comments:

Aaron Leiby said...

Oh, also, I haven't had much luck getting gdb to display the fp register values. If anyone knows any tricks to get that working that'd be great. I've been resorting to printf in the meantime.

Wolfgang Engel said...

Hey Aaron,
would you be interested in contributing to the VFP math library?

dopplex said...

Take a look at http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301g/DDI0301G_arm1176jzfs_r0p7_trm.pdf for some additional VFP reference material. It's the entire technical manual for the iPhone's ARM chip, and it has a section or two on VFP.

Unknown said...

Hi, thanks for the article it's very good to get started on VFP dev.

I did some tests on a naive simd implementation of complex numbers multiplication and I found out the hard way that straight C compiled with gcc -O3 -ffast-math -marm did mincemeat of my lovingly hand-crafted SIMD code. I'm a bit peeved as I'd never have thought that 4 C method calls would be 200 times faster than 5 SIMD instructions on a vector 8 floats...

I hope you have better luck than I did.
Cheers,
Palad1

Unknown said...

Well, I found out the hard way...
Never stall the vfp pipeline.

Thing is I need to set the stride to 2, and then set it back to whatever it was before I intervened. AFAIK there's no way to peek at the VFP state without stalling the thing.

Tricky..

Thanks a lot for your article, you've opened the pandora box as far as I can tell :)

Aaron Leiby said...

Yeah, afaict the VFP is not very useful for stuff like making a general purpose math library with a bunch of individual functions that don't do much work at a time. Instead, you need to find hotspots in your code where you're working with streams of data, set your stride, blaze through the numbers, and then restore it.

The problem I'm running into now is stack corruption- and lack of tools for tracking down the problem. I'm afraid it's maybe a multi-threading issue.

BTW: Thanks for the link dopplex.

Last_Inquisitor said...

Two typos (if I remember correctly):

- s0-s7 and d0-d3 are the first bank, not the first two banks.

- TARGET_IPHONE_SIMULATOR is defined for simulator AND device. That's why an #ifdef TARGET_IPHONE_SIMULATOR doesn't make much sense in that context. It should be #if TARGET_IPHONE_SIMULATOR!=0

Generally I recommend using a dedicated .s file instead of using GCC's inline assembly. Much easier to maintain.

And don't use that google-code's vfpmathlibrary's common macros (link mentioned on top of this article). That's the dumbest way to use assembly one can come up with. If you replace your adds, mults etc. with those you will degrade your code's performance for sure.
And obviously they didn't understand how the vector mode works (just take a look at the VFP_FIXED_4_VECTOR_OP define... why the hell are they doing 4 ops here?).

Anonymous said...

Let us know if you have any comments! iphone

Budi santoso said...

Resources like this the one you mentioned here went be very useful how to me! I went post a link To this url by my url. I am sure my visitors went think of that very helpful. phone and gadget review

Thomas said...

Thing is I need to set the stride to 2, and then set it back to whatever it was before I intervened. AFAIK there's no way to peek at the VFP state without stalling the thing.
Latest Mobile Phones
Cheap Mobile Phone

mobileappmax said...

it is very attracted to children and students because fun games,music(audio,video),internet and Bluetooth facilities are there.

salenayoungs said...

I am surely coming again for more contents of yours.
all vectors