Assembler Tutorial

From WiiBrew
Jump to navigation Jump to search

This introduction to PowerPC assembler assumes that you are somewhat familiar with the Intel assembler. It is not written as a tutorial for beginners in assembly programming. Hopefully it is possible to understand this tutorial if you just have programed in C before. This tutorial will allow you to write applications in PowerPC assembler. Disassembling compiled code is not covered, however, knowing assembler is a prerequisite to disassemble code.


The PowerPC is a RISC (Reduced Instruction Set Computing) processor architecture. PowerPC is an acronym which stands for Performance Optimization With Enhanced RISC / Performance Chip or Power Performance Computing. The specification for it was released in 1993 and is a 64-bit specification with a 32-bit subset. Almost all PowerPC processors are 32-bit now but feature a 64-bit data bus.

The PowerPC was developed jointly by Apple, IBM and Motorola (now named Freescale). There are many different PowerPC processors available. Apple has been using the PowerPC in Macintosh systems, IBM is using it in its RS/6000 and pSeries computers, and Nintendo used it in its GameCube, Wii, and Wii U systems. There are many embedded devices using the PowerPC.

The PowerPC is a superscalar microprocessor which means it has separate execution units. There is an integer unit, a floating-point unit, a branching unit, and even more depending on the processor type. These units can execute instructions in parallel within one clock cycle.

The registers

The PowerPC has many more registers than the Intel processors and these are named differently. All registers are 4 bytes or 32bits long on the 32-bit versions of the PowerPC. There are 32 (0-31) General Purpose Registers (GPRs or rX), 32 Floating point registers (FPRs or fX) and special purpose registers (SPRs) like the program counter PC or IAR (instruction address register). This keeps track which instruction needs to be executed next. There is a link register (LR) which can hold the address of a procedure for branch instructions, the condition register (CR) which has eight (0-7) 4 bit fields holding the result of e.g. a compare instruction. The count register for loops is called CTR. XER is the fixed-point exception register. FPSCR is the floating point status and control register.

On the PowerPC you cannot move data from one memory address to another. You have to read the data into a register first and then store the contents of the register at the destination address in memory. This design shall allow the processor to operate more efficiently.

The PowerPC uses the big-endian format to store data in memory. The most significant byte (MSB) value is stored at the memory location with the lowest address. So the bit numbering is reversed compared to an Intel processor.


Variables

The 32-bit version of the PowerPC supports the following data sizes:

Byte - 8 bits
Halfword - 16 bits
Word - 32 bits

An integer value of 12 can also be specified as 0x0C in hexadecimal or01100 in binary.


Variables are defined either in the data section or in the bss section which takes uninitialised data only.

Here are examples how to define variables. The name of the variable is always set as a label followed by a colon.

bytevar: .byte 0 #length of one byte - init zero
shortvar: .short 0 #length of two byte - init zero
wordvar: .long 0 #length of four byte - init zero


fivebytevar: .byte 11,12,13,14,15 #an array of five variables of one byte each
endof_fivebytevar: #specifies the address immediately following the array


stringvar: .string "Hello\n" #string variable - init to "Hello" plus newline
.size stringvarlen, .-stringvar #length of stringvar


Constants

The AS assembler allows to define constants. These will be replaced by the assembler before assembling the code.

Example:

.set GPR0,0

This will define the constant "GPR0" having the value zero. This will replace all occurences of GPR0 in the code by the number zero. This can enhance the readability of the code, since the registers are specified as numbers just like immediate values in the instructions. So using constants e.g. the instruction addi 0,0,0 can be written as addi GPR0,GPR0,0 in the code. Without defining constants the assembler will also accept addi %r0,%r0,0.


Instructions and Mnemonics

Assembler instructions for Intel processors have up to two parameters separated by a comma. Typically the first parameter is modified with the second.

The AS assembler will do this for Intel processors in the opposite direction compared with MASM, TASM etc. For the PowerPC, however, the first operand is used as the destination register and there can be up to five parameters separated by commas.

The PowerPC uses fixed-length 32-bit instructions. As an example for an addi instruction this 32-bit integer is divided in the following fields:

Opcode: 6 bits
Source register: 5 bits
Destination register: 5 bits
Immediate value: 16 bits

So to fill a 32bit register with an immediate value you have to use two instructions moving 16 bits each. In case of a 64bit processor you need even more instructions since you have to shift the bits here too.

Mnemonics are specializations of a more general instruction. They are used as simplified instructions for easier coding of assembly language programs. They are defined for frequently used instructions. A mnemonic may have two parameters and this will be converted by the assembler to an instruction which may require three or more parameters. Samples of mnemonics can be found among the described instructions below.

The available instructions for the PowerPC can be grouped as follows:

  1. Integer instructions
  2. Floating point instructions
  3. Load and store instructions
  4. Branch and flow control instructions
  5. Various instructions

The most common instructions of each group will be discussed here.


Integer Instructions

Integer Arithmetic Instructions

ADD

This instructions has several variants:

1. ADD

Syntax: add rD,rA,rB

This command adds two registers (rA and rB) and puts the result into the register rD (destination).

Example: add 3,6,4

In this example GPR6 and GPR4 are added and the result is put into GPR3.

Example: add 3,6,3

In this example GPR6 and GPR3 are added and the result is put into GPR3.

Example: add 3,0,4

Here the value in GPR4 is moved into GPR3 - like an Intel move instruction. If the second parameter is a zero this does not mean a GPR0 but the value zero. This is the case for several instructions.


2. ADDI - Add Immediate

Syntax: addi rD,rA,SIMM

This command adds a 16-bit signed integer (SIMM) to register rA and puts the result into the register rD (destination).

Example: addi 3,6,4

In this example GPR6 and the value 4 are added and the result is put into GPR3.

As you can see the registers are specified as a number and the integer is specified as a number. To improve the readability of the code you can define constants for the registers, e.g.:

.set r0,0; .set r1,1; .set r2,2; .set r3,3; .set r4,4; .set r5,5; .set r6,6; .....

Then the above command can be written as: addi r3,r6,4

You can also use the addi command as a move instruction:

addi 3,0,4

This sets GPR3 to the value 4. If the second parameter is a zero here this does not mean a GPR0 but the value zero.


3. ADDIS - Add Immediate Shifted

Syntax: addis rD,rA,SIMM

This command is used to add a 16-bit immediate value to the upper 16 bits of a 32bit register. It adds a 16-bit signed integer (SIMM) to register rA, then shifts left register rA by 16 bits and then puts the result into the register rD (destination). The lower 16 bits are cleared by this command. So to fill a 32 bit register with an immediate 32-bit value you first have to use addis to fill the upper 16 bits and then addi to fill the lower 16 bits. Otherwise the lower 16 bits would be cleared again by the addis command.

Example: addis 3,3,4

In this example GPR3 and the value 4 are added, then GPR3 is shifted 16 bits to the left and the result is put into GPR3. This will then contain 0x00040000 in hexadecimal. If you then execute an addi 3,3,4 command GPR3 will contain 0x00040004 in hexadecimal.

To move a pointer to an address of a variable or function into a register there are the @ha/@h and @l modifiers available. If you append these to the variable name you get the lower (@l) 16 bit of the absolute 32-bit address of the variable and with @ha you get the higher 16 bit of the absolute 32-bit address.

Example:

addis 3,0,hello@ha
addi 3,0,hello@l

With these two instructions the absolute 32-bit address of the string variable hello is moved into the GPR 3 register.

Instead of addis/addi the mnemonics lis/la are often used. These are explained below.


4. ADD. - ADD with CR Update

Contrary to the Intel processors the ADD instruction will not modify any flags. To achieve this you have to apped a dot to the instruction. So add. will set the CR bits 0-3 (CR0) in the CR register. These bits will then reflect a signed comparison of the result to zero. In effect the dot adds a cmpwi rD,0 instruction to the ADD instruction. A dot can be added to many PowerPC instructions.


5. Mnemonics for the ADD instruction


The following mnemonics are converted into ADD instruction by the assembler:


LI - Load Immediate

Syntax: li rD,value

This is equivalent to addi rD,0,value

Example: li 3,100

Sets GPR3 to 100 and clears the higher 16 bits.


LIS - Load Immediate Shifted

Syntax: lis rD,value

This is equivalent to addis rD,0,value

Example: lis 3,100

Sets higher 16 bits of GPR3 to 100 and clears the lower 16 bits.


LA - Load Address

Syntax: la rD,d(rA)

This is equivalent to addi rD,rA,d

Example: la 3,100(9)

Adds 100 to the address in GPR9 and loads the result in GPR3.


As a side note this is also equivalent:

li rD,value = ori rA,0,UIMM (UIMM = unsigned integer value)

So OR immediate can be used to load an immediate too as long as the value is unsigned.

To load an immediate 32-bit value in a register you can use:

lis 3,100
ori 3,0,200

This loads 100 into the higher 16 bits and 200 into the lower 16 bits.


SUBF - Subtract From

1. SUBF

Syntax: subf rD,rA,rB

Example: subf 3,4,5

Similar to the ADD instruction SUBF will subtract GPR4 from GPR5 and place the result in GPR3.


2. Subfic - Subtract from Immediate Carrying

Syntax: subfic rD,rA,SIMM

Example: subfic 3,4,5

This will subtract GPR4 from signed integer value 5 and place the result in GPR3. CRO is modified.


MUL - Multiply

Multiplying two 32-bit values will often result in a 64-bit value. So there are separate instructions to put the 64-bit result into two 32-bit registers:

1. MULLW - Multiply Low Word

Syntax: mullw rD,rA,rB

Example: mullw 3,4,5

This will multiply the contents of GPR4 and GPR5 and place the lower 32 bits of the result in GPR3.


2. MULLH - Multiply High Word

Syntax: mullh rD,rA,rB

Example: mullh 6,4,5

This will multiply the contents of GPR4 and GPR5 and place the higher 32 bits of the result in GPR6.


3. MULLI - Multiply Low Immediate

Syntax: mulli rD,rA,SIMM

Example: mulli 3,4,5

This will multiply the contents of GPR4 with the integer 5 and place the lower 32 bits of the result in GPR3. So the higher 32 bits - if any - are lost.


DIV - Divide

divw - Divide Word

Syntax: divw rD,rA,rB

Example: divw 3,4,5

This will divide the contents of GPR4 with the contents of GPR5 and place the result in GPR3. The remainder is lost.

Assembler programmers somehow design their tasks so that they only need to multiply and divide by powers of two. This way they can use the shift instructions instead of multiply and divide.


Integer Compare und Logical Instructions

1. CMP - Compare

Syntax: cmp crfD,L,rA,rB

Example: cmp 7,0,3,4

This will compare the signed contents of the GPR3 and GPR4 registers and set the CR7 field of the CR register accordingly. The second parameter has to be set to zero for 32bit processors.

If rA<rB then bit 0 of CR7 will be set. If rA>rB then bit 1 of CR7 will be set. If rA=rB then bit 2 of CR7 will be set.


2. CMPI - Compare Immediate

Syntax: cmpi crfD,L,rA,SIMM

Example: cmpi 7,0,3,4

This will compare the signed contents of the GPR3 register with the value 4 and set the CR7 field of the CR register accordingly. The second parameter has to be set to zero for 32bit processors.

If rA<4 then bit 0 of CR7 will be set. If rA>4 then bit 1 of CR7 will be set. If rA=4 then bit 2 of CR7 will be set.


3. Mnemonics for CMP

Sometimes the following mnemonics are used:

CMPWI - compare word immediate

Syntax: cmpwi crD,rA,SIMM

This is equivalent to cmpi crD,0,rA,SIMM

CMPLWI - Compare Logical word immediate

Syntax: cmplwi crD,rA,UIMM

This is equivalent to cmpli crD,0,rA,UIMM (UIMM = unsigned integer value)


OR

1. OR

Syntax: or rA,rS,rB

Example: or 3,4,5

This instruction will OR the contents of GPR4 and GPR5 and place the result in GPR3. The variant OR. (+dot) will update CR too.


2. ORI - OR Immediate

Syntax: ori rA,rS,UIMM

Example: ori 3,4,5

This instruction will OR the contents of GPR4 with the unsigned integer 5 and place the result in GPR3.

ori 0,0,0 can be used as a NOP (no operation) instruction. This could be used e.g. as a breakpoint for a debugger.


3. ORIS - OR Immediate Shifted

Syntax: ori rA,rS,UIMM

Example: oris 3,4,5

This instruction will OR the upper 16 bits contained in GPR4 with the unsigned integer 5 and place the result in GPR3.


4. Mnemonics for OR

MR - move [to] register

Syntax: mr rA,rS

This is equivalent to: or rA,rS,rS

Example: mr 31,1

This will move the value in GPR1 to GPR31.


AND

1. AND

Syntax: and rA,rS,rB

Example: and 3,4,5

This instruction will AND the contents of GPR4 and GPR5 and place the result in GPR3.


2. ANDI. - AND Immediate

Syntax: andi. rA,rS,UIMM

Example: andi. 3,4,0b00000011

This instruction will AND the contents of GPR4 with the unsigned integer 3 (binary 000011) and place the result in GPR3. Since this instruction ends with a dot, the CR register is updated. The higher 16 bits will be cleared. In effect all but the last two bits of GPR4 are cleared in this example.


3. ANDIS. - AND Immediate Shifted

Syntax: andis. rA,rS,UIMM

Example: andis. 3,4,5

This instruction will AND the upper 16 bits contained in GPR4 with the unsigned integer 5 and place the result in GPR3. Since this instruction ends with a dot, the CR register is updated. The lower 16 bits will be cleared.


4. There are equivalent instructions for XOR, NAND etc.


Integer shift and rotate instructions

1. SLW - Shift left word

Syntax: slw rA,rS,rB

Example: slw 3,4,5

The contents of the GPR4 register are shifted left by the value placed in the low-order six bits of the GPR5 register. The 32-bit result is placed in GPR3.


2. SRW - Shift right word

Syntax: srw rA,rS,rB

Example: srw 3,4,5

The contents of the GPR4 register are shifted right by the value placed in the low-order six bits of the GPR5 register. The 32-bit result is placed in GPR3.


3. RLWINM - Rotate Left Word Immediate then AND with Mask

Syntax: rlwinm rA,rS,SH,MB,ME

Example: rlwinm 3,4,5,0,31

Here the contents in GPR4 will be rotated left by 5 bits (the immediate value of 5 - parameter three) and the result placed into GPR3. After rotating GPR4 and before placing the result into GPR3 the value is ANDed with the mask specified in the last two parameters. The fourth parameter specifies the beginning of the 1-bits in the mask and the fifth parameter specifies the end of the 1-bits in the mask. In this example the begin is 0 and the end is 31. So all 32 bits are set in the AND mask. This causes no bits to be cleared by the mask.

A rotate right can be done by specifying a value of 32-n as the third parameter. So

rlwinm 3,4,31,0,31

will rotate GPR4 right by one bit. This is not equal to a division by 2 since a bit may be moved into the sign bit by the rotation. Use the shift mnemonics for that which are described below.


The PowerPC stores the data in big-endian format. So if you have a value of 31 which is0011111 and you want to clear the lower two bits you have to AND this with a mask of 0,29. Then bits 30 and 31 are set to zero in the mask and these bits will be cleared in the value of 31. The result will then be0011100 or 28.


If the third parameter is a zero there is no rotation and this command is just used as an AND mask. This is often done by gcc since this allows to execute AND with an immediate 32-bit value.

Example:

li 3,0b11111111 rlwinm 3,3,0,24,24

will clear all bits except bit 24 (big-endian format) in register GPR3. So register GPR3 will then contain 128 or10000000.


4. Mnemonics for shift

SLWI - Shift Left Word Immediate

Syntax: slwi rD,rA,SIMM (SIMM<32)

This is equivalent to: rlwinm rA,rS,n,0,31–n

Example: slwi 3,4,5

Shifts GPR4 left by 5 bits and places the result in GPR3. This is equal to multiplying GPR4 with 32 (2**5).


SRWI - Shift Right Word Immediate

Syntax: srwi rD,rA,SIMM (SIMM<32)

This is equivalent to: rlwinm rA,rS,32 – n,n,31

Example: srwi 3,4,5

Shifts GPR4 right by 5 bits and places the result in GPR3. This is equal to dividing GPR4 by 32 (2**5).


Floating point instructions

1. FMR - Floating Move Register

Syntax: fmr frD,frB

Example: fmr 3,4

The integer value in the FPR4 will be moved into the FPR3 (FPR=Floating point register)


2. LFS - Load Floating-Point Single

Syntax: lfs frD,d(rA)

Example: lsd 3,0(4)

Loads the word of data from the location in memory specified in GPR4 into floating-point register FPR3 and thereby converting it to floating-point double-precision.


3. STFS - Store Floating-Point Single

Syntax: stfs frS,d(rA)

Example: stfs 3,0(4)

Converts the contents of FPR3 to single-precision and stores it at the location in memory specified in GPR4.


4. LFD- Load Floating-Point Double

Syntax: lfd frD,d(rA)

Example: lfd 3,0(4)

Loads the doubleword of data from the location in memory specified in GPR4 into floating-point register FPR3.


5. STFD- Store Floating-Point Double

Syntax: stfd frD,d(rA)

Example: stfd 3,0(4)

Stores the doubleword of data in the floating-point register FPR3 at the location in memory specified in GPR4.


6. MTFSF - Move to FPSCR Fields

Syntax: mtfsf FM,frB

Copies the contents of the floating-point register frB into the Floating-Point Status and Control Register (FPSCR) under the control of the field mask in FM.


7. MTFSB1 - Move to FPSCR Bit 1

Syntax: mtfsb1 crbD

Example: mtfsb1 4

Sets bit 4 of the FBSCR register to 1.


Load and Store Instructions

The PowerPC allows you to move data from register to memory and from memory to register. You cannot copy directly from one memory location to another.

Described here are the instruction for word operations. There are equivalent instructions for reading a byte (LBZ) and a halfword (LHZ) or storing a byte (STB) and a halfword (STH).


1. LWZ - Load Word and Zero

Syntax: lwz rD,d(rA)

Example: lwz 3,10(4)

This will read the word at the memory location specified in GPR4 plus an "offset" of 10 and place it in GPR3.

In the case of LBZ and LHZ the higher bits are cleared to zero when moving the value into a 32-bit register.


2. LWZX - Load Word and Zero Indexed

Syntax: lwzx rD,rA,rB

Example: lwzx 3,4,5

Here the word at the memory address computed by adding the values in GPR4 and GPR5 is read and placed in GPR3.


3. LWZU - Load Word and Zero Update

Syntax: lwzu rD,d(rA)

Example: lwzu 3,10(4)

This will read the word at the memory location specified in GPR4 plus an "offset" of 10 and place it in GPR3. Then the computed memory address is placed(updated) in GPR4.


4. STW - Store Word

Syntax: stw rD,d(rA)

Example: stw 3,10(4)

This will store the value in GPR3 at the memory location specified in GPR4 plus an "offset" of 10.


5. STWX - Store Word Indexed

Syntax: stwx rD,rA,rB

Example: stwx 3,4,5

This will store the value in GPR3 at the memory location computed by adding the values in GPR4 and GPR5.


6. STWU - Store Word with Update

Syntax: stwu rD,d(rA)

Example: stwu 3,10(4)

This will store the value in GPR3 at the memory location specified in GPR4 plus an "offset" of 10. Then the computed memory address is placed in GPR4.

This instruction is frequently used to set up a stack frame.


7. MFSPR - Move from Special-Purpose Register

Syntax: mfspr rD,SPR

Example: mfpsr 3,920

Here the value from the special register 920 is moved to GPR3.


8. MTSPR - Move to Special-Purpose Register

Syntax: mtspr SPR,rS

Example: mtspr 912,3

Here the value in GPR3 is moved to the special register 912.


9. Mnemonics for MFSPR and MTSPR

a) MFLR - Move from Link Register

It is not possible to use the standard instructions on the link register. So the value of this register first has to be moved into a GPR register.

Example: mflr 0

This will read the value in the link register into GPR0.

b) MTLR - Move to Link Register

Example: mtlr 0

Here the value in GPR0 is written into the link register.

C) MTCTR - Move to Count Register

Example:

li 4,100
mtctr 4

The count register is set to 100 via the GPR4. You cannot load an immediate value into the CTR.


10. There are equivalent instructions for the CR, XER, CTR etc. registers.

Branch instructions

The PowerPC branch instructions are similar to the Intel Processor's jmp and call commands.

1. B - Branch

Syntax: b target_addr

Example: b testlabel

Here the execution will continue at the label "testlabel".

This instruction can be compared to a "near jmp" in Intel syntax.


2. Conditional branch instructions (mnemonics)

Following a cmpi 7,0,3,5 instruction which compares the value in GPR3 with the integer 5 and places the resulting flags in the CR7 field of the CR register, the following conditional branch instructions can be executed. If the condition is true the program will continue at "testlabel" within the code segment. Otherwise it will just continue.

beq - branch if equal | example: beq 7,testlabel
bne - branch if not equal | example: bne 7,testlabel
blt - branch if less than | example: blt 7,testlabel
bgt - branch if greater than | example: bgt 7,testlabel
ble - branch if less or equal | example: ble 7,testlabel
bge - branch if greater of equal | example: bge 7,testlabel


Example:

cmpwi 4,100 /* Compare value in GPR4 with 100 */
bne else_label /*if not 100 goto else */
...if statements...
b endif_label /* jmp over else part */
else_label:
...else statements...
endif_label:


3. BL - Branch then Link

Syntax: bl target_addr

Example: bl testsubroutine

This instruction can be used to call a subroutine or function. Its absolute address has to be loaded into the link register before executing the bl instruction. The bl instruction will save the next instruction address in the link register after branching to allow the called subroutine to return. If bl is executed in a subroutine itself, the link register value has to be saved first before executing a bl instruction since that register will be overwritten by the bl instruction.


4. BLR - Branch on Link Register

This instruction is frequently used as a return command from a subroutine or function. The link register is filled with the 32-bit return address when a bl (branch then link) instruction is executed (see above). The blr instruction will read that return address from the link register and return to the next instruction in the calling routine.


5. BDNZ - Branch Decrement not Zero

Syntax: bdnz target

Example:

li 4, 100
mtctr 4
looplabel:
...some statements...
bdnz looplabel

In this example the count register (CTR) is loaded with the value 100 first via GPR4. Then the bdnz instruction will decrement the CTR register and branch to looplabel as long as the CTR (count register) is not zero.


6. BDNZT - Branch Decrement not Zero True

This mnemonic adds a conditional branch test to the BDNZ instruction.

Syntax: bdnzt BI,target

Example:

li 4, 100
mtctr 4
looplabel:
...some statements...
cmpwi 5,10
bdnzt eq,looplabel

In this example the count register (CTR) is loaded with the value 100 first via GPR4. Then the bdnzt instruction will decrement the CTR register and branch to looplabel as long as the CTR (count register) is not zero. However, the instruction will also test if the condition is TRUE. So if the cmpwi instruction has determined that GPR5 has the value 10, the loop will be terminated before CTR has reached zero.


7. BEQLR - Branch then Link if Equal

Syntax: beqlr target_addr

This mnemonic can be used to call a subroutine or function if the preceeding cmp instruction determined equal.


8. BNELR - Branch then Link if not Equal

Syntax: bnelr target_addr

This mnemonic can be used to call a subroutine or function if the preceeding cmp instruction determined NOT equal.


9. RFI - Return from interrupt

returns from an interrupt service routine


Various instructions

1. CRXOR - Condition Register XOR

Syntax: crxor crbD,crbA,crbB

Example: crxor 6,6,6

Clears bit 6 of the CR register


2. CLRLWI - Clear left word immediate

Syntax: clrlwi rA,rS,16

Clear the high-order 16 bits of rS and place the result into rA.


3. ISYNC

Delay all following instructions until all previous instructions required for context.

Since the PowerPC has several instruction queues this can make sure that an instruction is executed before the next in the code.


Application Binary Interface (SVR4 ABI)

On Intel processors external subroutines and functions are usually called by pushing the arguments to be passed to the subroutine on the stack. The subroutine then sets up a stack frame and reads the parameters from the stack.

The PowerPC has lots of registers but none is defined by the PowerPC architecture as a stack pointer. The programmer can select any register to be a stack register.

To enable interoperatibility between different compilers and object files or libraries an ABI has been defined. There a slightly different versions of ABI's available. Here the Sytem V R4 ABI or SVR4 ABI is described since the GCC compiler for 32bit PowerPC's uses this ABI.

The SVR4 ABI specifies that arguments are not passed on the stack but in registers beginning with GPR3. GPR1 is specified to be used as the stack frame pointer. Bit 6 of the CR register indicates that a floating point argument is passed in the registers. If this is not the case this bit can be cleared using the CRXOR 6,6,6 instruction.

The registers are used as follows:

   r0          volatile, may be used by function linkage
   r1          stack pointer
   r2          reserved for system
   r3  .. r4   volatile, pass 1st - 2nd int args, return 1st - 2nd ints
   r5  .. r10  volatile, pass 3rd - 8th int args
   r11 .. r12  volatile, may be used by function linkage
   r13         small data area pointer
   r14 .. r31  saved
   f0          volatile
   f1          volatile, pass 1st float arg, return 1st float
   f2  .. f8   volatile, pass 2nd - 8th float args
   f9  .. f13  volatile
   f14 .. f30  saved
   f31         saved, static chain if needed.
   lr          volatile, return address
   ctr         volatile
   xer         volatile
   fpscr       volatile
   cr0         volatile
   cr1         volatile
   cr2 .. cr4  saved
   cr5 .. cr7  volatile

Volatile means that a called function does not have to preserve its value when it returns, saved means that a called function must restore its value before returning. So the calling function must save these registers before calling a subroutine.

The ABI also defines the stack frame which should be set up by the called subroutine. This a table showing how it shall be set up:

       SP----> +---------------------------------------+
               | back chain to caller                  | 0
               +---------------------------------------+
               | saved LR                              | 4
               +---------------------------------------+
               | Parameter save area (P)               | 8
               +---------------------------------------+
               | Alloca space (A)                      | 8+P
               +---------------------------------------+
               | Local variable space (L)              | 8+P+A
               +---------------------------------------+
               | saved CR (C)                          | 8+P+A+L
               +---------------------------------------+
               | Save area for GP registers (G)        | 8+P+A+L+C
               +---------------------------------------+
               | Save area for FP registers (F)        | 8+P+A+L+C+G
               +---------------------------------------+
       old SP->| back chain to caller's caller         |
               +---------------------------------------+


GCC uses the .align 2 command to align the SP to two words or 8 bytes. To set up this stack frame the subroutine first decrements the passed SP. This depends how many local variables shall be put into the stack frame, lets assume the SP is decremented by 40 bytes here. This is done using the

stwu 1,-40(1)

instruction. Hereby the current stack pointer passed from the calling function in GPR1 is stored at SP-40 which then becomes the back link field. Following that this instruction will set the SP in GPR1 to point to GPR1-40, the back link field. In this field the stack pointer of the previous stack frame set up by the calling function has just been stored before setting SP to SP-40. This sets up a linked list of stackframes. It allows to follow this list and write into the preceeding stack frame. GCC writes the link register value into the "LR save word" field of the preceeding stack frame with the

stw 0,44(1)

instruction after moving the link register into GPR0.

In its own stack frame the called subroutine will store the link register in the "LR save word" field.

If the stack frame has a size of 40 bytes, this results in 10 fields of word size. Subtracting two for the back chain and LR save word fields leaves eight words to save data in the stack frame. These fields could be addressed by:

SP+8, SP+12, SP+16, SP+20, SP+24, SP+28, SP+32, SP+36

In the "Hello world" example GCC saved GPR31 in SP+36, GPR3 in SP+28 and GPR4 in SP+24.

To free the stack frame again the SP has to be set back to point to the back chain field of the previous stack frame. The instruction

addi 1,1,40

will do this in our case here.

GCC will call its exit function instead.