Getting More Optimization out of the MPLAB® XC8 Compiler
What do you do if the usual optimizations are not enough?
The optimizations discussed in this article take advantage of how statements must be encoded by your compiler and they might allow you to squeeze a little extra into a device or allow your code to run just a little bit faster. As always, it is important to avoid degrading the readability of code for the sake of saving a few bytes, as difficult-to-read source code can be difficult to debug; however, the changes described in this article are trivial and should impact little on your code's understandability.
To help assess the impact of these suggestions, check your project’s list file to monitor the size of the assembly code sequences produced by the compiler for statements in your program.
Accessing Structure Members Via Pointers
If you are accessing structure members via a pointer, changing the order of the members might improve the code that accesses them.
Consider the following structure with three members and a pointer that could be used to access these.
int a;
int b;
int c;
} myStruct;
struct MYSTRUCT * stp = &myStruct;
When you access a structure member indirectly, the compiler has to produce assembly instructions that take the address contained in the structure pointer add the offset of the member in the structure to form the final address, then read the memory located at that address. As the content of the pointer is not known by the compiler and must be obtained at runtime, the addition must also be performed at runtime.
Some device instruction sets cannot perform the addition of a member's offset to the address when the indirect access is made, so the compiler might need to insert one or more extra instructions to perform this addition. However, if you are accessing the first member inside the structure, it will be located at an offset of zero and so, this addition is not required. Thus, if you place the most commonly accessed member first in the structure definition, your code might gain a little extra performance.
The following shows the assembly code that might be required to access the first and last members of the above structure via a pointer for a 16-bit PIC® microcontroller.
000300 804042 MOV stp, W2
000302 EB0900 CLR [W2]
29: stp->c = 0x0000;
000304 804042 MOV stp, W2
000306 EB0180 CLR W3
000308 980123 MOV W3, [W2+4]
Note that this situation is not present with direct access of structure members. In this case, the addition is performed between two constant operands that are known at compile time, hence the addition can also be performed at compile time, and no extra code is required.
Note also that the compilers for 16- and 32-bit devices might pad structures so that members are word aligned. This might improve the code required to access the members, but it increases the memory required for the structure. When writing the definition for a structure and if possible, try to arrange the members so that they are grouped by type size as well as having the most commonly accessed first.
int a; I need to be accessed fast
int c; both ints together
char b; both chars together
char d;
} myStruct;
Access of Elements in Awkwardly Sized Arrays
If you have an array, consideration of the array dimensions or element size might improve the code required to access the array elements.
Consider a multidimensional array, which is really just an array of subarrays, defined as:
If we consider only arrays of single-byte char types for the moment, the expression to access an array element, such as myArray[x][y], requires the compiler to take the address of the array and add to this an offset which is calculated as the second index, y, added to the result of the first index, x, multiplied by the size of the subarray, 5. As a C expression, that operation could be written as:
If x is a variable, and hence not known at compile time, the multiplication must be performed at runtime. When using devices with instruction sets that do not readily allow for multiplication, consideration of this expression can improve the generated code.
In the above case, simply swapping the dimensions of the array, as shown, change the constant multiplication operand from a not-friendly 5 to a more-friendly 2.
Multiplication by a value that is a power of 2 can often be performed as a logical shift, which will be much faster. The following MPLAB® XC8 output snippets show a call to a multiplication routine being used to obtain the address of an array defined as myArray[2][5], followed by code using a shift to access an array defined as myArray[5][2].
075F 3005 MOVLW 0x5
0760 00F0 MOVWF _bmul@multiplicand
0761 087D MOVF x, W
0762 27F4 CALL _bmul ;call to multiplication routine
...
36: result = myArray52[x][y];
076C 087D MOVF x, W
076D 00F6 MOVWF ??_main
076E 302A MOVLW low (_myArray52)
076F 35F6 LSLF ??_main, F ;logical shift
...
Even with a single-dimension array, this consideration can still be made, although in this case, the only option available to you is to adjust the size of the array elements. If you had an array of structures and each structure was 3 bytes long, adding an extra byte to the structure would increase the size of the memory required to store the array, but accessing elements that were 4 bytes long might be faster.
Note that if you only use constant array indices, both operands to the multiplication will be known, and the operation will be performed at compile time with no runtime penalty.
Multiplication By Big Numbers
The order of the operands of multiplication can sometimes affect how fast the multiplication is performed.
If a hardware multiply instruction is not available in the instruction set of the device you are using, then a routine might be called to perform this operation. With the MPLAB XC8 compiler, this routine can use a loop, which runs for as long as there are non-zero bits remaining in the first multiplication operand. Thus, if you can arrange for the first operand to have the smaller value, the multiplication might be performed faster.
In the following example, the first multiplication expression takes roughly 50 instruction cycles less time to execute than the second expression for a mid-range PIC device.
char b = 200;
int result;
result = a * b; faster
result = b * a; slower