Code Generator Improvements
From Open Watcom
This page is intended to collect notes on possible improvements of the Open Watcom code generator (cg) with regards to generated code quality (speed or size). Most of the issues are minor in impact and the generated code is often just as good as the competition. The list is by its nature somewhat random. Note that bugs need to be discussed elsewhere.
Many of the following observations were made when building 16-bit code for BIOS/firmware applications.
Contents |
Better intrinsics
Some intrinsic functions would benefit from being handled as pseudo-assembler instructions. That notably includes the inp()/outp() family of functions. For 8-bit I/O port addresses, immediates could be generated instead of requiring the DX register. And even for 16-bit port addresses, giving the register allocator full knowledge of the instructions could avoid unnecessary DX reloads.
More accurate intrinsics declarations
Consider the following:
#include <conio.h>
unsigned *pport;
void foo( unsigned val )
{
outpw( *pport, val );
outpw( *pport, val );
}
Compiled with 'wcc -s -oi', the resulting code is:
Segment: _TEXT BYTE USE16 00000017 bytes 0000 foo_: 0000 53 push bx 0001 52 push dx 0002 56 push si 0003 89 C3 mov bx,ax 0005 8B 36 00 00 mov si,word ptr _pport 0009 8B 14 mov dx,word ptr [si] 000B EF out dx,ax 000C 8B 36 00 00 mov si,word ptr _pport 0010 8B 14 mov dx,word ptr [si] 0012 EF out dx,ax 0013 5E pop si 0014 5A pop dx 0015 5B pop bx 0016 C3 ret
It is obvious that the code between the two OUT instructions is redundant. The intrinsics are not marked as modifying no memory and the cg therefore must assume that the content of the global may have changed. Properly declaring the intrinsic as modifying no memory and no registers helps. That is accomplished by adding the following:
#pragma aux outpw modify nomemory exact []
The generated code now reads:
Segment: _TEXT BYTE USE16 00000011 bytes 0000 foo_: 0000 53 push bx 0001 52 push dx 0002 56 push si 0003 89 C3 mov bx,ax 0005 8B 36 00 00 mov si,word ptr _pport 0009 8B 14 mov dx,word ptr [si] 000B EF out dx,ax 000C EF out dx,ax 000D 5E pop si 000E 5A pop dx 000F 5B pop bx 0010 C3 ret
Many of the intrinsics built into the C and C++ compilers are unnecessarily conservative, sometimes causing the cg to generate larger/slowed code than needed. Numerous intrinsics (inp()/outp(), ldiv(), lrotr()/lrotl(), etc.) modify no memory but that isn't taken into consideration.
More aggressive register reuse
Consider the following:
int comp( signed char sc, unsigned char uc )
{
return( sc > uc );
}
When built with wcc -oaxs -3, the generated code is not bad:
Segment: _TEXT BYTE USE16 00000010 bytes 0000 comp_: 0000 53 push bx 0001 0F BE D8 movsx bx,al 0004 0F B6 C2 movzx ax,dl 0007 39 C3 cmp bx,ax 0009 0F 9F C0 setg al 000C 30 E4 xor ah,ah 000E 5B pop bx 000F C3 ret
However, it is obvious that register BX doesn't need to be used at all. The cg could zero/sign extend the arguments in AX and DX where they are already, avoiding the use of BX and the need to save/restore the register. Note that a similar problem occurs when compiling with wcc -oaxs, but wcc -oaxsh avoids unnecessary register use. There are other similar situations where registers are used unnecessarily. Interestingly, Microsoft Visual C++ 1.52c has the same kind of problem when compiling with cl -c -Oxs -Gr and uses CX (instead of BX) for no good reason.
The problem is caused by the cg allocating a new register when converting a value instead of trying to reuse the already used register. The zero extension is allocated first and grabs register AX (because it is free), instead of sticking to DX since it's using DL already. The sign extension then can't use AX anymore and must allocate BX (the next free register).
The AssignConflicts() routine in cg/c/regalloc.c is responsible. Neither CalcSavings() not SortConflicts() gives any preference to DX when zero extending DL; AX is used simply because it's the first available register. It should be possible to tweak the cg to prefer register reuse. That is, when zero extending DL, the result should be stored in DX if DX is available.
Note that sign extension is more difficult when generating code for 286 and older because only the CBW instruction is available and it is tied to AL/AX. In that case, register moves may be unavoidable. Zero extension is more flexible because AH/BH/CH/DH can all be zeroed directly.
The following is a good testcase. It is surprisingly difficult for 16-bit compilers when passing arguments in registers.
int comp1( signed char sc, unsigned char uc )
{
return( sc < uc );
}
int comp2( unsigned char uc, signed char sc )
{
return( sc < uc );
}
int comp3( signed char sc, unsigned char uc )
{
return( uc < sc );
}
int comp4( unsigned char uc, signed char sc )
{
return( uc < sc );
}
After Change 36188, Open Watcom generates good code for the above. That was achieved by modifying the CountRegMoves() routine in regalloc.c to also consider the cases where the first operand of a MOV or other instruction (notably CNV) overlaps the examined registers.
Adding to SP
Consider the following:
extern int __cdecl foo( int arg );
int __fastcall bar( int arg )
{
return( foo( arg ) );
}
The generated code is as follows:
Segment: _TEXT BYTE USE16 00000008 bytes 0000 @bar: 0000 50 push ax 0001 E8 00 00 call _foo 0004 83 C4 02 add sp,0x0002 0007 C3 ret
While that isn't wrong, it would be much better to generate something like this:
Segment: _TEXT WORD USE16 00000006 bytes 0000 @bar: 0000 50 push ax 0001 E8 00 00 call _foo 0004 5B pop bx 0005 C3 ret
If there is a "killed" register available (unused and its value doesn't need to be saved), it's much better to generate a 1-byte POP rather than 3-byte ADD SP. If a routine with one parameter which uses stack argument passing is called often, this can add up to noticeable savings.
When adding 4 to SP, it can still be advantageous to generate two POP instructions when optimizing for size. There is analogous situation with the 386 cg. Note that on older CPUs, POP is slower than ADD SP, and should therefore be only used when optimizing for size.
Efficient short to long multiply
Consider the following:
long foo( short a, short b )
{
return( (long)a * b );
}
This sort of code is not unusual - it comes up whenever two 16-bit operands need to be multiplied to produce a 32-bit result. The cg currently generates code which is far too long and inefficient, because the intermediate pseudo-assembler language cannot express the x86 MUL/IMUL instructions which produce a result that is twice the size of the operands. Thus to get a 32-bit result, the 16-bit operands are first converted to 32-bit and then the _I4M routine is called. Microsoft C 6.0 and later produce far better code (generated via cl -c -Ox -G2):
Segment: _TEXT WORD USE16 0000000C bytes 0000 _foo: 0000 55 push bp 0001 8B EC mov bp,sp 0003 8B 46 06 mov ax,word ptr 0x6[bp] 0006 F7 6E 04 imul word ptr 0x4[bp] 0009 C9 leave 000A C3 ret 000B 90 nop
That code is much shorter and requires no support routine calls. The compiler recognized that the operands were really 16-bit quantities and therefore IMUL is exactly the right instruction, taking 16-bit operands and producing a 32-bit result in DX:AX.
Fixing this probably won't be trivial and some kind of "extending multiply" pseudo-instruction may need to be added to the cg. The cg will also need to recognize the case where a 32-bit multiply is performed on operands which were converted from 16-bit.
Note that the 32-bit cg has analogous problem with a multiply where the operands are int/long (32-bit) but the result is long long (64-bit).
There is also a potential improvement to be made for multiplies where the operands are 8-bit and the result 16-bit or 32-bit. That case isn't nearly as bad because the runtime support routine does not need to be called, but it would still be better to perform an 8-bit multiply and possibly convert the result rather than converting the two operands.
Better offset calculations (arithmetic optimization)
Consider the following:
typedef struct S1 {
unsigned char length;
unsigned char c;
unsigned char size;
} S1;
#define round_size( s ) \
( (&s->size - &s->length) + (16 - ((&s->size - &s->length) % 16)) )
int foo( S1 *ps )
{
return( &ps->size - &ps->length );
}
int bar( S1 *ps )
{
return( round_size( ps ) );
}
The &s->size - &s->length expression is essentially a poor man's sizeof (although the result is not identical). However, the code generated by Open Watcom for bar() is quite bad, although the final result is correct. The code for foo() is much easier to understand:
0000 foo_: 0000 52 push dx 0001 89 C2 mov dx,ax 0003 83 C2 02 add dx,0x0002 0006 29 C2 sub dx,ax 0008 89 D0 mov ax,dx 000A 5A pop dx 000B C3 ret
The code essentially calculates AX + 2 - AX, which obviously equals 2. In a more complex expression as in bar(), this leads to rather long code which should instead evaluate to a constant at compile time. A simpler and even more obvious testcase follows:
int foo( int a )
{
return( a + 2 - a );
}
The optimization would probably be best performed in the cg when constructing the expression tree.
Note that not every compiler performs this seemingly simple optimization (including e.g. IBM VisualAge C++ 3.5, Borland C++ 5.5). On the other hand, Microsoft C 6.0 does perform this optimization.

