Chapter 4

Chapter 4
MMX™ CODE DEVELOPMENT STRATEGY

In general, developing fast applications for Intel Architecture (IA) processors is not difficult. An understanding of the architecture and good development practices make the difference between a fast application and one that runs significantly slower than its full potential. Intel Architecture processors with MMX^TM technology add a new dimension to code development. Performance increase can be significant, though the conversion techniques are straight forward. In order to develop MMX technology code, examine the current implementation and determine the best way to take advantage of MMX technology instructions. If you are starting a new implementation, design the application with MMX technology in mind from the start.

4.1 Making a Plan

Whether adapting an existing application or creating a new one, using MMX technology instructions to optimal advantage requires consideration of several issues. Generally, you should look for code segments that are computationally intensive, that are adaptable to integer implementations, and that support efficient use of the cache architecture. Several tools are provided in the Intel Performance Tool Set to aid in this evaluation and tuning.

Several questions should be answered before beginning your implementation:

Which part of the code will benefit from MMX technology?
Is the current algorithm the best for MMX technology?
Is this code Integer or Floating-Point?
How should I arrange my data?
Is my data 8-, 16- or 32-bit?
Does the application need to run on processors both with and without MMX technology? Can I use CPUID to create a scaleable implementation?

4.2 Which Part of the Code Will Benefit from MMX^TM Technology?

Step one: Determine which code to convert.

Most applications have sections of code that are highly compute-intensive. Examples include speech compression algorithms and filters, video display routines, and rendering routines. These routines are generally small, repetitive loops, operating on 8- or 16-bit integers, and take a sizable portion of the application processing time. It is these routines that will yield the greatest performance increase when converted to MMX^TM technology optimized libraries code. Encapsulating these loops into MMX technology-optimized libraries will allow greater flexibility in supporting platforms with and without MMX technology.

A performance optimization tool such as Intel's VTune visual tuning tool may be used to isolate the compute-intensive sections of code. Once identified, an evaluation should be done to determine whether the current algorithm or a modified one will give the best performance. In some cases, it is possible to improve performance by changing the types of operations in the algorithm. Matching the algorithms to MMX technology instruction capabilities is key to extracting the best performance.

4.3 Is the Code Floating-Point or Integer?

Step two: Determine whether the algorithm contains floating-point or integer data.

If the current algorithm is implemented with integer data, then simply identify the portions of the algorithm that use the most microprocessor clock cycles. Once identified, re-implement these sections of code using MMX technology instructions.

If the algorithm contains floating-point data, then determine why floating-point was used. Several reasons exist for using floating-point operations: performance, range and precision. If performance was the reason for implementing the algorithm in floating-point, then the algorithm is a candidate for conversion to MMX technology instructions to increase performance.

If range or precision was an issue when implementing the algorithm in floating point then further investigation needs to be made. Can the data values be converted to integer with the required range and precision? If not, this code is best left as floating-point code.

4.3.1 MIXING FLOATING-POINT AND MMX^TM TECHNOLOGY CODES

When generating MMX technology code, it is important to keep in mind that the eight MMX technology registers are aliased upon the floating-point registers. Switching from MMX technology instructions to floating-point instructions can take up to fifty clock cycles, so it is best to minimize switching between these instruction types. Do not intermix MMX technology code and floating-point code at the instruction level. If an application does perform frequent switches between floating-point and MMX technology instructions, then consider extending the period that the application stays in the MMX technology instruction stream or floating-point instruction stream to minimize the penalty of the switch.

When writing an application that uses both floating-point and MMX technology instructions, use the following guidelines for isolating instruction execution:

Partition the MMX technology instruction stream and the floating-point instruction stream into separate instruction streams that contain instructions of one type.

Do not rely on register contents across transitions.

Leave an MMX technology code section with the floating-point tag word empty using the EMMS instruction.

Leave the floating-point code section with an empty stack.

For example:


FP_code:
	..
	..		/* leave the floating-point stack empty 
	*/
MMX_code:
	...
	EMMS		/* empty the MMX Technology registers */
FP_code1:
	...
	...		/* leave the floating-point stack empty 
	*/

Additional information on the floating-point programming model can be found in the Pentium® Processor Family Developer's Manual: Volume 3, Architecture and Programming, (Order Number 241430).

4.4 EMMS Guidelines

Step three: Always call the EMMS instruction at the end of your MMX technology code.

Since the MMX technology registers are aliased on the floating-point registers, it is very important to clear the MMX technology registers before issuing a floating-point instruction. Use the EMMS instruction to clear the MMX technology registers and set the value of the floating-point tag word (TW) to empty (that is, all ones). This instruction should be inserted at the end of all MMX technology code segments to avoid an overflow exception in the floating-point stack when a floating-point instruction is executed.

4.5 CPUID Usage for Detection of MMX^TM Technology

Step four: Determine if MMX technology is available.

MMX technology can be included in your application in two ways: Using the first method, have the application check for MMX technology during installation. If MMX technology is available, the appropriate libraries can be installed. The second method is to check during program execution and install the proper libraries at runtime. This is effective for programs that may be executed over a network.

To determine whether you are executing on a processor with MMX technology, your application should check the Intel Architecture feature flags. The CPUID instruction returns the feature flags in the EDX register. Based on the results, the program can decide which version of code is appropriate for the system.

Existence of MMX technology support is denoted by bit 23 of the feature flags. When this bit is set to 1 the processor has MMX technology support. The following code segment loads the feature flags in EDX and tests the result for MMX technology. Additional information on CPUID usage may be found in Intel Processor Identification with CPUID Instruction, Application Note AP-485, (Order Number 241618).

…				; identify existence of CPUID instruction
…				; 
…				; identify Intel Processor
…				;
mov EAX, 1			; request for feature flags
CPUID				; 0Fh, 0A2h   CPUID Instruction
test EDX, 00800000h		; is MMX technology Bit(bit 23)in feature 				; flags equal to 1
jnz	Found

4.6 Alignment of Data

Step five: Make sure your data is aligned.

Many compilers allow you to specify the alignment of your variables using controls. In general this guarantees that your variables will be on the appropriate boundaries. However, if you discover that some of the variables are not appropriately aligned as specified, then align the variable using the following C algorithm. This aligns a 64-bit variable on a 64-bit boundary. Once aligned, every access to this variable will save three clock cycles.

if (NULL == (new_ptr = malloc(new_value +1)* sizeof (var_struct))

mem_tmp = new_ptr;
mem_tmp /= 8;
new_tmp_ptr = (var_struct*) ((Mem_tmp+1) * 8);

Another way to improve data alignment is to copy the data into locations that are aligned on 64-bit boundaries. When the data is accessed frequently this can provide a significant performance improvement.

4.6.1 STACK ALIGNMENT

As a matter of convention, compilers allocate anything that is not static on the stack and it may be convenient to make use of the 64-bit data quantities that are stored on the stack. When this is necessary, it is important to make sure the stack is aligned. The following code in the function prologue and epilogue will make sure the stack is aligned.

Prologue:
	push		ebp				; save old frame ptr
	mov		ebp, 	esp			; make new frame ptr
	sub		ebp, 	4			; make room of stack ptr
	and		ebp, 	0FFFFFFFC		; align to 64 bits
	mov		[ebp],esp			; save old stack ptr
	mov		esp, ebp			; copy aligned ptr
	sub		esp, FRAMESIZE			; allocate space
	… callee saves, etc
epilogue:
	… callee restores, etc
	mov 		esp, [ebp]
	pop		ebp
	ret

In cases where misalignment is unavoidable for some frequently accessed data, it may be useful to copy the data to an aligned temporary storage location.

4.7 Data Arrangement

MMX technology uses an SIMD technique to exploit the inherent parallelism of many multimedia algorithms. To get the most performance out of MMX technology code, data should be formatted in memory according to the guidelines below.

Consider a simple example of adding a 16-bit bias to all the 16-bit elements of a vector. In regular scalar code, you would load the bias into a register at the beginning of the loop, access the vector elements in another register, and do the addition one element at a time.

Converting this routine to MMX technology code, you would expect a four times speedup since MMX technology instructions can process four elements of the vector at a time using the MOVQ instruction, and perform four additions at a time using the PADDW instruction. However, to achieve the expected speedup, you would need four contiguous copies of the bias in the MMX technology register when doing the addition.

In the original scalar code, only one copy of the bias was in memory. To use MMX technology instructions, you could use various manipulations to get four copies of the bias in an MMX technology register. Or, you could format your memory in advance to hold four contiguous copies of the bias. Then, you need only load these copies using one MOVQ instruction before the loop, and the four times speedup is achieved. For another interesting example of this type of data arrangement see Section 5.6.

The new 64-bit packed data types defined by MMX technology creates more potential for misaligned data accesses. The data access patterns of many algorithms are inherently misaligned when using MMX technology instructions and other packed data types. A simple example of this is an FIR filter. An FIR filter is effectively a vector dot product in the length of the number of coefficient taps. If the filter operation of data element i is the vector dot product that begins at data element j (data [ j ] *coeff [0] + data [j+1]*coeff [1]+...+data [j+num_of_taps-1]*coeff [num_of_taps-1] ), then the filter operation of data element i+1 begins at data elementj+1.

Section 4.6 covers aligning 64-bit data in memory. Assuming you have a 64-bit aligned data vector and a 64-bit aligned coefficients vector, the filter operation on the first data element will be fully aligned. For the filter operation on the second data element, however, each access to the data vector will be misaligned! Duplication and padding of data structures may be used to avoid the problem of data accesses in algorithms which are inherently misaligned. Using MMX^TM Technology Instructions to Compute a 16-Bit Real FIR Filter, Application Note #559, (Order Number 243044) shows an example of how to avoid the misalignment problem in the FIR filter.

Note that the duplication and padding technique overcomes the misalignment problem, thus avoiding the expensive penalty for misaligned data access, at the price of increasing the data size. When developing your code, you should consider this tradeoff and use the option which gives the best performance.

4.8 Tuning the Final Application

The best way to tune your application once it is functioning correctly is to use a profiler that measures the application while it is running on a system. Intel's VTune visual tuning tool is such a tool and can help you to determine where to make changes in your application to improve performance. Additionally, Intel's processors provide performance counters on-chip. Section 6.1 documents these counters and provides an explanation of how to use them.