Chapter 4
MMX CODE DEVELOPMENT STRATEGY
In general, developing fast applications for Intel Architecture (IA) processors is not difficult. An understanding of the architecture and good development practices make the difference between a fast application and one that runs significantly slower than its full potential. Intel Architecture processors with MMXTM technology add a new dimension to code development. Performance increase can be significant, though the conversion techniques are straight forward. In order to develop MMX technology code, examine the current implementation and determine the best way to take advantage of MMX technology instructions. If you are starting a new implementation, design the application with MMX technology in mind from the start.
Whether adapting an existing application or creating a new one, using MMX technology instructions to optimal advantage requires consideration of several issues. Generally, you should look for code segments that are computationally intensive, that are adaptable to integer implementations, and that support efficient use of the cache architecture. Several tools are provided in the Intel Performance Tool Set to aid in this evaluation and tuning.
Several questions should be answered before beginning your implementation:
Step one: Determine which code to convert.
Most applications have sections of code that are highly compute-intensive. Examples include speech compression algorithms and filters, video display routines, and rendering routines. These routines are generally small, repetitive loops, operating on 8- or 16-bit integers, and take a sizable portion of the application processing time. It is these routines that will yield the greatest performance increase when converted to MMXTM technology optimized libraries code. Encapsulating these loops into MMX technology-optimized libraries will allow greater flexibility in supporting platforms with and without MMX technology.
A performance optimization tool such as Intel's VTune visual tuning tool may be used to isolate the compute-intensive sections of code. Once identified, an evaluation should be done to determine whether the current algorithm or a modified one will give the best performance. In some cases, it is possible to improve performance by changing the types of operations in the algorithm. Matching the algorithms to MMX technology instruction capabilities is key to extracting the best performance.
Step two: Determine whether the algorithm contains floating-point or integer data.
If the current algorithm is implemented with integer data, then simply identify the portions of the algorithm that use the most microprocessor clock cycles. Once identified, re-implement these sections of code using MMX technology instructions.
If the algorithm contains floating-point data, then determine why floating-point was used. Several reasons exist for using floating-point operations: performance, range and precision. If performance was the reason for implementing the algorithm in floating-point, then the algorithm is a candidate for conversion to MMX technology instructions to increase performance.
If range or precision was an issue when implementing the algorithm in floating point then further investigation needs to be made. Can the data values be converted to integer with the required range and precision? If not, this code is best left as floating-point code.
When generating MMX technology code, it is important to keep in mind that
the eight MMX technology registers are aliased upon the floating-point registers.
Switching from MMX technology instructions to floating-point instructions
can take up to fifty clock cycles, so it is best to minimize switching
between these instruction types. Do not intermix MMX technology code and
floating-point code at the instruction level. If an application
does perform frequent switches between floating-point and MMX technology
instructions, then consider extending the period that the application
stays in the MMX technology instruction stream or floating-point instruction
stream to minimize the penalty of the switch.
When writing an application that uses both floating-point and
MMX technology instructions, use the following guidelines for isolating instruction
execution:
Additional information on the floating-point programming model
can be found in the Pentium® Processor Family Developer's
Manual: Volume 3, Architecture and Programming, (Order Number
241430).
Step three: Always call the EMMS instruction at
the end of your MMX technology code.
Since the MMX technology registers are aliased on the floating-point
registers, it is very important to clear the MMX technology registers before
issuing a floating-point instruction. Use the EMMS instruction
to clear the MMX technology registers and set the value of the floating-point
tag word (TW) to empty (that is, all ones). This instruction
should be inserted at the end of all MMX technology code segments to avoid
an overflow exception in the floating-point stack when a floating-point
instruction is executed.
Step four: Determine if MMX technology is available.
MMX technology can be included in your application
in two ways: Using the first method, have the application check
for MMX technology during installation. If MMX technology is
available, the appropriate libraries can be installed. The second
method is to check during program execution and install the proper
libraries at runtime. This is effective for programs that may
be executed over a network.
To determine whether you are executing on a processor
with MMX technology, your application should check the Intel Architecture
feature flags. The CPUID instruction returns the feature flags
in the EDX register. Based on the results, the program can decide
which version of code is appropriate for the system.
Existence of MMX technology support is denoted by
bit 23 of the feature flags. When this bit is set to 1 the processor
has MMX technology support. The following code segment loads
the feature flags in EDX and tests the result for MMX technology.
Additional information on CPUID usage may be found in Intel
Processor Identification with CPUID Instruction, Application
Note AP-485, (Order Number 241618).
Step five: Make sure your data is aligned.
Many compilers allow you to specify the alignment
of your variables using controls. In general this guarantees
that your variables will be on the appropriate boundaries. However,
if you discover that some of the variables are not appropriately
aligned as specified, then align the variable using the following
C algorithm. This aligns a 64-bit variable on a 64-bit boundary.
Once aligned, every access to this variable will save three clock
cycles.
Another way to improve data alignment is to copy
the data into locations that are aligned on 64-bit boundaries.
When the data is accessed frequently this can provide a significant
performance improvement.
As a matter of convention, compilers allocate anything that is
not static on the stack and it may be convenient to make use of
the 64-bit data quantities that are stored on the stack. When
this is necessary, it is important to make sure the stack is aligned.
The following code in the function prologue and epilogue will
make sure the stack is aligned.
In cases where misalignment is unavoidable for some
frequently accessed data, it may be useful to copy the data to
an aligned temporary storage location.
MMX technology uses an SIMD technique to exploit
the inherent parallelism of many multimedia algorithms. To get
the most performance out of MMX technology code, data should be formatted
in memory according to the guidelines below.
Consider a simple example of adding a 16-bit bias
to all the 16-bit elements of a vector. In regular scalar code,
you would load the bias into a register at the beginning of the
loop, access the vector elements in another register, and do the
addition one element at a time.
Converting this routine to MMX technology code, you would expect
a four times speedup since MMX technology instructions can process four elements
of the vector at a time using the MOVQ instruction, and perform
four additions at a time using the PADDW instruction. However,
to achieve the expected speedup, you would need four contiguous
copies of the bias in the MMX technology register when doing the addition.
In the original scalar code, only one copy of the
bias was in memory. To use MMX technology instructions, you could use various
manipulations to get four copies of the bias in an MMX technology register.
Or, you could format your memory in advance to hold four contiguous
copies of the bias. Then, you need only load these copies using
one MOVQ instruction before the loop, and the four times speedup
is achieved. For another interesting example of this type of
data arrangement see Section 5.6.
The new 64-bit packed data types defined by MMX technology
creates more potential for misaligned data accesses. The data
access patterns of many algorithms are inherently misaligned when
using MMX technology instructions and other packed data types. A simple example
of this is an FIR filter. An FIR filter is effectively a vector
dot product in the length of the number of coefficient taps. If
the filter operation of data element i
is the vector dot product that begins at data element j
(data [ j ] *coeff
[0] + data [j+1]*coeff [1]+...+data [j+num_of_taps-1]*coeff [num_of_taps-1]
), then the filter operation of data
element i+1 begins at
data element j+1.
Section 4.6 covers aligning 64-bit data in memory.
Assuming you have a 64-bit aligned data vector and a 64-bit aligned
coefficients vector, the filter operation on the first data element
will be fully aligned. For the filter operation on the second
data element, however, each access to the data vector will be
misaligned! Duplication and padding of data structures may be
used to avoid the problem of data accesses in algorithms which
are inherently misaligned. Using MMXTM Technology Instructions
to Compute a 16-Bit Real FIR Filter, Application Note #559,
(Order Number 243044) shows an example of how to avoid the misalignment
problem in the FIR filter.
Note that the duplication and padding technique overcomes
the misalignment problem, thus avoiding the expensive penalty
for misaligned data access, at the price of increasing the data
size. When developing your code, you should consider this tradeoff
and use the option which gives the best performance.
The best way to tune your application once it is
functioning correctly is to use a profiler that measures the application
while it is running on a system. Intel's VTune visual tuning
tool is such a tool and can help you to determine where to make
changes in your application to improve performance. Additionally,
Intel's processors provide performance counters on-chip. Section
6.1 documents these counters and provides an explanation of how
to use them.
For example:
FP_code:
..
.. /* leave the floating-point stack empty
*/
MMX_code:
...
EMMS /* empty the MMX Technology registers */
FP_code1:
...
... /* leave the floating-point stack empty
*/
4.4 EMMS Guidelines
4.5 CPUID Usage for Detection of MMXTM Technology
; identify existence of CPUID instruction
;
; identify Intel Processor
;
mov EAX, 1 ; request for feature flags
CPUID ; 0Fh, 0A2h CPUID Instruction
test EDX, 00800000h ; is MMX technology Bit(bit 23)in feature ; flags equal to 1
jnz Found
4.6 Alignment of Data
if (NULL == (new_ptr = malloc(new_value +1)* sizeof (var_struct))
mem_tmp = new_ptr;
mem_tmp /= 8;
new_tmp_ptr = (var_struct*) ((Mem_tmp+1) * 8);
4.6.1 STACK ALIGNMENT
Prologue:
push ebp ; save old frame ptr
mov ebp, esp ; make new frame ptr
sub ebp, 4 ; make room of stack ptr
and ebp, 0FFFFFFFC ; align to 64 bits
mov [ebp],esp ; save old stack ptr
mov esp, ebp ; copy aligned ptr
sub esp, FRAMESIZE ; allocate space
callee saves, etc
epilogue:
callee restores, etc
mov esp, [ebp]
pop ebp
ret
4.7 Data Arrangement
4.8 Tuning the Final Application
Legal Stuff © 1997 Intel Corporation