Disclaimer Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Copyright © Intel Corporation (1996). Third-party brands and names are the property of their respective owners.
|
The media extension to the Intel Architecture (IA) instruction set includes single-instruction multiple-data (SIMD) instructions.This application note presents examples of code that use the MMX technology techniques that guarantee data alignment with Assembly, C, C++, or Microsoft Windows*.
When the fundamental data type was the byte, data alignment was not required for optimum microprocessor performance. Processors accessed bytes equally fast on even or odd boundaries. On later microprocessor architectures, 16-bit and larger .data must be aligned on even boundaries or extra time is required as the processor makes the mis-aligned data access.
With the introduction of the MMX instruction set for the Pentium® processor, data alignment is a valuable tool for developing premium performance applications because aligned data can be accessed per cycle while mis-aligned data accesses incur a three cycle penalty and mis-aligned accesses that span a cache line incur a twelve-plus cycle penalty.
For a simple demonstration of the value of data alignment, sum the elements of an array first on an aligned array, and then on a mis-aligned array. Once we have an aligned array, we can guarantee misalignment by adding one to the array index.
;sum an aligned buffer PADDD MM0, buffer_A[00] PADDD MM0, buffer_A[08] PADDD MM0, buffer_A[16] PADDD MM0, buffer_A[24] ... ;sum a mis-aligned buffer PADDD MM0, buffer_B[01] ;force mis-alignment PADDD MM0, buffer_B[09] PADDD MM0, buffer_B[17] PADDD MM0, buffer_B[25] ...
Aligned data | |
Mis-aligned data |
As shown in Table 1, significant performance gains (in excess of 300 percent) occur when summing aligned data.
To avoid the multiple access penalty for mis-aligned data on Pentium processors, the Pentium® Processor Family Developer's Manual Volume 3: Architecture and Programming Manual (Order Number 241430) lists the following data alignment rules:
Data alignment is vital for developing high-performance applications with the two new data types for the MMX instruction set:
For applications such as multimedia applications, whose fundamental data types consist of small data types, bytes and words, SIMD instructions process multiple data elements on single clock cycles. For instance, logical operations can be performed on eight 8-bit pixel elements on a single clock cycle. The new MMX instruction set provides a rich set of arithmetic and logical operations which operate on 64-bit registers as parallel registers of eight bytes, or four words, or two doublewords. If data is aligned and optimization provisions are met, instructions can access and process up to 16 bytes of data per clock cycle in the dual execution pipes of the Pentium processor.
Buffers are the fundamental data structure of multimedia programming. While it has not been the case until recently, tool developers are beginning to provide default array alignment based on the array data type. Arrays of words are aligned on word boundaries, and doublewords are aligned to 8-byte boundaries. Do not assume that arrays are aligned, but check the alignment behavior of your development tools.
2.1.1. Packed Data Structures
Because structures can consist
of data items of various sizes and alignments, structures present
a potential source of data mis-alignment. Programs compiled for
Microsoft Windows* require structures to be packed on one-byte
boundaries. Alignment of data items within structures can be controlled
with either the /Zpn compiler option or the pack(n) pragma.
2.1.1.1. /Zpn
The /Zpn compiler option specifies
the alignment of data within structures. Small data items are
padded with extra bytes to the size specified by n. For
Microsoft Visual C/C++* 1.5 n =1, 2, or 4. For Microsoft
Visual C/C++ 4.0 n = 1, 2, 4, 8, 16.
2.1.1.2 #Pragma pack(n)
The pack(n) pragma aligns
data in structures on n byte boundaries where n
= 1, 2, 4, 8. The pack(n) pragma can be used to control packing
on a per structure basis in source code.
/Zpn and pack(n) are implementation specific and do not align structures on particular boundaries. They align data within structures.
By default, Microsoft Macro-Assembler* (MASM) aligns segments on 16-byte paragraph addresses. The example below defines two 100 element quadword, 8-byte, type buffers, EXAMPLETbl and TESTTbl. Each is aligned to 16-byte boundaries because each is defined in its own segment. The ES segment register is assigned the segment address of the TESTTbl buffer.
EXAMPLESEG SEGMENT PARA USE16 PUBLIC 'DATA' EXAMPLETbl DQ 100 DUP (0AAAAAAAAAAAAAAAAh) EXAMPLESEG ENDS TESTSEG SEGMENT PARA USE16 PUBLIC 'DATA' TESTTbl DQ 100 DUP (05555555555555555h) TEST_SEG ENDS ; ; ; EXAMPLECODE SEGMENT USE16 PUBLIC 'CODE' ASSUME CS:EXAMPLECODE, DS:EXAMPLESEG, ES:TESTSEG, SS:EXAMPLESTACK Begin: mov ax, EXAMPLESEG mov ds, ax mov ax, TESTSEG mov es, ax ; ; ; EXAMPLECODE ENDS end Begin
The ALIGN n directive aligns data on arbitrary n boundaries. In the example below, data bytes are defined to force mis-alignment of the EXAMPLETbl. The ALIGN directive aligns EXAMPLETbl on the next 8-byte boundary.
;Alignment with the ALIGN n directive EXAMPLEDATA SEGMENT PARA USE16 PUBLIC 'DATA' DB 3 DUP (0) ;As a test, force mis-alignment ALIGN 8 EXAMPLETbl DQ 100 DUP (0AAAAAAAAAAAAAAAAh) EXAMPLEDATA ENDS ; ; ; EXAMPLECODE SEGMENT PARA PUBLIC 'CODE' ASSUME CS:EXAMPLECODE, DS:EXAMPLEDATA, SS:EXAMPLESTACK Begin: ; ; ; EXAMPLECODE ENDS end Begin
This method involves adjusting a pointer with an offset to create a pointer to an arbitrarily aligned block of data. The pointer Pointed to is decrement by the number of bytes required to create a pointer the at points to an n to aligned block of data. An extra data element has been defined before the buffer definition the provide offset memory.
; Demonstrates data alignment with an indirect pointer EXAMPLEDATA SEGMENT PARA USE16 PUBLIC 'DATA' DQ 0 0AAAAAAAAAAAAAAAAh EXAMPLETbl DQ 100 DUP (0AAAAAAAAAAAAAAAAh) EXAMPLETblBase dw 0 EXAMPLEDATA ENDS ; ; ; EXAMPLECODE SEGMENT PARA USE16 PUBLIC 'CODE' ASSUME CS:EXAMPLECODE, DS:EXAMPLEDATA, SS:EXAMPLESTACK Begin: ; ; ; mov ax, OFFSET EXAMPLETbl mov bx, ax and bx, 07h sub ax, bx mov EXAMPLETblBase, ax ; ; ; EXAMPLECODE ENDS end Begin EXAMPLETblBase is then used for data accesses.
When the LOCAL directive is used it must be located immediately after the PROC directive. There is no way to force alignment before the LOCAL directive allocates space for locals. It would be convenient if there was a form of the ALIGN directive that could be used with LOCAL to align locals. Alignment can be achieved by adjusting a pointer as demonstrated above.
; Demonstrates data alignment of LOCAL directive stack variables ; ; call TASK ; ; TASK PROC LOCAL nuts[8]:BYTE LOCAL locBuff[10]:QWORD LOCAL locPtr:WORD lea ax, locBuff and ax, 0FFF8h lea bx, locPtr mov word ptr ss:[bx], ax ret TASK ENDP
EBP and ESP can be aligned to reference arbitrary aligned regions on the stack.
; Demonstrates aligning EBP to an aligned region on the stack push ebp mov eax, esp and eax, 7 sub ebp, eax ... ;assembly routine code pop ebp mov esp, ebp pop ebp ret
If ESP/SP is to be alignment adjusted, either the original ESP/SP or the adjustment value must be saved in order to return the stack pointer to its original alignment. The example below saves the stack pointer.
; Demonstrates aligning EBP to an aligned region on the stack ;entry code ... mov eax, esp mov savedStackPointer, eax ;save copy of mis-aligned stack pointer and esp, 0FFFFFFF8h ;align stack pointer ... ;program code mov eax, savedStackPointer mov esp, eax ;Restore stack pointer ;exit code ... ret
The indirect method of data alignment is a flexible method of alignment. It is applicable to all levels of code development from Assembly to C++. It entails allocating a buffer of the required size plus an extra data element of the same size as the buffer elements. An offset is added to a pointer into the buffer that is to be data aligned. The pointer is then used for all subsequent data accesses. If data is to be freed the original pointer returned by the memory allocation function should be used to free memory.
Alignment in C code is similar to the indirect method in Assembly. A calculation is made to determine the offset in bytes required to align a pointer on an arbitrary boundary. The original pointer should be saved to free memory.
int size_of_array = 100; int iSizeOfDouble; double * ptrDoubleMem; size_t size; void * vptrMalloc; unsigned int iAdjust; // size of buffer data element iSizeOfDouble = sizeof(double); // number of bytes to allocate size = size_of_array * iSizeOfDouble + iSizeOfDouble; // allocate memory vptrMalloc = malloc(size); // determine adjustment required to align pointer iAdjust = (unsigned int)vptrMalloc % (unsigned int) iSizeOfDouble; // calculate new aligned pointer ptrDoubleMem = (double *)((unsigned int)(iSizeOfDouble-iAdjust) +(unsigned int)vptrMalloc); // Program code // free memory with the original pointer free(vptrMalloc);
Alignment in Windows is similar to the method above for C. For Windows memory allocation use GlobalAlloc instead of malloc from the standard C library. The pointer returned by GlobalAlloc is passed to GlobalLock to lock the allocated memory and prevent windows from reusing the memory. The pointer returned by GlobalLock is the pointer used for alignment adjustment.
int size_of_array = 100; int iSizeOfDouble; double FAR * ptrDoubleMem; DWORD size; HGLOBAL hGlobalMemory; void FAR * lpGlobalMemory; unsigned int iAdjust; // size of buffer data element iSizeOfDouble = sizeof(double); // number of bytes to allocate size = size_of_array * iSizeOfDouble + iSizeOfDouble; // allocate and lock memory hGlobalMemory = GlobalAlloc(GMEM_MOVEABLE, size); lpGlobalMemory = GlobalLock (hGlobalMemory); // determine adjustment required to align pointer iAdjust = (unsigned long)lpGlobalMemory % (unsigned long) iSizeOfDouble; // calculate new aligned pointer ptrDoubleItem = (double *)((unsigned long)( iSizeOfDouble - iAdjust) + (unsigned long)lpGlobalMemory); // Program code // free memory with the original pointer GlobalUnlock(hGlobalMemory); GlobalFree(hGlobalMemory);
In C++, use NEW to allocate memory. The original pointer should be saved in order to free memory with delete.
int size_of_array = 100; char * ptrNewMem; double * ptrDoubleMem; int i; size_t iSizeOfDouble; unsigned int iAdjust; unsigned long size; // size of buffer data element iSizeOfDouble = sizeof(double); // number of bytes to allocate size = size_of_array * iSizeOfDouble + iSizeOfDouble; // allocate memory ptrNewMem = new char[size]; // determine adjustment required to align pointer iAdjust = (unsigned int)ptrNewMem % (unsigned int) iSizeOfDouble; // calculate new aligned pointer ptrDoubleMem = (double *)((unsigned int)( iSizeOfDouble -iAdjust) +(unsigned int)ptrNewMem); // Program code // free memory with the original pointer delete [] ptrNewMem;