Chapter 5

Chapter 5
MMX™ CODING TECHNIQUES

Coding Techniques

This section contains several simple examples that will help you to get started in coding your application. The goal is to provide simple, low-level operations that are frequently used. Each example uses the minimum number of instructions necessary to achieve best performance on Pentium^(R) and P6-family processors.

Each example includes:
A short description.
Sample code.
Any necessary notes.

These examples do not address scheduling as we assume you will incorporate the examples in longer code sequences.

5.1 Unsigned Unpack

The MMX^TM technology provides several instructions that are used to pack and unpack data in the MMX technology registers. The unpack instructions can be used to zero-extend an unsigned number. The following example assumes the source is a packed-word (16-bit) data type.

Input:   MM0 : Source value;
   	 MM7 : 0

A local variable can be used instead of the register MM7, if desired.

Output: 	MM0 : two zero-extended 32-bit doublewords from 2 LOW end words 
		MM1 : two zero-extended 32-bit doubleword from 2 HIGH end words

MOVQ		MM1, MM0		; copy source
PUNPCKLWD	MM0, MM7   		; unpack the 2 low end words 
					; into two 32-bit double word
PUNPCKHWD	MM1, MM7		; unpack the 2 high end words into two
					; 32-bit double word

5.2 Signed Unpack

Signed numbers should be sign-extended when unpacking the values. This is done differently than the zero-extend shown above. The following example assumes the source is a packed-word (16-bit) data type.

Input: MM0 : source value Output: MM0 : two sign-extended 32-bit doublewords from the two LOW end words MM1 : two sign-extended 32-bit doublewords from the two HIGH end words

PUNPCKHWD	MM1, MM0	; unpack the 2 high end words of the
				; source into the second and fourth 
				; words of the destination
PUNPCKLWD	MM0, MM0	; unpack the 2 low end words of the
				; source into the second and fourth
				; words of the destination
PSRAD		MM0, 16		; Sign-extend the 2 low end words of
				; the source into two 32-bit signed
				; doublewords
PSRAD		MM1, 16		; Sign-extend the 2 high end words of
				; the source into two 32-bit signed
				;doublewords

5.3 Interleaved Pack with Saturation

The PACK instructions pack two values into the destination register in a predetermined order. Specifically, the PACKSSDW instruction packs two signed doublewords from the source operand and two signed doublewords from the destination operand into four signed words in the destination register as shown in the figure below.

Figure 5-1. PACKSSDW mm, mm/mm64 Instruction Example

The following example interleaves the two values in the destination register, as shown in the figure below.

Figure 5-2. Interleaved Pack with Saturation Example

This example uses signed doublewords as source operands and the result is interleaved signed words. The pack instructions can be performed with or without saturation as needed.

Input: MM0 : Signed source1 value MM1 : Signed source2 value Output: MM0 : The first and third words contain the signed-saturated doublewords from MM0 MM0. The second and fourth words contain the signed-saturated doublewords from MM1

PACKSSDW	MM0, MM0	; pack and sign saturate
PACKSSDW	MM1, MM1	; pack and sign saturate
PUNPKLWD	MM0, MM1	; interleave the low end 16-bit values of the
				; operands

The pack instructions always assume the source operands are signed numbers. The result in the destination register is always defined by the pack instruction that performs the operation. For example, the PACKSSDW instruction, packs each of the two signed 32-bit values of the two sources into four saturated 16-bit signed values in the destination register. The PACKUSWB instruction, on the other hand, packs each of the four signed 16-bit values of the two sources into four saturated 8-bit unsigned values in the destination. A complete specification of the MMX technology instruction set can be found in the Intel Architecture MMX^TM Technology Programmers Reference Manual, (Order Number 243007).

5.4 Interleaved Pack Without Saturation

This example is similar to the last except that the resulting words are not saturated. In addition, in order to protect against overflow, only the low order 16-bits of each doubleword are used in this operation.

Input:  MM0 : signed source value
	MM1 : signed source value

Output: MM0 	: The first and third words contain the low 16-bits of the doublewords in MM0 
		: The second and fourth words contain the low 16-bits of the doublewords in MM1


PSLLD		MM1, 16			; shift the 16 LSB from each of the double
					; words values to the 16 MSB position 
PAND 		MM0, {0,ffff,0,ffff} 
					; mask to zero the 16 MSB of each
					; doubleword value
POR		MM0, MM1		; merge the two operands

5.5 Non-Interleaved Unpack

The unpack instructions perform an interleave merge of the data elements of the destination and source operands into the destination register. The following example merges the two operands into the destination registers without interleaving. For example, take two adjacent elements of a packed-word data type in source1; place this value in the low 32-bits of the results. Then take two adjacent elements of a packed-word data type in source2; place this value in the high 32-bits of the results. One of the destination registers will have the combination shown in Figure 5-3.

Figure 5-3. Result of Non-Interleaved Unpack in MMO

The other destination register will contain the opposite combination as in Figure 5-4.

Figure 5-4. Result of Non-Interleaved Unpack in MM1

The following example unpacks two packed-word sources in a non-interleaved way. The trick is to use the instruction which unpacks doublewords to a quadword, instead of using the instruction which unpacks words to doublewords.

Input:  MM0 : packed-word source value
	MM1 : packed-word source value

Output: MM0 : contains the two low end words of the original sources, non-interleaved 
	MM2 : contains the two high end words of the original sources, non-interleaved.

MOVQ		MM2, MM0	; copy source1
PUNPCKLDQ	MM0, MM1	; replace the two high end words of MM0
				; with the two low end words of MM1
				; leave the two low end words of MM0 
				; in place
PUNPCKHDQ	MM2, MM1	; move the two high end words of MM2 to the
				; two low end words of MM2; place the two
				; high end words of MM1 in the two high end
				; words of MM2

5.6 Complex Multiply by a Constant

Complex multiplication is an operation which requires four multiplications and two additions. This is exactly how the PMADDWD instruction operates. In order to use this instruction you need only to format the data into four 16-bit values. The real and imaginary components should be 16-bits each.

Let the input data be Dr and Di where
Dr = real component of the data
Di = imaginary component of the data

Format the constant complex coefficients in memory as four 16-bit values [Cr -Ci Ci Cr]. Remember to load the values into the MMX technology register using a MOVQ instruction.

Input: 	MM0 : a complex number Dr, Di 
	MM1 : constant complex coefficient in the form[Cr-Ci Ci Cr] 

Output: MM0 : two 32-bit dwords containing [ Pr Pi  ]

The real component of the complex product is Pr = Dr*Cr - Di*Ci, and the imaginary component of the complex product is Pi = Dr*Ci + Di*Cr

PUNPCKLDQ MM0,MM0	; This makes [Dr Di Dr Di]
PMADDWD   MM0, MM1	; and you're done, the result is		
			; [(Dr*Cr-Di*Ci)(Dr*Ci+Di*Cr)]

Note that the output is a packed word. If needed, a pack instruction can be used to convert the result to 16-bit (thereby matching the format of the input).

5.7 Absolute Difference of Unsigned Numbers

This example computes the absolute difference of two unsigned numbers. It assumes an unsigned packed-byte data type. Here, we make use of the subtract instruction with unsigned saturation. This instruction receives UNSIGNED operands and subtracts them with UNSIGNED saturation. This support exists only for packed bytes and packed words, NOT for packed dwords.

Input:	MM0:  source operand
	MM1:  source operand

Output: MM0: The absolute difference of the unsigned operands

MOVQ		MM2, MM0     	; make a copy of MM0
PSUBUSB		MM0, MM1 	; compute difference one way
PSUBUSB		MM1, MM2  	; compute difference the other way
POR		MM0, MM1     	; OR them together

This example will not work if the operands are signed. See the next example for signed absolute differences.

5.8 Absolute Difference of Signed Numbers

This example computes the absolute difference of two signed numbers. There is no MMX technology instruction subtract which receives SIGNED operands and subtracts them with UNSIGNED saturation. The technique used here is to first sort the corresponding elements of the input operands into packed-words of the maxima values, and packed-words of the minima values. Then the minima values are subtracted from the maxima values to generate the required absolute difference. The key is a fast sorting technique which uses the fact that B= XOR(A, XOR(A,B)) and A = XOR(A,0). Thus in a packed data type, having some elements being XOR(A,B) and some being 0, you could XOR such an operand with A and receive in some places values of A and in some values of B. The following examples assume a packed-word data type, each element being a signed value.

Input:	MM0: signed source operand
	MM1: signed source operand

Output:	MM0: The absolute difference of the signed operands

MOVQ		MM2, MM0	; make a copy of source1 (A)
PCMPGTW		MM0, MM1	; create mask of source1>source2 (A>B)
MOVQ		MM4, MM2	; make another copy of A
PXOR		MM2, MM1	; Create the intermediate value of the swap
				; operation - XOR(A,B)
PAND		MM2, MM0	; create a mask of  0s and XOR(A,B)
				; elements. Where A>B there will be a value
				; XOR(A,B) and where A<=B there will be 0.
MOVQ		MM3, MM2	; make a copy of the swap mask
PXOR		MM4, MM2	; This is the minima - XOR(A, swap mask)
PXOR		MM1, MM3	; This is the maxima - XOR(B, swap mask)
PSUBW		MM1, MM4	; absolute difference = maxima-minima

5.9 Absolute Value

To compute |x|, where x is signed. This example assumes signed words to be the operands.

Input: MM0 : signed source operand

Output:	MM1 : ABS(MM0)

MOVQ	MM1, MM0	; make a copy of x
PSRAW	MM0,15		; replicate sign bit (use 31 if doing
			; DWORDS)
PXOR	MM0, MM1	; take 1's complement of just the
			; negative fields
PSUBS	MM1, MM0	; add 1 to just the negative fields

Note that the absolute value of the most negative number (that is, 8000 hex for 16-bit) does not fit, but this code does something reasonable for this case; it gives 7fff which is off by one.

5.10 Clipping Signed Numbers to an Arbitrary Signed Range [HIGH, LOW]

This example shows how to clip a signed value to the signed range [HIGH, LOW]. Specifically, if the value is less than LOW or greater than HIGH then clip to LOW or HIGH, respectively. This technique uses the packed-add and packed-subtract instructions with unsigned saturation, which means that this technique can only be used on packed-bytes and packed-words data types.

The following example uses the constants packed_max and packed_min.

The following examples shows the operation on word values. For simplicity we use the following constants (corresponding constants are used in case the operation is done on byte values):

PACKED_MAX equals 0x7FFF7FFF7FFF7FFF
PACKED_MIN equals 0x8000800080008000
PACKED_LOW contains the value LOW in all 4 words of the packed-words datatype
PACKED_HIGH contains the value HIGH in all 4 words of the packed-words datatype
PACKED_USMAX is all 1's
HIGH_US adds the HIGH value to all data elements (4 words) of PACKED_MIN
LOW_US adds the LOW value to all data elements (4 words) of PACKED_MIN

The examples illustrate the operation on word values.

Input: MM0 : Signed source operands

Output: MM0 : Signed operands clipped to the unsigned range [HIGH, LOW]

PADD	MM0, PACKED_MIN				; add with no 	
						; saturation 0x8000
						; to convert to 
						; unsigned
PADDUSW	MM0, (PACKED_USMAX - HIGH_US)		; in effect this clips
						; to HIGH
PSUBUSW	MM0, (PACKED_USMAX - HIGH_US + LOW_US) 	;
						; in effect 
						; this clips to LOW
PADDW	MM0, PACKED_LOW				; undo the previous two
						; offsets

The code above converts values to unsigned numbers first and then clips them to an unsigned range. The last instruction converts the data back to signed data and places the data within the signed range. Conversion to unsigned data is required for correct results when the quantity (HIGH - LOW) < 0x8000.

IF (HIGH - LOW) >= 0x8000, the algorithm can be simplified to the following:

Input: MM0 : Signed source operands

Output: MM0 : Signed operands clipped to the unsigned range [HIGH, LOW]

PADDSSW	MM0, (PACKED_MAX - PACKED_HIGH)			;in effect this
							; clips to HIGH
PSUBSSW	MM0, (PACKED_USMAX - PACKED_HIGH + PACKED_LOW)	;clips to LOW
PADDW	MM0, LOW					;undo the 
							;previous two 
							;offsets

This algorithm saves a cycle when it is known that (HIGH - LOW) >= 0x8000. To see why the three instruction algorithm does not work when (HIGH - LOW) < 0x8000, realize that 0xffff minus any number less than 0x8000 will yield a number greater in magnitude than 0x8000 which is a negative number. When

PSUBSSW MM0, (0xFFFF - HIGH + LOW)

(the second instruction in the three-step algorithm) is executed, a negative number will be subtracted causing the values in MM0 to be increased instead of decreased, as should be the case, and causing an incorrect answer to be generated.

5.11 Clipping Unsigned Numbers to an Arbitrary Unsigned Range [HIGH, LOW]

This example clips an unsigned value to the unsigned range [HIGH, LOW]. If the value is less than LOW or greater than HIGH, then clip to LOW or HIGH, respectively. This technique uses the packed-add and packed-subtract instructions with unsigned saturation, thus this technique can only be used on packed-bytes and packed-words data types.

The example illustrates the operation on word values.

Input: MM0 : Unsigned source operands

Output: MM0 : Unsigned operands clipped to the unsigned range [HIGH, LOW]

PADDUSW	MM0, 0xFFFF - HIGH		; in effect this clips to HIGH
PSUBUSW	MM0, (0xFFFF - HIGH + LOW)	; in effect this clips to LOW
PADDW	MM0, LOW			; undo the previous two offsets

5.12 Generating Constants

The MMX technology instruction set does not have an instruction that will load immediate constants to MMX technology registers. The following code segments will generate frequently used constants in an MMX technology register. Of course, you can also put constants as local variables in memory, but when doing so be sure to duplicate the values in memory and load the values with a MOVQ instruction.

Generate a zero register in MM0:

PXOR		MM0, MM0

Generate all 1's in register MM1, which is -1 in each of the packed data type fields:

PCMPEQ		MM1, MM1

Generate the constant 1 in every packed-byte [or packed-word] (or packed-dword) field:

PXOR		MM0, MM0
PCMPEQ		MM1, MM1
PSUBB		MM0, MM1	[PSUBW	MM0, MM1]	(PSUBD	MM0, MM1)

Generate the signed constant 2^{n -1} in every packed-word (or packed-dword) field:

PCMPEQ		MM1, MM1
PSRLW		MM1, 16-n				(PSRLD MM1, 32-n)

Generate the signed constant -2ⁿ in every packed-word (or packed-dword) field:

PCMPEQ		MM1, MM1
PSLLW		MM1, n					(PSLLD	MM1, n)

Because the MMX technology instruction set does not support shift instructions for bytes, 2^n-1 and -2ⁿare relevant only for packed-words and packed-dwords..