AP-527 APPLICATION NOTE

Using MMX™ Instructions to
Get Bits From a Data Stream

Disclaimer
Information in this document is provided in connection with Intel products. No license under any patent or copyright is granted expressly or implied by this publication. Intel assumes no liability whatsoever, including infringement of any patent or copyright, for sale and use of Intel products except as provided in Intel's Terms and Conditions of Sale for such products.

Intel retains the right to make changes to these specifications at any time, without notice. Microcomputer Products may have minor variations to their specifications known as errata.

MPEG is an international standard for audio and video compression and decompression promoted by ISO. Implementations of MPEG CODEC’s or MPEG enabled platforms may require licenses from various entities, including Intel Corporation. Intel makes no representation as to the need for licenses from any entity. No licenses, either express, implied or by estoppel are granted herein. For information on licensing Intel patents, please contact Intel.

1.0. INTRODUCTION

2.0. GETBITS FUNCTION

2.1. Getbits Core

2.2. new64bit Code

2.3. do_refill Code

3.0. PERFORMANCE GAINS

3.1. Scalar Performance

3.2. MMX™ Code Performance

4.0. GETBITS FUNCTION: CODE LISTING

1.0. INTRODUCTION

The Intel Architecture (IA) media extensions include single-instruction, multi-data (SIMD) instructions. This application note presents examples of code that exploit these instructions. Specifically, the getbits function presented here illustrates how to use the new MMX^TM instructions (PSRLQ and PSLLQ) to manipulate a data stream. The performance improvement relative to traditional IA code can be attributed primarily to the much faster shift instructions. Whereas the IA shift instruction (SHIFT) takes four cycles on a Pentium® processor, the MMX instruction (PSHIFT) takes only one cycle. The performance gain is also due to the fact that the MMX instructions operate on 64-bit values instead of the 32-bit values operated on by the scalar shift instructions.

2.0. GETBITS FUNCTION

In applications such as the Moving Pictures Expert Group (MPEG) applications, the data is organized in a bit stream in big-endian order. The user requires a variable number of bits from this bitstream. The getbitsfunction performs these tasks to access the requested data:

1. The function gets the user-specified number of bits from the multimedia register, MM0, and returns them in the integer register, EAX.

2. When all the bits in MM0have been used, getbitsjumps to the new64bit code. This code refreshes MM0, filling it with bits from a buffer of finite length. getbitsthen returns the requested bits in the eax register.

3. When there are no more bits in the buffer, getbitsjumps to the do_refill code, which places a finite number of bits from the input stream into the buffer.

2.1. Getbits Core

Figure 1 illustrates the basic MMX code flow needed for the getbits core. Basically, MM2 (a copy of MM0) shifts the desired bits all the way to the least significant part of the register. Register MM0 is updated so that these returned bits are removed.

The core of the getbitsfunction is listed in Example 1. Recall that the multimedia register, MM0, contains the bits that will be retrieved bygetbits.

The first three lines determine how many bits the user requests and if there are enough valid bits in the MM0register. Assuming that MM0 contains enough valid bits for this call of getbits, the branch in line 4 is not taken.

Figure 1. Getbits Core Diagram

Example 1. Getbits Core

getbits:
  1	MOV	eax, DWORD PTR bit_count	; Number of valid bits in MM0
  2	MOV	ecx, 4[ESP]			; Parameter. How many bits to read
  3	SUB	eax, ecx			; Do we have enough bits in MM0
  4	JL	new64bit 			; If not get new 64 bits
  5	movd	MM3,64_minus_index[ecx*4]	; MM3 = 64 - number of needed bits
  6	movq	MM2,MM0				; make a copy of the bits
  7	movd	MM1,ecx 			; number of bits to read
  8	psrlq	MM2,MM3 			; MM2 now has valid bitstream in 
						; least significant part
  9	mov	BYTE PTR bit_count,al 		; Update number of valid bits
 10	movd	eax,MM2 			; move the result into eax
 11	psllq	MM0,MM1 			; throw away those bits
 12	ret

The big-endian to little-endian conversion is already performed. See Section 2.2. new64bit Code. The relevant bits are currently in the most significant part of MM0. These bits are copied to MM2(line 6). The bits in MM2 are shifted to the right so that they are now in the least significant part of the register (lines 5 and 8). The bits can then be loaded to the EAX register (line 10). The bits that were just loaded to EAX are then discarded at the most significant edge of MM0 (shift to left in lines 7 and 11). Line 9 updates the number of valid bits in MM0.

The read from a 64-bit MMX register takes only eight cycles, including the two cycles for the RET instruction. If the application uses this core piece of the getbits function as a macro, the two-cycle overhead for the return is saved. Of course, this means an extra 42 bytes of code size each time this macro is called.

In addition, if another register can be spared, then the number of requested bits can be passed to the macro in the ECX register, or another similar register. The code may then take as few as four cycles (another instruction or a pair of instructions could execute during the free cycle). The restructured code is listed in Example 2.

Example 2. Restructured getbits Core

getbits:
  1	sub	edi,ecx	       		; Are there enough bits in MM0
  2	jnl	continue		; If there are, then continue
  3	call	new64bit		; If not, get new 64 bits
  4	jmp	out			; finished
continue:
  5	movd	MM3,64_minus_index[ecx*4]
					;MM3 = 64 - number of requested bits
  6	movq	MM2,MM0			; make a copy of the bits
  7	movd	mm1,ecx			; number of bits to read
  8	psrlq	MM2,MM3			; MM2 has valid bitstream in least significant part
					;(free cycle)
					;(free cycle)
 10	movd	eax,MM2			; move the result into eax
 11	psllq	MM0,MM1			; throw away those bits
out:

2.2. new64bit Code

The new64bit code is performed when there are not enough bits available in MM0. This code (as illustrated in Figure 2) calculates the number of requested bits less the number of bits remaining in MM0. A temporary copy of the bits remaining in MM0is made. Then, another 64 bits are obtained from the bitstream (in two 32-bit halves that are converted to little endian at the start of the refill code). The remaining bits are prepended to these bits. After the requested bits are returned in the EAX register, MM0 contains the remaining part of the new 64 bits (the new bits are appended to the previously remaining bits).

The code is listed in Example 3.

Actual statistics from a sample MPEG1 audio application shows that an average of 4.86 bits are requested each time getbits is called. This means that an average of once every 13.17 calls to getbits, 64 more bits are read from the buffer into MM0.

Figure 2: new64bits Core Diagram

Example 3. new64bit Code

					
							;mm2 has remaining bits from old group
							;mm4 has new 64 bits
							;eax has -(number of bits we missed in old group), (negative)
							;ecx has number of requested bits

 1	add	eax,64					; eax = 64- # of bits we missed in old 
							; group
 2	movd	mm3,_64_minus_index[ecx*4]		; MM3 = 64 - number of requested bits
 3	psrlq	mm2,mm3					; mm2 has remaining bits in least 
							; significant part with room for new 
							; bits to right of it
 4	movd	mm3,eax					; mm3=64- # of bits we missed in old 
							; group
 5	movq	mm0,mm4					; save the new word in mm0 for next time
 6	psrlq	mm4,mm3					; mm4 = new bits we now need in least 
							; significant part
 7	mov		_bit_count,eax			; Save bit count for next time
 8	por	mm2,mm4					; combine remaining bits with the bits 
							; from new word
 9	movd	mm1,_64_minus_index[4*eax]		; # of bits we missed in old group
10	movd	eax,mm2					; return bits in eax
11	psllq	mm0,mm1					; remove the bits we just read from mm0

2.3. do_refill Code

When the 64-bit aligned input buffer is empty, the getbits function performs the do_refill code. The do_refill code resets the values of buf_pointer (pointer to input buffer) and the EndBuf (pointer to end of buffer). If the application assumes that the buffer is never empty (for example, perhaps the buffer is filled at the end of every frame) the two-cycle overhead of testing for end of buffer can be saved.

The do_refill code reads new data from the input stream every 64 bits. The current 64 bits are stored in MM0. Therefore, other functions should not modify the contents of MM0.

3.0. PERFORMANCE GAINS

This section details the performance improvement as compared with traditional scalar code. There is approximately a 2X performance gain for the MMX optimized code version for typical MPEG Audio code.

3.1. Scalar Performance

The scalar version of the getbitsfunction executes in 18 cycles (if there are enough bits in the 32-bit register). In case of a jump to the new32bit code (with a mispredicted branch it must be called more often since it retrieves only 32 bits), the code takes 38 cycles to execute, plus an additional five cycles (for the misprediction) to read another 32 bits.

Assuming 4.86 bits per read (see Section 2.2), the 32-bit register is refilled approximately every 6.58 reads. Total getbits execution time (on average) therefore is:

(18 * (6.58-1) + (38+5) )/6.58 = 21.8 cycles

3.2. MMX™ Code Performance

The MMX optimized code version of the getbitsfunction executes in eight cycles (if there are enough bits in MM0). The speedup can be attributed to several factors:

The MMX code uses 64-bit registers, rather than the 32-bit registers used by scalar code.

By using a 64-bit read, the cost of the mispredicted branch is reduced by half, compared to the scalar code with a 32-bit read.

In case of a jump to the new64bit code, the getbitsfunction takes 19 cycles to execute, plus an additional five cycles (for the misprediction) to read another 64-bit value.

Again, assuming 4.86 bits per read (see Section 2.2), the 64-bit MM0 register is refilled approximately every 13.17 getbitscalls. Therefore, on average, the total execution time is:

(8 * (13.17-1) + (19+5)) / 13.17 = 9.2 cycles

It takes 9.2 cycles to finish the MMX technology version of the getbits call. The performance gain is almost 2.4X over that of typical MPEG Audio code.

With the Pentium® Pro processor, the misprediction penalty is much higher. The performance gain relative to scalar code will therefore be even greater.

Note that this analysis assumes 100 percent misprediction on the branch.

4.0. GETBITS FUNCTION: CODE LISTING

	.486P
	
ASSUME ds:FLAT, cs:FLAT, ss:FLAT

EXTRN     refill_buffer:PROC

DATA SEGMENT PARA  PUBLIC USE32 'DATA'
	ALIGN     16

64_minus_index	dd 
64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,47,46,45,44,43,42,41,40,39,38,37,36,35,34
		dd 
33,32,31,30,29,28,27,26,25,24,23,22,21,20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1


EXTRN     _buf_pointer:DWORD
EXTRN     _end_buf:DWORD
EXTRN     _bit_count:DWORD
_DATA ENDS

_TEXT SEGMENT DWORD PUBLIC USE32 'CODE'

PUBLIC    _getbits
_getbits:

	mov	eax, DWORD PTR _bit_count	; Number of valid bits in MM0
	mov	ecx, 4[ESP]			; Parameter. How many bits should we 
						; read.

	sub	eax,ecx				; Do we have enough bits in MM0
	jl	new64bit			; If not get new 64 bits

	movd	MM3,_64_minus_index[ecx*4]	;MM3 = 64 - number of needed bits.
	movq	MM2,MM0
	
	movd		mm1,ecx			; # of bits to read
	psrlq	MM2,MM3				; MM2 now has valid bitstream in least 
						; significant part

	mov 	BYTE PTR _bit_count,al		; Update number of valid bits.

	movd	eax,MM2				; move the result into eax
	psllq	MM0,MM1				; throw away those bits
	
	ret

new64bit:
	movd	MM3,_64_minus_index[ecx*4]	;MM3 = 64 - number of requested bits 
						;(for shifting)
	movq	MM2,MM0				; copy left over bits

	mov	edx,DWORD PTR _buf_pointer	;pointer to bitstream
	mov	ecx,_end_buf			;read pointer to end of buffer

	add	edx,8				;update the pointer
	add	eax,64				;eax = 64- # of bits we missed in old 
						;group

	cmp	edx,ecx				;do we have another qword to read
	mov	DWORD PTR _buf_pointer,edx	;save new value

	mov	ecx,[edx-8]			;read next qword (dword here)
	mov	edx,[edx-4]			;(dword here)

	jge					;do_refill
refill:
	; now convert from bigendian to little and 
	; but make use of left over bits (MM2) before using these
	bswap	edx				;swapping the first 32 bit
	bswap	ecx				;swapping the second 32 bit

	movd		mm4,ecx			;second 32 bit in mm4
	psrlq	mm2,mm3				;mm2 has remaining bits in least 
						;significant part with room for new 
						;bits to right of it
	movd		mm1,edx			;move first 32 bit
	psllq	mm4,32				;shiftsecond 32 bit to upper part of 
						;register

	movd	mm3,eax				;mm3 gets the shift counter
	por	mm4,mm1				;combine the 64 swapped data into mm4

	movq	mm0,mm4				;save new word in mm0 for next time
	psrlq	mm4,mm3				;mm4 = new bits we now need in least 
						;significant part

	mov	bit_count,eax			;Save bit count for next time
	por	mm2,mm4				;combine remaining bits with the bits 
						;from new word

	movd	mm1,_64_minus_index[4*eax]	;# of bits we missed in old group

	movd	eax,mm2				;return bits in eax
	psllq	mm0,mm1				;remove the bits we just 
						;read from mm0

	RET
do_refill:
	PUSH	EAX
	PUSH	EDX
	PUSH	ECX
	CALL	_refill_buffer
	POP	ECX
	POP	EDX
	POP	EAX
	JMP	refill

_TEXT ENDS
END

Using MMX™ Instructions to Get Bits From a Data Stream

Using MMX™ Instructions to
Get Bits From a Data Stream