  ## Using MMX™ Instructions to Implement a SchurWeiner Filter

 Disclaimer Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Copyright © Intel Corporation (1996). Third-party brands and names are the property of their respective owners.

 1.0. INTRODUCTION 2.1. Input and Output Data Representation 2.2. Error Correction Techniques 3.1. Analyzing the Schur Algorithm 3.2. Creating the Initial Generator Matrix 3.3. Calculating the mth Order Reflection Coefficient 3.4. The Inner Loop: Recalculating the Generator Matrix

## 1.0. INTRODUCTION

The Intel Architecture (IA) media extension includes single-instruction, multi-data (SIMD) instructions. This application note presents examples of code that exploits these instructions. Specifically, the schur function presented here illustrates how to use the multiply and add instruction (PMADDWD>) to perform multiplication more efficiently. The performance improvement relative to traditional IA code is a result of the ability to efficiently perform multiple multiply operations in fewer cycles. To perform a signed Q15 multiply would use the IA IMUL> instruction and would take as many as 10 cycles. Using the PMADDWD> instruction, you can perform four word multiplies and two DWORD adds in three cycles. The performance gain can also be attributed to the fact that these new instructions operate on packed 64-bit values instead of 32-bit values.

## 2.0. THE SCHUR ALGORITHM

The amount of data which represents a human voice or sound is most often too large to store on a typical PC. Therefore, encoding the sound and only storing a partial set of the data is more practical. Voice encoding is one of the applications for which the Schur algorithm is used. This algorithm generates a set of reflection coefficients in a recursive manner as follows:

Step 1. Initialize the generator matrix G[pSize+1] with elements from the input vector r[pSize+1] as follows:

0 r(1) r(2) r(3) ..... r(p)

r(0) r(1) r(2) r(3) ..... r(p)

Step 2. Shift the second row of the generator matrix to the right by one place and discard the last element.

0 r(1) r(2) r(3) ..... r(p)

0 r(0) r(1) r(2) ..... r(p-1)

Step 3. Initialize m to 1

Step 4. Calculate the reflection coefficient K[m] as follows:

K[m] = - G[m] / G[m]

Step 5. Multiply the generator matrix by the following matrix:

1 K(m)

K(m) 1

Step 6. Shift the second row of the resultant matrix of step 5 by one place to the right to form the new generator matrix.

Step 7. Increment m.

Step 8. Repeat steps 4 through 7 until all pSize reflection coefficients have been solved.

### 2.1. Input and Output Data Representation

The input and output of this version of the Schur algorithm is represented in fixed-point notation. The input vector r[ ] and the resultant reflection coefficients K[ ] are represented as Q15 fractions stored in an array of short integers of size pSize+1. Throughout this algorithm multiplication and division of Q fractions are performed. It is important for the reader to understand what the resultant precision of such operations will be. If two signed Q15 fractions are multiplied together, the result will be a Q30 signed fraction. To determine the number of bits assigned to the fractional part of the result of a multiplication, add the number of bits assigned to the fractional parts of each multiplicand. To determine the number of bits assigned to the fractional part of the result of a division, subtract the number of bits assigned to the denominator from the number of bits assigned to the numerator. Refer to Figure 1 for further clarification.

Figure 1. Fixed-Point Notation ### 2.2. Error Correction Techniques

There are various areas in this algorithm where errors can occur. These errors are due primarily by the use of integer instead of floating-point arithmetic. This version of the Schur algorithm performs one error correction technique to minimize the overall amount of error found in the output results. It is accomplished by rounding results to the nearest digit of precision prior to converting from one precision to another (i.e., Q30 to Q15). Refer to Example 1 for a graphical representation of fixed-point conversion techniques.

Example 1. Fixed-Point Conversion Technique

```Converting from a Q30 to a Q15 signed fraction:

0x2A234238	Q30 signed fraction
+ 0x00004000	rounding factor

0x2A238238	rounded result
>> 15
0x00005447	conversion result

```

## 3.0. IMPLEMENTING THE SCHUR ALGORITHM USING MMX™ INSTRUCTIONS

There were numerous areas within the Schur algorithm which could be easily adapted and optimized for use with the MMX instruction set. There was also one area which could not. The remainder of this section discusses each part of the Schur algorithm and the optimizations used during programming.

### 3.1. Analyzing the Schur Algorithm

Prior to implementing the code, the steps of the Schur algorithm were analyzed to understand data organization and MMX instruction set pitfalls. This analysis was used to determine the approach to take when implementing the assembly version. During the analysis, four major pitfalls were found.

The first pitfall noted was the initialization of the generator matrix. This initialization entails a series of data moves from one memory location to another. If the data resides in the data cache, the accesses can be completed in one cycle if the access is aligned. If the data does not reside in the data cache, the speed of the memory access is bound by both the processor performance and the system implementation.

The second pitfall noted pertained to the integer division. This division of G[m] by G[m] can not be avoided since the denominator changes with each iteration of the main loop.

The third pitfall noted was the shifting right of the second row of the generator matrix by one element. This action is done in steps 2 and 6 and entails moving data from one memory location to another. If this step could be avoided many cycles could be saved.

The last pitfall noted was the data organization of the generator matrix with respect to step 5 of the Schur algorithm. To increase the performance of Step 5 (the inner loop), it would be optimal to increase the number of multiplications which could be done with each iteration of the loop. For instance, the new instruction set contains an instruction called PMADDWD which can perform up to four word multiplies and two DWORD adds in three cycles. To perform multiple multiplies, it is most likely that reading more than one element at a time from the generator matrix would be necessary. Therefore, it is optimal to align data accesses (longer than a word) to the generator matrix. For example, assume that the generator matrix is aligned on a quadword boundary. With the first iteration of the main loop, data accesses to the first row of the generator matrix would be from the second element through the last element. If a doubleword or quadword access was used, each access would be unaligned and would incur an unaligned penalty of three cycles.

### 3.2. Creating the Initial Generator Matrix

Creating the initial generator matrix consists of a series of data moves from one memory location to another. As mentioned previously, the performance of this section of code differs depending on if the data is present in the data cache. Assuming the data is not in the data cache, this step is bound by the amount of data accesses generated. Therefore, reducing the number of data accesses will increase the performance of this section of code. To reduce the number of data accesses, it is necessary to read and write as many data elements as possible with each memory access. The new instruction set supplies the quadword move instruction MOVQ. This instruction was used to read four input data elements at one time from the input vector r and to write four data elements at one time to the generator matrix. Example 2 contains the assembly code for initializing the generator matrix.

Example 2. Initializing the Generator Matrix

```initialize_generator_matrix:
movq	mm0, [eax][edx*2]			; get the next quadword
; of rMatrix

movq	[esi][edx*2], mm0			; store it in G0Matrix

movq	[ebx][edx*2], mm0			; store it again in G1Matrix

add	edx, 4					; increment the loop counter
; by 4 elements

cmp	edx, ecx					; is i <= pSize?
jle	initialize_generator_matrix	; if not then exit the loop

```

Note that according to the algorithm, the first element of the generator matrix is initialized to zero. During the analysis of the algorithm, it was found that this first element is never used. Therefore, initializing it to the first element of the input vector is not harmful.

Also note that since quadword data accesses are used, the amount of space allocated to the input vector and to each row of the generator matrix must be divisible by four words. This data restriction was chosen to decrease the amount of checking overhead for this loop.

### 3.3. Calculating the mth Order Reflection Coefficient

The mth order reflection coefficients are calculated by dividing G[m] by G. This integer division can not be avoided since the denominator changes for each iteration of the main loop. It takes 43 clock cycles every time that it is executed. If the denominator value remained constant with each iteration of the outer loop, one divide could have been executed prior to entering the main loop to determine 1/denominator constant. This value could then be used in an integer multiplication using the IMUL instruction and would take 10 cycles to execute. The IMUL instruction would have to be used in this instance since the new instruction set does not supply a doubleword multiply instruction. A floating-point divide could also be executed in place of the fixed-point integer divide but would incur an ~100 clock cycle penalty of switching from floating-point code to the MMX code. Therefore it was not used.

Notice that in the division G is used and not G[m]. In the analysis, we stated that if shifting of the second row of the generator matrix can be avoided many cycles would be saved. This shifting was avoided and will be explained further in the next section.

### 3.4. The Inner Loop: Recalculating the Generator Matrix

With each iteration of the main loop, the generator matrix must be recalculated based on the new K(m). The performance of this inner loop is critical to the overall performance of the code and extra time was spent in optimizing this loop. Below in Example 3 is the C language code which recalculates the generator matrix following the steps of the algorithm. This code will give the user a better idea of how this section of the algorithm is implemented.

Example 3. Recalculating the Generator Matrix (C code)

```
/*********************************************************
* Recalculate the generator matrix entries based on     *
* the new reflection coefficient K[m].                  *
*********************************************************/
for ( i = m ; i <= p; i++)
{
/* perform the matrix multiplication of each
* element
*/
temp1 = ((G[i] * K[i]) + 0x4000) >> 15;
temp2 = ((G[i] * K[i]) + 0x4000) >> 15;

/* perform the addition and store the results
*/
G[i] = G[i] + temp2;
G[i] = G[i] + temp1;
}

/*********************************************************
* Shift the second row over by one element.             *
*********************************************************/
for ( i = p; i > m; i--)
G[i] = G[i-1];
```

One performance gain mentioned in Section 3.1 was to decrease the number of unaligned data accesses to the generator matrix. To accomplish this, the shifting of the second row of the generator matrix was not performed. The algorithm was slightly changed to take this into consideration. By avoiding the shift and ensuring that the second row of the generator matrix is doubleword aligned, the number of unaligned accesses are decreased by half. Figure 2 depicts the change made to the algorithm.

Figure 2. Generator Matrix Usage

 Initial Generator Matrix Setup: G G Calculate K = G / G = A0 / a0 where m = 1 Recalculate the Generator Matrix G G Calculate K = G / G = B1/ a1 where m = 2 Recalculate the Generator Matrix G G Calculate K = G / G = C 2/a 2 where m = 3 Recalculate the Generator Matrix G G Calculate K = G / G = D3/a3 where m = 4 where: m is the main loop counter P is the number of reflection coefficients to solve for G is the first row of the generator matrix G is the second row of the generator matrix
Another performance gain which was mentioned in Section 3.1 was to perform as many multiplications with one instruction as possible. The new MMX instruction set contains two types of multiply instructions: PMADD and PMUL. The data must be analyzed to determine which instruction to use and how many multiplies can be performed with one instruction. Each multiplicand is a Q15 number. The result of each multiply is a Q30 number. The PMULHW could be used to perform four word multiplies but would prevent data rounding prior to the final conversion from Q30 to Q15. The PMADDWD instruction could be used but would only be able to perform two word multiplies. Using the PMADDWD instruction would retain data accuracy and allow for rounding of the result prior to performing the final data conversion from Q30 to Q15. For this implementation, the PMADDWD> instruction was chosen so that the simple error correction technique discussed in Section 2.2 could be applied to minimize the overall error introduced by the use of integer arithmetic. Below in Figure 3 is a representation of the data setup used to perform the multiplications.

Figure 3. Data Setup for the Inner Loop

 Pmaddwd Data Setup * * * * + + Pmaddwd Result Where: X denotes a don't care value G element represents an element from the generator matrix G+1 element represents the next element from the generator matrix K[m] represents the current reflection coefficient

Because of the nature of the algorithm, four multiplications at a minimum must be performed with each iteration of the loop (two for the first row and two for the second). This is necessary because the new values of the first row are dependent on the previous values of the second row. Also, the new values of the second row are dependent on the previous values of the first row. For example, the new value of G[m] is calculated as G[m] + (G * K[m]) and the new value of G is calculated as G + (G[m] * K[m]). Therefore, if two calculations are performed for the first row, it is necessary to perform two calculations for the second row.

The assembly code which was written to implement the recalculation of the generator matrix can be seen in Example 4. This code takes into consideration the previously mentioned optimizations. Assuming all data is in the data cache, this code takes 16 cycles per iteration to execute when accesses to the first row of the generator matrix are aligned. Since four multiplies are performed with each iteration of the inner loop, it takes four cycles to calculate each new generator matrix element. When accesses to the first row of the generator matrix are unaligned, the code takes 19 cycles per iteration and takes

Example 4. The Inner Loop

```_inner_loop:
1.	cmp		edx, eax			; is m > pSize
2.	jg		_inner_loop_done		; if so, jump out of this loop

3.	movd		mm0, [esi][edx*2]		; get G[m] and G[m+1]
4.

5.	movd		mm2, [ecx]			; get G[i] and G[i+1]
6.	punpcklwd	mm0, mm0			; mm0 = G[m+1], G[m+1], G[m],
; 	  G[m]

7.	movq		mm5, mm0			; make a copy of the G data setup
8.	punpcklwd	mm2, mm2			; mm2 = G[i+1], G[i+1], G[i],
;       G[i]

9.	movq		mm6, mm2			; make a copy of the G data setup
10.	pmaddwd	mm5, mm7			; Calculate G[m+1] * K[m],
; G[m] * K[m]

11.	psrad		mm2, 16			; Sign extend G[i+1] and G[i]
12.	pmaddwd	mm6, mm7			; Calculate G[i+1] * K[m],
; G[i] * K[m]

13.							; lose a cycle here since you must
14.							; wait three cycles after a pmaddwd prior
; to using its results

15.	psrad		mm0, 16			; Sign extend G[m+1] and G[m]
; converting from Q30 to Q15

17.	psrad		mm5, 15			; convert from Q30 to Q15
; converting from Q30 to Q15

19.	psrad		mm6, 15			; convert from Q30 to Q15
20.	paddd		mm2, mm5			; Calculate G[i+1] + G[m+1]*K[m]
; and G[i] + G[m]*K[m]

21.	packssw	mm2, mm2			; Convert from double to packed words
22.	paddd		mm0, mm6			; Calculate G[m+1] + G[i+1]*K[m]
; and G[m] + G[i]*K[m]

23.							; lose a cycle here since you must wait
24.							; 1 cycle after modifying a register
; prior to storing it.

25.	movd		[ecx], mm2			; Store the G results
26.	packssdw	mm0, mm0			; Convert from double to packed words

27.	add		ecx, 4			; Increment the G pointer
28.							; This won't pair with the next
; instruction since memory operations
; do not pair with instructions which
; operate on integer registers.

29.	movd		[esi][edx*2], mm0		; Store the G results
30.

31.	add		edx, 2			; Increment the loop counter by 2

32. 	jmp		_inner_loop			; Jump to the start of the inner loop
```

There are two areas in the inner loop where instruction penalties occur. The first penalty occurs on line 16. This instruction uses MM5 as a source. This register contains the result of the PMADDWD> instruction issued on line 10. Since three cycles have not elapsed from the time the PMADDWD instruction was issued to the time the result was used as a source, a one cycle penalty occurs. The second penalty occurs on line 25. This instruction stores the value in MM2 into the location pointed to by the register ECX. The register MM2 was modified in the instruction prior to the store. Since two cycles have not elapsed from the time MM2 was used to the time it was stored to memory, a one cycle penalty occurs.

## 4.0. CODE PERFORMANCE

The overall performance of this version of the Schur Weiner Filter is based on two sections of the code: the calculation of K[m] and the recalculation of the generator matrix. The following discussion assumes all data is present in the data cache and that all data is aligned.

To calculate K[m] and to set up register values prior to re-calculating the generator matrix takes approximately 55 clock cycles. The majority of these cycles can be directly attributed to the integer divide which executes in 43 cycles. As mentioned previously, this integer divide can not be avoided.

There are two cycle count numbers which are important to take into consideration when approximating the number of cycles it takes to recalculate the generator matrix. When all accesses to the first row of the generator matrix are aligned, it takes 16 clock cycles per iteration to recalculate four generator matrix elements. When all accesses to the first row of the generator matrix are unaligned, it takes 19 clock cycles per iteration to recalculate 4 generator matrix elements. Since half of the accesses to the first row of the generator matrix are aligned and half are unaligned, on average it takes 4.375 cycles to recalculate each element.

Applying these cycle count numbers to an application which calculates pSize reflection coefficients using an input vector of size pSize+1, an approximate overall performance can be obtained as follows: For the C language version of the Schur algorithm, refer to Appendix A. For the complete assembly version of the Schur algorithm refer to Appendix B. The assembly version is reentrant but the user may wish to dispose of the local arrays and pass them in as pointers to a heap space location if stack usage is an issue.

## APPENDIX A: "C' Version of the Schur Algorithm

```
* Description:

schur_int is the C language version of the Schur
algorithm used to calculate the

reflection coefficients of a given  set of linear
equations. The input matrix

is assumed to be a set of 15-bit signed fixed-point
short integers. The resultant

reflection coefficients will also be 15-bit signed
fixed-point short integers.

* Input:

r short int *  the input matrix representing the
set of linear

equations of which to find the reflection

coefficients. Each array element must be a

Q15 short integer.

K short int *  the output matrix containing the
resultant

reflection coefficients represented as Q15

short integers.

p int  the number of reflection coefficients to

solve for (typically 10 or 16).

void Schur (short *r, short *K, int  p )

{   // begin Schur()

short i,     // generator matrix index

m;    // order counter

short G;   // generator matrix; the number

// of columns (16) was determined

// by the input size.

// initialize the generator matrix

for (i = 0; i < p; i++)

{   // begin for loop

G[i] = r[i+1];

G[i] = r[i];

}   // end for loop

// for each order,  calculate the new reflection
coefficient

// and the new generator matrix for the next order

for (m = 1; m <= p; m++)

{    // begin for loop

// calculate k[m].  Since the result should be
a Q15 number,  the

// division should be between a Q30 and a Q15 number.
Therefore

// convert the numerator to a Q30 number prior
to performing the

// divide.

K[m] = -(short)(((long)G[m-1] << 15) /
((long)G[m-1]));

// calculate the new generator matrix.  Error correction
techniques

// must be used during this section.  When multiplying
2 Q15 numbers

// the result will become a Q30 number.  This number
must then be

// converted back to a Q15 number.  To performing
the rounding

// error correction technique,  add 0x4000 to the
Q30 number prior

// to shifting it right by 15 bits.

for (i = p-1; i >= m; i--)

{   // begin for loop

G[i] = (short)((((long)G[i] << 15)
+ (G[i] * K[m]) + 0x4000)

>> 15);

G[i] = (short)((((long)G[i-1] << 15)
+ (G[i-1] * K[m]) +

0x4000) >> 15);

}   // end for loop

}   // end for loop

}   // end schur_int()```

## APPENDIX B: The MMX™ Technology Version of the Levinson-Durbin Algorithm

```
;* Description:

;* The purpose of this file is to provide the MMX
code for the  schur algorithm as

;* an instructional example to those who are just
beginning to code using the new

;* instructions.

;*

;* Assumptions:

;* 1.     The set of normal equations given are stable

;* 2.     The normal equations are represented by
one matrix 'r' which

;*         contains the coefficients gamma(0) through
gamma(p) given as

;*         15-bit signed fixed-point short integers
(Q15).

;* 3.     The number of short integers allocated
for matrix 'r' is

;*         divisible by 4.  Used elements are initialized
to 0.

;* 4.     The resultant reflection coefficients will
be returned as

;*         15-bit signed fixed-point short integers
(Q15).

TITLE Schur

.486P

.model FLAT

;****************************************************************************

;*     TEXT SEGMENT

;****************************************************************************

_TEXT SEGMENT

;

; Declare schur as a public routine to allow the
'C' code to

; call it.

;

PUBLIC schur

;* Description:

;* schur is the optimized  MMX technology  version
of the Schur

;* algorithm.  It is used to calculate the reflection
coefficients of a

;* given set of  normal equations. The steps of the
algorithm are as

;* follows:

;*  Step 1:  Initialize the generator matrix to

;*    G0 =    0,    r(1), r(2), r(3), .... r(p)

;*    G1 =    r(0), r(1), r(2), r(3), .... r(p)

;*  Step 2:  Shift the elements in G1 by 1 position
to obtain:

;*    G0 =    0,    r(1), r(2), r(3), .... r(p)

;*    G1 =    0,    r(0), r(1), r(2), .... r(p-1)

;*  Step 3:  Initialize m to 1

;*  Step 4:  Calculate K[m] as follows

;*    K[m] = - G0(m) / G1(m)

;*  Step 5:  Multiply the generator matrix by the
matrix V where

;*    V is defined as:

;*    V = 1      K(m)

;*        K(m)   1

;*  Step 6:  Shift the elements in G1 by 1 position

;*  Step 7:  Repeat steps 4 through 6 for each reflection

;*    coefficient that needs to be calculated

;*

;* Inputs:

;* rMatrix short int *  a pointer to the first element

;*    of the input 'r' matrix

;* kMatrix short int *  a pointer to the output

;*    reflection coefficients array

;* pSize short int  the number of reflection coefficients

;*    to solve for (typically 10 or 16).

schur PROC C USES ebx ecx edx esi,

rMatrix:PTR WORD,

kMatrix:PTR WORD,

pSize:DWORD

;

; Declare local variables

;  1. G0Matrix     - row 1 of the generator matrix

;  2. G1Matrix     - row 2 of the generator matrix

;  3. mSave        - used to store the loop counter
'm'

;  4. round_factor - contains the rounding factor
0x4000

;

LOCAL G0Matrix:WORD

LOCAL G1Matrix:WORD

LOCAL mSave:DWORD

LOCAL round_factor:DWORD

;

; Initialize local variables

mov  round_factor, 04000H

mov  round_factor + 4, 04000H

;

; Setup registers in preparation to initialize the

; generator matrix G0 and G1

;

mov  eax, rMatrix   ; get the address of rMatrix

lea  esi, G0Matrix  ; get the address of G0Matrix

mov  ecx, pSize   ; get pSize

mov  edx, 0

lea  ebx, G1Matrix  ; get the address of G1Matrix

;

; Initialize the generator matrix G0 and G1

;

initialize_generator_matrix:

movq  mm0, [eax][edx*2]  ; get the next quadword

; of rMatrix

movq  [esi][edx*2], mm0  ; store it in G0Matrix

movq  [ebx][edx*2], mm0  ; store it again in G1Matrix

add  edx, 4   ; increment the loop counter

; by 4 elements

cmp  edx, ecx   ; is i <= pSize?

jle  initialize_generator_matrix ; if not then exit
the loop

;

; if (m > pSize) then we are done!

;

mov  edx, 1   ; initialize m

cmp  edx, ecx   ; is m > pSize?

_schur_main_loop:

;

; This is the main loop.

; Setup the numerator and denominator values for
the calculation of K[m].

; The numerator must be converted to a Q30 number
and the denominator will

; remain a Q15 number so that when the division is
completed the result will

; be a Q15 number.

;

mov  ax, [esi][edx*2]  ; get the numerator value

jg  _schur_main_loop_done ; exit the main loop if
m > pSize

shl  eax, 16   ; convert from Q15 to Q30

lea  ecx, G1Matrix  ; get the address of G1Matrix

sar  eax, 1   ; finish the numerator conversion

mov  ebx, [ecx]   ; get the denominator value

shl  ebx, 16   ; sign extend the denominator

mov  mSave, edx  ; save m

sar  ebx, 16   ; finish the sign extension

mov  edx, eax   ; copy the numerator to edx

sar  edx, 31   ; fill edx with the sign extension

mov  ecx, 0

idiv  ebx   ; calculate K[m]

mov  ebx, kMatrix  ; get the address of kMatrix

sub  ecx, eax   ; negate K[m]

mov  edx, mSave  ; get the loop counter 'm'

pxor  mm4, mm4  ; clear out mm4

movd  mm7, ecx   ; setup mm7 with 0, K[m], 0, K[m]

movq  mm1, round_factor  ; load the rounding factor
into mm1

punpcklwd mm7, mm4  ; continue setting up mm7

mov  [ebx][edx*2], ecx  ; store K[m] in kMatrix

lea  ecx, G1Matrix  ; get the address of G1Matrix

mov  eax, pSize   ; get the value of pSize

punpckldq  mm7, mm7  ; finish setting up mm7

;

; This is the inner loop.

; Recalculate the generator matrix using the following
algorithm

; for (i = 0; m <= pSize; i = i+2, m = m+2)

;  {

;  temp1 = (short int)((G0(m) * K(m) + rounding_factor)
>> 15);

;  temp2 = (short int)((G0(m+1) * K(m) + rounding_factor)
>> 15);

;  temp3 = (short int)((G1(i) * K(m) + rounding_factor)
>> 15);

;  temp4 = (short int)((G1(i+1) * K(m) + rounding_factor)
>> 15);

;  G0(m) = G0(m) + temp3;

;  G0(m+1) = G0(m+1) + temp4;

;  G1(i) = G1(i) + temp1;

;  G1(i+1) = G1(i+1) + temp2;

;  }

;

_inner_loop:

cmp  edx, eax   ; is m > pSize

jg  _inner_loop_done  ; if so, jump out of this
loop

movd  mm0, [esi][edx*2]  ; get G0(m) and G0(m+1)

movd  mm2, [ecx]  ; get G1(i) and G1(i+1)

punpcklwd mm0, mm0  ; mm0 = G0(m+1), G0(m+1), G0(m),
G0(m)

punpcklwd mm2, mm2  ; mm2 = G1(i+1), G1(i+1), G1(i),
G1(i)

movq  mm5, mm0  ; make a copy of the G0 data setup

movq  mm6, mm2  ; make a copy of the G1 data setup

pmaddwd  mm5, mm7  ; Calculate G0(m+1) * K(m), G0(m)
* K(m)

pmaddwd  mm6, mm7  ; Calculate G1(i+1) * K(m), G1(i)
* K(m)

psrad  mm2, 16   ; Sign extend G1(i+1) and G1(i)
to DWORDs

psrad  mm0, 16   ; Sign extend G0(m+1) and G0(m)
to DWORDs

* K(m)

; and G0(m) * K(m)

psrad  mm5, 15   ; Convert G0(m+1) * K(m), G0(m)
* K(m) to Q15

* K(m)

; and G1(i) * K(m)

psrad  mm6, 15   ; Convert G1(i+1) * K(m), G1(i)
* K(m) to Q15

paddd  mm2, mm5  ; Calculate G1(i+1) + G0(m+1) *
K(m) and

; G1(i) + G0(m) * K(m)

packssdw  mm2, mm2  ; Convert the two DWORD word
results to packed words

paddd  mm0, mm6  ; Calculate G0(m+1) + G1(i+1) *
K(m) and

; G0(m) + G1(i) * K(m)

movd  [ecx], mm2  ; Store the G1 results

packssdw  mm0, mm0  ; Convert the two DWORD word
results to packed words

movd  [esi][edx*2], mm0  ; Store the G0 results

add  edx, 2   ; Increment m by 2

loop

;

; Increment the main loop counter m and compare it
against pSize.

;

_inner_loop_done:

mov  edx, mSave  ; Get the current value of m

inc  edx   ; Increment m

mov  ecx, pSize   ; Get the value of pSize

cmp  edx, ecx   ; is m > pSize?

main loop

;

; We are done so it's time to return!

;

_schur_main_loop_done:

emms     ; flush floating-point stack 