|
Disclaimer Copyright © Intel Corporation (1996). Third-party brands and names are the property of their respective owners.
|
U and V are subsampled 2:1 in both vertical and horizontal directions. As a result, every U and V values are used for 4 Y values and generate 4 RGB pixels. The diagram shows that the number of bytes in the RGB buffer is the same as for the Y buffer. This is only true for RGB8. For RGB16, the number of bytes is twice as much, and for RGB24 it is 3 times as much.
3. Functional Description
For each 2x2 block of RGB pixels, 4 Y bytes
1 U and 1V byte are needed as shown in Figure 1.
The input and output signals for Y, U, V fall within this range:
16 Y 235
16 u,v 240
Figure 1- color conversion scheme
Conversion is performed according to the following:
G
= 1.164 (Y-16) - 0.391(u-128) - 0.813(v-128)
R
= 1.164 (Y-16) + 1.596(v-128)
B
= 1.164 (Y-16)
+2.018(u-128)
The ranges of R,G,B values can be obtained by substituting the Y,U,V limits into the above equations, as follows:
-179 = 0 -179 < R
< 255 + 179 = 433
-135 = 0 -135 < G
< 255 + 135 = 390
-227 = 0 -227 < B
< 255 + 227 = 365
Once the R,G,B values are calculated, result
should be translated to their final range. For example, in the
case of RGB24 format, each output pixel is represented
by 24 bits; each color component is represented by one byte.
Therefore, each of the R,G,B values must be clamped to
within 0..255. The ranges for RGB above shows signed values,
which means that the all calculations should use signed arithmetic.
On the other hand, the final legal ranges of RGB is 0.255, which
requires that the saturation uses unsigned arithmetics. For RGB16,
the output range is further reduced to fit the RGB values in 16bits.
This is done by dropping some of the least significant bits of
each color.
4. Color Converter Interface
Each Color Converter Kernel (CCK) receives as input three planes: Y, U, V, a Y pitch, and UV pitch (U,V pitches are always the same). It also receives a pointer to the output buffer and its pitch (CCOPitch). In addition, it receives an aspect ratio adjustment count, which enables adjustment of the destination height to fit a specific aspect ratio of the display device.
5. Choosing
Algorithm for Color Conversion To RGB24
Without Zooming
Three different implementations of the YUV12 to
RGB24 algorithm using the MMX technology will be
discussed in this section.
The first implementation of the algorithm utilizes the maximum parallelism offered by MMX Technology. It performs byte operations on 8 pixels at a time. This method uses pre-calculated tables and should yield the best throughput of the methods described here. However, since the temporary results during calculations may be larger than 8 bits, the YUV impact data is scaled down before calculations are made. This results in loss of precession of the final RGB data. However, this loss of data is not recognized by the naked eye and is very well acceptable.
The second method also uses lookup tables. It obtains precise final results by using MMX Technology to operate on words. This method has its own drawbacks, since only 4 pixels can be calculated at a time (compared to 8 in the first method). Moreover, the final word values have to be packed to byte format before storing it to the output buffer. Finally, the lookup tables doubles in size yielding worse cache locality.
The third approach uses direct calculations instead of lookup tables. This approach could be a good alternative to the first because it does not use lookup tables and thus has better cache behavior. Another advantage is realized because memory writes to the graphics card are uncached and slow which gives the CPU enough time to perform the required calculations. On the other hand, this method requires word arithmetic which reduces the amount of parallelism in half, and requires repacking the final results to byte format. Nonetheless, measurements show that this method can be as fast as the first approach.
6. YUV12
to RGB24 Conversion Using Lookup Tables(first
method)
The YUV12 to RGB color conversion formulas could
be represented as follows:
R
= Y_impact[Y] + + VR_impact[v]
G
= Y_impact[Y] + UG_impact[u] + VG_impact[v]
B
= Y_impact[Y] + UB_impact[u]
where (values from section 3):
Y_impact(Y) = 1.164(Y-16),
VR_impact(V) = 1.596(V-128)
VG_impact(V) = -0.813(V-128)
UG_impact(U) = -0.391(U-128)
UB_impact(U) = 2.018(U-128)
As mentioned above, for byte calculations, the Y,U,V impact data have to be scaled down so that the results doe not exceed the data range. Using the scale factor 1/4, ranges of U,V impact can be reduced to -64..64, and Y impact can be reduced to 0..64. Adding the impacts together gives an R,G,B values between -64..128.
To clamp negative R,G,B values to 0, a constant 64 could be included in the Y impact tables which puts yields a range between 0..196. As a result, all calculation could be unsigned byte operations, which is a perfect fit for the MMX technology.
Figure 2. illustrates a block diagram of this algorithm.
Figure 2- Conversion scheme YUV12 RGB24 using look up tables.
6.1 Extracting Y,U and
V Impacts From Lookup Tables
The inner loop of the algorithm generates a 2x4 block
of RGB pixels. It processes two lines at a time, since the impact
of the U and V components is the same for two consecutive
lines. Twelve bytes are generated for four RGB24 pixels. Thus
three dwords are written to the output buffer.
Figure 3- obtaining u impact on four RGB points.
As shown in Figure 3, the first U input byte is used to reference the U_impact table for the first 2 RGB pixels. The second U input byte is used to reference the U_impact table for the next 2 RGB pixels. The UV impact will be used for two consecutive lines.
The Y impact is calculated for each line. To get Y impact on even-numbered lines (Ye..) four Y impact values are combined together as follows:
Figure 4- obtaining Y impact.
The Y impact for odd-numbered lines is calculated in the same manner.
Adding the Y lines to the U,V-impact, and continuing to perform operations as illustrated in Figure 2, the final R,G,B results are generated as follows:
Optimized implementation of this algorithm is found
in Appendix 4.
6.2 Aspect
Ratio Calculation
The Aspect ratio parameter allows for adjustment of picture aspect ratio (width/height). The algorithm only allows for reduction in height of picture by dropping certain lines when generating the output. For example if the aspect ratio is 12 , each 12th line is be dropped. Two solutions were considered. In the first one, each output line is processed separately and if the line number is a multiple of the aspect ratio, the line is dropped. The drawback of this solution is that the UV impacts, which are common for two consecutive lines, are either calculated twice, or stored in a temporary buffer. Both of them increase the amount of accessing required when no line is dropped, which is most of the cases.
The second solution always processes two lines at time. A line is skipped by writing the second calculated line over the first line. Thus, the amount of work is the same as if no lines are dropped at all. Therefore, the benefit of this method comes from the fact that U,V calculation is only done once.
6.3 Size
of Lookup Tables
All tables contain 256 elements. The Y table
contains dword entries, which yields 1K tables size. Each
U, V table has qword entries, which yields 2K table
each. Therefore, the total Y,U,V table size is 5K.
In the U,V tables, the RGB values in locations 0,1,2 are the same as the values in locations 3,4,5 respectively. This is due to the fact that U,V impacts two consecutive pixels. The U,V table sizes could be reduced by half eliminating the duplication. This could be done using shifts at run time to generate the proper format. However, this costs more CPU cycles.
To position the Y impacts in the right places, a shift instruction can be used. It is possible to use four tables for Y, and store shifted value in them. However, such tables will consume more memory, which could add additional pressure on the data cache.
In this algorithm each output point is enlarged into 22 block. So now U and V values impact a 4x4 block, and Y values impact a 2x2 block. This algorithm was implemented using direct calculations of RGB values, and it uses the same ideas like RGB16 zoom by two.
Implementation of this algorithm can be found in Appendix 5.
8. YUV12
to RGB16 Conversion Using Lookup Tables
In the RGB16 color format, every pixel is represented by 16-bit color components. Different graphics cards assign different number of bits for each of the R,G and B components, as follows:
x555 [ignore high order bit, then R,G,B where B
is low]
655 [ R=6(high), G=5, B=5(low) ]
565 [ R=5(high), G=6, B=5(low) ]
664 [ R=6(high), G=6, B=4(low) ]
For example in x555 allocation, 5 bits are used to encode each color.
The first stage of YUV12 to RGB16 conversion is identical to YUV12 to RGB24 conversion. There is an additional step which decimates the RGB24 color components and packs them into the appropriate 16 bit format.
Implementation of this algorithm can be found in Appendix 5.
9. YUV12
to RGB8 (CLUT8) Conversion Using Lookup Tables
9.1 Algorithm
Description
RGB8 format represents each color in 8 bits, yielding a total of 256 colors. The contents of the 8 bits is an index into a Color Lookup Table known as a color palette. Graphics adapters are programmed with this palette either by the operating system or by the application. The operating system reserves the first 10 and last 10 entries of the palette for system usage. The rest of the entries are used by the active application.
(In 256 color mode this picture may look wrong. Use 16 or 24 bit color mode to see this picture properly)
The palette used for this implementation of RGB8 color is divided into 9 zones each with 26 gradients of the same color. U and V impacts are used to determine which color zone they represent, and the Y impact determines the intensity of the color in that zone. Definition of the palette may can be found in Appendix 1.
The Y,U,V impacts are calculated according to the following equations:
Table 1 - Color Conversion Rules for RGB8 CCK
In addition, a noise pattern is added to the input Y,U,V values to give the picture a smooth look. The noise pattern is shown in Table 2. This extra processing consumes more precious cycles of the CPU, especially since that U and V impacts are different on different lines and thus must be calculated separately.
V-noise:
| Line 1 | 10h | 8 | 18h | 0 |
| Line 2 | 18h | 0 | 10h | 8 |
| Line 3 | 8 | 10h | 0 | 18h |
| Line 4 | 0 | 18h | 8 | 10h |
U-noise:
| Line 1 | 8 | 10h | 0 | 18h |
| Line 2 | 0 | 18h | 8 | 10h |
| Line 3 | 10h | 8 | 18h | 0 |
| Line 4 | 18h | 0 | 10h | 8 |
Y-noise:
| Line 1 | 4 | 2 | 6 | 0 |
| Line 2 | 6 | 0 | 4 | 2 |
| Line 3 | 2 | 4 | 0 | 6 |
| Line 4 | 0 | 6 | 2 | 4 |
Table 2 - Noise Matrixes for RGB8 CCK.
Since the noise values are added to the input Y,U,V data, the color conversion rules are different for every pixel in the 44 matrix. For example, consider the first pixel in Line 1. With the noise values added to it, a new color conversion table is derived as follows:
Table 3 - Color Conversion Rules for RGB8 CCK.
9.2 Calculating
UV Impact
This implementation performs color conversion of
8 consecutive pixels at a time, as shown in Figure 5. To calculate
the U impact, the algorithm loads 4 U bytes and
duplicates them across the 8 bytes (since every U value
impacts 2 neighboring pixels). The result is compared against
the pre-calculated constants, U_low_b & U_high_b. Note
that IA MMXTM Technology instructions compare only
signed numbers; therefore, arguments should be converted to sign
range. U_low_b & U_high_b are pre-calculated, such
that the only needed conversion so only one conversion is needed
at run time, for all of the 8 U bytes . The instruction psubb
mm0, convert_to_sign does this conversion.
Figure 5 illustrates a block diagram of this algorithm.
Figure 5 - Calculating u Impact for RGB8 CCK.
The constants U_low_b and U_high_b are the comparison values in Table 3; calculated for every pixel in the 4x4 matrix. Notice that these values include the noise effect introduced in Table 2 and are already converted to signed values.
U_low_b:
ebf3e3fbebf3e3fbh =6c74647c6c74647c - 8080808080808080
e3fbebf3e3fbebf3h =647c6c74647c6c74 - 8080808080808080
f3ebfbe3f3ebfbe3h =746c7c64746c7c64 - 8080808080808080
fbe3f3ebfbe3f3ebh =7c64746c7c64746c - 8080808080808080
These values are derived in a similar method as shown in Table 3. For the first, the value 6c74647c6c74647c is equal to the limit value 6464646464646464 added to the noise pattern at that line 0810001808100018. All constants are converted to signed numbers by subtracting 8080808080808080
The result of the comparison is 00h for any byte below the compared corresponding limit, and FFh for every byte greater or equal to the corresponding limit. The result of the comparison is anded with the value 4e4e4e4e4e4e4e4eh, producing an intermediate result of U impact.
The comparison of the upper limit is done in a similar fashion and its result is added to the lower limit impact, yielding the total impact of U.
A different method is used to calculate the Y impact. The input Y value is first saturated on the lower end by subtracting Y_low_b, which is the lower limit including the effect of the noise, as shown in Table 3. The result is then divided by 8 and clipped to the upper limit by adding saturate_to_Yhigh. Finally, the result is brought back to the mid-range by subtracting the saturate_to_Ylow.
Notice that the saturate_to_Ylow also includes the offset 10, representing the first 10 reserved system colors.
Constant Y_low_b is different for every four consecutive lines. Subtracting Y_low_b is equivalent to adding noise value (0402060004020600 for first line) and subtracting 1bh,which is a lower limit for Y
Y_low_b
1719151b1719151b = 1b1b1b1b1b1b1b1b - 0402060004020600
19171b1519171b15 = 1b1b1b1b1b1b1b1b - 0204000602040006
151b1719151b1719 = 1b1b1b1b1b1b1b1b - 0600040206000402
1b1519171b151917 = 1b1b1b1b1b1b1b1b - 0006020400060204
Adding saturate_to_Y_high constant, converts all values above 19h to FFh, which puts it in the range E6h..FFh. Subtracting return_from_Y_high constant, brings all values to the range Ah..23h, which is Y range 0..19h plus Ah. The constant Ah is added to the result; this is the first 10 reserved colors by the operating system.
saturate_to_Y_high = e6e6e6e6e6e6e6e6 ; e6=
ff-19
return_from_Y_high = dcdcdcdcdcdcdcdc ; ff-19-a
Implementation of this algorithm can be found in Appendix 2.
10. Converting
to RGB8,
Zoom by 2
In this algorithm each output point is duplicated into a 22 block. Therefore, each U and V value impacts a 4x4 block, and each Y value impacts a 2x2 block. Before starting to calculate Y,U,V impacts from Table 1, the noise values are added (using matrixes from Table 2). To calculate U (and V) impact on the first line of the 44 block, use the technique shown in Figure 5 (one more punpack for duplicating u points should be added). U, V impacts are added together, giving a UV impact for 4 pixels in the block. Since the noise values are the same, but in different byte locations in the 4x4 matrix, the rest of the UV impacts for the following three lines could be calculated by shuffling these values accordingly. For example if UV impact on first line is:
UV3 UV2 UV1 UV0
then the rest of lines are:
UV1 UV0 UV3 UV2
- second line
UV2 UV3 UV0 UV1
- third line
UV0 UV1 UV2 UV3
- fourth line
The rest of the algorithm is similar to the non-zoomed algorithm.
Two algorithms were implemented for YUV12 to RGB8 color conversion..
The First algorithm has two sequential loops. The first loop calculates the common UV impacts on four lines. The results are stored in a temporary buffer. The second loop calculates the . The second loop calculates the Y impact and combines them with the pre-calculated UV impacts to calculate the RGB pixel values. Each iteration of the second loop yields a 4x16 block of RGB pixels. This algorithm was found to be slow compared to the second algorithm (below), because of the nature of its calculations. The algorithm performs calculations of RGB pixels, and then writes them out to the graphics card. Due to the slow bandwidth of the graphics card compared to the CPU, the CPU write buffers were almost always full, causing a slow down in performance.
The second algorithm is based on interleaving the writes to the graphics card with RGB calculations. This algorithm is composed of one loop that calculates the Y,U,V impactsand combines them to generate the RGB values. As a result of the extra calculations of U,V impacts inside the loop, the size of the loop is increased, thus spreading the writes to the graphics card between calculations. The change in code structure resulted in a 1.3x speedup.
Implementation of the second algorithm can be found in Appendix 3.
11. Assumptions
For optimal performance, the algorithms assume that
the output buffer is aligned on qword (8 byte) boundary.
If it is aligned on 4 byte boundary, 4 bytes from the previous
iteration and 4 bytes from the current iteration should be packed
into qword. Then, write the 8 bytes to a qword aligned
address. Qword writes are almost twice as fast as dword
writes.
The code sample found in Appendix 5 are optimized for the Pentium® processor. The code samples for YUV to RGB24 converter with lookup tables is also optimized to avoid partial stalls on the Pentium Pro® processor.
12. Appendix
1. Definition
of palette ( used for color space conversion to RGB8 ).
As mentioned before, the first and last 10 colors are reserved by the operating system. Therefore, the first entry in the table corresponds to the 10th entry in the palette table. There are three values for each entry, corresponding to blue, green and red consecutively.
unsigned char
PalTable[26*9*3] = {
0, 39+ 15, 0,
0, 39+ 24, 0,
0, 39+ 33, 0,
0, 39+ 42, 0,
-44+ 51, 39+ 51, 0,
-44+ 60, 39+ 60, -55+ 60,
-44+ 69, 39+ 69, -55+ 69,
-44+ 78, 39+ 78, -55+ 78,
-44+ 87, 39+ 87, -55+ 87,
-44+ 96, 39+ 96, -55+ 96,
-44+105, 39+105, -55+105,
-44+114, 39+114, -55+114,
-44+123, 39+123, -55+123,
-44+132, 39+132, -55+132,
-44+141, 39+141, -55+141,
-44+150, 39+150, -55+150,
-44+159, 39+159, -55+159,
-44+168, 39+168, -55+168,
-44+177, 39+177, -55+177,
-44+186, 39+186, -55+186,
-44+195, 39+195, -55+195,
-44+204, 39+204, -55+204,
-44+213, 39+213, -55+213,
-44+222, 255, -55+222,
-44+231, 255, -55+231,
-44+240, 255, -55+240,
0, 26+ 15, 0+ 15,
0, 26+ 24, 0+ 24,
0, 26+ 33, 0+ 33,
0, 26+ 42, 0+ 42,
-44+ 51, 26+ 51, 0+ 51,
-44+ 60, 26+ 60, 0+ 60,
-44+ 69, 26+ 69, 0+ 69,
-44+ 78, 26+ 78, 0+ 78,
-44+ 87, 26+ 87, 0+ 87,
-44+ 96, 26+ 96, 0+ 96,
-44+105, 26+105, 0+105,
-44+114, 26+114, 0+114,
-44+123, 26+123, 0+123,
-44+132, 26+132, 0+132,
-44+141, 26+141, 0+141,
-44+150, 26+150, 0+150,
-44+159, 26+159, 0+159,
-44+168, 26+168, 0+168,
-44+177, 26+177, 0+177,
-44+186, 26+186, 0+186,
-44+195, 26+195, 0+195,
-44+204, 26+204, 0+204,
-44+213, 26+213, 0+213,
-44+222, 26+222, 0+222,
-44+231, 255, 0+231,
-44+240, 255, 0+240,
0, 14+ 15, 55+ 15,
0, 14+ 24, 55+ 24,
0, 14+ 33, 55+ 33,
0, 14+ 42, 55+ 42,
-44+ 51, 14+ 51, 55+ 51,
-44+ 60, 14+ 60, 55+ 60,
-44+ 69, 14+ 69, 55+ 69,
-44+ 78, 14+ 78, 55+ 78,
-44+ 87, 14+ 87, 55+ 87,
-44+ 96, 14+ 96, 55+ 96,
-44+105, 14+105, 55+105,
-44+114, 14+114, 55+114,
-44+123, 14+123, 55+123,
-44+132, 14+132, 55+132,
-44+141, 14+141, 55+141,
-44+150, 14+150, 55+150,
-44+159, 14+159, 55+159,
-44+168, 14+168, 55+168,
-44+177, 14+177, 55+177,
-44+186, 14+186, 55+186,
-44+195, 14+195, 55+195,
-44+204, 14+204, 255,
-44+213, 14+213, 255,
-44+222, 255, 255,
-44+231, 255, 255,
-44+240, 255, 255,
0+ 15, 13+ 15, 0,
0+ 24, 13+ 24, 0,
0+ 33, 13+ 33, 0,
0+ 42, 13+ 42, 0,
0+ 51, 13+ 51, 0,
0+ 60, 13+ 60, -55+ 60,
0+ 69, 13+ 69, -55+ 69,
0+ 78, 13+ 78, -55+ 78,
0+ 87, 13+ 87, -55+ 87,
0+ 96, 13+ 96, -55+ 96,
0+105, 13+105, -55+105,
0+114, 13+114, -55+114,
0+123, 13+123, -55+123,
0+132, 13+132, -55+132,
0+141, 13+141, -55+141,
0+150, 13+150, -55+150,
0+159, 13+159, -55+159,
0+168, 13+168, -55+168,
0+177, 13+177, -55+177,
0+186, 13+186, -55+186,
0+195, 13+195, -55+195,
0+204, 13+204, -55+204,
0+213, 13+213, -55+213,
0+222, 13+222, -55+222,
0+231, 13+231, -55+231,
0+240, 13+242, -55+240,
0+ 15, 0+ 15, 0+ 15,
0+ 24, 0+ 24, 0+ 24,
0+ 33, 0+ 33, 0+ 33,
0+ 42, 0+ 42, 0+ 42,
0+ 51, 0+ 51, 0+ 51,
0+ 60, 0+ 60, 0+ 60,
0+ 69, 0+ 69, 0+ 69,
0+ 78, 0+ 78, 0+ 78,
0+ 87, 0+ 87, 0+ 87,
0+ 96, 0+ 96, 0+ 96,
0+105, 0+105, 0+105,
0+114, 0+114, 0+114,
0+123, 0+123, 0+123,
0+132, 0+132, 0+132,
0+141, 0+141, 0+141,
0+150, 0+150, 0+150,
0+159, 0+159, 0+159,
0+168, 0+168, 0+168,
0+177, 0+177, 0+177,
0+186, 0+186, 0+186,
0+195, 0+195, 0+195,
0+204, 0+204, 0+204,
0+213, 0+213, 0+213,
0+222, 0+222, 0+222,
0+231, 0+231, 0+231,
0+240, 0+240, 0+240,
0+ 15, -13+ 15, 55+ 15,
0+ 24, -13+ 24, 55+ 24,
0+ 33, -13+ 33, 55+ 33,
0+ 42, -13+ 42, 55+ 42,
0+ 51, -13+ 51, 55+ 51,
0+ 60, -13+ 60, 55+ 60,
0+ 69, -13+ 69, 55+ 69,
0+ 78, -13+ 78, 55+ 78,
0+ 87, -13+ 87, 55+ 87,
0+ 96, -13+ 96, 55+ 96,
0+105, -13+105, 55+105,
0+114, -13+114, 55+114,
0+123, -13+123, 55+123,
0+132, -13+132, 55+132,
0+141, -13+141, 55+141,
0+150, -13+150, 55+150,
0+159, -13+159, 55+159,
0+168, -13+168, 55+168,
0+177, -13+177, 55+177,
0+186, -13+186, 55+186,
0+195, -13+195, 55+195,
0+204, -13+204, 255,
0+213, -13+213, 255,
0+222, -13+222, 255,
0+231, -13+231, 255,
0+240, -13+240, 255,
44+ 15, -14+ 15, 0,
44+ 24, -14+ 24, 0,
44+ 33, -14+ 33, 0,
44+ 42, -14+ 42, 0,
44+ 51, -14+ 51, 0,
44+ 60, -14+ 60, -55+ 60,
44+ 69, -14+ 69, -55+ 69,
44+ 78, -14+ 78, -55+ 78,
44+ 87, -14+ 87, -55+ 87,
44+ 96, -14+ 96, -55+ 96,
44+105, -14+105, -55+105,
44+114, -14+114, -55+114,
44+123, -14+123, -55+123,
44+132, -14+132, -55+132,
44+141, -14+141, -55+141,
44+150, -14+150, -55+150,
44+159, -14+159, -55+159,
44+168, -14+168, -55+168,
44+177, -14+177, -55+177,
44+186, -14+186, -55+186,
44+195, -14+195, -55+195,
44+204, -14+204, -55+204,
255, -14+213, -55+213,
255, -14+222, -55+222,
255, -14+231, -55+231,
255, -14+242, -55+240,
44+ 15, 0, 0+ 15,
44+ 24, 0, 0+ 24,
44+ 33, -26+ 33, 0+ 33,
44+ 42, -26+ 42, 0+ 42,
44+ 51, -26+ 51, 0+ 51,
44+ 60, -26+ 60, 0+ 60,
44+ 69, -26+ 69, 0+ 69,
44+ 78, -26+ 78, 0+ 78,
44+ 87, -26+ 87, 0+ 87,
44+ 96, -26+ 96, 0+ 96,
44+105, -26+105, 0+105,
44+114, -26+114, 0+114,
44+123, -26+123, 0+123,
44+132, -26+132, 0+132,
44+141, -26+141, 0+141,
44+150, -26+150, 0+150,
44+159, -26+159, 0+159,
44+168, -26+168, 0+168,
44+177, -26+177, 0+177,
44+186, -26+186, 0+186,
44+195, -26+195, 0+195,
44+204, -26+204, 0+204,
255, -26+213, 0+213,
255, -26+222, 0+222,
255, -26+231, 0+231,
255, -26+240, 0+240,
44+ 15, 0, 55+ 15,
44+ 24, 0, 55+ 24,
44+ 33, 0, 55+ 33,
44+ 42, -39+ 42, 55+ 42,
44+ 51, -39+ 51, 55+ 51,
44+ 60, -39+ 60, 55+ 60,
44+ 69, -39+ 69, 55+ 69,
44+ 78, -39+ 78, 55+ 78,
44+ 87, -39+ 87, 55+ 87,
44+ 96, -39+ 96, 55+ 96,
44+105, -39+105, 55+105,
44+114, -39+114, 55+114,
44+123, -39+123, 55+123,
44+132, -39+132, 55+132,
44+141, -39+141, 55+141,
44+150, -39+150, 55+150,
44+159, -39+159, 55+159,
44+168, -39+168, 55+168,
44+177, -39+177, 55+177,
44+186, -39+186, 55+186,
44+195, -39+195, 55+195,
44+204, -39+204, 255,
255, -39+213, 255,
255, -39+222, 255,
255, -39+231, 255,
255, -39+240, 255,
};
tmpV3_U1low_bound[esp] -
constants for odd line
tmpV3_U1high_bound[esp]
tmpU3_V1low_bound[esp]
tmpU3_V1high_bound[esp]
tmpV2_U0low_bound[esp] -
constants for even line
tmpV2_U0high_bound[esp]
tmpU2_V0low_bound[esp]
tmpU2_V0high_bound[esp]
tmpY0_low[esp]
- Constants for Y values
tmpY1_low[esp]
;-------------------------------------------------------------------------
;
; cxm1281 -- This function performs YUV12 to CLUT8 color conversion for H26x.
; It dithers among 9 chroma points and 26 luma points, mapping the
; 8 bit luma pels into the 26 luma points by clamping the ends and
; stepping the luma by 8.
;
; Color convertor is not destructive.
; Requirement:
; U and V plane SHOULD be followed by 4 bytes (for read only)
; Y plane SHOULD be followed by 8 bytes (for read only)
.586P
include iammx.inc
ASSUME ds:FLAT, cs:FLAT, ss:FLAT
;------------------------------------------------------------
PQ equ PD
PD equ DWORD PTR
;------------------------------------------------------------
;=============================================================================
_DATA SEGMENT PARA PUBLIC USE32 'DATA'
align 8
PUBLIC Y0_low
PUBLIC Y1_low
PUBLIC U_low_value
PUBLIC V_low_value
PUBLIC U2_V0high_bound
PUBLIC U2_V0low_bound
PUBLIC U3_V1high_bound
PUBLIC U3_V1low_bound
PUBLIC V2_U0high_bound
PUBLIC V2_U0low_bound
PUBLIC V3_U1high_bound
PUBLIC V3_U1low_bound
PUBLIC return_from_Y_high
PUBLIC saturate_to_Y_high
PUBLIC clean_MSB_mask
PUBLIC convert_to_sign
if 0 ;old_constants
V2_U0low_bound dq 0f3ebfbe3f3ebfbe3h ; 746c7c64746c7c64 - 8080808080808080
U2_V0low_bound dq 0ebf3e3fbebf3e3fbh ; 6c74647c6c74647c - 8080808080808080
_V2_U0low_bound dq 0f3ebfbe3f3ebfbe3h ; 746c7c64746c7c64 - 8080808080808080
U3_V1low_bound dq 0e3fbebf3e3fbebf3h ; 647c6c74647c6c74 - 8080808080808080
V3_U1low_bound dq 0fbe3f3ebfbe3f3ebh ; 7c64746c7c64746c - 8080808080808080
_U3_V1low_bound dq 0e3fbebf3e3fbebf3h ; 647c6c74647c6c74 - 8080808080808080
V2_U0high_bound dq 130b1b03130b1b03h ; 948c9c84948c9c84 - 8080808080808080
U2_V0high_bound dq 0b13031b0b13031bh ; 8c94849c8c94849c - 8080808080808080
_V2_U0high_bound dq 130b1b03130b1b03h ; 948c9c84948c9c84 - 8080808080808080
U3_V1high_bound dq 031b0b13031b0b13h ; 849c8c94849c8c94 - 8080808080808080
V3_U1high_bound dq 1b03130b1b03130bh ; 9c84948c9c84948c - 8080808080808080
_U3_V1high_bound dq 031b0b13031b0b13h ; 849c8c94849c8c94 - 8080808080808080
U_low_value dq 1a1a1a1a1a1a1a1ah
V_low_value dq 4e4e4e4e4e4e4e4eh
else ; new constants
V2_U0low_bound dq 0ebf3e3fbebf3e3fbh ; 6c74647c6c74647c - 8080808080808080
U2_V0low_bound dq 0f3ebfbe3f3ebfbe3h ; 746c7c64746c7c64 - 8080808080808080
_V2_U0low_bound dq 0ebf3e3fbebf3e3fbh ; 6c74647c6c74647c - 8080808080808080
U3_V1low_bound dq 0fbe3f3ebfbe3f3ebh ; 7c64746c7c64746c - 8080808080808080
V3_U1low_bound dq 0e3fbebf3e3fbebf3h ; 647c6c74647c6c74 - 8080808080808080
_U3_V1low_bound dq 0fbe3f3ebfbe3f3ebh ; 7c64746c7c64746c - 8080808080808080
V2_U0high_bound dq 0b13031b0b13031bh ; 8c94849c8c94849c - 8080808080808080
U2_V0high_bound dq 130b1b03130b1b03h ; 948c9c84948c9c84 - 8080808080808080
_V2_U0high_bound dq 0b13031b0b13031bh ; 8c94849c8c94849c - 8080808080808080
U3_V1high_bound dq 1b03130b1b03130bh ; 9c84948c9c84948c - 8080808080808080
V3_U1high_bound dq 031b0b13031b0b13h ; 849c8c94849c8c94 - 8080808080808080
_U3_V1high_bound dq 1b03130b1b03130bh ; 9c84948c9c84948c - 8080808080808080
V_low_value dq 1a1a1a1a1a1a1a1ah
U_low_value dq 4e4e4e4e4e4e4e4eh
endif
convert_to_sign dq 8080808080808080h
; Y0_low,Y1_low are arrays
Y0_low dq 1719151b1719151bh ; 1b1b1b1b1b1b1b1b - 0402060004020600 ; for line%4=0
dq 19171b1519171b15h ; 1b1b1b1b1b1b1b1b - 0204000602040006 ; for line%4=2
Y1_low dq 151b1719151b1719h ; 1b1b1b1b1b1b1b1b - 0600040206000402 ; for line%4=1
dq 1b1519171b151917h ; 1b1b1b1b1b1b1b1b - 0006020400060204 ; for line%4=3
clean_MSB_mask dq 1f1f1f1f1f1f1f1fh
saturate_to_Y_high dq 0e6e6e6e6e6e6e6e6h ; ffh-19h
return_from_Y_high dq 0dcdcdcdcdcdcdcdch ; ffh-19h-ah (return back and ADD ah);
_DATA ENDS
;=============================================================================
U_low equ mm6
V_low equ mm7
U_high equ U_low
V_high equ V_low
LocalsRelativeToEBP = 0
RegisterStorageSize = 16
LocalFrameSize = End_of_locals
; Arguments:
arg_YPlane = LocalsRelativeToEBP + RegisterStorageSize + 4
arg_UPlane = LocalsRelativeToEBP + RegisterStorageSize + 8
arg_VPlane = LocalsRelativeToEBP + RegisterStorageSize + 12
arg_FrameWidth = LocalsRelativeToEBP + RegisterStorageSize + 16
arg_FrameHeight = LocalsRelativeToEBP + RegisterStorageSize + 20
arg_YPitch = LocalsRelativeToEBP + RegisterStorageSize + 24
arg_ChromaPitch = LocalsRelativeToEBP + RegisterStorageSize + 28
arg_AspectAdjustmentCount = LocalsRelativeToEBP + RegisterStorageSize + 32
arg_ColorConvertedFrame = LocalsRelativeToEBP + RegisterStorageSize + 36
arg_DCIOffset = LocalsRelativeToEBP + RegisterStorageSize + 40
arg_CCOffsetToLine0 = LocalsRelativeToEBP + RegisterStorageSize + 44
arg_CCOPitch = LocalsRelativeToEBP + RegisterStorageSize + 48
EndOfArgList = LocalsRelativeToEBP + RegisterStorageSize + 56
; LocalFrameSize (on local stack frame)
tmpV2_U0low_bound = 0 ; qw
tmpU2_V0low_bound = 8 ; qw
tmpU3_V1low_bound = 16 ; qw
tmpV3_U1low_bound = 24 ; qw
tmpV2_U0high_bound = 32 ; qw
tmpU2_V0high_bound = 40 ; qw
tmpU3_V1high_bound = 48 ; qw
tmpV3_U1high_bound = 56 ; qw
tmpY0_low = 64 ; qw
tmpY1_low = 72 ; qw
tmpBlockParity = 80
AspectCount = 84
tmpYCursorEven = 88
tmpYCursorOdd = 92
tmpCCOPitch = 96
Old_esp = 100
End_of_locals = 104
LCL EQU <esp+>
;=============================================================================
; extern void "C" MMX_YUV12ToCLUT8 (
; U8* YPlane,
; U8* UPlane,
; U8* VPlane,
; UN FrameWidth,
; UN FrameHeight,
; UN YPitch,
; UN VPitch,
; UN AspectAdjustmentCount,
; U8* ColorConvertedFrame,
; U32 DCIOffset,
; U32 CCOffsetToLine0,
; int CCOPitch,
; int CCType)
;
; The local variables are on the stack.
; The tables are in the one and only data segment.
;
; CCOffsetToLine0 is relative to ColorConvertedFrame.
;
PUBLIC C MMX_YUV12ToCLUT8
_TEXT SEGMENT DWORD PUBLIC USE32 'CODE'
MMX_YUV12ToCLUT8:
push esi
push edi
push ebp
push ebx
mov ebp,esp
sub esp,LocalFrameSize
and esp,0fffffff8h
mov [esp+Old_esp],ebp
mov ecx,[ebp+arg_YPitch]
mov ebx,[ebp+arg_FrameWidth]
mov eax,[ebp+arg_YPlane]
add eax,ebx ; Points to end of Y even line
mov tmpYCursorEven[esp],eax
add eax,ecx ; add YPitch
mov tmpYCursorOdd[esp],eax
lea edx,[edx+2*ebx] ; final value of Y-odd-pointer
mov esi,PD [ebp+arg_VPlane]
mov edx,PD [ebp+arg_UPlane]
mov eax,PD [ebp+arg_ColorConvertedFrame]
add eax,PD [ebp+arg_DCIOffset]
add eax,PD [ebp+arg_CCOffsetToLine0]
sar ebx,1
add esi,ebx
add edx,ebx
lea edi,[eax+2*ebx] ; CCOCursor
mov ecx,[ebp+arg_AspectAdjustmentCount]
mov AspectCount[esp],ecx
test ecx,ecx ; if AspectCount=0 we should not drop any lines
jnz non_zero_AspectCount
dec ecx
non_zero_AspectCount:
mov AspectCount[esp],ecx
cmp ecx,1
jbe finish
neg ebx
mov [ebp+arg_FrameWidth],ebx
movq mm6,PQ U_low_value ; store some frequently used values in registers
movq mm7,PQ V_low_value
xor eax,eax
mov tmpBlockParity[esp],eax
; Register Usage:
;
; esi -- points to the end of V Line
; edx -- points to the end of U Line.
; edi -- points to the end of even line of output.
; ebp -- points to the end of odd line of output.
;
; ecx -- points to the end of even/odd Y Line
; eax -- 8*(line&2) == 0, on line%4=0,1
; == 8, on line%4=2,3
; in the loop, eax points to the end of even Y line
; ebx -- Number of points, we havn't done yet. (multiplyed by -0.5)
;
;
;------------------------------------------------------------------------------
; Noise matrix is of size 4x4 , so we have different noise values in even pair of lines,
; and in odd pair of lines. But in our loop we are doing 2 lines. So here we are prepairing
; constants for next two lines.
; This code is done each time we are starting to convert next pair of lines.
PrepareNext2Lines:
mov eax,tmpBlockParity[esp]
;constants for odd line
movq mm0,PQ V3_U1low_bound[eax]
movq mm1,PQ V3_U1high_bound[eax]
movq mm2,PQ U3_V1low_bound[eax]
movq mm3,PQ U3_V1high_bound[eax]
movq PQ tmpV3_U1low_bound[esp],mm0
movq PQ tmpV3_U1high_bound[esp],mm1
movq PQ tmpU3_V1low_bound[esp],mm2
movq PQ tmpU3_V1high_bound[esp],mm3
;constants for even line
movq mm0,PQ V2_U0low_bound[eax]
movq mm1,PQ V2_U0high_bound[eax]
movq mm2,PQ U2_V0low_bound[eax]
movq mm3,PQ U2_V0high_bound[eax]
movq PQ tmpV2_U0low_bound[esp],mm0
movq PQ tmpV2_U0high_bound[esp],mm1
movq PQ tmpU2_V0low_bound[esp],mm2
movq PQ tmpU2_V0high_bound[esp],mm3
; Constants for Y values
movq mm4,PQ Y0_low[eax]
movq mm5,PQ Y1_low[eax]
xor eax,8
mov tmpBlockParity[esp],eax
movq PQ tmpY0_low[esp],mm4
movq PQ tmpY1_low[esp],mm5
; if AspectCount<2 we should skip a line. In this case we are steel doing two
; lines, but output pointers are the same, so we just overwriting line which we should skip
mov eax,[ebp+arg_CCOPitch]
mov ebx, AspectCount[esp]
xor ecx,ecx
sub ebx,2
mov tmpCCOPitch[esp],eax
ja continue
mov eax,[ebp+arg_AspectAdjustmentCount]
mov tmpCCOPitch[esp],ecx ; 0
lea ebx,[ebx+eax] ; calculate new AspectCount
jnz continue ; skiping even line
;skip_odd_line
mov eax,tmpYCursorEven[esp]
; set odd constants to be equal to even_constants
; Odd line will be performed as even
movq PQ tmpV3_U1low_bound[esp],mm0
movq PQ tmpV3_U1high_bound[esp],mm1
movq PQ tmpU3_V1low_bound[esp],mm2
movq PQ tmpU3_V1high_bound[esp],mm3
movq PQ tmpY1_low[esp],mm4
mov tmpYCursorOdd[esp],eax
; when we got here, we already did all preparations.
; we are entering a main loop which is starts at do_next_8x2_block label
continue:
mov AspectCount[esp],ebx
mov ebx,[ebp+arg_FrameWidth]
mov ebp,edi
add ebp,tmpCCOPitch[esp] ; ebp points to the end of odd line of output
mov eax,tmpYCursorEven[esp]
mov ecx,tmpYCursorOdd[esp]
movdt mm0,[edx+ebx] ; read 4 U points
movdt mm2,[esi+ebx] ; read 4 V points
punpcklbw mm0,mm0 ; u3:u3:u2:u2|u1:u1:u0:u0
psubb mm0,PQ convert_to_sign
punpcklbw mm2,mm2 ; v3:v3:v2:v2|v1:v1:v0:v0
movq mm4,[eax+2*ebx] ; read 8 Y points from even line
movq mm1,mm0 ; u3:u3:u2:u2|u1:u1:u0:u0
do_next_8x2_block:
psubb mm2,PQ convert_to_sign ; convert to sign range (for comparison)
movq mm5,mm1 ; u3:u3:u2:u2|u1:u1:u0:u0
pcmpgtb mm0,PQ tmpV2_U0low_bound[esp]
movq mm3,mm2
pcmpgtb mm1,PQ tmpV2_U0high_bound[esp]
pand mm0,U_low
psubusb mm4,PQ tmpY0_low[esp]
pand mm1,U_high
pcmpgtb mm2,PQ tmpU2_V0low_bound[esp]
psrlq mm4,3
pand mm4,PQ clean_MSB_mask
pand mm2,V_low
paddusb mm4,PQ saturate_to_Y_high
paddb mm0,mm1 ; U03:U03:U02:U02|U01:U01:U00:U00
psubusb mm4,PQ return_from_Y_high
movq mm1,mm5
pcmpgtb mm5,PQ tmpV3_U1low_bound[esp]
paddd mm0,mm2
pcmpgtb mm1,PQ tmpV3_U1high_bound[esp]
pand mm5,U_low
paddd mm0,mm4
movq mm2,mm3
pcmpgtb mm3,PQ tmpU2_V0high_bound[esp]
pand mm1,U_high
movq mm4,[ecx+2*ebx] ; read next 8 Y points from odd line
paddb mm5,mm1 ; u impact on odd line
psubusb mm4,PQ tmpY1_low[esp]
movq mm1,mm2
pcmpgtb mm2,PQ tmpU3_V1low_bound[esp]
psrlq mm4,3
pand mm4,PQ clean_MSB_mask
pand mm2,V_low
paddusb mm4,PQ saturate_to_Y_high
paddd mm5,mm2
psubusb mm4,PQ return_from_Y_high
pand mm3,V_high
pcmpgtb mm1,PQ tmpU3_V1high_bound[esp]
paddb mm3,mm0
movdt mm0,[edx+ebx+4] ; read next 4 U points
pand mm1,V_high
movdt mm2,[esi+ebx+4] ; read next 4 V points
paddd mm5,mm4
movq mm4,[eax+2*ebx+8] ; read next 8 Y points from even line
paddb mm5,mm1
psubb mm0,PQ convert_to_sign
punpcklbw mm2,mm2 ; v3:v3:v2:v2|v1:v1:v0:v0
movq [edi+2*ebx],mm3 ; write even line
punpcklbw mm0,mm0 ; u3:u3:u2:u2|u1:u1:u0:u0
movq [ebp+2*ebx],mm5 ; write odd line
movq mm1,mm0 ; u3:u3:u2:u2|u1:u1:u0:u0
add ebx,4
jl do_next_8x2_block
; update pointes to input and output buffers, to point to the next lines
mov ebp,[esp+Old_esp]
mov eax,tmpYCursorEven[esp]
mov ecx,[ebp+arg_YPitch]
add edi,[ebp+arg_CCOPitch] ; go to the end of next line
add edi,tmpCCOPitch[esp] ; skip odd line
lea eax,[eax+2*ecx]
mov tmpYCursorEven[esp],eax
add eax,[ebp+arg_YPitch]
mov tmpYCursorOdd[esp],eax
add esi,[ebp+arg_ChromaPitch]
add edx,[ebp+arg_ChromaPitch]
sub PD [ebp+arg_FrameHeight],2
ja PrepareNext2Lines
;------------------------------------------------------------------------------
finish:
emms
mov esp,[esp+Old_esp]
pop ebx
pop ebp
pop edi
pop esi
ret
_TEXT ENDS
END
;-------------------------------------------------------------------------
;
; cxm1282 -- This function performs YUV12 to CLUT8 zoom-by-2 color conversion
; for H26x. It dithers among 9 chroma points and 26 luma
; points, mapping the 8 bit luma pels into the 26 luma points by
; clamping the ends and stepping the luma by 8.
;
; 1. The color convertor is destructive; the input Y, U, and V
; planes will be clobbered. The Y plane MUST be preceded by
; 1544 bytes of space for scratch work.
; 2. U and V planes should be preceded by 4 bytes (for read only)
;
include locals.inc
include iammx.inc
ASSUME ds:FLAT, cs:FLAT, ss:FLAT
.586
.xlist
.list
;------------------------------------------------------------
PQ equ PD
;------------------------------------------------------------
;=============================================================================
MMXDATA1 SEGMENT PARA USE32 PUBLIC 'DATA'
ALIGN 8
;convert_to_sign dq 8080808080808080h
;V2_U0low_bound dq 0f3ebfbe3f3ebfbe3h ; 746c7c64746c7c64 - 8080808080808080
;V2_U0high_bound dq 130b1b03130b1b03h ; 948c9c84948c9c84 - 8080808080808080
;U2_V0low_bound dq 0ebf3e3fbebf3e3fbh ; 6c74647c6c74647c - 8080808080808080
;U2_V0high_bound dq 0b13031b0b13031bh ; 8c94849c8c94849c - 8080808080808080
;U_low_value dq 1a1a1a1a1a1a1a1ah
;V_low_value dq 4e4e4e4e4e4e4e4eh
;Y0_correct dq 1b1519171b151917h ; 1b1b1b1b1b1b1b1b - 0006020400060204
;Y1_correct dq 19171b1519171b15h ; 1b1b1b1b1b1b1b1b - 0204000602040006
;Y2_correct dq 151b1719151b1719h ; 1b1b1b1b1b1b1b1b - 0402060004020600
;Y3_correct dq 1719151b1719151bh ; 1b1b1b1b1b1b1b1b - 0600040206000402
;clean_MSB_mask dq 1f1f1f1f1f1f1f1fh
;saturate_to_Y_high dq 0e6e6e6e6e6e6e6e6h ; ffh-19h
;return_from_Y_high dq 0dcdcdcdcdcdcdcdch ; ffh-19h-ah (return back and ADD ah);
extrn convert_to_sign:qword
extrn V2_U0low_bound:qword
extrn V2_U0high_bound:qword
extrn U2_V0low_bound:qword
extrn U2_V0high_bound:qword
extrn U_low_value:qword
extrn V_low_value:qword
extrn Y0_low:qword
extrn Y1_low:qword
extrn clean_MSB_mask:qword
extrn saturate_to_Y_high:qword
extrn return_from_Y_high:qword
Y0_correct equ Y1_low+8
Y1_correct equ Y0_low+8
Y2_correct equ Y1_low
Y3_correct equ Y0_low
U_high_value equ U_low_value
V_high_value equ V_low_value
MMXDATA1 ENDS
;=============================================================================
LocalFrameSize = 24
RegisterStorageSize = 16
; Arguments:
YPlane = LocalFrameSize + RegisterStorageSize + 4
UPlane = LocalFrameSize + RegisterStorageSize + 8
VPlane = LocalFrameSize + RegisterStorageSize + 12
FrameWidth = LocalFrameSize + RegisterStorageSize + 16
FrameHeight = LocalFrameSize + RegisterStorageSize + 20
YPitch = LocalFrameSize + RegisterStorageSize + 24
ChromaPitch = LocalFrameSize + RegisterStorageSize + 28
AspectAdjustmentCount = LocalFrameSize + RegisterStorageSize + 32
ColorConvertedFrame = LocalFrameSize + RegisterStorageSize + 36
DCIOffset = LocalFrameSize + RegisterStorageSize + 40
CCOffsetToLine0 = LocalFrameSize + RegisterStorageSize + 44
CCOPitch = LocalFrameSize + RegisterStorageSize + 48
EndOfArgList = LocalFrameSize + RegisterStorageSize + 56
; Locals (on local stack frame)
CCOCursor = 0
DistanceFromVToU = 4
AspectCount = 8
CCOLine1 = 12
CCOLine2 = 16
CCOLine3 = 20
LCL EQU <esp+>
;=============================================================================
MMXCODE1 SEGMENT PARA USE32 PUBLIC 'CODE'
; extern void "C" MMX_YUV12ToCLUT8ZoomBy2 (
; U8* YPlane,
; U8* UPlane,
; U8* VPlane,
; UN FrameWidth,
; UN FrameHeight,
; UN YPitch,
; UN VPitch,
; UN AspectAdjustmentCount,
; U8* ColorConvertedFrame,
; U32 DCIOffset,
; U32 CCOffsetToLine0,
; int CCOPitch,
; int CCType)
;
; The local variables are on the stack.
; The tables are in the one and only data segment.
;
; CCOffsetToLine0 is relative to ColorConvertedFrame.
;
PUBLIC C MMX_YUV12ToCLUT8ZoomBy2
MMX_YUV12ToCLUT8ZoomBy2:
push esi
push edi
push ebp
push ebx
sub esp,LocalFrameSize
mov ebx,PD [esp+VPlane]
mov ecx,PD [esp+UPlane]
sub ecx,ebx
mov PD [esp+DistanceFromVToU],ecx
mov eax,PD [esp+ColorConvertedFrame]
add eax,PD [esp+DCIOffset]
add eax,PD [esp+CCOffsetToLine0]
mov PD [esp+CCOCursor],eax
; Ledx FrameHeight
; Lecx YPitch
; imul edx,ecx
Ledi CCOPitch
Lesi YPlane ; Fetch cursor over luma plane.
Seax CCOCursor
; add edx,esi
; Sedx YLimit
Ledx AspectAdjustmentCount
Sedx AspectCount
mov edi,esi
Lebx FrameWidth
Leax CCOCursor
sar ebx,1
sub ebx,4 ; counter starts from maxvalue-4, and in last iteration it equals 0
mov ecx,eax
ADDedi YPitch ; edi = odd Y line cursor
ADDecx CCOPitch
Sebx FrameWidth
Secx CCOLine1
Lebx CCOPitch
; in each outer loop iteration, 4 lines of output are done.
; in each inner loop iteration block 4x16 of output is done.
; main task of outer loop is to prepare pointers for inner loop
NextFourLines:
; prepare output pointers
; ebx=CCOPitch
; eax=CCOLine0
; ecx=CCOLine1
Lebp AspectCount
sub ebp,2
ja continue1 ; jump if it still>0
ADDebp AspectAdjustmentCount
mov ecx,eax ; Output1 will overwrite Output0 line
Secx CCOLine1
continue1:
lea edx,[ecx+ebx]
sub ebp,2
Sedx CCOLine2
ja continue2 ; jump if it still>0
ADDebp AspectAdjustmentCount
xor ebx,ebx ; Output1 will overwrite Output0 line
continue2:
Sebp AspectCount
lea ebp,[edx+ebx]
Sebp CCOLine3
; output pointers are done
; Inner loop does 4x16 block of output points
; Register Usage
;
; esi -- Cursor over even Y line
; edi -- Cursor over odd Y line
; edx -- Cursor over V line
; ebp -- Cursor over U line.
; eax -- cursor over Output
; ecx -- cursor over Output1,2,3
; ebx -- counter
Lebp VPlane
Lebx FrameWidth
mov edx,ebp
ADDebp DistanceFromVToU ; Cursor over U line.
movdt mm3,[ebp+ebx] ; read 4 U points
movdt mm2,[edx+ebx] ; read 4 V points
punpcklbw mm3,mm3 ; u3:u3:u2:u2|u1:u1:u0:u0
prepare_next4x8:
psubb mm3,PQ convert_to_sign
punpcklbw mm2,mm2 ; v3:v3:v2:v2|v1:v1:v0:v0
psubb mm2,PQ convert_to_sign
movq mm4,mm3
movdt mm7,[esi+2*ebx] ; read even Y line
punpcklwd mm3,mm3 ; u1:u1:u1:u1|u0:u0:u0:u0
Lecx CCOLine1
movq mm1,mm3
pcmpgtb mm3,PQ V2_U0low_bound
punpcklbw mm7,mm7 ; y3:y3:y2:y2|y1:y1:y0:y0
pand mm3,PQ U_low_value
movq mm5,mm7
psubusb mm7,PQ Y0_correct
movq mm6,mm2
pcmpgtb mm1,PQ V2_U0high_bound
punpcklwd mm2,mm2 ; v1:v1:v1:v1|v0:v0:v0:v0
pand mm1,PQ U_high_value
psrlq mm7,3
pand mm7,PQ clean_MSB_mask
movq mm0,mm2
pcmpgtb mm2,PQ U2_V0low_bound
; empty slot !!!!
pcmpgtb mm0,PQ U2_V0high_bound
paddb mm3,mm1
pand mm2,PQ V_low_value
pand mm0,PQ V_high_value
; two empty slots !!!!
paddusb mm7,PQ saturate_to_Y_high
paddb mm3,mm2
psubusb mm7,PQ return_from_Y_high ; Y impact on line0
paddd mm3,mm0 ; common U,V impact on line 0
psubusb mm5,PQ Y1_correct
paddb mm7,mm3 ; final value of line 0
movq mm0,mm3 ; u31:u21:u11:u01|u30:u20:u10:u00
psrlq mm5,3
pand mm5,PQ clean_MSB_mask
psrld mm0,16 ; : :u31:u21| : :u30:u20
paddusb mm5,PQ saturate_to_Y_high
pslld mm3,16 ; u11:u01: : |u10:u00: :
psubusb mm5,PQ return_from_Y_high ; Y impact on line0
por mm0,mm3 ; u11:u01:u31:u21|u10:u00:u30:u20
movdt mm3,[edi+2*ebx] ; odd Y line
paddb mm5,mm0 ; final value of line 0
punpcklbw mm3,mm3 ; y3:y3:y2:y2|y1:y1:y0:y0
movq mm2,mm0 ; u11:u01:u31:u21|u10:u00:u30:u20
movq [ecx+4*ebx],mm5 ; write Output1 line
movq mm1,mm3
movq [eax+4*ebx],mm7 ; write Output0 line
psrlw mm0,8 ; :u11: :u31| :u10: :u30
psubusb mm1,PQ Y3_correct
psllw mm2,8 ; u01: :u21: |u00: :u20:
psubusb mm3,PQ Y2_correct
psrlq mm1,3
pand mm1,PQ clean_MSB_mask
por mm0,mm2 ; u01:u11:u21:u31|u00:u10:u20:u30
paddusb mm1,PQ saturate_to_Y_high
psrlq mm3,3
psubusb mm1,PQ return_from_Y_high
movq mm5,mm0 ; u01:u11:u21:u31|u00:u10:u20:u30
pand mm3,PQ clean_MSB_mask
paddb mm1,mm0
paddusb mm3,PQ saturate_to_Y_high
psrld mm5,16
psubusb mm3,PQ return_from_Y_high
pslld mm0,16
Lecx CCOLine3
por mm5,mm0 ; u21:u31:u01:u11|u20:u30:u00:u10
movdt mm2,[esi+2*ebx+4] ; read next even Y line
paddb mm5,mm3
movq [ecx+4*ebx],mm1 ; write Output3 line
punpckhwd mm4,mm4 ; u3:u3:u3:u3|u2:u2:u2:u2
; start next 4x8 block of output
; SECOND uv-QWORD
; mm6, mm4 are live
Lecx CCOLine2
movq mm3,mm4
pcmpgtb mm4,PQ V2_U0low_bound
punpckhwd mm6,mm6
movq [ecx+4*ebx],mm5 ; write Output2 line
movq mm7,mm6
pand mm4,PQ U_low_value
punpcklbw mm2,mm2 ; y3:y3:y2:y2|y1:y1:y0:y0
pcmpgtb mm3,PQ V2_U0high_bound
movq mm5,mm2
pand mm3,PQ U_high_value
pcmpgtb mm6,PQ U2_V0low_bound
paddb mm4,mm3
pand mm6,PQ V_low_value
pcmpgtb mm7,PQ U2_V0high_bound
paddb mm4,mm6
pand mm7,PQ V_high_value
psubusb mm2,PQ Y0_correct
paddd mm4,mm7
psubusb mm5,PQ Y1_correct
psrlq mm2,3
pand mm2,PQ clean_MSB_mask
movq mm3,mm4 ; u31:u21:u11:u01|u30:u20:u10:u00
paddusb mm2,PQ saturate_to_Y_high
pslld mm3,16 ; u11:u01: : |u10:u00: :
psubusb mm2,PQ return_from_Y_high
psrlq mm5,3
pand mm5,PQ clean_MSB_mask
paddb mm2,mm4 ; MM4=u31:u21:u11:u01|u30:u20:u10:u00, WHERE U STANDS FOR UNATED U AND V IMPACTS
paddusb mm5,PQ saturate_to_Y_high
psrld mm4,16 ; : :u31:u21| : :u30:u20
psubusb mm5,PQ return_from_Y_high
por mm4,mm3 ; u11:u01:u31:u21|u10:u00:u30:u20
paddb mm5,mm4
Lecx CCOLine1
movdt mm0,[edi+2*ebx+4] ; read odd Y line
movq mm7,mm4 ; u11:u01:u31:u21|u10:u00:u30:u20
movq [ecx+4*ebx+8],mm5 ; write Output1 line
punpcklbw mm0,mm0 ; y3:y3:y2:y2|y1:y1:y0:y0
movq [eax+4*ebx+8],mm2 ; write Output0 line
movq mm1,mm0
psubusb mm1,PQ Y2_correct
psrlw mm4,8 ; :u11: :u31| :u10: :u30
psubusb mm0,PQ Y3_correct
psrlq mm1,3
pand mm1,PQ clean_MSB_mask
psllw mm7,8 ; u01: :u21: |u00: :u20:
paddusb mm1,PQ saturate_to_Y_high
por mm4,mm7 ; u01:u11:u21:u31|u00:u10:u20:u30
psubusb mm1,PQ return_from_Y_high
psrlq mm0,3
pand mm0,PQ clean_MSB_mask
movq mm5,mm4 ; u01:u11:u21:u31|u00:u10:u20:u30
paddusb mm0,PQ saturate_to_Y_high
psrld mm5,16
psubusb mm0,PQ return_from_Y_high
paddb mm0,mm4
Lecx CCOLine3
movdt mm3,[ebp+ebx-4] ; read next 4 U points
pslld mm4,16
movq [ecx+4*ebx+8],mm0 ; write Output3 line
por mm5,mm4 ; u21:u31:u01:u11|u20:u30:u00:u10
paddb mm5,mm1
Lecx CCOLine2
movdt mm2,[edx+ebx-4] ; read next 4 V points
punpcklbw mm3,mm3 ; u3:u3:u2:u2|u1:u1:u0:u0
movq [ecx+4*ebx+8],mm5 ; write Output2 line
sub ebx,4
jae prepare_next4x8
Lebx CCOPitch
Lecx CCOLine3
Lebp YPitch
Ledx VPlane
lea eax,[ecx+ebx] ; next Output0 = old Output3 + CCOPitch
lea ecx,[ecx+2*ebx] ; next Output1 = old Output3 + 2* CCOPitch
ADDedx ChromaPitch
Secx CCOLine1
lea esi,[esi+2*ebp] ; even Y line cursor goes to next line
lea edi,[edi+2*ebp] ; odd Y line cursor goes to next line
Sedx VPlane ; edx will point to V plane
sub PD FrameHeight[esp],2
ja NextFourLines
emms
add esp,LocalFrameSize
pop ebx
pop ebp
pop edi
pop esi
retn
MMXCODE1 ENDS
END
;-------------------------------------------------------------------------
; cx512241 -- This function performs YUV12-to-RGB24 color conversion for H26x.
; It is tuned for best performance on the Pentium(r) Microprocessor.
; It handles the format in which the low order byte is B, the
; second byte is G, and the high order byte is R.
;
; The YUV12 input is planar, 8 bits per pel. The Y plane may have
; a pitch of up to 768. It may have a width less than or equal
; to the pitch. It must be DWORD aligned, and preferably QWORD
; aligned. Pitch and Width must be a multiple of four. For best
; performance, Pitch should not be 4 more than a multiple of 32.
; Height may be any amount, but must be a multiple of two. The U
; and V planes may have a different pitch than the Y plane, subject
; to the same limitations.
;
; The color convertor is destructive; the input Y, U, and V
; planes will be clobbered. The Y plane MUST be preceded by
; 3104 bytes of space for scratch work.
OPTION PROLOGUE:None
OPTION EPILOGUE:ReturnAndRelieveEpilogueMacro
include iammx.inc
include locals.inc
.xlist
.list
.DATA
; any data would go here
ALIGN 8
sixty_four dd 40404040h, 40404040h
include small_ta.asm
.CODE
ASSUME ds:FLAT, cs:FLAT, ss:FLAT
; void FAR ASM_CALLTYPE MMX_YUV12ToRGB24 (
; U8* YPlane,
; U8* UPlane,
; U8* VPlane,
; UN FrameWidth,
; UN FrameHeight,
; UN YPitch,
; UN VPitch,
; UN AspectAdjustmentCount,
; U8* ColorConvertedFrame,
; U32 DCIOffset,
; U32 CCOffsetToLine0,
; IN CCOPitch,
; IN CCType)
;
; The local variables are on the stack.
; The tables are in the one and only data segment.
;
; CCOffsetToLine0 is relative to ColorConvertedFrame.
;
PUBLIC C YUV12ToRGB24
; due to the need for the ebp reg, these parameter declarations aren't used,
; they are here so the assembler knows how many bytes to relieve from the stack
LocalFrameSize = 40
RegisterStorageSize = 16
; Arguments:
; Arguments:
YPlane = LocalFrameSize + RegisterStorageSize + 4
UPlane = LocalFrameSize + RegisterStorageSize + 8
VPlane = LocalFrameSize + RegisterStorageSize + 12
FrameWidth = LocalFrameSize + RegisterStorageSize + 16
FrameHeight = LocalFrameSize + RegisterStorageSize + 20
YPitch = LocalFrameSize + RegisterStorageSize + 24
ChromaPitch = LocalFrameSize + RegisterStorageSize + 28
AspectAdjustmentCount = LocalFrameSize + RegisterStorageSize + 32
ColorConvertedFrame = LocalFrameSize + RegisterStorageSize + 36
DCIOffset = LocalFrameSize + RegisterStorageSize + 40
CCOffsetToLine0 = LocalFrameSize + RegisterStorageSize + 44
CCOPitch = LocalFrameSize + RegisterStorageSize + 48
EndOfArgList = LocalFrameSize + RegisterStorageSize + 52
; Locals (on local stack frame)
CCOCursor = 0
CCOSkipDistance = 4
ChromaLineLen = 8
YSkipDistance = 12
YCursor = 16
DistanceFromVToU = 20
tmpYCursorEven = 24
tmpYCursorOdd = 28
tmpCCOPitch = 32
AspectCount = 36
LCL EQU <esp+>
YUV12ToRGB24:
push esi
push edi
push ebp
push ebx
sub esp,LocalFrameSize
mov ebx,PD [esp+VPlane]
mov ecx,PD [esp+UPlane]
sub ecx,ebx
mov PD [esp+DistanceFromVToU],ecx
mov eax,PD [esp+ColorConvertedFrame]
add eax,PD [esp+DCIOffset]
add eax,PD [esp+CCOffsetToLine0]
mov PD [esp+CCOCursor],eax
; Ledx FrameHeight
Lecx YPitch
; imul edx,ecx ; FrameHeight*YPitch
Lebx FrameWidth
Leax CCOPitch
sub eax,ebx ; CCOPitch-FrameWidth
sub ecx,ebx ; YPitch-FrameWidth
sub eax,ebx ; CCOPitch-2*FrameWidth
Secx YSkipDistance
sub eax,ebx ; CCOPitch-3*FrameWidth
Lesi YPlane ; Fetch cursor over luma plane.
sar ebx,1 ; FrameWidth/2
Seax CCOSkipDistance ; CCOPitch-3*FrameWidth
add edx,esi ; YPlane+Size_of_Y_array
Sebx ChromaLineLen ; FrameWidth/2
; Sedx YLimit
Sesi YCursor
Ledx AspectAdjustmentCount
Lesi VPlane
test edx,edx ; if AspectCount=0 we should not drop any lines
jnz non_zero_AspectCount
dec edx
non_zero_AspectCount:
Sedx AspectAdjustmentCount
xor eax,eax
Lebp DistanceFromVToU
Ledi YCursor ; Fetch Y Pitch.
Lebx FrameWidth
add edi,ebx
Sedi tmpYCursorEven
Leax YPitch
add edi,eax
Sedi tmpYCursorOdd
sar ebx,1
add esi,ebx
add ebp,esi
neg ebx
Sebx FrameWidth
Ledi CCOCursor
; Register Usage:
;
; edx -- Y Line cursor. Chroma contribs go in lines above current Y line.
; esi -- V Line cursor.
; ebp -- U Line cursor
; edi -- Cursor over the color converted output image.
; ebx -- Number of points, we havn't done yet.
;
;
; ecx -- V contribution to RGB; sum of U and V contributions.
; eax -- Alternately a U and a V pel.
;------------------------------------------------------------------------------
sub edi,12
movq mm7,sixty_four
Leax AspectAdjustmentCount
Seax AspectCount
cmp eax,1
jbe finish
PrepareChromaLine:
Lebx FrameWidth
Leax AspectCount
Ledx CCOPitch
xor ecx,ecx
sub eax,2
Sedx tmpCCOPitch
ja continue
Leax AspectAdjustmentCount
Secx tmpCCOPitch ; 0
jnz skip_even_line
skip_odd_line:
Ledx tmpYCursorEven
Seax AspectCount
Sedx tmpYCursorOdd
jmp do_next_4x2_block
skip_even_line:
dec eax
continue:
Seax AspectCount
Ledx tmpYCursorEven
xor eax,eax
mov cl,[edx+2*ebx] ; Ye0
mov al,[edx+2*ebx+1] ; Ye1
movdt mm1,PD Y0[eax*4] ; 0: 0: 0: 0| 0:Ye1: Ye1: Ye1
do_next_4x2_block:
movdt mm3,PD Y0[ecx*4] ; 0: 0: 0: 0| 0:Ye0: Ye0: Ye0
psllq mm1,24 ; 0: 0: Ye1: Ye1| Ye1: 0: 0: 0
xor ecx,ecx
mov al,[edx+2*ebx+2] ; Ye2
mov cl,[edx+2*ebx+3] ; Ye3
xor edx,edx
mov dl,[esi+ebx+1] ; v1
add edi,12 ; output
movdt mm4,PD Y0[eax*4] ; 0: 0: 0: 0| 0: 0:Ye2 : Ye2
por mm3,mm1 ; 0: 0: Ye1: Ye1| Ye1:Ye0: Ye0: Ye0
movdt mm5,PD Y0[ecx*4] ; 0| 0: Ye3: Ye3: Ye3
psllq mm4,48 ;Ye2 : Ye2: 0: 0| 0: 0: 0: 0
mov cl,[ebp+ebx] ; u0
mov al,[esi+ebx] ; v0
movq mm2,PD v0[edx*8] ; 0: 0: Rv3: Gv3| Bv3: Rv2: Gv2: Bv2 u,v impact on RGB[0] and RGB[1] is equal
por mm3,mm4 ;Ye2 : Ye2: Ye1: Ye1| Ye1: Ye0: Ye0: Ye0
movq mm0,PD u0[ecx*8] ; 0: 0: Ru1: Gu1| Bu1: Ru0: Gu0: Bu0 u,v impact on RGB[0] and RGB[1] is equal
psllq mm5,8 ; 0| Ye3: Ye3: Ye3: 0
mov cl,[ebp+ebx+1] ; u1
Ledx tmpYCursorOdd
paddb mm0,PD v0[eax*8] ; 0: 0:Ruv1:Guv1|Buv1:Ruv0:Guv0:Buv0
psrlq mm4,56 ; 0: 0: 0: 0| 0: 0: 0: Ye2
paddb mm2,PD u0[ecx*8] ; 0: 0:Ruv3:Guv3|Buv3:Ruv2:Guv2:Buv2
por mm4,mm5 ; 0| Ye3: Ye3: Ye3: Ye2
movq mm1,mm2
psllq mm2,48 ;Guv2:Buv2: 0: 0| 0: 0: 0: 0
psrlq mm1,16 ; 0: 0: 0: 0|Ruv3:Guv3:Buv3:Ruv2
por mm0,mm2 ;Guv2:Buv2:Ruv1:Guv1|Buv1:Ruv0:Guv0:Buv0
paddb mm3,mm0 ; r0:g0:b0:r1|g1:b1:r2:g2
mov cl,[edx+2*ebx+1] ; Yo1
mov al,[edx+2*ebx] ; Yo0
psubusb mm3,mm7 ; mm7=sixty_four
movdt mm6,PD Y0[ecx*4] ; 0: 0: Ye1: Ye1| Ye1:Ye0: Ye0: Ye0
paddb mm4,mm1 ; x: x: 0: 0| b2: r3: g3: b3
movdt mm5,PD Y0[eax*4] ; 0: 0: 0: 0| 0:Ye0: Ye0: Ye0
psubusb mm4,mm7 ; mm7=sixty_four
psllq mm6,24 ; 0: 0: 0: 0| 0:Ye0: Ye0: Ye0
mov al,[edx+2*ebx+2] ; Yo2
paddusb mm3,mm3
por mm5,mm6
movdt mm6,PD Y0[eax*4] ; 0: 0: 0: 0| 0: 0: Ye2: Ye2
paddusb mm3,mm3
paddusb mm4,mm4
psllq mm6,48 ;Ye2: Ye2: 0: 0| 0: 0: 0: 0
movdf [edi],mm3
paddusb mm4,mm4
mov cl,[edx+2*ebx+3] ; Yo3
psrlq mm3,32
movdf [edi+8],mm4
por mm5,mm6 ;Ye2: Ye2: Ye1: Ye1| Ye1: Ye0: Ye0: Ye0
movdt mm2,PD Y0[ecx*4] ; 0| Ye3: Ye3: Ye3: Ye2
paddb mm5,mm0 ; r0:g0:b0:r1|g1:b1:r2:g2
psllq mm2,8
Ledx tmpYCursorEven
psubusb mm5,mm7 ; mm7=sixty_four
psrlq mm6,56 ; 0: 0: 0: 0| 0: 0: 0: Ye2
por mm6,mm2 ; 0| Ye3: Ye3: Ye3: Ye2
paddusb mm5,mm5
Leax tmpCCOPitch
paddusb mm5,mm5
paddb mm6,mm1 ; x: x: 0: 0| b2: r3: g3: b3
mov cl,[edx+2*ebx+1+4] ; Ye1
movdf [edi+eax],mm5
psrlq mm5,32
movdf [edi+4],mm3
psubusb mm6,mm7 ; mm7=sixty_four
movdf [edi+eax+4],mm5
paddusb mm6,mm6
movdt mm1,PD Y0[ecx*4] ; 0: 0: 0: 0| 0:Ye1: Ye1: Ye1
paddusb mm6,mm6
mov cl,[edx+2*ebx+4] ; Ye0
add ebx,2
movdf [edi+eax+8],mm6
mov eax,zero
jl do_next_4x2_block
Leax YPitch
ADDedi CCOSkipDistance ; go to begin of next line
ADDedi tmpCCOPitch ; skip odd line
Ledx tmpYCursorEven
lea edx,[edx+2*eax]
Sedx tmpYCursorEven
ADDedx YPitch
Sedx tmpYCursorOdd
ADDesi ChromaPitch
ADDebp ChromaPitch
sub PD FrameHeight[esp],2 ; Done with last line?
ja PrepareChromaLine
;------------------------------------------------------------------------------
finish:
add esp,LocalFrameSize
emms
pop ebx
pop ebp
pop edi
pop esi
rturn
END
;-------------------------------------------------------------------------
OPTION PROLOGUE:None
OPTION EPILOGUE:ReturnAndRelieveEpilogueMacro
include iammx.inc
include locals.inc
.586
.xlist
.list
ASSUME ds:FLAT, cs:FLAT, ss:FLAT
MMXCODE1 SEGMENT PARA USE32 PUBLIC 'CODE'
MMXCODE1 ENDS
MMXDATA1 SEGMENT PARA USE32 PUBLIC 'DATA'
MMXDATA1 ENDS
MMXDATA1 SEGMENT
; any data would go here
ALIGN 8
;constants for direct RGB calculation: 4x10.6 values
Minusg dd 00800080h,00800080h
VtR dd 00660066h,00660066h ;01990199h,01990199h
UtB dd 00810081h,00810081h ;02050205h,02050205h
Ymul dd 004a004ah,004a004ah ;012a012ah,012a012ah
Yadd dd 10101010h,10101010h
UVtG dd 00340019h,00340019h ;00d00064h,00d00064h
MASK_036 dd 0ff0000ffh,00ff0000h
MASK_147 dd 0000ff00h,0ff0000ffh
tmpYCursorEven dd 0
tmpYCursorOdd dd 0
tmpBuffer db 48 dup (?) ; aligned on 8 byte boundary scratch buffer
MMXDATA1 ENDS
LocalFrameSize = 20
RegisterStorageSize = 16
; Arguments:
YPlane = LocalFrameSize + RegisterStorageSize + 4
UPlane = LocalFrameSize + RegisterStorageSize + 8
VPlane = LocalFrameSize + RegisterStorageSize + 12
FrameWidth = LocalFrameSize + RegisterStorageSize + 16
FrameHeight = LocalFrameSize + RegisterStorageSize + 20
YPitch = LocalFrameSize + RegisterStorageSize + 24
ChromaPitch = LocalFrameSize + RegisterStorageSize + 28
AspectAdjustmentCount = LocalFrameSize + RegisterStorageSize + 32
ColorConvertedFrame = LocalFrameSize + RegisterStorageSize + 36
DCIOffset = LocalFrameSize + RegisterStorageSize + 40
CCOffsetToLine0 = LocalFrameSize + RegisterStorageSize + 44
CCOPitch = LocalFrameSize + RegisterStorageSize + 48
CCType = LocalFrameSize + RegisterStorageSize + 52
EndOfArgList = LocalFrameSize + RegisterStorageSize + 56
; Locals (on local stack frame)
CCOCursor = 0
CCOSkipDistance = 4
ChromaLineLen = 8
R3G3B3R2 = 12
G2B2R1G1 = 16
AspectCount = 20
LCL EQU <esp+>
MMXCODE1 SEGMENT
; extern void "C" MMX_YUV12ToRGB24ZoomBy2 (U8 * YPlane,
; U8 * UPlane,
; U8 * VPlane,
; UN FrameWidth,
; UN FrameHeight,
; UN YPitch,
; UN VPitch,
; UN AspectAdjustmentCount,
; U8 FAR * ColorConvertedFrame,
; U32 DCIOffset,
; U32 CCOffsetToLine0,
; IN CCOPitch,
; IN CCType)
;
; CCOffsetToLine0 is relative to ColorConvertedFrame.
;
;extrn finish_next_iteration:proc
;extrn start_next_iteration:proc
PUBLIC C MMX_YUV12ToRGB24ZoomBy2
; due to the need for the ebp reg, these parameter declarations aren't used,
; they are here so the assembler knows how many bytes to relieve from the stack
MMX_YUV12ToRGB24ZoomBy2:
push esi
push edi
push ebp
push ebx
sub esp,LocalFrameSize
mov eax,PD [esp+ColorConvertedFrame]
add eax,PD [esp+DCIOffset]
add eax,PD [esp+CCOffsetToLine0]
mov PD [esp+CCOCursor],eax
Ledx FrameHeight
add edx,edx
Sedx FrameHeight
Lecx YPitch
Lebx FrameWidth
Leax CCOPitch
lea esi,[ebx+2*ebx] ; 3*FrameWidth
Ledx AspectAdjustmentCount
sar ebx,1 ; FrameWidth/2
sub eax,esi ; CCOPitch-3*FrameWidth
Sebx ChromaLineLen ; FrameWidth/2
sub eax,esi ; CCOPitch-6*FrameWidth
Seax CCOSkipDistance ; CCOPitch-3*FrameWidth
Lesi VPlane
test edx,edx
jnz non_zero_AspectCount
inc edx
Sedx AspectAdjustmentCount
non_zero_AspectCount:
Sedx AspectCount
xor eax,eax
Ledi CCOCursor
mov edx,PD [esp+UPlane]
sub edx,esi
Lebp YPlane ; Fetch Y Pitch.
Lebx FrameWidth
add ebp,ebx
mov tmpYCursorEven,ebp
Leax YPitch
add ebp,eax
mov tmpYCursorOdd,ebp
sar ebx,1
add esi,ebx
add edx,esi ; edx is distance from V plane to U plane
neg ebx
Sebx FrameWidth
; Register Usage:
;
; ebp -- Y Line cursor. Chroma contribs go in lines above current Y line.
; esi -- V
; edx -- U
; edi -- Cursor over the color converted output image.
; ebx -- Number of points, we havn't done yet.
;
;
; ecx -- 3*CCOPitch
; eax -- CCOPitch.
;------------------------------------------------------------------------------
PrepareChromaLine:
Lebp AspectCount
Leax CCOPitch
sub ebp,4
Lebx FrameWidth
lea ecx,[eax+2*eax] ; pointer to fourth output line
ja continue
lea ecx,[2*eax]
ADDebp AspectAdjustmentCount
continue:
Sebp AspectCount
align 16
do_next_8x2_block:
;;;;;;; trsansformation U, V
movdt mm1, [edx+ebx] ; 4 u values
pxor mm0,mm0 ; mm0=0
movdt mm2, [esi+ebx] ; 4 v values
punpcklbw mm1,mm0 ; get 4 unsign u
psubw mm1,Minusg ; get 4 unsign u-128
punpcklbw mm2,mm0 ; get unsign v
psubw mm2,Minusg ; get unsign v-128
movq mm3,mm1 ; save the u unsign
mov ebp,tmpYCursorEven
punpcklwd mm1,mm2 ; get 2 low u,v unsign pairs
pmaddwd mm1,UVtG
movq mm5,mm3 ; save u-128
movq mm6,[ebp+2*ebx] ; mm6 has 8 y pixels
punpckhwd mm3,mm2 ; create high 2 unsign uv pairs
pmaddwd mm3,UVtG
psubusb mm6,Yadd ; mm6 has 8 y-16 pixels
packssdw mm1,mm3 ; packed the results to signed words
movq mm7,mm6 ; save the 8 y-16 pixels
punpcklbw mm6,mm0 ; mm6 has 4 low y-16 unsign
pmullw mm6,Ymul
punpckhbw mm7,mm0 ; mm7 has 4 high y-16 unsign
pmullw mm7,Ymul
movq mm4,mm1
movq PD [tmpBuffer],mm1 ; save 4 chroma G values
punpcklwd mm1,mm1 ; chroma G replicate low 2
movq mm0,mm6 ; low y
movq mm3,mm7 ; high y
punpckhwd mm4,mm4 ; chroma G replicate high 2
psubw mm6,mm1 ; 4 low G
movq mm1, mm5 ; 4 u values
psraw mm6,6 ; low G
psubw mm7,mm4 ; 4 high G values in signed 16 bit
punpcklwd mm1,mm1 ; replicate the 2 low u pixels
pmullw mm1,UtB
punpckhwd mm5,mm5
pmullw mm5,UtB
psraw mm7,6 ; high G
movq PD [tmpBuffer+8],mm1 ; low chroma B
packuswb mm6,mm7 ; mm6: G7 G6 G5 G4 G3 G2 G1 G0
movq PD [tmpBuffer+16],mm5 ; high chroma B
paddw mm5,mm3 ; 4 high B values in signed 16 bit
paddw mm1,mm0 ; 4 low B values in signed 16 bit
psraw mm5,6 ; high B
movq mm7, mm2
punpcklwd mm2,mm2 ; replicate the 2 low v pixels
psraw mm1,6 ; low B
pmullw mm2,VtR
punpckhwd mm7,mm7
pmullw mm7,VtR
packuswb mm1,mm5 ; mm1: B7 B6 B5 B4 B3 B2 B1 B0
movq PD [tmpBuffer+24],mm2 ; low chroma R
paddw mm2,mm0 ; 4 low R values in signed 16 bit
psraw mm2,6 ; low R
movq PD [tmpBuffer+32],mm7 ; high chroma R
paddw mm7,mm3 ; 4 high R values in signed 16 bit
psraw mm7,6 ; high R
movq PD [tmpBuffer+40],mm1 ; save B in memory
packuswb mm2,mm7 ; mm2: R7 R6 R5 R4 R3 R2 R1 R0
movq mm3,mm6 ; save G in mm3
punpcklbw mm1,mm1 ; mm1: B3 B3 B2 B2 B1 B1 B0 B0
movq mm0,mm1
punpcklwd mm1,mm1 ; mm1: B1 B1 B1 B1 B0 B0 B0 B0
pand mm1,MASK_036 ; mm1: 0 B1 0 0 B0 0 0 B0
punpcklbw mm6,mm6 ; mm6: G3 G3 G2 G2 G1 G1 G0 G0
movq mm5,mm6
punpcklwd mm6,mm6 ; mm6: G1 G1 G1 G1 G0 G0 G0 G0
movq mm4,mm2 ; save R in mm4
punpcklbw mm2,mm2 ; mm2: R3 R3 R2 R2 R1 R1 R0 R0
pand mm6,MASK_036 ; mm6: 0 G1 0 0 G0 0 0 G0
movq mm7,mm2
punpcklwd mm2,mm2 ; mm2: R1 R1 R1 R1 R0 R0 R0 R0
pand mm2,MASK_036 ; mm2: 0 R1 0 0 R0 0 0 R0
psllq mm6,8 ; mm6: G1 0 0 G0 0 0 G0 0
psllq mm2,16 ; mm2: 0 0 R0 0 0 R0 0 0
por mm1,mm6
por mm2,mm1 ; mm2: G1 B1 R0 G0 B0 R0 G0 B0
movq mm1,mm0 ; mm1: B3 B3 B2 B2 B1 B1 B0 B0
movq PD [edi],mm2 ; store result
psrlq mm1,24 ; mm1: 0 0 0 B3 B3 B2 B2 B1
movq PD [edi+eax],mm2 ; store result
punpcklwd mm1,mm1 ; mm1: B3 B2 B3 B2 B2 B1 B2 B1
;; 2nd phase
pand mm1,MASK_036 ; mm1: 0 B2 0 0 B2 0 0 B1
movq mm6,mm5 ; mm6: G3 G3 G2 G2 G1 G1 G0 G0
psllq mm1,8 ; mm1: B2 0 0 B2 0 0 B1 0
psrlq mm6,16 ; mm6: 0 0 G3 G3 G2 G2 G1 G1
movq mm2,mm7
pand mm6,MASK_036 ; mm6: 0 G2 0 0 G2 0 0 G1
psrlq mm2,16 ; mm2: 0 0 R3 R3 R2 R2 R1 R1
psllq mm6,16 ; mm6: 0 0 G2 0 0 G1 0 0
punpcklwd mm2,mm2 ; mm2: R2 R2 R2 R2 R1 R1 R1 R1
por mm1,mm6
pand mm2,MASK_036 ; mm2: 0 R2 0 0 R1 0 0 R1
movq mm6,mm5 ; mm6: G3 G3 G2 G2 G1 G1 G0 G0
por mm2,mm1 ; mm2: B2 R2 G2 B2 R1 G1 B1 R1
movq mm1,mm0 ; mm1: B3 B3 B2 B2 B1 B1 B0 B0
movq PD [edi+8],mm2 ; store result
psrlq mm1,48 ; mm1: 0 0 0 0 0 0 B3 B3
movq PD [edi+eax+8],mm2 ; store result
punpcklwd mm1,mm1 ; mm1: 0 0 0 0 B3 B3 B3 B3
;; 3nd phase
pand mm1,MASK_036 ; mm1: 0 0 0 0 B3 0 0 B3
psrlq mm6,40 ; mm6: 0 0 0 0 0 G3 G3 G2
punpcklwd mm6,mm6 ; mm6: 0 G3 0 G3 G3 G2 G3 G2
movq mm2,mm7
pand mm6,MASK_036 ; mm6: 0 G3 0 0 G3 0 0 G2
psllq mm1,16 ; mm1: 0 0 B3 0 0 B3 0 0
punpckhwd mm2,mm2 ; mm2: R3 R3 R3 R3 R2 0 R2 0
por mm1,mm6
pand mm2,MASK_147 ; mm2: R3 0 0 R3 0 0 R2 0
movq mm6,mm3 ; restore mm6 with G
por mm2,mm1 ; mm2: R3 G3 B3 R3 G3 B3 R2 G2
movq mm1,PD [tmpBuffer+40] ; restore mm1 with B
movq PD [edi+16],mm2 ; store result
punpckhbw mm1,mm1 ; mm1: B7 B7 B6 B6 B5 B5 B4 B4
movq PD [edi+eax+16],mm2 ; store result
movq mm2,mm4 ; restore mm2 with R
; 4th phase
movq mm0,mm1
punpckhbw mm6,mm6 ; mm6: G7 G7 G6 G6 G5 G5 G4 G4
punpcklwd mm1,mm1 ; mm1: B5 B5 B5 B5 B4 B4 B4 B4
movq mm5,mm6
pand mm1,MASK_036 ; mm1: 0 B5 0 0 B4 0 0 B4
punpcklwd mm6,mm6 ; mm6: G5 G5 G5 G5 G4 G4 G4 G4
pand mm6,MASK_036 ; mm6: 0 G5 0 0 G4 0 0 G4
punpckhbw mm2,mm2 ; mm2: R7 R7 R6 R6 R5 R5 R4 R4
psllq mm6,8 ; mm6: G5 0 0 G4 0 0 G4 0
movq mm7,mm2
punpcklwd mm2,mm2 ; mm2: R5 R5 R5 R5 R4 R4 R4 R4
pand mm2,MASK_036 ; mm2: 0 R5 0 0 R4 0 0 R4
psllq mm2,16 ; mm2: 0 0 R4 0 0 R4 0 0
por mm1,mm6
por mm2,mm1 ; mm2: G5 B5 R4 G4 B4 R4 G4 B4
movq mm1,mm0 ; mm1: B7 B7 B6 B6 B5 B5 B4 B4
movq PD [edi+24],mm2 ; store result
psrlq mm1,24 ; mm1: 0 0 0 B7 B7 B6 B6 B5
movq PD [edi+eax+24],mm2 ; store result
punpcklwd mm1,mm1 ; mm1: B7 B6 B7 B6 B6 B5 B6 B5
;; 5th phase
pand mm1,MASK_036 ; mm1: 0 B6 0 0 B6 0 0 B5
movq mm6,mm5 ; mm6: G7 G7 G6 G6 G5 G5 G4 G4
psllq mm1,8 ; mm1: B6 0 0 B6 0 0 B5 0
movq mm2,mm7
psrlq mm6,24 ; mm6: 0 0 0 G7 G7 G6 G6 G5
psrlq mm2,16 ; mm2: 0 0 R7 R7 R6 R6 R5 R5
punpcklwd mm6,mm6 ; mm6: G7 G6 G7 G6 G6 G5 G6 G5
pand mm6,MASK_036 ; mm6: 0 G6 0 0 G6 0 0 G5
punpcklwd mm2,mm2 ; mm2: R6 R6 R6 R6 R5 R5 R5 R5
pand mm2,MASK_036 ; mm2: 0 R6 0 0 R5 0 0 R5
psllq mm6,16 ; mm6: 0 0 G6 0 0 G5 0 0
;>>>>
por mm2,mm6
por mm2,mm1 ; mm2: B6 R6 G6 B6 R5 G5 B5 R5
movq mm1,mm0 ; mm1: B7 B7 B6 B6 B5 B5 B4 B4
psrlq mm1,48 ; mm1: 0 0 0 0 0 0 B7 B7
movq mm6,mm5 ; mm6: G7 G7 G6 G6 G5 G5 G4 G4
movq PD [edi+32],mm2 ; store result
punpcklwd mm1,mm1 ; mm1: 0 0 0 0 B7 B7 B7 B7
movq PD [edi+eax+32],mm2 ; store result
psrlq mm6,40 ; mm6: 0 0 0 0 0 G7 G7 G6
;; 6th phase
pand mm1,MASK_036 ; mm1: 0 0 0 0 B7 0 0 B7
punpcklwd mm6,mm6 ; mm6: 0 G7 0 G7 G7 G6 G7 G6
pand mm6,MASK_036 ; mm6: 0 G7 0 0 G7 0 0 G6
psllq mm1,16 ; mm1: 0 0 B7 0 0 B7 0 0
movq mm2,mm7
por mm1,mm6
mov ebp,tmpYCursorOdd
punpckhwd mm2,mm2 ; mm2: 0 R7 0 R7 R7 R6 R7 R6
pand mm2,MASK_147 ; mm2: 0 R7 0 0 R7 0 0 R6
; lea ecx, [eax+2*eax]
por mm2,mm1 ; mm2: R7 G7 B7 R7 G7 B7 R6 G6
;- start odd line
movq mm1,[ebp+2*ebx] ; mm1 has 8 y pixels
pxor mm0, mm0
psubusb mm1,Yadd ; mm1 has 8 pixels y-16
movq mm5,mm1
punpcklbw mm1,mm0 ; get 4 low y-16 unsign pixels word
pmullw mm1,Ymul ; low 4 luminance contribution
punpckhbw mm5,mm0 ; 4 high y-16
pmullw mm5,Ymul ; high 4 luminance contribution
movq PD [edi+40],mm2 ; store result
movq PD [edi+eax+40],mm2 ; store result
movq mm2,mm1
paddw mm2,PD [tmpBuffer+24] ; low 4 R
movq mm6,mm5
paddw mm5,PD [tmpBuffer+32] ; high 4 R
psraw mm2,6
psraw mm5,6
packuswb mm2,mm5 ; mm0: R7 R6 R5 R4 R3 R2 R1 R0
movq mm0,mm1
paddw mm0,PD [tmpBuffer+8] ; low 4 B
movq mm5,mm6
paddw mm5,PD [tmpBuffer+16] ; high 4 B
psraw mm0,6
movq mm3,PD [tmpBuffer] ; chroma G low 4
psraw mm5,6
packuswb mm0,mm5 ; mm2: B7 B6 B5 B4 B3 B2 B1 B0
movq mm4,mm3
punpcklwd mm3,mm3 ; replicate low 2
punpckhwd mm4,mm4 ; replicate high 2
psubw mm1,mm3 ; 4 low G
psubw mm6,mm4 ; 4 high G values in signed 16 bit
psraw mm1,6 ; low G
movq PD [tmpBuffer+40],mm0 ; save B in memory
psraw mm6,6 ; high G
packuswb mm1,mm6 ; mm1: G7 G6 G5 G4 G3 G2 G1 G0
movq mm4,mm2 ; save R in mm4
movq mm6,mm1
movq mm1,mm0
movq mm3,mm6 ; save G in mm3
punpcklbw mm1,mm1 ; mm1: B3 B3 B2 B2 B1 B1 B0 B0
movq mm0,mm1
punpcklwd mm1,mm1 ; mm1: B1 B1 B1 B1 B0 B0 B0 B0
pand mm1,MASK_036 ; mm1: 0 B1 0 0 B0 0 0 B0
punpcklbw mm6,mm6 ; mm6: G3 G3 G2 G2 G1 G1 G0 G0
movq mm5,mm6
punpcklwd mm6,mm6 ; mm6: G1 G1 G1 G1 G0 G0 G0 G0
pand mm6,MASK_036 ; mm6: 0 G1 0 0 G0 0 0 G0
punpcklbw mm2,mm2 ; mm2: R3 R3 R2 R2 R1 R1 R0 R0
psllq mm6,8 ; mm6: G1 0 0 G0 0 0 G0 0
movq mm7,mm2
punpcklwd mm2,mm2 ; mm2: R1 R1 R1 R1 R0 R0 R0 R0
por mm1,mm6
pand mm2,MASK_036 ; mm2: 0 R1 0 0 R0 0 0 R0
movq mm6,mm5 ; mm6: G3 G3 G2 G2 G1 G1 G0 G0
psllq mm2,16 ; mm2: 0 0 R0 0 0 R0 0 0
por mm2,mm1 ; mm2: G1 B1 R0 G0 B0 R0 G0 B0
psrlq mm6,24 ; mm6: 0 0 0 G3 G3 G2 G2 G1
movq PD [edi+ecx],mm2 ; store result
movq mm1,mm0 ; mm1: B3 B3 B2 B2 B1 B1 B0 B0
movq PD [edi+2*eax],mm2 ; store result
psrlq mm1,24 ; mm1: 0 0 0 B3 B3 B2 B2 B1
;; 2nd phase
punpcklwd mm1,mm1 ; mm1: B3 B2 B3 B2 B2 B1 B2 B1
movq mm2,mm7
pand mm1,MASK_036 ; mm1: 0 B2 0 0 B2 0 0 B1
punpcklwd mm6,mm6 ; mm6: G3 G2 G3 G2 G2 G1 G2 G1
pand mm6,MASK_036 ; mm6: 0 G2 0 0 G2 0 0 G1
psllq mm1,8 ; mm1: B2 0 0 B2 0 0 B1 0
psllq mm6,16 ; mm6: 0 0 G2 0 0 G1 0 0
psrlq mm2,16 ; mm2: 0 0 R3 R3 R2 R2 R1 R1
por mm1,mm6
punpcklwd mm2,mm2 ; mm2: R2 R2 R2 R2 R1 R1 R1 R1
movq mm6,mm5 ; mm6: G3 G3 G2 G2 G1 G1 G0 G0
pand mm2,MASK_036 ; mm2: 0 R2 0 0 R1 0 0 R1
psrlq mm6,40 ; mm6: 0 0 0 0 0 G3 G3 G2
por mm2,mm1 ; mm2: B2 R2 G2 B2 R1 G1 B1 R1
movq mm1,mm0 ; mm1: B3 B3 B2 B2 B1 B1 B0 B0
movq PD [edi+ecx+8],mm2 ; store result
punpckhwd mm1,mm1 ; mm1: B3 B3 B3 B3 0 0 0 0
movq PD [edi+2*eax+8],mm2 ; store result
punpcklwd mm6,mm6 ; mm6: 0 G3 0 G3 G3 G2 G3 G2
;; 3nd phase
pand mm1,MASK_147 ; mm1: 0 0 0 0 B3 0 0 B3
movq mm2,mm7
psrlq mm1,16 ; mm1: 0 0 B3 0 0 B3 0 0
pand mm6,MASK_036 ; mm6: 0 G3 0 0 G3 0 0 G2
psrlq mm2,40 ; mm2: 0 0 0 0 0 R3 R3 R2
punpcklwd mm2,mm2 ; mm2: 0 R3 0 R3 R3 R2 R3 R2
por mm1,mm6
pand mm2,MASK_036 ; mm2: 0 R3 0 0 R3 0 0 R2
movq mm6,mm3 ; restore mm6 with G
psllq mm2,8 ; mm2: R3 0 0 R3 0 0 R2 0
por mm2,mm1 ; mm2: R3 G3 B3 R3 G3 B3 R2 G2
movq mm1,PD [tmpBuffer+40] ; restore mm1 with B
movq PD [edi+ecx+16],mm2 ; store result
psrlq mm1,32 ; 0 0 0 0 B7 B6 B5 B4
movq PD [edi+2*eax+16],mm2 ; store result
psrlq mm6,32 ; 0 0 0 0 G7 G6 G5 G4
movq mm2,mm4 ; restore mm2 with R
punpcklbw mm1,mm1 ; mm1: B7 B7 B6 B6 B5 B5 B4 B4
; 4th phase
psrlq mm2,32 ; 0 0 0 0 R7 R6 R5 R4
movq mm0,mm1
punpcklwd mm1,mm1 ; mm1: B5 B5 B5 B5 B4 B4 B4 B4
pand mm1,MASK_036 ; mm1: 0 B5 0 0 B4 0 0 B4
punpcklbw mm6,mm6 ; mm6: G7 G7 G6 G6 G5 G5 G4 G4
punpcklbw mm2,mm2 ; mm2: R7 R7 R6 R6 R5 R5 R4 R4
movq mm5,mm6
punpcklwd mm6,mm6 ; mm6: G5 G5 G5 G5 G4 G4 G4 G4
movq mm7,mm2
pand mm6,MASK_036 ; mm6: 0 G5 0 0 G4 0 0 G4
punpcklwd mm2,mm2 ; mm2: R5 R5 R5 R5 R4 R4 R4 R4
pand mm2,MASK_036 ; mm2: 0 R5 0 0 R4 0 0 R4
psllq mm6,8 ; mm6: G5 0 0 G4 0 0 G4 0
psllq mm2,16 ; mm2: 0 0 R4 0 0 R4 0 0
por mm1,mm6
por mm2,mm1 ; mm2: G5 B5 R4 G4 B4 R4 G4 B4
movq mm1,mm0 ; mm1: B7 B7 B6 B6 B5 B5 B4 B4
psrlq mm1,24 ; mm1: 0 0 0 B7 B7 B6 B6 B5
movq mm6,mm5 ; mm6: G7 G7 G6 G6 G5 G5 G4 G4
movq PD [edi+ecx+24],mm2 ; store result
punpcklwd mm1,mm1 ; mm1: B7 B6 B7 B6 B6 B5 B6 B5
movq PD [edi+2*eax+24],mm2 ; store result
psrlq mm6,24 ; mm6: 0 0 0 G7 G7 G6 G6 G5
;; 5th phase
pand mm1,MASK_036 ; mm1: 0 B6 0 0 B6 0 0 B5
punpcklwd mm6,mm6 ; mm6: G7 G6 G7 G6 G6 G5 G6 G5
pand mm6,MASK_036 ; mm6: 0 G6 0 0 G6 0 0 G5
psllq mm1,8 ; mm1: B6 0 0 B6 0 0 B5 0
psllq mm6,16 ; mm6: 0 0 G6 0 0 G5 0 0
movq mm2,mm7
psrlq mm2,16 ; mm2: 0 0 R7 R7 R6 R6 R5 R5
por mm1,mm6
punpcklwd mm2,mm2 ; mm2: R6 R6 R6 R6 R5 R5 R5 R5
movq mm6,mm5 ; mm6: G7 G7 G6 G6 G5 G5 G4 G4
pand mm2,MASK_036 ; mm2: 0 R6 0 0 R5 0 0 R5
psrlq mm6,40 ; mm6: 0 0 0 0 0 G7 G7 G6
por mm2,mm1 ; mm2: B6 R6 G6 B6 R5 G5 B5 R5
punpcklwd mm6,mm6 ; mm6: 0 G7 0 G7 G7 G6 G7 G6
pand mm6,MASK_036 ; mm6: 0 G7 0 0 G7 0 0 G6
movq mm1,mm0 ; mm1: B7 B7 B6 B6 B5 B5 B4 B4
movq PD [edi+ecx+32],mm2 ; store result
punpckhwd mm1,mm1 ; mm1: B7 B7 B7 B7 0 0 0 0
movq PD [edi+2*eax+32],mm2 ; store result
movq mm2,mm7
;; 6th phase
pand mm1,MASK_147 ; mm1: B7 0 0 B7 0 0 0 0
psrlq mm2,40 ; mm2: 0 0 0 0 0 R7 R7 R6
punpcklwd mm2,mm2 ; mm2: 0 R7 0 R7 R7 R6 R7 R6
pand mm2,MASK_036 ; mm2: 0 R7 0 0 R7 0 0 R6
psrlq mm1,16 ; mm1: 0 0 B7 0 0 B7 0 0
psllq mm2,8 ; mm2: R7 0 0 R7 0 0 R6 0
por mm1,mm6
por mm2,mm1 ; mm2: R7 G7 B7 R7 G7 B7 R6 G6
movq PD [edi+ecx+40],mm2 ; store result
movq PD [edi+2*eax+40],mm2 ; store result
add edi,48 ; ih take 48 instead of 12 output
add ebx,4 ; ? to take 4 pixels together instead of 2
jl do_next_8x2_block ; ? update the loop for 8 y pixels at once
ADDedi CCOSkipDistance
add edi,ecx ; set output pointer after fourth line
Leax YPitch
mov ebp,tmpYCursorOdd
lea ebp,[ebp+2*eax] ; skip two lines
mov tmpYCursorOdd,ebp
mov ebp,tmpYCursorEven
lea ebp,[ebp+2*eax]
mov tmpYCursorEven,ebp
ADDesi ChromaPitch
ADDedx ChromaPitch
sub PD FrameHeight[esp],4
ja PrepareChromaLine
;------------------------------------------------------------------------------
finish:
add esp,LocalFrameSize
emms
pop ebx
pop ebp
pop edi
pop esi
ret
MMXCODE1 ENDS
END
;-------------------------------------------------------------------------
; cxm12161 -- This function performs YUV12-to-RGB16 color conversion for H26x.
; It handles any format in which there are three fields, the low
; order field being B and fully contained in the low order byte, the
; second field being G and being somewhere in bits 4 through 11,
; and the high order field being R and fully contained in the high
; order byte.
;
; The YUV12 input is planar, 8 bits per pel. The Y plane may have
; a pitch of up to 768. It may have a width less than or equal
; to the pitch. It must be DWORD aligned, and preferably QWORD
; aligned. Pitch and Width must be a multiple of four. For best
; performance, Pitch should not be 4 more than a multiple of 32.
; Height may be any amount, but must be a multiple of two. The U
; and V planes may have a different pitch than the Y plane, subject
; to the same limitations.
;
include iammx.inc
include locals.inc
.586
.xlist
.list
ASSUME ds:FLAT, cs:FLAT, ss:FLAT
MMXCODE1 SEGMENT PARA USE32 PUBLIC 'CODE'
MMXCODE1 ENDS
MMXDATA1 SEGMENT PARA USE32 PUBLIC 'DATA'
MMXDATA1 ENDS
MMXDATA1 SEGMENT
ALIGN 8
RGB_formats:
dd RGB565
dd RGB555
dd RGB664
dd RGB655
Minusg dd 00800080h, 00800080h
Yadd dd 10101010h, 10101010h
VtR dd 00660066h, 00660066h ;01990199h,01990199h
VtG dd 00340034h, 00340034h ;00d000d0h,00d000d0h
UtG dd 00190019h, 00190019h ;00640064h,00640064h
UtB dd 00810081h, 00810081h ;02050205h,02050205h
Ymul dd 004a004ah, 004a004ah ;012a012ah,012a012ah
UVtG dd 00340019h, 00340019h ;00d00064h,00d00064h
VtRUtB dd 01990205h, 01990205h
fourbitu dd 0f0f0f0f0h, 0f0f0f0f0h
fivebitu dd 0e0e0e0e0h, 0e0e0e0e0h
sixbitu dd 0c0c0c0c0h, 0c0c0c0c0h
MMXDATA1 ENDS
LocalFrameSize = 156
RegisterStorageSize = 16
; Arguments:
YPlane = LocalFrameSize + RegisterStorageSize + 4
UPlane = LocalFrameSize + RegisterStorageSize + 8
VPlane = LocalFrameSize + RegisterStorageSize + 12
FrameWidth = LocalFrameSize + RegisterStorageSize + 16
FrameHeight = LocalFrameSize + RegisterStorageSize + 20
YPitch = LocalFrameSize + RegisterStorageSize + 24
ChromaPitch = LocalFrameSize + RegisterStorageSize + 28
AspectAdjustmentCount = LocalFrameSize + RegisterStorageSize + 32
ColorConvertedFrame = LocalFrameSize + RegisterStorageSize + 36
DCIOffset = LocalFrameSize + RegisterStorageSize + 40
CCOffsetToLine0 = LocalFrameSize + RegisterStorageSize + 44
CCOPitch = LocalFrameSize + RegisterStorageSize + 48
CCType = LocalFrameSize + RegisterStorageSize + 52
EndOfArgList = LocalFrameSize + RegisterStorageSize + 56
; Locals (on local stack frame)
CCOCursor = 0
CCOSkipDistance = 4
ChromaLineLen = 8
YCursor = 12
DistanceFromVToU = 16
EndOfChromaLine = 20
AspectCount = 24
AspectBaseCount = 28
tmpYCursorEven = 32
tmpYCursorOdd = 36
tmpCCOPitch = 40
temp_mmx = 44 ; note it is 48 bytes
RLeftShift = 92
GLeftShift = 100
RRightShift = 108
GRightShift = 116
BRightShift = 124
RUpperLimit = 132
GUpperLimit = 140
BUpperLimit = 148
MMXCODE1 SEGMENT
; extern void "C" MMX_YUV12ToRGB16 (
; U8* YPlane,
; U8* UPlane,
; U8* VPlane,
; UN FrameWidth,
; UN FrameHeight,
; UN YPitch,
; UN VPitch,
; UN AspectAdjustmentCount,
; U8* ColorConvertedFrame,
; U32 DCIOffset,
; U32 CCOffsetToLine0,
; IN CCOPitch,
; IN CCType)
;
; The local variables are on the stack,
; The tables are in the one and only data segment.
;
; CCOffsetToLine0 is relative to ColorConvertedFrame.
; CCType used by RGB color convertors to determine the exact conversion type.
; RGB565 = 0
; RGB555 = 1
; RGB664 = 2
; RGB655 = 3
PUBLIC C MMX_YUV12ToRGB16
MMX_YUV12ToRGB16:
push esi
push edi
push ebp
push ebx
sub esp, LocalFrameSize
mov eax, [esp+CCType]
cmp eax,4
jae finish
jmp RGB_formats[eax*4]
RGB555:
xor eax, eax
mov ebx, 2 ; 10-8 for byte shift
mov [esp+RLeftShift], ebx
mov [esp+RLeftShift+4], eax
mov ebx, 5
mov [esp+GLeftShift], ebx
mov [esp+GLeftShift+4], eax
mov ebx, 9
mov [esp+RRightShift], ebx
mov [esp+RRightShift+4], eax
mov [esp+GRightShift], ebx
mov [esp+GRightShift+4], eax
mov [esp+BRightShift], ebx
mov [esp+BRightShift+4], eax
movq mm0, fivebitu
movq [esp+RUpperLimit], mm0
movq [esp+GUpperLimit], mm0
movq [esp+BUpperLimit], mm0
jmp RGBEND
RGB664:
xor eax, eax
mov ebx, 2 ; 8-6
mov [esp+RLeftShift], ebx
mov [esp+RLeftShift+4], eax
mov ebx, 4
mov [esp+GLeftShift], ebx
mov [esp+GLeftShift+4], eax
mov ebx, 8
mov [esp+RRightShift], ebx
mov [esp+RRightShift+4], eax
mov [esp+GRightShift], ebx
mov [esp+GRightShift+4], eax
mov ebx, 10
mov [esp+BRightShift], ebx
mov [esp+BRightShift+4], eax
movq mm0, sixbitu
movq [esp+RUpperLimit], mm0
movq [esp+GUpperLimit], mm0
movq mm0, fourbitu
movq [esp+BUpperLimit], mm0
jmp RGBEND
RGB655:
xor eax, eax
mov ebx, 2 ; 8-6
mov [esp+RLeftShift], ebx
mov [esp+RLeftShift+4], eax
mov ebx, 5
mov [esp+GLeftShift], ebx
mov [esp+GLeftShift+4], eax
mov ebx, 8
mov [esp+RRightShift], ebx
mov [esp+RRightShift+4], eax
mov ebx, 9
mov [esp+GRightShift], ebx
mov [esp+GRightShift+4], eax
mov [esp+BRightShift], ebx
mov [esp+BRightShift+4], eax
movq mm0, sixbitu
movq [esp+RUpperLimit], mm0
movq mm0, fivebitu
movq [esp+GUpperLimit], mm0
movq [esp+BUpperLimit], mm0
jmp RGBEND
RGB565:
xor eax, eax
mov ebx, 3 ; 8-5
mov [esp+RLeftShift], ebx
mov [esp+RLeftShift+4], eax
mov ebx, 5
mov [esp+GLeftShift], ebx
mov [esp+GLeftShift+4], eax
mov ebx, 9
mov [esp+RRightShift], ebx
mov [esp+RRightShift+4], eax
mov [esp+BRightShift], ebx
mov [esp+BRightShift+4], eax
mov ebx, 8
mov [esp+GRightShift], ebx
mov [esp+GRightShift+4], eax
movq mm0, fivebitu
movq [esp+RUpperLimit], mm0
movq [esp+BUpperLimit], mm0
movq mm0, sixbitu
movq [esp+GUpperLimit], mm0
; jmp RGBEND
RGBEND:
mov ebx, [esp+VPlane]
mov ecx, [esp+UPlane]
sub ecx, ebx
mov [esp+DistanceFromVToU], ecx
mov eax, [esp+ColorConvertedFrame]
add eax, [esp+DCIOffset]
add eax, [esp+CCOffsetToLine0]
mov [esp+CCOCursor], eax
Lecx YPitch
Lebx FrameWidth
Leax CCOPitch
sub eax, ebx ; CCOPitch-FrameWidth
sub eax, ebx ; CCOPitch-2*FrameWidth
sar ebx, 1 ; FrameWidth/2
Lesi YPlane ; Fetch cursor over luma plane.
Sebx ChromaLineLen ; FrameWidth/2
Seax CCOSkipDistance ; CCOPitch-3*FrameWidth
Sesi YCursor
Ledx AspectAdjustmentCount
Lesi VPlane
cmp edx,1
je finish
Sedx AspectCount
Sedx AspectBaseCount
xor eax, eax
Ledi ChromaLineLen
Sedi EndOfChromaLine
Ledi CCOCursor
Ledx DistanceFromVToU
Lebp YCursor ; Fetch Y Pitch.
Lebx FrameWidth
add ebp, ebx
Sebp tmpYCursorEven
Leax YPitch
add ebp, eax
Sebp tmpYCursorOdd
sar ebx, 1
add esi, ebx
add edx, esi
neg ebx
Sebx FrameWidth
; Register Usage:
;
;------------------------------------------------------------------------------
PrepareChromaLine:
Lebp AspectCount
Lebx FrameWidth
sub ebp,2
Leax CCOPitch
Seax tmpCCOPitch
ja continue
xor eax,eax
ADDebp AspectAdjustmentCount
Seax tmpCCOPitch
continue:
Sebp AspectCount
do_next_8x2_block:
Lebp tmpYCursorEven
; here is even line
movdt mm1, [edx+ebx] ; 4 u values
pxor mm0, mm0 ; mm0=0
movdt mm2, [esi+ebx] ; 4 v values
punpcklbw mm1, mm0 ; get 4 unsign u
psubw mm1, Minusg ; get 4 unsign u-128
punpcklbw mm2, mm0 ; get unsign v
psubw mm2, Minusg ; get unsign v-128
movq mm3, mm1 ; save the u-128 unsign
movq mm5, mm1 ; save u-128 unsign
punpcklwd mm1, mm2 ; get 2 low u, v unsign pairs
pmaddwd mm1, UVtG
punpckhwd mm3, mm2 ; create high 2 unsign uv pairs
pmaddwd mm3, UVtG
movq temp_mmx[esp], mm2 ; save v-128
movq mm6, [ebp+2*ebx] ; mm6 has 8 y pixels
psubusb mm6, Yadd ; mm6 has 8 y-16 pixels
packssdw mm1, mm3 ; packed the results to signed words
movq mm7, mm6 ; save the 8 y-16 pixels
punpcklbw mm6, mm0 ; mm6 has 4 low y-16 unsign
pmullw mm6, Ymul
punpckhbw mm7, mm0 ; mm7 has 4 high y-16 unsign
pmullw mm7, Ymul
movq mm4, mm1
movq temp_mmx[esp+8], mm1 ; save 4 chroma G values
punpcklwd mm1, mm1 ; chroma G replicate low 2
movq mm0, mm6 ; low y
punpckhwd mm4, mm4 ; chroma G replicate high 2
movq mm3, mm7 ; high y
psubw mm6, mm1 ; 4 low G
psraw mm6, [esp+GRightShift]
psubw mm7, mm4 ; 4 high G values in signed 16 bit
movq mm2, mm5
punpcklwd mm5, mm5 ; replicate the 2 low u pixels
pmullw mm5, UtB
punpckhwd mm2, mm2
psraw mm7, [esp+GRightShift]
pmullw mm2, UtB
packuswb mm6, mm7 ; mm6: G7 G6 G5 G4 G3 G2 G1 G0
movq temp_mmx[esp+16], mm5 ; low chroma B
paddw mm5, mm0 ; 4 low B values in signed 16 bit
movq temp_mmx[esp+40], mm2 ; high chroma B
paddw mm2, mm3 ; 4 high B values in signed 16 bit
psraw mm5, [esp+BRightShift] ; low B scaled down by 6+(8-5)
psraw mm2, [esp+BRightShift] ; high B scaled down by 6+(8-5)
packuswb mm5, mm2 ; mm5: B7 B6 B5 B4 B3 B2 B1 B0
movq mm2, temp_mmx[esp] ; 4 v values
movq mm1, mm5 ; save B
movq mm7, mm2
punpcklwd mm2, mm2 ; replicate the 2 low v pixels
pmullw mm2, VtR
punpckhwd mm7, mm7
pmullw mm7, VtR
paddusb mm1, [esp+BUpperLimit] ; mm1: saturate B+0FF-15
movq temp_mmx[esp+24], mm2 ; low chroma R
paddw mm2, mm0 ; 4 low R values in signed 16 bit
psraw mm2, [esp+RRightShift] ; low R scaled down by 6+(8-5)
pxor mm4, mm4 ; mm4=0 for 8->16 conversion
movq temp_mmx[esp+32], mm7 ; high chroma R
paddw mm7, mm3 ; 4 high R values in signed 16 bit
psraw mm7, [esp+RRightShift] ; high R scaled down by 6+(8-5)
psubusb mm1, [esp+BUpperLimit]
packuswb mm2, mm7 ; mm2: R7 R6 R5 R4 R3 R2 R1 R0
paddusb mm6, [esp+GUpperLimit] ; G fast patch ih
psubusb mm6, [esp+GupperLimit] ; fast patch ih
paddusb mm2, [esp+RUpperLimit] ; R
psubusb mm2, [esp+RUpperLimit]
; here we are packing from RGB24 to RGB16
; input:
; mm6: G7 G6 G5 G4 G3 G2 G1 G0
; mm1: B7 B6 B5 B4 B3 B2 B1 B0
; mm2: R7 R6 R5 R4 R3 R2 R1 R0
; assuming 8 original pixels in 0-H representation on mm6, mm5, mm2
; when H=2**xBITS-1 (x is for R G B)
; output:
; mm1- result: 4 low RGB16
; mm7- result: 4 high RGB16
; using: mm0- zero register
; mm3- temporary results
; algorithm:
; for (i=0; i<8; i++) {
; RGB[i]=256*(R[i]<<(8-5))+(G[i]<<5)+B[i];
; }
psllq mm2, [esp+RLeftShift] ; position R in the most significant part of the byte
movq mm7, mm1 ; mm1: Save B
; note: no need for shift to place B on the least significant part of the byte
; R in left position, B in the right position so they can be combined
punpcklbw mm1, mm2 ; mm1: 4 low 16 bit RB
pxor mm0, mm0 ; mm0: 0
punpckhbw mm7, mm2 ; mm5: 4 high 16 bit RB
movq mm3, mm6 ; mm3: G
punpcklbw mm6, mm0 ; mm6: low 4 G 16 bit
psllw mm6, [esp+GLeftShift] ; shift low G 5 positions
punpckhbw mm3, mm0 ; mm3: high 4 G 16 bit
por mm1, mm6 ; mm1: low RBG16
psllw mm3, [esp+GLeftShift] ; shift high G 5 positions
por mm7, mm3 ; mm5: high RBG16
Lebp tmpYCursorOdd ; moved to here to save cycles before odd line
movq [edi], mm1 ; !! aligned
;- start odd line
movq mm1, [ebp+2*ebx] ; mm1 has 8 y pixels
pxor mm2, mm2
psubusb mm1, Yadd ; mm1 has 8 pixels y-16
movq mm5, mm1
punpcklbw mm1, mm2 ; get 4 low y-16 unsign pixels word
pmullw mm1, Ymul ; low 4 luminance contribution
punpckhbw mm5, mm2 ; 4 high y-16
pmullw mm5, Ymul ; high 4 luminance contribution
movq [edi+8], mm7 ; !! aligned
movq mm0, mm1
paddw mm0, temp_mmx[esp+24] ; low 4 R
movq mm6, mm5
psraw mm0, [esp+RRightShift] ; low R scaled down by 6+(8-5)
paddw mm5, temp_mmx[esp+32] ; high 4 R
movq mm2, mm1
psraw mm5, [esp+RRightShift] ; high R scaled down by 6+(8-5)
paddw mm2, temp_mmx[esp+16] ; low 4 B
packuswb mm0, mm5 ; mm0: R7 R6 R5 R4 R3 R2 R1 R0
psraw mm2, [esp+BRightShift] ; low B scaled down by 6+(8-5)
movq mm5, mm6
paddw mm6, temp_mmx[esp+40] ; high 4 B
psraw mm6, [esp+BRightShift] ; high B scaled down by 6+(8-5)
movq mm3, temp_mmx[esp+8] ; chroma G low 4
packuswb mm2, mm6 ; mm2: B7 B6 B5 B4 B3 B2 B1 B0
movq mm4, mm3
punpcklwd mm3, mm3 ; replicate low 2
punpckhwd mm4, mm4 ; replicate high 2
psubw mm1, mm3 ; 4 low G
psraw mm1, [esp+GRightShift] ; low G scaled down by 6+(8-5)
psubw mm5, mm4 ; 4 high G values in signed 16 bit
psraw mm5, [esp+GRightShift] ; high G scaled down by 6+(8-5)
paddusb mm2, [esp+BUpperLimit] ; mm1: saturate B+0FF-15
packuswb mm1, mm5 ; mm1: G7 G6 G5 G4 G3 G2 G1 G0
psubusb mm2, [esp+BupperLimit]
paddusb mm1, [esp+GUpperLimit] ; G
psubusb mm1, [esp+GUpperLimit]
paddusb mm0, [esp+RUpperLimit] ; R
Leax tmpCCOPitch
psubusb mm0, [esp+RUpperLimit]
; here we are packing from RGB24 to RGB16
; mm1: G7 G6 G5 G4 G3 G2 G1 G0
; mm2: B7 B6 B5 B4 B3 B2 B1 B0
; mm0: R7 R6 R5 R4 R3 R2 R1 R0
; output:
; mm2- result: 4 low RGB16
; mm7- result: 4 high RGB16
; using: mm4- zero register
; mm3- temporary results
psllq mm0, [esp+RLeftShift] ; position R in the most significant part of the byte
movq mm7, mm2 ; mm7: Save B
; note: no need for shift to place B on the least significant part of the byte
; R in left position, B in the right position so they can be combined
punpcklbw mm2, mm0 ; mm1: 4 low 16 bit RB
pxor mm4, mm4 ; mm4: 0
movq mm3, mm1 ; mm3: G
punpckhbw mm7, mm0 ; mm7: 4 high 16 bit RB
punpcklbw mm1, mm4 ; mm1: low 4 G 16 bit
punpckhbw mm3, mm4 ; mm3: high 4 G 16 bit
psllw mm1, [esp+GLeftShift] ; shift low G 5 positions
por mm2, mm1 ; mm2: low RBG16
psllw mm3, [esp+GLeftShift] ; shift high G 5 positions
por mm7, mm3 ; mm7: high RBG16
movq [edi+eax], mm2
movq [edi+eax+8], mm7 ; aligned
add edi, 16 ; ih take 16 bytes (8 pixels-16 bit)
add ebx, 4 ; ? to take 4 pixels together instead of 2
jl do_next_8x2_block ; ? update the loop for 8 y pixels at once
ADDedi CCOSkipDistance ; go to begin of next line
ADDedi tmpCCOPitch ; skip odd line (if it is needed)
; Leax AspectCount
; Lebp CCOPitch ; skip odd line
; sub eax, 2
; jg @f
; Addeax AspectBaseCount
; xor ebp, ebp
;@@:
; Seax AspectCount
; add edi, ebp
Leax YPitch
Lebp tmpYCursorOdd
add ebp, eax ; skip one line
; lea ebp, [ebp+2*eax] ; skip two lines
Sebp tmpYCursorEven
; Sebp tmpYCursorOdd
add ebp, eax ; skip one line
Sebp tmpYCursorOdd
; Lebp tmpYCursorEven
; lea ebp, [ebp+2*eax]
; Sebp tmpYCursorEven
ADDesi ChromaPitch
ADDedx ChromaPitch
; Leax YLimit ; Done with last line?
; cmp ebp, eax
; jbe PrepareChromaLine
sub PD FrameHeight[esp],2
ja PrepareChromaLine
;------------------------------------------------------------------------------
finish:
emms
add esp, LocalFrameSize
pop ebx
pop ebp
pop edi
pop esi
retn
MMXCODE1 ENDS
END
;-------------------------------------------------------------------------
; cx512162 -- This function performs zoom-by-2 YUV12-to-RGB16 color conversion
; for H26x. It handles 555, 655, 565, and 664 formats.
;
; The YUV12 input is planar, 8 bits per pel. The Y plane may have
; a pitch of up to 768. It may have a width less than or equal
; to the pitch. It must be DWORD aligned, and preferably QWORD
; aligned. Pitch and Width must be a multiple of eight.
; Height must be a multiple of two. The U and V planes may have
; a different pitch than the Y plane, subject to the same limitations.
;
; The color convertor is non destructive.
;-------------------------------------------------------------------------
include iammx.inc
include locals.inc
.586
.xlist
.list
ASSUME ds:FLAT, cs:FLAT, ss:FLAT
RTIME16=1
DITHER=1
MMXDATA1 SEGMENT PARA USE32 PUBLIC 'DATA'
ALIGN 8
RGB_formats:
dd RGB565
dd RGB555
dd RGB664
dd RGB655
Minusg dd 00800080h, 00800080h
VtR dd 00660066h, 00660066h ;01990199h,01990199h
VtG dd 00340034h, 00340034h ;00d000d0h,00d000d0h
UtG dd 00190019h, 00190019h ;00640064h,00640064h
UtB dd 00810081h, 00810081h ;02050205h,02050205h
Ymul dd 004a004ah, 004a004ah ;012a012ah,012a012ah
Yadd dd 10101010h, 10101010h
UVtG dd 00340019h, 00340019h ;00d00064h,00d00064h
VtRUtB dd 01990205h, 01990205h
fourbitu dd 0f0f0f0f0h, 0f0f0f0f0h
fivebitu dd 0e0e0e0e0h, 0e0e0e0e0h
sixbitu dd 0c0c0c0c0h, 0c0c0c0c0h
shiftone dd 02020202h, 02020202h
shifttwo dd 04040404h, 04040404h
shiftthree dd 08080808h, 08080808h
MMXDATA1 ENDS
LocalFrameSize = 174
RegisterStorageSize = 16
; Arguments:
YPlane = LocalFrameSize + RegisterStorageSize + 4
UPlane = LocalFrameSize + RegisterStorageSize + 8
VPlane = LocalFrameSize + RegisterStorageSize + 12
FrameWidth = LocalFrameSize + RegisterStorageSize + 16
FrameHeight = LocalFrameSize + RegisterStorageSize + 20
YPitch = LocalFrameSize + RegisterStorageSize + 24
ChromaPitch = LocalFrameSize + RegisterStorageSize + 28
AspectAdjustmentCount = LocalFrameSize + RegisterStorageSize + 32
ColorConvertedFrame = LocalFrameSize + RegisterStorageSize + 36
DCIOffset = LocalFrameSize + RegisterStorageSize + 40
CCOffsetToLine0 = LocalFrameSize + RegisterStorageSize + 44
CCOPitch = LocalFrameSize + RegisterStorageSize + 48
CCType = LocalFrameSize + RegisterStorageSize + 52
EndOfArgList = LocalFrameSize + RegisterStorageSize + 56
; Locals (on local stack frame)
CCOCursor = 0
CCOSkipDistance = 4
ChromaLineLen = 8
YCursor = 12
DistanceFromVToU = 16
EndOfChromaLine = 20
AspectCount = 24
tmpYCursorEven = 28
tmpYCursorOdd = 32
temp_mmx = 36 ; 48 bytes
RLeftShift = 84
GLeftShift = 92
RRightShift = 100
GRightShift = 108
BRightShift = 116
RUpperLimit = 124
GUpperLimit = 132
BUpperLimit = 140
RDither = 148
GDither = 156
BDither = 164
; Switches used by RGB color convertors to determine the exact conversion type.
LCL EQU <esp+>
MMXCODE1 SEGMENT PARA USE32 PUBLIC 'CODE'
; void FAR ASM_CALLTYPE YUV12ToRGB16ZoomBy2 (
; U8* YPlane,
; U8* UPlane,
; U8* VPlane,
; UN FrameWidth,
; UN FrameHeight,
; UN YPitch,
; UN UVPitch,
; UN AspectAdjustmentCount,
; U8* ColorConvertedFrame,
; U32 DCIOffset,
; U32 CCOffsetToLine0,
; int CCOPitch,
; int CCType)
;
; The local variables are on the stack,
; The tables are in the one and only data segment.
;
; CCOffsetToLine0 is relative to ColorConvertedFrame.
;
PUBLIC C MMX_YUV12ToRGB16ZoomBy2
MMX_YUV12ToRGB16ZoomBy2:
push esi
push edi
push ebp
push ebx
sub esp, LocalFrameSize
mov eax, [esp+CCType]
cmp eax,4
jae finish
jmp RGB_formats[eax*4]
RGB555:
xor eax, eax
mov ebx, 2 ; 10-8 for byte shift
mov [esp+RLeftShift], ebx
mov [esp+RLeftShift+4], eax
mov ebx, 5
mov [esp+GLeftShift], ebx
mov [esp+GLeftShift+4], eax
mov ebx, 9
mov [esp+RRightShift], ebx
mov [esp+RRightShift+4], eax
mov [esp+GRightShift], ebx
mov [esp+GRightShift+4], eax
mov [esp+BRightShift], ebx
mov [esp+BRightShift+4], eax
movq mm0, fivebitu
movq [esp+RUpperLimit], mm0
movq [esp+GUpperLimit], mm0
movq [esp+BUpperLimit], mm0
movq mm0,shifttwo ; 1<<(7-5) for dither
movq [esp+RDither],mm0
movq [esp+GDither],mm0
movq [esp+BDither],mm0
jmp RGBEND
RGB664:
xor eax, eax
mov ebx, 2 ; 8-6
mov [esp+RLeftShift], ebx
mov [esp+RLeftShift+4], eax
mov ebx, 4
mov [esp+GLeftShift], ebx
mov [esp+GLeftShift+4], eax
mov ebx, 8
mov [esp+RRightShift], ebx
mov [esp+RRightShift+4], eax
mov [esp+GRightShift], ebx
mov [esp+GRightShift+4], eax
movq mm0, sixbitu
movq [esp+RUpperLimit], mm0
movq [esp+GUpperLimit], mm0
mov ebx, 10
mov [esp+BRightShift], ebx
mov [esp+BRightShift+4], eax
movq mm0, fourbitu
movq [esp+BUpperLimit], mm0
movq mm0,shiftone ; 1<<(7-6) for dither
movq [esp+RDither],mm0
movq [esp+GDither],mm0
movq mm0,shiftthree ; 1<<(7-4) for dither
movq [esp+BDither],mm0
jmp RGBEND
RGB655:
xor eax, eax
mov ebx,2 ; 8-6
mov [esp+RLeftShift], ebx
mov [esp+RLeftShift+4], eax
mov ebx,5
mov [esp+GLeftShift], ebx
mov [esp+GLeftShift+4], eax
mov ebx,9
mov [esp+GRightShift], ebx
mov [esp+GRightShift+4], eax
mov [esp+BRightShift], ebx
mov [esp+BRightShift+4], eax
mov ebx,8
mov [esp+RRightShift], ebx
mov [esp+RRightShift+4], eax
movq mm0,fivebitu
movq [esp+GUpperLimit], mm0
movq [esp+BUpperLimit], mm0
movq mm0,sixbitu
movq [esp+RUpperLimit], mm0
movq mm0,shifttwo ; 1<<(7-5) for dither
movq [esp+GDither],mm0
movq [esp+BDither],mm0
movq mm0,shiftone ; 1<<(7-6) for dither
movq [esp+RDither],mm0
jmp RGBEND
RGB565:
xor eax, eax
mov ebx, 3 ; 8-5
mov [esp+RLeftShift], ebx
mov [esp+RLeftShift+4], eax
mov ebx, 5
mov [esp+GLeftShift], ebx
mov [esp+GLeftShift+4], eax
mov ebx, 9
mov [esp+RRightShift], ebx
mov [esp+RRightShift+4], eax
mov [esp+BRightShift], ebx
mov [esp+BRightShift+4], eax
movq mm0, fivebitu
movq [esp+RUpperLimit], mm0
movq [esp+BUpperLimit], mm0
mov ebx, 8
mov [esp+GRightShift], ebx
mov [esp+GRightShift+4], eax
movq mm0, sixbitu
movq [esp+GUpperLimit], mm0
movq mm0,shifttwo ; 1<<(7-5) for dither
movq [esp+RDither],mm0
movq [esp+BDither],mm0
movq mm0,shiftone ; 1<<(7-6) for dither
movq [esp+GDither],mm0
; jmp RGBEND
RGBEND:
mov ebx, [esp+VPlane]
mov ecx, [esp+UPlane]
sub ecx, ebx
mov [esp+DistanceFromVToU], ecx
mov eax, [esp+ColorConvertedFrame]
add eax, [esp+DCIOffset]
add eax, [esp+CCOffsetToLine0]
mov [esp+CCOCursor], eax
Lebx FrameWidth
Leax CCOPitch
Lesi YPlane ; Fetch cursor over luma plane.
shl ebx, 2 ; FrameWidth*2
sub eax, ebx ; CCOPitch-2*FrameWidth
shr ebx, 3 ; FrameWidth*3
Sesi YCursor
Sebx ChromaLineLen ; FrameWidth*3
Seax CCOSkipDistance ; CCOPitch-3*FrameWidth
Leax AspectAdjustmentCount
Lesi VPlane
Seax AspectCount
xor eax, eax
Ledi ChromaLineLen
Sedi EndOfChromaLine
Ledi CCOCursor
Ledx DistanceFromVToU
Lebp YCursor ; Fetch Y Pitch.
Lebx FrameWidth
add ebp, ebx
Sebp tmpYCursorEven
Leax YPitch
add ebp, eax
Sebp tmpYCursorOdd
sar ebx, 1
add esi, ebx
add edx, esi
neg ebx
Sebx FrameWidth
; Register Usage:
;
; ebp -- Y Line cursor. Chroma contribs go in lines above current Y line.
; esi -- Chroma Line cursor.
; edx -- Distance from V pel to U pel.
; edi -- Cursor over the color converted output image.
; ebx -- Number of points taken together.
;
;
; ecx -- Point to Far line (2 lines away)
; eax -- Line Pitch
;------------------------------------------------------------------------------
PrepareChromaLine:
Lebx FrameWidth
Leax CCOPitch
do_next_8x2_block:
Lebp tmpYCursorEven
movdt mm1, [edx+ebx] ; 4 u values
pxor mm0, mm0 ; mm0=0
movdt mm2, [esi+ebx] ; 4 v values
punpcklbw mm1, mm0 ; get 4 unsign u
psubw mm1, Minusg ; get 4 unsign u-128
punpcklbw mm2, mm0 ; get unsign v
psubw mm2, Minusg ; get unsign v-128
movq mm3, mm1 ; save the u-128 unsign
movq mm5, mm1 ; save u-128 unsign
punpcklwd mm1, mm2 ; get 2 low u, v unsign pairs
pmaddwd mm1, UVtG
punpckhwd mm3, mm2 ; create high 2 unsign uv pairs
pmaddwd mm3, UVtG
movq temp_mmx[esp], mm2 ; save v-128
movq mm6, [ebp+2*ebx] ; mm6 has 8 y pixels
psubusb mm6, Yadd ; mm6 has 8 y-16 pixels
packssdw mm1, mm3 ; packed the results to signed words
movq mm7, mm6 ; save the 8 y-16 pixels
punpcklbw mm6, mm0 ; mm6 has 4 low y-16 unsign
pmullw mm6, Ymul
punpckhbw mm7, mm0 ; mm7 has 4 high y-16 unsign
pmullw mm7, Ymul
movq mm4, mm1
movq temp_mmx[esp+8], mm1 ; save 4 chroma G values
punpcklwd mm1, mm1 ; chroma G replicate low 2
movq mm0, mm6 ; low y
punpckhwd mm4, mm4 ; chroma G replicate high 2
movq mm3, mm7 ; high y
psubw mm6, mm1 ; 4 low G
; movq mm1, mm5 ; 4 u values
psraw mm6, [esp+GRightShift]
psubw mm7, mm4 ; 4 high G values in signed 16 bit
movq mm2, mm5
punpcklwd mm5, mm5 ; replicate the 2 low u pixels
pmullw mm5, UtB
punpckhwd mm2, mm2
pmullw mm2, UtB
psraw mm7, [esp+GRightShift]
packuswb mm6, mm7 ; mm6: G7 G6 G5 G4 G3 G2 G1 G0
movq temp_mmx[esp+16], mm5 ; low chroma B
paddw mm5, mm0 ; 4 low B values in signed 16 bit
movq temp_mmx[esp+40], mm2 ; high chroma B
paddw mm2, mm3 ; 4 high B values in signed 16 bit
psraw mm5, [esp+BRightShift] ; low B scaled down by 6+(8-5)
psraw mm2, [esp+BRightShift] ; high B scaled down by 6+(8-5)
packuswb mm5, mm2 ; mm1: B7 B6 B5 B4 B3 B2 B1 B0
movq mm2, temp_mmx[esp] ; 4 v values
movq mm1, mm5 ; save B
movq mm7, mm2
punpcklwd mm2, mm2 ; replicate the 2 low v pixels
pmullw mm2, VtR
punpckhwd mm7, mm7
pmullw mm7, VtR
paddusb mm1, [esp+BUpperLimit] ; mm1: saturate B+0FF-15
movq temp_mmx[esp+24], mm2 ; low chroma R
paddw mm2, mm0 ; 4 low R values in signed 16 bit
psraw mm2, [esp+RRightShift] ; low R scaled down by 6+(8-5)
pxor mm4, mm4 ; mm4=0 for 8->16 conversion
movq temp_mmx[esp+32], mm7 ; high chroma R
paddw mm7, mm3 ; 4 high R values in signed 16 bit
psraw mm7, [esp+RRightShift] ; high R scaled down by 6+(8-5)
psubusb mm1, [esp+BUpperLimit]
packuswb mm2, mm7 ; mm2: R7 R6 R5 R4 R3 R2 R1 R0
paddusb mm6, [esp+GUpperLimit] ; G
psubusb mm6, [esp+GupperLimit]
paddusb mm2, [esp+RUpperLimit] ; R
psubusb mm2, [esp+RUpperLimit]
psllq mm2, [esp+RLeftShift] ; position R in the most significant part of the byte
movq mm7, mm1 ; mm1: Save B
; note: no need for shift to place B on the least significant part of the byte
; R in left position, B in the right position so they can be combined
punpcklbw mm1, mm2 ; mm1: 4 low 16 bit RB
pxor mm0, mm0 ; mm0: 0
punpckhbw mm7, mm2 ; mm7: 4 high 16 bit RB
movq mm3, mm6 ; mm3: G
punpcklbw mm6, mm0 ; mm6: low 4 G 16 bit
psllw mm6, [esp+GLeftShift] ; shift low G 5 positions
punpckhbw mm3, mm0 ; mm3: high 4 G 16 bit
psllw mm3, [esp+GLeftShift] ; shift high G 5 positions
por mm1, mm6 ; mm1: low RBG16
movq mm2, mm1
por mm7, mm3 ; mm7: high RBG16
punpcklwd mm1, mm1
movq [edi], mm1 ; !! aligned
punpckhwd mm2, mm2
movq [edi+eax], mm1 ; !! patch
movq [edi+8], mm2 ; !! patch
movq [edi+eax+8], mm2 ; !! patch
movq mm6, mm7
punpcklwd mm7, mm7 ; get 4 low y-16 unsign pixels word
movq [edi+16], mm7 ; !! aligned
punpckhwd mm6, mm6 ; get 4 low y-16 unsign pixels word
movq [edi+eax+16], mm7 ; !! aligned
movq [edi+24], mm6 ; !! aligned
movq [edi+eax+24], mm6 ; !! aligned
;- start odd line
Lebp tmpYCursorOdd ; moved here to save cycles before odd line
movq mm1, [ebp+2*ebx] ; mm1 has 8 y pixels
pxor mm2, mm2
psubusb mm1, Yadd ; mm1 has 8 pixels y-16
movq mm5, mm1
punpcklbw mm1, mm2 ; get 4 low y-16 unsign pixels word
pmullw mm1, Ymul ; low 4 luminance contribution
punpckhbw mm5, mm2 ; 4 high y-16
pmullw mm5, Ymul ; high 4 luminance contribution
movq mm0, mm1
paddw mm0, temp_mmx[esp+24] ; low 4 R
movq mm6, mm5
psraw mm0, [esp+RRightShift] ; low R scaled down by 6+(8-5)
paddw mm5, temp_mmx[esp+32] ; high 4 R
movq mm2, mm1
psraw mm5, [esp+RRightShift] ; high R scaled down by 6+(8-5)
paddw mm2, temp_mmx[esp+16] ; low 4 B
packuswb mm0, mm5 ; mm0: R7 R6 R5 R4 R3 R2 R1 R0
psraw mm2, [esp+BRightShift] ; low B scaled down by 6+(8-5)
movq mm5, mm6
paddw mm6, temp_mmx[esp+40] ; high 4 B
psraw mm6, [esp+BRightShift] ;high B scaled down by 6+(8-5)
movq mm3, temp_mmx[esp+8] ; chroma G low 4
packuswb mm2, mm6 ; mm2: B7 B6 B5 B4 B3 B2 B1 B0
movq mm4, mm3
punpcklwd mm3, mm3 ; replicate low 2
punpckhwd mm4, mm4 ; replicate high 2
psubw mm1, mm3 ; 4 low G
psraw mm1, [esp+GRightShift] ; low G scaled down by 6+(8-5)
psubw mm5, mm4 ; 4 high G values in signed 16 bit
psraw mm5, [esp+GRightShift] ; high G scaled down by 6+(8-5)
pxor mm3, mm3 ; B
paddusb mm2, [esp+BUpperLimit] ; mm1: saturate B+0FF-15
packuswb mm1, mm5 ; mm1: G7 G6 G5 G4 G3 G2 G1 G0
psubusb mm2, [esp+BupperLimit]
paddusb mm1, [esp+GUpperLimit] ; G
psubusb mm1, [esp+GUpperLimit]
paddusb mm0, [esp+RUpperLimit] ; R
psubusb mm0, [esp+RUpperLimit]
lea ecx, [eax+2*eax] ; ecx - point to next 3 line
psllq mm0, [esp+RLeftShift] ; position R in the most significant part of the byte
movq mm7, mm2 ; mm7: Save B
; note: no need for shift to place B on the least significant part of the byte
; R in left position, B in the right position so they can be combined
punpcklbw mm2, mm0 ; mm1: 4 low 16 bit RB
pxor mm4, mm4 ; mm4: 0
movq mm3, mm1 ; mm3: G
punpckhbw mm7, mm0 ; mm7: 4 high 16 bit RB
punpcklbw mm1, mm4 ; mm1: low 4 G 16 bit
punpckhbw mm3, mm4 ; mm3: high 4 G 16 bit
psllw mm1, [esp+GLeftShift] ; shift low G 5 positions
por mm2, mm1 ; mm2: low RBG16
psllw mm3, [esp+GLeftShift] ; shift high G 5 positions
movq mm4, mm2 ; mm4: save low RBG16
por mm7, mm3 ; mm7: high RBG16
punpcklwd mm2, mm2 ; replicate low low RGB16
movq [edi+2*eax], mm2
punpckhwd mm4, mm4 ; replicate high low RGB16
movq [edi+2*eax+8], mm4 ; patch
movq mm5, mm7 ; save high RBG16
movq [edi+ecx], mm2
punpcklwd mm7, mm7
movq [edi+ecx+8], mm4 ; patch
punpckhwd mm5, mm5
movq [edi+ecx+16], mm7 ; aligned
movq [edi+2*eax+16], mm7 ; aligned
movq [edi+ecx+24], mm5 ; aligned
movq [edi+2*eax+24], mm5 ; aligned
add edi, 32 ; ih take 16 bytes (8 pixels-16 bit)
add ebx, 4 ; ? to take 4 pixels together instead of 2
jl do_next_8x2_block ; ? update the loop for 8 y pixels at once
ADDedi CCOSkipDistance ; go to begin of next line
ADDedi CCOPitch ; skip odd line
ADDedi CCOPitch ; skip odd line
ADDedi CCOPitch ; skip odd line
Leax CCOPitch
Leax YPitch
Lebp tmpYCursorOdd
lea ebp, [ebp+2*eax] ; skip two lines
Sebp tmpYCursorOdd
Lebp tmpYCursorEven
lea ebp, [ebp+2*eax]
Sebp tmpYCursorEven
ADDesi ChromaPitch
ADDedx ChromaPitch
sub PD FrameHeight[esp],2
ja PrepareChromaLine
;------------------------------------------------------------------------------
finish:
emms
add esp, LocalFrameSize
pop ebx
pop ebp
pop edi
pop esi
retn
MMXCODE1 ENDS
END