Chapter 6
MMX TECHNOLOGY PERFORMANCE MONITORING EXTENSIONS
The most effective way to improve the performance
of your code is to find the performance bottlenecks. Intel Architecture
processors include a counter on the processor that will allow
you to gather information about the performance of your application.
This counter keeps track of events that occur while your code
is executing. You can read the counter during execution and determine
if your code has stalls. This may be accomplished by using Intel's
VTune profiling tool or by using instructions within your code.
The section describes the performance monitoring
features for MMX code on Pentium® and P6-family processors with
MMX technology.
The RDPMC instruction is described in Section 6.3.
6.1 Superscalar (Pentium® Family) Performance Monitoring
Events
All Pentium processors feature performance counters
and several new events have been added to support MMX technology.
All new events are assigned to one of the two event counters (CTR0,
CTR1), with the exception of "twin events" (such as
" D1 starvation" and "FIFO is empty") which
are assigned to different counters to allow their concurrent measurement.
The events must be assigned to their specified counter. Table
6-1 lists the performance monitoring events. New events are listed
in bold.
Table 6-1. Performance Monitoring Events
Serial |
Encoding
|
Counter 0
|
Counter 1
|
Performance Monitoring Event
| Occurrence or Duration
|
0 | 000000
| Yes |
Yes | Data Read
| OCCURRENCE
|
1 | 000001
| Yes |
Yes | Data Write
| OCCURRENCE
|
2 | 000010
| Yes |
Yes | Data TLB Miss
| OCCURRENCE
|
3 | 000011
| Yes |
Yes | Data Read Miss
| OCCURRENCE
|
4 | 000100
| Yes |
Yes | Data Write Miss
| OCCURRENCE
|
5 | 000101
| Yes |
Yes | Write (hit) to M or E state lines
| OCCURRENCE
|
6 | 000110
| Yes |
Yes | Data Cache Lines Written Back
| OCCURRENCE
|
7 | 000111
| Yes |
Yes | External Data Cache Snoops
| OCCURRENCE
|
8 | 001000
| Yes |
Yes | External Data Cache Snoop Hits
| OCCURRENCE
|
9 | 001001
| Yes |
Yes | Memory Accesses in Both Pipes
| OCCURRENCE
|
10 |
001010 |
Yes | Yes
| Bank Conflicts | OCCURRENCE
|
11 |
001011 |
Yes | Yes
| Misaligned Data Memory or I/O References
| OCCURRENCE
|
12 |
001100 |
Yes | Yes
| Code Read | OCCURRENCE
|
13 |
001101 |
Yes | Yes
| Code TLB Miss | OCCURRENCE
|
14 |
001110 |
Yes | Yes
| Code Cache Miss | OCCURRENCE
|
15 |
001111 |
Yes | Yes
| Any Segment Register Loaded
| OCCURRENCE
|
16 |
010000 |
Yes | Yes
| Reserved |
|
17 |
010001 |
Yes | Yes
| Reserved |
|
18 |
010010 |
Yes | Yes
| Branches | OCCURRENCE
|
19 |
010011 |
Yes | Yes
| BTB Predictions | OCCURRENCE
|
20 |
010100 |
Yes | Yes
| Taken Branch or BTB hit.
| OCCURRENCE
|
21 |
010101 |
Yes | Yes
| Pipeline Flushes | OCCURRENCE
|
22 |
010110 |
Yes | Yes
| Instructions Executed |
OCCURRENCE
|
23 |
010111 |
Yes | Yes
| Instructions Executed in the v-pipe e.g. parallelism/pairing
| OCCURRENCE
|
24
| 011000
| Yes
| Yes
| Clocks while a bus cycle is in progress (bus utilization)
| DURATION
|
25 |
011001 |
Yes | Yes
| Number of clocks stalled due to full write buffers
| DURATION
|
26 |
011010 |
Yes | Yes
| Pipeline stalled waiting for data memory read
| DURATION
|
27 |
011011 |
Yes | Yes
| Stall on write to an E or M state line
| DURATION
|
29 |
011101 |
Yes | Yes
| I/O Read or Write Cycle
| OCCURRENCE
|
Table 6-1. Performance Monitoring Events (Cont'd)
|
Serial
|
Encoding
|
Counter 0
|
Counter 1
|
Performance Monitoring Event
| Occurrence or Duration
|
30 |
011110 |
Yes | Yes
| Non-cacheable memory reads
| OCCURRENCE
|
31 |
011111 |
Yes | Yes
| Pipeline stalled because of an address generation interlock
| DURATION
|
32 |
100000 |
Yes | Yes
| Reserved |
|
33 |
100001 |
Yes | Yes
| Reserved |
|
34 |
100010 |
Yes | Yes
| FLOPs | OCCURRENCE
|
35 |
100011 |
Yes | Yes
| Breakpoint match on DR0 Register
| OCCURRENCE
|
36 |
100100 |
Yes | Yes
| Breakpoint match on DR1 Register
| OCCURRENCE
|
37 |
100101 |
Yes | Yes
| Breakpoint match on DR2 Register
| OCCURRENCE
|
38 |
100110 |
Yes | Yes
| Breakpoint match on DR3 Register
| OCCURRENCE
|
39 |
100111 |
Yes | Yes
| Hardware Interrupts |
OCCURRENCE
|
40 |
101000 |
Yes | Yes
| Data Read or Data Write
| OCCURRENCE
|
41 |
101001 |
Yes | Yes
| Data Read Miss or Data Write Miss
| OCCURRENCE
|
43
| 101011
| Yes
| No
| MMXTM instructions executed in u-pipe
| OCCURRENCE
|
43
| 101011
| No
| Yes
| MMX instructions executed in v-pipe
| OCCURRENCE
|
45
| 101101
| Yes
| No
| EMMS instructions executed
| OCCURRENCE
|
45
| 101101
| No
| Yes
| Transition between MMX instructions and FP instructions
| OCCURRENCE
|
46
| 101110
| No
| Yes
| Writes to Non-Cacheable Memory
| OCCURRENCE
|
47
| 101111
| Yes
| No
| Saturating MMX instructions executed
| OCCURRENCE
|
47
| 101111
| No
| Yes
| Saturations performed
| OCCURRENCE
|
48
| 110000
| Yes
| No
| Number of Cycles Not in HLT State
| DURATION
|
49
| 110001
| Yes
| No
| MMX instruction data reads
| OCCURRENCE
|
Table 6-1. Performance Monitoring Events (Cont'd)
|
Serial
|
Encoding
|
Counter 0
|
Counter 1
|
Performance Monitoring Event
| Occurrence or Duration
|
50
| 110010
| Yes
| No
| Floating Point Stalls
| DURATION
|
50
| 110010
| No
| Yes
| Taken Branches
| OCCURRENCE
|
51
| 110011
| No
| Yes
| D1 Starvation and one instruction in FIFO
| OCCURRENCE
|
52
| 110100
| Yes
| No
| MMX instruction data writes
| OCCURRENCE
|
52
| 110100
| No
| Yes
| MMX instruction data write misses
| OCCURRENCE
|
53
| 110101
| Yes
| No
| Pipeline flushes due to wrong branch prediction
| OCCURRENCE
|
53
| 110101
| No
| Yes
| Pipeline flushes due to wrong branch predictions resolved in WB-stage
| OCCURRENCE
|
54
| 110110
| Yes
| No
| Misaligned data memory reference on MMX instruction
| OCCURRENCE
|
54
| 110110
| No
| Yes
| Pipeline stalled waiting for MMX instruction data memory read
| DURATION
|
55
| 110111
| Yes
| No
| Returns Predicted Incorrectly
| OCCURRENCE
|
55
| 110111
| No
| Yes
| Returns Predicted (Correctly and Incorrectly)
| OCCURRENCE
|
56
| 111000
| Yes
| No
| MMX instruction multiply unit interlock
| DURATION
|
56
| 111000
| No
| Yes
| MOVD/MOVQ store stall due to previous operation
| DURATION
|
57
| 111001
| Yes
| No
| Returns
| OCCURRENCE
|
57
| 111001
| No
| Yes
| RSB Overflows
| OCCURRENCE
|
58
| 111010
| Yes
| No
| BTB false entries
| OCCURRENCE
|
58
| 111010
| No
| Yes
| BTB miss prediction on a Not-Taken Branch
| OCCURRENCE
|
59
| 111011
| Yes
| No
| Number of clocks stalled due to full write buffers
while executing MMX instructions
| DURATION
|
59
| 111011
| No
| Yes
| Stall on MMX instruction write to E or M line
| DURATION
|
6.1.1 DESCRIPTION OF MMXTM INSTRUCTION
EVENTS
The event codes/counter are provided in parenthesis.
- MMX instructions executed in U-pipe (101011/0):
-- Total number of MMX instructions executed in U-pipe.
- MMX instructions executed in V-pipe (101011/1):
-- Total number of MMX instructions executed in V-pipe.
- EMMS instructions executed (101101/0):
-- Counts number of EMMS instructions executed.
- Transition between MMX instructions and FP instructions
(101101/1):
-- Counts first floating-point instruction following any MMX instruction
or first MMX instruction following a floating-point instruction.
May be used to estimate the penalty in transitions between FP
state and MMX state. An even count indicates the processor is
in MMX state. An odd count indicates it is in FP state.
- Writes to non-cacheable memory (101110/1):
-- Counts the number of write accesses to non-cacheable memory.
It includes write cycles caused by TLB misses and I/O write cycles.
Cycles restarted due to BOFF# are not re-counted.
- Saturating MMX instructions executed (101111/0):
-- Counts saturating MMX instructions executed, independently
of whether or not they actually saturated. Saturating MMX instructions
may perform add, subtract, or pack operations .
- Saturations performed (101111/1):
-- Counts number of MMX instructions that used saturating arithmetic
where at least one of the results actually saturated (that is,
if an MMX instruction operating on four dwords saturated in three
out of the four results, the counter will be incremented by only
one).
- Number of cycles not in HALT (HLT) state (110000/0):
-- This event counts the number of cycles the processor is not
idle due to HALT (HLT) instruction. This event will enable the
user to calculate "net CPI". Note that during the time
that the processor is executing the HLT instruction, the Time
Stamp Counter (TSC) is not disabled. Since this event is controlled
by the Counter Controls CC0, CC1 it can be used to calculate the
CPI at CPL=3 which the TSC cannot provide.
- MMX instruction data reads (110001/0):
-- Analogous to "Data reads", counting only MMX instruction
accesses.
- MMX instruction data read misses (110001/1):
-- Analogous to "Data read misses", counting only MMX
instruction accesses.
- Floating-Point stalls (110010/0):
-- This event counts the number of clocks while pipe is stalled
due to a floating-point freeze.
- Number of Taken Branches (110010/1):
-- This event counts the number of taken branches.
- D1 starvation and FIFO is empty (110011/0), D1
starvation and only one instruction in FIFO (110011/1):
-- The D1 stage can issue 0, 1, or 2 instructions per clock if
instructions are available in the FIFO buffer. The first event
counts how many times D1 cannot issue ANY instructions because
the FIFO buffer is empty. The second event counts how many times
the D1-stage issues just a single instruction because the FIFO
buffer had just one instruction ready. Combined with two other
events, Instruction Executed (010110) and Instruction Executed
in the V-pipe (010110), the second event enables the user to calculate
the number of times pairing rules prevented issue of two instructions.
- MMX instruction data writes (110001/1):
-- Analogous to "Data writes", counting only MMX instruction
accesses.
- MMX instruction data write misses (110100/1):
-- Analogous to "Data write misses", counting only MMX
instruction accesses.
- Pipeline flushes due to wrong branch prediction
(110101/0), Pipeline flushes due to wrong branch prediction resolved
in WB-stage(110101/1):
-- Counts any pipeline flush due to a branch which the pipeline
did not follow correctly. It includes cases where a branch was
not in the BTB, cases where a branch was in the BTB but was mispredicted,
and cases where a branch was correctly predicted but to the wrong
address. Branches are resolved in either the Execute stage (E-stage)
or the Writeback stage (WB-stage). In the latter case, the misprediction
penalty is larger by one clock. The first event listed above counts
the number of incorrectly predicted branches resolved in either
the E-stage or the WB-stage. The second event counts the number
of incorrectly predicted branches resolved in the WB-stage. The
difference between these two counts is the number of E-stage resolved
branches.
- Misaligned data memory reference on MMX instruction
(110110/0):
-- Analogous to "Misaligned data memory reference",
counting only MMX instruction accesses.
- Pipeline stalled waiting for data memory read
( 110110/1):
-- Analogous to "Pipeline stalled waiting for data memory
read", counting only MMX technology accesses.
- Returns predicted incorrectly or not predicted
at all (110111/0):
-- These are the actual number of Returns that were either incorrectly
predicted or were not predicted at all. It is the difference between
the total number of executed returns and the number of returns
that were correctly predicted. Only RET instructions are counted
(that is, IRET instructions are not counted).
- Returns predicted (correctly and incorrectly)
(110111/1):
-- This is the actual number of Returns for which a prediction
was made. Only RET instructions are counted (that is, IRET instructions
are not counted).
- MMX technology multiply unit interlock (111000/0):
-- This event counts the number of clocks the pipe is stalled
because the destination of a previous MMX technology multiply instruction
is not yet ready. The counter will not be incremented if there
is another cause for a stall. For each occurrence of a multiply
interlock, this event may be counted twice (if the stalled instruction
comes on the next clock after the multiply) or only once (if the
stalled instruction comes two clocks after the multiply).
- MOVD/MOVQ store stall due to previous operation
(111000/1):
-- Number of clocks a MOVD/MOVQ store is stalled in D2 stage due
to a previous MMX technology operation with a destination to be used in
the store instruction.
- Returns (111001/0):
-- This is the actual number of Returns executed. Only RET instructions
are counted (that is, IRET instructions are not counted). Any
exception taken on a RET instruction also updates this counter.
- RSB overflows (111001/1):
-- This event counts the number of times the Return Stack Buffer
(RSB) cannot accommodate a call address.
- BTB false entries (111010/0):
-- This event counts the number of false entries in the Branch
Target Buffer. False entries are causes for misprediction other
than a wrong prediction.
- BTB miss-prediction on a Not-Taken Branch (111010/1):
-- This event counts the number of times the BTB predicted a Not-Taken
branch as Taken.
- Number of clocks stalled due to full write buffers
while executing MMX instructions (111011/0):
-- Analogous to "Number of clocks stalled due to full write
buffers", counting only MMX instruction accesses.
- Stall on MMX instruction write to an E or M state
line (111011/1):
-- Analogous to "Stall on write to an E or M state line",
counting only MMX instruction accesses.
6.2 Dynamic Execution (P6-Family) Performance Monitoring Events
This section describes the counters on P6-family
processors. Table 4-2 lists the events that can be counted with
the performance-monitoring counters and read with the RDPMC instruction.
In the table, the:
- Unit column gives the microarchitecture or bus
unit that produces the event.
- Event number column gives the hexadecimal number
identifying the event.
- Mnemonic event name column gives the name of
the event.
- Unit mask column gives the unit mask required
(if any).
- Description column describes the event.
- Comments column gives additional information
about the event.
These performance monitoring events are intended
to be used as guides for performance tuning. The counter values
reported are not guaranteed to be absolutely accurate and should
be used as a relative guide for tuning. Known discrepancies are
documented where applicable. All performance events are model-specific
to P6-family processors and are not architecturally guaranteed
in future versions of the processor. All performance event encodings
not listed in the table are reserved and their use will result
in undefined counter results.
Further details will be made available in a later
version of this document.
See the end of the table for notes related to certain
entries in the table.
Table 6-2. Performance Monitoring Counters
|
Unit
| Event Num.
| Mnemonic Event Name
| Unit Mask
| Description
| Comments
|
Data Cache Unit (DCU) | 43H
| DATA_MEM_ REFS | 00H
| All memory references, both cacheable and non- cacheable
| |
| 45H
| DCU_LINES_IN | 00H
| Total lines allocated in the DCU.
| |
| 46H
| DCU_M_LINES_IN | 00H
| Number of M state lines allocated in the DCU.
| |
| 47H
| DCU_M_LINES_
OUT
| 00H | Number of M state lines evicted from the DCU. This includes evictions via snoop HITM, intervention or replacement.
| |
| 48H
| DCU_MISS_
OUTSTANDING
| 00H | Weighted number of cycles while a DCU miss is outstanding.
| An access that also misses the L2 is short-changed by 2 cycles. (i.e. if counts N cycles, should be N+2 cycles.)
Subsequent loads to the same cache line will not result in any additional counts.
Count value not precise, but still useful.
|
Instruction Fetch Unit (IFU)
| 80H | IFU_IFETCH
| 00H | Number of instruction fetches, both cacheable and non-cacheable.
| |
| 81H
| IFU_IFETCH_MISS | 00H
| Number of instruction fetch misses.
| |
| 85H
| ITLB_MISS | 00H
| Number of ITLB misses.
| |
| 86H
| IFU_MEM_STALL | 00H
| Number of cycles that the instruction fetch pipe stage is stalled, including cache misses, ITLB misses, ITLB faults, and victim cache evictions.
| |
| 87H
| ILD_STALL | 00H
| Number of cycles that the instruction length decoder is stalled.
| |
Table 6-2. Performance Monitoring Counters (Cont'd)
|
Unit
| Event Num.
| Mnemonic Event Name
| Unit Mask
| Description
| Comments
|
| 29H
| L2_LD | MESI
0FH
| Number of L2 data loads.
| |
| 2AH
| L2_ST | MESI
0FH
| Number of L2 data stores.
| |
| 24H
| L2_LINES_IN | 00H
| Number of lines allocated in the L2.
| |
| 26H
| L2_LINES_OUT | 00H
| Number of lines removed from the L2 for any reason.
| |
| 25H
| L2_M_LINES_INM | 00H
| Number of modified lines allocated in the L2.
| |
| 27H
| L2_M_LINES_OUTM | 00H
| Number of modified lines removed from the L2 for any reason.
| |
| 2EH
| L2_RQSTS | MESI
0FH
| Number of L2 requests.
| |
| 21H
| L2_ADS | 00H
| Number of L2 address strobes.
| |
| 22H
| L2_DBUS_BUSY | 00H
| Number of cycles during which the data bus was busy.
| |
| 23H
| L2_DBUS_BUSY_RD | 00H
| Number of cycles during which the data bus was busy transferring data from L2 to the processor.
| |
External Bus Logic (EBL)2
| 62H | BUS_DRDY_CLOCKS
| 00H (Self)
20H (Any)
| Number of clocks during which DRDY is asserted.
| Unit Mask = 00H counts bus clocks when the processor is driving DRDY.
Unit Mask = 20H counts in processor clocks when any agent is driving DRDY.
|
| 63H
| BUS_LOCK_CLOCKS | 00H (Self)
20H (Any)
| Number of clocks during which LOCK is asserted
| Always counts in processor clocks
|
Table 6-2. Performance Monitoring Counters (Cont'd)
|
Unit
| Event Num.
| Mnemonic Event Name
| Unit Mask
| Description
| Comments
|
| 66H
| BUS_TRAN_RFO | 00H (Self)
20H (Any)
| Number of read for ownership transactions.
| |
| 68H
| BUS_TRAN_IFETCH | 00H (Self)
20H (Any)
| Number of instruction fetch transactions.
| |
| 69H
| BUS_TRAN_INVAL | 00H (Self)
20H (Any)
| Number of invalidate transactions.
| |
| 6AH
| BUS_TRAN_PWR | 00H (Self)
20H (Any)
| Number of partial write transactions.
| |
| 6BH
| BUS_TRANS_P | 00H (Self)
20H (Any)
| Number of partial transactions.
| |
| 6CH
| BUS_TRANS_IO | 00H (Self)
20H (Any)
| Number of I/O transactions.
| |
| 6DH
| BUS_TRAN_DEF | 00H (Self)
20H (Any)
| Number of deferred transactions.
| |
| 6EH
| BUS_TRAN_BURST | 00H (Self)
20H (Any)
| Number of burst transactions.
| |
| 70H
| BUS_TRAN_ANY | 00H (Self)
20H (Any)
| Number of all transactions.
| |
| 6FH
| BUS_TRAN_MEM | 00H (Self)
20H (Any)
| Number of memory transactions
| |
Table 6-2. Performance Monitoring Counters (Cont'd)
|
Unit
| Event Num.
| Mnemonic Event Name
| Unit Mask
| Description
| Comments
|
| 61H
| BUS_BNR_DRV | 00H (Self)
| Number of bus clock cycles during which this processor is driving the BNR pin.
| |
| 7AH
| BUS_HIT_DRV | 00H (Self)
| Number of bus clock cycles during which this processor is driving the HIT pin.
| Includes cycles due to snoop stalls.
|
| 7EH
| BUS_SNOOP_STALL | 00H (Self)
| Number of clock cycles during which the bus is snoop stalled.
| |
Floating Point Unit |
C1H | FLOPS
| 00H |
Number of computational floating-point operations retired.
| Counter 0 only |
| 10H
| FP_COMP_OPS_EXE | 00H
| Number of computational floating-point operations executed.
| Counter 0 only. |
| 11H
| FP_ASSIST | 00H
| Number of floating-point exception cases handled by microcode.
| Counter 1 only. |
| 12H
| MUL | 00H
| Number of multiplies. |
Counter 1 only. |
| 13H
| DIV | 00H
| Number of divides. |
Counter 1 only. |
| 14H
| CYCLES_DIV_BUSY | 00H
| Number of cycles during which the divider is busy.
| Counter 0 only. |
Memory Ordering | 03H
| LD_BLOCKS | 00H
| Number of store buffer blocks
| |
| 04H
| SB_DRAINS | 00H
| Number of store buffer drain cycles.
| |
| 05H
| MISALIGN_MEM_REF | 00H
| Number of misaligned data memory references.
| |
Instruction Decoding and Retirement
| C0H |
INST_RETIRED | OOH
| Number of instructions retired.
| |
| C2H
| UOPS_RETIRED | 00H
| Number of micro-ops retired.
| |
| D0H
| INST_DECODER | 00H
| Number of instructions decoded.
| |
Interrupts | C8H
| HW_INT_RX | 00H
| Number of hardware interrupts received.
| |
Table 6-2. Performance Monitoring Counters (Cont'd)
|
Unit
| Event Num.
| Mnemonic Event Name
| Unit Mask
| Description
| Comments
|
| C6H
| CYCLES_INT_MASKED |
00H | Number of processor cycles for which interrupts are disabled.
| |
Branches | C4H
| BR_INST_RETIRED | 00H
| Number of branch instructions retired.
| |
| C5H
| BR_MISS_PRED_
RETIRED
| 00H |
Legal Stuff © 1997 Intel Corporation