1> AMD Family 15h processors include many features designed to improve software performance. The
internal design, or microarchitecture, of these processors provides the following key features:
• Integrated DDR3 memory controller with memory prefetcher
• 64-Kbyte L1 instruction cache and 16-Kbyte L1 data cache
• Shared L2 cache between cores of compute unit
• Shared L3 cache compute units on chip (for supported platforms)
• 32-byte instruction fetch
• Instruction predecode and branch prediction during cache-line fills
• Decoupled prediction and instruction fetch pipelines
• Four-wayAMD64 instruction decoding (This is a theoretical limit. See section 2.3 on page 31.)
• Dynamic scheduling and speculative execution
• Two-way integer execution
• Two-way address generation
• Two-way 128-bit wide floating-point execution
• Legacy single-instruction multiple-data (SIMD) instruction extensions, as well as support for
XOP, FMA4, VPERMILx, and Advanced Vector Extensions (AVX).
• Superforwarding
• Prefetch into L2 or L1 data cache
• Deep out-of-order integer and floating-point execution
• HyperTransport™ technology
Several enhancements to the AMD64 architecture have resulted in significant performance
improvements in AMD Family 15h processors, including:
• Improved performance of shuffle instructions
• Improved data transfer between floating-point registers and general purpose registers
• Improved floating-point register to floating-point register moves
• Optimization of repeated move instructions
• More efficient PUSH/POP stack operations
• 1-Gbyte paging
• Load-Execute Instructions for Unaligned Data
Use load-execute instructions instead of discrete load and execute instructions when performing
SIMD integer, SIMD floating-point and x87 computations on floating-point source operands. This is
recommended regardless of the alignment of packed data on AMD Family 15h processors. (The use
of load-execute instructions under these circumstances was only recommended for aligned packed
data on the previous AMD64 processors.) This replacement is only possible if the misaligned
exception mask (MM) is set.This optimization can be especially useful in vectorized
SIMD loops and may eliminate the need for loop peeling due to nonalignment.
2> AMD Family 15h Processor Cache Operations
AMD Family 15h processors use four different caches to accelerate instruction execution and data
processing:
• L1 instruction cache
• L1 data cache
• Share compute unit L2 cache
• Shared on chip L3 cache (on supported platforms)
2.1> L1 Instruction Cache
The out-of-order execution engine of AMD Family 15h processors contains a 64-Kbyte, 2-way set
associative L1 instruction cache. Each line in this cache is 64 bytes long. However, only 32 bytes
are fetched in every cycle. Functions associated with the L1 instruction cache are instruction loads,
instruction prefetching, instruction predecoding, and branch prediction. Requests that miss in the L1
instruction cache are fetched from the L2 cache or, subsequently, from the L3 cache or system
memory.
On misses, the L1 instruction cache generates fill requests to a naturally aligned 64-byte line
containing the instructions and the next sequential line of bytes (a prefetch). Because code typically
exhibits spatial locality, prefetching is an effective technique for avoiding decode stalls. Cache-line
replacement is based on a least-recently-used replacement algorithm.
Predecoding begins as the L1 instruction cache is filled. Predecode information is generated and
stored alongside the instruction cache. This information is used to help efficiently identify the
boundaries between variable length AMD64 instructions
2.2> L1 Data Cache
The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128-
bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle. It is divided
into 16 banks, each 16 bytes wide. In addition, the L1 cache is protected from single bit errors through
the use of parity. There is a hardware prefetcher that brings data into the L1 data cache to avoid
misses. The L1 data cache has a 4-cycle load-to-use latency. Only one load can be performed from a
given bank of the L1 cache in a single cycle.
2.3> L2 Cache
The AMD Family 15h processor has one shared L2 cache per compute unit. This full-speed on-die L2
cache is mostly inclusive relative to the L1 cache. The L2 is a write-through cache. Every time a store
is performed in a core, that address is written into both the L1 data cache of the core the store belongs
to and the L2 cache (which is shared between the two cores). The L2 cache has an 18-20 cycle load to
use latency.
Size and associativity of the AMD Family 15h processor L2 cache is implementation dependent. See
the appropriate BIOS and Kernel Developer's Guide for details.
2.4> L3 Cache
The AMD Family 15h processor supports a maximum of 8MB of L3 cache per die, distributed among
four L3 sub-caches which can each be up to 2MB in size. The L3 cache is considered a non-inclusive
victim cache architecture optimized for multi-core AMD processors. Only L2 evictions cause
allocations into the L3 cache. Requests that hit in the L3 cache can either leave the data in the L3
cache—if it is likely the data is being accessed by multiple cores—or remove the data from the L3
cache (and place it solely in the L1 cache, creating space for other L2 victim/copy-backs), if it is likely
the data is only being accessed by a single core. Furthermore, the L3 cache of the AMD Family 15h
processor also features a number of micro-architectural improvements that enable higher
bandwidth.
3> Branch-Prediction
To predict and accelerate branches, AMD Family 15h processors employ a combination of nextaddress
logic, a 2-level branch target buffer (BTB) for branch identification and direct target
prediction, a return address stack used for predicting return addresses, an indirect target predictor for
predicting indirect jump and call addresses, a hybrid branch predictor for predicting conditional
branch directions, and a fetch window tracking structure (BSR). Predicted-taken branches incur a 1-
cycle bubble in the branch prediction pipeline when they are predicted by the L1 BTB, and a 4-cycle
bubble in the case where they are predicted by the L2 BTB. The minimum branch misprediction
penalty is 20 cycles in the case of conditional and indirect branches and 15 cycles for unconditional
direct branches and returns.
The BTB is a tagged two-level set associative structure accessed using the fetch address of the current
window. Each BTB entry includes information about a branch and its target. The L1 BTB contains
128 sets of 4 ways for a total of 512 entries, while the L2 BTB has 1024 sets of 5 ways for a total of
5120 entries.
The hybrid branch predictor is used for predicting conditional branches. It consists of a global
predictor, a local predictor and a selector that tracks whether each branch is correlating better with the
global or local predictor. The selector and local predictor are indexed with a linear address hash. The
global predictor is accessed via a 2-bit address hash and a 12-bit global history.
AMD Family 15h processors implement a separate 512- entry indirect target array used to predict
indirect branches with multiple dynamic targets.
In addition, the processors implement a 24-entry return address stack to predict return addresses from
a near or far call. Most of the time, as calls are fetched, the next return address is pushed onto the
return stack and subsequent returns pop a predicted return address off the top of the stack. However,
mispredictions sometimes arise during speculative execution. Mechanisms exist to restore the stack to
a consistent state after these mispredictions.
4> Instruction Fetch and Decode
While previous AMD64 processors had a single 32-byte fetch window, AMD Family 15h processors
have two 32-byte fetch windows, from which four μops can be selected. These fetch windows, when
combined with the 128-bit floating-point execution unit, allow the processor to sustain a
fetch/dispatch/retire sequence of four instructions per cycle. Most instructions decode to a single μop,
but fastpath double instructions decode to two μops. ALU instructions can also issue four μops per
cycle and microcoded instructions should be considered single issue. Thus, there is not necessarily a
one-to-one correspondence between the decode size of assembler instructions and the capacity of the
32-byte fetch window and the production of optimal assembler code requires considerable attention
to the details of the underlying programming constraints.
Assembly language programmers can now group more instructions together but must still concern
themselves with the possibility that an instruction may span a 32-byte fetch window. In this regard, it
is also advisable to align hot loops to 32 bytes instead of 16 bytes, especially in the case of loops for
large SIMD instructions.
AMD Family 15h processors can theoretically fetch 32B of instructions per cycle and send these
instructions to the Decode Unit (DE) in 16B windows through the 16-entry (per-thread) Instruction
Byte Buffer (IBB). The Decode Unit can only scan two of these 16B windows in a given cycle for up
to four instructions. If four instructions partially or wholly exist in more than two of these windows,
only those instructions within the first and second windows will be decoded. Aligning to 16B
boundaries is important to achieve full decode performance.
5> Integer Execution
The integer execution unit for the AMD Family 15h processor consists of two components:
• the integer datapath
• the instruction scheduler and retirement control
These two components are responsible for all integer execution (including address generation) as well
as coordination of all instruction retirement and exception handling. The instruction scheduler and
retirement control tracks instruction progress from dispatch, issue, execution and eventual retirement.
The scheduling for integer operations is fully data-dependency driven; proceeding out-of-order based
on the validity of source operands and the availability of execution resources.
Since the Bulldozer core implements a floating point co-processor model of operation, most
scheduling and execution decisions of floating-point operations are handled by the floating point unit.
However, the scheduler does track the completion status of all outstanding operations and is the final
arbiter for exception processing and recovery.
6> Translation-Lookaside Buffer
A translation-lookaside buffer (TLB) holds the most-recently-used page mapping information. It
assists and accelerates the translation of virtual addresses to physical addresses.
The AMD Family 15h processors utilize a two-level TLB structure.
6.1> L1 Instruction TLB Specifications
The AMD Family 15h processor contains a fully-associative L1 instruction TLB with 48 4-Kbyte
page entries and 24 2-Mbyte or 1-Gbyte page entries. 4-Mbyte pages require two 2-Mbyte entries;
thus, the number of entries available for 4-Mbyte pages is one half the number of 2-Mbyte page
entries.
6.2> L1 Data TLB Specifications
The AMD Family 15h processor contains a fully-associative L1 data TLB with 32 entries for 4-
Kbyte, 2-Mbyte, and 1-Gbyte pages. 4-Mbyte pages require two 2-Mbyte entries; thus, the number of
entries available for 4-Mbyte pages is one half the number of 2-Mbyte page entries.
6.3> L2 Instruction TLB Specifications
The AMD Family 15 processor contains a 4-way set-associative L2 instruction TLB with 512 4-
Kbyte page entries.
6.4> L2 Data TLB Specifications
The AMD Family 15h processor contains an L2 data TLB and page walk cache (PWC) with 1024 4-
Kbyte, 2-Mbyte or 1-Gbyte page entries (8-way set-associative). 4-Mbyte pages require two 2-Mbyte
entries; thus, the number of entries available for 4-Mbyte pages is one half the number of 2-Mbyte
page entries.
7> Integer Unit
The integer unit consists of two components, the integer scheduler, which feeds the integer execution
pipes, and the integer execution unit, which carries out several types of operations discussed below.
The integer unit is duplicated for each thread pair.
7.1> Integer Scheduler
The scheduler can receive and schedule up to four micro-ops (μops) in a dispatch group per cycle.
The scheduler tracks operand availability and dependency information as part of its task of issuing
μops to be executed. It also assures that older μops which have been waiting for operands are
executed in a timely manner. The scheduler also manages register mapping and renaming.
7.2> Integer Execution Unit
There are four integer execution units per core. Two units which handle all arithmetic, logical and
shift operations (EX). And two which handle address generation and simple ALU operations
(AGLU). Figure 2 shows a block diagram for one integer cluster. There are two such integer clusters
per compute unit.
Macro-ops are broken down into micro-ops in the schedulers. Micro-ops are executed when their
operands are available, either from the register file or result buses. Micro-ops from a single operation
can execute out-of-order. In addition, a particular integer pipe can execute two micro-ops from
different macro-ops (one in the ALU and one in the AGLU) at the same time. (See Figure 1 on
page 32.) The scheduler can receive up to four macro-ops per cycle. This group of macro-ops is
called a dispatch group.
EX0 contains a variable latency non-pipelined integer divider. EX1 contains a pipelined integer
multiplier. The AGLUs contain a simple ALU to execute arithmetic and logical operations and
generate effective addresses. A load and store unit (LSU) reads and writes data to and from the L1
data cache. The integer scheduler sends a completion status to the ICU when the outstanding microops
for a given macro-op are executed. (For more information on the LSU, see section 2.12 on page
38.)
L1 DTLB has been increased to 64M for AMD Family 15h Models 10h-1fh processors. For
AMD Family 15h models 20h to 2fh processors, the L1 DTLB size has increased from 32 entries to
64 entries.
The LZCNT and POPCNT operations are handled in a pipelined unit attached to EX0.
8> Floating-Point Unit
The AMD Family 15h processor floating point unit (FPU) was designed to provide four times the raw
FADD and FMUL bandwidth as the original AMD Opteron and Athlon 64 processors. It achieves this
by means of two 128-bit fused multiply-accumulate (FMAC) units which are supported by a 128-bit
high-bandwidth load-store system. The FPU is a coprocessor model that is shared between the two
cores of one AMD Family 15h compute unit. As such it contains its own scheduler, register files and
renamers and does not share them with the integer units. This decoupling provides optimal
performance of both the integer units and the FPU. In addition to the two FMACs, the FPU also
contains two 128-bit integer units which perform arithmetic and logical operations on AVX, MMX
and SSE packed integer data.
A 128-bit integer multiply accumulate (IMAC) unit is incorporated into FPU pipe 0. The IMAC
performs integer fused multiply and accumulate, and similar arithmetic operations on AVX, MMX
and SSE data. A crossbar (XBAR) unit is integrated into FPU pipe 1 to execute the permute
instruction along with shifts, packs/unpacks and shuffles. There is an FPU load-store unit which
supports up to two 128-bit loads and one 128-bit store per cycle.
FPU Features Summary and Specifications:
• The FPU can receive up to four ops per cycle. These ops can only be from one thread, but the
thread may change every cycle. Likewise the FPU is four wide, capable of issue, execution and
completion of four ops each cycle. Once received by the FPU, ops from multiple threads can be
executed.
• Within the FPU, up to two loads per cycle can be accepted, possibly from different threads.
• There are four logical pipes: two FMAC and two packed integer. For example, two 128-bit
FMAC and two 128-bit integer ALU ops can be issued and executed per cycle.
• Two 128-bit FMAC units. Each FMAC supports four single precision or two double-precision ops.
• FADDs and FMULs are implemented within the FMAC's.
• x87 FADDs and FMULs are also handled by the FMAC.
• Each FMAC contains a variable latency divide/square root machine.
• Only 1 256-bit operation can issue per cycle, however an extra cycle can be incurred as in the case
of a FastPath Double if both micro ops cannot issue together.
9> Load-Store Unit
The AMD family 15h processor load-store (LS) unit handles data accesses. There are two LS units
per compute unit, or one per core. The LS unit supports two 128-bit loads/cycles and one 128-bit
store/cycle. There is a 24 entry store queue. This queue buffers stored data until it can be written to
the data cache. The load queue has 40 entries and holds load operations until after the load has been
completed and delivered to the integer unit or the FPU. The LS unit is composed of two largely independent
pipelines enabling the execution of two memory operations per cycle.
Finally, the LS unit helps ensure that the architectural load and store ordering rules are preserved
(a requirement for AMD64 architecture compatibility).
10> Write Combining
AMD Family 15h processors provide four write-combining data buffers that allow four simultaneous
streams.
A Write Coalescing Cache (WCC) has been incorporated into the AMD family 15h
microarchitecture. The WCC is 4 KB in size and is 4-way set associative. Stores to cacheable memory
and, thus, to the L2 cache are coalesced in this cache.
11> Integrated Memory Controller
AMD Family 15h processors provide integrated low-latency, high-bandwidth DDR3 memory
controllers.
The memory controller supports:
• DRAM chips that are 4, 8, and 16 bits wide within a DIMM.
• Interleaving memory within DIMMs.
• ECC checking with single symbol correcting and double symbol detecting.
• Dual-independent 64-bit channel operation.
• Optimized scheduling algorithms and access pattern predictors to improve latency and achieved
bandwidth, particularly for interleaved streams of read and write DRAM accesses.
• A data prefetcher.
Prefetched data is held in the memory controller itself and is not speculatively filled into the L1, L2,
or L3 caches. This prefetcher is able to capture both positive and negative stride values (both unit and
non-unit) of cache-line size, as well as some more complicated access patterns.
For specifications on a certain processor's memory controller, see the data sheet for that processor.
For information on how to program the memory controller.
12> HyperTransport™ Technology Interface
Support HT 3.x, HyperTransport Assist.
Additional features in the AMD Family 15h HyperTransport implementation may include:
• HyperTransport link bandwidth balancing, allowing multiple HyperTransport links to be teamed
to carry coherent traffic.
• HyperTransport Link Splitting, which allowing a single 16-bit link to be split into two 8-bit links.
These features allow for further optimized platform designs that are capable of increasing system
bandwidth and reducing latency.
13> AMD Virtualization Optimizations
• The advantages of using nested paging instead of shadow paging
• Guest page attribute table (PAT) configuration
• State swapping
• Economizing Interceptions
• Nested page and shadow page size
• TLB control and flushing in shadow pages
• Instruction Fetch for Intercepted (REP) INS instructions
• Sharing IOIO and MSR protection masks
• CPUID
• Time resources
• Paravirtualized resources