This first, high-end desktop implementation of Nehalem is code-named Bloomfield, and it's essentially the same silicon that should go into two-socket servers eventually. As a result, Bloomfield chips come with two QPI links onboard, as the die shot above indicates. However, the second QPI link is unused. In 2P servers based on this architecture, that second interconnect will link the two sockets, and over it, the CPUs will share cache coherency messages (using a new protocol) and data (since the memory subsystem will be NUMA)—again, very similar to the Opteron.
In order to take advantage of this radically modified system architecture, the design team tweaked Nehalem's processor cores in a range of ways big and small. Although the Core 2's basic four-issue-wide design and execution resources remain more or less unchanged, almost everything around the execution units has been altered to keep them more fully occupied. The instruction decoder can fuse more types of x86 instructions together and, unlike Core 2, it can do so when running in 64-bit mode. The branch predictor's accuracy has been enhanced, too. Many of the changes involve the memory subsystem—not just the caches and memory controller, which we've already discussed, but inside the core itself. The load and store buffers have been increased in size, for instance.
These modifications make sense in light of the Core i7's much higher system-level throughput, but they also help make another new mechanism in the chip work better: the resurrected Hyper-Threading, or simultaneous multithreading (SMT). Each core in Nehalem can track two independent hardware threads, much like some other Intel processors, including later versions of the Pentium 4 and, more recently, the Atom. SMT takes advantage of the explicit parallelism built into multithreaded software to keep the CPU's execution units more fully occupied, and done well, it can be a clear win, delivering solid performance gains at very little cost in terms of additional die area or power use. Intel architect Rohank Singhal outlined how Nehalem's implementation of Hyper-Threading works at this past Fall IDF. Some hardware, such as the registers, must be duplicated for each thread, but much of it can be shared. Nehalem's load, store, and reorder buffers are statically partitioned between the two threads, for example, while the reservation station and caches are shared dynamically based on demand. The execution units themselves don't need to be altered at all.
Comments (0)
Post a Comment