## 14.5 A 600MHz Single-Chip Multiprocessor with 4.8GB/s Internal Shared Pipelined Bus and 512kB Internal Memory

Satoshi Kaneko, Katsunori Sawai, Norio Masui, Koichi Ishimi, Teruyuki Itou, Masayuki Satou, Hiroyuki Kondo, Naoto Okumura, Yukari Takata, Hirokazu Takata, Mamoru Sakugawa, Takashi Higuchi, Sugako Ohtani, Kei Sakamoto, Naoshi Ishikawa, Masami Nakajima, Shunichi Iwata, Kiyoshi Hayase, Satoshi Nakano, Sachiko Nakazawa, Osamu Tomisawa, Toru Shimizu

Mitsubishi Electric Corporation, Itami, Japan

This 600MHz single-chip multiprocessor consists of two M32R 32b CPU cores[1] and 512kB shared SRAM, and is designed for embedded microcontroller and system on chip (SoC) core. Since software for an embedded systems increases its complexity, embedded processors [2]-[4] are required to increase performance. At the same time, low power dissipation is still a key feature for battery-operated applications. The objective of this work is to implement a single-chip multiprocessor with both higher peak performance and lower power dissipation.

This chip is fabricated in a 0.15µm 4M CMOS process, runs at 600MHz, and dissipates 800mW peak. Supply voltages are 1.5V (internal) and 3.3V(I/O). Figure 14.5.1 displays the micrograph of a 65.0mm<sup>2</sup> die. The block diagram of this chip is shown in Fig. 14.5.2. Two symmetric CPU cores (CPU0, CPU1) and shared SRAM (512kB) are connected via a 128b-wide internal CPU bus. Operating frequencies of the CPU core, CPU bus/shared SRAM, and the external bus are 600MHz, 300MHz and 100MHz, respectively. The CPU core is a 7-stage pipelined, dual-issue processor with DSP functions. The CPU core contains 8kB 2-way set-associative instruction and data caches, 32-entry full-associative instruction and data TLBs. Peripheral units are ICU for multiprocessor, Timer, Serial I/O, digital phase-locked loop(PLL) and SDRAM controller.

Since each CPU has 2 sets of bus access ports, one for instruction fetches, the other for load/stores, there are four bus masters launching multiple requests. To fulfill the large bandwidth requirement of these bus masters, the internal CPU bus is pipelined. Figure 14.5.3 shows the pipeline of the bus transfer. The pipeline consists of four stages; arbitration, snoop, slave-access and data with each taking one bus cycle (3.3ns). The 300MHz 128b internal CPU bus provides 4.8GB/s peak throughput. Some pipeline stages are omitted to eliminate unnecessary cycles for specific bus operations. The data stage is removed on operand stores. The snoop stage is removed on instruction fetches and non-cacheable access in the multiprocessor mode. The snoop stage also is removed on any bus operations in the single-processor mode. By keeping the bus latency low, inefficient speculative bus operation is eliminated.

The internal shared bus also has the features for cache-coherent symmetric multiprocessor (SMP) bus. Data cache (D-cache) is a write-back D-cache, supporting MESI protocol. The internal shared bus supports lock/unlock function. These features offer a method for inter processor synchronization.

The cache memories are one of the largest current consuming blocks in a CPU. Reducing the power dissipation of the on-chip memory system is a key point to implement a single-chip multiprocessor for embedded use, because there are multiple cache memories operating concurrently. Figure 14.5.4 shows the diagram of instruction cache (I-cache). To reduce the power consumption, in normal prefetch cycles, TAG memory access and data memory access are divided in two consecutive cycles and only one is activated. On the other hand, Tag memory access and data memory access are executed in the same cycle, in branch condition to enhance the performance. By having two variations of cache access, low power and high performance are obtained. The variable latency cache can reduce its power dissipation about 35% according to Dhrystone benchmark simulation.

The power consumption of TLBs is reduced by separating TAG memory into address space identifier (ASID)-tag and virtual page number(VPN)-tag as shown in Figure 14.5.5. Only when new value is stored to the ASID register in such a case as context switching, each entry of ASID-CAM plane is compared and the result is stored to the ASID-match registers. The comparison for the VPN-tag is done on the entries whose ASID-match register is set. Since ASID-CAM plane is not compared as frequently as VPN-CAM plane, this divided TLB tag memory can reduce power dissipation more than 28%.

In addition to the normal multiprocessor mode, this chip supports 3 power-saving modes. These are: (1) Single processor mode stops CPU1. (2) Sleep mode stops CPU0 and CPU1. (3) In stop mode, all operating clock signals are stopped. CPU0, CPU1, internal CPU bus and Peripheral units work at different frequencies. Software accessible registers can change these frequencies dynamically and it takes 51ns. Furthermore, the digital PLL has the feature of fast lock-in time as it includes a counter to control frequency. The value of this counter is stored in a register before entering stop mode, and is restored after exiting stop mode. In this method, lock-in time is shortened to 750ns (reference clock 100MHz).

As shown in Figure 14.5.6, a PLL and a clock generator generate 4 clock signals (CLKCPU0, CLKCPU1, CLKBUS, CLKPER). Lower frequency clocks, CLKPER and CLKBUS are distributed through clock trees over the entire chip. Higher frequency clocks, CLKCPU0, CLKCPU1 are distributed to CPU0 and CPU1 through equal-length routings. Inside the CPU core, the CLKCPU clock domain is divided into four small regions. In each region, the CLKCPU is distributed through a clock mesh. This configuration minimizes the area required for the clock distribution, and reduces the clocking power. The total area of the clock meshes is about 1570µm x 300µm resulting in clocking power consumption of 75.24mW for each CPU. More than 61% of the flip-flops and latches are gated controlled to reduce power consumption. Simulated clock skew is 57.5ps in one clock mesh, and 75.5ps overall.

## References

[1] T.Shimizu, et al., "Multimedia 32b RISC Microprocessor with 16Mb DRAM," *ISSCC Digest of Technical Papers*, pp. 216-217, 1996.

[2] M.Nakajima, et al., "400MHz 32b Embedded Microprocessor Core AM34-1 with 4.0GB/S Cross-Bar Bus Switch for SoC," *ISSCC Digest of Technical Papers*, pp.342-343, 2002.

[3] T.Koyama, et al., "250MHz Single-Chip Multiprocessor for A/V Signal Processing," *ISSCC Digest of Technical Papers*, pp.146-147, 2001.

[4] N.Nishi, et al., "1GIPS 1W Single-Chip Tightly-Coupled Four-Way Multiprocessor with Architecture Support for Multiple Control Flow Execution," *ISSCC Digest of Technical Papers*, pp.418-419, 2000.



2