# An Operation Rearrangement Technique for Low-Power VLIW Instruction Fetch* 

Dongkun Shin and Jihong Kim<br>School of Computer Science and Engineering<br>Seoul National University<br>E-mail: \{sdk, jihong\}@davinci.snu.ac.kr

## Abstract

As mobile applications are required to handle more computing-intensive tasks, many mobile devices are designed using VLIW processors for high performance. In VLIW machines where a single instruction contains multiple operations, the power consumption during instruction fetches varies significantly depending on how the operations are arranged within the instruction. In this paper, we describe a post-pass optimal operation rearrangement method for low-power VLIW instruction fetch. The proposed method modifies operation placement orders within VLIW instructions so that the switching activity between successive instruction fetches is minimized. Our experiment shows that the switching activity can be reduced by $34 \%$ on average for benchmark programs.

## I. Introduction

As mobile applications are required to handle more computing-intensive tasks (such as video decoding), many mobile devices are designed using VLIW processors for high performance. For example, the Crusoe processors [9] from Transmeta (which were developed for mobile Internet computing market) are based on 64 bits or 128 bits VLIW CPU cores. Fujitsu Microelectronics' FR300 [5] (whose main application area is in wireless cellular phones) also has a VLIW architecture. In addition, there are many VLIW digital signal processors such as Texas Instruments' TMS320C6x series that can be used for wireless devices $[4,6]$.

While VLIW CPU-based mobile devices generally provide enough computing power to handle many computing intensive applications, they usually consume a large amount of power. For example, TMS320C620x processors consume between 1.2 W and 2.3 W at 1.8 V while high-end embedded microprocessors such as StrongArm 110 consume between 100 mW and 1 W at 3 V [8, 13]. Therefore, in designing VLIW CPU-based mobile devices, low power consumption is a dominant design constraint.

[^0]In digital CMOS circuits (that use well-designed logic gates), switching activity accounts for over $90 \%$ of total power consumption [1]. Therefore, many techniques have been proposed and developed to reduce the amount of switching activity in multiple levels of design abstraction [3]. For example, bus-invert coding [14] reduces a significant number of bit changes from bus lines by dynamically inverting the bus lines when the number of switched bus lines is more than half the number of bitlines. Register relabeling [11] assigns register numbers of instructions so that more frequently consecutive register numbers have a smaller Hamming distance, thus reducing the switching activity from the instruction fetch and decode logic.

In this paper, we propose a post-pass optimization technique that can significantly reduce switching activity during the instruction fetch phase in VLIW processors. The proposed method takes advantage of a VLIW machine's instruction encoding characteristic: VLIW CPUs can place the same operation in multiple operation slots within the VLIW instruction. ${ }^{1}$ Since a single instruction generally contains multiple operations in a VLIW CPU, the power consumption during instruction fetches varies significantly depending on how the operations are arranged within the instruction. We reduce switching activity by modifying operation placement orders within VLIW instructions so that the switching activity between successive instruction fetches is minimized.

The organization of the rest of the paper is as follows. Before presenting the proposed operation rearrangement technique, we review prior works on low-power instruction scheduling in Section II. In Section III, we describe a target VLIW machine model and define several terms. An operation rearrangement technique applicable to a single basic block is explained in Section IV while the complete solution is discussed in Section V. Experimental results are presented in Section VI followed by conclusions in Section VII.

[^1]
## II. Related Works

The low-power instruction scheduling problem has been recently investigated by several research groups.

Su et al. proposed an instruction scheduling technique, called cold scheduling, to reduce the amount of switching activity in the control path [15]. Used in conjunction with a traditional list scheduling algorithm, cold scheduling schedules instructions in the ready list based on the power cost of an instruction. The power cost of an instruction is determined by the number of bit changes when the instruction in question is scheduled following the last instruction. Su et al. show that the combination of Gray code addressing and cold scheduling results in a $20-30 \%$ reduction in the switching activity from the control path.

Tiwari et al. show that conventional compiler optimization techniques targeting high performance are also effective for low-power software $[16,17]$. Their experiments indicate that an optimal register allocation technique is effective in reducing power consumption.

Lee et al. have investigated the low-power scheduling problem for DSP-based systems [10]. They take into consideration what they term circuit-state overhead which is the switching activity between a pair of specific instructions. Through the code rescheduling based on circuitstate overhead, energy savings up to $40 \%$ were achieved on the benchmarks used.

Since off-chip driving and bus consume a significant amount of power in microprocessor-based systems, lowpower instruction scheduling was studied to reduce the switching activity on system bus. Tomiyama et al. proposed an instruction scheduling technique which reduces transitions on an instruction bus between an on-chip cache and a main memory when instruction cache misses occur [19]. This scheduling technique schedules instructions in each basic block in a way that binary representations of consecutive two machine instructions are less different while maintaining the control/data dependencies of the original program.

Most of existing low-power instruction scheduling techniques (including the techniques described above), however, assume that processors can issue at most one instruction at each cycle. Therefore, these techniques can not be directly applied to multiple-issue machines such as a VLIW CPU. In a VLIW CPU, since multiple operations are packed into a single instruction, two levels of scheduling decisions should be made to reduce power consumption. In the first level, we have to decide that which operations are packed into which instructions. Once the first level scheduling decision is made, in the second level, we have to decide which orders the selected operations are placed in specific instructions. The technique proposed in this paper solves the second-level low-power scheduling problem for a VLIW CPU assuming that the decision for the first-level scheduling problem was already made.

One recent study investigated a low-power instruction scheduling technique for a VLIW CPU [18]. However,
the goal of [18] was to reduce the peak power dissipation. The scheduling algorithm described in [18] schedules an operation in the current instruction as long as the power dissipation of the current instruction does not exceed the given threshold value. Although effective in reducing the peak power dissipation, this algorithm does not take account of the inter-instruction effect and inter-operation effect during the scheduling process. Our scheduling algorithm proposed in this paper considers both effects in arranging the operations within the instruction, thus resulting in a better solution.

## III. VLIW Machine Model and Definitions

## A. Target VLIW Machine Model

VLIW architectures use long instruction words to execute multiple operations simultaneously. In specifying multiple operations within a single VLIW instruction, two encoding methods are typically used: uncompressed encoding and compressed encoding [2]. In a VLIW machine with an uncompressed encoding, each operation slot of a VLIW instruction corresponds to a particular functional unit. The operation specified in a particular operation slot, therefore, is executed only in the corresponding functional unit. If a functional unit is not scheduled to execute an operation at the given cycle, NOP should be specified in the corresponding operation slot. Under this encoding method, the number of candidate operation slots for an operation is limited to the number of corresponding functional units that can execute the operation.

On the other hand, in a VLIW machine with a compressed encoding, the position of operation slots within a VLIW instruction does not directly correspond to a particular functional unit. The assignment of a particular functional unit to an operation is generally decided by the functional unit subfield of the operation encoding. The functional unit subfield specifies which functional unit should be assigned to the operation. In addition, in order to increase memory utilization, NOP operations are not explicitly encoded in the VLIW instruction. In this type of VLIW machines, an operation can be placed in any operation slot within the same VLIW instruction.

Figures 1 and 2 compare two types of encoding methods using a sample VLIW program sequence $S$. In the program sequence $S$, three VLIW instructions are shown where "||" specifies parallel operations that are executed simultaneously. As shown in Figures 1.(b) and 1.(c), in an uncompressed VLIW instruction encoding, the operation rearrangement is rather limited. For example, in the first VLIW instruction, IADD and NOP, FADD and NOP, and LOAD and STORE can be exchanged. For a compressed VLIW instruction encoding shown in Figures 2.(b) and 2.(c), there are more chances for operation rearrangements because there is no direct correspondence between the position of an operation slot and a corresponding functional unit. For example, for the first VLIW


Fig. 1. Uncompressed VLIW instruction encoding; (a) a sample instruction sequence $S$, (b) one uncompressed encoding of $S$ and (c) an alternative encoding of $S$.


Fig. 2. Compressed VLIW instruction encoding; (a) a sample instruction sequence $S$, (b) one compressed encoding of $S$ and (c) an alternative encoding of $S$.
instruction of $S, 4$ ! different operation rearrangements are all possible. ${ }^{2}$ Although the proposed operation rearrangement technique is equally effective for a VLIW machine with an uncompressed encoding, we assume that a target VLIW CPU was encoded using a compressed encoding method.

Throughout this paper, we consider a target system with an architectural organization shown in Figure 3. The VLIW processor with a compressed encoding has an on-chip instruction cache. The VLIW instructions are fetched through the $b_{\text {cache }}$-bit width instruction bus. If the instruction is not found in the on-chip instruction cache, the corresponding memory block is fetched from the main memory through the $b_{\text {mem }}$-bit width instruction bus. Because of the compressed encoding format, several VLIW instructions can be fetched together in a single fetch from the instruction cache. We call these instructions a fetch packet as a group. For a description purpose, we make the following assumptions on the tar-

[^2]

Fig. 3. Target system architecture.

## get system:

- In a single $b_{\text {cache }}$-bit fetch packet, exactly $N$ operations are included. (That is, the width of a single operation slot is exactly $b_{\text {cache }} / N$.)
- No instruction crosses the fetch packet boundary.
- $b_{\text {mem }}$ is equal to the operation width. (That is, $b_{m e m}$ $=b_{\text {cache }} / N$.)
- When the external instruction bus is not used, each line in the external bus is assumed to hold a logic 1 value to prevent from the high impedance condition.


## B. Definitions

In explaining the operation rearrangement technique, we use the following definitions:

Definition 1 A permutation $\sigma:\{1, \cdots, n\} \rightarrow$ $\{1, \cdots, n\}$ is said to be an operation rearrangement function.

Definition 2 Two VLIW instructions $I_{1}=$ $\left(O P_{1}^{1}, O P_{2}^{1}, \cdots, O P_{n}^{1}\right)$ and $I_{2}=\left(O P_{1}^{2}, O P_{2}^{2}, \cdots, O P_{n}^{2}\right)$ are said to be equivalent under operation rearrangement if there exists an operation rearrangement function $\sigma$ such that $O P_{\sigma(i)}^{1}=O P_{i}^{2}$ for all $1 \leq i \leq n$.

Definition 3 Two fetch packets $F P_{1}=\left(I_{1}^{1}, I_{2}^{1}, \cdots, I_{n}^{1}\right)$ and $F P_{2}=\left(I_{1}^{2}, I_{2}^{2}, \cdots, I_{n}^{2}\right)$ are said to be equivalent under operation rearrangement if there exist operation rearrangement functions $\left(\sigma_{1}, \sigma_{2}, \cdots, \sigma_{n}\right)$ such that $I_{i}^{1}$ is equivalent to $I_{i}^{2}$ under $\sigma_{i}$ for all $1 \leq i \leq n$. $E Q\left(F P_{i}\right)$ is used to represent the set of equivalent fetch packets for a given $F P_{i}$.

Definition 4 Two basic blocks $b b_{1}=\left(F P_{1}^{1}, F P_{2}^{1}\right.$, $\left.\cdots, F P_{n}^{1}\right)$ and $b b_{2}=\left(F P_{1}^{2}, F P_{2}^{2}, \cdots, F P_{n}^{2}\right)$ are said to be equivalent under operation rearrangement if $F P_{i}^{1}$ is equivalent to $F P_{i}^{2}$ under operation rearrangement for all $1 \leq i \leq n . E Q(b b)$ is used to represent the set of equivalent basic blocks for a given basic block $b b$.

Definition 5 Two programs $S_{1}=\left(b b_{1}^{1}, b b_{2}^{1}, \cdots, b b_{n}^{1}\right)$ and $S_{2}=\left(b b_{1}^{2}, b b_{2}^{2}, \cdots, b b_{n}^{2}\right)$ are said to be equivalent under operation rearrangement if $b b_{i}^{1}$ is equivalent to $b b_{i}^{2}$
under operation rearrangement for all $1 \leq i \leq n . E Q(S)$ is used to represent the set of equivalent programs for a given program $S$.

In the rest of paper, we use "equivalent" to mean "equivalent under operation rearrangement" where no confusion arises.

## IV. Local Operation Rearrangement Problem

In this section, we consider a simpler operation rearrangement problem that we call local operation rearrangement problem (LOR). In the LOR problem, each basic block is independently considered and assumed that the basic block is fetched from the main memory and executed only once. Since the basic block is fetched from the main memory, there are cache misses associated during the instruction fetch. A complete operation rearrangement problem that we call global operation rearrangement problem ( $G O R$ ) is discussed in the next section. In the GOR problem, all the basic blocks are simultaneously considered.

## A. Basic Idea

In order to reduce the switching activity during the instruction fetch phase in a target system, we reduce the number of bit transitions between successive instruction fetches, because switching activity is directly proportional to the number of bit changes. Since, in a VLIW machine with a compressed encoding, an operation can be placed in any operation slot within the instruction boundary, the number of bit transitions between successive instruction fetches can be reduced by reordering given VLIW instructions to equivalent instructions that have less switching activity. Consider an example shown in Figure 4. There are four fetch packets each of which is 32 -bit wide (that is, $b_{\text {cache }}=32$ ). In the example, each fetch packet consists of a single VLIW instruction which in turn consists of four operations. Figure 4.(b) shows the instruction sequence after an operation placement order was modified to reduce the bit transitions in the instruction bus. When the four instructions are executed sequentially only once, the rearranged instruction sequence shown in Figure 4.(b) reduces the total number of bit changes by about $25 \%$ from 39 to 29 , while maintaining the same semantics of the original sequence.

## B. LOR Problem Formulation

If a given basic block $B$ is executed only once, in our target architecture shown in Figure 3, the number of bit changes $S W^{B}$ during the instruction fetch phase is given by the sum of two terms, $S W_{\text {cache }}^{B}$ and $S W_{\text {mem }}^{B}$. $S W_{\text {cache }}^{B}$ represents the number of bit changes at the internal instruction bus and $S W_{m e m}$ indicates the number of bit

(a) Before operation rearrangement

| Instruction Cache |  |  |  |
| :---: | :---: | :---: | :---: |
| Toopolo 101 | 10010101 | 10011001 | 00000000 |
| 00011101 | 10001111 | 01011101 | $\underline{0} 0000010$ |
| 10011101 | 100011001 | 11111111 | 10010000 |
| 1000111 1 | 00011101 | 10100101 | 00011100 |

Fetched values on Instruction Bus 00010101100101011001100100000000
? 8 bit transitions 00011101100011110101110100000010 10 bit transitions 10011101100110011111111110010000
\& 11 bit transitions 10001111000111011010010100011100
the total number of bit changes $=29$
(b) After operation rearrangement

Fig. 4. An operation rearrangement example.
changes at the external instruction bus. Using the notations explained in Table $1, S W_{\text {cache }}^{B}$ and $S W_{m e m}^{B}$ are computed as follows.
$S W_{\text {cache }}^{B}$ is the sum of all the bit changes incurred during successive fetches of fetch packets from the instruction cache and calculated as follows:

$$
\begin{equation*}
S W_{\text {cache }}^{B}=\sum_{i=1}^{N_{f p}(B)-1} d_{f p}\left(F P_{i}^{B}, F P_{i+1}^{B}\right) \tag{1}
\end{equation*}
$$

$S W_{m e m}^{B}$ is the sum of all the bit changes between adjacent operation fetches from the main memory. Since we assumed that $b_{m e m}$ is equal to $b_{\text {cache }} / N_{o p}$ in Section III, if we assume that there is only one cache miss for each memory block and basic blocks are aligned by the cache memory block size, $S W_{m e m}^{B}$ is calculated as follows:

$$
\begin{align*}
S W_{m e m}^{B} & =\sum_{i=1}^{N_{f p}(B)} \sum_{n=1}^{N_{o p}-1} d_{o p}\left(O P_{n}^{F P_{i}^{B}}, O P_{n+1}^{F P_{i}^{B}}\right) \\
& +\sum_{i=1}^{N_{f p}(B)-1} d_{o p}\left(O P_{N_{o p}}^{F P_{i}^{B}}, O P_{1}^{F P_{i+1}^{B}}\right) \\
& +d_{o p}\left(\mathbf{1}, O P_{1}^{F P_{1}^{B}}\right)+d_{o p}\left(O P_{N_{o p}}^{F P_{N_{p}(B)}^{B}}, \mathbf{1}\right) \tag{2}
\end{align*}
$$

Assuming the load capacitance ratio of the internal instruction bus to the external instruction bus is $\frac{1}{\alpha}, S W^{B}$ is computed as follows using the Equations (1) and (2):

$$
\begin{align*}
S W^{B} & =S W_{\text {cache }}^{B}+\alpha \cdot S W_{\text {mem }}^{B} \\
& =\sum_{i=1}^{N_{f p}(B)-1} S W_{F P}^{\text {inter }}\left(F P_{i}^{B}, F P_{i+1}^{B}\right) \\
& +\sum_{i=1}^{N_{f p}(B)} S W_{F P}^{\text {intra }}\left(F P_{i}^{B}\right) \tag{3}
\end{align*}
$$

| Symbol | Meaning |
| :--- | :--- |
| $N_{f p}(B)$ | The number of fetch packets in a basic block $B$. |
| $N_{o p}$ | The number of operations in a fetch packet. (This is a fixed value regardless of $B$.) |
| $\mathbf{1}$ | The bit vector where every bit is 1 and whose length is $b_{m e m}$. |
| $F P_{i}^{B}$ | The $i$-th fetch packet of a basic block $B$. |
| $O P_{n}^{F P_{i}^{B}}$ | The $n$-th operation of $F P_{i}^{B}$. <br> (Within a fetch packet $F P_{i}^{B}$, the first operation is $O P_{1}^{F P_{i}^{B}}$ and the last one is $O P_{N_{o p}}^{F P_{i}^{B}}$. |
| $d_{f p}\left(F P_{i}^{B}, F P_{j}^{B}\right)$ | The Hamming distance between the fetch packets $F P_{i}^{B}$ and $F P_{j}^{B}$. |
| $d_{o p}\left(O P_{n}^{F P_{i}^{B}}, O P_{m}^{F P_{j}^{B}}\right)$ | The Hamming distance between the operations $O P_{n}^{F P_{i}^{B}}$ and $O P_{m}^{F P_{j}^{B}}$. |

TABLE 1
Notations used in Section IV.B
where

$$
\begin{align*}
& S W_{F P}^{\text {inter }}\left(F P_{i}^{B}, F P_{i+1}^{B}\right)= \\
& \quad d_{f p}\left(F P_{i}^{B}, F P_{i+1}^{B}\right)+\alpha \cdot d_{o p}\left(O P_{N_{o p}}^{F P_{i}^{B}}, O P_{1}^{F P_{i+1}^{B}}\right) \\
& S W_{F P}^{\text {intra }}\left(F P_{i}^{B}\right)= \\
& \begin{cases}\alpha \cdot d_{o p}\left(\mathbf{1}, O P_{1}^{F P_{i}^{B}}\right)+S_{o p} & \text { if } i=1 \\
\alpha \cdot d_{o p}\left(O P_{N_{o p}}^{F P_{i}^{B}}, \mathbf{1}\right)+S_{o p} & \text { if } i=N_{f p}(B) \\
S_{o p} & \text { otherwise }\end{cases}  \tag{5}\\
& \quad\left(\text { where } S_{o p}=\alpha \cdot \sum_{n=1}^{N_{o p}-1} d_{o p}\left(O P_{n}^{F P_{i}^{B}}, O P_{n+1}^{F P_{i}^{B}}\right)\right)
\end{align*}
$$

Given a basic block $B$, the LOR problem is to find an equivalent basic block $B^{\prime}$ such that $S W^{B^{\prime}} \leq$ $S W^{B^{\prime \prime}}$ for all $B^{\prime \prime} \in E Q(B)$. If operations are rearranged, $d_{f p}\left(F P_{i}^{B}, F P_{i+1}^{B}\right), d_{o p}\left(O P_{N_{o p}}^{F P_{i}^{B}}, O P_{1}^{F P_{i+1}^{B}}\right)$ and $d_{o p}\left(O P_{n}^{F P_{i}^{B}}, O P_{n+1}^{F P_{i}^{B}}\right)$ in Equations (4) and (5) are changed.

## C. Optimal Solution for $L O R$

We compute an optimal solution for the LOR problem by converting the LOR problem to the shortest path problem between two special nodes, START and END. Using the notations described in Table 2, given a basic block $B$, we construct a weighted directed graph $G_{B}=\left\{V, E, W_{\text {node }}, W_{\text {edge }}\right\}$, where

$$
\begin{aligned}
V= & \{\mathrm{START}, \mathrm{END}\} \cup \bigcup_{i=1}^{N_{f_{p}}(B)} E Q\left(F P_{i}^{B}\right) \\
= & \{\mathrm{START}, \mathrm{END}\} \cup \bigcup_{i=1}^{N_{f p}(B)}\left\{F P_{i, 1}^{B}, \cdots, F P_{i, N_{e q}\left(F P_{i}^{B}\right)}^{B}\right\} \\
E= & \left\{(v, w) \mid v=\mathrm{START}, w \in E Q\left(F P_{1}^{B}\right)\right\} \cup \\
& \left\{(v, w) \mid w=\mathrm{END}, v \in E Q\left(F P_{N_{f p}(B)}^{B}\right)\right\} \cup \\
& \left\{(v, w) \mid v \in E Q\left(F P_{i}^{B}\right), w \in E Q\left(F P_{i+1}^{B}\right)\right. \\
& \left.\quad \text { for } 1 \leq i<N_{f p}(B)\right\}
\end{aligned}
$$



Fig. 5. A shortest path problem formulation of the LOR problem (with node and edge weights omitted).

Figure 5 shows an example graph constructed by transforming the LOR problem to the shortest path problem. For each fetch packet $F P_{i}^{B}, N_{e q}\left(F P_{i}^{B}\right)$ vertices are created in $G_{B}$, and for successive fetch packets, $F P_{i}^{B}$ and $F P_{i+1}^{B}$, every pair of $\left(F P_{i, k}^{B}, F P_{i+1, k^{\prime}}^{B}\right)$ is connected by an edge. We call the $N_{e q}\left(F P_{i}^{B}\right)$ vertices created from the fetch packet $F P_{i}^{B}$ to be in the level $i$. In the graph $G_{B}$, the distance of a path $P=\left(\mathrm{START}, v_{1}, \cdots, v_{k}\right.$, END) is given by $\sum_{i=1}^{k} W_{\text {node }}\left(v_{i}\right)+\sum_{i=1}^{k-1} W_{\text {edge }}\left(v_{i}, v_{i+1}\right)$. The distance of path $P$ is equal to $S W^{B}$ when each fetch packet $F P_{i}^{B}$ is reordered to $v_{i}$ for $1 \leq i \leq k$.

An optimal solution of the shortest path problem described above can be found by using a modified shortest path algorithm shown in Figure 6. The modified shortest path algorithm is based on the following theorem whose proof is trivial.

| Symbol | Meaning |
| :--- | :--- |
| $N_{i n s}\left(F P_{i}^{B}\right)$ | The number of instructions in $F P_{i}^{B}$. |
| $I_{j}^{F P_{i}^{B}}$ | The $j$-th instruction of $F P_{i}^{B} \quad\left(1 \leq j \leq N_{i n s}\left(F P_{i}^{B}\right)\right)$. |
| $N_{o p}\left(I_{j}^{F P_{i}^{B}}\right)$ | The number of operations in $I_{j}^{F P_{i}^{B}}$. |
| $N_{e q}\left(I_{j}^{F P_{i}^{B}}\right)$ | The number of instructions that are equivalent to $I_{j}^{F P_{i}^{B}}\left(N_{e q}\left(I_{j}^{F P_{i}^{B}}\right)=\left(N_{o p}\left(I_{j}^{F P_{i}^{B}}\right)\right)!\right)$. |
| $N_{e q}\left(F P_{i}^{B}\right)$ | The number of fetch packets that are equivalent to $F P_{i}^{B}\left(N_{e q}\left(F P_{i}^{B}\right)=\prod_{j=1}^{N_{i n s}\left(F P_{i}^{B}\right)} N_{e q}\left(I_{j}^{F P_{i}^{B}}\right)\right)$. |
| $F P_{i, n}^{B}$ | The $n$-th fetch packet in $E Q\left(F P_{i}^{B}\right)\left(1 \leq n \leq N_{e q}\left(F P_{i}^{B}\right)\right)$. |

TABLE 2
Notations used in Section IV.C

```
for \(i \leftarrow 0\) to \(N_{f p}(B)\{\)
    \(/ *\) for each vertex in the level \(i+1 * /\)
    for \(k \leftarrow 1\) to \(N_{e q}\left(F P_{i+1}^{B}\right)\{\)
        \(S W_{\text {min }}:=\infty\);
        \(/ *\) for each vertex in the level \(i * /\)
        for \(j \leftarrow 1\) to \(N_{e q}\left(F P_{i}^{B}\right)\{\)
            \(S W_{c u r}:=d_{P\left(F P_{i, j}^{B}\right)}+W_{e d g e}\left(F P_{i, j}^{B}, F P_{i+1, k}^{B}\right)\)
                \(+W_{\text {node }}\left(F P_{i+1, k}^{B}\right) ;\)
            /* find the minimum value \(* /\)
            if \(\left(S W_{\text {min }}>S W_{\text {cur }}\right)\{\)
            \(S W_{\text {min }}:=S W_{\text {cur }} ;\)
            MinNode \(:=\mathrm{j}\);
            \}
            \}
            \(d_{P\left(F P_{i+1, k}^{B}\right)}:=S W_{\text {min }} ;\)
            /* store MinNode for the final path construction */
            \(\operatorname{MinPath}\left[F P_{i+1, k}^{B}\right]:=F P_{i, \text { MinNode }}^{B} ;\)
    \}
\}
```

Fig. 6. A modified shortest path algorithm.

Theorem 1 Let a path $P\left(F P_{i, j}^{B}\right)=\left(\mathrm{START}, v_{1}, \cdots\right.$, $\left.v_{i-1}, F P_{i, j}^{B}\right)$ be the shortest path from START to $F P_{i, j}^{B} \in$ $E Q\left(F P_{i}^{B}\right)$ and the distance of the path $P\left(F P_{i, j}^{B}\right)^{i, j}$ be $d_{P\left(F P_{i, j}^{B}\right)}$. Then the minimum distance of the path $P\left(F P_{i+1, k}^{B}\right)=\left(\mathrm{START}, v_{1}, \cdots, v_{i}, F P_{i+1, k}^{B}\right), d_{P\left(F P_{i+1, k}^{B}\right)}$, is given by

$$
\begin{array}{r}
\min _{1 \leq j \leq N_{e q}\left(F P_{i}^{B}\right)}\left[d_{P\left(F P_{i, j}^{B}\right)}+W_{\text {edge }}\left(F P_{i, j}, F P_{i+1, k}\right)\right. \\
\left.+W_{\text {node }}\left(F P_{i+1, k}\right)\right] \tag{6}
\end{array}
$$

In Figure $6, S W_{\min }$ is a variable to store the minimum distance of a path from START to $F P_{i+1, k}^{B}$ (in Line 15) and $S W_{c u r}$ is a variable to store the minimum distance of a path from START to $F P_{i+1, k}^{B}$ that passes through $F P_{i, j}^{B}$. The shortest path is constructed by visiting MinPath in reverse order. The complexity of the modified shortest path algorithm is given by $O\left(N_{f p}(B) \cdot\left(\overline{N_{e q}^{F P^{B}}}\right)^{2}\right)$ where $\overline{N_{e q}^{F P^{B}}}=\frac{1}{N_{f p}(B)} \sum_{i=1}^{N_{f_{p}}(B)} N_{e q}\left(F P_{i}^{B}\right) . \overline{N_{e q}^{F P^{B}}}$ is bounded by $N_{o p}$ !.

## V. Global Operation Rearrangement Problem

In the GOR problem, all the basic blocks in a program are simultaneously considered to find a global optimal solution. Since the LOR problem does not take account of inter-block switching activity, simply solving the LOR problem for each basic block does not minimize the number of bit changes for a complete program. In order to compute an optimal solution for the GOR problem, we need additional information on the dynamic behavior of program execution. For example, we should know how many times each basic block is executed, how often each basic block experiences cache misses, and how basic blocks are related each other, etc.

## A. GOR Problem Formulation

If a program $S$ is composed of basic blocks $b b_{1}, b b_{2}, \cdots, b b_{N_{b b}(S)}$, then the total number of bit changes $S W^{S}$ from instruction fetches while executing the program $S$ is given as follows, using the notations described in Table 3:
$S W^{S}=\sum_{i=1}^{N_{b b}(S)} \sum_{j=1}^{N_{b b}(S)} S W_{B B}^{\text {inter }}\left(b b_{i}, b b_{j}\right)+\sum_{i=1}^{N_{b b}(S)} S W_{B B}^{\text {intra }}\left(b b_{i}\right)$
where

$$
\begin{align*}
& S W_{B B}^{\text {inter }}\left(b b_{i}, b b_{j}\right) \\
& \quad=w\left(b b_{i}, b b_{j}\right) \cdot S W_{F P}^{i n t e r}\left(F P_{N_{f p}\left(b b_{i}\right)}^{b b_{i}}, F P_{1}^{b b_{j}}\right)  \tag{8}\\
& S W_{B B}^{\text {intra }}\left(b b_{i}\right) \\
& \quad=w\left(b b_{i}\right) \cdot\left(\sum_{i=1}^{N_{f_{p}}\left(b b_{i}\right)-1} S W_{F P}^{i n t e r}\left(F P_{i}^{b b_{i}}, F P_{i+1}^{b b_{i}}\right)\right. \\
& \left.\quad+\sum_{i=1}^{N_{f_{p}\left(b b_{i}\right)}} S W_{F P}^{\text {intra }}\left(F P_{i}^{b b_{i}}\right)\right) \tag{9}
\end{align*}
$$

$S W_{F P}^{i n t e r}\left(F P_{n}^{b b_{i}}, F P_{m}^{b b_{j}}\right)=d_{f p}\left(F P_{n}^{b b_{i}}, F P_{m}^{b b_{j}}\right)$
$S W_{F P}^{i n t r a}\left(F P_{n}^{b b_{i}}\right)$

| Symbol | Meaning |
| :--- | :--- |
| $N_{b b}(S)$ | The number of basic blocks in a program $S$. |
| $w\left(b b_{i}, b b_{j}\right)$ | The expected number of times that a basic block $b b_{j}$ is executed right after a basic block $b b_{i}$. |
| $w\left(b b_{i}\right)$ | The expected number of times that a basic block $b b_{i}$ is executed. |
| $M B\left(F P_{n}^{b b_{i}}\right)$ | The memory block that contains $F P_{n}^{b b_{i}}$. |
| $F P_{j}^{M B\left(F P_{n}^{b b_{i}}\right)}$ | The j-th fetch packet in the memory block that contains $F P_{n}^{b b_{i}}$. |
| $R_{m i s s}^{M B\left(F P_{n}^{b b_{i}}\right)}$ | The cache miss rate of the memory block $M B\left(F P_{n}^{b b_{i}}\right)$. |

TABLE 3
Notations used in Section V.A

| Symbol | Meaning |
| :--- | :--- |
| $N_{e q}\left(b b_{i}^{S}\right)$ | The number of basic blocks that are equivalent to $b b_{i}^{S}\left(N_{e q}\left(b b_{i}^{S}\right)=\prod_{j=1}^{N_{f p}\left(b b_{i}^{S}\right)} N_{e q}\left(F P_{j}^{b b_{i}^{S}}\right)\right)$. |
| $b b_{i, n}^{S}$ | The $n$-th basic block in $E Q\left(b b_{i}^{S}\right)\left(1 \leq n \leq N_{e q}\left(b b_{i}^{S}\right)\right)$. |

TABLE 4
Notations used in Section V.B

$$
\begin{aligned}
& =\alpha \cdot R_{m i s s}^{m b} \cdot\left(\sum_{j=1}^{N_{f p}(M B)} \sum_{k=1}^{N_{o p}-1} d_{o p}\left(O P_{k}^{F P_{j}^{m b}}, O P_{k+1}^{F P_{j}^{m b}}\right)\right. \\
& +\sum_{j=1}^{N_{f p}(M B)-1} d_{o p}\left(O P_{N_{o p}}^{F P_{j}^{m b}}, O P_{1}^{F P_{j+1}^{m b}}\right) \\
& +\quad d_{o p}\left(\mathbf{1}, O P_{1}^{F P_{1}^{m b}}\right)+d_{o p}\left(O P_{N_{o p}}^{\left.\left.F P_{N_{f p}(M B)}^{m b}, \mathbf{1}\right)\right)}\right. \\
& \quad\left(\text { where } m b=M B\left(F P_{n}^{b b_{i}}\right)\right)
\end{aligned}
$$

In Equations (8), (9), and (11), w( $\left.b b_{i}, b b_{j}\right), w\left(b b_{i}\right)$ and $R_{m i s s}^{m b}$ can be calculated by analyzing program execution traces. When a cache miss occurs for a fetch packet $F$, all the fetch packets in the missed memory block that contains $F$ are fetched from the main memory. Therefore, in Equation (11), all the fetch packets in $M B\left(F P_{n}^{b b_{i}}\right)$ are considered in computing the bit changes at the external instruction bus. We assume that basic blocks are aligned by the cache memory block size. In Equation (10), the hamming distance between the last operation of $F P_{n}^{b b_{i}}$ and the first operation of $F P_{m}^{b b_{j}}$ is omitted because it is included in Equation (11). Given a program $S$, the GOR problem is to find an equivalent program $S^{\prime}$ such that $S W^{S^{\prime}} \leq S W^{S^{\prime \prime}}$ for all $S^{\prime \prime} \in E Q(S)$.

## B. Optimal Solution for GOR

We solve the GOR problem in a similar fashion as the LOR problem by transforming the GOR problem to the shortest path problem. The main difference from the LOR problem is that since a program generally contains branches and loops, a constructed graph may span multiple paths from a given node. In order to utilize the same shortest path algorithm used in solving the LOR problem, we transform the constructed graph so that the new graph has no branches and loops.

Using the notations described in Table 4, given a program $S$, we construct a weighted directed graph $G_{S}=$ $\left\{V, E, W_{\text {node }}, W_{\text {edge }},\right\}$, where

$$
\begin{aligned}
& V=\{\mathrm{START}, \mathrm{END}\} \cup \bigcup_{i=1}^{N_{b b}(S)} E Q\left(b b_{i}^{S}\right) \\
& =\{\text { START, END }\} \cup \bigcup_{i=1}^{N_{b b}(S)}\left\{b b_{i, 1}^{S}, \cdots, b b_{i, N_{e q}\left(b b_{i}^{S}\right)}^{S}\right\}, \\
& E=\left\{(v, w) \mid v=\operatorname{START}, w \in E Q\left(b b_{\text {Entry }}^{S}\right)\right\} \cup \\
& \left\{(v, w) \mid w=\mathrm{END}, v \in \bigcup_{i \in E \times i t^{S}} E Q\left(b b_{i}^{S}\right)\right\} \cup \\
& \left\{(v, w) \mid v \in E Q\left(b b_{i}^{S}\right), w \in E Q\left(b b_{j}^{S}\right) \text { for } 1 \leq i, j \leq N_{b b}(S)\right. \\
& \text { where } b b_{j}^{S} \text { is an immediate successor of } b b_{i}^{S} \\
& \text { in a control flow graph\}, } \\
& W_{\text {node }}(v)=\left\{\begin{array}{ll}
S W_{B B}^{i n t r a}(v) & \text { if } v \in V-\{\operatorname{START}, \mathrm{END}\} \\
0 & \text { otherwise }
\end{array},\right. \text { and } \\
& W_{\text {edge }}(v, w)= \begin{cases}S W_{B B}^{\text {inter }}(v, w) & \text { if } v, w \in V-\{\text { START, END }\} \\
0 & \text { otherwise }\end{cases}
\end{aligned}
$$

$b b_{E n t r y}^{S}$ is the entry point of the program $S$ and the Exit ${ }^{S}$ contains the indices of basic blocks that are the exit points of the program $S$.

In order to reduce the computational complexity of the GOR problem, we eliminate $b b_{i, k}^{S}$ from $G_{S}$ if there exists $b b_{i, j}^{S}($ where $j \neq k)$ such that

$$
\begin{gather*}
F P_{1}^{b b_{i, j}^{S}}=F P_{1}^{b b_{i, k}^{S}}  \tag{12}\\
F P_{N_{f_{p}\left(b b_{i}^{S}\right)}^{b b b_{i, j}^{S}}=F P_{\left.N_{f_{p}\left(b b_{i}^{S}\right.}^{S}\right)}^{b b_{i, k}^{S}}, \text { and }}^{S W_{B B}^{i n+r a}\left(b b_{i, j}^{S}\right) \leq S W_{B B}^{i n t r a}\left(b b_{i, k}^{S}\right)} \tag{13}
\end{gather*}
$$

If $b b_{i, k}^{S}$ satisfies Equations (12), (13), and (14), $b b_{i, k}^{S}$ cannot be a part of an optimal GOR solution because both $b b_{i, j}^{S}$ and $b b_{i, k}^{S}$ have the same $S W_{B B}^{i n t e r}$ value. For each basic block $b b_{i}^{S}$, applying a modified LOR algorithm (with the first and last fetch packets fixed) $N_{e q}\left(F P_{1}^{b b_{i}^{S}}\right) \times$ $N_{e q}\left(F P_{N_{f p}\left(b b_{i}^{S}\right)}^{b b_{i}^{S}}\right)$ times, we can construct a simplified $G_{S}$ with no redundant $b b_{i, k}^{S}$ 's. Eliminating redundant $b b_{i, k}^{S}$ from $G_{S}, N_{e q}\left(b b_{i}^{S}\right)$ is effectively reduced to $N_{e q}\left(F P_{1}^{b b_{i}^{S}}\right) \times$ $N_{e q}\left(F P_{N_{f p}\left(b b_{i}^{S}\right)}^{b b_{i}^{S}}\right)$. (For the rest of the paper, we use $N_{i}$ to represent $N_{e q}\left(F P_{1}^{b b_{i}^{S}}\right) \times N_{e q}\left(F P_{\left.N_{f_{p}\left(b b_{i}^{S}\right.}\right)}^{b b_{i}^{S}}\right)$ for a simpler description.)

Once a simplified $G_{S}$ is constructed, it is further converted to remove branching nodes and looping nodes so that the shortest path algorithm for the LOR problem can be reused. Figure 7 illustrates how the branch merging and loop rolling operations work on the $G_{S}$ graph using the example control structures.


Fig. 7. Effects of branch merging and loop rolling on the $G_{S}$ graph.

Branch merging replaces two branch successor nodes $v_{i}$ and $v_{j}$ with a new node $v_{i \oplus j}$. For the new node $v_{i \oplus j}$, $N_{e q}\left(b b_{i \oplus j}\right)$ is set to $N_{e q}\left(b b_{i}\right) \times N_{e q}\left(b b_{j}\right)$. For example, in Figure 7.(a), the basic blocks $b_{2}$ and $b_{3}$ have 2 equivalent basic blocks respectively. After the branch merging operation is applied, $v_{2 \oplus 3}$ has 4 equivalent basic blocks. The node $v_{2}$ (with $W_{\text {node }}\left(v_{2}\right)=16$ ) and the node $v_{3}$ ( $W_{\text {node }}\left(v_{3}\right)=13$ ) are merged into the node $v_{2 \oplus 3}$ whose $W_{\text {node }}$ value is $29(=16+13)$. The node $v_{2 \oplus 3}$ has an edge with $v_{1}$ and its edge weight is $6(=2+4)$.

After branch merging in Figure 7.(a), three basic blocks $b_{1}, b_{2 \oplus 3}$, and $b_{4}$ can be merged into a single basic block using the LOR algorithm. The resulting node has the $W_{\text {node }}$ value of 78 . We call this extra merging sequential
merging. If the basic blocks $b_{i}, \cdots, b_{j}$ are merged into a single basic block by a sequential merging operation, the merged basic block has $N_{e q}\left(F P_{1}^{b b_{i}^{S}}\right) \times N_{e q}\left(F P_{N_{f p}\left(b b_{j}^{S}\right)}^{b b_{j}^{S}}\right)$ equivalent nodes.

Loop rolling works in a similar fashion as sequential merging. It merges loop body nodes $v_{i}, \cdots, v_{j}$ into a new node $v_{\cup(i, \cdots, j)}$ as with sequential merging. The difference is that loop rolling adds the weights of back edges in computing the weight of the merged node. For example, in Figure 7.(b), consider the basic blocks $b_{2}$ and $b_{3}$ that have two equivalent basic blocks respectively. The nodes $v_{2}$ and $v_{3}$ are merged into the new node $v_{\cup(2,3)}$ whose $W_{\text {node }}$ value is $60(=25+24+4+7)$. The node $v_{\cup(2,3)}$ has an edge with the node $v_{1}$, and its edge weight is 6 that is the value of $W_{\text {edge }}\left(v_{1}, v_{2}\right)$.

When nodes $v_{i}, v_{j}$ are merged into a new node $v^{\prime}$ by branch merging or loop rolling, the following changes are made to the $G_{S}$ graph:

$$
\begin{aligned}
& V= V \cup E Q\left(v^{\prime}\right)-E Q\left(v_{i}\right)-E Q\left(v_{j}\right) \\
& E= E \cup\left\{(v, w) \mid v \in E Q\left(v^{\prime}\right), w \in E Q\left(v_{k}\right)\right\} \\
& \cup\left\{(v, w) \mid v \in E Q\left(v_{h}\right), w \in E Q\left(v^{\prime}\right)\right\} \\
&-\left\{(v, w) \mid v \in E Q\left(v_{i}\right) \cup E Q\left(v_{j}\right), w \in E Q\left(v_{k}\right)\right\} \\
&-\left\{(v, w) \mid v \in E Q\left(v_{h}\right), w \in E Q\left(v_{i}\right) \cup E Q\left(v_{j}\right)\right\} \\
&-\left\{(v, w) \mid v \in E Q\left(v_{i}\right), w \in E Q\left(v_{j}\right)\right\} \\
&-\left\{(v, w) \mid v \in E Q\left(v_{j}\right), w \in E Q\left(v_{i}\right)\right\} \\
& \text { for all } k \text { such that } k \neq i, k \neq j, \\
& \text { and }\left(\left(v_{i}, v_{k}\right) \in E \text { or }\left(v_{j}, v_{k}\right) \in E\right), \text { and } \\
& \text { for all } h \text { such that } h \neq i, h \neq j, \\
& \text { and }\left(\left(v_{h}, v_{i}\right) \in E \text { or }\left(v_{h}, v_{j}\right) \in E\right) \\
& \\
& W_{\text {node }}\left(v^{\prime}\right)= W_{\text {node }}\left(v_{i}\right)+W_{\text {node }}\left(v_{j}\right) \\
&+W_{\text {edge }}\left(v_{i}, v_{j}\right)+W_{\text {edge }}\left(v_{j}, v_{i}\right) \\
& W_{\text {edge }}\left(v^{\prime}, v_{k}\right)= W_{\text {edge }}\left(v_{i}, v_{k}\right)+W_{\text {edge }}\left(v_{j}, v_{k}\right) \\
& \text { for all } k \text { such that } k \neq i, k \neq j, \\
& \text { and }\left(\left(v_{i}, v_{k}\right) \in E \text { or }\left(v_{j}, v_{k}\right) \in E\right) \\
& W_{\text {edge }}\left(v_{h}, v^{\prime}\right)= W_{\text {edge }}\left(v_{h}, v_{i}\right)+W_{\text {edgee }}\left(v_{h}, v_{j}\right) \\
& \text { for all } h \text { such that } h \neq i, h \neq j, \\
& \text { and }\left(\left(v_{h}, v_{i}\right) \in E \text { or }\left(v_{h}, v_{j}\right) \in E\right)
\end{aligned}
$$

Once the $G_{S}$ is converted to a graph with no branches and loops, the shortest path algorithm used for the LOR problem can compute the optimal solution.

## C. Heuristic Solution for GOR

Finding an optimal GOR solution using the $G_{S}$ graph constructed in the previous section may require an excessive amount of memory and cycles. For example, for each basic block $b_{i}, N_{i}$ node structures are required. Furthermore, when two basic blocks $b_{i}$ and $b_{j}$ are merged using a branch merging operation, the required number of node structures for the merged node increases to $N_{i} \times N_{j}$. In this section, we propose a heuristic solution for the GOR problem which we call the GOR-H algorithm.

The GOR-H algorithm reduces the memory requirement and computing cycles significantly by two heuristic rules. First, all the basic blocks are not equally treated. For each basic block $b b_{i}$, we associate $F R\left(b b_{i}\right)$ which is defined as follows:

$$
F R\left(b b_{i}\right)=\frac{w\left(b b_{i}\right) \cdot N_{f p}\left(b b_{i}\right)}{\sum_{j=1}^{N_{b b}(S)} w\left(b b_{j}\right) \cdot N_{f p}\left(b b_{j}\right)}
$$

$F R\left(b b_{i}\right)$ represents an effective fetch rate of the fetch packets in the basic block $b b_{i}$ over all the basic blocks of a program. Since a basic block with a larger $F R\left(b b_{i}\right)$ value has a bigger effect on the total switching activity during the instruction fetch phase, basic blocks with large $F R\left(b b_{i}\right)$ values are more thoroughly reordered than ones with small $F R\left(b b_{i}\right)$ values.

Second, for each basic block $b b_{i}$, not all the equivalent basic blocks in $E Q\left(b b_{i}\right)$ are tried to find an optimal solution. Only $N_{c a n d}$ equivalent basic blocks are created and included in $G_{S}$. These $N_{\text {cand }}$ equivalent basic blocks are ones with up to the $N_{\text {cand }}-t h$ smallest switching activity value among all the basic blocks in $E Q\left(b b_{i}\right)$.

Once the $G_{S}$ graph is constructed by the two rules above, the rest of processing steps (that is, branch merging, loop rolling and sequential merging) are same as done in the previous section. From the transformed $G_{S}$, we can solve the GOR problem using the LOR algorithm.

## VI. Experiments

In order to evaluate how well the proposed operation rearrangement technique works on application programs, we have performed experiments using a VLIW digital signal processor, TMS320C6201 [7], from Texas Instruments. The TMS320C6201 is a fixed-point DSP that can specify eight 32 -bit operations in a single 256 -bit instruction. The TMS320C6201 uses a compressed encoding with $b_{\text {cache }}$ $=256$. As benchmark programs, various DSP programs were used. The proposed technique was implemented as a separate post-pass tool, which takes as an input an executable file produced by the TI's TMS320C6x optimizing C compiler and produces as an output the rearranged lowpower version of the same program.

We have measured the number of bit transitions during the instruction fetch phase for each benchmark program using a switching activity counter. Given an executable file with appropriate input data, a switching activity counter program computes the number of bit transitions from both the internal and external busses during the program execution using instruction address traces. Instruction address traces for benchmark programs were collected by a manual analysis of benchmark source programs.

Table 5 summaries the experimental results with selected DSP benchmark programs. For each benchmark program, the average number of bit transitions per instruction fetch (BT/IF) is computed. For $\alpha$, we have used

100 [12]. We have compared BT/IF's among TI compiler generated programs (the default column in Table 5), rearranged programs by the proposed LOR technique (the LOR column in Table 5) and the GOR heuristic technique (the GOR-H column in Table 5). We have used 100 for $N_{c a n d}$ in the GOR heuristic technique.

As shown in Table 5, our operation rearrangement technique reduces the number of bit transitions during the instruction fetch phase on an average by $34.3 \%$ compared with the programs generated by the TI compiler. The GOR heuristic technique outperformed the LOR technique by $2.9 \%$ more reduction in the switching activity. For many benchmark programs, however, the LOR technique was quite effective, resulting in the almost equivalent switching activity reduction as in the GOR heuristic technique.

## VII. Conclusions

In this paper we have described and evaluated an operation rearrangement method during instruction fetches in VLIW machines. The proposed method, which works as a post-pass tool for compiled programs, reorganizes the operation placement orders within VLIW instructions such that the resulting program has the minimum number of bit transitions during instruction fetches. The experimental results show that the proposed rearrangement technique can significantly reduce the switching activity during the instruction fetch phase in VLIW machines. For our benchmark programs, the switching activity was reduced by $34 \%$ on an average.

In this paper, we considered the problem of modifying operation orders for pre-compiled VLIW programs. However, optimization decisions made during the compilation process can affect the outcome of operation rearrangement. For example, depending on how instructions are scheduled, the number of bit changes during the instruction fetch phase can vary significantly. We plan to investigate the phase-ordering problem between the operation rearrangement and other optimization steps as a next topic.

## References

[1] A. Chandrakasan, T. Shung, and R. W. Broderson. Low power CMOS digital design. IEEE Journal of Solid State Circuits, 27(4):473-484, 1992.
[2] T. Conte, S. Banerjia, S. Larin, K. N. Menezes, and S. W. Sathaye. Instruction fetch mechanisms for VLIW architectures with compressed encodings. In Proc. of the 29th IEEE/ACM Int. Symp. on Microarchitecture, pages 201-211, 1996.
[3] S. Devadas and S. Malik. A survey of optimization techniques targeting low power VLSI circuits. In

| Benchmark <br> Program | Bit transitions/IF |  |  | Reduction |  |
| :--- | ---: | ---: | ---: | ---: | ---: |
|  | default | LOR | GOR-H | LOR | GOR-H |
| vector multiply | 68.6 | 46.0 | 43.7 | $33.0 \%$ | $36.3 \%$ |
| FIR8 | 86.8 | 59.3 | 56.7 | $31.6 \%$ | $34.6 \%$ |
| FIRcx | 79.5 | 60.6 | 60.5 | $23.9 \%$ | $24.0 \%$ |
| IIR | 71.7 | 52.1 | 51.7 | $27.4 \%$ | $28.0 \%$ |
| lattice analysis | 88.4 | 63.4 | 58.2 | $28.3 \%$ | $34.2 \%$ |
| W_vec | 89.5 | 62.9 | 57.1 | $30.0 \%$ | $36.3 \%$ |
| dotp_sqr | 79.2 | 44.5 | 44.3 | $43.9 \%$ | $44.1 \%$ |
| minerror | 50.6 | 33.2 | 31.3 | $34.3 \%$ | $38.1 \%$ |
| biquad | 78.1 | 54.6 | 52.3 | $30.0 \%$ | $33.0 \%$ |
| Average | 76.9 | 53.0 | 50.6 | $31.4 \%$ | $34.3 \%$ |

TABLE 5
Experimental results

Proc. of Int. Symp. on Low Power Electronics and Design (ISLPED'97), pages 239-242, 1997.
[4] P. Faraboschi, G. Desoli, and J. A. Fisher. The latest word in digital and media processing. IEEE Signal Processing Magazine, 15(2):59-85, 1998.
[5] Fujitsu Microelectronics, Inc.
Fujitsu's new high-performance VLIW processor cores. http://www.fujitsumicro.com/.
[6] R. Henning and C. Chakrabarti. High-level design synthesis of a low power, VLIW processor for the IS54 VSELP speech encoder. In Proc. of Int. Conf. on Computer Design (ICCD'97), pages 571-576, 1997.
[7] Texas Instruments. TMS320C62xx CPU and Instruction Set, 1997.
[8] Texas Instruments. TMS320C6000 Power Consumption Summary, 1999.
[9] A. Klaiber. The technology behind the Crusoe processor. Transmeta Corporation White Paper, 2000.
[10] M. T. Lee, V. Tiwari, S. Malik, and M. Fujita. Power analysis and minimization techniques for embedded DSP software. IEEE Trans. VLSI Systems, 5(1):123135, 1997.
[11] H. Mehta, R. M. Owens, M. J. Irwin, R. Chen, and D. Ghosh. Techniques for low energy software. In Proc. of Int. Symp. on Low Power Electronics and Design (ISLPED'97), pages 72-75, 1997.
[12] E. Musoll, T. Lang, and L. Cortadella. Exploiting the locality of memory references to reduce the address bus energy. In Proc. of Int. Symp. on Low Power Electronics and Design (ISLPED'97), pages 202-207, 1997.
[13] J.-M. Puiatti, J. Llosa, C. Piguet, and E. Sanchez. Low-power VLIW processors: A high-level evaluation. In Proc. of Int. Workshop - Power and Timing Modeling, Optimization and Simulation (PATMOS '98), pages 399-408, 1998.
[14] M. R. Stan and W. P. Burleson. Bus-invert coding for low power I/O. IEEE Trans. on VLSI Systems, 3:49-58, Mar. 1995.
[15] C. L. Su, C. Y. Tsui, and A. Despain. Low power architectural design and compilation techniques for high-performance processor. In Proc. of COMPCON94, pages 489-498, 1994.
[16] V. Tiwari, S. Malik, and A. Wolfe. Compilation techniques for low energy: An overview. In Proc. of Int. Symp. on Low-Power Electronics, 1994.
[17] V. Tiwari, S. Malik, and A. Wolfe. Power analysis of embedded software: A first step towards software power minimization. IEEE Trans. VLSI Systems, 2(4):437-445, 1994.
[18] M. C. Toburen, T. M. Conte, and M. Reilly. Instruction scheduling for low power dissipation in high performance microprocessors. In Proc. of Power Driven Microarchitecture Workshop in conjunction with the 25th International Symposium on Computer Architecture (ISCA'98), 1998.
[19] H. Tomiyama, T. Ishihara, A. Inoue, and H. Yasuura. Instruction scheduling for power reduction in processor-based system design. In Proc. of the 1998 Design Automation and Test in Europe (DATE '98), pages 855-860, 1998.


[^0]:    *This work was supported in part by BK21 Information Technology program.

[^1]:    ${ }^{1} \mathrm{We}$ distinguish between an operation and an instruction in a VLIW CPU. A VLIW instruction is assumed to consist of several operations.

[^2]:    ${ }^{2}$ In Figures 2.(b) and 2.(c), parallel operations within the same VLIW instruction is specified using tail bits (shown in the shaded boxes). If a tail bit of an operation $O$ is 1 , the operation $O$ is executed in parallel with the next operation. Otherwise, the next operation is executed after the current instruction is executed.

