Hardware/Software Partitioning

Peter Marwedel
Informatik 12
TU Dortmund
Germany

2008/12/06
Structure of this course

3: Embedded System HW

2: Specifications

4: Standard Software, Real-Time Operating Systems

5: Scheduling, HW/SW-Partitioning, Applications to MP-Mapping

6: Evaluation

7: Optimization of Embedded Systems

8: Testing

New clustering
Hardware/software partitioning

No need to consider special purpose hardware in the long run? Correct for fixed functionality, but wrong in general, since “By the time MPEG-n can be implemented in software, MPEG-n+1 has been invented” [de Man]
Functionality to be implemented in software or in hardware?

Decision based on hardware/software partitioning, a special case of hardware/software codesign.
Codesign Tool (COOL) as an example of HW/SW partitioning

Inputs to COOL:
1. Target technology
2. Design constraints
3. Required behavior
Hardware/software codesign: approach

Specification

Mapping

Steps of the COOL partitioning algorithm (1)

1. Translation of the behavior into an internal graph model
2. Translation of the behavior of each node from VHDL into C
3. Compilation
   - All C programs compiled for the target processor,
   - Computation of the resulting program size,
   - Estimation of the resulting execution time (simulation input data might be required)
4. Synthesis of hardware components:
   ∀ leaf nodes, application-specific hardware is synthesized. High-level synthesis sufficiently fast.
Steps of the COOL partitioning algorithm (2)

1. Flattening of the hierarchy:
   • Granularity used by the designer is maintained.
   • Cost and performance information added to the nodes.
   • Precise information required for partitioning is pre-computed

2. Generating and solving a mathematical model of the optimization problem:
   • Integer programming IP model for optimization. Optimal with respect to the cost function (approximates communication time)
Steps of the COOL partitioning algorithm (3)

1. Iterative improvements:
   Adjacent nodes mapped to the same hardware component are now merged.

![Diagram showing iterative improvements in the COOL partitioning algorithm.]
Steps of the COOL partitioning algorithm (4)

1. **Interface synthesis:**
   After partitioning, the glue logic required for interfacing processors, application-specific hardware and memories is created.
An integer programming model for HW/SW partitioning

Notation:

- Index set $I$ denotes task graph nodes.
- Index set $L$ denotes task graph node types, e.g. square root, DCT or FFT.
- Index set $KH$ denotes hardware component types, e.g. hardware components for the DCT or the FFT.
- Index set $J$ of hardware component instances.
- Index set $KP$ denotes processors. All processors are assumed to be of the same type.
An IP model for HW/SW partitioning

- $X_{i,k} = 1$ if node $v_i$ is mapped to hardware component type $k \in KH$ and 0 otherwise.
- $Y_{i,k} = 1$ if node $v_i$ is mapped to processor $k \in KP$ and 0 otherwise.
- $NY_{\ell,k} = 1$ if at least one node of type $\ell$ is mapped to processor $k \in KP$ and 0 otherwise.
- $T$ is a mapping from task graph nodes to their types: $T: I \rightarrow L$
- The cost function accumulates the cost of hardware units:
  $$C = \text{cost(processors)} + \text{cost(memories)} + \text{cost(application specific hardware)}$$
Constraints

Operation assignment constraints

\[ \forall i \in I : \sum_{k \in KH} X_{i,k} + \sum_{k \in KP} Y_{i,k} = 1 \]

All task graph nodes have to be mapped either in software or in hardware.

Variables are assumed to be integers.

Additional constraints to guarantee they are either 0 or 1:

\[ \forall i \in I : \forall k \in KH : X_{i,k} \leq 1 \]

\[ \forall i \in I : \forall k \in KP : Y_{i,k} \leq 1 \]
Operation assignment constraints (2)

\[ \forall \ell \in L, \forall i: T(v_i) = c_\ell, \forall k \in KP: NY_{\ell,k} \geq Y_{i,k} \]

For all types \( \ell \) of operations and for all nodes \( i \) of this type: if \( i \) is mapped to some processor \( k \), then that processor must implement the functionality of \( \ell \).

Decision variables must also be 0/1 variables:

\[ \forall \ell \in L, \forall k \in KP: NY_{\ell,k} \leq 1. \]
Resource & design constraints

- \( \forall k \in KH \), the cost (area) used for components of that type is calculated as the sum of the costs of the components of that type. This cost should not exceed its maximum.
- \( \forall k \in KP \), the cost for associated data storage area should not exceed its maximum.
- \( \forall k \in KP \) the cost for storing instructions should not exceed its maximum.
- The total cost (\( \sum_{k \in KH} \)) of HW components should not exceed its maximum.
- The total cost of data memories (\( \sum_{k \in KP} \)) should not exceed its maximum.
- The total cost instruction memories (\( \sum_{k \in KP} \)) should not exceed its maximum.
Scheduling

Processor $p_1$

FIR$_1$

FIR$_2$

ASIC $h_1$

Communication channel $c_1$

$\vdots v_3 \ldots v_4$

or

$\vdots v_4 \ldots v_3$

$\vdots v_7 \ldots v_8$

or

$\vdots v_8 \ldots v_7$

$\vdots e_3 \ldots e_4$

or

$\vdots e_4 \ldots e_3$
Scheduling / precedence constraints

- For all nodes $v_{i1}$ and $v_{i2}$ that are potentially mapped to the same processor or hardware component instance, introduce a binary decision variable $b_{i1,i2}$ with $b_{i1,i2} = 1$ if $v_{i1}$ is executed before $v_{i2}$ and $= 0$ otherwise.

Define constraints of the type

- $(\text{end-time of } v_{i1}) \leq (\text{start time of } v_{i2})$ if $b_{i1,i2} = 1$ and
- $(\text{end-time of } v_{i2}) \leq (\text{start time of } v_{i1})$ if $b_{i1,i2} = 0$

- Ensure that the schedule for executing operations is consistent with the precedence constraints in the task graph.

- Approach just fixes the order of execution and avoids the complexity of computing start times during optimization.
Other constraints

- **Timing constraints**
  These constraints can be used to guarantee that certain time constraints are met.
- Some less important constraints omitted ..
Example

HW types H1, H2 and H3 with costs of 20, 25, and 30.
Processors of type P.
Tasks T1 to T5.
Execution times:

<table>
<thead>
<tr>
<th>T</th>
<th>H1</th>
<th>H2</th>
<th>H3</th>
<th>P</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>20</td>
<td></td>
<td></td>
<td>100</td>
</tr>
<tr>
<td>2</td>
<td></td>
<td>20</td>
<td></td>
<td>100</td>
</tr>
<tr>
<td>3</td>
<td>12</td>
<td></td>
<td>12</td>
<td>10</td>
</tr>
<tr>
<td>4</td>
<td>12</td>
<td></td>
<td>12</td>
<td>10</td>
</tr>
<tr>
<td>5</td>
<td>20</td>
<td></td>
<td></td>
<td>100</td>
</tr>
</tbody>
</table>
Operation assignment constraints (1)

<table>
<thead>
<tr>
<th>T</th>
<th>H1</th>
<th>H2</th>
<th>H3</th>
<th>P</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>20</td>
<td></td>
<td></td>
<td>100</td>
</tr>
<tr>
<td>2</td>
<td></td>
<td>20</td>
<td></td>
<td>100</td>
</tr>
<tr>
<td>3</td>
<td></td>
<td>12</td>
<td></td>
<td>10</td>
</tr>
<tr>
<td>4</td>
<td></td>
<td>12</td>
<td></td>
<td>10</td>
</tr>
<tr>
<td>5</td>
<td>20</td>
<td></td>
<td></td>
<td>100</td>
</tr>
</tbody>
</table>

\[ \forall i \in I: \sum_{k \in KH} x_{i,k} + \sum_{k \in KP} y_{i,k} = 1 \]

\[ x_{1,1} + y_{1,1} = 1 \text{ (task 1 mapped to H1 or to P)} \]
\[ x_{2,2} + y_{2,1} = 1 \]
\[ x_{3,3} + y_{3,1} = 1 \]
\[ x_{4,3} + y_{4,1} = 1 \]
\[ x_{5,1} + y_{5,1} = 1 \]
Operation assignment constraints (2)

Assume types of tasks are \( \ell = 1, 2, 3, 3, \) and 1.
\[
\forall \, \ell \in L, \ \forall \, i : T(v_i) = c_\ell, \ \forall \, k \in KP : NY_{\ell,k} \geq Y_{i,k}
\]

Functionality 3 to be implemented on processor if node 4 is mapped to it.

\[
NY_{1,1} \geq Y_{1,1},
NY_{2,1} \geq Y_{2,1},
NY_{3,1} \geq Y_{3,1},
NY_{3,1} \geq Y_{4,1},
NY_{1,1} \geq Y_{5,1}
\]
Other equations

Time constraints leading to: Application specific hardware required for time constraints under 100 time units.

<table>
<thead>
<tr>
<th>T</th>
<th>H1</th>
<th>H2</th>
<th>H3</th>
<th>P</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>20</td>
<td></td>
<td></td>
<td>100</td>
</tr>
<tr>
<td>2</td>
<td></td>
<td>20</td>
<td></td>
<td>100</td>
</tr>
<tr>
<td>3</td>
<td></td>
<td>12</td>
<td></td>
<td>10</td>
</tr>
<tr>
<td>4</td>
<td></td>
<td>12</td>
<td></td>
<td>10</td>
</tr>
<tr>
<td>5</td>
<td>20</td>
<td></td>
<td></td>
<td>100</td>
</tr>
</tbody>
</table>

Cost function:

\[ C = 20 \#(H1) + 25 \#(H2) + 30 \#(H3) + \text{cost(processor)} + \text{cost(memory)} \]
Result

For a time constraint of 100 time units and cost(P)<cost(H3):

<table>
<thead>
<tr>
<th>T</th>
<th>H1</th>
<th>H2</th>
<th>H3</th>
<th>P</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>20</td>
<td>20</td>
<td>100</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>20</td>
<td>12</td>
<td>10</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>12</td>
<td>12</td>
<td>10</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>20</td>
<td>100</td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>100</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Solution (educated guessing) :
T1 → H1
T2 → H2
T3 → P
T4 → P
T5 → H1
Separation of scheduling and partitioning

Combined scheduling/partitioning very complex;

- Heuristic: Compute estimated schedule
  - Perform partitioning for estimated schedule
  - Perform final scheduling
  - If final schedule does not meet time constraint, go to 1 using a reduced overall timing constraint.

Diagram:

- Actual execution time
- Approx. execution time
- Specification
- New specification
- 1st Iteration
- 2nd Iteration
Application example

Audio lab (mixer, fader, echo, equalizer, balance units); slow SPARC processor
1μ ASIC library
Allowable delay of 22.675 μs (~ 44.1 kHz)

Outdated technology; just a proof of concept.
Running time for COOL optimization

CASCADE Only simple models can be solved optimally.
Deviation from optimal design

Hardly any loss in design quality.
Running time for heuristic
Design space for audio lab

Everything in software: 72.9 µs, 0 $\lambda^2$
Everything in hardware: 3.06 µs, $457.9 \times 10^6 \lambda^2$
Lowest cost for given sample rate: 18.6 µs, $78.4 \times 10^6 \lambda^2$, 
Positioning of COOL

COOL approach:

- shows that formal model of hardware/SW codesign is beneficial; IP modeling can lead to useful implementation even if optimal result is available only for small designs.

Other approaches for HW/SW partitioning:

- starting with everything mapped to hardware; gradually moving to software as long as timing constraint is met.
- starting with everything mapped to software; gradually moving to hardware until timing constraint is met.
- Binary search.
HW/SW partitioning
in the context of mapping applications to processors

- Handling of heterogeneous systems
- Handling of task dependencies
- Considers of communication (at least in COOL)
- Considers memory sizes etc (at least in COOL)
- For COOL: just homogeneous processors
- No link to scheduling theory