You are here:

Home DAES Staff Prof. Dr. Peter Marwedel Information



Workshop on Compiler Assisted SoC Assembly (CASA), Oct. 27th, 2006


This is the second edition of the CASA workshop (initially called OCASA) that brings together researchers and industrial attendees who want to teach, learn and discuss the use of compilers in designing SoC architectures. The goal of using compilers in the architecture design loop is to provide reliable information about the impact of architectural features. This reliable information is expected to avoid time-consuming design loops and is therefore expected to provide a "Drastic Reduction of NRE and Time-to-Market". The workshop will consist of invited university and industrial speakers performing high-profile research and development in the topic of the workshop title.

Workshop Chairs:

  • Peter Marwedel, University of Dortmund, Dortmund, Germany
  • Hideharu Amano, Keio University, Yokohama, Japan

Abstracts of the Workshop on Compiler Assisted SoC Assembly (CASA)

  • Paolo Bonzini, Laura Pozzi (University of Lugano, Switzerland) Ajay K.Verma, Paolo Ienne (Ecole Polytechnique Federal de Lausanne (EPFL), Lausanne, Switzerland): Enabling and Profiting of Automatic Customization: New Challenges in Front-end Optimizations and Datapath Design

    Most commercial embedded processors, both for ASICs and FPGAs, now support some form of application-domain customization. Although customization is still largely a manual task, considerable work has been performed in the last years, both in industry and in academy, to develop algorithms and tools which can customize such processors automatically from the source code of applications. We argue that these developments represent an important evolution whereby Electronic Design Automation begins to tackle successfully typical Computer Architecture issues. Yet, we also show that, for such automated approaches to be really successful, more research is needed in two key areas: At one end of the process, there is a need to revisit and develop front-end compiler optimizations so that they expose the best opportunities to customization algorithms; we outline some research potentials and results in this field. At the opposite end of the customization process, where typically application-specific datapaths are identified, there are new challenges in automating the efficient implementation of such structures; we discuss some recent results and challenges in the optimization of arithmetic circuits, especially when the description of such circuits comes from software descriptions.

  • Grant Martin (Tensilica, US): Design Space Exploration for Configurable-Processor based SoC

    There are several ways in which compilers are used to determine the SoC architectural characteristics for a configurable and extensible processor. It is of fundamental importance that the language compilers for a configurable processor be as configurable and extensible as the processor architecture itself, so that all  user-defined extensions are accessible during the coding and exploration process to decide on an optimal processor configuration and set of instruction extensions.

    Recent technology developments, however, allow a much higher level of automation in design space exploration of a configurable processor architecture for particular applications. It is now possible to automatically explore a huge range of possible instantiations and produce a large tradeoff set so that performance,  energy consumption and cost can be optimised. This kind of technology is useful both for single-processor based SoC, but can also be used in conjunction with multi-processor design concepts to assist in mapping a complex embedded application to MPSoC.

    This talk will discuss some of the most recent technology advances that allow automated design space exploration to be possible.

  • Aviral Shrivastava (Arizona State University, Tempe, US), Nikil Dutt (UC Irvine, US): Compiler-assisted processor exploration and design

    Programmable Embedded Systems are characterized by strict, multi-dimensional design constraints such as power, performance, and time-to-market. In order to satisfy all these design constraints simultaneously, embedded processors, the embedded processor architectures are highly-customized. The may omit standard architectural features e.g., caches, prefetching etc., and employ various idiosyncratic architectural features, e.g., partial register renaming, dual instruction set etc. This poses a challenge for the compilers for these embedded processors. The compiler should not only be able to exploit the idiosyncratic design features, but also avoid the penalty of missing design features. Although difficult, the compiler can be very difficult in such situations, and can exercise a significant improvement in the power, performance etc. of the system. In such cases inputs from the compiler should be included while designing the processor (micro)architecture.

    In this talk, we will present our compiler-assisted design technique in the context of Register File Power reduction, which is a major hot-spot in modern processors. A bypass sensitive compiler can modify the code so that more instructions will read source register operands from the bypasses rather than the register file, thereby reducing the power density of the register file by as much as 26%. Our Compiler-in-the-Loop bypass exploration technique allows the designers to achieve the same Register File power envelope with fewer bypasses.

  • Takuji Hieda, Hiroaki Tanaka, Keishi Sakanushi, Yoshinori Takeuchi, and Masaharu Imai  (Osaka University, Osaka, Japan): Effectiveness of Partial Forwarding in Pipeline Processors

    Data forwarding is an essential technique to improve the performance of pipelined processors.  However, full forwarding requires considerable hardware area, increases power consumption, and may degrade the clock frequency of the processor.  Partial forwarding is one of the attractive methods that require moderate area and power consumption while improving performance of the processor.  This paper discusses the advantages of partial forwarding method and introduces an instruction scheduling algorithm that takes partial forwarding into consideration.

  • Krishna Palem, Lakshmi N. B. Chakrapani (Georgia Tech, Atlanta, US): Highly Productive Compiler-Driven Customization of Heterogeneous Embedded Platforms

    The spectacular  growth in  the computer industry  has been  fueled in significant part  by the  striking rate at  which CMOS  technology has continued to  scale, popularly referred  to as "Moore's  law". However, the  attractive opportunities  for  embedded systems  that this  trend offers is  limited by  serious hurdles to  the easy adoption  of these billion transistor chips looming on the horizon.

    With increasing  transistor densities, non-recurring  engineering cost (or NRE) and time-to-market pose  an ever increasing impediment to the proliferation  of customized computing  systems---custom architectures are often tailored to meet the applications specific needs and pervade the  embedded computing  domain. With  an optimizing  compiler  at the heart, we describe our  methodology through which we have successfully overcome these  impediments posed  by NRE and  time-to-market, despite the  increasing  design  complexity.   Referred to  as  geometry-aware modulo-scheduling, we empower the  ``compiler'' to play a central role in being a front-end to the process of designing custom SoC (hardware) architectures.  A proof of  concept system embodying this concept will be demonstrated.

  • Scott Mahlke, Kevin Fan, Manjunath Kudlur, Ganesh Dasika University of Michigan, Ann Arbor, MI, US): Compiler Orchestrated Customization of Accelerator Pipelines

    An ever larger variety of embedded computer systems are being designed and deployed that have both challenging performance requirements to process real-time audio or video signals, and tight energy constraints to allow them to operate in a portable manner.  Specialized nonprogrammable hardware accelerators are often used in these systems to execute parts of the applications that would run too slowly or consume too much power if implemented in software on an embedded programmable processor.  In this talk, I will describe the Streamroller synthesis system for automatically designing pipelines of accelerators for streaming applications.  The application is modeled using the C language with several stylizations. The synthesis of the accelerator pipeline consists of a top-level design for defining the mapping of kernels to accelerators and the desired throughput for each accelerator.  This step is followed by designing individual loop accelerators, instantiating buffers for arrays used in the application, and hooking up these building blocks to form a prescribed-throughput pipeline.  Several novel technologies are used in the Streamroller system to create highly customized but efficient designs: multifunction accelerators for mapping multiple loop nests to a single hardware accelerator, cost-sensitive modulo scheduling for minimizing the cost of individual accelerators, and razor latches to facilitate performing online timing analysis and data-centric dynamic frequency adaptation.

  • Maziar Goudarzi, Tohru Ishihara, Hiroto Yasuura< (System LSI Research Center, Kyushu University, Fukuoka, Japan): Process Variation Aware Compilation for Improving Energy-Efficiency of Processors with Defective Caches

    We introduce a new concept for canceling negative effects of process variation at a system level of abstraction. The basic idea is to use  different object codes for chips which have different leakage current and  delay values due to the process variation while those chips are fabricated  from the same design.

    Yield improvement through exploiting fault-free sections of defective cache memory is a well-known technique. The idea is to mark defective  cache sections using an extra flag and to invalidate them so that the  defective sections do not affect the correct operation of a processor.  The marked sections are not used for replacement in case of a cache  miss and accessing the sections will always cause a cache miss. The major down side of the idea is performance degradation due to  a reduced size of the cache memory. Our approach modifies the placement  of basic blocks or functions in the address space so that the number of cache misses is minimized for given locations of defective cache lines.

    The defective cache line denotes an ultra-leaky cache line, an extremely slow cache line and a functionally faulty cache line. Suppose there are several cache lines having much larger delay than the average due to the process variation. The cycle time of the processor would be determined by the access time for those slow cache lines in this case. If we invalidate and do not use those lines, the clock cycle time can be improved. On the other hand, suppose there are several ultra-leaky cache lines whose leakage currents are much higher than the average. In this case,  invalidating the cache lines cannot reduce the leakage currents. Instead of  this, our technique modifies the instruction scheduling and the placement of  basic blocks or functions so that the total leakage current in the cache can  be minimized. This idea is based on the observation that the leakage current in a leaky SRAM cell is  depending on the logic value stored.

    In this paper, we present a process variation aware code placement technique which improves the energy-efficiency of a processor with a  defective cache memory. To the best of our knowledge, this is the first  compiler-based technique which offsets the performance degradation and  the abnormal leakage current due to the process variation. Our technique does not need any modification of internal processor architecture.  Experiments demonstrate that the technique can drastically reduce the  performance degradation and the abnormal leakage current in  90nm  CMOS technology and below.

  • Diederik Verkest, Johan Vounckx (IMEC, Leuven, Belgium): Scratchpad memory exploration in multi-processor platforms

    Scratchpad memory is an additional memory resource in SoC next to the more traditionally  used caches. Particularly in multi-processor SoC, scratchpad memory can provide a more    energy-efficient and more predictable alternative to the use of caches. However, scratchpad  memory is software controlled and today its usage is typically left in the  hands of the   programmer. This not only makes it hard to use scratchpad effectively, it also makes it  impossible to explore different options for scratchpad when defining a platform  architecture. In this presentation, we look at compiler support for scratchpad memory   management. Thanks to the automation offered by a compiler, a platform architect can  explore the best options for the memory hierarchy for the considered application domain.

  • Peter Marwedel, Manish Verma, Lars Wehmeyer (University of Dortmund, Germany): Compiler-Directed Optimization of Memory Architectures using a Memory  Architecure Aware Compiler

    The speed of memories is known to be improving at a slower rate than the speed of processors. This leads to a situation in which the performance of systems is heavily influenced by the architecture and the speed of memories. Optimizing the memory architecture requires the concurrent consideration of the impact of compiler algorithms and architectures.This is feasible with compilers which are used in the optimization loop. In this talk, we will demonstrate how this can be done with a memory architecture aware compiler algorithm. This algorithm is capable of exploiting information about the memory architecture in order to map frequently accessed memory objects to faster memories in the memory hierarchy. In particular, this compiler supports the use of scratch pad memories. Such memories provide three advantages at the same time:these memories are fast (since they are small), they are energy- efficient (for the same reason). and they are timing-predictable (since we know at design-time whether an access will be an access to the scratch-pad or an access to the main memory). We will present approaches for exploiting such scratch pads. We will then show how these optimizations can be used  to optimize the memory architecture.

  • Tobias Becker and Wayne Luk (Imperial College London): Run-Time Compilation for Reconfigurable System-on-Chip Designs

    The latest Field-Programmable Gate Arrays (FPGAs) are examples of reconfigurable system-on-chip devices, since they contain a variety of computational, storage and communication resources. Run-time reconfiguration has been of academic interest for more than a decade; it has added software-like flexibility to hardware performance. Usually run-time reconfiguration relies on precompiled bitstreams, which contain programming information for a reconfigurable device, that are loaded at run time. While this is arguably the fastest approach of updating the functionality of an FPGA, the requirement of having to store multiple bitstreams can impose a significant overhead, especially if there are thousands or tens of thousands of bitstreams. In applications where the required configurations are not known at design time, the pre-compilation of all possible combinations of configurations can result in a bitstream storage that is unrealistically large. Run-time compilation can solve this problem, by rapidly generating new configurations at run time. However, conventional hardware implementation tools are not optimized for fast execution. Our approach for fast compilation relies on the assumption that run-time reconfigurable designs exhibit certain regularity. For instance, in designs employing constant folding to reduce circuit size, delay and power consumption for signal processing or encryption applications, different parameters can be used to produce circuits optimized for particular situations. This talk describes the use of parameterized structural hardware descriptions to simplify the compilation process for reconfigurable devices. Such descriptions can be used in combination  with staged and incremental compilation techniques to speed up the compilation process for fast execution during run time.