| Hussam Wei, Chia-Lin Sung, Khe-Chung Wang and Chin-Yuan Lu. Robust Brain-Inspired Computing: On the Reliability of Spiking Neural Network Using Emerging Non-Volatile Synapses. In Proceedings of the IEEE 59th International Reliability Physics Symposium (IRPS) 2021 [BibTeX]@inproceedings { WeiIRPS21,
author = {Wei, Hussam and Sung, Chia-Lin and Wang, Khe-Chung and Lu, Chin-Yuan},
title = {Robust Brain-Inspired Computing: On the Reliability of Spiking Neural Network Using Emerging Non-Volatile Synapses},
booktitle = {Proceedings of the IEEE 59th International Reliability Physics Symposium (IRPS)},
year = {2021},
keywords = {nvm-oma},
confidential = {n},
} |
| Christian Hakert, Asif-Ali Khan, Kuan-Hsun Chen, Fazal, Jeronimo Castrillon and Jian-Jia Chen. BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory. In 58th ACM/IEEE Design Automation Conference (DAC), accepted 2021 [BibTeX][Abstract]@inproceedings { HakertDAC21,
author = {Hakert, Christian and Khan, Asif-Ali and Chen, Kuan-Hsun and Fazal, and Castrillon, Jeronimo and Chen, Jian-Jia},
title = {BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory},
booktitle = {58th ACM/IEEE Design Automation Conference (DAC), accepted},
year = {2021},
keywords = {kuan, nvm-oma},
confidential = {n},
abstract = {Modern embedded-systems integrate machine learning algorithms. In resource constrained setups, execution has to be optimized for execution time and energy. In order to access data in RTM, it needs to be shifted to the access port. We propose a novel domain specific approach for placing decision trees in RTMs. We reduce the total amount of shits by exploiting the tree structure. We prove that the theoretical optimal decision tree placement is at most 4× better in terms of shifts than our proposed approach. Throughout extensive experiments, we show that our method outperforms the state-of-the-art methods.},
} Modern embedded-systems integrate machine learning algorithms. In resource constrained setups, execution has to be optimized for execution time and energy. In order to access data in RTM, it needs to be shifted to the access port. We propose a novel domain specific approach for placing decision trees in RTMs. We reduce the total amount of shits by exploiting the tree structure. We prove that the theoretical optimal decision tree placement is at most 4× better in terms of shifts than our proposed approach. Throughout extensive experiments, we show that our method outperforms the state-of-the-art methods.
|
| Mario Guenzel, Kuan-Hsun Chen, Niklas Ueter, Georg Brüggen, Marco Duerr and Jian-Jia Chen. Timing Analysis of Asynchronized Distributed Cause-Effect Chains. In Real-Time and Embedded Technology and Applications Symposium (RTAS) 2021 [BibTeX][Abstract]@inproceedings { guenzel2021e2e,
author = {Guenzel, Mario and Chen, Kuan-Hsun and Ueter, Niklas and Br\"uggen, Georg and Duerr, Marco and Chen, Jian-Jia},
title = {Timing Analysis of Asynchronized Distributed Cause-Effect Chains},
booktitle = {Real-Time and Embedded Technology and Applications Symposium (RTAS)},
year = {2021},
keywords = {kuan, georg},
confidential = {n},
abstract = {Real-time systems require the formal guarantee of timing-constraints, not only for the individual tasks but also for the data-propagation paths. A cause-effect chain describes the data flow among multiple tasks, e.g., from sensors to actuators, independent from the priority order of the tasks. In this paper, we provide an end-to-end timing-analysis for cause-effect chains on asynchronized distributed systems with periodic task activations, considering the maximum reaction time (duration of data processing) and the maximum data age (worst-case data freshness). On one local electronic control unit (ECU), we present how to compute the exact local (worst-case) end-to-end latencies when the execution time of the periodic tasks is fixed. We further extend our analysis to globally asynchronized systems by combining the local results. Throughout synthesized data based on an automotive benchmark as well as on randomized parameters, we show that our analytical results improve the state-of-the-art for periodic task activations.},
} Real-time systems require the formal guarantee of timing-constraints, not only for the individual tasks but also for the data-propagation paths. A cause-effect chain describes the data flow among multiple tasks, e.g., from sensors to actuators, independent from the priority order of the tasks. In this paper, we provide an end-to-end timing-analysis for cause-effect chains on asynchronized distributed systems with periodic task activations, considering the maximum reaction time (duration of data processing) and the maximum data age (worst-case data freshness). On one local electronic control unit (ECU), we present how to compute the exact local (worst-case) end-to-end latencies when the execution time of the periodic tasks is fixed. We further extend our analysis to globally asynchronized systems by combining the local results. Throughout synthesized data based on an automotive benchmark as well as on randomized parameters, we show that our analytical results improve the state-of-the-art for periodic task activations.
|
| Hsiang-Yun Cheng, Chun-Feng Wu, Christian Hakert, Kuan-Hsun Chen, Yuan-Hao Chang, Jian-Jia Chen, Chia-Lin Yang and Tei-Wei Kuo. Future Computing Platform Design: A Cross-Layer Design Approach. In Design, Automation and Test in Europe Conference (DATE) 2021 [BibTeX][PDF][Abstract]@inproceedings { Cheng/etal/2021,
author = {Cheng, Hsiang-Yun and Wu, Chun-Feng and Hakert, Christian and Chen, Kuan-Hsun and Chang, Yuan-Hao and Chen, Jian-Jia and Yang, Chia-Lin and Kuo, Tei-Wei},
title = {Future Computing Platform Design: A Cross-Layer Design Approach},
booktitle = {Design, Automation and Test in Europe Conference (DATE)},
year = {2021},
keywords = {kuan, nvm-oma},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2021datecross.pdf},
confidential = {n},
abstract = {Future computing platforms are facing a paradigm shift with the emerging resistive memory technologies. First, they offer fast memory accesses and data persistence in a single large-capacity device deployed on the memory bus, blurring the boundary between memory and storage. Second, they enable computing-in-memory for neuromorphic computing to mitigate costly data movements. Due to the non-ideality of these resistive memory devices at the moment, we envision that cross-layer design is essential to bring such a system into practice. In this paper, we showcase a few examples to demonstrate how cross-layer design can be developed to fully exploit the potential of resistive memories and accelerate its adoption for future computing platforms.},
} Future computing platforms are facing a paradigm shift with the emerging resistive memory technologies. First, they offer fast memory accesses and data persistence in a single large-capacity device deployed on the memory bus, blurring the boundary between memory and storage. Second, they enable computing-in-memory for neuromorphic computing to mitigate costly data movements. Due to the non-ideality of these resistive memory devices at the moment, we envision that cross-layer design is essential to bring such a system into practice. In this paper, we showcase a few examples to demonstrate how cross-layer design can be developed to fully exploit the potential of resistive memories and accelerate its adoption for future computing platforms.
|
| Mikail Yayla, Kuan-Hsun Chen, Georgios Zervakis, Jörg Henkel, Jian-Jia Chen and Hussam Amrouch. FeFET and NCFET for Future Neural Networks: Visions and Opportunities. In Design, Automation and Test in Europe Conference (DATE) 2021 [BibTeX][PDF][Abstract]@inproceedings { yayla/etal/2021,
author = {Yayla, Mikail and Chen, Kuan-Hsun and Zervakis, Georgios and Henkel, J\"org and Chen, Jian-Jia and Amrouch, Hussam},
title = {FeFET and NCFET for Future Neural Networks: Visions and Opportunities},
booktitle = {Design, Automation and Test in Europe Conference (DATE)},
year = {2021},
keywords = {kuan, nvm-oma},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2021datefefet.pdf},
confidential = {n},
abstract = {The goal of this special session paper is to introduce and discuss different emerging technologies for logic circuitry and memory as well as new lightweight architectures for neural networks. We demonstrate how the ever-increasing complexity in Artificial Intelligent (AI) applications, resulting in an immense increase in the computational power, necessitates inevitably employing innovations starting from the underlying devices all the way up to the architectures. Two different promising emerging technologies will be presented: (i) Negative Capacitance Field-Effect Transistor (NCFET) as a new beyond-CMOS technology with advantages for offering low power and/or higher accuracy for neural network inference. (ii) Ferroelectric FET (FeFET) as a novel non-volatile, area-efficient and ultra-low power memory device. In addition, we demonstrate how Binary Neural Networks (BNNs) offer a promising alternative for traditional Deep Neural Networks (DNNs) due to its lightweight hardware implementation. Finally, we present the challenges from combining FeFET-based NVM with NNs and summarize our perspectives for future NNs and the vital role that emerging technologies may play.},
} The goal of this special session paper is to introduce and discuss different emerging technologies for logic circuitry and memory as well as new lightweight architectures for neural networks. We demonstrate how the ever-increasing complexity in Artificial Intelligent (AI) applications, resulting in an immense increase in the computational power, necessitates inevitably employing innovations starting from the underlying devices all the way up to the architectures. Two different promising emerging technologies will be presented: (i) Negative Capacitance Field-Effect Transistor (NCFET) as a new beyond-CMOS technology with advantages for offering low power and/or higher accuracy for neural network inference. (ii) Ferroelectric FET (FeFET) as a novel non-volatile, area-efficient and ultra-low power memory device. In addition, we demonstrate how Binary Neural Networks (BNNs) offer a promising alternative for traditional Deep Neural Networks (DNNs) due to its lightweight hardware implementation. Finally, we present the challenges from combining FeFET-based NVM with NNs and summarize our perspectives for future NNs and the vital role that emerging technologies may play.
|
| Jian-Jia Chen and Christian Hakert. Tutorial for Full System Simulations of Non-Volatile Main Memories. In Design, Automation and Test in Europe Conference 2021 [BibTeX][Link]@inproceedings { date2021tutorial,
author = {Chen, Jian-Jia and Hakert, Christian},
title = {Tutorial for Full System Simulations of Non-Volatile Main Memories},
booktitle = {Design, Automation and Test in Europe Conference},
year = {2021},
url = {https://video.tu-dortmund.de/m/c25c3c9373cc554bf8007973fe6e81fbb78c9b1f221b9b1b9cce780f9bbe4186676e7036389eb22bc89f1f605c618ab1cf5bf0d7ead76af129bd412b82bb7601},
keywords = {nvm-oma},
confidential = {n},
} |
| Sebastian Buschjäger, Jian-Jia Chen, Kuan-Hsun Chen, Mario Günzel, Katharina Morik, Rodion Novkin, Lukas Pfahler and Mikail Yayla. Bit Error Tolerance Metrics for Binarized Neural Networks. In Workshop on System-level Design Methods for Deep Learning on Heterogeneous Architectures (SLOHA) 2021 [BibTeX][Link][Abstract]@inproceedings { buschjaegerSLOHA2021,
author = {Buschj\"ager, Sebastian and Chen, Jian-Jia and Chen, Kuan-Hsun and G\"unzel, Mario and Morik, Katharina and Novkin, Rodion and Pfahler, Lukas and Yayla, Mikail},
title = {Bit Error Tolerance Metrics for Binarized Neural Networks},
booktitle = {Workshop on System-level Design Methods for Deep Learning on Heterogeneous Architectures (SLOHA)},
year = {2021},
url = {https://arxiv.org/abs/2102.00818},
keywords = {kuan},
confidential = {n},
abstract = {To reduce the resource demand of neural network (NN) inference systems, it has been proposed to use approximate
memory, in which the supply voltage and the timing parameters are tuned trading accuracy with energy consumption and performance. Tuning these parameters aggressively leads to bit errors, which can be tolerated by NNs when bit flips are injected during training. However, bit flip training, which is the state of the art for achieving bit error tolerance, does not scale well; it leads to massive overheads and cannot be applied for high bit error rates (BERs). Alternative methods to achieve bit error tolerance in NNs are needed, but the underlying principles behind the bit error tolerance of NNs have not been reported yet. With this lack of understanding, further progress in the research on NN bit error tolerance will be restrained.
In this study, our objective is to investigate the internal changes in the NNs that bit flip training causes, with a focus on Binarized NNs (BNNs). To this end, we quantify the properties of bit error tolerant BNNs with two metrics. First, we propose a neuron-level bit error tolerance metric, which calculates the margin between the pre-activation values and batch normalization thresholds. Secondly, to capture the effects of bit error tolerance on the interplay of neurons, we propose an inter-neuron bit error tolerance metric, which measures the importance of each neuron and computes the variance over all importance values. Our experimental results support that these two metrics are strongly related to bit error tolerance.},
} To reduce the resource demand of neural network (NN) inference systems, it has been proposed to use approximate
memory, in which the supply voltage and the timing parameters are tuned trading accuracy with energy consumption and performance. Tuning these parameters aggressively leads to bit errors, which can be tolerated by NNs when bit flips are injected during training. However, bit flip training, which is the state of the art for achieving bit error tolerance, does not scale well; it leads to massive overheads and cannot be applied for high bit error rates (BERs). Alternative methods to achieve bit error tolerance in NNs are needed, but the underlying principles behind the bit error tolerance of NNs have not been reported yet. With this lack of understanding, further progress in the research on NN bit error tolerance will be restrained.
In this study, our objective is to investigate the internal changes in the NNs that bit flip training causes, with a focus on Binarized NNs (BNNs). To this end, we quantify the properties of bit error tolerant BNNs with two metrics. First, we propose a neuron-level bit error tolerance metric, which calculates the margin between the pre-activation values and batch normalization thresholds. Secondly, to capture the effects of bit error tolerance on the interplay of neurons, we propose an inter-neuron bit error tolerance metric, which measures the importance of each neuron and computes the variance over all importance values. Our experimental results support that these two metrics are strongly related to bit error tolerance.
|
| Christian Hakert and Jian-Jia Chen. [Demo] Tutorial for Full System Simulations of Non-Volatile Main Memories. In Design, Automation and Test in Europe Conference 2021 [BibTeX][Link]@inproceedings { date2021tutorialdemo,
author = {Hakert, Christian and Chen, Jian-Jia},
title = {[Demo] Tutorial for Full System Simulations of Non-Volatile Main Memories},
booktitle = {Design, Automation and Test in Europe Conference},
year = {2021},
url = {https://video.tu-dortmund.de/m/d62742ad8a171810b7af16e983f4e1349ba7ea83a0918ae3944d1659b9c5ae4d0730d7cac39cba40f5851f2001a6c17f5f33b2f67802cb722eb6295618357ddb},
keywords = {nvm-oma},
confidential = {n},
} |
| Sebastian Buschjäger, Jian-Jia Chen, Kuan-Hsun Chen, Mario Günzel, Christian Hakert, Katharina Morik, Rodion Novkin, Lukas Pfahler and Mikail Yayla. Margin-Maximization in Binarized Neural Networks for Optimizing Bit Error Tolerance. In Design, Automation and Test in Europe Conference (DATE), accepted 2021, Best Paper Award Candidate [BibTeX][PDF][Abstract]@inproceedings { buschjaeger/etal/2021,
author = {Buschj\"ager, Sebastian and Chen, Jian-Jia and Chen, Kuan-Hsun and G\"unzel, Mario and Hakert, Christian and Morik, Katharina and Novkin, Rodion and Pfahler, Lukas and Yayla, Mikail},
title = {Margin-Maximization in Binarized Neural Networks for Optimizing Bit Error Tolerance},
booktitle = {Design, Automation and Test in Europe Conference (DATE), accepted},
year = {2021},
note = {Best Paper Award Candidate},
keywords = {kuan, nvm-oma},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2021dateyayla.pdf},
confidential = {n},
abstract = {To overcome the memory wall in neural network (NN) inference systems, recent studies have proposed to use approximate memory, in which the supply voltage and access latency parameters are tuned, for lower energy consumption and faster access at the cost of reliability. To tolerate the occuring bit errors, the state-of-the-art approaches apply bit flip injections to the NNs during training, which require high overheads and do not scale well for large NNs and high bit error rates. In this work, we focus on binarized NNs (BNNs), whose simpler structure allows better exploration of bit error tolerance metrics based on margins. We provide formal proofs to quantify the maximum number of bit flips that can be tolerated. With the proposed margin-based metrics and the well-known hinge loss for maximum margin classification in support vector machines (SVMs), we are able to construct a modified hinge loss (MHL) to train BNNs for bit error tolerance without any bit flip injections. Our experimental results indicate that the MHL enables the possibility for BNNs to tolerate higher bit error rates than with bit flip training and, therefore, allows to further lower the requirements on approximate memories used for BNNs. },
} To overcome the memory wall in neural network (NN) inference systems, recent studies have proposed to use approximate memory, in which the supply voltage and access latency parameters are tuned, for lower energy consumption and faster access at the cost of reliability. To tolerate the occuring bit errors, the state-of-the-art approaches apply bit flip injections to the NNs during training, which require high overheads and do not scale well for large NNs and high bit error rates. In this work, we focus on binarized NNs (BNNs), whose simpler structure allows better exploration of bit error tolerance metrics based on margins. We provide formal proofs to quantify the maximum number of bit flips that can be tolerated. With the proposed margin-based metrics and the well-known hinge loss for maximum margin classification in support vector machines (SVMs), we are able to construct a modified hinge loss (MHL) to train BNNs for bit error tolerance without any bit flip injections. Our experimental results indicate that the MHL enables the possibility for BNNs to tolerate higher bit error rates than with bit flip training and, therefore, allows to further lower the requirements on approximate memories used for BNNs.
|
| Qiao Yu, Kuan-Hsun Chen and Jian-Jia Chen. Using a Set of Triangle Inequalities to Accelerate K-means Clustering. In Similarity Search and Applications - 13th International Conference (SISAP) Virtual Conference, Sep 30 - Oct 2 2020 [BibTeX][Abstract]@inproceedings { yu2020,
author = {Yu, Qiao and Chen, Kuan-Hsun and Chen, Jian-Jia},
title = {Using a Set of Triangle Inequalities to Accelerate K-means Clustering},
booktitle = {Similarity Search and Applications - 13th International Conference (SISAP)},
year = {2020},
editor = {Shin'ichi Satoh, Lucia Vadicamo, Arthur Zimek, Fabio Carrara, Ilaria Bartolini, Martin Aum\"uller, Bj\"orn Þór Jónsson, Rasmus Pagh},
address = {Virtual Conference},
month = {Sep 30 - Oct 2},
publisher = {Springer},
keywords = {kuan},
confidential = {n},
abstract = {The k-means clustering is a well-known problem in data mining and machine learning. However, the de facto standard, i.e., Lloyd’s k-mean algorithm, suffers from a large amount of time on the distance calculations. Elkan’s k-means algorithm as one prominent approach exploits triangle inequality to greatly reduce such distance calculations between points and centers, while achieving the exactly same clustering results with significant speed improvement, especially on high-dimensional datasets. In this paper, we propose a set of triangle inequalities to enhance the filtering step of Elkan’s k-means algorithm. With our new
filtering bounds, a filtering-based Elkan (FB-Elkan) is proposed, which preserves the same results as Lloyd’s k-means algorithm and additionally prunes unnecessary distance calculations. In addition, a memory-optimized Elkan (MO-Elkan) is provided, where the space complexity is greatly reduced by trading-off the maintenance of lower bounds and the run-time efficiency. Throughout evaluations with real-world datasets, FB-Elkan in general accelerates the original Elkan’s k-means algorithm
for high-dimensional datasets (up to 1.69x), whereas MO-Elkan outperforms the others for low-dimensional datasets (up to 2.48x). Specifically, when the datasets have a large number of points, i.e., n ≥ 5M, MO-Elkan still can derive the exact clustering results, while the original Elkan’s k-means algorithm is not applicable due to memory limitation.},
} The k-means clustering is a well-known problem in data mining and machine learning. However, the de facto standard, i.e., Lloyd’s k-mean algorithm, suffers from a large amount of time on the distance calculations. Elkan’s k-means algorithm as one prominent approach exploits triangle inequality to greatly reduce such distance calculations between points and centers, while achieving the exactly same clustering results with significant speed improvement, especially on high-dimensional datasets. In this paper, we propose a set of triangle inequalities to enhance the filtering step of Elkan’s k-means algorithm. With our new
filtering bounds, a filtering-based Elkan (FB-Elkan) is proposed, which preserves the same results as Lloyd’s k-means algorithm and additionally prunes unnecessary distance calculations. In addition, a memory-optimized Elkan (MO-Elkan) is provided, where the space complexity is greatly reduced by trading-off the maintenance of lower bounds and the run-time efficiency. Throughout evaluations with real-world datasets, FB-Elkan in general accelerates the original Elkan’s k-means algorithm
for high-dimensional datasets (up to 1.69x), whereas MO-Elkan outperforms the others for low-dimensional datasets (up to 2.48x). Specifically, when the datasets have a large number of points, i.e., n ≥ 5M, MO-Elkan still can derive the exact clustering results, while the original Elkan’s k-means algorithm is not applicable due to memory limitation.
|
| Tseng-Yi Chen, Yuan-Hao Chang, Ming-Chang Yang and Huang-Wei Chen. How to Cultivate a Green Decision Tree without Loss of Accuracy?. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED) 2020 [BibTeX][Abstract]@inproceedings { ChenISLPED20,
author = {Chen, Tseng-Yi and Chang, Yuan-Hao and Yang, Ming-Chang and Chen, Huang-Wei},
title = {How to Cultivate a Green Decision Tree without Loss of Accuracy?},
booktitle = {Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED)},
year = {2020},
keywords = {nvm-oma, MOST},
confidential = {n},
abstract = {Decision tree is the core algorithm of the random forest learning that has been widely applied to classification and regression problems in the machine learning field. For avoiding underfitting, a decision tree algorithm will stop growing its tree model when the model is a fully-grown tree. However, a fully-grown tree will result in an overfitting problem reducing the accuracy of a decision tree. In such a dilemma, some post-pruning strategies have been proposed to reduce the model complexity of the fully-grown decision tree. Nevertheless, such a process is very energy-inefficiency over an non-volatile-memory-based (NVM-based) system because NVM generally have high writing costs (i.e., energy consumption and I/O latency). Such unnecessary data will induce high writing energy consumption and long I/O latency on NVM-based architectures, especially for low-power-oriented embedded systems. In order to establish a green decision tree (i.e., a tree model with minimized construction energy consumption), this study rethinks a pruning algorithm, namely duo-phase pruning framework, which can significantly decrease the energy consumption on the NVM-based computing system without loss of accuracy.},
} Decision tree is the core algorithm of the random forest learning that has been widely applied to classification and regression problems in the machine learning field. For avoiding underfitting, a decision tree algorithm will stop growing its tree model when the model is a fully-grown tree. However, a fully-grown tree will result in an overfitting problem reducing the accuracy of a decision tree. In such a dilemma, some post-pruning strategies have been proposed to reduce the model complexity of the fully-grown decision tree. Nevertheless, such a process is very energy-inefficiency over an non-volatile-memory-based (NVM-based) system because NVM generally have high writing costs (i.e., energy consumption and I/O latency). Such unnecessary data will induce high writing energy consumption and long I/O latency on NVM-based architectures, especially for low-power-oriented embedded systems. In order to establish a green decision tree (i.e., a tree model with minimized construction energy consumption), this study rethinks a pruning algorithm, namely duo-phase pruning framework, which can significantly decrease the energy consumption on the NVM-based computing system without loss of accuracy.
|
| Wei-Chun Cheng, Shuo-Han Chen, Yuan-Hao Chang, Kuan-Hsun Chen, Jian-Jia Chen, Tseng-Yi Chen, Ming-Chang Yang and Wei-Kuan Shih. NS-FTL: Alleviating the Uneven Bit-Level Wearing of NVRAM-based FTL via NAND-SPIN. In 9th Non-Volatile Memory Systems and Applications Symposium (NVMSA) 2020 [BibTeX][Abstract]@inproceedings { ChengNVMSA21,
author = {Cheng, Wei-Chun and Chen, Shuo-Han and Chang, Yuan-Hao and Chen, Kuan-Hsun and Chen, Jian-Jia and Chen, Tseng-Yi and Yang, Ming-Chang and Shih, Wei-Kuan},
title = {NS-FTL: Alleviating the Uneven Bit-Level Wearing of NVRAM-based FTL via NAND-SPIN},
booktitle = {9th Non-Volatile Memory Systems and Applications Symposium (NVMSA)},
year = {2020},
keywords = {kuan, nvm-oma, MOST},
confidential = {n},
abstract = {Non-Volatile random access memory (NVRAM) has been regarded as a promising DRAM alternative with its nonvolatility, near-zero idle power consumption, and byte addressability. In particular, some NVRAM devices, such as Spin Torque Transfer (STT) RAM, can provide the same or better access performance and lower power consumption when compared with dynamic random access memory (DRAM). These nice features make NVRAM become an attractive DRAM replacement on NAND flash storage for resolving the management overhead of the flash translation layer (FTL). For instance, when adopting NVRAM for storing the mapping entries of FTL, the overheads of loading and storing the mapping entries between the non-volatile NAND flash and the volatile DRAM can be eliminated. Nevertheless, due to the limited lifetime constraint of NVRAM, the bit-level update behavior of FTL may lead to the issue of uneven bit-level wearing and the lifetime capacity of those less-worn NVRAM cells could be underutilized. Such an observation motivates this study to utilize the emerging NAND-like Spin Torque Transfer memory (NAND-SPIN) for alleviating the uneven bit-level wearing of NVRAM-based FTL and making the best of the lifetime capacity of each NAND-SPIN cell. The experimental results show that the proposed design can effectively avoid the uneven bit-level wearing, when compared with page-based FTL on NAND-SPIN.},
} Non-Volatile random access memory (NVRAM) has been regarded as a promising DRAM alternative with its nonvolatility, near-zero idle power consumption, and byte addressability. In particular, some NVRAM devices, such as Spin Torque Transfer (STT) RAM, can provide the same or better access performance and lower power consumption when compared with dynamic random access memory (DRAM). These nice features make NVRAM become an attractive DRAM replacement on NAND flash storage for resolving the management overhead of the flash translation layer (FTL). For instance, when adopting NVRAM for storing the mapping entries of FTL, the overheads of loading and storing the mapping entries between the non-volatile NAND flash and the volatile DRAM can be eliminated. Nevertheless, due to the limited lifetime constraint of NVRAM, the bit-level update behavior of FTL may lead to the issue of uneven bit-level wearing and the lifetime capacity of those less-worn NVRAM cells could be underutilized. Such an observation motivates this study to utilize the emerging NAND-like Spin Torque Transfer memory (NAND-SPIN) for alleviating the uneven bit-level wearing of NVRAM-based FTL and making the best of the lifetime capacity of each NAND-SPIN cell. The experimental results show that the proposed design can effectively avoid the uneven bit-level wearing, when compared with page-based FTL on NAND-SPIN.
|
| Yunfeng Huang, Fang-Jing Wu, Christian Hakert, Georg Brüggen, Kuan-Hsun Chen, Jian-Jia Chen, Patrick Böcker, Petr Chernikov, Luis Cruz, Zeyi Duan, Ahmed Gheith, Anand Gopalan Yantao Gong, Karthik Prakash, Ammar Tauqir and Yue Wang. Demo Abstract: Perception vs. Reality - Never Believe in What You See. In 19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) Virtual Conference, 2020 [BibTeX][PDF][Abstract]@inproceedings { ipsndemo2020,
author = {Huang, Yunfeng and Wu, Fang-Jing and Hakert, Christian and Br\"uggen, Georg and Chen, Kuan-Hsun and Chen, Jian-Jia and B\"ocker, Patrick and Chernikov, Petr and Cruz, Luis and Duan, Zeyi and Gheith, Ahmed and Yantao Gong, Anand Gopalan and Prakash, Karthik and Tauqir, Ammar and Wang, Yue},
title = {Demo Abstract: Perception vs. Reality - Never Believe in What You See},
booktitle = {19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)},
year = {2020 },
address = {Virtual Conference},
keywords = {kuan, georg},
file = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2020-ipsn.pdf},
confidential = {n},
abstract = {The increasing availability of heterogeneous ambient sensing systems challenges the according information processing systems to analyse and compare a variety of different systems in a single scenario. For instance, localization of objects can be performed by image processing systems as well as by radio based localization. If such systems are utilized to localize the same objects, synergy of the outputs is important to enable comparable and meaningful analysis.This demo showcases the practical deployment and challenges ofsuch an example system.},
} The increasing availability of heterogeneous ambient sensing systems challenges the according information processing systems to analyse and compare a variety of different systems in a single scenario. For instance, localization of objects can be performed by image processing systems as well as by radio based localization. If such systems are utilized to localize the same objects, synergy of the outputs is important to enable comparable and meaningful analysis.This demo showcases the practical deployment and challenges ofsuch an example system.
|
| Christian Hakert, Kuan-Hsun Chen, Mikail Yayla, Georg von der Brüggen, Sebastian Bloemeke and Jian-Jia Chen. Software-Based Memory Analysis Environments for In-Memory Wear-Leveling. In 25th Asia and South Pacific Design Automation Conference ASP-DAC 2020, Invited Paper Beijing, China, 2020 [BibTeX][PDF][Abstract]@inproceedings { nvmsimulator,
author = {Hakert, Christian and Chen, Kuan-Hsun and Yayla, Mikail and Br\"uggen, Georg von der and Bloemeke, Sebastian and Chen, Jian-Jia},
title = {Software-Based Memory Analysis Environments for In-Memory Wear-Leveling},
booktitle = {25th Asia and South Pacific Design Automation Conference ASP-DAC 2020, Invited Paper},
year = {2020},
address = {Beijing, China},
keywords = {kuan, nvm-oma, georg},
file = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2020-aspdac-nvm.pdf},
confidential = {n},
abstract = {Emerging non-volatile memory (NVM) architectures are considered as a replacement for DRAM and storage in the near future, since NVMs provide low power consumption, fast access speed, and low unit cost. Due to the lower write-endurance of NVMs, several in-memory wear-leveling techniques have been studied over the last years. Since most approaches propose or rely on specialized hardware, the techniques are often evaluated based on assumptions and in-house simulations rather than on real systems. To address this issue, we develop a setup consisting of a gem5 instance and an NVMain2.0 instance, which simulates an entire system (CPU, peripherals, etc.) together with an NVM plugged into the system. Taking a recorded memory access pattern from a low-level simulation into consideration to design and optimize wear-leveling techniques as operating system services allows a cross-layer design of wear- leveling techniques. With the insights gathered by analyzing the recorded memory access patterns, we develop a software-only wear-leveling solution, which does not require special hardware at all. This algorithm is evaluated afterwards by the full system simulation.},
} Emerging non-volatile memory (NVM) architectures are considered as a replacement for DRAM and storage in the near future, since NVMs provide low power consumption, fast access speed, and low unit cost. Due to the lower write-endurance of NVMs, several in-memory wear-leveling techniques have been studied over the last years. Since most approaches propose or rely on specialized hardware, the techniques are often evaluated based on assumptions and in-house simulations rather than on real systems. To address this issue, we develop a setup consisting of a gem5 instance and an NVMain2.0 instance, which simulates an entire system (CPU, peripherals, etc.) together with an NVM plugged into the system. Taking a recorded memory access pattern from a low-level simulation into consideration to design and optimize wear-leveling techniques as operating system services allows a cross-layer design of wear- leveling techniques. With the insights gathered by analyzing the recorded memory access patterns, we develop a software-only wear-leveling solution, which does not require special hardware at all. This algorithm is evaluated afterwards by the full system simulation.
|
| Wei-Chun Cheng, Shuo-Han Chen, Yuan-Hao Chang, Kuan-Hsun Chen, Jian-Jia Chen, Tseng-Yi Chen, Ming-Chang Yang and Wei-Kuan Shih. NS-FTL: Alleviating the Uneven Bit-Level Wearing of NVRAM-based FTL via NAND-SPIN. In 9th Non-Volatile Memory Systems and Applications Symposium (NVMSA) Virtual Conference, 2020 [BibTeX][PDF][Abstract]@inproceedings { most2020nvmsa,
author = {Cheng, Wei-Chun and Chen, Shuo-Han and Chang, Yuan-Hao and Chen, Kuan-Hsun and Chen, Jian-Jia and Chen, Tseng-Yi and Yang, Ming-Chang and Shih, Wei-Kuan},
title = {NS-FTL: Alleviating the Uneven Bit-Level Wearing of NVRAM-based FTL via NAND-SPIN},
booktitle = {9th Non-Volatile Memory Systems and Applications Symposium (NVMSA)},
year = {2020},
address = {Virtual Conference},
keywords = {kuan, nvm-oma, },
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2020nvmsa-ftl.pdf},
confidential = {n},
abstract = {Non-Volatile random access memory (NVRAM) has been regarded as a promising DRAM alternative with its non volatility, near-zero idle power consumption, and byte addressability. In particular, some NVRAM devices, such as Spin Torque Transfer (STT) RAM, can provide the same or better access performance and lower power consumption when compared with dynamic random access memory (DRAM). These nice features make NVRAM become an attractive DRAM replacement on NAND flash storage for resolving the management overhead of the flash translation layer (FTL). For instance, when adopting NVRAM for storing the mapping entries of FTL, the overheads of loading and storing the mapping entries between the non-volatile NAND flash and the volatile DRAM can be eliminated. Nevertheless, due to the limited lifetime constraint of NVRAM, the bit-level update behavior of FTL may lead to the issue of uneven bit-level wearing and the lifetime capacity of those less-worn NVRAM cells could be underutilized. Such an observation motivates this study to utilize the emerging NAND-like Spin Torque Transfer memory (NAND-SPIN) for alleviating the uneven bit-level wearing of NVRAM-based FTL and making the best of the lifetime capacity of each NAND-SPIN cell. The experimental results show that the proposed design can effectively avoid the uneven bit-level wearing, when compared with page-based FTL on NAND-SPIN.},
} Non-Volatile random access memory (NVRAM) has been regarded as a promising DRAM alternative with its non volatility, near-zero idle power consumption, and byte addressability. In particular, some NVRAM devices, such as Spin Torque Transfer (STT) RAM, can provide the same or better access performance and lower power consumption when compared with dynamic random access memory (DRAM). These nice features make NVRAM become an attractive DRAM replacement on NAND flash storage for resolving the management overhead of the flash translation layer (FTL). For instance, when adopting NVRAM for storing the mapping entries of FTL, the overheads of loading and storing the mapping entries between the non-volatile NAND flash and the volatile DRAM can be eliminated. Nevertheless, due to the limited lifetime constraint of NVRAM, the bit-level update behavior of FTL may lead to the issue of uneven bit-level wearing and the lifetime capacity of those less-worn NVRAM cells could be underutilized. Such an observation motivates this study to utilize the emerging NAND-like Spin Torque Transfer memory (NAND-SPIN) for alleviating the uneven bit-level wearing of NVRAM-based FTL and making the best of the lifetime capacity of each NAND-SPIN cell. The experimental results show that the proposed design can effectively avoid the uneven bit-level wearing, when compared with page-based FTL on NAND-SPIN.
|
| Christian Hakert, Kuan-Hsun Chen, Simon Kuenzer, Sharan Santhanam, Shuo-Han Chen, Yuan-Hao Chang, Felipe Huici and Jian-Jia Chen. Split’n Trace NVM: Leveraging Library OSes for Semantic Memory Tracing. In 9th Non-Volatile Memory Systems and Applications Symposium (NVMSA) Virtual Conference, 2020 [BibTeX][PDF][Abstract]@inproceedings { hakert2020nvmsa,
author = {Hakert, Christian and Chen, Kuan-Hsun and Kuenzer, Simon and Santhanam, Sharan and Chen, Shuo-Han and Chang, Yuan-Hao and Huici, Felipe and Chen, Jian-Jia},
title = {Split’n Trace NVM: Leveraging Library OSes for Semantic Memory Tracing},
booktitle = {9th Non-Volatile Memory Systems and Applications Symposium (NVMSA)},
year = {2020},
address = {Virtual Conference},
keywords = {kuan, nvm-oma, },
file = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2020-nvmsa-hakert.pdf},
confidential = {n},
abstract = {With the rise of non-volatile memory (NVM) as a replacement for traditional main memories (e.g. DRAM), memory access analysis is becoming an increasingly important topic. NVMs suffer from technical shortcomings as such as reduced cell endurance which call for precise memory access analysis in order to design maintenance strategies that can extend the memory’s lifetime. While existing memory access analyzers trace memory accesses at various levels, from the application level with code instrumentation, down to the hardware level where software is executed on special analysis hardware, they usually interpret main memory as a consecutive area, without investigating the application semantics of different memory regions.
In contrast, this paper presents a memory access simulator, which splits the main memory into semantic regions and enriches the simulation result with semantics from the analyzed application. We leverage a library-based operating system called Unikraft by ascribing memory regions of the simulation to the relevant OS libraries. This novel approach allows us to derive a detailed analysis of which libraries (and thus functionalities) are responsible for which memory access patterns. Through offline profiling with our simulator, we provide a fine-granularity analysis of memory access patterns that provide insights for the design of efficient NVM maintenance strategies.},
} With the rise of non-volatile memory (NVM) as a replacement for traditional main memories (e.g. DRAM), memory access analysis is becoming an increasingly important topic. NVMs suffer from technical shortcomings as such as reduced cell endurance which call for precise memory access analysis in order to design maintenance strategies that can extend the memory’s lifetime. While existing memory access analyzers trace memory accesses at various levels, from the application level with code instrumentation, down to the hardware level where software is executed on special analysis hardware, they usually interpret main memory as a consecutive area, without investigating the application semantics of different memory regions.
In contrast, this paper presents a memory access simulator, which splits the main memory into semantic regions and enriches the simulation result with semantics from the analyzed application. We leverage a library-based operating system called Unikraft by ascribing memory regions of the simulation to the relevant OS libraries. This novel approach allows us to derive a detailed analysis of which libraries (and thus functionalities) are responsible for which memory access patterns. Through offline profiling with our simulator, we provide a fine-granularity analysis of memory access patterns that provide insights for the design of efficient NVM maintenance strategies.
|
| Marcel Ebbrecht, Kuan-Hsun Chen and Jian-Jia Chen. Bucket of Ignorance: A Hybrid Data Structure for Timing Mechanism in Real-Time Operating Systems. In 26th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Brief Presentations Track (BP) Virtual Conference (accepted for presentation), April 2020 [BibTeX][Link][Abstract]@inproceedings { ebbrecht2020timer,
author = {Ebbrecht, Marcel and Chen, Kuan-Hsun and Chen, Jian-Jia},
title = {Bucket of Ignorance: A Hybrid Data Structure for Timing Mechanism in Real-Time Operating Systems},
booktitle = {26th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Brief Presentations Track (BP)},
year = {2020},
address = {Virtual Conference (accepted for presentation)},
month = {April},
url = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/ebrrechttimer.pdf},
keywords = {kuan},
confidential = {n},
abstract = {To maintain deterministic timing behaviors, RealTime Operating Systems (RTOSes) require not only a task scheduler but also a timing mechanism for the periodicity of recurrent tasks. Most of existing open-source RTOSes implement either a tree-based or a list-based mechanism to track which task is ready to release on-the-fly. Although tree-based mechanisms are known to be efficient in time complexity for searching operations, the additional effort processing removals and insertions are also not negligible, which may countervail the advantage, compared to list-based timer-managers, even with small task sets. In this work, we provide a simulation framework, which is ready to be released, to investigate existing timing mechanisms and analyze how do they perform under certain conditions. Throughout extensive simulations, we show that our proposed solution indeed requires less computation effort than conventional timing mechanisms when the size of task set is in the range of 16 to 208. },
} To maintain deterministic timing behaviors, RealTime Operating Systems (RTOSes) require not only a task scheduler but also a timing mechanism for the periodicity of recurrent tasks. Most of existing open-source RTOSes implement either a tree-based or a list-based mechanism to track which task is ready to release on-the-fly. Although tree-based mechanisms are known to be efficient in time complexity for searching operations, the additional effort processing removals and insertions are also not negligible, which may countervail the advantage, compared to list-based timer-managers, even with small task sets. In this work, we provide a simulation framework, which is ready to be released, to investigate existing timing mechanisms and analyze how do they perform under certain conditions. Throughout extensive simulations, we show that our proposed solution indeed requires less computation effort than conventional timing mechanisms when the size of task set is in the range of 16 to 208.
|
| Christian Hakert, Kuan-Hsun Chen and Jian-Jia Chen. Can Wear-Aware Memory Allocation be Intelligent?. In 2020 ACM/IEEE Workshop on Machine Learning for CAD (MLCAD ’20), November 16–20, 2020, Virtual Event, Ice- land 2020 [BibTeX][PDF][Abstract]@inproceedings { mlcad2020intelliheap,
author = {Hakert, Christian and Chen, Kuan-Hsun and Chen, Jian-Jia},
title = {Can Wear-Aware Memory Allocation be Intelligent?},
booktitle = {2020 ACM/IEEE Workshop on Machine Learning for CAD (MLCAD ’20), November 16–20, 2020, Virtual Event, Ice- land},
year = {2020},
file = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2020-mlcad-hakert.pdf},
confidential = {n},
abstract = {Many non-volatile memories (NVM) suffer from a severe reduced
cell endurance and therefore require wear-leveling. Heap memory,
as one segment, which potentially is mapped to a NVM, faces a
strong application dependent characteristic regarding the amount
of memory accesses and allocations. A simple deterministic strategy
for wear leveling of the heap may suffer when the available action
space becomes too large. Therefore, we investigate the employment
of a reinforcement learning agent as a substitute for such a strategy
in this paper. The agent’s objective is to learn a strategy, which is
optimal with respect to the total memory wear out. We conclude
this work with an evaluation, where we compare the deterministic
strategy with the proposed agent. We report that our proposed
agent outperforms the simple deterministic strategy in several cases.
However, we also report further optimization potential in the agent
design and deployment.
},
} Many non-volatile memories (NVM) suffer from a severe reduced
cell endurance and therefore require wear-leveling. Heap memory,
as one segment, which potentially is mapped to a NVM, faces a
strong application dependent characteristic regarding the amount
of memory accesses and allocations. A simple deterministic strategy
for wear leveling of the heap may suffer when the available action
space becomes too large. Therefore, we investigate the employment
of a reinforcement learning agent as a substitute for such a strategy
in this paper. The agent’s objective is to learn a strategy, which is
optimal with respect to the total memory wear out. We conclude
this work with an evaluation, where we compare the deterministic
strategy with the proposed agent. We report that our proposed
agent outperforms the simple deterministic strategy in several cases.
However, we also report further optimization potential in the agent
design and deployment.
|
| Lea Schönberger, Georg Brüggen, Kuan-Hsun Chen, Benjamin Sliwa, Hazem Youssef, Aswin Ramachandran, Christian Wietfeld, Michael Hompel and Jian-Jia Chen. Offloading Safety- and Mission-Critical Tasks via Unreliable Connections. In 32nd Euromicro Conference on Real-Time Systems (ECRTS 2020) Virtual Conference, June 2020 [BibTeX][Link][Abstract]@inproceedings { schoenberger2020ecrts,
author = {Sch\"onberger, Lea and Br\"uggen, Georg and Chen, Kuan-Hsun and Sliwa, Benjamin and Youssef, Hazem and Ramachandran, Aswin and Wietfeld, Christian and Hompel, Michael and Chen, Jian-Jia},
title = {Offloading Safety- and Mission-Critical Tasks via Unreliable Connections},
booktitle = {32nd Euromicro Conference on Real-Time Systems (ECRTS 2020)},
year = {2020},
address = {Virtual Conference},
month = {June},
url = {https://drops.dagstuhl.de/opus/volltexte/2020/12381/},
keywords = {lea, georg, kuan},
confidential = {n},
abstract = {For many cyber-physical systems, e.g., IoT systems and autonomous vehicles, offloading workload to auxiliary processing units has become crucial. However, since this approach highly depends on network connectivity and responsiveness, typically only non-critical tasks are offloaded, which have less strict timing requirements than critical tasks. In this work, we provide two protocols allowing to offload critical and non-critical tasks likewise, while providing different service levels for non-critical tasks in the event of an unsuccessful offloading operation, depending on the respective system requirements. We analyze the worst-case timing behavior of the local cyber-physical system and, based on these analyses, we provide a sufficient schedulability test for each of the proposed protocols. In the course of comprehensive experiments, we show that our protocols have reasonable acceptance ratios under the provided schedulability tests. Moreover, we demonstrate that the system behavior under our proposed protocols is strongly dependent on probability of unsuccessful offloading operations, the percentage of critical tasks in the system, and the amount of offloaded workload.},
} For many cyber-physical systems, e.g., IoT systems and autonomous vehicles, offloading workload to auxiliary processing units has become crucial. However, since this approach highly depends on network connectivity and responsiveness, typically only non-critical tasks are offloaded, which have less strict timing requirements than critical tasks. In this work, we provide two protocols allowing to offload critical and non-critical tasks likewise, while providing different service levels for non-critical tasks in the event of an unsuccessful offloading operation, depending on the respective system requirements. We analyze the worst-case timing behavior of the local cyber-physical system and, based on these analyses, we provide a sufficient schedulability test for each of the proposed protocols. In the course of comprehensive experiments, we show that our protocols have reasonable acceptance ratios under the provided schedulability tests. Moreover, we demonstrate that the system behavior under our proposed protocols is strongly dependent on probability of unsuccessful offloading operations, the percentage of critical tasks in the system, and the amount of offloaded workload.
|
| Jakob Richter, Junjie Shi, Jian-Jia Chen, Jörg Rahnenführer and Michel Lang. Model-based Optimization with Concept Drifts. In The Genetic and Evolutionary Computation Conference accepted and to appear, 2020 [BibTeX]@inproceedings { GECCO-Richter-etal2010,
author = {Richter, Jakob and Shi, Junjie and Chen, Jian-Jia and Rahnenf\"uhrer, J\"org and Lang, Michel},
title = {Model-based Optimization with Concept Drifts},
booktitle = {The Genetic and Evolutionary Computation Conference},
year = {2020},
address = {accepted and to appear},
confidential = {n},
} |
| Shuo-Han Chen, Ming-Chang Yang and Yuan-Hao Chang. The Best of Both Worlds: On Exploiting Bit-Alterable NAND Flash for Lifetime and Read Performance Optimization. In Proceedings of the 56th Annual Design Automation Conference (DAC) 2019 [BibTeX][Abstract]@inproceedings { ChenDAC19,
author = {Chen, Shuo-Han and Yang, Ming-Chang and Chang, Yuan-Hao},
title = {The Best of Both Worlds: On Exploiting Bit-Alterable NAND Flash for Lifetime and Read Performance Optimization},
booktitle = {Proceedings of the 56th Annual Design Automation Conference (DAC)},
year = {2019},
keywords = {nvm-oma, MOST},
confidential = {n},
abstract = {With the emergence of bit-alterable 3D NAND flash, programming and erasing a flash cell at bit-level granularity have become a reality. Bit-level operations can benefit the high density, high bit-error-rate 3D NAND flash via realizing the "bit-level rewrite operation," which can refresh error bits at bit-level granularity for reducing the error correction latency and improving the read performance with minimal lifetime expense. Different from existing refresh techniques, bit-level operations can lower the lifetime expense via removing error bits directly without page-based rewrites. However, since bit-level rewrites may induce a similar amount of latency as conventional page-based rewrites and thus lead to low rewrite throughput, the efficiency of bit-level rewrites should be carefully considered. Such observation motivates us to propose a bit-level error removal (BER) scheme to derive the most-efficient way of utilizing the bit-level operations for both lifetime and read performance optimization. A series of experiments was conducted to demonstrate the capability of the BER scheme with encouraging results.},
} With the emergence of bit-alterable 3D NAND flash, programming and erasing a flash cell at bit-level granularity have become a reality. Bit-level operations can benefit the high density, high bit-error-rate 3D NAND flash via realizing the "bit-level rewrite operation," which can refresh error bits at bit-level granularity for reducing the error correction latency and improving the read performance with minimal lifetime expense. Different from existing refresh techniques, bit-level operations can lower the lifetime expense via removing error bits directly without page-based rewrites. However, since bit-level rewrites may induce a similar amount of latency as conventional page-based rewrites and thus lead to low rewrite throughput, the efficiency of bit-level rewrites should be carefully considered. Such observation motivates us to propose a bit-level error removal (BER) scheme to derive the most-efficient way of utilizing the bit-level operations for both lifetime and read performance optimization. A series of experiments was conducted to demonstrate the capability of the BER scheme with encouraging results.
|
| Yu Ting Ho, Chun-Feng Wu, Ming-Chang Yang, Yseng-Yi Chen and Yuan-Hao Chang. Replanting Your Forest: NVM-friendly Bagging Strategy for Random Forest. In Non-Volatile Memory Systems and Applications Symposium (NVMSA) 2019 [BibTeX][Abstract]@inproceedings { HoNVMSA19,
author = {Ho, Yu Ting and Wu, Chun-Feng and Yang, Ming-Chang and Chen, Yseng-Yi and Chang, Yuan-Hao},
title = {Replanting Your Forest: NVM-friendly Bagging Strategy for Random Forest},
booktitle = {Non-Volatile Memory Systems and Applications Symposium (NVMSA)},
year = {2019},
keywords = {nvm-oma, MOST},
confidential = {n},
abstract = {Random forest is effective and accurate in making predictions for classification and regression problems, which constitute the majority of machine learning applications or systems nowadays. However, as the data are being generated explosively in this big data era, many machine learning algorithms, including the random forest algorithm, may face the difficulty in maintaining and processing all the required data in the main memory. Instead, intensive data movements (i.e., data swappings) between the faster-but-smaller main memory and the slowerbut-larger secondary storage may occur excessively and largely degrade the performance. To address this challenge, the emerging non-volatile memory (NVM) technologies are placed great hopes to substitute the traditional random access memory (RAM) and to build a larger-than-ever main memory space because of its higher cell density, lower power consumption, and comparable read performance as traditional RAM. Nevertheless, the limited write endurance of NVM cells and the read-write asymmetry of NVMs may still limit the feasibility of performing machine learning algorithms directly on NVMs. Such dilemma inspires this study to develop an NVM-friendly bagging strategy for the random forest algorithm, in order to trade the “randomness” of the sampled data for the reduced data movements in the memory hierarchy without hurting the prediction accuracy. The evaluation results show that the proposed design could save up to 72% of the write accesses on the representative traces with nearly no degradation on the prediction accuracy.},
} Random forest is effective and accurate in making predictions for classification and regression problems, which constitute the majority of machine learning applications or systems nowadays. However, as the data are being generated explosively in this big data era, many machine learning algorithms, including the random forest algorithm, may face the difficulty in maintaining and processing all the required data in the main memory. Instead, intensive data movements (i.e., data swappings) between the faster-but-smaller main memory and the slowerbut-larger secondary storage may occur excessively and largely degrade the performance. To address this challenge, the emerging non-volatile memory (NVM) technologies are placed great hopes to substitute the traditional random access memory (RAM) and to build a larger-than-ever main memory space because of its higher cell density, lower power consumption, and comparable read performance as traditional RAM. Nevertheless, the limited write endurance of NVM cells and the read-write asymmetry of NVMs may still limit the feasibility of performing machine learning algorithms directly on NVMs. Such dilemma inspires this study to develop an NVM-friendly bagging strategy for the random forest algorithm, in order to trade the “randomness” of the sampled data for the reduced data movements in the memory hierarchy without hurting the prediction accuracy. The evaluation results show that the proposed design could save up to 72% of the write accesses on the representative traces with nearly no degradation on the prediction accuracy.
|
| Chi-Hsing Chang and Che-Wei Chang. Adaptive Memory and Storage Fusion on Non-Volatile One-Memory System. In Non-Volatile Memory Systems and Applications Symposium (NVMSA) 2019 [BibTeX][Abstract]@inproceedings { ChangNVMSA19,
author = {Chang, Chi-Hsing and Chang, Che-Wei},
title = {Adaptive Memory and Storage Fusion on Non-Volatile One-Memory System},
booktitle = {Non-Volatile Memory Systems and Applications Symposium (NVMSA)},
year = {2019},
keywords = {nvm-oma, MOST},
confidential = {n},
abstract = {Non-volatile memory (NVM), such as phase change memory (PCM), can be a promising candidate to replace DRAM because of its lower leakage power and higher density. Since PCM is non-volatile, it can also be used as storage to support in-place execution and reduce loading time. However, as conventional operating systems have different strategies to satisfy various constraints on memory and storage subsystems, using PCM as both memory and storage in a system requires thorough consideration on the system's inherent constraints, such as limited lifetime, retention time requirements, and possible overheads. Most existing work still divide NVM into separated memory and storage parts, but this strategy still incurs the overhead of loading data from storage to memory as in conventional systems. In our work, we rethink the data retention time requirements for PCM memory/storage and develop an adaptive memory-storage management strategy to dynamically reconfigure the One-Memory System, with considerations of the current average write-cycle and the number of retention-time qualified frames for storage, to reduce the extra data movement between memory and storage with a limited lifetime sacrifice. Experimental results show that our adaptive design improves the performance by reducing 86.1% of the extra writes of data movement, and only 3.4% of the system's lifetime is sacrificed.},
} Non-volatile memory (NVM), such as phase change memory (PCM), can be a promising candidate to replace DRAM because of its lower leakage power and higher density. Since PCM is non-volatile, it can also be used as storage to support in-place execution and reduce loading time. However, as conventional operating systems have different strategies to satisfy various constraints on memory and storage subsystems, using PCM as both memory and storage in a system requires thorough consideration on the system's inherent constraints, such as limited lifetime, retention time requirements, and possible overheads. Most existing work still divide NVM into separated memory and storage parts, but this strategy still incurs the overhead of loading data from storage to memory as in conventional systems. In our work, we rethink the data retention time requirements for PCM memory/storage and develop an adaptive memory-storage management strategy to dynamically reconfigure the One-Memory System, with considerations of the current average write-cycle and the number of retention-time qualified frames for storage, to reduce the extra data movement between memory and storage with a limited lifetime sacrifice. Experimental results show that our adaptive design improves the performance by reducing 86.1% of the extra writes of data movement, and only 3.4% of the system's lifetime is sacrificed.
|
| Mikail Yayla, Anas Toma, Jan Eric Lenssen, Victoria Shpacovitch, Kuan-Hsun Chen, Frank Weichert and Jian-Jia Chen. Resource-Efficient Nanoparticle Classification Using Frequency Domain Analysis. In BVM Workshop Lübeck, Germany, March 2019 [BibTeX][Abstract]@inproceedings { Yayla-BVM2019,
author = {Yayla, Mikail and Toma, Anas and Lenssen, Jan Eric and Shpacovitch, Victoria and Chen, Kuan-Hsun and Weichert, Frank and Chen, Jian-Jia},
title = {Resource-Efficient Nanoparticle Classification Using Frequency Domain Analysis},
booktitle = {BVM Workshop},
year = {2019},
address = {L\"ubeck, Germany},
month = {March},
keywords = {kuan},
confidential = {n},
abstract = {We present a method for resource-efficient classification of nanoparticles such as viruses in liquid or gas samples by analyzing Surface Plasmon Resonance (SPR) images using frequency domain features. The SPR images are obtained with the Plasmon Assisted Microscopy Of Nano-sized Objects (PAMONO) biosensor, which was developed as a mobile virus and particle detector. Convolutional neural network (CNN) solutions are available for the given task, but since the mobility of the sensor is an important factor, we provide a faster and less resource demanding alternative approach for the use in a small virus detection device.The execution time of our approach, which can be optimized further using low power hardware such as a digital signal processor (DSP), is at least 2.6 times faster than the current CNN solution while sacrificing only 1 to 2.5 percent points in accuracy.},
} We present a method for resource-efficient classification of nanoparticles such as viruses in liquid or gas samples by analyzing Surface Plasmon Resonance (SPR) images using frequency domain features. The SPR images are obtained with the Plasmon Assisted Microscopy Of Nano-sized Objects (PAMONO) biosensor, which was developed as a mobile virus and particle detector. Convolutional neural network (CNN) solutions are available for the given task, but since the mobility of the sensor is an important factor, we provide a faster and less resource demanding alternative approach for the use in a small virus detection device.The execution time of our approach, which can be optimized further using low power hardware such as a digital signal processor (DSP), is at least 2.6 times faster than the current CNN solution while sacrificing only 1 to 2.5 percent points in accuracy.
|
| Kuan-Hsun Chen, Niklas Ueter, Georg von der Brüggen and Jian-Jia Chen. Efficient Computation of Deadline-Miss Probability and Parametric Remedies for Potential Pitfalls. In Design, Automation and Test in Europe (DATE) Florence, Italy, 25-29th, March 2019 [BibTeX][PDF][Link][Abstract]@inproceedings { khchenDATE2019,
author = {Chen, Kuan-Hsun and Ueter, Niklas and Br\"uggen, Georg von der and Chen, Jian-Jia},
title = {Efficient Computation of Deadline-Miss Probability and Parametric Remedies for Potential Pitfalls},
booktitle = {Design, Automation and Test in Europe (DATE)},
year = {2019},
address = {Florence, Italy},
month = {25-29th, March},
url = {https://ieeexplore.ieee.org/abstract/document/8714908},
keywords = {kuan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/kuan2019date.pdf},
confidential = {n},
abstract = {In soft real-time systems, applications can tolerate rare deadline misses. Therefore, probabilistic arguments and analyses are applicable in the timing analyses for this class of systems, as demonstrated in many existing researches. Convolution-based analyses allow to derive tight deadline-miss probabilities, but suffer from a high time complexity. Among the analytical approaches, which result in a significantly faster runtime than the convolution-based approaches, the Chernoff bounds provide the tightest results. In this paper, we show that calculating the deadline-miss probability using Chernoff bounds can be solved by considering an equivalent convex optimization problem. This allows us to, on the one hand, decrease the runtime of the Chernoff bounds while, on the other hand, ensure a tighter approximation since a larger variable space can be searched more efficiently, i.e., by using binary search techniques over a larger area instead of a sequential search over a smaller area. We evaluate this approach considering synthesized task sets. Our approach is shown to be computationally efficient for large task systems, whilst experimentally suggesting reasonable approximation quality compared to an exact analysis.},
} In soft real-time systems, applications can tolerate rare deadline misses. Therefore, probabilistic arguments and analyses are applicable in the timing analyses for this class of systems, as demonstrated in many existing researches. Convolution-based analyses allow to derive tight deadline-miss probabilities, but suffer from a high time complexity. Among the analytical approaches, which result in a significantly faster runtime than the convolution-based approaches, the Chernoff bounds provide the tightest results. In this paper, we show that calculating the deadline-miss probability using Chernoff bounds can be solved by considering an equivalent convex optimization problem. This allows us to, on the one hand, decrease the runtime of the Chernoff bounds while, on the other hand, ensure a tighter approximation since a larger variable space can be searched more efficiently, i.e., by using binary search techniques over a larger area instead of a sequential search over a smaller area. We evaluate this approach considering synthesized task sets. Our approach is shown to be computationally efficient for large task systems, whilst experimentally suggesting reasonable approximation quality compared to an exact analysis.
|
| Helena Kotthaus, Lea Schönberger, Andreas Lang, Jian-Jia Chen and Peter Marwedel. Can Flexible Multi-Core Scheduling Help to Execute Machine Learning Algorithms Resource-Efficiently?. In 22nd International Workshop on Software and Compilers for Embedded Systems, pages 59--62 2019 [BibTeX][Link]@inproceedings { kotthaus/2019b,
author = {Kotthaus, Helena and Sch\"onberger, Lea and Lang, Andreas and Chen, Jian-Jia and Marwedel, Peter},
title = {Can Flexible Multi-Core Scheduling Help to Execute Machine Learning Algorithms Resource-Efficiently?},
booktitle = {22nd International Workshop on Software and Compilers for Embedded Systems},
year = {2019},
series = {SCOPES '19},
pages = {59--62},
publisher = {ACM},
url = {https://dl.acm.org/citation.cfm?id=3323986},
keywords = {Lea},
confidential = {n},
} |
| Helena Kotthaus and Jan Vitek. Typical Mistakes in Data Science: Should you Trust my Model? . In Abstract Booklet of the International R User Conference (UseR!) Toulouse, France, July 2019 [BibTeX][Link]@inproceedings { kotthaus/2019c,
author = {Kotthaus, Helena and Vitek, Jan},
title = {Typical Mistakes in Data Science: Should you Trust my Model? },
booktitle = {Abstract Booklet of the International R User Conference (UseR!)},
year = {2019},
address = {Toulouse, France},
month = {July},
url = {http://www.user2019.fr/posters/},
confidential = {n},
} |
| Anas Toma, Juri Wenner, Jan Eric Lenssen and Jian-Jia Chen. Adaptive Quality Optimization of Computer Vision Tasks in Resource-Constrained Devices using Edge Computing. In the 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGrid 2019) Larnaca, Cyprus, May 2019 [BibTeX][Abstract]@inproceedings { Toma-CCGrid2019,
author = {Toma, Anas and Wenner, Juri and Lenssen, Jan Eric and Chen, Jian-Jia},
title = {Adaptive Quality Optimization of Computer Vision Tasks in Resource-Constrained Devices using Edge Computing},
booktitle = {the 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGrid 2019)},
year = {2019},
address = {Larnaca, Cyprus},
month = {May},
confidential = {n},
abstract = {This paper presents an approach to optimize the quality of computer vision tasks in resource-constrained devices by using different execution versions of the same task. The execution versions are generated by dropping irrelevant contents of the input images or other contents that have marginal effect on the quality of the result. Our execution model is designed to support the edge computing paradigm, where the tasks can be executed remotely on edge nodes either to improve the quality or to reduce the workload of the local device. We also propose an algorithm that selects the suitable execution versions, which includes the configuration and the location of the execution and maximizes the total quality of the tasks based on the available resources. The proposed approach provides reliable and adaptive task execution by using several execution versions with various performance and quality trade-offs. Therefore, it is very beneficial for systems with resource and timing constraints such as portable medical devices, surveillance video cameras, wearable systems, etc. The proposed algorithm is evaluated using different computer vision benchmarks.},
} This paper presents an approach to optimize the quality of computer vision tasks in resource-constrained devices by using different execution versions of the same task. The execution versions are generated by dropping irrelevant contents of the input images or other contents that have marginal effect on the quality of the result. Our execution model is designed to support the edge computing paradigm, where the tasks can be executed remotely on edge nodes either to improve the quality or to reduce the workload of the local device. We also propose an algorithm that selects the suitable execution versions, which includes the configuration and the location of the execution and maximizes the total quality of the tasks based on the available resources. The proposed approach provides reliable and adaptive task execution by using several execution versions with various performance and quality trade-offs. Therefore, it is very beneficial for systems with resource and timing constraints such as portable medical devices, surveillance video cameras, wearable systems, etc. The proposed algorithm is evaluated using different computer vision benchmarks.
|
| Lea Schönberger, Georg von der Brüggen, Horst Schirmeier and Jian-Jia Chen. Design Optimization for Hardware-Based Message Filters in Broadcast Buses. In Design, Automation and Test in Europe (DATE) Florence, Italy, March 25-29 2019 [BibTeX][Link][Abstract]@inproceedings { schoenbergerDATE2019,
author = {Sch\"onberger, Lea and Br\"uggen, Georg von der and Schirmeier, Horst and Chen, Jian-Jia},
title = {Design Optimization for Hardware-Based Message Filters in Broadcast Buses},
booktitle = {Design, Automation and Test in Europe (DATE)},
year = {2019},
address = {Florence, Italy},
month = {March 25-29},
url = {https://ieeexplore.ieee.org/abstract/document/8714793},
keywords = {lea, georg},
confidential = {n},
abstract = {In the field of automotive engineering, broadcast buses, e.g., Controller Area Network (CAN), are frequently used to connect multiple electronic control units (ECUs). Each message transmitted on such buses can be received by each single participant, but not all messages are relevant for every ECU. For this purpose, all incoming messages must be filtered in terms of relevance by either hardware or software techniques. We address the issue of designing hardware filter configurations for clients connected to a broadcast bus in order to reduce the cost, i.e., the computation overhead, provoked by undesired but accepted messages. More precisely, we propose an SMT formulation that can be applied to i) retrieve a (minimal) perfect filter configuration, i.e., no undesired messages are received, ii) optimize the filter quality under given hardware restrictions, or iii) minimize the hardware cost for a given type of filtercomponent and a maximum cost threshold.},
} In the field of automotive engineering, broadcast buses, e.g., Controller Area Network (CAN), are frequently used to connect multiple electronic control units (ECUs). Each message transmitted on such buses can be received by each single participant, but not all messages are relevant for every ECU. For this purpose, all incoming messages must be filtered in terms of relevance by either hardware or software techniques. We address the issue of designing hardware filter configurations for clients connected to a broadcast bus in order to reduce the cost, i.e., the computation overhead, provoked by undesired but accepted messages. More precisely, we propose an SMT formulation that can be applied to i) retrieve a (minimal) perfect filter configuration, i.e., no undesired messages are received, ii) optimize the filter quality under given hardware restrictions, or iii) minimize the hardware cost for a given type of filtercomponent and a maximum cost threshold.
|
| Christian Hakert, Mikail Yayla, Kuan-Hsun Chen, Georg von der Brüggen, Jian-Jia Chen, Sebastian Buschjäger, Katharina Morik, Paul R. Genssler, Lars Bauer, Hussam Amrouch and Jörg Henkel. Stack Usage Analysis for Efficient Wear Leveling in Non-Volatile Main Memory Systems. In 1st ACM/IEEE Workshop on Machine Learning for CAD (MLCAD) Alberta, Canada, 2019 [BibTeX][PDF][Abstract]@inproceedings { mlcad2019stackanalysis,
author = {Hakert, Christian and Yayla, Mikail and Chen, Kuan-Hsun and Br\"uggen, Georg von der and Chen, Jian-Jia and Buschj\"ager, Sebastian and Morik, Katharina and Genssler, Paul R. and Bauer, Lars and Amrouch, Hussam and Henkel, J\"org},
title = {Stack Usage Analysis for Efficient Wear Leveling in Non-Volatile Main Memory Systems},
booktitle = {1st ACM/IEEE Workshop on Machine Learning for CAD (MLCAD)},
year = {2019},
address = {Alberta, Canada},
keywords = {kuan, nvm-oma, georg},
file = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/mlcad_2019.pdf},
confidential = {n},
abstract = {Emerging non-volatile memory (NVM) technologies, such as Phase Change Memory (PCM), have been considered as a replacement for DRAM and storage due to their low power consumption, fast access speed, and low unit cost. Even so, some NVMs have a significantly lower write endurance and hence in-memory wear leveling is an important requirement for practical applicability. Since writes to the stack often target a small and dense memory region, generic, coarse-grained wear-leveling mechanisms (e.g. virtual memory page remapping) are not sufficient. An alternative solution is to relocate the stack memory regularly, which involves copying of the stack content. As the stack content changes in size during the execution of an application, the copy overhead can be significantly mitigated by performing the relocation when the stack size is small. In this paper, we investigate two approaches to determine points in time when the stack is small. First, we analyze the possibility to fit simple machine-learning models to the stack usage function. Precise predictions of this function enable the identification of the minimum stack size during execution. In our evaluation, the tested models provide accurate estimates of the future stack usage function for a subset of common applications. As a second approach, we analyze applications a priori and determine potential optimal points to perform relocation in the instruction stream. In detail, we deploy the application in an analysis environment, which determines a rating for each executed instruction. Based on this rating, we apply a genetic algorithm to identify the best points in the instruction stream to perform the stack relocation. This approach allows to save up to 85% of the write overhead for wear-leveling in our experiments.},
} Emerging non-volatile memory (NVM) technologies, such as Phase Change Memory (PCM), have been considered as a replacement for DRAM and storage due to their low power consumption, fast access speed, and low unit cost. Even so, some NVMs have a significantly lower write endurance and hence in-memory wear leveling is an important requirement for practical applicability. Since writes to the stack often target a small and dense memory region, generic, coarse-grained wear-leveling mechanisms (e.g. virtual memory page remapping) are not sufficient. An alternative solution is to relocate the stack memory regularly, which involves copying of the stack content. As the stack content changes in size during the execution of an application, the copy overhead can be significantly mitigated by performing the relocation when the stack size is small. In this paper, we investigate two approaches to determine points in time when the stack is small. First, we analyze the possibility to fit simple machine-learning models to the stack usage function. Precise predictions of this function enable the identification of the minimum stack size during execution. In our evaluation, the tested models provide accurate estimates of the future stack usage function for a subset of common applications. As a second approach, we analyze applications a priori and determine potential optimal points to perform relocation in the instruction stream. In detail, we deploy the application in an analysis environment, which determines a rating for each executed instruction. Based on this rating, we apply a genetic algorithm to identify the best points in the instruction stream to perform the stack relocation. This approach allows to save up to 85% of the write overhead for wear-leveling in our experiments.
|
| Junjie Shi, Niklas Ueter, Georg von der Brüggen and Jian-Jia Chen. Multiprocessor Synchronization of Periodic Real-Time Tasks Using Dependency Graphs. In 25th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS, pages 279--292 2019 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/rtas/ShiUBC19,
author = {Shi, Junjie and Ueter, Niklas and Br\"uggen, Georg von der and Chen, Jian-Jia},
title = {Multiprocessor Synchronization of Periodic Real-Time Tasks Using Dependency Graphs},
booktitle = {25th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS},
year = {2019},
pages = {279--292},
url = {https://doi.org/10.1109/RTAS.2019.00031},
keywords = {georg, junjie},
file = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2019-rtas-junjie.pdf},
confidential = {n},
abstract = {When considering recurrent real-time tasks in multiprocessor systems, access to shared resources, via so-called critical sections, can jeopardize the schedulability of the system. The reason is that resource access is mutual exclusive and a task must finish its execution of the critical section before another task can access the same resource. Therefore, the problem of multiprocessor synchronization has been extensively studied since the 1990s, and a large number of multiprocessor resource sharing protocols have been developed and analyzed. Most protocols assume work-conserving scheduling algorithms which make it impossible to schedule task sets where a critical section of one task is longer than the relative deadline of another task that accesses the same resource. The only known exception to the work-conserving paradigm is the recently presented Dependency Graph Approach where the order in which tasks access a shared resource is not determined online, but based on a pre-computed dependency graph. Since the initial work only considers frame-based task systems, this paper extends the Dependency Graph Approach to periodic task systems. We point out the connection to the uniprocessor non-preemptive scheduling problem and exploit the related algorithms to construct dependency graphs for each resource. To schedule the derived dependency graphs, List scheduling is combined with an earliest-deadline-first heuristic. We evaluated the performance considering synthesized task sets under different configurations, where a significant improvement of the acceptance ratio compared to other resource sharing protocols is observed. Furthermore, to show the applicability in real-world systems, we detail the implementation in LITMUS RT and report the resulting scheduling overheads.},
} When considering recurrent real-time tasks in multiprocessor systems, access to shared resources, via so-called critical sections, can jeopardize the schedulability of the system. The reason is that resource access is mutual exclusive and a task must finish its execution of the critical section before another task can access the same resource. Therefore, the problem of multiprocessor synchronization has been extensively studied since the 1990s, and a large number of multiprocessor resource sharing protocols have been developed and analyzed. Most protocols assume work-conserving scheduling algorithms which make it impossible to schedule task sets where a critical section of one task is longer than the relative deadline of another task that accesses the same resource. The only known exception to the work-conserving paradigm is the recently presented Dependency Graph Approach where the order in which tasks access a shared resource is not determined online, but based on a pre-computed dependency graph. Since the initial work only considers frame-based task systems, this paper extends the Dependency Graph Approach to periodic task systems. We point out the connection to the uniprocessor non-preemptive scheduling problem and exploit the related algorithms to construct dependency graphs for each resource. To schedule the derived dependency graphs, List scheduling is combined with an earliest-deadline-first heuristic. We evaluated the performance considering synthesized task sets under different configurations, where a significant improvement of the acceptance ratio compared to other resource sharing protocols is observed. Furthermore, to show the applicability in real-world systems, we detail the implementation in LITMUS RT and report the resulting scheduling overheads.
|
| Junjie Shi, Niklas Ueter, Georg von der Brüggen and Jian-Jia Chen. Partitioned Scheduling for Dependency Graphs in Multiprocessor Real-Time Systems. In 25th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA, pages 1--12 Hangzhou, China, August 18-21 2019 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/rtcsa/ShiUBC19,
author = {Shi, Junjie and Ueter, Niklas and Br\"uggen, Georg von der and Chen, Jian-Jia},
title = {Partitioned Scheduling for Dependency Graphs in Multiprocessor Real-Time Systems},
booktitle = {25th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA},
year = {2019},
pages = {1--12},
address = {Hangzhou, China},
month = {August 18-21},
url = {https://ieeexplore.ieee.org/document/8864591},
keywords = {georg, junjie},
file = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/shi-partitioned.pdf},
confidential = {n},
abstract = {Effectively handling precedence constraints and resource synchronization is a challenging problem in the era of multiprocessor systems even with massively parallel computation power. One common approach is to apply list scheduling to a given task graph with precedence constraints. However, in some application scenarios, such as the OpenMP task model and multiprocessor partitioned scheduling for resource synchronization using binary semaphores, several operations can be forced to be tied to the same processor, which invalidates the list scheduling. This paper studies a special case of this challenging scheduling problem, where a task comprised of (at most) three subtasks is executed sequentially on the same processor and the second subtasks of the tasks may have sequential dependencies, e.g., due to synchronization. We demonstrate the limits of existing algorithms and provide effective heuristics considering preemptive execution. The evaluation results show a significant improvement, compared to the existing multiprocessor partitioned scheduling strategies.},
} Effectively handling precedence constraints and resource synchronization is a challenging problem in the era of multiprocessor systems even with massively parallel computation power. One common approach is to apply list scheduling to a given task graph with precedence constraints. However, in some application scenarios, such as the OpenMP task model and multiprocessor partitioned scheduling for resource synchronization using binary semaphores, several operations can be forced to be tied to the same processor, which invalidates the list scheduling. This paper studies a special case of this challenging scheduling problem, where a task comprised of (at most) three subtasks is executed sequentially on the same processor and the second subtasks of the tasks may have sequential dependencies, e.g., due to synchronization. We demonstrate the limits of existing algorithms and provide effective heuristics considering preemptive execution. The evaluation results show a significant improvement, compared to the existing multiprocessor partitioned scheduling strategies.
|
| Jian-Jia Chen, Tobias Hahn, Ruben Hoeksma, Nicole Megow and Georg von der Brüggen. Scheduling Self-Suspending Tasks: New and Old Results. In 31st Euromicro Conference on Real-Time Systems, ECRTS, pages 16:1--16:23 Stuttgart, Germany, July 9-12 2019 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/ecrts/ChenHHMB19,
author = {Chen, Jian-Jia and Hahn, Tobias and Hoeksma, Ruben and Megow, Nicole and Br\"uggen, Georg von der},
title = {Scheduling Self-Suspending Tasks: New and Old Results},
booktitle = {31st Euromicro Conference on Real-Time Systems, ECRTS},
year = {2019},
pages = {16:1--16:23},
address = {Stuttgart, Germany},
month = {July 9-12},
url = {https://doi.org/10.4230/LIPIcs.ECRTS.2019.16},
keywords = {georg},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2019-ecrts-jj.pdf},
confidential = {n},
abstract = {In computing systems, a job may suspend itself (before it finishes its execution) when it has to wait for certain results from other (usually external) activities. For real-time systems, such self-suspension behavior has been shown to induce performance degradation. Hence, the researchers in the real-time systems community have devoted themselves to the design and analysis of scheduling algorithms that can alleviate the performance penalty due to self-suspension behavior. As self-suspension and delegation of parts of a job to non-bottleneck resources is pretty natural in many applications, researchers in the operations research (OR) community have also explored scheduling algorithms for systems with such suspension behavior, called the master-slave problem in the OR community. This paper first reviews the results for the master-slave problem in the OR literature and explains their impact on several long-standing problems for scheduling self-suspending real-time tasks. For frame-based periodic real-time tasks, in which the periods of all tasks are identical and all jobs related to one frame are released synchronously, we explore different approximation metrics with respect to
resource augmentation factors under different scenarios for both uniprocessor and multiprocessor systems, and demonstrate that different approximation metrics can create different levels of difficulty for the approximation. Our experimental results show that such more carefully designed schedules can significantly outperform the state-of-the-art.},
} In computing systems, a job may suspend itself (before it finishes its execution) when it has to wait for certain results from other (usually external) activities. For real-time systems, such self-suspension behavior has been shown to induce performance degradation. Hence, the researchers in the real-time systems community have devoted themselves to the design and analysis of scheduling algorithms that can alleviate the performance penalty due to self-suspension behavior. As self-suspension and delegation of parts of a job to non-bottleneck resources is pretty natural in many applications, researchers in the operations research (OR) community have also explored scheduling algorithms for systems with such suspension behavior, called the master-slave problem in the OR community. This paper first reviews the results for the master-slave problem in the OR literature and explains their impact on several long-standing problems for scheduling self-suspending real-time tasks. For frame-based periodic real-time tasks, in which the periods of all tasks are identical and all jobs related to one frame are released synchronously, we explore different approximation metrics with respect to
resource augmentation factors under different scenarios for both uniprocessor and multiprocessor systems, and demonstrate that different approximation metrics can create different levels of difficulty for the approximation. Our experimental results show that such more carefully designed schedules can significantly outperform the state-of-the-art.
|
| Nils Hölscher, Kuan-Hsun Chen, Georg von der Brüggen and Jian-Jia Chen. Examining and Supporting Multi-Tasking in EV3OSEK. In 14th annual workshop on Operating Systems Platforms for Embedded Real-Time applications (OSPERT 2018), Barcelona, Spain Barcelona, Spain, July 2018 [BibTeX][PDF][Abstract]@inproceedings { Nils-OSPERT,
author = {H\"olscher, Nils and Chen, Kuan-Hsun and Br\"uggen, Georg von der and Chen, Jian-Jia},
title = {Examining and Supporting Multi-Tasking in EV3OSEK},
booktitle = {14th annual workshop on Operating Systems Platforms for Embedded Real-Time applications (OSPERT 2018), Barcelona, Spain},
year = {2018},
address = {Barcelona, Spain},
month = {July},
keywords = {kuan, Georg},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-nils.pdf},
confidential = {n},
abstract = {Lego Mindstorms Robots are a popular platform for graduate level researches and college education purposes. As a
portation of nxtOSEK, an OSEK standard compatible real-time operation system, EV3OSEK inherits the advantages of nxtOSEK for experiments on EV3, the latest generation of Mindstorms robots. Unfortunately, the current version of EV3OSEK still has some serious errors. In this work we address task preemption, a common feature desired in every RTOS. We reveal the errors in the current version and propose corresponding solutions for EV3OSEK that fix the errors in the IRQ-Handler and the task dispatching properly, thus enabling real multi-tasking on EV3OSEK. Our verifications show that the current design flaws are solved. Along with this work, we suggest that researchers who performed experiments on nxtOSEK should carefully examine if the flaws presented in this paper affect their results.},
} Lego Mindstorms Robots are a popular platform for graduate level researches and college education purposes. As a
portation of nxtOSEK, an OSEK standard compatible real-time operation system, EV3OSEK inherits the advantages of nxtOSEK for experiments on EV3, the latest generation of Mindstorms robots. Unfortunately, the current version of EV3OSEK still has some serious errors. In this work we address task preemption, a common feature desired in every RTOS. We reveal the errors in the current version and propose corresponding solutions for EV3OSEK that fix the errors in the IRQ-Handler and the task dispatching properly, thus enabling real multi-tasking on EV3OSEK. Our verifications show that the current design flaws are solved. Along with this work, we suggest that researchers who performed experiments on nxtOSEK should carefully examine if the flaws presented in this paper affect their results.
|
| Zheng Dong, Cong Liu, Soroush Bateni, Kuan-Hsun Chen, Jian-Jia Chen, Georg von der Brüggen and Junjie Shi. Shared-Resource-Centric Limited Preemptive Scheduling: A Comprehensive Study of Suspension-based Partitioning Approaches. In IEEE Real-Time and Embedded Technology and Applications Symposium RTAS Porto, Portugal, April 11-13 2018 [BibTeX][PDF][Link][Abstract]@inproceedings { dong2018rtas,
author = {Dong, Zheng and Liu, Cong and Bateni, Soroush and Chen, Kuan-Hsun and Chen, Jian-Jia and Br\"uggen, Georg von der and Shi, Junjie},
title = {Shared-Resource-Centric Limited Preemptive Scheduling: A Comprehensive Study of Suspension-based Partitioning Approaches},
booktitle = {IEEE Real-Time and Embedded Technology and Applications Symposium RTAS},
year = {2018},
address = {Porto, Portugal},
month = {April 11-13},
url = {https://ieeexplore.ieee.org/document/8430080},
keywords = {kuan, georg, junjie},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018_rtas_zheng.pdf},
confidential = {n},
abstract = {This paper studies the problem of scheduling a set of hard real-time sporadic tasks that may access CPU cores and
a shared resource. Motivated by the observation that the CPU resource is often abundant compared to the shared resources in multi-core and many-core systems, we propose to resolve this problem from a counter-intuitive shared-resource-centric perspective, focusing on judiciously prioritizing and scheduling tasks’ requests in a limited preemptive manner on the shared resource while viewing the worst-case latency a task may experience on the CPU cores as suspension delays. We develop a rather comprehensive set of task partitioning algorithms that partition tasks onto the shared resource with the objective of guaranteeing schedulability while minimizing the required size of the shared resource, which plays a critical role in reducing the overall cost and complexity of building resource-constrained embedded systems in many application domains. A GPU-based prototype case study and extensive simulation-based experiments have been conducted, which validate both our shared-resource-centric scheduling philosophy and the efficiency of our suspension-based partitioning solutions in practice. },
} This paper studies the problem of scheduling a set of hard real-time sporadic tasks that may access CPU cores and
a shared resource. Motivated by the observation that the CPU resource is often abundant compared to the shared resources in multi-core and many-core systems, we propose to resolve this problem from a counter-intuitive shared-resource-centric perspective, focusing on judiciously prioritizing and scheduling tasks’ requests in a limited preemptive manner on the shared resource while viewing the worst-case latency a task may experience on the CPU cores as suspension delays. We develop a rather comprehensive set of task partitioning algorithms that partition tasks onto the shared resource with the objective of guaranteeing schedulability while minimizing the required size of the shared resource, which plays a critical role in reducing the overall cost and complexity of building resource-constrained embedded systems in many application domains. A GPU-based prototype case study and extensive simulation-based experiments have been conducted, which validate both our shared-resource-centric scheduling philosophy and the efficiency of our suspension-based partitioning solutions in practice.
|
| Jian-Jia Chen, Georg von der Brüggen and Niklas Ueter. Push Forward: Global Fixed-Priority Scheduling of Arbitrary-Deadline Sporadic Task Systems. In 30th Euromicro Conference on Real-Time Systems (ECRTS) 2018 , pages 8:1--8:24 Barcelona, Spain, July 3-6, 2018 2018 [BibTeX][PDF][Link][Abstract]@inproceedings { Chen2018ECRTS,
author = {Chen, Jian-Jia and Br\"uggen, Georg von der and Ueter, Niklas},
title = {Push Forward: Global Fixed-Priority Scheduling of Arbitrary-Deadline Sporadic Task Systems},
booktitle = {30th Euromicro Conference on Real-Time Systems (ECRTS) 2018 },
year = {2018},
pages = {8:1--8:24},
address = {Barcelona, Spain},
month = {July 3-6, 2018},
url = {http://drops.dagstuhl.de/opus/volltexte/2018/8996/pdf/LIPIcs-ECRTS-2018-8.pdf},
keywords = {georg},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-ecrts-jj.pdf},
confidential = {n},
abstract = {The sporadic task model is often used to analyze recurrent execution of tasks in real-time systems. A sporadic task defines an infinite sequence of task instances, also called jobs, that arrive under the minimum inter-arrival time constraint. To ensure the system safety, timeliness has to be guaranteed in addition to functional correctness, i.e., all jobs of all tasks have to be finished before the job deadlines. We focus on analyzing arbitrary-deadline task sets on a homogeneous (identical) multiprocessor system under any given global fixed-priority scheduling approach and provide a series of schedulability tests with different tradeoffs between their time complexity and their accuracy. Under the arbitrary-deadline setting, the relative deadline of a task can be longer than the minimum inter-arrival time of the jobs of the task. We show that global deadline-monotonic (DM) scheduling has a speedup bound of 3 − 1/M against any optimal scheduling algorithms, where M is the number of identical processors, and prove that this bound is asymptotically tight.},
} The sporadic task model is often used to analyze recurrent execution of tasks in real-time systems. A sporadic task defines an infinite sequence of task instances, also called jobs, that arrive under the minimum inter-arrival time constraint. To ensure the system safety, timeliness has to be guaranteed in addition to functional correctness, i.e., all jobs of all tasks have to be finished before the job deadlines. We focus on analyzing arbitrary-deadline task sets on a homogeneous (identical) multiprocessor system under any given global fixed-priority scheduling approach and provide a series of schedulability tests with different tradeoffs between their time complexity and their accuracy. Under the arbitrary-deadline setting, the relative deadline of a task can be longer than the minimum inter-arrival time of the jobs of the task. We show that global deadline-monotonic (DM) scheduling has a speedup bound of 3 − 1/M against any optimal scheduling algorithms, where M is the number of identical processors, and prove that this bound is asymptotically tight.
|
| Kuan-Hsun Chen, Georg von der Brüggen and Jian-Jia Chen. Analysis of Deadline Miss Rates for Uniprocessor Fixed-Priority Scheduling. In The 24th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA) Hakodate, Japan, August 2018, Best Student Paper Award , Outstanding Paper Award [BibTeX][PDF][Link][Abstract]@inproceedings { khchenRTCSA18,
author = {Chen, Kuan-Hsun and Br\"uggen, Georg von der and Chen, Jian-Jia},
title = {Analysis of Deadline Miss Rates for Uniprocessor Fixed-Priority Scheduling},
booktitle = {The 24th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)},
year = {2018},
address = {Hakodate, Japan},
month = {August},
note = { Best Student Paper Award , Outstanding Paper Award },
url = {https://ieeexplore.ieee.org/document/8607246},
keywords = {Georg, kuan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-kuanrtcsa.pdf},
confidential = {n},
abstract = {Timeliness is an important feature for many embedded systems. Although soft real-time embedded systems can tolerate and allow certain deadline misses, it is still important to quantify them to justify whether the considered systems are acceptable. In this paper, we provide a way to safely over-approximate the expected deadline miss rate for a specific sporadic real-time task under fixed-priority preemptive scheduling in uniprocessor systems. Our approach is compatible with the existing results in the literature that calculate the probability of deadline misses either based on the convolution-based approaches or analytically. We demonstrate our approach by considering randomly generated task sets with an execution behavior that simulates jobs that are subjected to soft errors incurred by hardware transient faults under a given fault rate. To empirically gather the deadline miss rates, we implemented an event-based simulator with a fault-injection module and release the scripts. With extensive simulations under different fault rates, we evaluate the efficiency and the pessimism of our approach. The evaluation results show that our approach is effective to derive an upper bound of the expected deadline miss rate and efficient with respect to the required computation time.},
} Timeliness is an important feature for many embedded systems. Although soft real-time embedded systems can tolerate and allow certain deadline misses, it is still important to quantify them to justify whether the considered systems are acceptable. In this paper, we provide a way to safely over-approximate the expected deadline miss rate for a specific sporadic real-time task under fixed-priority preemptive scheduling in uniprocessor systems. Our approach is compatible with the existing results in the literature that calculate the probability of deadline misses either based on the convolution-based approaches or analytically. We demonstrate our approach by considering randomly generated task sets with an execution behavior that simulates jobs that are subjected to soft errors incurred by hardware transient faults under a given fault rate. To empirically gather the deadline miss rates, we implemented an event-based simulator with a fault-injection module and release the scripts. With extensive simulations under different fault rates, we evaluate the efficiency and the pessimism of our approach. The evaluation results show that our approach is effective to derive an upper bound of the expected deadline miss rate and efficient with respect to the required computation time.
|
| Mikail Yayla, Kuan-Hsun Chen and Jian-Jia Chen. Fault Tolerance on Control Applications: Empirical Investigations of Impacts from Incorrect Calculations. In 4th Workshop on Emerging Ideas and Trends in Engineering of Cyber-Physical Systems (EITEC) 2018 [BibTeX][Link][Abstract]@inproceedings { Yayla2018EITEC,
author = {Yayla, Mikail and Chen, Kuan-Hsun and Chen, Jian-Jia},
title = {Fault Tolerance on Control Applications: Empirical Investigations of Impacts from Incorrect Calculations},
booktitle = {4th Workshop on Emerging Ideas and Trends in Engineering of Cyber-Physical Systems (EITEC)},
year = {2018},
url = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-yayla-eitec.pdf},
keywords = {kuan},
confidential = {n},
abstract = {Due to aggressive technology downscaling, mobile and embedded systems are susceptible to transient faults in the underlying hardware. Transient faults may incur soft-errors or even lead to system failure. A recent study has proposed to exploit the concept of the (m,k)-firm real-time task model with compensation techniques to manage redundant executions, aiming to selectively protect the control application. In this work we provide an empirical approach to find the (m,k) robustness requirements. With the delivered (m,k) robustness requirements on path tracing and balance control tasks, we conduct comprehensive case studies to evaluate the effectiveness of the compensation techniques under different fault locations and fault rates. },
} Due to aggressive technology downscaling, mobile and embedded systems are susceptible to transient faults in the underlying hardware. Transient faults may incur soft-errors or even lead to system failure. A recent study has proposed to exploit the concept of the (m,k)-firm real-time task model with compensation techniques to manage redundant executions, aiming to selectively protect the control application. In this work we provide an empirical approach to find the (m,k) robustness requirements. With the delivered (m,k) robustness requirements on path tracing and balance control tasks, we conduct comprehensive case studies to evaluate the effectiveness of the compensation techniques under different fault locations and fault rates.
|
| Helena Kotthaus, Andreas Lang and Peter Marwedel. Optimizing Parallel R Programs via Dynamic Scheduling Strategies. In Abstract Booklet of the International R User Conference (UseR!) Brisbane, Australia, July 2018 [BibTeX][Link]@inproceedings { kotthaus/2018a,
author = {Kotthaus, Helena and Lang, Andreas and Marwedel, Peter},
title = {Optimizing Parallel R Programs via Dynamic Scheduling Strategies},
booktitle = {Abstract Booklet of the International R User Conference (UseR!)},
year = {2018},
address = {Brisbane, Australia},
month = {July},
url = {https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf},
confidential = {n},
} |
| Sebastian Buschjäger, Kuan-Hsun Chen, Jian-Jia Chen and Katharina Morik. Realization of Random Forest for Real-Time Evaluation through Tree Framing. In The IEEE International Conference on Data Mining (ICDM) Singapore, November 2018 [BibTeX][PDF]@inproceedings { Buschjaeger2018,
author = {Buschj\"ager, Sebastian and Chen, Kuan-Hsun and Chen, Jian-Jia and Morik, Katharina},
title = {Realization of Random Forest for Real-Time Evaluation through Tree Framing},
booktitle = {The IEEE International Conference on Data Mining (ICDM)},
year = {2018},
address = {Singapore},
month = {November},
keywords = {kuan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/buschjaeger2018.pdf},
confidential = {n},
} |
| Jian-Jia Chen, Nikhil Bansal, Samarjit Chakraborty and Georg von der Brüggen. Packing Sporadic Real-Time Tasks on Identical Multiprocessor Systems. In International Symposium on Algorithms and Computation (ISAAC) Jiaoxi, Yilan County, Taiwan, December 16-19 2018 [BibTeX][PDF][Link][Abstract]@inproceedings { Chen-ISAAC-2018,
author = {Chen, Jian-Jia and Bansal, Nikhil and Chakraborty, Samarjit and Br\"uggen, Georg von der},
title = {Packing Sporadic Real-Time Tasks on Identical Multiprocessor Systems},
booktitle = {International Symposium on Algorithms and Computation (ISAAC)},
year = {2018},
address = {Jiaoxi, Yilan County, Taiwan},
month = {December 16-19},
url = {https://doi.org/10.1109/PRDC.2018.00010},
keywords = {georg},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-isaac-jj.pdf},
confidential = {n},
abstract = {In real-time systems, in addition to the functional correctness recurrent tasks must fulfill timing constraints to ensure the correct behavior of the system. Partitioned scheduling is widely used in real-time systems, i.e., the tasks are statically assigned onto processors while ensuring that all timing constraints are met. The decision version of the problem, which is to check whether the deadline constraints of tasks can be satisfied on a given number of identical processors, has been known N P-complete in the strong sense. Several studies on this problem are based on approximations involving resource augmentation, i.e., speeding up individual processors. This paper studies another type of resource augmentation by allocating additional processors, a topic that has not been explored until recently. We provide polynomial-time algorithms and analysis, in which the approximation factors are dependent upon the input instances. Specifically, the factors are related to the maximum ratio of the period to the relative deadline of a task in the given task set. We also show that these algorithms unfortunately cannot achieve a constant approximation factor for general cases. Furthermore, we prove that the problem does not admit any asymptotic polynomial-time approximation scheme (APTAS) unless P = NP when the task set has constrained deadlines, i.e., the relative deadline of a task is no more than the period of the task.},
} In real-time systems, in addition to the functional correctness recurrent tasks must fulfill timing constraints to ensure the correct behavior of the system. Partitioned scheduling is widely used in real-time systems, i.e., the tasks are statically assigned onto processors while ensuring that all timing constraints are met. The decision version of the problem, which is to check whether the deadline constraints of tasks can be satisfied on a given number of identical processors, has been known N P-complete in the strong sense. Several studies on this problem are based on approximations involving resource augmentation, i.e., speeding up individual processors. This paper studies another type of resource augmentation by allocating additional processors, a topic that has not been explored until recently. We provide polynomial-time algorithms and analysis, in which the approximation factors are dependent upon the input instances. Specifically, the factors are related to the maximum ratio of the period to the relative deadline of a task in the given task set. We also show that these algorithms unfortunately cannot achieve a constant approximation factor for general cases. Furthermore, we prove that the problem does not admit any asymptotic polynomial-time approximation scheme (APTAS) unless P = NP when the task set has constrained deadlines, i.e., the relative deadline of a task is no more than the period of the task.
|
| Anas Toma, Vincent Meyers and Jian-Jia Chen. Implementation and Evaluation of Multi-Mode Real-Time Tasks under Different Scheduling Algorithms. In the 14th annual workshop on Operating Systems Platforms for Embedded Real-Time applications (OSPERT 2018), Barcelona, Spain Barcelona, Spain, July 2018 [BibTeX][PDF][Abstract]@inproceedings { Toma-OSPERT2018,
author = {Toma, Anas and Meyers, Vincent and Chen, Jian-Jia},
title = {Implementation and Evaluation of Multi-Mode Real-Time Tasks under Different Scheduling Algorithms},
booktitle = {the 14th annual workshop on Operating Systems Platforms for Embedded Real-Time applications (OSPERT 2018), Barcelona, Spain},
year = {2018},
address = {Barcelona, Spain},
month = {July},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-toma-ospert.pdf},
confidential = {n},
abstract = {Tasks in the multi-mode real-time model have different execution modes according to an external input. Every mode represents a level of functionality where the tasks have different parameters. Such a model exists in automobiles where some of the tasks that control the engine should always adapt to its rotation speed. Many studies have evaluated the feasibility of such a model under different scheduling algorithms, however, only through simulation. This paper provides an empirical evaluation for the schedulability of the multi-mode real-time tasks under fixed- and dynamic-priority scheduling algorithms. Furthermore, an evaluation for the overhead of the scheduling algorithms is provided. The implementation and the evaluation were carried out in a real environment using Raspberry Pi hardware and FreeRTOS real-time operating system. A simulation for a crankshaft was performed to generate realistic tasks in addition to the syntheticones. Unlike expected, the results show that the Rate-Monotonic algorithm outperforms the Earliest Deadline First algorithm in scheduling tasks with relatively shorter periods.},
} Tasks in the multi-mode real-time model have different execution modes according to an external input. Every mode represents a level of functionality where the tasks have different parameters. Such a model exists in automobiles where some of the tasks that control the engine should always adapt to its rotation speed. Many studies have evaluated the feasibility of such a model under different scheduling algorithms, however, only through simulation. This paper provides an empirical evaluation for the schedulability of the multi-mode real-time tasks under fixed- and dynamic-priority scheduling algorithms. Furthermore, an evaluation for the overhead of the scheduling algorithms is provided. The implementation and the evaluation were carried out in a real environment using Raspberry Pi hardware and FreeRTOS real-time operating system. A simulation for a crankshaft was performed to generate realistic tasks in addition to the syntheticones. Unlike expected, the results show that the Rate-Monotonic algorithm outperforms the Earliest Deadline First algorithm in scheduling tasks with relatively shorter periods.
|
| Jan Eric Lenssen, Anas Toma, Albert Seebold, Victoria Shpacovitch, Pascal Libuschewski, Frank Weichert, Jian-Jia Chen and Roland Hergenröder. Real-Time Low SNR Signal Processing for Nanoparticle Analysis with Deep Neural Networks. In the 11th International Conference on Bio-Inspired Systems and Signal Processing (BIOSIGNALS 2018) Funchal, Portugal, January 2018, (Best Paper Award) [BibTeX][Abstract]@inproceedings { Eric-Biosignals18,
author = {Lenssen, Jan Eric and Toma, Anas and Seebold, Albert and Shpacovitch, Victoria and Libuschewski, Pascal and Weichert, Frank and Chen, Jian-Jia and Hergenr{\"o}der, Roland},
title = {Real-Time Low SNR Signal Processing for Nanoparticle Analysis with Deep Neural Networks},
booktitle = {the 11th International Conference on Bio-Inspired Systems and Signal Processing (BIOSIGNALS 2018)},
year = {2018},
address = {Funchal, Portugal},
month = {January},
note = { (Best Paper Award) },
confidential = {n},
abstract = {In this work, we improve several steps of our Plasmon Assisted Microscopy Of Nano-sized Objects (PAMONO) sensor data processing pipeline through application of deep neural networks. The PAMONO-biosensor is a mobile nanoparticle sensor utilizing Surface Plasmon Resonance (SPR) imaging for quantification and analysis of nanoparticles in liquid or air samples. Characteristics of PAMONO sensor data are spatiotemporal blob-like structures with very low Signal-to-Noise Rtion (SNR), which indicate particle bindings and can be automatically analyzed with image processing methods. We propose and evaluate deep neural network architectures for spatiotemporal detection, time-series analysis and classification. We compare them to traditional methods like frequency domain or polygon shape features classified by a Random Forest classifier. It is shown that the application of deep learning enables the sensor to automatically detect and quantify 80 nm polystyrene particles and pushes the limits in blob detection with very low SNRs below one. In addition, we present benchmarks and show that real-time processing is achievable on consumer level desktop G RAPHICS P ROCESSING U NIT s (GPUs).},
} In this work, we improve several steps of our Plasmon Assisted Microscopy Of Nano-sized Objects (PAMONO) sensor data processing pipeline through application of deep neural networks. The PAMONO-biosensor is a mobile nanoparticle sensor utilizing Surface Plasmon Resonance (SPR) imaging for quantification and analysis of nanoparticles in liquid or air samples. Characteristics of PAMONO sensor data are spatiotemporal blob-like structures with very low Signal-to-Noise Rtion (SNR), which indicate particle bindings and can be automatically analyzed with image processing methods. We propose and evaluate deep neural network architectures for spatiotemporal detection, time-series analysis and classification. We compare them to traditional methods like frequency domain or polygon shape features classified by a Random Forest classifier. It is shown that the application of deep learning enables the sensor to automatically detect and quantify 80 nm polystyrene particles and pushes the limits in blob detection with very low SNRs below one. In addition, we present benchmarks and show that real-time processing is achievable on consumer level desktop G RAPHICS P ROCESSING U NIT s (GPUs).
|
| Georg von der Brüggen, Lea Schönberger and Jian-Jia Chen. Do Nothing, but Carefully: Fault Tolerance with Timing Guarantees for Multiprocessor Systems devoid of Online Adaptation. In The 23rd IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2018) Taipei, Taiwan, December 4-7 2018 [BibTeX][PDF][Link][Abstract]@inproceedings { brueggenetalPRDC2018,
author = {Br\"uggen, Georg von der and Sch\"onberger, Lea and Chen, Jian-Jia},
title = {Do Nothing, but Carefully: Fault Tolerance with Timing Guarantees for Multiprocessor Systems devoid of Online Adaptation},
booktitle = {The 23rd IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2018)},
year = {2018},
address = {Taipei, Taiwan},
month = {December 4-7},
url = {https://ieeexplore.ieee.org/document/8639554},
keywords = {georg, lea},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-prdc-georg.pdf},
confidential = {n},
abstract = {Many practical real-time systems must be able to sustain several reliability threats induced by their physical environments that cause short-term abnormal system behavior, such as transient faults. To cope with this change of system behavior, online adaptions, which may introduce a high computation overhead, are performed in many cases to ensure the timeliness of the more important tasks while no guarantees are provided for the less important tasks. In this work, we propose a system model which does not require any online adaption, but, according to the concept of dynamic real-time guarantees, provides full timing guarantees as well as limited timing guarantees, depending on the system behavior. For the normal system behavior, timeliness is guaranteed for all tasks; otherwise, timeliness is guaranteed only for the more important tasks while bounded tardiness is ensured for the less important tasks. Aiming to provide such dynamic timing guarantees, we propose a suitable system model and discuss, how this can be established by means of partitioned as well as semi-partitioned strategies. Moreover, we propose an approach for handling abnormal behavior with a longer duration, such as intermittent faults or overheating of processors, by performing task migration in order to compensate the affected system component and to increase the system’s reliability. We show by comprehensive experiments that good acceptance ratios can be achieved under partitioned scheduling, which can be further improved under semi-partitioned strategies. In addition, we demonstrate that the proposed migration techniques lead to a reasonable trade-off between the decrease in schedulability and the gain in robustness of the system. The presented approaches can also be applied to mixed-criticality systems with two criticality levels.},
} Many practical real-time systems must be able to sustain several reliability threats induced by their physical environments that cause short-term abnormal system behavior, such as transient faults. To cope with this change of system behavior, online adaptions, which may introduce a high computation overhead, are performed in many cases to ensure the timeliness of the more important tasks while no guarantees are provided for the less important tasks. In this work, we propose a system model which does not require any online adaption, but, according to the concept of dynamic real-time guarantees, provides full timing guarantees as well as limited timing guarantees, depending on the system behavior. For the normal system behavior, timeliness is guaranteed for all tasks; otherwise, timeliness is guaranteed only for the more important tasks while bounded tardiness is ensured for the less important tasks. Aiming to provide such dynamic timing guarantees, we propose a suitable system model and discuss, how this can be established by means of partitioned as well as semi-partitioned strategies. Moreover, we propose an approach for handling abnormal behavior with a longer duration, such as intermittent faults or overheating of processors, by performing task migration in order to compensate the affected system component and to increase the system’s reliability. We show by comprehensive experiments that good acceptance ratios can be achieved under partitioned scheduling, which can be further improved under semi-partitioned strategies. In addition, we demonstrate that the proposed migration techniques lead to a reasonable trade-off between the decrease in schedulability and the gain in robustness of the system. The presented approaches can also be applied to mixed-criticality systems with two criticality levels.
|
| Lea Schönberger, Wen-Hung Huang, Georg von der Brüggen, Kuan-Hsun Chen and Jian-Jia Chen. Schedulability Analysis and Priority Assignment for Segmented Self-Suspending Tasks. In The 24th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA) Hakodate, Japan, August 28-31 2018 [BibTeX][Link][Abstract]@inproceedings { schoenbergerRTCSA2018,
author = {Sch\"onberger, Lea and Huang, Wen-Hung and Br\"uggen, Georg von der and Chen, Kuan-Hsun and Chen, Jian-Jia},
title = {Schedulability Analysis and Priority Assignment for Segmented Self-Suspending Tasks},
booktitle = {The 24th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)},
year = {2018},
address = {Hakodate, Japan},
month = {August 28-31},
url = {https://ieeexplore.ieee.org/document/8607245},
keywords = {kuan, lea, Georg},
confidential = {n},
abstract = {Self-suspending behavior in real-time embedded systems can have a major and non-trivial negative impact on
timing predictability. In this work, we investigate how to analyze the schedulability of segmented self-suspending task systems under a fixed-priority assignment. For this purpose, we introduce the multi-segment workload function as well as the maximum workload function in order to quantify the maximum interference from the higher-priority tasks when constructing our (sufficient) schedulability test. Moreover, we derive an optimal priority assignment with respect to our schedulability test since it is compatible with Audsley’s Optimal Priority Assignment (OPA). We show by means of comprehensive evaluations that our approach is highly effective concerning the number of schedulable task sets.
Furthermore, one set of results reveals a rather non-intuitive observation, namely, that the worst-case suspension time of a computation segment should also be respected to improve the schedulability even if the suspension may finish earlier.},
} Self-suspending behavior in real-time embedded systems can have a major and non-trivial negative impact on
timing predictability. In this work, we investigate how to analyze the schedulability of segmented self-suspending task systems under a fixed-priority assignment. For this purpose, we introduce the multi-segment workload function as well as the maximum workload function in order to quantify the maximum interference from the higher-priority tasks when constructing our (sufficient) schedulability test. Moreover, we derive an optimal priority assignment with respect to our schedulability test since it is compatible with Audsley’s Optimal Priority Assignment (OPA). We show by means of comprehensive evaluations that our approach is highly effective concerning the number of schedulable task sets.
Furthermore, one set of results reveals a rather non-intuitive observation, namely, that the worst-case suspension time of a computation segment should also be respected to improve the schedulability even if the suspension may finish earlier.
|
| Jian-Jia Chen, Georg von der Brüggen, Junjie Shi and Niklas Ueter. Dependency Graph Approach for Multiprocessor Real-Time Synchronization. In IEEE Real-Time Systems Symposium, RTSS 2018, pages 434--446 Nashville, TN, USA, December 11-14 2018 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/rtss/ChenBSU18,
author = {Chen, Jian-Jia and Br\"uggen, Georg von der and Shi, Junjie and Ueter, Niklas},
title = {Dependency Graph Approach for Multiprocessor Real-Time Synchronization},
booktitle = {IEEE Real-Time Systems Symposium, RTSS 2018},
year = {2018},
pages = {434--446},
address = {Nashville, TN, USA},
month = {December 11-14},
url = {https://doi.org/10.1109/RTSS.2018.00057},
keywords = {georg, junjie},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-rtss-jj.pdf},
confidential = {n},
abstract = {Over the years, many multiprocessor locking protocols have been designed and analyzed. However, the performance
of these protocols highly depends on how the tasks are partitioned and prioritized, and how the resources are shared locally and globally. This paper answers a few fundamental questions when real-time tasks share resources in multiprocessor systems. We explore the fundamental difficulty of the multiprocessor synchronization problem and show that a very simplified version of this problem is N P -hard in the strong sense regardless of the number of processors and the underlying scheduling paradigm. Therefore, the allowance of preemption or migration does not reduce the computational complexity. On the positive side, we develop a dependency-graph approach that is specifically useful
for frame-based real-time tasks, i.e., when all tasks have the same period and release their jobs always at the same time. We present a series of algorithms with speedup factors between 2 and 3 under semi-partitioned scheduling. We further explore methodologies for and tradeoffs between preemptive and non-preemptive scheduling algorithms, and partitioned and semi-partitioned scheduling algorithms. Our approach is extended to periodic tasks under certain conditions. },
} Over the years, many multiprocessor locking protocols have been designed and analyzed. However, the performance
of these protocols highly depends on how the tasks are partitioned and prioritized, and how the resources are shared locally and globally. This paper answers a few fundamental questions when real-time tasks share resources in multiprocessor systems. We explore the fundamental difficulty of the multiprocessor synchronization problem and show that a very simplified version of this problem is N P -hard in the strong sense regardless of the number of processors and the underlying scheduling paradigm. Therefore, the allowance of preemption or migration does not reduce the computational complexity. On the positive side, we develop a dependency-graph approach that is specifically useful
for frame-based real-time tasks, i.e., when all tasks have the same period and release their jobs always at the same time. We present a series of algorithms with speedup factors between 2 and 3 under semi-partitioned scheduling. We further explore methodologies for and tradeoffs between preemptive and non-preemptive scheduling algorithms, and partitioned and semi-partitioned scheduling algorithms. Our approach is extended to periodic tasks under certain conditions.
|
| Niklas Ueter, Georg von der Brüggen, Jian-Jia Chen, Jing Li and Kunal Agrawal. Reservation-Based Federated Scheduling for Parallel Real-Time Tasks. In 2018 IEEE Real-Time Systems Symposium, RTSS 2018, pages 482--494 Nashville, TN, USA, December 11-14 2018, Outstanding Paper Award [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/rtss/UeterBCLA18,
author = {Ueter, Niklas and Br\"uggen, Georg von der and Chen, Jian-Jia and Li, Jing and Agrawal, Kunal},
title = {Reservation-Based Federated Scheduling for Parallel Real-Time Tasks},
booktitle = {2018 IEEE Real-Time Systems Symposium, RTSS 2018},
year = {2018},
pages = {482--494},
address = {Nashville, TN, USA},
month = {December 11-14},
note = { Outstanding Paper Award },
url = {https://doi.org/10.1109/RTSS.2018.00061},
keywords = {georg},
file = {https://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-rtss-niklas.pdf},
confidential = {n},
abstract = {Multicore systems are increasingly utilized in real-time systems in order to address the high computational demands. To fully exploit the advantages of multicore processing, possible intra-task parallelism modeled as a directed acyclic graph (DAG) must be utilized efficiently. This paper considers the scheduling problem for parallel real-time tasks with constrained and arbitrary deadlines. In contrast to prior work in this area, it generalizes federated scheduling and proposes a novel reservation-based approach. Namely, we propose a reservation-based federated scheduling strategy that reduces the problem of scheduling arbitrary-deadline DAG task sets to the problem of scheduling arbitrary-deadline sequential task sets by allocating reservation servers. We provide the general reservation design for sporadic parallel tasks, such that any scheduling algorithm and analysis for sequential tasks with arbitrary deadlines can be used to execute the allocated reservation servers of parallel tasks. Moreover, the proposed reservation-based federated scheduling algorithms provide constant speedup factors with respect to any optimal scheduler for arbitrary-deadline DAG task sets. We demonstrate via numerical and empirical experiments that our algorithms are competitive with the state of the art.},
} Multicore systems are increasingly utilized in real-time systems in order to address the high computational demands. To fully exploit the advantages of multicore processing, possible intra-task parallelism modeled as a directed acyclic graph (DAG) must be utilized efficiently. This paper considers the scheduling problem for parallel real-time tasks with constrained and arbitrary deadlines. In contrast to prior work in this area, it generalizes federated scheduling and proposes a novel reservation-based approach. Namely, we propose a reservation-based federated scheduling strategy that reduces the problem of scheduling arbitrary-deadline DAG task sets to the problem of scheduling arbitrary-deadline sequential task sets by allocating reservation servers. We provide the general reservation design for sporadic parallel tasks, such that any scheduling algorithm and analysis for sequential tasks with arbitrary deadlines can be used to execute the allocated reservation servers of parallel tasks. Moreover, the proposed reservation-based federated scheduling algorithms provide constant speedup factors with respect to any optimal scheduler for arbitrary-deadline DAG task sets. We demonstrate via numerical and empirical experiments that our algorithms are competitive with the state of the art.
|
| Georg von der Brüggen, Nico Piatkowski, Kuan-Hsun Chen, Jian-Jia Chen and Katharina Morik. Efficiently Approximating the Probability of Deadline Misses in Real-Time Systems. In 30th Euromicro Conference on Real-Time Systems (ECRTS 2018) Barcelona, Spain, July 3 - 6 2018 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/ecrts/BruggenPCCM18,
author = {Br\"uggen, Georg von der and Piatkowski, Nico and Chen, Kuan-Hsun and Chen, Jian-Jia and Morik, Katharina},
title = {Efficiently Approximating the Probability of Deadline Misses in Real-Time Systems},
booktitle = {30th Euromicro Conference on Real-Time Systems (ECRTS 2018) },
year = {2018},
address = {Barcelona, Spain},
month = {July 3 - 6},
url = {http://drops.dagstuhl.de/opus/volltexte/2018/8997/pdf/LIPIcs-ECRTS-2018-6.pdf},
keywords = {kuan, Georg},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-ecrts-georg.pdf},
confidential = {n},
abstract = {This paper explores the probability of deadline misses for a set of constrained-deadline sporadic soft real-time tasks on uniprocessor platforms. We explore two directions to evaluate the probability whether a job of the task under analysis can finish its execution at (or before) a testing time point t. One approach is based on analytical upper bounds that can be efficiently computed in polynomial time at the price of precision loss for each testing point, derived from the well-known Hoeffding's inequality and the well-known Bernstein's inequality. Another approach convolutes the probability efficiently over multinomial distributions, exploiting a series of state space reduction techniques, i.e., pruning without any loss of precision, and approximations via unifying equivalent classes with a bounded loss of precision. We demonstrate the effectiveness of our approaches in a series of evaluations. Distinct from the convolution-based methods in the literature, which suffer from the high computation demand and are applicable only to task sets with a few tasks, our approaches can scale reasonably without losing much precision in terms of the derived probability of deadline misses. },
} This paper explores the probability of deadline misses for a set of constrained-deadline sporadic soft real-time tasks on uniprocessor platforms. We explore two directions to evaluate the probability whether a job of the task under analysis can finish its execution at (or before) a testing time point t. One approach is based on analytical upper bounds that can be efficiently computed in polynomial time at the price of precision loss for each testing point, derived from the well-known Hoeffding's inequality and the well-known Bernstein's inequality. Another approach convolutes the probability efficiently over multinomial distributions, exploiting a series of state space reduction techniques, i.e., pruning without any loss of precision, and approximations via unifying equivalent classes with a bounded loss of precision. We demonstrate the effectiveness of our approaches in a series of evaluations. Distinct from the convolution-based methods in the literature, which suffer from the high computation demand and are applicable only to task sets with a few tasks, our approaches can scale reasonably without losing much precision in terms of the derived probability of deadline misses.
|
| Anas Toma, Alexander Starinow, Jan Eric Lenssen and Jian-Jia Chen. Saving Energy for Cloud Applications in Mobile Devices using Nearby Resources. In the 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2018) Cambridge, UK, March 2018. [BibTeX][PDF][Abstract]@inproceedings { Toma-PDP2018,
author = {Toma, Anas and Starinow, Alexander and Lenssen, Jan Eric and Chen, Jian-Jia},
title = {Saving Energy for Cloud Applications in Mobile Devices using Nearby Resources},
booktitle = {the 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2018)},
year = {2018.},
address = {Cambridge, UK},
month = {March},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2018-toma-pdp.pdf},
confidential = {n},
abstract = {In this paper, we present a middleware to save energy in mobile computing devices that offload tasks to a remote server in the cloud. Saving energy in these devices is very important to prolong the battery life and avoid overheating. The middleware uses an available nearby device called auxiliary server either as a surrogate for the remote one, or as a proxy to pass the data between the mobile device and the remote server. The main idea is to reduce the energy consumption of the communication with the remote server by using a high-speed or a low-power local connection with the auxiliary server instead. The paper also analyzes when it is beneficial to use the auxiliary server based on the response time from the remote server and the bandwidth of the remote connection. The proposed middleware is evaluated using different benchmarks, including commonly used applications in mobile devices, and simulations. Furthermore, it is compared to state-of-the art approaches in this area. The experiments show that The middleware is energy-efficient especially when the bandwidth of the remote communication is relatively low or the server is overloaded.},
} In this paper, we present a middleware to save energy in mobile computing devices that offload tasks to a remote server in the cloud. Saving energy in these devices is very important to prolong the battery life and avoid overheating. The middleware uses an available nearby device called auxiliary server either as a surrogate for the remote one, or as a proxy to pass the data between the mobile device and the remote server. The main idea is to reduce the energy consumption of the communication with the remote server by using a high-speed or a low-power local connection with the auxiliary server instead. The paper also analyzes when it is beneficial to use the auxiliary server based on the response time from the remote server and the bandwidth of the remote connection. The proposed middleware is evaluated using different benchmarks, including commonly used applications in mobile devices, and simulations. Furthermore, it is compared to state-of-the art approaches in this area. The experiments show that The middleware is energy-efficient especially when the bandwidth of the remote communication is relatively low or the server is overloaded.
|
| Junjie Shi, Kuan-Hsun Chen, Shuai Zhao, Wen-Hung Huang, Jian-Jia Chen and Andy Wellings. Implementation and Evaluation of Multiprocessor Resource Synchronization Protocol (MrsP) on LITMUSRT. In 13th Workshop on Operating Systems Platforms for Embedded Real-Time Applications 2017 [BibTeX][PDF][Abstract]@inproceedings { OSPERT17,
author = {Shi, Junjie and Chen, Kuan-Hsun and Zhao, Shuai and Huang, Wen-Hung and Chen, Jian-Jia and Wellings, Andy},
title = {Implementation and Evaluation of Multiprocessor Resource Synchronization Protocol (MrsP) on LITMUSRT},
booktitle = {13th Workshop on Operating Systems Platforms for Embedded Real-Time Applications},
year = {2017},
keywords = {kuan, junjie},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2017-junjie-ospert.pdf},
confidential = {n},
abstract = {Preventing race conditions or data corruptions for concurrent shared resource accesses of real-time tasks is a challenging problem. By adopting the resource synchronization protocols, such a problem has been studied in the literature, but there are not enough evaluations that consider the overhead from the implementations of different protocols. In this paper, we discuss our implementation of the Multiprocessor Resource Sharing Protocol (MrsP) and the Distributed Non-Preemptive Protocol (DNPP) on LITMUS RT . Both of them are released in open source under GNU General Public License (GPL2). To study the impact of the implementation overhead, we deploy different synchronization scenarios with generated task sets and measure the performance with respect to the worst-case response time. The results illustrate that generally the implementation overhead is acceptable, whereas some unexpected system overhead may happen under distributed synchronization protocols on LITMUSRT.},
} Preventing race conditions or data corruptions for concurrent shared resource accesses of real-time tasks is a challenging problem. By adopting the resource synchronization protocols, such a problem has been studied in the literature, but there are not enough evaluations that consider the overhead from the implementations of different protocols. In this paper, we discuss our implementation of the Multiprocessor Resource Sharing Protocol (MrsP) and the Distributed Non-Preemptive Protocol (DNPP) on LITMUS RT . Both of them are released in open source under GNU General Public License (GPL2). To study the impact of the implementation overhead, we deploy different synchronization scenarios with generated task sets and measure the performance with respect to the worst-case response time. The results illustrate that generally the implementation overhead is acceptable, whereas some unexpected system overhead may happen under distributed synchronization protocols on LITMUSRT.
|
| Kuan-Hsun Chen and Jian-Jia Chen. Probabilistic Schedulability Tests for Uniprocessor Fixed-Priority Scheduling under Soft Errors. In IEEE International Symposium on Industrial Embedded Systems (SIES), pages 1--8 2017 [BibTeX][PDF][Abstract]@inproceedings { SIES2017,
author = {Chen, Kuan-Hsun and Chen, Jian-Jia},
title = {Probabilistic Schedulability Tests for Uniprocessor Fixed-Priority Scheduling under Soft Errors},
booktitle = {IEEE International Symposium on Industrial Embedded Systems (SIES)},
year = {2017},
pages = {1--8},
keywords = {kuan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2017-kuan-epst.pdf},
confidential = {n},
abstract = {Due to rising integrations, low voltage operations, and environmental influences such as electromagnetic interference and radiation, transient faults may cause soft errors and corrupt the execution state. Such soft errors can be recovered by applying fault-tolerant techniques. Therefore, the execution time of a job of a sporadic/periodic task may differ, depending upon the occurrence of soft errors and the applied error detection and recovery mechanisms. We model a periodic/sporadic real-time task under such a scenario by using two different worst-case execution times (WCETs), in which one is with the occurrence of soft errors and another is not. Based on a probabilistic soft-error model, the WCETs are hence with different probabilities. In this paper, we present efficient probabilistic schedulability tests that can be applied to verify the schedulability based on probabilistic arguments under fixed-priority scheduling on a uniprocessor
system. We demonstrate how the Chernoff bounds can be used to calculate the task workloads based on their probabilistic WCETs. In addition, we further consider how to calculate the probability of -consecutive deadline misses of a task. The pessimism and the efficiency of our approaches are evaluated against the tighter and approximated convolution-based approaches, by running extensive evaluations under different soft-error rates. The evaluation results show that our approaches are effective to derive the probability of deadline misses and efficient with respect to the needed calculation time.},
} Due to rising integrations, low voltage operations, and environmental influences such as electromagnetic interference and radiation, transient faults may cause soft errors and corrupt the execution state. Such soft errors can be recovered by applying fault-tolerant techniques. Therefore, the execution time of a job of a sporadic/periodic task may differ, depending upon the occurrence of soft errors and the applied error detection and recovery mechanisms. We model a periodic/sporadic real-time task under such a scenario by using two different worst-case execution times (WCETs), in which one is with the occurrence of soft errors and another is not. Based on a probabilistic soft-error model, the WCETs are hence with different probabilities. In this paper, we present efficient probabilistic schedulability tests that can be applied to verify the schedulability based on probabilistic arguments under fixed-priority scheduling on a uniprocessor
system. We demonstrate how the Chernoff bounds can be used to calculate the task workloads based on their probabilistic WCETs. In addition, we further consider how to calculate the probability of -consecutive deadline misses of a task. The pessimism and the efficiency of our approaches are evaluated against the tighter and approximated convolution-based approaches, by running extensive evaluations under different soft-error rates. The evaluation results show that our approaches are effective to derive the probability of deadline misses and efficient with respect to the needed calculation time.
|
| Olaf Neugebauer, Peter Marwedel, Roland Kühn and Michael Engel. Quality Evaluation Strategies for Approximate Computing in Embedded Systems. In Technological Innovation for Smart Systems: 8th IFIP WG 5.5/SOCOLNET Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2017, Costa de Caparica, Portugal, May 3-5, 2017, Proceedings, pages 203--210 2017 [BibTeX][Link]@inproceedings { Neugebauer2017,
author = {Neugebauer, Olaf and Marwedel, Peter and K{\"u}hn, Roland and Engel, Michael},
title = {Quality Evaluation Strategies for Approximate Computing in Embedded Systems},
booktitle = {Technological Innovation for Smart Systems: 8th IFIP WG 5.5/SOCOLNET Advanced Doctoral Conference on Computing, Electrical and Industrial Systems, DoCEIS 2017, Costa de Caparica, Portugal, May 3-5, 2017, Proceedings},
year = {2017},
editor = {Camarinha-Matos, Luis M. and Parreira-Rocha, Mafalda and Ramezani, Javaneh},
pages = {203--210},
publisher = {Springer International Publishing},
url = {http://dx.doi.org/10.1007/978-3-319-56077-9_19},
confidential = {n},
} |
| Helena Kotthaus, Jakob Richter, Andreas Lang, Janek Thomas, Bernd Bischl, Peter Marwedel, Jörg Rahnenführer and Michel Lang. RAMBO: Resource-Aware Model-Based Optimization with Scheduling for Heterogeneous Runtimes and a Comparison with Asynchronous Model-Based Optimization. In Proceedings of the 11th International Conference: Learning and Intelligent Optimization (LION 11), pages 180--195 2017 [BibTeX][Link][Abstract]@inproceedings { kotthaus/2017a,
author = {Kotthaus, Helena and Richter, Jakob and Lang, Andreas and Thomas, Janek and Bischl, Bernd and Marwedel, Peter and Rahnenf\"uhrer, J\"org and Lang, Michel},
title = {RAMBO: Resource-Aware Model-Based Optimization with Scheduling for Heterogeneous Runtimes and a Comparison with Asynchronous Model-Based Optimization},
booktitle = {Proceedings of the 11th International Conference: Learning and Intelligent Optimization (LION 11)},
year = {2017},
pages = {180--195},
publisher = {Lecture Notes in Computer Science, Springer},
url = {http://www.springer.com/de/book/9783319694030},
confidential = {n},
abstract = {Sequential model-based optimization is a popular technique for global optimization of expensive black-box functions. It uses a regression model to approximate the objective function and iteratively proposes new interesting points. Deviating from the original formulation, it is often indispensable to apply parallelization to speed up the computation. This is usually achieved by evaluating as many points per iteration as there are workers available. However, if runtimes of the objective function are heterogeneous, resources might be wasted by idle workers. Our new knapsack-based scheduling approach aims at increasing the effectiveness of parallel optimization by efficient resource utilization. Derived from an extra regression model we use runtime predictions of point evaluations to efficiently map evaluations to workers and reduce idling. We compare our approach to five established parallelization strategies on a set of continuous functions with heterogeneous runtimes. Our benchmark covers comparisons of synchronous and asynchronous model-based approaches and investigates the scalability.},
} Sequential model-based optimization is a popular technique for global optimization of expensive black-box functions. It uses a regression model to approximate the objective function and iteratively proposes new interesting points. Deviating from the original formulation, it is often indispensable to apply parallelization to speed up the computation. This is usually achieved by evaluating as many points per iteration as there are workers available. However, if runtimes of the objective function are heterogeneous, resources might be wasted by idle workers. Our new knapsack-based scheduling approach aims at increasing the effectiveness of parallel optimization by efficient resource utilization. Derived from an extra regression model we use runtime predictions of point evaluations to efficiently map evaluations to workers and reduce idling. We compare our approach to five established parallelization strategies on a set of continuous functions with heterogeneous runtimes. Our benchmark covers comparisons of synchronous and asynchronous model-based approaches and investigates the scalability.
|
| Helena Kotthaus, Andreas Lang, Olaf Neugebauer and Peter Marwedel. R goes Mobile: Efficient Scheduling for Parallel R Programs on Heterogeneous Embedded Systems. In Abstract Booklet of the International R User Conference (UseR!), pages 74 Brussels, Belgium, July 2017 [BibTeX][Link]@inproceedings { kotthaus/2017b,
author = {Kotthaus, Helena and Lang, Andreas and Neugebauer, Olaf and Marwedel, Peter},
title = {R goes Mobile: Efficient Scheduling for Parallel R Programs on Heterogeneous Embedded Systems},
booktitle = {Abstract Booklet of the International R User Conference (UseR!)},
year = {2017},
pages = {74},
address = {Brussels, Belgium},
month = {July},
url = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2017_user_kotthaus.pdf },
confidential = {n},
} |
| Jian-Jia Chen, Georg von der Brüggen, Wen-Hung Huang and Robert I. Davis. On the Pitfalls of Resource Augmentation Factors and Utilization Bounds in Real-Time Scheduling. In 29th Euromicro Conference on Real-Time Systems, ECRTS, pages 9:1--9:25 Dubrovnik, Croatia, June 27-30 2017 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/ecrts/ChenBHD17,
author = {Chen, Jian-Jia and Br\"uggen, Georg von der and Huang, Wen-Hung and Davis, Robert I.},
title = {On the Pitfalls of Resource Augmentation Factors and Utilization Bounds in Real-Time Scheduling},
booktitle = {29th Euromicro Conference on Real-Time Systems, ECRTS},
year = {2017},
pages = {9:1--9:25},
address = {Dubrovnik, Croatia},
month = {June 27-30},
url = {https://doi.org/10.4230/LIPIcs.ECRTS.2017.9},
keywords = {Georg},
file = {http://drops.dagstuhl.de/opus/volltexte/2017/7161/pdf/LIPIcs-ECRTS-2017-9.pdf},
confidential = {n},
abstract = {In this paper, we take a careful look at speedup factors, utilization bounds, and capacity augmentation bounds. These three metrics have been widely adopted in real-time scheduling research as the de facto standard theoretical tools for assessing scheduling algorithms and schedulability tests. Despite that, it is not always clear how researchers and designers should interpret or use these metrics. In studying this area, we found a number of surprising results, and related to them, ways in which the metrics may be misinterpreted or misunderstood. In this paper, we provide a perspective on the use of these metrics, guiding researchers on their meaning and interpretation, and helping to avoid pitfalls in their use. Finally, we propose and demonstrate the use of parametric augmentation functions as a means of providing nuanced information that may be more relevant in practical settings.},
} In this paper, we take a careful look at speedup factors, utilization bounds, and capacity augmentation bounds. These three metrics have been widely adopted in real-time scheduling research as the de facto standard theoretical tools for assessing scheduling algorithms and schedulability tests. Despite that, it is not always clear how researchers and designers should interpret or use these metrics. In studying this area, we found a number of surprising results, and related to them, ways in which the metrics may be misinterpreted or misunderstood. In this paper, we provide a perspective on the use of these metrics, guiding researchers on their meaning and interpretation, and helping to avoid pitfalls in their use. Finally, we propose and demonstrate the use of parametric augmentation functions as a means of providing nuanced information that may be more relevant in practical settings.
|
| Jian-Jia Chen, Georg von der Brüggen, Wen-Hung Huang and Cong Liu. State of the art for scheduling and analyzing self-suspending sporadic real-time tasks. In 23rd IEEE International Conference on Embedded and Real-Time Computing Systems and Applications RTCSA, pages 1--10 Hsinchu, Taiwan, August 16-18 2017, Invited paper [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/rtcsa/ChenBH017,
author = {Chen, Jian-Jia and Br\"uggen, Georg von der and Huang, Wen-Hung and Liu, Cong},
title = {State of the art for scheduling and analyzing self-suspending sporadic real-time tasks},
booktitle = {23rd IEEE International Conference on Embedded and Real-Time Computing Systems and Applications RTCSA},
year = {2017},
pages = {1--10},
address = {Hsinchu, Taiwan},
month = {August 16-18},
note = {Invited paper},
url = {http://doi.ieeecomputersociety.org/10.1109/RTCSA.2017.8046321},
keywords = {Georg},
file = {media/documents/publications/downloads/2017-chen-rtcsa.suspension-review.pdf},
confidential = {n},
abstract = {In computing systems, a job/process/task/thread may suspend itself when it has to wait for some other internal or external activities, such as computation offloading or memory accesses, to finish before it can continue its execution. In the literature, there are two commonly adopted self-suspending sporadic task models in real-time systems: 1) the dynamic self-suspension model and 2) the segmented self-suspension sporadic task model. A dynamic self-suspending sporadic task is specified with an upper bound on the maximum suspension time for a job (task instance), which allows a job to dynamically suspend itself arbitrary often as long as the suspension time upper bound is not violated. By contrast, a segmented self-suspending sporadic task has a predefined execution and suspension pattern in an interleaving manner. The dynamic self-suspension model is very flexible but inaccurate, whilst the segmented self-suspension model is very restrictive but very accurate. The gap between these two widely-adopted self-suspension task models can be potentially filled by the hybrid self-suspension task model.
The investigation of the impact of self-suspension on timing predictability has been started in 1988. This survey paper
provides a short summary of the state of the art in the design and analysis of scheduling algorithms and schedulability tests for self-suspending tasks in real-time systems.},
} In computing systems, a job/process/task/thread may suspend itself when it has to wait for some other internal or external activities, such as computation offloading or memory accesses, to finish before it can continue its execution. In the literature, there are two commonly adopted self-suspending sporadic task models in real-time systems: 1) the dynamic self-suspension model and 2) the segmented self-suspension sporadic task model. A dynamic self-suspending sporadic task is specified with an upper bound on the maximum suspension time for a job (task instance), which allows a job to dynamically suspend itself arbitrary often as long as the suspension time upper bound is not violated. By contrast, a segmented self-suspending sporadic task has a predefined execution and suspension pattern in an interleaving manner. The dynamic self-suspension model is very flexible but inaccurate, whilst the segmented self-suspension model is very restrictive but very accurate. The gap between these two widely-adopted self-suspension task models can be potentially filled by the hybrid self-suspension task model.
The investigation of the impact of self-suspension on timing predictability has been started in 1988. This survey paper
provides a short summary of the state of the art in the design and analysis of scheduling algorithms and schedulability tests for self-suspending tasks in real-time systems.
|
| Jian-Jia Chen, Wen-Hung Huang, Zheng Dong and Cong Liu. Fixed-priority scheduling of mixed soft and hare real-time tasks on multiprocessors. In 23rd {IEEE} International Conference on Embedded and Real-Time Computing Systems and Applications, {RTCSA} , pages 1--10 Hsinchu, Taiwan, August 16-18 2017 [BibTeX][PDF][Link]@inproceedings { DBLP:conf/rtcsa/ChenHD017,
author = {Chen, Jian-Jia and Huang, Wen-Hung and Dong, Zheng and Liu, Cong},
title = {Fixed-priority scheduling of mixed soft and hare real-time tasks on multiprocessors},
booktitle = {23rd {IEEE} International Conference on Embedded and Real-Time Computing Systems and Applications, {RTCSA} },
year = {2017},
pages = {1--10},
address = {Hsinchu, Taiwan},
month = {August 16-18},
url = {http://doi.ieeecomputersociety.org/10.1109/RTCSA.2017.8046312},
file = {media/documents/publications/downloads/2017-chen-RTCSA-SRT.pdf},
confidential = {n},
} |
| Georg von der Brüggen, Wen-Hung Huang and Jian-Jia Chen. Hybrid self-suspension models in real-time embedded systems. In 23rd IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA, pages 1--9 Hsinchu, Taiwan, August 16-18 2017 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/rtcsa/BruggenHC17,
author = {Br\"uggen, Georg von der and Huang, Wen-Hung and Chen, Jian-Jia},
title = {Hybrid self-suspension models in real-time embedded systems},
booktitle = {23rd IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA},
year = {2017},
pages = {1--9},
address = {Hsinchu, Taiwan},
month = {August 16-18},
url = {http://doi.ieeecomputersociety.org/10.1109/RTCSA.2017.8046328},
keywords = {Georg, kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2017-rtcsa-georg.pdf},
confidential = {n},
abstract = {To tackle the unavoidable self-suspension behavior due to I/O-intensive interactions, multi-core processors, computation offloading systems with coprocessors, etc., the dynamic and the segmented self-suspension sporadic task models have been widely used in the literature. We propose new self-suspension models that are hybrids of the dynamic and the segmented models. Those hybrid models are capable of exploiting knowledge about execution paths, potentially reducing modelling pessimism. In addition, we provide the corresponding schedulability analysis under fixed-relative-deadline (FRD) scheduling and explain how the state-of-the-art FRD scheduling strategy can be applied. Empirically, these hybrid approaches are shown to be effective with regards to the number of schedulable task sets.},
} To tackle the unavoidable self-suspension behavior due to I/O-intensive interactions, multi-core processors, computation offloading systems with coprocessors, etc., the dynamic and the segmented self-suspension sporadic task models have been widely used in the literature. We propose new self-suspension models that are hybrids of the dynamic and the segmented models. Those hybrid models are capable of exploiting knowledge about execution paths, potentially reducing modelling pessimism. In addition, we provide the corresponding schedulability analysis under fixed-relative-deadline (FRD) scheduling and explain how the state-of-the-art FRD scheduling strategy can be applied. Empirically, these hybrid approaches are shown to be effective with regards to the number of schedulable task sets.
|
| Georg von der Brüggen, Jian-Jia Chen, Wen-Hung Huang and Maolin Yang. Release Enforcement in Resource-Oriented Partitioned Scheduling for Multiprocessor Systems. In 25th International Conference on Real-Time Networks and Systems (RTNS) Grenoble, France, October 04 - 06 2017 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/rtns/Bruggen17rop,
author = {Br\"uggen, Georg von der and Chen, Jian-Jia and Huang, Wen-Hung and Yang, Maolin},
title = {Release Enforcement in Resource-Oriented Partitioned Scheduling for Multiprocessor Systems},
booktitle = {25th International Conference on Real-Time Networks and Systems (RTNS)},
year = {2017},
address = {Grenoble, France},
month = {October 04 - 06},
url = {https://dl.acm.org/citation.cfm?id=3139287},
keywords = {Georg, kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2017_rtns_brueggen_rop.pdf},
confidential = {n},
abstract = {When partitioned scheduling is used in real-time multiprocessor systems, access to shared resources can jeopardize the
schedulability if the task partition is not done carefully. To tackle this problem we change our view angle from focusing
on the computing tasks to focusing on the shared resources by applying resource-oriented partitioned scheduling. We use a release enforcement technique to shape the interference from the higher-priority jobs to be sporadic, analyze the
schedulability, and provide strategies for partitioning both the critical and the non-critical sections of tasks onto processors individually. Our approaches are shown to be effective, both in the evaluations and from a theoretical point of view by providing a speedup factor of 6, improving previously known results. },
} When partitioned scheduling is used in real-time multiprocessor systems, access to shared resources can jeopardize the
schedulability if the task partition is not done carefully. To tackle this problem we change our view angle from focusing
on the computing tasks to focusing on the shared resources by applying resource-oriented partitioned scheduling. We use a release enforcement technique to shape the interference from the higher-priority jobs to be sporadic, analyze the
schedulability, and provide strategies for partitioning both the critical and the non-critical sections of tasks onto processors individually. Our approaches are shown to be effective, both in the evaluations and from a theoretical point of view by providing a speedup factor of 6, improving previously known results.
|
| Georg von der Brüggen, Niklas Ueter, Jian-Jia Chen and Matthias Freier. Parametric Utilization Bounds for Implicit-Deadline Periodic Tasks in Automotive Systems. In 25th International Conference on Real-Time Networks and Systems (RTNS) Grenoble, France, 2017 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/rtns/Bruggen17automotive,
author = {Br\"uggen, Georg von der and Ueter, Niklas and Chen, Jian-Jia and Freier, Matthias},
title = {Parametric Utilization Bounds for Implicit-Deadline Periodic Tasks in Automotive Systems},
booktitle = {25th International Conference on Real-Time Networks and Systems (RTNS)},
year = {2017},
address = {Grenoble, France},
url = {https://dl.acm.org/citation.cfm?id=3139273},
keywords = {georg},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2017_brueggen_rtns_automotive.pdf},
confidential = {n},
abstract = {Fixed-priority scheduling has been widely used in safety-critical applications. This paper explores the parametric utilization bounds for implicit-deadline periodic tasks in automotive uniprocessor systems, where the period of a task is either 1, 2, 5, 10, 20, 50, 100, 200, or 1000 milliseconds. We prove a parametric utilization bound of 90%+z for such automotive task systems under rate-monotonic preemptive scheduling (RM-P), where z is a parameter defined by the input task set with 0 ≤ z ≤ 10%. Moreover, we explain how to perform an exact schedulability test for an automotive task set under RM-P by validating only three conditions. Furthermore, we extend our analyses to rate-monotonic non-preemptive scheduling (RM-NP). We show that very reasonable utilization values can still be achieved under RM-NP if the execution time of all tasks is below 1 millisecond. The analyses presented here are compatible with angle- synchronous tasks by applying the related arrival curves. It is shown in the evaluations that scheduling those angle-synchronous tasks according to their minimum inter-arrival time instead of assigning them to the highest priority can drastically increase the acceptance ratio in some settings.},
} Fixed-priority scheduling has been widely used in safety-critical applications. This paper explores the parametric utilization bounds for implicit-deadline periodic tasks in automotive uniprocessor systems, where the period of a task is either 1, 2, 5, 10, 20, 50, 100, 200, or 1000 milliseconds. We prove a parametric utilization bound of 90%+z for such automotive task systems under rate-monotonic preemptive scheduling (RM-P), where z is a parameter defined by the input task set with 0 ≤ z ≤ 10%. Moreover, we explain how to perform an exact schedulability test for an automotive task set under RM-P by validating only three conditions. Furthermore, we extend our analyses to rate-monotonic non-preemptive scheduling (RM-NP). We show that very reasonable utilization values can still be achieved under RM-NP if the execution time of all tasks is below 1 millisecond. The analyses presented here are compatible with angle- synchronous tasks by applying the related arrival curves. It is shown in the evaluations that scheduling those angle-synchronous tasks according to their minimum inter-arrival time instead of assigning them to the highest priority can drastically increase the acceptance ratio in some settings.
|
| Wen-Hung Huang and Jian-Jia Chen. Self-Suspension Real-Time Tasks under Fixed-Relative-Deadline Fixed-Priority Scheduling. In Design, Automation and Test in Europe (DATE) Dresden, Germany, 14 -18th Mar 2016 [BibTeX][PDF][Abstract]@inproceedings { HC16,
author = {Huang, Wen-Hung and Chen, Jian-Jia},
title = {Self-Suspension Real-Time Tasks under Fixed-Relative-Deadline Fixed-Priority Scheduling},
booktitle = {Design, Automation and Test in Europe (DATE)},
year = {2016},
address = {Dresden, Germany},
month = {14 -18th Mar},
keywords = {kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/Self-Suspension-EDA-FP-6p},
confidential = {n},
abstract = {Self-suspension is becoming a prominent characteristic in real-time systems such as:
(i) I/O-intensive systems (ii) multi-core processors, and (iii) computation offloading systems
with coprocessors, like Graphics Processing Units (GPUs).
In this work, we study self-suspension systems under fixed-priority (FP) fixed-relative-deadline (FRD) algorithm by using release enforcement to control self-suspension tasks' behavior. Specifically, we use equal-deadline assignment (EDA) to assign the release phases of computations and suspensions.
We provide analysis for deriving the speedup factor of the FP FRD scheduler using suspension-laxity-monotonic (SLM) priority assignment.
This is the first positive result to provide bounded speedup factor guarantees for general multi-segment self-suspending task systems.},
} Self-suspension is becoming a prominent characteristic in real-time systems such as:
(i) I/O-intensive systems (ii) multi-core processors, and (iii) computation offloading systems
with coprocessors, like Graphics Processing Units (GPUs).
In this work, we study self-suspension systems under fixed-priority (FP) fixed-relative-deadline (FRD) algorithm by using release enforcement to control self-suspension tasks' behavior. Specifically, we use equal-deadline assignment (EDA) to assign the release phases of computations and suspensions.
We provide analysis for deriving the speedup factor of the FP FRD scheduler using suspension-laxity-monotonic (SLM) priority assignment.
This is the first positive result to provide bounded speedup factor guarantees for general multi-segment self-suspending task systems.
|
| Wen-Hung Huang, Jian-Jia Chen and Jan Reineke. MIRROR: Symmetric Timing Analysis for Real-Time Tasks on Multicore Platforms with Shared Resources. In Design Automation Conference (DAC) Austin, TX, USA, June 05-09 2016 [BibTeX][PDF][Abstract]@inproceedings { WR16,
author = {Huang, Wen-Hung and Chen, Jian-Jia and Reineke, Jan},
title = {MIRROR: Symmetric Timing Analysis for Real-Time Tasks on Multicore Platforms with Shared Resources},
booktitle = {Design Automation Conference (DAC)},
year = {2016},
address = {Austin, TX, USA},
month = {June 05-09 },
keywords = {kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/PID4192125-mirror.pdf},
confidential = {n},
abstract = {The emergence of multicore and manycore platforms poses a
big challenge for the design of real-time embedded systems,
especially for timing analysis. We observe in this paper that
response-time analysis for multicore platforms with shared
resources can be symmetrically approached from two per-
spectives: a core-centric and a shared-resource-centric per-
spective. The common \core-centric" perspective is that a
task executes on a core until it suspends the execution due
to shared resource accesses. The potentially less intuitive
\shared-resource-centric" perspective is that a task performs
requests on shared resources until suspending itself back to
perform computation on its respective core.
Based on the above observation, we provide a pseudo-
polynomial-time schedulability test and response-time anal-
ysis for constrained-deadline sporadic task systems. In addi-
tion, we propose a task partitioning algorithm that achieves a
speedup factor of 7, compared to the optimal schedule. This
constitutes the rst result in this research line with a speedup
factor guarantee. The experimental evaluation demonstrates
that our approach can yield high acceptance ratios if the
tasks have only a few resource access segments.},
} The emergence of multicore and manycore platforms poses a
big challenge for the design of real-time embedded systems,
especially for timing analysis. We observe in this paper that
response-time analysis for multicore platforms with shared
resources can be symmetrically approached from two per-
spectives: a core-centric and a shared-resource-centric per-
spective. The common \core-centric" perspective is that a
task executes on a core until it suspends the execution due
to shared resource accesses. The potentially less intuitive
\shared-resource-centric" perspective is that a task performs
requests on shared resources until suspending itself back to
perform computation on its respective core.
Based on the above observation, we provide a pseudo-
polynomial-time schedulability test and response-time anal-
ysis for constrained-deadline sporadic task systems. In addi-
tion, we propose a task partitioning algorithm that achieves a
speedup factor of 7, compared to the optimal schedule. This
constitutes the rst result in this research line with a speedup
factor guarantee. The experimental evaluation demonstrates
that our approach can yield high acceptance ratios if the
tasks have only a few resource access segments.
|
| Wen-Hung Huang and Jian-Jia Chen. Utilization Bounds on Allocating Rate-Monotonic Scheduled Multi-Mode Tasks on Multiprocessor Systems. In Design Automation Conference (DAC) Austin, TX, USA, June 05-09 2016 [BibTeX][PDF][Abstract]@inproceedings { HC16a,
author = {Huang, Wen-Hung and Chen, Jian-Jia},
title = {Utilization Bounds on Allocating Rate-Monotonic Scheduled Multi-Mode Tasks on Multiprocessor Systems},
booktitle = {Design Automation Conference (DAC)},
year = {2016},
address = {Austin, TX, USA},
month = {June 05-09 },
keywords = {kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/PID4192127-multimode-mp.pdf},
confidential = {n},
abstract = {Formal models used for representing recurrent real-time pro-
cesses have traditionally been characterized by a collection of
jobs that are released periodically. However, such a model-
ing may result in resource under-utilization in systems whose
behaviors are not entirely periodic. For instance, tasks in
cyber-physical system (CPS) may change their service levels,
e.g., periods and/or execution times, to adapt to the changes
of environments. In this work, we study a model that is a
generalization of the periodic task model, called multi-mode
task model: a task has several modes specied with dierent
execution times and periods to switch during runtime, inde-
pendent of other tasks. Moreover, we study the problem
of allocating a set of multi-mode tasks on a homogeneous
multiprocessor system. We present a scheduling algorithm
using any reasonable allocation decreasing (RAD) algorithm
for task allocations for scheduling multi-mode tasks on mul-
tiprocessor systems. We prove that this algorithm achieves
38% utilization for implicit-deadline rate-monotonic (RM)
scheduled multi-mode tasks on multiprocessor systems.},
} Formal models used for representing recurrent real-time pro-
cesses have traditionally been characterized by a collection of
jobs that are released periodically. However, such a model-
ing may result in resource under-utilization in systems whose
behaviors are not entirely periodic. For instance, tasks in
cyber-physical system (CPS) may change their service levels,
e.g., periods and/or execution times, to adapt to the changes
of environments. In this work, we study a model that is a
generalization of the periodic task model, called multi-mode
task model: a task has several modes specied with dierent
execution times and periods to switch during runtime, inde-
pendent of other tasks. Moreover, we study the problem
of allocating a set of multi-mode tasks on a homogeneous
multiprocessor system. We present a scheduling algorithm
using any reasonable allocation decreasing (RAD) algorithm
for task allocations for scheduling multi-mode tasks on mul-
tiprocessor systems. We prove that this algorithm achieves
38% utilization for implicit-deadline rate-monotonic (RM)
scheduled multi-mode tasks on multiprocessor systems.
|
| Kuan-Hsun Chen, Georg Brüggen and Jian-Jia Chen. Overrun Handling for Mixed-Criticality Support in RTEMS. In Workshop on Mixed-Criticality Systems Porto, Portugal, Nov 29 2016 [BibTeX][Link][Abstract]@inproceedings { WMC2016,
author = {Chen, Kuan-Hsun and Br\"uggen, Georg and Chen, Jian-Jia},
title = {Overrun Handling for Mixed-Criticality Support in RTEMS},
booktitle = {Workshop on Mixed-Criticality Systems},
year = {2016},
address = {Porto, Portugal},
month = {Nov 29},
url = {https://hal.archives-ouvertes.fr/hal-01438843/document},
keywords = {kuan, Georg},
confidential = {n},
abstract = {Real-time operating systems are not only used in embedded real-time systems but also useful for the simulation and
validation of those systems. During the evaluation of our paper about Systems with Dynamic Real-Time Guarantees that appears in RTSS 2016 we discovered certain unexpected system behavior in the open-source real-time operating system RTEMS. In the current implementation of RTEMS (version 4.11), overruns of an implicit-deadline task, i.e., deadline misses, result in unexpected system behavior as they may lead to a shift of the release pattern of the task. This also has the consequence that some task instances are not released as they should be. In this paper we explain the reason why such problems occur in RTEMS and our solutions.},
} Real-time operating systems are not only used in embedded real-time systems but also useful for the simulation and
validation of those systems. During the evaluation of our paper about Systems with Dynamic Real-Time Guarantees that appears in RTSS 2016 we discovered certain unexpected system behavior in the open-source real-time operating system RTEMS. In the current implementation of RTEMS (version 4.11), overruns of an implicit-deadline task, i.e., deadline misses, result in unexpected system behavior as they may lead to a shift of the release pattern of the task. This also has the consequence that some task instances are not released as they should be. In this paper we explain the reason why such problems occur in RTEMS and our solutions.
|
| Jian-Jia Chen, Wen-Hung Huang and Cong Liu. k2Q: A Quadratic-Form Response Time and Schedulability Analysis Framework for Utilization-Based Analysis. In Real-Time Systems Symposium (RTSS) Porto, Portugal, Nov. 29 - Dec. 2 2016 [BibTeX]@inproceedings { RTSS2016-k2Q,
author = {Chen, Jian-Jia and Huang, Wen-Hung and Liu, Cong},
title = {k2Q: A Quadratic-Form Response Time and Schedulability Analysis Framework for Utilization-Based Analysis},
booktitle = {Real-Time Systems Symposium (RTSS)},
year = {2016},
address = {Porto, Portugal},
month = {Nov. 29 - Dec. 2},
keywords = {kevin },
confidential = {n},
} |
| Kuan-Hsun Chen, Björn Bönninghoff, Jian-Jia Chen and Peter Marwedel. Compensate or Ignore? Meeting Control Robustness Requirements through Adaptive Soft-Error Handling. In Languages, Compilers, Tools and Theory for Embedded Systems (LCTES) Santa Barbara, CA, U.S.A., June 2016 [BibTeX][PDF][Link][Abstract]@inproceedings { Chenlctes2016,
author = {Chen, Kuan-Hsun and B\"onninghoff, Bj\"orn and Chen, Jian-Jia and Marwedel, Peter},
title = {Compensate or Ignore? Meeting Control Robustness Requirements through Adaptive Soft-Error Handling},
booktitle = {Languages, Compilers, Tools and Theory for Embedded Systems (LCTES)},
year = {2016},
address = {Santa Barbara, CA, U.S.A.},
month = {June},
organization = {ACM},
url = {http://dx.doi.org/10.1145/2907950.2907952},
keywords = {kuan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016-khchen-lctes.pdf},
confidential = {n},
abstract = {To avoid catastrophic events like unrecoverable system failures on mobile and embedded systems caused by soft-errors, software-
based error detection and compensation techniques have been proposed. Methods like error-correction codes or redundant execution can offer high flexibility and allow for application-specific fault-tolerance selection without the needs of special hardware supports. However, such software-based approaches may lead to system overload due to the execution time overhead. An adaptive deployment of such techniques to meet both application requirements and system constraints is desired. From our case study, we observe that a control task can tolerate limited errors with acceptable performance loss. Such tolerance can be modeled as a (m, k) constraint which requires at least m correct runs out of any k consecutive runs to be correct. In this paper, we discuss how a given (m, k) constraint can be satisfied by adopting patterns of task instances with individual error detection and compensation capabilities. We introduce static strategies and provide a formal feasibility analysis for validation. Furthermore, we develop an adaptive scheme that extends our initial approach with online awareness that increases efficiency while preserving analysis results. The effectiveness of our method is shown in a real-world case study as well as for synthesized task sets.},
} To avoid catastrophic events like unrecoverable system failures on mobile and embedded systems caused by soft-errors, software-
based error detection and compensation techniques have been proposed. Methods like error-correction codes or redundant execution can offer high flexibility and allow for application-specific fault-tolerance selection without the needs of special hardware supports. However, such software-based approaches may lead to system overload due to the execution time overhead. An adaptive deployment of such techniques to meet both application requirements and system constraints is desired. From our case study, we observe that a control task can tolerate limited errors with acceptable performance loss. Such tolerance can be modeled as a (m, k) constraint which requires at least m correct runs out of any k consecutive runs to be correct. In this paper, we discuss how a given (m, k) constraint can be satisfied by adopting patterns of task instances with individual error detection and compensation capabilities. We introduce static strategies and provide a formal feasibility analysis for validation. Furthermore, we develop an adaptive scheme that extends our initial approach with online awareness that increases efficiency while preserving analysis results. The effectiveness of our method is shown in a real-world case study as well as for synthesized task sets.
|
| Ingo Korb, Helena Kotthaus and Peter Marwedel. mmapcopy: Efficient Memory Footprint Reduction using Application Knowledge. In Proceedings of the 31st Annual ACM Symposium on Applied Computing Pisa, Italy, 2016 [BibTeX][PDF][Abstract]@inproceedings { korb:2016:sac,
author = {Korb, Ingo and Kotthaus, Helena and Marwedel, Peter},
title = {mmapcopy: Efficient Memory Footprint Reduction using Application Knowledge},
booktitle = {Proceedings of the 31st Annual ACM Symposium on Applied Computing },
year = {2016},
address = {Pisa, Italy},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016-korb-mmapcopy.pdf},
confidential = {n},
abstract = {Memory requirements can be a limiting factor for programs dealing with large data structures. Especially interpreted programming languages that are used to deal with large vectors like R suffer from memory overhead when copying such data structures. Avoiding data duplication directly in the application can reduce the memory requirements. Alternatively, generic kernel-level memory reduction functionality like deduplication and compression can lower the amount of memory required, but they need to compensate for missing application knowledge by utilizing more CPU time, leading to excessive overhead. To allow new optimizations based
on the application’s knowledge about its own memory utilization, we propose to introduce a new system call. This system call uses the existing copy-on-write functionality of the Linux kernel to avoid duplicating memory when data is copied. Our experiments using real-world benchmarks written in the R language show that our approach can yield significant improvement in CPU time compared to Kernel Samepage Merging without compromising the amount of memory saved.
},
} Memory requirements can be a limiting factor for programs dealing with large data structures. Especially interpreted programming languages that are used to deal with large vectors like R suffer from memory overhead when copying such data structures. Avoiding data duplication directly in the application can reduce the memory requirements. Alternatively, generic kernel-level memory reduction functionality like deduplication and compression can lower the amount of memory required, but they need to compensate for missing application knowledge by utilizing more CPU time, leading to excessive overhead. To allow new optimizations based
on the application’s knowledge about its own memory utilization, we propose to introduce a new system call. This system call uses the existing copy-on-write functionality of the Linux kernel to avoid duplicating memory when data is copied. Our experiments using real-world benchmarks written in the R language show that our approach can yield significant improvement in CPU time compared to Kernel Samepage Merging without compromising the amount of memory saved.
|
| Jakob Richter, Helena Kotthaus, Bernd Bischl, Peter Marwedel, Jörg Rahnenführer and Michel Lang. Faster Model-Based Optimization through Resource-Aware Scheduling Strategies. In Proceedings of the 10th International Conference: Learning and Intelligent Optimization (LION 10). vol. 10079 of Lecture Notes in Computer Science., pages 267--273 2016 [BibTeX][PDF][Link][Abstract]@inproceedings { kotthaus/2016a,
author = {Richter, Jakob and Kotthaus, Helena and Bischl, Bernd and Marwedel, Peter and Rahnenf\"uhrer, J\"org and Lang, Michel},
title = {Faster Model-Based Optimization through Resource-Aware Scheduling Strategies},
booktitle = {Proceedings of the 10th International Conference: Learning and Intelligent Optimization (LION 10).},
year = {2016},
volume = {vol. 10079 of Lecture Notes in Computer Science.},
pages = {267--273},
publisher = {Springer},
url = {http://link.springer.com/chapter/10.1007/978-3-319-50349-3_22},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016_kotthaus_lion10.pdf },
confidential = {n},
abstract = {We present a Resource-Aware Model-Based Optimization framework RAMBO that leads to efficient utilization of parallel computer architectures through resource-aware scheduling strategies. Conventional MBO fits a regression model on the set of already evaluated configurations and their observed performances to guide the search. Due to its inherent sequential nature, an efficient parallel variant can not directly be derived, as only the most promising configuration w.r.t. an infill criterion is evaluated in each iteration. This issue has been addressed by generalized infill criteria in order to propose multiple points simultaneously for parallel execution in each sequential step. However, these extensions in general neglect systematic runtime differences in the configuration space which often leads to underutilized systems. We estimate runtimes using an additional surrogate model to improve the scheduling and demonstrate that our framework approach already yields improved resource utilization on two exemplary classification tasks.},
} We present a Resource-Aware Model-Based Optimization framework RAMBO that leads to efficient utilization of parallel computer architectures through resource-aware scheduling strategies. Conventional MBO fits a regression model on the set of already evaluated configurations and their observed performances to guide the search. Due to its inherent sequential nature, an efficient parallel variant can not directly be derived, as only the most promising configuration w.r.t. an infill criterion is evaluated in each iteration. This issue has been addressed by generalized infill criteria in order to propose multiple points simultaneously for parallel execution in each sequential step. However, these extensions in general neglect systematic runtime differences in the configuration space which often leads to underutilized systems. We estimate runtimes using an additional surrogate model to improve the scheduling and demonstrate that our framework approach already yields improved resource utilization on two exemplary classification tasks.
|
| Helena Kotthaus, Jakob Richter, Andreas Lang, Michel Lang and Peter Marwedel. Resource-Aware Scheduling Strategies for Parallel Machine Learning R Programs through RAMBO. In Abstract Booklet of the International R User Conference (UseR!) 195 USA, Stanford, June 2016 [BibTeX][Link][Abstract]@inproceedings { kotthaus:2016b,
author = {Kotthaus, Helena and Richter, Jakob and Lang, Andreas and Lang, Michel and Marwedel, Peter},
title = {Resource-Aware Scheduling Strategies for Parallel Machine Learning R Programs through RAMBO},
booktitle = {Abstract Booklet of the International R User Conference (UseR!)},
year = {2016},
number = {195},
address = {USA, Stanford},
month = {June},
url = {http://user2016.org/files/abs-book.pdf},
confidential = {n},
abstract = {We present resource-aware scheduling strategies for parallel R programs leading to efficient utilization of parallel computer architectures by estimating resource demands. We concentrate on applications that consist of independent tasks.
The R programming language is increasingly used to process large data sets in parallel, which requires a high amount of resources. One important application is parameter tuning of machine learning algorithms where evaluations need to be executed in parallel to reduce runtime. Here, resource demands of tasks heavily vary depending on the algorithm configuration. Running such an application in a naive parallel way leads to inefficient resource utilization and thus to long runtimes. Therefore, the R package “parallel” offers a scheduling strategy, called “load balancing”.
It dynamically allocates tasks to worker processes. This option is recommended when tasks have widely different computation times or if computer architectures are heterogeneous. We analyzed memory and CPU utilization of parallel applications with our TraceR profiling tool and found that the load balancing mechanism is not sufficient for parallel tasks with high variance in resource demands. A scheduling strategy needs to know resource demands of a task before execution to efficiently map applications to available resources.
Therefore, we build a regression model to estimate resource demands based on previous evaluated tasks. Resource estimates like runtime are then used to guide our scheduling strategies. Those strategies are integrated in our RAMBO (Resource-Aware Model-Based Optimization) Framework. Compared to standard mechanisms of the parallel package our approach yields improved resource utilization.},
} We present resource-aware scheduling strategies for parallel R programs leading to efficient utilization of parallel computer architectures by estimating resource demands. We concentrate on applications that consist of independent tasks.
The R programming language is increasingly used to process large data sets in parallel, which requires a high amount of resources. One important application is parameter tuning of machine learning algorithms where evaluations need to be executed in parallel to reduce runtime. Here, resource demands of tasks heavily vary depending on the algorithm configuration. Running such an application in a naive parallel way leads to inefficient resource utilization and thus to long runtimes. Therefore, the R package “parallel” offers a scheduling strategy, called “load balancing”.
It dynamically allocates tasks to worker processes. This option is recommended when tasks have widely different computation times or if computer architectures are heterogeneous. We analyzed memory and CPU utilization of parallel applications with our TraceR profiling tool and found that the load balancing mechanism is not sufficient for parallel tasks with high variance in resource demands. A scheduling strategy needs to know resource demands of a task before execution to efficiently map applications to available resources.
Therefore, we build a regression model to estimate resource demands based on previous evaluated tasks. Resource estimates like runtime are then used to guide our scheduling strategies. Those strategies are integrated in our RAMBO (Resource-Aware Model-Based Optimization) Framework. Compared to standard mechanisms of the parallel package our approach yields improved resource utilization.
|
| Wen-Hung Huang, Maolin Yang and Jian-Jia Chen. Resource-Oriented Partitioned Scheduling in Multiprocessor Systems: How to Partition and How to Share?. In Real-Time Systems Symposium (RTSS) Porto, Portugal, Nov. 29 - Dec. 2 2016, (Outstanding Paper Award). We identified some typos and revised the paper on May. 29th 2017. Revised version [BibTeX][PDF][Link]@inproceedings { RTSS2016-resource,
author = {Huang, Wen-Hung and Yang, Maolin and Chen, Jian-Jia},
title = {Resource-Oriented Partitioned Scheduling in Multiprocessor Systems: How to Partition and How to Share?},
booktitle = {Real-Time Systems Symposium (RTSS)},
year = {2016},
address = {Porto, Portugal},
month = {Nov. 29 - Dec. 2},
publisher = { Revised version with latexdiff},
note = { (Outstanding Paper Award). We identified some typos and revised the paper on May. 29th 2017. Revised version},
url = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2017-synchronization-revised-diff.pdf},
keywords = {kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016-kevin-synchronization_RTSS_camera_ready.pdf},
confidential = {n},
} |
| Jian-Jia Chen. Computational Complexity and Speedup Factors Analyses for Self-Suspending Tasks. In Real-Time Systems Symposium (RTSS) Porto, Portugal, Nov. 29 - Dec. 2 2016 [BibTeX][PDF]@inproceedings { RTSS2016-suspension,
author = {Chen, Jian-Jia},
title = {Computational Complexity and Speedup Factors Analyses for Self-Suspending Tasks},
booktitle = {Real-Time Systems Symposium (RTSS)},
year = {2016},
address = {Porto, Portugal},
month = {Nov. 29 - Dec. 2},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016-JJ-Suspension-Complexity.pdf},
confidential = {n},
} |
| Jian-Jia Chen. Partitioned Multiprocessor Fixed-Priority Scheduling of Sporadic Real-Time Tasks. In Euromicro Conference on Real-Time Systems (ECRTS) Toulouse, France, 05-08, July 2016, (Outstanding Paper Award) An extended version is available via arXiv: http://arxiv.org/abs/1505.04693 [BibTeX][PDF][Abstract]@inproceedings { ChenECRTS2016-Partition,
author = {Chen, Jian-Jia},
title = {Partitioned Multiprocessor Fixed-Priority Scheduling of Sporadic Real-Time Tasks},
booktitle = {Euromicro Conference on Real-Time Systems (ECRTS)},
year = {2016},
address = {Toulouse, France},
month = {05-08, July},
note = { (Outstanding Paper Award) An extended version is available via arXiv: http://arxiv.org/abs/1505.04693},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016-chen-ecrts16-partition.pdf},
confidential = {n},
abstract = { Partitioned multiprocessor scheduling has been widely accepted in
academia and industry to statically assign and partition real-time
tasks onto identical multiprocessor systems. This paper studies
fixed-priority partitioned multiprocessor scheduling for sporadic
real-time systems, in which deadline-monotonic scheduling is applied
on each processor. Prior to this paper, the best known results are
by Fisher, Baruah, and Baker with speedup factors $4-\frac{2}{M}$
and $3-\frac{1}{M}$ for arbitrary-deadline and constrained-deadline
sporadic real-time task systems, respectively, where $M$ is the
number of processors. We show that a
greedy mapping strategy has a speedup factor $3-\frac{1}{M}$ when
considering task systems with arbitrary deadlines. Such a factor holds for polynomial-time
schedulability tests and exponential-time (exact) schedulability
tests. Moreover, we also improve the speedup factor to $2.84306$
when considering constrained-deadline task systems.
We also provide tight examples when the fitting strategy in the
mapping stage is arbitrary and $M$ is sufficiently large.
For both constrained- and
arbitrary-deadline task systems, the analytical result surprisingly shows that using exact tests does not gain
theoretical benefits (with respect to speedup factors) if the speedup factor analysis
is oblivious of the particular fitting strategy used.},
} Partitioned multiprocessor scheduling has been widely accepted in
academia and industry to statically assign and partition real-time
tasks onto identical multiprocessor systems. This paper studies
fixed-priority partitioned multiprocessor scheduling for sporadic
real-time systems, in which deadline-monotonic scheduling is applied
on each processor. Prior to this paper, the best known results are
by Fisher, Baruah, and Baker with speedup factors $4-2{M}$
and $3-1{M}$ for arbitrary-deadline and constrained-deadline
sporadic real-time task systems, respectively, where $M$ is the
number of processors. We show that a
greedy mapping strategy has a speedup factor $3-1{M}$ when
considering task systems with arbitrary deadlines. Such a factor holds for polynomial-time
schedulability tests and exponential-time (exact) schedulability
tests. Moreover, we also improve the speedup factor to $2.84306$
when considering constrained-deadline task systems.
We also provide tight examples when the fitting strategy in the
mapping stage is arbitrary and $M$ is sufficiently large.
For both constrained- and
arbitrary-deadline task systems, the analytical result surprisingly shows that using exact tests does not gain
theoretical benefits (with respect to speedup factors) if the speedup factor analysis
is oblivious of the particular fitting strategy used.
|
| Georg von der Brüggen, Kuan-Hsun Chen, Wen-Hung Huang and Jian-Jia Chen. Systems with Dynamic Real-Time Guarantees in Uncertain and Faulty Execution Environments. In Real-Time Systems Symposium (RTSS) Porto, Portugal, Nov. 29 - Dec. 2 2016 [BibTeX][PDF][Link][Abstract]@inproceedings { RTSS2016-dynamic-faulty,
author = {Br\"uggen, Georg von der and Chen, Kuan-Hsun and Huang, Wen-Hung and Chen, Jian-Jia},
title = {Systems with Dynamic Real-Time Guarantees in Uncertain and Faulty Execution Environments},
booktitle = {Real-Time Systems Symposium (RTSS)},
year = {2016},
address = {Porto, Portugal},
month = {Nov. 29 - Dec. 2},
url = {https://ieeexplore.ieee.org/document/7809865},
keywords = {georg, kevin, kuan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016_rtss_georg.pdf},
confidential = {n},
abstract = {In many practical real-time systems, the physical environment and the system platform can impose uncertain
execution behaviour to the system. For example, if transient faults are detected, the execution time of a task instance can be increased due to recovery operations. Such fault recovery routines make the system very vulnerable with respect to meeting hard real-time deadlines. In theory and in practical systems, this problem is often handled by aborting not so important tasks to guarantee the response time of the more important tasks. However, for most systems such faults occur rarely and the results of not so important tasks might still be useful, even if they are a bit late. This implicates to not abort these not so important tasks but keep them running even if faults occur, provided that the more important tasks still meet their hard real time properties. In this paper, we present Systems with Dynamic Real-Time Guarantees to model this behaviour and determine if the system can provide full timing guarantees or limited timing guarantees without any online adaptation after a fault occurred. We present a schedulability test, provide an algorithm for optimal priority assignment, determine the maximum interval length until the system will again provide full timing guarantees and explain how we can monitor the system state online. The approaches presented in this paper can also be applied to mixed criticality systems with dual criticality levels. },
} In many practical real-time systems, the physical environment and the system platform can impose uncertain
execution behaviour to the system. For example, if transient faults are detected, the execution time of a task instance can be increased due to recovery operations. Such fault recovery routines make the system very vulnerable with respect to meeting hard real-time deadlines. In theory and in practical systems, this problem is often handled by aborting not so important tasks to guarantee the response time of the more important tasks. However, for most systems such faults occur rarely and the results of not so important tasks might still be useful, even if they are a bit late. This implicates to not abort these not so important tasks but keep them running even if faults occur, provided that the more important tasks still meet their hard real time properties. In this paper, we present Systems with Dynamic Real-Time Guarantees to model this behaviour and determine if the system can provide full timing guarantees or limited timing guarantees without any online adaptation after a fault occurred. We present a schedulability test, provide an algorithm for optimal priority assignment, determine the maximum interval length until the system will again provide full timing guarantees and explain how we can monitor the system state online. The approaches presented in this paper can also be applied to mixed criticality systems with dual criticality levels.
|
| Jian-Jia Chen, Geoffrey Nelissen and Wen-Hung Kevin Huang. A Unifying Response Time Analysis Framework for Dynamic Self-Suspending Tasks. In Euromicro Conference on Real-Time Systems (ECRTS) Toulouse, France, 05-08, July 2016, An extended version is available in technical report #850, Technische Universität Dortmund - Fakultät für Informatik [BibTeX][PDF][Abstract]@inproceedings { ChenECRTS2016-suspension,
author = {Chen, Jian-Jia and Geoffrey Nelissen, and Huang, Wen-Hung Kevin},
title = {A Unifying Response Time Analysis Framework for Dynamic Self-Suspending Tasks},
booktitle = {Euromicro Conference on Real-Time Systems (ECRTS)},
year = {2016},
address = {Toulouse, France},
month = {05-08, July},
note = { An extended version is available in technical report #850, Technische Universit\"at Dortmund - Fakult\"at f\"ur Informatik},
keywords = {kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016-chen-ecrts-suspension.pdf},
confidential = {n},
abstract = { For real-time embedded systems, self-suspending behaviors can cause
substantial performance/schedulability degradations. In this paper,
we focus on preemptive fixed-priority scheduling for the dynamic
self-suspension task model on uniprocessor. This
model assumes that a job of a task can dynamically suspend itself during its execution (for instance, to wait for shared resources or access co-processors or external devices).
The total suspension time of a job is upper-bounded, but this dynamic behavior drastically influences the interference generated by this task on lower-priority tasks. The state-of-the-art results for this task model can be classified
into three categories (i) modeling suspension as computation, (ii)
modeling suspension as release jitter, and (iii) modeling suspension as a blocking term.
However, several results associated to the release jitter approach have been recently proven to be erroneous, and the concept of modeling suspension as blocking was never
formally proven correct. This paper presents a unifying
response time analysis framework for the dynamic self-suspending
task model. We provide a rigorous proof and show that the existing analyses pertaining to the three categories mentioned above are analytically dominated by our proposed solution. Therefore, all those techniques are in fact correct, but they are
inferior to the proposed response time analysis in this paper. The
evaluation results show that our analysis framework can generate huge
improvements (an increase of up to $50\%$ of the number of task sets
deemed schedulable) over these state-of-the-art analyses.},
} For real-time embedded systems, self-suspending behaviors can cause
substantial performance/schedulability degradations. In this paper,
we focus on preemptive fixed-priority scheduling for the dynamic
self-suspension task model on uniprocessor. This
model assumes that a job of a task can dynamically suspend itself during its execution (for instance, to wait for shared resources or access co-processors or external devices).
The total suspension time of a job is upper-bounded, but this dynamic behavior drastically influences the interference generated by this task on lower-priority tasks. The state-of-the-art results for this task model can be classified
into three categories (i) modeling suspension as computation, (ii)
modeling suspension as release jitter, and (iii) modeling suspension as a blocking term.
However, several results associated to the release jitter approach have been recently proven to be erroneous, and the concept of modeling suspension as blocking was never
formally proven correct. This paper presents a unifying
response time analysis framework for the dynamic self-suspending
task model. We provide a rigorous proof and show that the existing analyses pertaining to the three categories mentioned above are analytically dominated by our proposed solution. Therefore, all those techniques are in fact correct, but they are
inferior to the proposed response time analysis in this paper. The
evaluation results show that our analysis framework can generate huge
improvements (an increase of up to $50%$ of the number of task sets
deemed schedulable) over these state-of-the-art analyses.
|
| Matthias Freier and Jian-Jia Chen. Sporadic Task Handling in Time-Triggered Systems. In Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems, {SCOPES} 2016, Sankt Goar, Germany, May 23-25, 2016, pages 135--144 2016 [BibTeX][Link]@inproceedings { DBLP:conf/scopes/FreierC16,
author = {Freier, Matthias and Chen, Jian-Jia},
title = {Sporadic Task Handling in Time-Triggered Systems},
booktitle = {Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems, {SCOPES} 2016, Sankt Goar, Germany, May 23-25, 2016},
year = {2016},
pages = {135--144},
url = {http://doi.org/10.1145/2906363.2906383},
confidential = {n},
} |
| Georg von der Brüggen, Wen-Hung Huang, Jian-Jia Chen and Cong Liu. Uniprocessor Scheduling Strategies for Self-Suspending Task Systems. In Proceedings of the 24th International Conference on Real-Time Networks and Systems (RTNS), pages 119--128 Brest, October 19 - 21 2016 [BibTeX][PDF][Link][Abstract]@inproceedings { vonderBruggen:2016:USS:2997465.2997497,
author = {Br\"uggen, Georg von der and Huang, Wen-Hung and Chen, Jian-Jia and Liu, Cong},
title = {Uniprocessor Scheduling Strategies for Self-Suspending Task Systems},
booktitle = {Proceedings of the 24th International Conference on Real-Time Networks and Systems (RTNS)},
year = {2016},
pages = {119--128},
address = {Brest},
month = {October 19 - 21},
publisher = {ACM},
url = {http://dl.acm.org/ft_gateway.cfm?id=2997497\&ftid=1804918\&dwn=1\&CFID=691780547\&CFTOKEN=64912419},
keywords = {georg, kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016-rtns-georg.pdf},
confidential = {n},
abstract = {We study uniprocessor scheduling for hard real-time self-suspending task systems where each task may contain a single self-suspension interval. We focus on improving state-of-the-art fixed-relative-deadline (FRD) scheduling approaches, where an FRD scheduler assigns a separate relative deadline to each computation segment of a task. Then, FRD schedules different computation segments by using the earliest deadline first (EDF) scheduling policy, based on the assigned deadlines for the computation segments. Our proposed algorithm, Shortest Execution Interval First Deadline Assignment (SEIFDA), greedily assigns the relative deadlines of the computation segments, starting with the task with the smallest execution interval length, i.e., the period minus the self-suspension time. We show that any reasonable deadline assignment under this strategy has a speedup factor of 3. Moreover, we present how to approximate the schedulability test and a generalized mixed integer linear programming (MILP) that can be formulated based on the tolerable loss in the schedulability test defined by the users. We show by both analysis and experiments that through designing smarter relative deadline assignment policies, the resulting FRD scheduling algorithms yield significantly better performance than existing schedulers for such task systems. },
} We study uniprocessor scheduling for hard real-time self-suspending task systems where each task may contain a single self-suspension interval. We focus on improving state-of-the-art fixed-relative-deadline (FRD) scheduling approaches, where an FRD scheduler assigns a separate relative deadline to each computation segment of a task. Then, FRD schedules different computation segments by using the earliest deadline first (EDF) scheduling policy, based on the assigned deadlines for the computation segments. Our proposed algorithm, Shortest Execution Interval First Deadline Assignment (SEIFDA), greedily assigns the relative deadlines of the computation segments, starting with the task with the smallest execution interval length, i.e., the period minus the self-suspension time. We show that any reasonable deadline assignment under this strategy has a speedup factor of 3. Moreover, we present how to approximate the schedulability test and a generalized mixed integer linear programming (MILP) that can be formulated based on the tolerable loss in the schedulability test defined by the users. We show by both analysis and experiments that through designing smarter relative deadline assignment policies, the resulting FRD scheduling algorithms yield significantly better performance than existing schedulers for such task systems.
|
| Anas Toma, Santiago Pagani, Jian-Jia Chen, Wolfgang Karl and Jörg Henkel. An Energy-Efficient Middleware for Computation Offloading in Real-Time Embedded Systems. In the 22th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2016) Daegu, South Korea, August 2016. [BibTeX][PDF][Abstract]@inproceedings { Toma-RTCSA2016,
author = {Toma, Anas and Pagani, Santiago and Chen, Jian-Jia and Karl, Wolfgang and Henkel, J\"org},
title = {An Energy-Efficient Middleware for Computation Offloading in Real-Time Embedded Systems},
booktitle = {the 22th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2016)},
year = {2016.},
address = {Daegu, South Korea},
month = {August},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2016-toma-rtcsa.pdf},
confidential = {n},
abstract = {Embedded systems have limited resources, such as computation capabilities and battery life. The Dynamic Voltage and Frequency Scaling (DVFS) technique is used to save energy by running the processor of the embedded system at low voltage and frequency levels. However, this prolongs the execution time, which may cause potential deadline misses for real-time tasks. In this paper, we propose a general-purpose middleware to reduce the energy consumption in embedded systems without violating the real-time constraints. The algorithms in the middleware adopt the computation offloading concept to reduce the workload on the processor of the embedded system by sending the computation-intensive tasks to a powerful server. The algorithms are further combined with the DVFS technique to find the running frequency (or speed) such that the energy consumption is minimized and the real-time constraints are satisfied. The evaluation shows that our approach reduces the average energy consumption down to nearly 60%, compared to executing all the tasks locally at the maximum processor speed.},
} Embedded systems have limited resources, such as computation capabilities and battery life. The Dynamic Voltage and Frequency Scaling (DVFS) technique is used to save energy by running the processor of the embedded system at low voltage and frequency levels. However, this prolongs the execution time, which may cause potential deadline misses for real-time tasks. In this paper, we propose a general-purpose middleware to reduce the energy consumption in embedded systems without violating the real-time constraints. The algorithms in the middleware adopt the computation offloading concept to reduce the workload on the processor of the embedded system by sending the computation-intensive tasks to a powerful server. The algorithms are further combined with the DVFS technique to find the running frequency (or speed) such that the energy consumption is minimized and the real-time constraints are satisfied. The evaluation shows that our approach reduces the average energy consumption down to nearly 60%, compared to executing all the tasks locally at the maximum processor speed.
|
| Wen-Hung Huang and Jian-Jia Chen. Response Time Bounds for Sporadic Arbitrary-Deadline Tasks under Global Fixed-Priority Scheduling on Multiprocessors. In International Conference on Real-Time Networks and Systems (RTNS) Lille, France, 4-6th Nov 2015 [BibTeX][PDF]@inproceedings { CH15,
author = {Huang, Wen-Hung and Chen, Jian-Jia},
title = {Response Time Bounds for Sporadic Arbitrary-Deadline Tasks under Global Fixed-Priority Scheduling on Multiprocessors},
booktitle = {International Conference on Real-Time Networks and Systems (RTNS)},
year = {2015},
address = {Lille, France},
month = {4-6th Nov},
keywords = {kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-huang_response-time-bounded-rtns.pdf},
confidential = {n},
} |
| Huang Wen-Hung, Jian-Jia Chen, Husheng Zhou and Cong Liu. PASS: Priority Assignment of Real-Time Tasks with Dynamic Suspending Behavior under Fixed-Priority Scheduling. In Design Automation Conference (DAC), San Francisco, CA, USA 2015 [BibTeX][PDF][Abstract]@inproceedings { Wal.15a,
author = {Wen-Hung, Huang and Chen, Jian-Jia and Zhou, Husheng and Liu, Cong},
title = {PASS: Priority Assignment of Real-Time Tasks with Dynamic Suspending Behavior under Fixed-Priority Scheduling},
booktitle = {Design Automation Conference (DAC), San Francisco, CA, USA},
year = {2015},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/pass-dac-2015.pdf},
confidential = {n},
abstract = {Self-suspension is becoming an increasingly prominent char-
acteristic in real-time systems such as: (i) I/O-intensive
systems, where applications interact intensively with I/O
devices, (ii) multi-core processors, where tasks running on
dierent cores have to synchronize and communicate with
each other, and (iii) computation ooading systems with
coprocessors, like Graphics Processing Units (GPUs). In
this paper, we show that rate-monotonic (RM), deadline-
monotonic (DM) and laxity-monotonic (LM) scheduling will
perform rather poor in dynamic self-suspending systems in
terms of speed-up factors. On the other hand, the proposed
PASS approach is guaranteed to nd a feasible priority as-
signment on a speed-2 uniprocessor, if one exists on a unit-
speed processor. We evaluate the feasibility of the proposed
approach via a case study implementation. Furthermore,
the eectiveness of the proposed approach is also shown via
extensive simulation results.},
} Self-suspension is becoming an increasingly prominent char-
acteristic in real-time systems such as: (i) I/O-intensive
systems, where applications interact intensively with I/O
devices, (ii) multi-core processors, where tasks running on
dierent cores have to synchronize and communicate with
each other, and (iii) computation ooading systems with
coprocessors, like Graphics Processing Units (GPUs). In
this paper, we show that rate-monotonic (RM), deadline-
monotonic (DM) and laxity-monotonic (LM) scheduling will
perform rather poor in dynamic self-suspending systems in
terms of speed-up factors. On the other hand, the proposed
PASS approach is guaranteed to nd a feasible priority as-
signment on a speed-2 uniprocessor, if one exists on a unit-
speed processor. We evaluate the feasibility of the proposed
approach via a case study implementation. Furthermore,
the eectiveness of the proposed approach is also shown via
extensive simulation results.
|
| Kuan-Hsun Chen, Jian-Jia Chen, Florian Kriebel, Semeen Rehman, Muhammad Shafique and J Henkel. Reliability-Aware Task Mapping on Many-Cores with Performance Heterogeneity. In ESWEEK Workshop on Resiliency in Embedded Electronic Systems 2015 [BibTeX][PDF][Abstract]@inproceedings { REES2015,
author = {Chen, Kuan-Hsun and Chen, Jian-Jia and Kriebel, Florian and Rehman, Semeen and Shafique, Muhammad and Henkel, J},
title = {Reliability-Aware Task Mapping on Many-Cores with Performance Heterogeneity},
booktitle = {ESWEEK Workshop on Resiliency in Embedded Electronic Systems},
year = {2015},
keywords = {kuan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-kuan-rees.pdf},
confidential = {n},
abstract = {In this paper we explore how to efficiently allocate the tasks onto many-cores by using RMT to improve the overall dependability, with respect to both timing and functional correctness while also accounting for application tasks with multiple compiled versions. Such multiple reliable versions can be generated by using the reliability-aware compilers like [2] and [8], exhibiting diverse performance and reliability properties. By applying multiple reliable task versions and RMT, we are able to exploit the optimization space at both software and hardware-levels while exploring different area, execution time, and achieved reliability tradeoffs. The timing correctness can be defined as the deadline miss rate, which is typically adopted as the quality of service (QoS) metric in many practical real-time applications.},
} In this paper we explore how to efficiently allocate the tasks onto many-cores by using RMT to improve the overall dependability, with respect to both timing and functional correctness while also accounting for application tasks with multiple compiled versions. Such multiple reliable versions can be generated by using the reliability-aware compilers like [2] and [8], exhibiting diverse performance and reliability properties. By applying multiple reliable task versions and RMT, we are able to exploit the optimization space at both software and hardware-levels while exploring different area, execution time, and achieved reliability tradeoffs. The timing correctness can be defined as the deadline miss rate, which is typically adopted as the quality of service (QoS) metric in many practical real-time applications.
|
| Matthias Freier and Jian-Jia Chen. Time-Triggered Communication Scheduling Analysis for Real-Time Multicore Systems. In 10th IEEE International Symposium on Industrial Embedded Systems (SIES), Siegen, Germany,, June 8-10 2015 [BibTeX][PDF][Abstract]@inproceedings { Freier2015,
author = {Freier, Matthias and Chen, Jian-Jia},
title = {Time-Triggered Communication Scheduling Analysis for Real-Time Multicore Systems},
booktitle = {10th IEEE International Symposium on Industrial Embedded Systems (SIES), },
year = {2015},
address = {Siegen, Germany,},
month = {June 8-10},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-chen-SIES.pdf},
confidential = {n},
abstract = {Scheduling of real-time applications for multicore platforms has become an important research topic. For analyzing the timing satisfactions of real-time tasks, most researches in the literature assume independent tasks. However, industrial applications are usually with fully tangled dependencies among the tasks. Independence of the tasks provides a very nice abstraction, whereas dependent structures due to the tangled executions of the tasks are closer to the real systems. This paper studies the scheduling policies and the schedulabil-ity analysis based on independent tasks by hiding the execution dependencies with additional timing parameters. Our scheduling policy relates to the well-known periodic task model, but in contrast, tasks are able to communicate with each other. A feasible task set requires an analysis for each core and the communication infrastructure, which can be performed indi-vidually by decoupling computation from communication in a distributed system. By using a Time-Triggered Constant Phase (TTCP) scheduler, each task receives certain time-slots in the hyper-period of the task set, which ensures a time-predictable communication impact. In this paper, we provide several algorithms to derive the time-slot for each task. Further, we found a fast heuristic algorithm to calculate the time-slot for each task, which is capable to reach a core utilization of 90% by considering typical industrial applications. Finally, experiments show the effectiveness of our heuristic and the performance in different settings.},
} Scheduling of real-time applications for multicore platforms has become an important research topic. For analyzing the timing satisfactions of real-time tasks, most researches in the literature assume independent tasks. However, industrial applications are usually with fully tangled dependencies among the tasks. Independence of the tasks provides a very nice abstraction, whereas dependent structures due to the tangled executions of the tasks are closer to the real systems. This paper studies the scheduling policies and the schedulabil-ity analysis based on independent tasks by hiding the execution dependencies with additional timing parameters. Our scheduling policy relates to the well-known periodic task model, but in contrast, tasks are able to communicate with each other. A feasible task set requires an analysis for each core and the communication infrastructure, which can be performed indi-vidually by decoupling computation from communication in a distributed system. By using a Time-Triggered Constant Phase (TTCP) scheduler, each task receives certain time-slots in the hyper-period of the task set, which ensures a time-predictable communication impact. In this paper, we provide several algorithms to derive the time-slot for each task. Further, we found a fast heuristic algorithm to calculate the time-slot for each task, which is capable to reach a core utilization of 90% by considering typical industrial applications. Finally, experiments show the effectiveness of our heuristic and the performance in different settings.
|
| Huang Wen-Hung and Jian-Jia Chen. Techniques for Schedulability Analysis in Mode Change Systems under Fixed-Priority Scheduling. In IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA) Hong Kong, August 19-21, 2015 2015, (Best Paper Award). We identified some typos in the proofs of Theorems 5 and 6, on May. 29th 2017. Revised version [BibTeX][PDF][Link]@inproceedings { WC15-RTCSA,
author = {Wen-Hung, Huang and Chen, Jian-Jia},
title = {Techniques for Schedulability Analysis in Mode Change Systems under Fixed-Priority Scheduling},
booktitle = {IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)},
year = {2015},
address = {Hong Kong},
month = {August 19-21, 2015},
note = { (Best Paper Award). We identified some typos in the proofs of Theorems 5 and 6, on May. 29th 2017. Revised version },
url = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-multimode-revised-diff.pdf},
keywords = {kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/polynomial-mode-change.pdf},
confidential = {n},
} |
| Jian-Jia Chen, Wen-Hung Huang and Cong Liu. k2U: A General Framework from k-Point Effective Schedulability Analysis to Utilization-Based Tests. In Real-Time Systems Symposium (RTSS) Dec. 1-4 2015 [BibTeX][PDF][Abstract]@inproceedings { ChenHLRTSS2015,
author = {Chen, Jian-Jia and Huang, Wen-Hung and Liu, Cong},
title = {k2U: A General Framework from k-Point Effective Schedulability Analysis to Utilization-Based Tests},
booktitle = {Real-Time Systems Symposium (RTSS)},
year = {2015},
month = {Dec. 1-4},
keywords = {kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-chen-RTSS.pdf},
confidential = {n},
abstract = {To deal with a large variety of workloads in
different application domains in real-time embedded systems,
a number of expressive task models have been developed. For
each individual task model, researchers tend to develop different
types of techniques for deriving schedulability tests with different
computation complexity and performance. In this paper, we
present a general schedulability analysis framework, namely the
k2U framework, that can be potentially applied to analyze
a large set of real-time task models under any fixed-priority
scheduling algorithm, on both uniprocessor and multiprocessor
scheduling. The key to k2U is a k-point effective schedulability
test, which can be viewed as a “blackbox” interface. For any
task model, if a corresponding k-point effective schedulability
test can be constructed, then a sufficient utilization-based test
can be automatically derived. We show the generality of k2U by
applying it to different task models, which results in new and
improved tests compared to the state-of-the-art.},
} To deal with a large variety of workloads in
different application domains in real-time embedded systems,
a number of expressive task models have been developed. For
each individual task model, researchers tend to develop different
types of techniques for deriving schedulability tests with different
computation complexity and performance. In this paper, we
present a general schedulability analysis framework, namely the
k2U framework, that can be potentially applied to analyze
a large set of real-time task models under any fixed-priority
scheduling algorithm, on both uniprocessor and multiprocessor
scheduling. The key to k2U is a k-point effective schedulability
test, which can be viewed as a “blackbox” interface. For any
task model, if a corresponding k-point effective schedulability
test can be constructed, then a sufficient utilization-based test
can be automatically derived. We show the generality of k2U by
applying it to different task models, which results in new and
improved tests compared to the state-of-the-art.
|
| Helena Kotthaus, Ingo Korb and Peter Marwedel. Performance Analysis for Parallel R Programs: Towards Efficient Resource Utilization. In Abstract Booklet of the International R User Conference (UseR!), pages 66 Aalborg, Denmark, June 2015 [BibTeX][Link]@inproceedings { kotthaus/2015a,
author = {Kotthaus, Helena and Korb, Ingo and Marwedel, Peter},
title = {Performance Analysis for Parallel R Programs: Towards Efficient Resource Utilization},
booktitle = {Abstract Booklet of the International R User Conference (UseR!)},
year = {2015},
pages = {66},
address = {Aalborg, Denmark},
month = {June},
url = {http://user2015.math.aau.dk/docs/useR2015-BookOfAbstracts.pdf},
confidential = {n},
} |
| Helena Kotthaus, Ingo Korb and Peter Marwedel. Distributed Performance Analysis for R. In R Implementation, Optimization and Tooling Workshop (RIOT) Prag, Czech, July 2015 [BibTeX][Link]@inproceedings { kotthaus/2015b,
author = {Kotthaus, Helena and Korb, Ingo and Marwedel, Peter},
title = {Distributed Performance Analysis for R},
booktitle = {R Implementation, Optimization and Tooling Workshop (RIOT)},
year = {2015},
address = {Prag, Czech},
month = {July},
url = {http://2015.ecoop.org/track/RIOT-2015-papers#program},
confidential = {n},
} |
| Andreas Heinig, Florian Schmoll, Björn Bönninghoff, Peter Marwedel and Michael Engel. FAME: Flexible Real-Time Aware Error Correction by Combining Application Knowledge and Run-Time Information. In Proceedings of the 11th Workshop on Silicon Errors in Logic - System Effects (SELSE) 2015 [BibTeX][PDF]@inproceedings { heinig:2015:selse,
author = {Heinig, Andreas and Schmoll, Florian and B\"onninghoff, Bj\"orn and Marwedel, Peter and Engel, Michael},
title = {FAME: Flexible Real-Time Aware Error Correction by Combining Application Knowledge and Run-Time Information},
booktitle = {Proceedings of the 11th Workshop on Silicon Errors in Logic - System Effects (SELSE)},
year = {2015},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-heinig-selse2015.pdf},
confidential = {n},
} |
| Georg von der Brüggen, Jian-Jia Chen and Wen-Hung Huang. Schedulability and Optimization Analysis for Non-Preemptive Static Priority Scheduling Based on Task Utilization and Blocking Factors. In Proceedings of Euromicro Conference on Real-Time Systems (ECRTS) Lund, Sweden, July 8-10 2015, We identified an error and revised the paper on Aug. 14th 2015. Short summary of erratum [BibTeX][PDF][Abstract]@inproceedings { brueggen:2015:ecrts,
author = {Br\"uggen, Georg von der and Chen, Jian-Jia and Huang, Wen-Hung},
title = {Schedulability and Optimization Analysis for Non-Preemptive Static Priority Scheduling Based on Task Utilization and Blocking Factors},
booktitle = {Proceedings of Euromicro Conference on Real-Time Systems (ECRTS)},
year = {2015},
address = {Lund, Sweden},
month = {July 8-10},
note = {We identified an error and revised the paper on Aug. 14th 2015. Short summary of erratum },
keywords = {georg, kevin},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015_brueggen_ecrts.pdf},
confidential = {n},
abstract = {For real time task sets, allowing preemption is often considered to be
important to ensure the schedulability, as it allows high-priority tasks
to be allocated to the processor nearly immediately.
However, preemptive scheduling also introduces some additional
overhead and may not be allowed for some hardware components, which
motivates the needs of non-preemptive or limited-preemptive scheduling.
We present a safe sufficient schedulability test
for non-preemptive (NP) fixed priority scheduling that can verify the
schedulability for Deadline Monotonic (DM-NP) and Rate Monotonic
(RM-NP) scheduling in linear time, if task orders according to priority and
period are given. This test leads to a better
upper bound on the speedup factor for
DM-NP and RM-NP in comparison to Earliest Deadline First (EDF-NP) than previously known,
closing the gab between lower and upper bound.
We improve our test, resulting in interesting properties of the blocking time
that allow to determine schedulability by only considering the schedulability of
the preemptive case if some conditions are met. Furthermore, we present a
utilization bound for RM-NP, based on the ratio \gamma >0 of the
upper bound of the maximum blocking time to the execution time, significantly improving previous results.},
} For real time task sets, allowing preemption is often considered to be
important to ensure the schedulability, as it allows high-priority tasks
to be allocated to the processor nearly immediately.
However, preemptive scheduling also introduces some additional
overhead and may not be allowed for some hardware components, which
motivates the needs of non-preemptive or limited-preemptive scheduling.
We present a safe sufficient schedulability test
for non-preemptive (NP) fixed priority scheduling that can verify the
schedulability for Deadline Monotonic (DM-NP) and Rate Monotonic
(RM-NP) scheduling in linear time, if task orders according to priority and
period are given. This test leads to a better
upper bound on the speedup factor for
DM-NP and RM-NP in comparison to Earliest Deadline First (EDF-NP) than previously known,
closing the gab between lower and upper bound.
We improve our test, resulting in interesting properties of the blocking time
that allow to determine schedulability by only considering the schedulability of
the preemptive case if some conditions are met. Furthermore, we present a
utilization bound for RM-NP, based on the ratio \gamma >0 of the
upper bound of the maximum blocking time to the execution time, significantly improving previous results.
|
| Olaf Neugebauer, Michael Engel and Peter Marwedel. Multi-Objective Aware Communication Optimization for Resource-Restricted Embedded Systems. In Proceedings of Architecture of Computing Systems. Proceedings, ARCS 2015 [BibTeX][PDF][Abstract]@inproceedings { neugebauer:2015:arcs,
author = {Neugebauer, Olaf and Engel, Michael and Marwedel, Peter},
title = {Multi-Objective Aware Communication Optimization for Resource-Restricted Embedded Systems},
booktitle = {Proceedings of Architecture of Computing Systems. Proceedings, ARCS},
year = {2015},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-neugebauer-arcs.pdf},
confidential = {n},
abstract = {Creating efficient parallel software for current embedded multicore systems is a complex and error-prone task. While automatic parallelization tools help to exploit the performance of multicores, most of these systems waste optimization opportunities since they neglect to consider hardware details such as communication performance and memory hierarchies. In addition, most tools do not allow multi-criterial optimization for objectives such as performance and energy. These approaches are especially relevant in the embedded domain. In this paper we present PICO, an approach that enables multi-objective optimization of embedded parallel programs. In combination with a state-of-the-art parallelization approach for sequential C code, PICO uses high-level models and simulators for performance and energy consumption optimization. As a result, PICO generates a set of Pareto-optimal solutions using a genetic algorithm-based optimization. These solutions allow an embedded system designer to choose a parallelization solution which exhibits a suitable trade-off between the required speedup and the resulting energy consumption according to a given system's requirements. Using PICO, we were able to reduce energy consumption by about 35% compared to the sequential execution for a heterogeneous architecture. Further, runtime reductions by roughly 55% were achieved for a benchmark on a homogeneous platform.},
} Creating efficient parallel software for current embedded multicore systems is a complex and error-prone task. While automatic parallelization tools help to exploit the performance of multicores, most of these systems waste optimization opportunities since they neglect to consider hardware details such as communication performance and memory hierarchies. In addition, most tools do not allow multi-criterial optimization for objectives such as performance and energy. These approaches are especially relevant in the embedded domain. In this paper we present PICO, an approach that enables multi-objective optimization of embedded parallel programs. In combination with a state-of-the-art parallelization approach for sequential C code, PICO uses high-level models and simulators for performance and energy consumption optimization. As a result, PICO generates a set of Pareto-optimal solutions using a genetic algorithm-based optimization. These solutions allow an embedded system designer to choose a parallelization solution which exhibits a suitable trade-off between the required speedup and the resulting energy consumption according to a given system's requirements. Using PICO, we were able to reduce energy consumption by about 35% compared to the sequential execution for a heterogeneous architecture. Further, runtime reductions by roughly 55% were achieved for a benchmark on a homogeneous platform.
|
| Olaf Neugebauer, Pascal Libuschewski, Michael Engel, Heinrich Mueller and Peter Marwedel. Plasmon-based Virus Detection on Heterogeneous Embedded Systems. In Proceedings of Workshop on Software & Compilers for Embedded Systems (SCOPES) 2015 [BibTeX][PDF][Abstract]@inproceedings { neugebauer2015:scopes,
author = {Neugebauer, Olaf and Libuschewski, Pascal and Engel, Michael and Mueller, Heinrich and Marwedel, Peter},
title = {Plasmon-based Virus Detection on Heterogeneous Embedded Systems},
booktitle = {Proceedings of Workshop on Software \& Compilers for Embedded Systems (SCOPES) },
year = {2015},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-neugebauer-scopes.pdf},
confidential = {n},
abstract = {Embedded systems, e.g. in computer vision applications, are expected to provide significant amounts of computing power to process large data volumes. Many of these systems, such as used in medical diagnosis, are mobile devices and face significant challenges to provide sufficient performance while operating on a constrained energy budget.
Modern embedded MPSoC platforms use heterogeneous CPU and GPU cores providing a large number of optimization parameters. This allows to find useful trade-offs between energy consumption and performance for a given application. In this paper, we describe how the complex data processing required for PAMONO, a novel type of biosensor for the detection of biological viruses, can efficiently be implemented on a state-of-the-art heterogeneous MPSoC platform. An additional optimization dimension explored is the achieved quality of service. Reducing the virus detection accuracy enables additional optimizations not achievable by modifying hardware or software parameters alone.
Instead of relying on often inaccurate simulation models, our design space exploration employs a hardware-in-the-loop approach to evaluate the performance and energy consumption on the embedded target platform. Trade-offs between performance, energy and accuracy are controlled by a genetic algorithm running on a PC control system which deploys the evaluation tasks to a number of connected embedded boards. Using our optimization approach, we are able to achieve frame rates meeting the requirements without losing accuracy. Further, our approach is able to reduce the energy consumption by 93% with a still reasonable detection quality.},
} Embedded systems, e.g. in computer vision applications, are expected to provide significant amounts of computing power to process large data volumes. Many of these systems, such as used in medical diagnosis, are mobile devices and face significant challenges to provide sufficient performance while operating on a constrained energy budget.
Modern embedded MPSoC platforms use heterogeneous CPU and GPU cores providing a large number of optimization parameters. This allows to find useful trade-offs between energy consumption and performance for a given application. In this paper, we describe how the complex data processing required for PAMONO, a novel type of biosensor for the detection of biological viruses, can efficiently be implemented on a state-of-the-art heterogeneous MPSoC platform. An additional optimization dimension explored is the achieved quality of service. Reducing the virus detection accuracy enables additional optimizations not achievable by modifying hardware or software parameters alone.
Instead of relying on often inaccurate simulation models, our design space exploration employs a hardware-in-the-loop approach to evaluate the performance and energy consumption on the embedded target platform. Trade-offs between performance, energy and accuracy are controlled by a genetic algorithm running on a PC control system which deploys the evaluation tasks to a number of connected embedded boards. Using our optimization approach, we are able to achieve frame rates meeting the requirements without losing accuracy. Further, our approach is able to reduce the energy consumption by 93% with a still reasonable detection quality.
|
| Peter Munk, Matthias Freier, Jan Richling and Jian-Jia Chen. Dynamic Guaranteed Service Communication on Best-Effort Networks-on-Chip. In 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, {PDP} 2015, Turku, Finland, March 4-6, 2015, pages 353--360 2015 [BibTeX][PDF][Abstract]@inproceedings { DBLP:conf/pdp/MunkFRC15,
author = {Munk, Peter and Freier, Matthias and Richling, Jan and Chen, Jian-Jia},
title = {Dynamic Guaranteed Service Communication on Best-Effort Networks-on-Chip},
booktitle = {23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, {PDP} 2015, Turku, Finland, March 4-6, 2015},
year = {2015},
pages = {353--360},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2015-chen-PDP.pdf},
confidential = {n},
abstract = {In order to execute applications under real-time constraints on many-core processors with a Network-on-Chip (NoC), guaranteed service (GS) communication with guaranteed end-to-end latency and bandwidth is required. Several hardware-based solutions for GS communication have been proposed in literature. However, commercially available many-core processors, e.g., Tilera's Tile Pro64 or Adapt Eva's Epiphany, do not support such features. In this paper, we propose a software solution that allows GS communication on 2D-mesh packet-switching NoCs. Our investigation is based on a hardware model that is applicable to commercially available processors, which include multiple NoCs to separate request and response packets and support only best-effort communication. We prove that a common upper bound of the injection rate for all sources limits the congestion which leads to an upper bound of the worst-case transmission latency (WCTL) for any transmission, i.e., the combination of a request and a response packet. Furthermore, our approach supports arbitrary transmission streams that can be modified at runtime without violating the upper bound of the WCTL, as long as the injection rate is not violated. This enables adaptive features such as task migration or dynamic scheduling policies. Experiments evaluate our solution for different traffic patterns.},
} In order to execute applications under real-time constraints on many-core processors with a Network-on-Chip (NoC), guaranteed service (GS) communication with guaranteed end-to-end latency and bandwidth is required. Several hardware-based solutions for GS communication have been proposed in literature. However, commercially available many-core processors, e.g., Tilera's Tile Pro64 or Adapt Eva's Epiphany, do not support such features. In this paper, we propose a software solution that allows GS communication on 2D-mesh packet-switching NoCs. Our investigation is based on a hardware model that is applicable to commercially available processors, which include multiple NoCs to separate request and response packets and support only best-effort communication. We prove that a common upper bound of the injection rate for all sources limits the congestion which leads to an upper bound of the worst-case transmission latency (WCTL) for any transmission, i.e., the combination of a request and a response packet. Furthermore, our approach supports arbitrary transmission streams that can be modified at runtime without violating the upper bound of the WCTL, as long as the injection rate is not violated. This enables adaptive features such as task migration or dynamic scheduling policies. Experiments evaluate our solution for different traffic patterns.
|
| Santiago Pagani, Jian-Jia Chen, Muhammad Shafique and Jörg Henkel. MatEx: efficient transient and peak temperature computation for compact thermal models. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, (DATE), pages 1515--1520 Grenoble, France, , March 9-13 2015 [BibTeX][PDF][Abstract]@inproceedings { DBLP:conf/date/PaganiCSH15,
author = {Pagani, Santiago and Chen, Jian-Jia and Shafique, Muhammad and Henkel, J\"org},
title = {MatEx: efficient transient and peak temperature computation for compact thermal models},
booktitle = {Proceedings of the 2015 Design, Automation \& Test in Europe Conference \& Exhibition, (DATE)},
year = {2015},
pages = {1515--1520},
address = {Grenoble, France, },
month = {March 9-13},
file = {http://cesweb.itec.kit.edu/~pagani/pubs/Pagani-DATE-2015-MatEx.pdf},
confidential = {n},
abstract = {In many core systems, run-time scheduling decisions, such as task migration, core activations/deactivations, voltage/frequency scaling, etc., are typically used to optimize the resource usages. Such run-time decisions change the power consumption, which can in turn result in transient temperatures much higher than any steady-state scenarios. Therefore, to be thermally safe, it is important to evaluate the transient peaks before making resource management decisions. This paper presents a method for computing these transient peaks in just a few milliseconds, which is suited for run-time usage. This technique works for any compact thermal model consisting in a system of first-order differential equations, for example, RC thermal networks. Instead of using regular numerical methods, our algorithm is based on analytically solving the differential equations using matrix exponentials and linear algebra. This results in a mathematical expression which can easily be analyzed and differentiated to compute the maximum transient temperatures. Moreover, our method can also be used to efficiently compute all transient temperatures for any given time resolution without accuracy losses. We implement our solution as an open-source tool called MatEx. Our experimental evaluations show that the execution time of MatEx for peak temperature computation can be bounded to no more than 2.5 ms for systems with 76 thermal nodes, and to no more than 26.6 ms for systems with 268 thermal nodes, which is three orders of magnitude faster than the state-of-the-art for the same settings.},
} In many core systems, run-time scheduling decisions, such as task migration, core activations/deactivations, voltage/frequency scaling, etc., are typically used to optimize the resource usages. Such run-time decisions change the power consumption, which can in turn result in transient temperatures much higher than any steady-state scenarios. Therefore, to be thermally safe, it is important to evaluate the transient peaks before making resource management decisions. This paper presents a method for computing these transient peaks in just a few milliseconds, which is suited for run-time usage. This technique works for any compact thermal model consisting in a system of first-order differential equations, for example, RC thermal networks. Instead of using regular numerical methods, our algorithm is based on analytically solving the differential equations using matrix exponentials and linear algebra. This results in a mathematical expression which can easily be analyzed and differentiated to compute the maximum transient temperatures. Moreover, our method can also be used to efficiently compute all transient temperatures for any given time resolution without accuracy losses. We implement our solution as an open-source tool called MatEx. Our experimental evaluations show that the execution time of MatEx for peak temperature computation can be bounded to no more than 2.5 ms for systems with 76 thermal nodes, and to no more than 26.6 ms for systems with 268 thermal nodes, which is three orders of magnitude faster than the state-of-the-art for the same settings.
|
| Santiago Pagani, Muhammad Shafique, Heba Khdr, Jian-Jia Chen and Jörg Henkel. seBoost: Selective Boosting for Heterogeneous Manycores. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) Amsterdam, Netherlands, October 4-9 2015 [BibTeX][PDF][Abstract]@inproceedings { DBLP:conf/codes/PaganiCSH15,
author = {Pagani, Santiago and Shafique, Muhammad and Khdr, Heba and Chen, Jian-Jia and Henkel, J\"org},
title = {seBoost: Selective Boosting for Heterogeneous Manycores},
booktitle = {International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)},
year = {2015},
address = {Amsterdam, Netherlands},
month = {October 4-9},
file = {http://cesweb.itec.kit.edu/~pagani/pubs/Pagani-DATE-2015-MatEx.pdf},
confidential = {n},
abstract = {Boosting techniques have been widely adopted in commercial multicore and manycore systems, mainly because they provide means to satisfy performance requirements surges, for one or more cores, at run-time. Current boosting techniques select the boosting levels (for boosted cores) and the throttle-down levels (for non-boosted cores) either arbitrarily or through step-wise control approaches. These methods might result in unnecessary performance losses for the non-boosted cores, in short boosting intervals, in failing to satisfy the required performance surges, or in necessary
high power and energy consumption. This paper presents an efficient and lightweight run-time boosting technique based
on transient temperature estimation, called seBoost. Our technique guarantees meeting the performance requirements
surges at run-time, thus maximizing the boosting time with a minimum loss of performance for the non-boosted cores.},
} Boosting techniques have been widely adopted in commercial multicore and manycore systems, mainly because they provide means to satisfy performance requirements surges, for one or more cores, at run-time. Current boosting techniques select the boosting levels (for boosted cores) and the throttle-down levels (for non-boosted cores) either arbitrarily or through step-wise control approaches. These methods might result in unnecessary performance losses for the non-boosted cores, in short boosting intervals, in failing to satisfy the required performance surges, or in necessary
high power and energy consumption. This paper presents an efficient and lightweight run-time boosting technique based
on transient temperature estimation, called seBoost. Our technique guarantees meeting the performance requirements
surges at run-time, thus maximizing the boosting time with a minimum loss of performance for the non-boosted cores.
|
| Jing Li, Jian Jia Chen, Kunal Agrawal, Chenyang Lu, Chris Gill and Abusayeed Saifullah. Analysis of Federated and Global Scheduling for Parallel tasks. In Proceedings of the 26th Euromicro Conference on Real-Time Systems, Madrid, Spain, July 8-11, 2014 2014 [BibTeX][Abstract]@inproceedings { Li:ecrts14,
author = {Li, Jing and Chen, Jian Jia and Agrawal, Kunal and Lu, Chenyang and Gill, Chris and Saifullah, Abusayeed},
title = {Analysis of Federated and Global Scheduling for Parallel tasks},
booktitle = {Proceedings of the 26th Euromicro Conference on Real-Time Systems, Madrid, Spain, July 8-11, 2014},
year = {2014},
confidential = {n},
abstract = {This paper considers the scheduling of parallel real-time tasks with implicit deadlines. Each parallel task is characterized as a general directed acyclic graph (DAG). We analyze three different real-time scheduling strategies: two well known algorithms, namely global earliest-deadline-first and global rate-monotonic, and one new algorithm, namely federated scheduling. The federated scheduling algorithm proposed in this paper is a generalization of partitioned scheduling to parallel tasks. In this strategy, each high-utilization task (utilization $\ge 1$) is assigned a set of dedicated cores and the remaining low-utilization tasks share the remaining cores. We prove capacity augmentation bounds for all three schedulers. In particular, we show that if on unit-speed cores, a task set has total utilization of at most $m$ and the critical-path length of each task is smaller than its deadline, then federated scheduling can schedule that task set on $m$ cores of speed 2; G-EDF can schedule it with speed $\frac{3+\sqrt{5}}{2} \approx 2.618$; and G-RM can schedule it with speed $2+\sqrt{3}\approx 3.732$. We also provide lower bounds on the speedup and show that the bounds are tight for federated scheduling and G-EDF when $m$ is sufficiently large.},
} This paper considers the scheduling of parallel real-time tasks with implicit deadlines. Each parallel task is characterized as a general directed acyclic graph (DAG). We analyze three different real-time scheduling strategies: two well known algorithms, namely global earliest-deadline-first and global rate-monotonic, and one new algorithm, namely federated scheduling. The federated scheduling algorithm proposed in this paper is a generalization of partitioned scheduling to parallel tasks. In this strategy, each high-utilization task (utilization $\ge 1$) is assigned a set of dedicated cores and the remaining low-utilization tasks share the remaining cores. We prove capacity augmentation bounds for all three schedulers. In particular, we show that if on unit-speed cores, a task set has total utilization of at most $m$ and the critical-path length of each task is smaller than its deadline, then federated scheduling can schedule that task set on $m$ cores of speed 2; G-EDF can schedule it with speed $3+\sqrt{5}{2} \approx 2.618$; and G-RM can schedule it with speed $2+3\approx 3.732$. We also provide lower bounds on the speedup and show that the bounds are tight for federated scheduling and G-EDF when $m$ is sufficiently large.
|
| Cong Liu and Jian-Jia Chen. Bursty-Interference Analysis Techniques for Analyzing Complex Real-Time Task Models. In Proceedings of the 35th IEEE Real-Time Systems Symposium (RTSS), Rome, Italy, December 2-5, 2014 2014 [BibTeX][Abstract]@inproceedings { Liu:RTSS14a,
author = {Liu, Cong and Chen, Jian-Jia},
title = {Bursty-Interference Analysis Techniques for Analyzing Complex Real-Time Task Models},
booktitle = {Proceedings of the 35th IEEE Real-Time Systems Symposium (RTSS), Rome, Italy, December 2-5, 2014},
year = {2014},
confidential = {n},
abstract = {Due to the recent trend towards building complex real-time cyber-physical systems, system designers need to develop and choose expressive formal models for representing such systems, as the model should be adequately expressive such that it can accurately convey the relevant characteristics of the system being modeled. Compared to the classical sporadic task model, there exist a number of real-time task models that are more expressive. However, such models are often complex and thus are rather difficult to be analyzed efficiently. Due to this reason, prior analysis methods for dealing with such complex task models are pessimistic. In this paper, a novel analysis technique, namely the bursty-interference analysis, is presented for analyzing two common expressive real-time task models, the general self-suspending task model and the deferrable server task model. This technique is used to derive new uniprocessor utilization-based schedulability tests and rate-monotonic utilization bounds for the two considered task models scheduled under rate-monotonic scheduling. Extensive experiments presented herein show that our proposed tests improve upon prior tests in all scenarios, in many cases by a wide margin. To the best of our knowledge, these are the first analysis techniques that can efficiently handle the general self-suspending and deferrable server task models on uniprocessors.},
} Due to the recent trend towards building complex real-time cyber-physical systems, system designers need to develop and choose expressive formal models for representing such systems, as the model should be adequately expressive such that it can accurately convey the relevant characteristics of the system being modeled. Compared to the classical sporadic task model, there exist a number of real-time task models that are more expressive. However, such models are often complex and thus are rather difficult to be analyzed efficiently. Due to this reason, prior analysis methods for dealing with such complex task models are pessimistic. In this paper, a novel analysis technique, namely the bursty-interference analysis, is presented for analyzing two common expressive real-time task models, the general self-suspending task model and the deferrable server task model. This technique is used to derive new uniprocessor utilization-based schedulability tests and rate-monotonic utilization bounds for the two considered task models scheduled under rate-monotonic scheduling. Extensive experiments presented herein show that our proposed tests improve upon prior tests in all scenarios, in many cases by a wide margin. To the best of our knowledge, these are the first analysis techniques that can efficiently handle the general self-suspending and deferrable server task models on uniprocessors.
|
| Cong Liu, Jian-Jia Chen, Liang He and Yu Gu. Analysis Techniques for Supporting Harmonic Real-Time Tasks with Suspensions. In Proceedings of the 26th Euromicro Conference on Real-Time Systems, Madrid, Spain, July 8-11, 2014 2014 [BibTeX][Abstract]@inproceedings { Liu:ecrts14,
author = {Liu, Cong and Chen, Jian-Jia and He, Liang and Gu, Yu},
title = {Analysis Techniques for Supporting Harmonic Real-Time Tasks with Suspensions},
booktitle = {Proceedings of the 26th Euromicro Conference on Real-Time Systems, Madrid, Spain, July 8-11, 2014},
year = {2014},
confidential = {n},
abstract = {In many real-time systems, tasks may experience suspension delays when they block to access shared resources or interact with external devices such as I/O. It is known that such suspensions delays may negatively impact schedulability. Particularly in hard real-time systems, a few negative results exist on analyzing the schedulability of such systems, even for very restricted suspending task models on a uniprocessor. In this paper, we focus on the particular case of hard real-time suspending task systems with harmonic periods, which is a special case of practical relevance. We propose a new uniprocessor suspension-aware analysis technique for supporting such task systems under rate-monotonic scheduling. Our analysis technique is able to achieve only Θ(1) suspension-related utilization loss on a uniprocessor.Based upon this technique, we further propose a partitioning scheme that supports suspending task systems with harmonic periods on multiprocessors. The resulting schedulability test shows that compared to existing schedulability tests designed for ordinary non-suspending task systems, suspensions only results in Θ(m) additional suspension-related utilization loss, where m is the number of processors. Furthermore, experiments presented herein show that both our uniprocessor and multiprocessor schedulability tests improve upon prior approaches by a significant margin.},
} In many real-time systems, tasks may experience suspension delays when they block to access shared resources or interact with external devices such as I/O. It is known that such suspensions delays may negatively impact schedulability. Particularly in hard real-time systems, a few negative results exist on analyzing the schedulability of such systems, even for very restricted suspending task models on a uniprocessor. In this paper, we focus on the particular case of hard real-time suspending task systems with harmonic periods, which is a special case of practical relevance. We propose a new uniprocessor suspension-aware analysis technique for supporting such task systems under rate-monotonic scheduling. Our analysis technique is able to achieve only Θ(1) suspension-related utilization loss on a uniprocessor.Based upon this technique, we further propose a partitioning scheme that supports suspending task systems with harmonic periods on multiprocessors. The resulting schedulability test shows that compared to existing schedulability tests designed for ordinary non-suspending task systems, suspensions only results in Θ(m) additional suspension-related utilization loss, where m is the number of processors. Furthermore, experiments presented herein show that both our uniprocessor and multiprocessor schedulability tests improve upon prior approaches by a significant margin.
|
| Jian-Jia Chen and Cong Liu. Fixed-Relative-Deadline Scheduling of Hard Real-Time Tasks with Self-Suspensions. In Proceedings of the 35th IEEE Real-Time Systems Symposium (RTSS), Rome, Italy, December 2-5, 2014 2014, We identified a typo in the schedulability test in Theorem 3 on 13, May, 2015. Short summary [BibTeX][Abstract]@inproceedings { Chen:RTSS14a,
author = {Chen, Jian-Jia and Liu, Cong},
title = {Fixed-Relative-Deadline Scheduling of Hard Real-Time Tasks with Self-Suspensions},
booktitle = {Proceedings of the 35th IEEE Real-Time Systems Symposium (RTSS), Rome, Italy, December 2-5, 2014},
year = {2014},
note = {We identified a typo in the schedulability test in Theorem 3 on 13, May, 2015. Short summary },
confidential = {n},
abstract = {In many real-time systems, tasks may experience self-suspension delays when accessing external devices. The problem of scheduling such self-suspending tasks to meet hard deadlines on a uniprocessor is known to be $\mathcal{NP}$-hard in the strong sense. Current solutions including the common suspension-oblivious approach of treating all suspensions as computation can be quite pessimistic. This paper shows that another category of scheduling algorithms, namely fixed-relative-deadline (FRD) scheduling, may yield better performance than classical schedulers such as EDF and RM, for real-time tasks that may experience one self-suspension during the execution of a task instance. We analyze a simple FRD algorithm, namely EDA, and derive corresponding pseudo-polynomial-time and linear-time schedulability tests. To analyze the quality of EDA and its schedulability tests, we analyze their resource augmentation factors, with respect to the speed-up factor that is needed to ensure the schedulability and feasibility of the resulting schedule. Specifically, the speed-up factor of EDA is $2$ and $3$, when referring to the optimal FRD scheduling and any feasible arbitrary scheduling, respectively. Moreover, the speed-up factor of the proposed linear-time schedulability test is $2.787$ and $4.875$, when referring to the optimal FRD scheduling and any feasible arbitrary scheduling, respectively. Furthermore, extensive experiments presented herein show that our proposed linear-time schedulability test improves upon prior approaches by a significant margin. To our best knowledge, for the scheduling of self-suspending tasks, these are the first results of any sort that indicate it might be possible to design good approximation algorithms.},
} In many real-time systems, tasks may experience self-suspension delays when accessing external devices. The problem of scheduling such self-suspending tasks to meet hard deadlines on a uniprocessor is known to be $NP$-hard in the strong sense. Current solutions including the common suspension-oblivious approach of treating all suspensions as computation can be quite pessimistic. This paper shows that another category of scheduling algorithms, namely fixed-relative-deadline (FRD) scheduling, may yield better performance than classical schedulers such as EDF and RM, for real-time tasks that may experience one self-suspension during the execution of a task instance. We analyze a simple FRD algorithm, namely EDA, and derive corresponding pseudo-polynomial-time and linear-time schedulability tests. To analyze the quality of EDA and its schedulability tests, we analyze their resource augmentation factors, with respect to the speed-up factor that is needed to ensure the schedulability and feasibility of the resulting schedule. Specifically, the speed-up factor of EDA is $2$ and $3$, when referring to the optimal FRD scheduling and any feasible arbitrary scheduling, respectively. Moreover, the speed-up factor of the proposed linear-time schedulability test is $2.787$ and $4.875$, when referring to the optimal FRD scheduling and any feasible arbitrary scheduling, respectively. Furthermore, extensive experiments presented herein show that our proposed linear-time schedulability test improves upon prior approaches by a significant margin. To our best knowledge, for the scheduling of self-suspending tasks, these are the first results of any sort that indicate it might be possible to design good approximation algorithms.
|
| Santiago Pagani, Heba Khdr, Waqaas Munawar, Jian-Jia Chen, Muhammad Shafique, Minming Li and Jörg Henkel. {TSP}: Thermal Safe Power - Efficient power budgeting for Many-Core Systems in Dark Silicon. In IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) New Delhi, India, October 2014 2014, Best Paper Award, TSP tool is available here [BibTeX][Abstract]@inproceedings { Pagani-TSP14,
author = {Pagani, Santiago and Khdr, Heba and Munawar, Waqaas and Chen, Jian-Jia and Shafique, Muhammad and Li, Minming and Henkel, J\"org},
title = {{TSP}: Thermal Safe Power - Efficient power budgeting for Many-Core Systems in Dark Silicon},
booktitle = {IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) New Delhi, India, October 2014},
year = {2014},
note = {Best Paper Award, TSP tool is available here},
confidential = {n},
abstract = {Chip manufacturers provide the Thermal Design Power (TDP) for a specific chip. The cooling solution is designed to dissipate this power level. But because TDP is not necessarily the maximum power that can be applied, chips are operated with Dynamic Thermal Management (DTM) techniques. To avoid excessive triggers of DTM, usually, system designers also use TDP as power constraint. However, using a single and constant value as power constraint, e.g., TDP, can result in big performance losses in many-core systems. Having better power budgeting techniques is a major step towards dealing with the dark silicon problem. This paper presents a new power budget concept, called Thermal Safe Power (TSP), which is an abstraction that provides safe power constraint values as a function of the number of simultaneously operating cores. Executing cores at any power consumption below TSP ensures that DTM is not triggered. TSP can be computed offline for the worst cases, or online for a particular mapping of cores. Our simulations show that using TSP as power constraint results in 50.5\% and 14.2\% higher average performance, compared to using constant power budgets (both per-chip and per-core) and a boosting technique, respectively. Moreover, TSP results in dark silicon estimations which are more optimistic than estimations using constant power budgets},
} Chip manufacturers provide the Thermal Design Power (TDP) for a specific chip. The cooling solution is designed to dissipate this power level. But because TDP is not necessarily the maximum power that can be applied, chips are operated with Dynamic Thermal Management (DTM) techniques. To avoid excessive triggers of DTM, usually, system designers also use TDP as power constraint. However, using a single and constant value as power constraint, e.g., TDP, can result in big performance losses in many-core systems. Having better power budgeting techniques is a major step towards dealing with the dark silicon problem. This paper presents a new power budget concept, called Thermal Safe Power (TSP), which is an abstraction that provides safe power constraint values as a function of the number of simultaneously operating cores. Executing cores at any power consumption below TSP ensures that DTM is not triggered. TSP can be computed offline for the worst cases, or online for a particular mapping of cores. Our simulations show that using TSP as power constraint results in 50.5% and 14.2% higher average performance, compared to using constant power budgets (both per-chip and per-core) and a boosting technique, respectively. Moreover, TSP results in dark silicon estimations which are more optimistic than estimations using constant power budgets
|
| Waqaas Munawar, Heba Khdr, Santiago Pagani, Muhammad Shafique, Jian-Jia Chen and Jörg Henkel. Peak Power Management for Scheduling Real-time Tasks on Heterogeneous Many-Core Systems. In The 20th IEEE International Conference on Parallel and Distributed Systems, (ICPADS), Hsinchu, Taiwan, Dec 16-19, 2014 2014 [BibTeX][PDF][Abstract]@inproceedings { munawarPeak14,
author = {Munawar, Waqaas and Khdr, Heba and Pagani, Santiago and Shafique, Muhammad and Chen, Jian-Jia and Henkel, J{\"o}rg},
title = {Peak Power Management for Scheduling Real-time Tasks on Heterogeneous Many-Core Systems},
booktitle = {The 20th IEEE International Conference on Parallel and Distributed Systems, (ICPADS), Hsinchu, Taiwan, Dec 16-19, 2014},
year = {2014},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-munawar-icpads.pdf},
confidential = {n},
abstract = {The number and diversity of cores in on-chip systems is increasing rapidly. However, due to the Thermal Design Power (TDP) constraint, it is not possible to continuously operate all cores at the same time. Exceeding the TDP constraint may activate the Dynamic Thermal Management (DTM) to ensure thermal stability. Such hardware based closed-loop safeguards pose a big challenge in using many-core chips for real-time tasks. Managing the worst-case peak power usage of a chip can help toward resolving this issue. We present a scheme to minimize the peak power usage for frame-based and periodic real-time tasks on many-core processors by scheduling the sleep cycles for each active core and introduce the concept of a sufficient test for peak power consumption for task feasibility. We consider both inter-task and inter-core diversity in terms of power usage and present computationally efficient algorithms for peak power minimization for these cases, i.e., a special case of homogeneous tasks on homogeneous cores to the general case of heterogeneous tasks on heterogeneous cores. We evaluate our solution through extensive simulations using the 48-core SCC platform and gem5 architecture simulator. Our simulation results show the efficacy of our scheme.},
} The number and diversity of cores in on-chip systems is increasing rapidly. However, due to the Thermal Design Power (TDP) constraint, it is not possible to continuously operate all cores at the same time. Exceeding the TDP constraint may activate the Dynamic Thermal Management (DTM) to ensure thermal stability. Such hardware based closed-loop safeguards pose a big challenge in using many-core chips for real-time tasks. Managing the worst-case peak power usage of a chip can help toward resolving this issue. We present a scheme to minimize the peak power usage for frame-based and periodic real-time tasks on many-core processors by scheduling the sleep cycles for each active core and introduce the concept of a sufficient test for peak power consumption for task feasibility. We consider both inter-task and inter-core diversity in terms of power usage and present computationally efficient algorithms for peak power minimization for these cases, i.e., a special case of homogeneous tasks on homogeneous cores to the general case of heterogeneous tasks on heterogeneous cores. We evaluate our solution through extensive simulations using the 48-core SCC platform and gem5 architecture simulator. Our simulation results show the efficacy of our scheme.
|
| Anas Toma, Jian-Jia Chen and Wei Liu. Computation Offloading for Sporadic Real-Time Tasks. In 20th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Chongqing, China, August 2014 2014 [BibTeX][PDF][Abstract]@inproceedings { TomaCL-RTCSA14,
author = {Toma, Anas and Chen, Jian-Jia and Liu, Wei},
title = {Computation Offloading for Sporadic Real-Time Tasks},
booktitle = {20th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Chongqing, China, August 2014},
year = {2014},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-toma-rtcsa.pdf},
confidential = {n},
abstract = {The applications of the mobile devices are increasingly being improved. They include computation-intensive tasks, such as video and audio processing. However, the mobile devices have limited resources, which may make it difficult to finish these tasks in time. Computation offloading can be used to boost the capabilities of these resource-constrained devices, where the computation-intensive tasks are moved to a powerful remote processing unit. This paper considers the computation offloading problem for sporadic real-time tasks. The total bandwidth server (TBS) is adopted on the remote processing unit (the server side) for resource reservation. On the client side, a dynamic programming algorithm is proposed to determine the offloading decision of the tasks such that their schedule is feasible (i.e., all the tasks meet their deadlines). The algorithm is evaluated using a case study of surveillance system and synthesized benchmarks.},
} The applications of the mobile devices are increasingly being improved. They include computation-intensive tasks, such as video and audio processing. However, the mobile devices have limited resources, which may make it difficult to finish these tasks in time. Computation offloading can be used to boost the capabilities of these resource-constrained devices, where the computation-intensive tasks are moved to a powerful remote processing unit. This paper considers the computation offloading problem for sporadic real-time tasks. The total bandwidth server (TBS) is adopted on the remote processing unit (the server side) for resource reservation. On the client side, a dynamic programming algorithm is proposed to determine the offloading decision of the tasks such that their schedule is feasible (i.e., all the tasks meet their deadlines). The algorithm is evaluated using a case study of surveillance system and synthesized benchmarks.
|
| Helena Kotthaus, Ingo Korb, Markus Künne and Peter Marwedel. Performance Analysis for R: Towards a Faster R Interpreter. In Abstract Booklet of the International R User Conference (UseR!), pages 104 Los Angeles, USA, July 2014 [BibTeX][Link]@inproceedings { kotthaus/2014b,
author = {Kotthaus, Helena and Korb, Ingo and K\"unne, Markus and Marwedel, Peter},
title = {Performance Analysis for R: Towards a Faster R Interpreter},
booktitle = {Abstract Booklet of the International R User Conference (UseR!)},
year = {2014},
pages = {104},
address = { Los Angeles, USA},
month = {jul},
url = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/kotthaus_user2014.pdf},
confidential = {n},
} |
| Helena Kotthaus, Ingo Korb, Michael Engel and Peter Marwedel. Dynamic Page Sharing Optimization for the R Language . In Proceedings of the 10th Symposium on Dynamic Languages, pages 79--90 Portland, Oregon, USA, October 2014 [BibTeX][PDF][Link][Abstract]@inproceedings { kotthaus/2014e,
author = {Kotthaus, Helena and Korb, Ingo and Engel, Michael and Marwedel, Peter},
title = {Dynamic Page Sharing Optimization for the R Language },
booktitle = {Proceedings of the 10th Symposium on Dynamic Languages},
year = {2014},
series = {DLS '14},
pages = {79--90},
address = {Portland, Oregon, USA},
month = {oct},
publisher = {ACM},
url = {http://dl.acm.org/citation.cfm?id=2661094},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014e_kotthaus.pdf},
confidential = {n},
abstract = {
Dynamic languages such as R are increasingly used to process .large data sets. Here, the R interpreter induces a large memory overhead due to wasteful memory allocation policies. If an application's working set exceeds the available physical memory, the OS starts to swap, resulting in slowdowns of a several orders of magnitude. Thus, memory optimizations for R will be beneficial to many applications.
Existing R optimizations are mostly based on dynamic compilation or native libraries. Both methods are futile when the OS starts to page out memory. So far, only a few, data-type or application specific memory optimizations for R exist. To remedy this situation, we present a low-overhead page sharing approach for R that significantly reduces the interpreter's memory overhead. Concentrating on the most rewarding optimizations avoids the high runtime overhead of existing generic approaches for memory deduplication or compression. In addition, by applying knowledge of interpreter data structures and memory allocation patterns, our approach is not constrained to specific R applications and is transparent to the R interpreter.
Our page sharing optimization enables us to reduce the memory consumption by up to 53.5% with an average of 18.0% for a set of real-world R benchmarks with a runtime overhead of only 5.3% on average. In cases where page I/O can be avoided, significant speedups are achieved.
},
}
Dynamic languages such as R are increasingly used to process .large data sets. Here, the R interpreter induces a large memory overhead due to wasteful memory allocation policies. If an application's working set exceeds the available physical memory, the OS starts to swap, resulting in slowdowns of a several orders of magnitude. Thus, memory optimizations for R will be beneficial to many applications.
Existing R optimizations are mostly based on dynamic compilation or native libraries. Both methods are futile when the OS starts to page out memory. So far, only a few, data-type or application specific memory optimizations for R exist. To remedy this situation, we present a low-overhead page sharing approach for R that significantly reduces the interpreter's memory overhead. Concentrating on the most rewarding optimizations avoids the high runtime overhead of existing generic approaches for memory deduplication or compression. In addition, by applying knowledge of interpreter data structures and memory allocation patterns, our approach is not constrained to specific R applications and is transparent to the R interpreter.
Our page sharing optimization enables us to reduce the memory consumption by up to 53.5% with an average of 18.0% for a set of real-world R benchmarks with a runtime overhead of only 5.3% on average. In cases where page I/O can be avoided, significant speedups are achieved.
|
| Chen-Wei Huang, Timon Kelter, Bjoern Boenninghoff, Jan Kleinsorge, Michael Engel, Peter Marwedel and Shiao-Li Tsao. Static WCET Analysis of the H.264/AVC Decoder Exploiting Coding Information. In International Conference on Embedded and Real-Time Computing Systems and Applications Chongqing, China, August 2014 [BibTeX]@inproceedings { huang:2014:rtcsa,
author = {Huang, Chen-Wei and Kelter, Timon and Boenninghoff, Bjoern and Kleinsorge, Jan and Engel, Michael and Marwedel, Peter and Tsao, Shiao-Li},
title = {Static WCET Analysis of the H.264/AVC Decoder Exploiting Coding Information},
booktitle = {International Conference on Embedded and Real-Time Computing Systems and Applications},
year = {2014},
address = {Chongqing, China},
month = {August},
organization = {IEEE},
keywords = {wcet},
confidential = {n},
} |
| Andreas Heinig, Florian Schmoll, Peter Marwedel and Michael Engel. Who's Using that Memory? A Subscriber Model for Mapping Errors to Tasks. In Proceedings of the 10th Workshop on Silicon Errors in Logic - System Effects (SELSE) Stanford, CA, USA, April 2014 [BibTeX][PDF][Abstract]@inproceedings { heinig:2014:SELSE,
author = {Heinig, Andreas and Schmoll, Florian and Marwedel, Peter and Engel, Michael},
title = {Who's Using that Memory? A Subscriber Model for Mapping Errors to Tasks},
booktitle = {Proceedings of the 10th Workshop on Silicon Errors in Logic - System Effects (SELSE)},
year = {2014},
address = {Stanford, CA, USA},
month = {April},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-heinig-selse2014.pdf},
confidential = {n},
abstract = {In order to assess the robustness of software-based fault-tolerance
methods, extensive tests have to be performed that
inject faults, such as bit flips, into hardware components of a running
system. Fault injection commonly uses either system simulations, resulting
in execution times orders of magnitude longer than on real systems, or
exposes a real system to error sources like radiation. This can take place
in real time, but it enables only a very coarse-grained control over the
affected system component.
A solution combining the best characteristics from both approaches should
achieve precise fault injection in real hardware systems. The approach
presented in this paper uses the JTAG background debug facility of a CPU
to inject faults into main memory and registers of a running system. Compared
to similar earlier approaches, our solution is able to achieve rapid
fault injection using a low-cost microcontroller instead of a complex
FPGA. Consequently, our injection software is much more flexible. It
allows to restrict error injection to the execution of a set of predefined
components, resulting in a more precise control of the injection, and
also emulates error reporting, which enables the evaluation
of different error detection approaches in addition to robustness
evaluation.
},
} In order to assess the robustness of software-based fault-tolerance
methods, extensive tests have to be performed that
inject faults, such as bit flips, into hardware components of a running
system. Fault injection commonly uses either system simulations, resulting
in execution times orders of magnitude longer than on real systems, or
exposes a real system to error sources like radiation. This can take place
in real time, but it enables only a very coarse-grained control over the
affected system component.
A solution combining the best characteristics from both approaches should
achieve precise fault injection in real hardware systems. The approach
presented in this paper uses the JTAG background debug facility of a CPU
to inject faults into main memory and registers of a running system. Compared
to similar earlier approaches, our solution is able to achieve rapid
fault injection using a low-cost microcontroller instead of a complex
FPGA. Consequently, our injection software is much more flexible. It
allows to restrict error injection to the execution of a set of predefined
components, resulting in a more precise control of the injection, and
also emulates error reporting, which enables the evaluation
of different error detection approaches in addition to robustness
evaluation.
|
| Timon Kelter and Peter Marwedel. Parallelism Analysis: Precise WCET Values for Complex Multi-Core Systems. In Third International Workshop on Formal Techniques for Safety-Critical Systems Luxembourg, November 2014 [BibTeX][PDF][Link]@inproceedings { kelter:2014:ftscs,
author = {Kelter, Timon and Marwedel, Peter},
title = {Parallelism Analysis: Precise WCET Values for Complex Multi-Core Systems},
booktitle = {Third International Workshop on Formal Techniques for Safety-Critical Systems},
year = {2014},
editor = {Cyrille Artho and Peter \"Olveczky},
series = {FTSCS},
address = {Luxembourg},
month = {November},
publisher = {Springer},
url = {http://www.ftscs.org/index.php?n=Main.Home},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-kelter-ftscs.pdf},
confidential = {n},
} |
| Timon Kelter, Peter Marwedel and Hendrik Borghorst. WCET-aware Scheduling Optimizations for Multi-Core Real-Time Systems. In International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), pages 67-74 Samos, Greece, July 2014 [BibTeX][PDF]@inproceedings { kelter:2014:samos,
author = {Kelter, Timon and Marwedel, Peter and Borghorst, Hendrik},
title = {WCET-aware Scheduling Optimizations for Multi-Core Real-Time Systems},
booktitle = {International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)},
year = {2014},
pages = {67-74},
address = {Samos, Greece},
month = {July},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-samos.pdf},
confidential = {n},
} |
| Bjoern Dusza, Peter Marwedel, Olaf Spinczyk and Christian Wietfeld. A Context-Aware Battery Lifetime Model for Carrier Aggregation Enabled LTE-A Systems. In IEEE Consumer Communications and Networking Conference Las Vegas, USA, January 2014 [BibTeX][Abstract]@inproceedings { marwedel:2014:ccnc,
author = {Dusza, Bjoern and Marwedel, Peter and Spinczyk, Olaf and Wietfeld, Christian},
title = {A Context-Aware Battery Lifetime Model for Carrier Aggregation Enabled LTE-A Systems},
booktitle = {IEEE Consumer Communications and Networking Conference},
year = {2014},
series = {CCNC},
address = {Las Vegas, USA},
month = {January},
organization = {IEEE},
keywords = {energy},
confidential = {n},
abstract = {A Quality of Experience (QoE) parameter of increasing importance is the time that a battery powered
communication device (e.g. smartphone) can be operated before it needs to be recharged. However, due to the fact that battery capacity is not evolving as fast as the power requirement, the battery lifetime of modern user equipment is
stagnating or even decreasing from one device generation to another. In parallel, a major challenge for the design of
next generation wireless systems such as LTE-Advanced (LTE-A) is that the required high portion of spectrum is not
available in a consecutive portion. For this reason, a procedure called interband non-continuous Carrier Aggregation
(CA) will be introduced in LTE-A which allows for the combination of multiple spectrum pieces from different frequency
bands. This procedure however requires the parallel operation of multiple power amplifiers that are characterized by a
high energy demand. In this paper, we quantify the impact of CA on the power consumption of LTE-A enabled communication by means of a Markovian based power consumption model that incorporates system parameters as well as context parameters. The results show that the suitability of CA does from a battery lifetime perspective strongly depend upon the actual device characteristics as well as the resource availability is the various frequency bands. Furthermore, the application of the sophisticated Kinetic Battery Model (KiBaM) shows that the charge recovery effect during idle periods does significantly affect the battery lifetime.},
} A Quality of Experience (QoE) parameter of increasing importance is the time that a battery powered
communication device (e.g. smartphone) can be operated before it needs to be recharged. However, due to the fact that battery capacity is not evolving as fast as the power requirement, the battery lifetime of modern user equipment is
stagnating or even decreasing from one device generation to another. In parallel, a major challenge for the design of
next generation wireless systems such as LTE-Advanced (LTE-A) is that the required high portion of spectrum is not
available in a consecutive portion. For this reason, a procedure called interband non-continuous Carrier Aggregation
(CA) will be introduced in LTE-A which allows for the combination of multiple spectrum pieces from different frequency
bands. This procedure however requires the parallel operation of multiple power amplifiers that are characterized by a
high energy demand. In this paper, we quantify the impact of CA on the power consumption of LTE-A enabled communication by means of a Markovian based power consumption model that incorporates system parameters as well as context parameters. The results show that the suitability of CA does from a battery lifetime perspective strongly depend upon the actual device characteristics as well as the resource availability is the various frequency bands. Furthermore, the application of the sophisticated Kinetic Battery Model (KiBaM) shows that the charge recovery effect during idle periods does significantly affect the battery lifetime.
|
| Peter Marwedel and Michael Engel. Flipped classroom teaching for a cyber-physical system course - an adequate presence-based learning approach in the internet age. In Proceedings of the Tenth European Workshop on Microelectronics Education (EWME) Tallinn, Estonia, May 2014 [BibTeX][PDF][Abstract]@inproceedings { marwedel:2014:ewme,
author = {Marwedel, Peter and Engel, Michael},
title = {Flipped classroom teaching for a cyber-physical system course - an adequate presence-based learning approach in the internet age},
booktitle = {Proceedings of the Tenth European Workshop on Microelectronics Education (EWME)},
year = {2014},
address = {Tallinn, Estonia},
month = {May},
publisher = {IEEE},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-ewme.pdf},
confidential = {n},
abstract = {In the age of the Internet, teaching styles need to take new ways of learning into account. This paper recommends the use of the flipped classroom approach. In this approach, the roles of work at home and in class are essentially swapped. We present a case study covering a course on cyber-physical system fundamentals. Results are strongly encouraging us to continue along these lines. We are also commenting on general advantages and limitations of this style of teaching.},
} In the age of the Internet, teaching styles need to take new ways of learning into account. This paper recommends the use of the flipped classroom approach. In this approach, the roles of work at home and in class are essentially swapped. We present a case study covering a course on cyber-physical system fundamentals. Results are strongly encouraging us to continue along these lines. We are also commenting on general advantages and limitations of this style of teaching.
|
| Heinrich Müller Dominic Siedhoff. Signal/Background Classification of Time Series for Biological Virus Detection. In Pattern Recognition - 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014. Proceedings 2014 [BibTeX]@inproceedings { Siedhoff/etal/2014b,
author = {Dominic Siedhoff, Heinrich M\"uller},
title = {Signal/Background Classification of Time Series for Biological Virus Detection},
booktitle = {Pattern Recognition - 36th German Conference, GCPR 2014, M\"unster, Germany, September 2-5, 2014. Proceedings},
year = {2014},
editor = {Xiaoyi Jiang, Joachim Hornegger, Reinhard Koch},
publisher = {Springer},
confidential = {n},
} |
| Olaf Neugebauer, Michael Engel and Peter Marwedel. A Parallelization Approach for Resource Restricted Embedded Heterogeneous MPSoCs Inspired by OpenMP. In Proceedings of Software Engineering for Parallel Systems (SEPS) 2014 [BibTeX]@inproceedings { neugebauer:2014:seps,
author = {Neugebauer, Olaf and Engel, Michael and Marwedel, Peter},
title = {A Parallelization Approach for Resource Restricted Embedded Heterogeneous MPSoCs Inspired by OpenMP},
booktitle = {Proceedings of Software Engineering for Parallel Systems (SEPS)},
year = {2014},
confidential = {n},
} |
| Jan Kleinsorge and Peter Marwedel. Computing Maximum Blocking Times with Explicit Path Analysis under Non-local Flow Bounds. In Proceedings of the International Conference on Embedded Software (EMSOFT 2014) New Delhi, India, October 2014 [BibTeX][Link]@inproceedings { Kleinsorge:2014:EMSOFT,
author = {Kleinsorge, Jan and Marwedel, Peter},
title = {Computing Maximum Blocking Times with Explicit Path Analysis under Non-local Flow Bounds},
booktitle = {Proceedings of the International Conference on Embedded Software (EMSOFT 2014)},
year = {2014},
series = {EMSOFT 2014},
address = {New Delhi, India},
month = {oct},
url = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-jk.pdf},
confidential = {n},
} |
| Wei Liu, Jian-Jia Chen, Anas Toma, Tei-Wei Kuo and Qingxu Deng. Computation Offloading by Using Timing Unreliable Components in Real-Time Systems. In Design Automation Conference (DAC), San Francisco, CA, USA, June 1-5 2014 [BibTeX][PDF][Link][Abstract]@inproceedings { DBLP:conf/dac/LiuCTKD14,
author = {Liu, Wei and Chen, Jian-Jia and Toma, Anas and Kuo, Tei-Wei and Deng, Qingxu},
title = {Computation Offloading by Using Timing Unreliable Components in Real-Time Systems},
booktitle = {Design Automation Conference (DAC), San Francisco, CA, USA, June 1-5},
year = {2014},
url = {http://doi.acm.org/10.1145/2593069.2593109},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-liu-dac.pdf},
confidential = {n},
abstract = {There are many timing unreliable computing components in modern computer systems, which are typically forbidden in hard real-time systems due to the timing uncertainty. In this paper, we propose a computation offloading mechanism to utilise these timing unreliable components in a hard real-time system, by providing local compensations. The key of the mechanism is to decide (1) how the unreliable components are utilized and (2) how to set the worst-case estimated response time. The local compensation has to start when the unreliable components do not deliver the results in the estimated response time. We propose a scheduling algorithm and its schedulability test to analyze the feasibility of the compensation mechanism. To validate the proposed mechanism, we perform a case study based on image-processing applications in a robot system and simulations. By adopting the timing unreliable components, the system can handle higher-quality images and with better performance.},
} There are many timing unreliable computing components in modern computer systems, which are typically forbidden in hard real-time systems due to the timing uncertainty. In this paper, we propose a computation offloading mechanism to utilise these timing unreliable components in a hard real-time system, by providing local compensations. The key of the mechanism is to decide (1) how the unreliable components are utilized and (2) how to set the worst-case estimated response time. The local compensation has to start when the unreliable components do not deliver the results in the estimated response time. We propose a scheduling algorithm and its schedulability test to analyze the feasibility of the compensation mechanism. To validate the proposed mechanism, we perform a case study based on image-processing applications in a robot system and simulations. By adopting the timing unreliable components, the system can handle higher-quality images and with better performance.
|
| Pascal Libuschewski, Dennis Kaulbars, Dominic Siedhoff, Frank Weichert, Heinrich Müller, Christian Wietfeld and Peter Marwedel. Multi-Objective Computation Offloading for Mobile Biosensors via LTE. In Wireless Mobile Communication and Healthcare (Mobihealth), 2014 EAI 4th International Conference on December 2014 [BibTeX][PDF][Link][Abstract]@inproceedings { Libuschewski/etal/2014a,
author = {Libuschewski, Pascal and Kaulbars, Dennis and Siedhoff, Dominic and Weichert, Frank and M\"uller, Heinrich and Wietfeld, Christian and Marwedel, Peter},
title = {Multi-Objective Computation Offloading for Mobile Biosensors via LTE},
booktitle = {Wireless Mobile Communication and Healthcare (Mobihealth), 2014 EAI 4th International Conference on},
year = {2014},
month = {Dec},
url = {http://dx.doi.org/10.4108/icst.mobihealth.2014.257374},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-mobihealth.pdf},
confidential = {n},
abstract = {For a rapid identification of viral epidemics a mobile virus detection is needed, which can process samples without a laboratory. The application of medical biosensors, at key positions with a high passenger volume (e.g. airports), became increasingly meaningful as epidemic early warning systems. As mobile biosensors have to fulfill various demands, like a rapid analysis and prolonged battery lifetime we present in this study a multi-objective computation offloading for mobile sensors. The decision whether it is beneficial to offload work to a server can be made automatically on the basis of contrary objectives and several constraints.},
} For a rapid identification of viral epidemics a mobile virus detection is needed, which can process samples without a laboratory. The application of medical biosensors, at key positions with a high passenger volume (e.g. airports), became increasingly meaningful as epidemic early warning systems. As mobile biosensors have to fulfill various demands, like a rapid analysis and prolonged battery lifetime we present in this study a multi-objective computation offloading for mobile sensors. The decision whether it is beneficial to offload work to a server can be made automatically on the basis of contrary objectives and several constraints.
|
| Pascal Libuschewski, Peter Marwedel, Dominic Siedhoff and Müller Heinrich. Multi-Objective Energy-Aware GPGPU Design Space Exploration for Medical or Industrial Applications. In Signal-Image Technology and Internet-Based Systems (SITIS), 2014 Tenth International Conference on, pages 637-644 November 2014, doi 10.1109/SITIS.2014.11 [BibTeX][PDF][Link][Abstract]@inproceedings { Libuschewski/etal/2014b,
author = {Libuschewski, Pascal and Marwedel, Peter and Siedhoff, Dominic and Heinrich, M\"uller},
title = {Multi-Objective Energy-Aware GPGPU Design Space Exploration for Medical or Industrial Applications},
booktitle = {Signal-Image Technology and Internet-Based Systems (SITIS), 2014 Tenth International Conference on},
year = {2014},
pages = {637-644},
month = {Nov},
publisher = {IEEE Computer Society},
note = {doi 10.1109/SITIS.2014.11},
url = {dx.doi.org/10.1109/SITIS.2014.11},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-sitis.pdf},
confidential = {n},
abstract = {This work presents a multi-objective design space exploration for Graphics Processing Units (GPUs). For any given GPGPU application, a Pareto front of best suited GPUs can be calculated. The objectives can be chosen according to the demands of the system, for example energy efficiency, run time and real-time capability. The simulated GPUs can be desktop, high performance or mobile versions. Also GPUs that do not yet exist can be modeled and simulated.
The main application area for the presented approach is the identification of suitable GPU hardware for given medical or industrial applications, e.g. for real-time process control or in healthcare sensor environments. As use case a real-time capable medical biosensor program for an automatic detection of pathogens and a wide variety of industrial, biological and physical applications were evaluated.},
} This work presents a multi-objective design space exploration for Graphics Processing Units (GPUs). For any given GPGPU application, a Pareto front of best suited GPUs can be calculated. The objectives can be chosen according to the demands of the system, for example energy efficiency, run time and real-time capability. The simulated GPUs can be desktop, high performance or mobile versions. Also GPUs that do not yet exist can be modeled and simulated.
The main application area for the presented approach is the identification of suitable GPU hardware for given medical or industrial applications, e.g. for real-time process control or in healthcare sensor environments. As use case a real-time capable medical biosensor program for an automatic detection of pathogens and a wide variety of industrial, biological and physical applications were evaluated.
|
| Yu-Ming Chang, Yuan-Hao Chang, Jian-Jia Chen, Tei-Wei Kuo, Hsiang-Pang Li and Hang-Ting Lue. On Trading Wear-leveling with Heal-leveling. In Design Automation Conference (DAC), San Francisco, CA, USA, June 1-5 2014, Best Paper Candidate [BibTeX][Link][Abstract]@inproceedings { DBLP:conf/dac/ChangCCKLL14,
author = {Chang, Yu-Ming and Chang, Yuan-Hao and Chen, Jian-Jia and Kuo, Tei-Wei and Li, Hsiang-Pang and Lue, Hang-Ting},
title = {On Trading Wear-leveling with Heal-leveling},
booktitle = {Design Automation Conference (DAC), San Francisco, CA, USA, June 1-5},
year = {2014},
note = {Best Paper Candidate},
url = {http://doi.acm.org/10.1145/2593069.2593172},
confidential = {n},
abstract = {Manufacturers are constantly seeking to increase flash memory density in order to fulfill the ever growing demand for storage capacity. However, this trend significantly reduces the reliability and endurance of flash memory chips. The lifetime degradation worsens as the number of erase cycles grows, even with wear leveling technology being adopted to extend flash memory lifetime by evenly distributing erase cycles to every flash block. To address this issue, self-healing technology is proposed to recover a flash block before the flash block is worn out, but such a technology still has its limitation when recovering flash blocks. In contrast to the existing wear leveling designs, we adopt the self-healing technology to propose a heal-leveling design that evenly distributes healing cycles to flash blocks. Ultimately, heal-leveling aims to extend the lifetime of flash memory without introducing a large amount of live-data copying overheads. We conducted a series of experiments to evaluate the capability of the proposed design. The results show that our design can significantly improve the access performance and the effective lifetime of flash memory without the unnecessary overheads caused by wear leveling technology.},
} Manufacturers are constantly seeking to increase flash memory density in order to fulfill the ever growing demand for storage capacity. However, this trend significantly reduces the reliability and endurance of flash memory chips. The lifetime degradation worsens as the number of erase cycles grows, even with wear leveling technology being adopted to extend flash memory lifetime by evenly distributing erase cycles to every flash block. To address this issue, self-healing technology is proposed to recover a flash block before the flash block is worn out, but such a technology still has its limitation when recovering flash blocks. In contrast to the existing wear leveling designs, we adopt the self-healing technology to propose a heal-leveling design that evenly distributes healing cycles to flash blocks. Ultimately, heal-leveling aims to extend the lifetime of flash memory without introducing a large amount of live-data copying overheads. We conducted a series of experiments to evaluate the capability of the proposed design. The results show that our design can significantly improve the access performance and the effective lifetime of flash memory without the unnecessary overheads caused by wear leveling technology.
|
| Jian-Jia Chen, Mong-Jen Kao, D. T. Lee, Ignaz Rutter and Dorothea Wagner. Online Dynamic Power Management with Hard Real-Time Guarantees. In 31st International Symposium on Theoretical Aspects of Computer Science (STACS), Lyon, France, March 5-8, 2014, pages 226-238 2014 [BibTeX][Link][Abstract]@inproceedings { DBLP:conf/stacs/ChenKLRW14,
author = {Chen, Jian-Jia and Kao, Mong-Jen and Lee, D. T. and Rutter, Ignaz and Wagner, Dorothea},
title = {Online Dynamic Power Management with Hard Real-Time Guarantees},
booktitle = {31st International Symposium on Theoretical Aspects of Computer Science (STACS), Lyon, France, March 5-8, 2014},
year = {2014},
pages = {226-238},
url = {http://dx.doi.org/10.4230/LIPIcs.STACS.2014.226},
confidential = {n},
abstract = {We consider the problem of online dynamic power management that provides hard real-time guarantees for multi-processor systems. In this problem, a set of jobs, each associated with an arrival time, a deadline, and an execution time, arrives to the system in an online fashion. The objective is to compute a non-migrative preemptive schedule of the jobs and a sequence of power on/off operations of the processors so as to minimize the total energy consumption while ensuring that all the deadlines of the jobs are met. We assume that we can use as many processors as necessary. In this paper we examine the complexity of this problem and provide online strategies that lead to practical energy-efficient solutions for real-time multi-processor systems. First, we consider the case for which we know in advance that the set of jobs can be scheduled feasibly on a single processor. We show that, even in this case, the competitive factor of any online algorithm is at least 2.06. On the other hand, we give a 4-competitive online algorithm that uses at most two processors. For jobs with unit execution times, the competitive factor of this algorithm improves to 3.59. Second, we relax our assumption by considering as input multiple streams of jobs, each of which can be scheduled feasibly on a single processor. We present a trade-off between the energy-efficiency of the schedule and the number of processors to be used. More specifically, for k given job streams and h processors with h>k, we give a scheduling strategy such that the energy usage is at most 4.k/(h-k) times that used by any schedule which schedules each of the k streams on a separate processor. Finally, we drop the assumptions on the input set of jobs. We show that the competitive factor of any online algorithm is at least 2.28, even for the case of unit job execution times for which we further derive an O(1)-competitive algorithm.},
} We consider the problem of online dynamic power management that provides hard real-time guarantees for multi-processor systems. In this problem, a set of jobs, each associated with an arrival time, a deadline, and an execution time, arrives to the system in an online fashion. The objective is to compute a non-migrative preemptive schedule of the jobs and a sequence of power on/off operations of the processors so as to minimize the total energy consumption while ensuring that all the deadlines of the jobs are met. We assume that we can use as many processors as necessary. In this paper we examine the complexity of this problem and provide online strategies that lead to practical energy-efficient solutions for real-time multi-processor systems. First, we consider the case for which we know in advance that the set of jobs can be scheduled feasibly on a single processor. We show that, even in this case, the competitive factor of any online algorithm is at least 2.06. On the other hand, we give a 4-competitive online algorithm that uses at most two processors. For jobs with unit execution times, the competitive factor of this algorithm improves to 3.59. Second, we relax our assumption by considering as input multiple streams of jobs, each of which can be scheduled feasibly on a single processor. We present a trade-off between the energy-efficiency of the schedule and the number of processors to be used. More specifically, for k given job streams and h processors with h>k, we give a scheduling strategy such that the energy usage is at most 4.k/(h-k) times that used by any schedule which schedules each of the k streams on a separate processor. Finally, we drop the assumptions on the input set of jobs. We show that the competitive factor of any online algorithm is at least 2.28, even for the case of unit job execution times for which we further derive an O(1)-competitive algorithm.
|
| Helena Kotthaus, Michel Lang, Jörg Rahnenführer and Peter Marwedel. Runtime and Memory Consumption Analyses for Machine Learning R Programs. In Abstracts 45. Arbeitstagung, Ulmer Informatik-Berichte, pages 3-4 June 2013 [BibTeX][PDF]@inproceedings { kotthaus/2013a,
author = {Kotthaus, Helena and Lang, Michel and Rahnenf{\"u}hrer, J{\"o}rg and Marwedel, Peter},
title = {Runtime and Memory Consumption Analyses for Machine Learning R Programs},
booktitle = {Abstracts 45. Arbeitstagung, Ulmer Informatik-Berichte},
year = {2013},
pages = {3-4},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/kotthaus_etal_2013a.pdf},
confidential = {n},
} |
| Björn Döbel, Horst Schirmeier and Michael Engel. Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment . In Proceedings of the 5th Workshop on Design for Reliability (DFR) January 2013, - Best Poster Award - [BibTeX][PDF][Abstract]@inproceedings { doebel:2013:dfr,
author = {D\"obel, Bj\"orn and Schirmeier, Horst and Engel, Michael},
title = {Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment },
booktitle = {Proceedings of the 5th Workshop on Design for Reliability (DFR)},
year = {2013},
month = {January},
note = {- Best Poster Award -},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-dfr-doebel.pdf},
confidential = {n},
abstract = {From a software developer's perspective, fault injection (FI) is the most complete way of evaluating the sensitivity of a program against hardware errors. Unfortunately, FI campaigns require a substantial investment of both, time and computing resources, making their application infeasible in many cases.
Program Vulnerability Factor (PVF) analysis has been proposed as an alternative for estimating software vulnerability. In this paper we present PVF/x86, a tool for computing the PVF for x86 programs. We validate the use of PVF analysis by running PVF/x86 on an image decoder application and compare the results to those obtained with a state-of-the-art FI framework. We identify weak spots of PVF analysis and outline ideas for addressing those points.},
} From a software developer's perspective, fault injection (FI) is the most complete way of evaluating the sensitivity of a program against hardware errors. Unfortunately, FI campaigns require a substantial investment of both, time and computing resources, making their application infeasible in many cases.
Program Vulnerability Factor (PVF) analysis has been proposed as an alternative for estimating software vulnerability. In this paper we present PVF/x86, a tool for computing the PVF for x86 programs. We validate the use of PVF analysis by running PVF/x86 on an image decoder application and compare the results to those obtained with a state-of-the-art FI framework. We identify weak spots of PVF analysis and outline ideas for addressing those points.
|
| Daniel Cordes, Michael Engel, Olaf Neugebauer and Peter Marwedel. Automatic Extraction of Task-Level Parallelism for Heterogeneous MPSoCs. In Proceedings of the Fourth International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2013) Lyon, France, October 2013 [BibTeX][PDF][Abstract]@inproceedings { Cordes:2013:PSTI,
author = {Cordes, Daniel and Engel, Michael and Neugebauer, Olaf and Marwedel, Peter},
title = {Automatic Extraction of Task-Level Parallelism for Heterogeneous MPSoCs},
booktitle = {Proceedings of the Fourth International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI 2013)},
year = {2013},
series = {PSTI 2013},
address = {Lyon, France},
month = {oct},
keywords = {automatic parallelization; embedded software; heterogeneity; mpsoc; integer linear programming; task-level parallelism},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-psti-cordes.pdf},
confidential = {n},
abstract = {Heterogeneous multi-core platforms are increasingly attractive for embedded applications due to their adaptability and efficiency. This proliferation of heterogeneity demands new approaches for extracting thread level parallelism from sequential applications which have to be efficient at runtime. We present, to the best of our knowledge, the first Integer Linear Programming (ILP)-based parallelization approach for heterogeneous multi-core platforms. Using Hierarchical Task Graphs and high-level timing models, our approach manages to balance the extracted tasks while considering performance differences between cores. As a result, we obtain considerable speedups at runtime, significantly outperforming tools for homogeneous systems. We evaluate our approach by parallelizing standard benchmarks from various application domains.},
} Heterogeneous multi-core platforms are increasingly attractive for embedded applications due to their adaptability and efficiency. This proliferation of heterogeneity demands new approaches for extracting thread level parallelism from sequential applications which have to be efficient at runtime. We present, to the best of our knowledge, the first Integer Linear Programming (ILP)-based parallelization approach for heterogeneous multi-core platforms. Using Hierarchical Task Graphs and high-level timing models, our approach manages to balance the extracted tasks while considering performance differences between cores. As a result, we obtain considerable speedups at runtime, significantly outperforming tools for homogeneous systems. We evaluate our approach by parallelizing standard benchmarks from various application domains.
|
| Timon Kelter, Tim Harde, Peter Marwedel and Heiko Falk. Evaluation of resource arbitration methods for multi-core real-time systems. In Proceedings of the 13th International Workshop on Worst-Case Execution Time Analysis (WCET) Paris, France, July 2013 [BibTeX][PDF][Link][Abstract]@inproceedings { kelter:2013:wcet,
author = {Kelter, Timon and Harde, Tim and Marwedel, Peter and Falk, Heiko},
title = {Evaluation of resource arbitration methods for multi-core real-time systems},
booktitle = {Proceedings of the 13th International Workshop on Worst-Case Execution Time Analysis (WCET)},
year = {2013},
editor = {Claire Maiza},
address = {Paris, France},
month = {July},
url = {http://wcet2013.imag.fr/},
keywords = {wcet},
file = {http://drops.dagstuhl.de/opus/volltexte/2013/4117/pdf/2.pdf},
confidential = {n},
abstract = {Multi-core systems have become prevalent in the last years, because of their favorable properties in terms of energy consumption, computing power and design complexity. First attempts have been made to devise WCET analyses for multi-core processors, which have to deal with the problem that the cores may experience interferences during accesses to shared resources. To limit these interferences, the vast amount of previous work is proposing a strict TDMA (time division multiple access) schedule for arbitrating shared resources. Though this type of arbitration yields a high predictability, this advantage is paid for with a poor resource utilization. In this work, we compare different arbitration methods with respect to their predictability and average case performance. We show how known WCET analysis techniques can be extended to work with the presented arbitration strategies and perform an evaluation of the resulting ACETs and WCETs on an extensive set of realworld benchmarks. Results show that there are cases when TDMA is not the best strategy, especially when predictability and performance are equally important.},
} Multi-core systems have become prevalent in the last years, because of their favorable properties in terms of energy consumption, computing power and design complexity. First attempts have been made to devise WCET analyses for multi-core processors, which have to deal with the problem that the cores may experience interferences during accesses to shared resources. To limit these interferences, the vast amount of previous work is proposing a strict TDMA (time division multiple access) schedule for arbitrating shared resources. Though this type of arbitration yields a high predictability, this advantage is paid for with a poor resource utilization. In this work, we compare different arbitration methods with respect to their predictability and average case performance. We show how known WCET analysis techniques can be extended to work with the presented arbitration strategies and perform an evaluation of the resulting ACETs and WCETs on an extensive set of realworld benchmarks. Results show that there are cases when TDMA is not the best strategy, especially when predictability and performance are equally important.
|
| Daniel Cordes, Michael Engel, Olaf Neugebauer and Peter Marwedel. Automatic Extraction of Pipeline Parallelism for Embedded Heterogeneous Multi-Core Platforms. In Proceedings of the Sixteenth International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES 2013) Montreal, Canada, October 2013 [BibTeX][PDF][Abstract]@inproceedings { Cordes:2013:CASES,
author = {Cordes, Daniel and Engel, Michael and Neugebauer, Olaf and Marwedel, Peter},
title = {Automatic Extraction of Pipeline Parallelism for Embedded Heterogeneous Multi-Core Platforms},
booktitle = {Proceedings of the Sixteenth International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES 2013)},
year = {2013},
series = {CASES 2013},
address = {Montreal, Canada},
month = {oct},
keywords = {Automatic Parallelization; Heterogeneity; MPSoC; Embedded Software; Integer Linear Programming; Pipeline},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-cases-cordes.pdf},
confidential = {n},
abstract = {Automatic parallelization of sequential applications is the key for efficient use and optimization of current and future embedded multi-core systems. However, existing approaches often fail to achieve efficient balancing of tasks running on heterogeneous cores of an MPSoC. A reason for this is often insufficient knowledge of the underlying architecture's performance.
In this paper, we present a novel parallelization approach for embedded MPSoCs that combines pipeline parallelization for loops with knowledge about different execution times for tasks on cores with different performance properties. Using Integer Linear Programming, an optimal solution with respect to the model used is derived implementing tasks with a well-balanced execution behavior. We evaluate our pipeline parallelization approach for heterogeneous MPSoCs using a set of standard embedded benchmarks and compare it with two existing state-of-the-art approaches. For all benchmarks, our parallelization approach obtains significantly higher speedups than either approach on heterogeneous MPSoCs.
},
} Automatic parallelization of sequential applications is the key for efficient use and optimization of current and future embedded multi-core systems. However, existing approaches often fail to achieve efficient balancing of tasks running on heterogeneous cores of an MPSoC. A reason for this is often insufficient knowledge of the underlying architecture's performance.
In this paper, we present a novel parallelization approach for embedded MPSoCs that combines pipeline parallelization for loops with knowledge about different execution times for tasks on cores with different performance properties. Using Integer Linear Programming, an optimal solution with respect to the model used is derived implementing tasks with a well-balanced execution behavior. We evaluate our pipeline parallelization approach for heterogeneous MPSoCs using a set of standard embedded benchmarks and compare it with two existing state-of-the-art approaches. For all benchmarks, our parallelization approach obtains significantly higher speedups than either approach on heterogeneous MPSoCs.
|
| Andreas Heinig, Ingo Korb, Florian Schmoll, Peter Marwedel and Michael Engel. Fast and Low-Cost Instruction-Aware Fault Injection. In Proc. of SOBRES 2013 2013 [BibTeX][Link][Abstract]@inproceedings { heinig:2013:sobres,
author = {Heinig, Andreas and Korb, Ingo and Schmoll, Florian and Marwedel, Peter and Engel, Michael},
title = {Fast and Low-Cost Instruction-Aware Fault Injection},
booktitle = {Proc. of SOBRES 2013},
year = {2013},
url = {http://danceos.org/sobres/2013/papers/SOBRES-640-Heinig.pdf},
keywords = {ders},
confidential = {n},
abstract = {In order to assess the robustness of software-based fault-tolerance methods, extensive tests have to be performed that inject faults, such as bit flips, into hardware components of a running system. Fault injection commonly uses either system simulations, resulting in execution times orders of magnitude longer than on real systems, or exposes a real system to error sources like radiation. This can take place in real time, but it enables only a very coarse-grained control over the affected system component.
A solution combining the best characteristics from both approaches should achieve precise fault injection in real hardware systems. The approach presented in this paper uses the JTAG background debug facility of a CPU to inject faults into main memory and registers of a running system. Compared to similar earlier approaches, our solution is able to achieve rapid fault injection using a low-cost microcontroller instead of a complex FPGA. Consequently, our injection software is much more flexible. It allows to restrict error injection to the execution of a set of predefined components, resulting in a more precise control of the injection, and also emulates error reporting, which enables the evaluation of different error detection approaches in addition to robustness evaluation.},
} In order to assess the robustness of software-based fault-tolerance methods, extensive tests have to be performed that inject faults, such as bit flips, into hardware components of a running system. Fault injection commonly uses either system simulations, resulting in execution times orders of magnitude longer than on real systems, or exposes a real system to error sources like radiation. This can take place in real time, but it enables only a very coarse-grained control over the affected system component.
A solution combining the best characteristics from both approaches should achieve precise fault injection in real hardware systems. The approach presented in this paper uses the JTAG background debug facility of a CPU to inject faults into main memory and registers of a running system. Compared to similar earlier approaches, our solution is able to achieve rapid fault injection using a low-cost microcontroller instead of a complex FPGA. Consequently, our injection software is much more flexible. It allows to restrict error injection to the execution of a set of predefined components, resulting in a more precise control of the injection, and also emulates error reporting, which enables the evaluation of different error detection approaches in addition to robustness evaluation.
|
| Daniel Cordes, Michael Engel, Olaf Neugebauer and Peter Marwedel. Automatic Extraction of Multi-Objective Aware Parallelism for Heterogeneous MPSoCs. In Proceedings of the Sixth International Workshop on Multi-/Many-core Computing Systems (MuCoCoS 2013) Edinburgh, Scotland, UK, September 2013 [BibTeX][PDF][Abstract]@inproceedings { Cordes:2013:MUCOCOS,
author = {Cordes, Daniel and Engel, Michael and Neugebauer, Olaf and Marwedel, Peter},
title = {Automatic Extraction of Multi-Objective Aware Parallelism for Heterogeneous MPSoCs},
booktitle = {Proceedings of the Sixth International Workshop on Multi-/Many-core Computing Systems (MuCoCoS 2013)},
year = {2013},
series = {MuCoCoS 2013},
address = {Edinburgh, Scotland, UK},
month = {sep},
keywords = {automatic parallelization; embedded software; heterogeneity; mpsoc; genetic algorithms; task-level parallelism; pipeline parallelism; multi-objective},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-mucocos-cordes.pdf},
confidential = {n},
abstract = {Heterogeneous MPSoCs are used in a large fraction of current embedded systems. In order to efficiently exploit the available processing power, advanced parallelization techniques are required. In addition to consider performance variances between heterogeneous cores, these methods have to be multi-objective aware to be useful for resource restricted embedded systems. This multi-objective optimization requirement results in an explosion of the design space size. As a consequence, efficient approaches are required to find promising solution candidates. In this paper, we present the first portable genetic algorithm-based approach to speed up ANSI-C applications by combining extraction techniques for task-level and pipeline parallelism for heterogeneous multicores while considering additional objectives.
Using our approach enables embedded system designers to select a parallelization of an application from a set of Pareto-optimal solutions according to the performance and energy consumption requirements of a given system. The evaluation of a large set of typical embedded benchmarks shows that our approach is able to generate solutions with low energy consumption, high speedup, low communication overhead or useful trade-offs between these three objectives.},
} Heterogeneous MPSoCs are used in a large fraction of current embedded systems. In order to efficiently exploit the available processing power, advanced parallelization techniques are required. In addition to consider performance variances between heterogeneous cores, these methods have to be multi-objective aware to be useful for resource restricted embedded systems. This multi-objective optimization requirement results in an explosion of the design space size. As a consequence, efficient approaches are required to find promising solution candidates. In this paper, we present the first portable genetic algorithm-based approach to speed up ANSI-C applications by combining extraction techniques for task-level and pipeline parallelism for heterogeneous multicores while considering additional objectives.
Using our approach enables embedded system designers to select a parallelization of an application from a set of Pareto-optimal solutions according to the performance and energy consumption requirements of a given system. The evaluation of a large set of typical embedded benchmarks shows that our approach is able to generate solutions with low energy consumption, high speedup, low communication overhead or useful trade-offs between these three objectives.
|
| Jan Kleinsorge, Heiko Falk and Peter Marwedel. Simple Analysis of Partial Worst-case Execution Paths on General Control Flow Graphs. In Proceedings of the International Conference on Embedded Software (EMSOFT 2013) Montreal, Canada, October 2013 [BibTeX][Link]@inproceedings { Kleinsorge:2013:EMSOFT,
author = {Kleinsorge, Jan and Falk, Heiko and Marwedel, Peter},
title = {Simple Analysis of Partial Worst-case Execution Paths on General Control Flow Graphs},
booktitle = {Proceedings of the International Conference on Embedded Software (EMSOFT 2013)},
year = {2013},
series = {EMSOFT 2013},
address = {Montreal, Canada},
month = {oct},
url = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013_emsoft.pdf},
keywords = {wcet; Worst-case Execution Time; Path Analysis; Static Analysis},
confidential = {n},
} |
| A. Herkersdorf, M. Engel, M. Glaß, J. Henkel, V.B. Kleeberger, M.A. Kochte, J.M. Kühn, S.R. Nassif, H. Rauchfuss, W. Rosenstiel, U. Schlichtmann, M. Shafique, M.B. Tahoori, J. Teich, N. Wehn, C. Weis and H.-J. Wunderlich. Cross-Layer Dependability Modeling and Abstraction in Systems on Chip. In Proceedings of the Workshop on Silicon Errors in Logic System Effects (SELSE) March 2013 [BibTeX][PDF][Abstract]@inproceedings { herkersdorf:2013:selse,
author = {Herkersdorf, A. and Engel, M. and Gla\"s, M. and Henkel, J. and Kleeberger, V.B. and Kochte, M.A. and K\"uhn, J.M. and Nassif, S.R. and Rauchfuss, H. and Rosenstiel, W. and Schlichtmann, U. and Shafique, M. and Tahoori, M.B. and Teich, J. and Wehn, N. and Weis, C. and Wunderlich, H.-J.},
title = {Cross-Layer Dependability Modeling and Abstraction in Systems on Chip},
booktitle = {Proceedings of the Workshop on Silicon Errors in Logic System Effects (SELSE)},
year = {2013},
month = {March},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-selse-herkersdorf.pdf},
confidential = {n},
abstract = {The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple
bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about
the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher
layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software
variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.},
} The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple
bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about
the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher
layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software
variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.
|
| Pascal Libuschewski, Dominic Siedhoff, Constantin Timm, Andrej Gelenberg and Frank Weichert. Fuzzy-enhanced, Real-time capable Detection of Biological Viruses Using a Portable Biosensor. In In Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSIGNALS), pages 169-174 February 2013 [BibTeX][Abstract]@inproceedings { Libuschewski/etal/2013b,
author = {Libuschewski, Pascal and Siedhoff, Dominic and Timm, Constantin and Gelenberg, Andrej and Weichert, Frank},
title = {Fuzzy-enhanced, Real-time capable Detection of Biological Viruses Using a Portable Biosensor},
booktitle = {In Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSIGNALS)},
year = {2013},
pages = {169-174},
month = {February},
confidential = {n},
abstract = {This work presents a novel portable biosensor for indirect detection of viruses by optical microscopy. The focus lies on energy-efficient real-time data analysis for automated virus detection. The superiority of our fuzzy-enhanced time-series analysis over hard thresholding is demonstrated. Real-time capability is achieved through general-purpose computing on graphics processing units (GPGPU). It is shown that this virus detection is real-time capable on an off-the-shelf laptop computer, allowing for a wide range of in-field use-cases.},
} This work presents a novel portable biosensor for indirect detection of viruses by optical microscopy. The focus lies on energy-efficient real-time data analysis for automated virus detection. The superiority of our fuzzy-enhanced time-series analysis over hard thresholding is demonstrated. Real-time capability is achieved through general-purpose computing on graphics processing units (GPGPU). It is shown that this virus detection is real-time capable on an off-the-shelf laptop computer, allowing for a wide range of in-field use-cases.
|
| Janmartin Jahn, Sebastian Kobbe, Santiago Pagani, Jian{-}Jia Chen and J{\"{o}}rg Henkel. Runtime resource allocation for software pipelines. In International Workshop on Software and Compilers for Embedded Systems, {M-SCOPES} '13, Sankt Goar, Germany, June 19-21, 2013, pages 96--99 2013 [BibTeX][Link]@inproceedings { DBLP:conf/scopes/JahnKPCH13,
author = {Jahn, Janmartin and Kobbe, Sebastian and Pagani, Santiago and Chen, Jian{-}Jia and Henkel, J{\"{o}}rg},
title = {Runtime resource allocation for software pipelines},
booktitle = {International Workshop on Software and Compilers for Embedded Systems, {M-SCOPES} '13, Sankt Goar, Germany, June 19-21, 2013},
year = {2013},
bdsk-url-1 = {http://doi.acm.org/10.1145/2463596.2486156},
bdsk-url-2 = {http://dx.doi.org/10.1145/2463596.2486156},
pages = {96--99},
url = {http://doi.acm.org/10.1145/2463596.2486156},
confidential = {n},
} |
| Michael Engel. Adding Flexibility to Fault-Tolerance by Analyzing Hardware-Software Interactions. In Invited Talk at HiPEAC Computing Systems Week Ghent, Belgium, October 2012, Thematic Session "The intertwining challenges of reliability, testing and verification" [BibTeX][Abstract]@inproceedings { engel:csw2012,
author = {Engel, Michael},
title = {Adding Flexibility to Fault-Tolerance by Analyzing Hardware-Software Interactions},
booktitle = {Invited Talk at HiPEAC Computing Systems Week},
year = {2012},
address = {Ghent, Belgium},
month = {oct},
organization = {HiPEACH},
note = {Thematic Session "The intertwining challenges of reliability, testing and verification"},
confidential = {n},
abstract = {With an expected increasing number of permanent and transient errors and a growing influence of variability on semiconductor operation, correcting all possible errors in hardware will become more and more infeasible. Future fault-tolerant systems will have to incorporate information about possible effects of errors on the application level in order to reduce the hardware and software overhead for fault-tolerance. This requires a reconsideration of the interaction of hardware and software. By extending programming language semantics and performing compiler-based static analyses on error effects and propagation, it becomes possible to introduce reliability requirements into software development as a first-class member, allowing system designers to tailor the fault tolerance behavior of a system to given requirements, like expected uptime or quality-of-service bounds. This talk will give an overview of current research approaches to build these more flexible fault-tolerant systems with a special focus on the projects in Germany's research program SPP1500 "Dependable Embedded Systems".},
} With an expected increasing number of permanent and transient errors and a growing influence of variability on semiconductor operation, correcting all possible errors in hardware will become more and more infeasible. Future fault-tolerant systems will have to incorporate information about possible effects of errors on the application level in order to reduce the hardware and software overhead for fault-tolerance. This requires a reconsideration of the interaction of hardware and software. By extending programming language semantics and performing compiler-based static analyses on error effects and propagation, it becomes possible to introduce reliability requirements into software development as a first-class member, allowing system designers to tailor the fault tolerance behavior of a system to given requirements, like expected uptime or quality-of-service bounds. This talk will give an overview of current research approaches to build these more flexible fault-tolerant systems with a special focus on the projects in Germany's research program SPP1500 "Dependable Embedded Systems".
|
| Daniel Cordes and Peter Marwedel. Multi-Objective Aware Extraction of Task-Level Parallelism Using Genetic Algorithms. In Proceedings of Design, Automation and Test in Europe (DATE 2012) Dresden, Germany, March 2012 [BibTeX][PDF][Abstract]@inproceedings { cordes:12:date,
author = {Cordes, Daniel and Marwedel, Peter},
title = {Multi-Objective Aware Extraction of Task-Level Parallelism Using Genetic Algorithms},
booktitle = {Proceedings of Design, Automation and Test in Europe (DATE 2012)},
year = {2012},
address = {Dresden, Germany},
month = {mar},
keywords = {Automatic Parallelization, Embedded Software, Multi-Objective, Genetic Algorithms, Task-Level Parallelism, Energy awareness},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-date-cordes.pdf},
confidential = {n},
abstract = {A large amount of research work has been done in the area of automatic parallelization for decades, resulting in a huge amount of tools, which should relieve the designer from the burden of manually parallelizing an application. Unfortunately, most of these tools are only optimizing the execution time by splitting up applications into concurrently executed tasks. In the domain of embedded devices, however, it is not sufficient to look only at this criterion. Since most of these devices are constraint-driven regarding execution time, energy consumption, heat dissipation and other objectives, a good trade-off has to be found to efficiently map applications to multiprocessor system on chip (MPSoC) devices. Therefore, we developed a fully automated multi-objective aware parallelization framework, which optimizes different objectives at the same time. The tool returns a Pareto-optimal front of solutions of the parallelized application to the designer, so that the solution with the best trade-off can be chosen.},
} A large amount of research work has been done in the area of automatic parallelization for decades, resulting in a huge amount of tools, which should relieve the designer from the burden of manually parallelizing an application. Unfortunately, most of these tools are only optimizing the execution time by splitting up applications into concurrently executed tasks. In the domain of embedded devices, however, it is not sufficient to look only at this criterion. Since most of these devices are constraint-driven regarding execution time, energy consumption, heat dissipation and other objectives, a good trade-off has to be found to efficiently map applications to multiprocessor system on chip (MPSoC) devices. Therefore, we developed a fully automated multi-objective aware parallelization framework, which optimizes different objectives at the same time. The tool returns a Pareto-optimal front of solutions of the parallelized application to the designer, so that the solution with the best trade-off can be chosen.
|
| Michael Engel and Peter Marwedel. Semantic Gaps in Software-Based Reliability. In Proceedings of the 4th Workshop on Design for Reliability (DFR'12) Paris, France, January 2012 [BibTeX][Abstract]@inproceedings { engel:dfr:2012,
author = {Engel, Michael and Marwedel, Peter},
title = {Semantic Gaps in Software-Based Reliability},
booktitle = {Proceedings of the 4th Workshop on Design for Reliability (DFR'12)},
year = {2012},
address = {Paris, France},
month = {jan},
organization = {HiPEAC},
keywords = {ders},
confidential = {n},
abstract = {Future semiconductors will show a heterogeneous distribution of permanent faults as a result of fabrication variations and aging. To increase yields and lifetimes of these chips, a fault tolerance approach is required that handles resources on a small-scale basis with low overhead. In embedded systems, this overhead can be reduced by classifying data and instructions to determine the varying impact of errors on different instructions and data. Using this classification, only errors with significant impact on system behavior have to be corrected.
In this position paper, we describe one problem with this analysis, the semantic gap between high-level language source code and the low-level data flow through architecture components. In addition, we discuss possible approaches to handle this gap. Of special interest are the implications on achieving reliable execution of dependability-critical code.
},
} Future semiconductors will show a heterogeneous distribution of permanent faults as a result of fabrication variations and aging. To increase yields and lifetimes of these chips, a fault tolerance approach is required that handles resources on a small-scale basis with low overhead. In embedded systems, this overhead can be reduced by classifying data and instructions to determine the varying impact of errors on different instructions and data. Using this classification, only errors with significant impact on system behavior have to be corrected.
In this position paper, we describe one problem with this analysis, the semantic gap between high-level language source code and the low-level data flow through architecture components. In addition, we discuss possible approaches to handle this gap. Of special interest are the implications on achieving reliable execution of dependability-critical code.
|
| Olivera Jovanovic, Nils Kneuper, Peter Marwedel and Michael Engel. ILP-based Memory-Aware Mapping Optimization for MPSoCs. In The 10th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing Paphos, Cyprus, December 2012 [BibTeX][PDF][Abstract]@inproceedings { jovanovic:2012b,
author = {Jovanovic, Olivera and Kneuper, Nils and Marwedel, Peter and Engel, Michael},
title = {ILP-based Memory-Aware Mapping Optimization for MPSoCs},
booktitle = {The 10th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing},
year = {2012},
address = {Paphos, Cyprus},
month = {December},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-cse-jovanovic.pdf},
confidential = {n},
abstract = {The mapping of applications onto multiprocessor system-on-chip (MPSoC) devices is an important and complex optimization task. The goal is to efficiently distribute application tasks to available processors while optimizing for energy or runtime. Unfortunately, the influence of memories or memory hierarchies is not considered in existing mapping optimizations so far, even though it is a well-known fact that memories have a drastic impact on runtime and energy consumption of the system.
In this paper, we address the challenge of finding an efficient application to MPSoC mapping while explicitly considering the
underlying memory subsystem and an efficient mapping of task’s memory objects to memories. For this purpose, we developed a memory-aware mapping tool based on ILP optimization. Evaluations on various benchmarks show that our memory-aware mapping tool outperforms state-of-the-art mapping optimizations by reducing the runtime up to 18%, and energy consumption up to 21%.},
} The mapping of applications onto multiprocessor system-on-chip (MPSoC) devices is an important and complex optimization task. The goal is to efficiently distribute application tasks to available processors while optimizing for energy or runtime. Unfortunately, the influence of memories or memory hierarchies is not considered in existing mapping optimizations so far, even though it is a well-known fact that memories have a drastic impact on runtime and energy consumption of the system.
In this paper, we address the challenge of finding an efficient application to MPSoC mapping while explicitly considering the
underlying memory subsystem and an efficient mapping of task’s memory objects to memories. For this purpose, we developed a memory-aware mapping tool based on ILP optimization. Evaluations on various benchmarks show that our memory-aware mapping tool outperforms state-of-the-art mapping optimizations by reducing the runtime up to 18%, and energy consumption up to 21%.
|
| Sascha Plazar, Jan Kleinsorge, Heiko Falk and Peter Marwedel. WCET-aware Static Locking of Instruction Caches. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 44-52 San Jose, CA, USA, April 2012 [BibTeX][Link][Abstract]@inproceedings { plazar:2012:cgo,
author = {Plazar, Sascha and Kleinsorge, Jan and Falk, Heiko and Marwedel, Peter},
title = {WCET-aware Static Locking of Instruction Caches},
booktitle = {Proceedings of the International Symposium on Code Generation and Optimization (CGO)},
year = {2012},
pages = {44-52},
address = {San Jose, CA, USA},
month = {apr},
url = {http://www.uni-ulm.de/fileadmin/website_uni_ulm/iui.inst.050/profile/profil_hfalk/publications/20120402-cgo-plazar.pdf},
keywords = {wcet},
confidential = {n},
abstract = {In the past decades, embedded system designers moved from simple, predictable system designs towards complex systems equipped with caches. This step was necessary in order to bridge the increasingly growing gap between processor and memory system performance. Static analysis techniques had to be developed to allow the estimation of the cache behavior and an upper bound of the execution time of a program. This bound is called worst-case execution time (WCET). Its knowledge is crucial to verify whether hard real-time systems satisfy their timing constraints, and the WCET is a key parameter for the design of embedded systems.
In this paper, we propose a WCET-aware optimization technique for static I-cache locking which improves a program’s performance and predictability. To select the memory blocks to lock into the cache and avoid time consuming repetitive WCET analyses, we developed a new algorithm employing integer-linear programming (ILP). The ILP models the worst-case execution path (WCEP) of a program and takes the influence of locked cache contents into account. By modeling the effect of locked memory blocks on the runtime of basic blocks, the overall WCET of a program can be minimized. We show that our optimization is able to reduce the WCET of real-life benchmarks by up to 40.8%. At the same time, our proposed approach is able to outperform a regular cache by up to 23.8% in terms of WCET.},
} In the past decades, embedded system designers moved from simple, predictable system designs towards complex systems equipped with caches. This step was necessary in order to bridge the increasingly growing gap between processor and memory system performance. Static analysis techniques had to be developed to allow the estimation of the cache behavior and an upper bound of the execution time of a program. This bound is called worst-case execution time (WCET). Its knowledge is crucial to verify whether hard real-time systems satisfy their timing constraints, and the WCET is a key parameter for the design of embedded systems.
In this paper, we propose a WCET-aware optimization technique for static I-cache locking which improves a program’s performance and predictability. To select the memory blocks to lock into the cache and avoid time consuming repetitive WCET analyses, we developed a new algorithm employing integer-linear programming (ILP). The ILP models the worst-case execution path (WCEP) of a program and takes the influence of locked cache contents into account. By modeling the effect of locked memory blocks on the runtime of basic blocks, the overall WCET of a program can be minimized. We show that our optimization is able to reduce the WCET of real-life benchmarks by up to 40.8%. At the same time, our proposed approach is able to outperform a regular cache by up to 23.8% in terms of WCET.
|
| Andreas Heinig, Vincent J. Mooney, Florian Schmoll, Peter Marwedel, Krishna Palem and Michael Engel. Classification-based Improvement of Application Robustness and Quality of Service in Probabilistic Computer Systems. In Proceedings of ARCS 2012 - International Conference on Architecture of Computing Systems Munich, Germany, March 2012, -- ARCS 2012 Best Paper Award Winner -- [BibTeX][PDF][Abstract]@inproceedings { heinig:2012:arcs,
author = {Heinig, Andreas and Mooney, Vincent J. and Schmoll, Florian and Marwedel, Peter and Palem, Krishna and Engel, Michael},
title = {Classification-based Improvement of Application Robustness and Quality of Service in Probabilistic Computer Systems},
booktitle = {Proceedings of ARCS 2012 - International Conference on Architecture of Computing Systems},
year = {2012},
address = {Munich, Germany},
month = {mar},
note = {-- ARCS 2012 Best Paper Award Winner --},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-arcs-heinig.pdf},
confidential = {n},
abstract = {Future semiconductors no longer guarantee permanent de- terministic operation. They are expected to show probabilistic behavior due to lowered voltages and shrinking structures. Compared to radiation-induced errors, probabilistic systems face increa- sed error frequencies leading to unexpected bit-flips. Approaches like probabilistic CMOS provide methods to control error distributions which reduce the error probability in more significant bits. However, instruc- tions handling control flow or pointers still require determinism, requiring a classification to identify these instructions.
We apply our transient error classification to probabilistic circuits using differing voltage distributions. Static analysis ensures that probabilistic effects only affect unreliable operations which accept a certain level of impreciseness, and that errors in probabilistic components will never propagate to critical operations.
To evaluate, we analyze robustness and quality-of-service of an H.264 video decoder. Using classification results, we map unreliable arithmetic operations onto probabilistic components of an MPARM model, while remaining operations use deterministic components.},
} Future semiconductors no longer guarantee permanent de- terministic operation. They are expected to show probabilistic behavior due to lowered voltages and shrinking structures. Compared to radiation-induced errors, probabilistic systems face increa- sed error frequencies leading to unexpected bit-flips. Approaches like probabilistic CMOS provide methods to control error distributions which reduce the error probability in more significant bits. However, instruc- tions handling control flow or pointers still require determinism, requiring a classification to identify these instructions.
We apply our transient error classification to probabilistic circuits using differing voltage distributions. Static analysis ensures that probabilistic effects only affect unreliable operations which accept a certain level of impreciseness, and that errors in probabilistic components will never propagate to critical operations.
To evaluate, we analyze robustness and quality-of-service of an H.264 video decoder. Using classification results, we map unreliable arithmetic operations onto probabilistic components of an MPARM model, while remaining operations use deterministic components.
|
| Sudipta Chattopadhyay, Chong Lee Kee, Abhik Roychoudhury, Timon Kelter, Heiko Falk and Peter Marwedel. A Unified WCET Analysis Framework for Multi-core Platforms. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 99-108 Beijing, China, April 2012 [BibTeX][PDF][Link][Abstract]@inproceedings { kelter:2012:rtas,
author = {Chattopadhyay, Sudipta and Kee, Chong Lee and Roychoudhury, Abhik and Kelter, Timon and Falk, Heiko and Marwedel, Peter},
title = {A Unified WCET Analysis Framework for Multi-core Platforms},
booktitle = {IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)},
year = {2012},
pages = {99-108},
address = {Beijing, China},
month = {April},
url = {http://www.rtas.org/12-home.htm},
keywords = {wcet},
file = {http://www.comp.nus.edu.sg/~sudiptac/papers/mxtiming.pdf},
confidential = {n},
abstract = {With the advent of multi-core architectures, worst case execution time (WCET) analysis has become an increasingly difficult problem. In this paper, we propose a unified WCET analysis framework for multi-core processors featuring both shared cache and shared bus. Compared to other previous works, our work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor). In addition, our framework does not assume a timing anomaly free multicore architecture for computing the WCET. A detailed experiment methodology suggests that we can obtain reasonably tight WCET estimates in a wide range of benchmark programs.},
} With the advent of multi-core architectures, worst case execution time (WCET) analysis has become an increasingly difficult problem. In this paper, we propose a unified WCET analysis framework for multi-core processors featuring both shared cache and shared bus. Compared to other previous works, our work differs by modeling the interaction of shared cache and shared bus with other basic micro-architectural components (e.g. pipeline and branch predictor). In addition, our framework does not assume a timing anomaly free multicore architecture for computing the WCET. A detailed experiment methodology suggests that we can obtain reasonably tight WCET estimates in a wide range of benchmark programs.
|
| Helena Kotthaus, Sascha Plazar and Peter Marwedel. A JVM-based Compiler Strategy for the R Language. In Abstract Booklet at The 8th International R User Conference (UseR!) WiP, pages 68 Nashville, Tennessee, USA, June 2012 [BibTeX][PDF]@inproceedings { kotthaus:12:user,
author = {Kotthaus, Helena and Plazar, Sascha and Marwedel, Peter},
title = {A JVM-based Compiler Strategy for the R Language},
booktitle = {Abstract Booklet at The 8th International R User Conference (UseR!) WiP},
year = {2012},
pages = {68},
address = {Nashville, Tennessee, USA},
month = {jun},
keywords = {R language, Java, dynamic compiler optimization},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-user-kotthaus.pdf},
confidential = {n},
} |
| Daniel Cordes, Michael Engel, Peter Marwedel and Olaf Neugebauer. Automatic extraction of multi-objective aware pipeline parallelism using genetic algorithms. In Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis Tampere, Finland, October 2012 [BibTeX][PDF][Abstract]@inproceedings { Cordes:2012:CODES,
author = {Cordes, Daniel and Engel, Michael and Marwedel, Peter and Neugebauer, Olaf},
title = {Automatic extraction of multi-objective aware pipeline parallelism using genetic algorithms},
booktitle = {Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis},
year = {2012},
series = {CODES+ISSS '12},
address = {Tampere, Finland},
month = {oct},
publisher = {ACM},
keywords = {automatic parallelization, embedded software, energy, genetic algorithms, multi-objective, pipeline parallelism},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-codes-cordes.pdf},
confidential = {n},
abstract = {The development of automatic parallelization techniques has been fascinating researchers for decades. This has resulted in a significant amount of tools, which should relieve the designer from the burden of manually parallelizing an application. However, most of these tools only focus on minimizing execution time which drastically reduces their applicability to embedded devices. It is essential to find good trade-offs between different objectives like, e.g., execution time, energy consumption, or communication overhead, if applications should be parallelized for embedded multiprocessor system-on-chip (MPSoC) devices. Another important aspect which has to be taken into account is the streaming-based structure found in many embedded applications such as multimedia and network services. The best way to parallelize these applications is to extract pipeline parallelism. Therefore, this paper presents the first multi-objective aware approach exploiting pipeline parallelism automatically to make it most suitable for resource-restricted embedded devices. We have compared the new pipeline parallelization approach to an existing task-level extraction technique. The evaluation has shown that the new approach extracts very efficient multi-objective aware parallelism. In addition, the two approaches have been combined and it could be shown that both approaches perfectly complement each other.},
} The development of automatic parallelization techniques has been fascinating researchers for decades. This has resulted in a significant amount of tools, which should relieve the designer from the burden of manually parallelizing an application. However, most of these tools only focus on minimizing execution time which drastically reduces their applicability to embedded devices. It is essential to find good trade-offs between different objectives like, e.g., execution time, energy consumption, or communication overhead, if applications should be parallelized for embedded multiprocessor system-on-chip (MPSoC) devices. Another important aspect which has to be taken into account is the streaming-based structure found in many embedded applications such as multimedia and network services. The best way to parallelize these applications is to extract pipeline parallelism. Therefore, this paper presents the first multi-objective aware approach exploiting pipeline parallelism automatically to make it most suitable for resource-restricted embedded devices. We have compared the new pipeline parallelization approach to an existing task-level extraction technique. The evaluation has shown that the new approach extracts very efficient multi-objective aware parallelism. In addition, the two approaches have been combined and it could be shown that both approaches perfectly complement each other.
|
| Michael Engel and Björn Döbel. The Reliable Computing Base – A Paradigm for Software-based Reliability. In Proceedings of SOBRES September 2012 [BibTeX][PDF][Abstract]@inproceedings { engel:2012:sobres,
author = {Engel, Michael and D\"obel, Bj\"orn},
title = {The Reliable Computing Base – A Paradigm for Software-based Reliability},
booktitle = {Proceedings of SOBRES},
year = {2012},
month = {sep},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-sobres-engel.pdf},
confidential = {n},
abstract = {For embedded systems, the use of software-based error detection
and correction approaches is an attractive means in order to reduce
often inconvenient overheads in hardware. To ensure that such a
software-based fault-tolerance approach is effective, it must be
guaranteed that a certain amount of hardware and software components in
a system can be trusted to provide correct service in the presence of
errors. In analogy with the Trusted Computing Base (TCB) in security
research, we call these components the Reliable Computing Base (RCB).
Similar to the TCB, it is also desirable to reduce the size of the RCB, so the
overhead in redundant hardware resources can be reduced. In this
position paper, we describe approaches for informal as well as formal definitions
of the RCB, the related metrics and approaches for RCB minimization.
},
} For embedded systems, the use of software-based error detection
and correction approaches is an attractive means in order to reduce
often inconvenient overheads in hardware. To ensure that such a
software-based fault-tolerance approach is effective, it must be
guaranteed that a certain amount of hardware and software components in
a system can be trusted to provide correct service in the presence of
errors. In analogy with the Trusted Computing Base (TCB) in security
research, we call these components the Reliable Computing Base (RCB).
Similar to the TCB, it is also desirable to reduce the size of the RCB, so the
overhead in redundant hardware resources can be reduced. In this
position paper, we describe approaches for informal as well as formal definitions
of the RCB, the related metrics and approaches for RCB minimization.
|
| Björn Döbel, Hermann Härtig and Michael Engel. Operating System Support for Redundant Multithreading. In Proceedings of EMSOFT October 2012 [BibTeX][PDF][Abstract]@inproceedings { doebel:2012:EMSOFT,
author = {D\"obel, Bj\"orn and H\"artig, Hermann and Engel, Michael},
title = {Operating System Support for Redundant Multithreading},
booktitle = {Proceedings of EMSOFT},
year = {2012},
month = {oct},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-emsoft-doebel.pdf},
confidential = {n},
abstract = {In modern commodity operating systems, core functionality is usually designed assuming that the underlying processor
hardware always functions correctly. Shrinking hardware feature sizes break this assumption. Existing approaches to
cope with these issues either use hardware functionality that is not available in commercial-off-the-shelf (COTS) systems
or poses additional requirements on the software development side, making reuse of existing software hard, if not impossible.
In this paper we present Romain, a framework that provides transparent redundant multithreading1
as an operating system service for hardware error detection and recovery. When applied to a standard benchmark suite, Romain requires a maximum runtime overhead of 30 % for triple modular redundancy (while in many cases remaining below 5%). Furthermore, our approach minimizes the complexity added to the operating system for the sake of replication.
},
} In modern commodity operating systems, core functionality is usually designed assuming that the underlying processor
hardware always functions correctly. Shrinking hardware feature sizes break this assumption. Existing approaches to
cope with these issues either use hardware functionality that is not available in commercial-off-the-shelf (COTS) systems
or poses additional requirements on the software development side, making reuse of existing software hard, if not impossible.
In this paper we present Romain, a framework that provides transparent redundant multithreading1
as an operating system service for hardware error detection and recovery. When applied to a standard benchmark suite, Romain requires a maximum runtime overhead of 30 % for triple modular redundancy (while in many cases remaining below 5%). Furthermore, our approach minimizes the complexity added to the operating system for the sake of replication.
|
| Peter Marwedel and Michael Engel. Efficient Computing in Cyber-Physical Systems. In Proceedings of SAMOS XII July 2012 [BibTeX][PDF][Abstract]@inproceedings { marwedel:2012:samos,
author = {Marwedel, Peter and Engel, Michael},
title = {Efficient Computing in Cyber-Physical Systems},
booktitle = {Proceedings of SAMOS XII},
year = {2012},
month = {jul},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-samos-marwedel.pdf},
confidential = {n},
abstract = {Computing in cyber-physical systems has to be efficient in terms of a number of objectives. In particular, computing has to be execution-time and energy efficient. In this paper, we will consider optimization techniques aiming at efficiency in terms of these two objectives. In the first part, we will consider techniques for the integration of compilers and worst-case execution time (WCET) estimation. We will demonstrate, how such integration opens the door to WCET-reduction algorithms. For example, an algorithm for WCET-aware compilation reduces the WCET for an automotive application by more than 50\% by exploiting scratch pad memories (SPMs).
In the second part, we will demonstrate techniques for improving the energy efficiency of cyber-physical systems, in particular the use of SPMs. In the third part, we demonstrate how the optimization for multiple objectives taken into account. This paper provides an overview of work performed at the Chair for Embedded Systems of TU Dortmund and the Informatik Centrum Dortmund, Germany.},
} Computing in cyber-physical systems has to be efficient in terms of a number of objectives. In particular, computing has to be execution-time and energy efficient. In this paper, we will consider optimization techniques aiming at efficiency in terms of these two objectives. In the first part, we will consider techniques for the integration of compilers and worst-case execution time (WCET) estimation. We will demonstrate, how such integration opens the door to WCET-reduction algorithms. For example, an algorithm for WCET-aware compilation reduces the WCET for an automotive application by more than 50% by exploiting scratch pad memories (SPMs).
In the second part, we will demonstrate techniques for improving the energy efficiency of cyber-physical systems, in particular the use of SPMs. In the third part, we demonstrate how the optimization for multiple objectives taken into account. This paper provides an overview of work performed at the Chair for Embedded Systems of TU Dortmund and the Informatik Centrum Dortmund, Germany.
|
| Christopher Boelmann, Torben Weis, Arno Wacker and Michael Engel. Self-Stabilizing Micro Controller for Large-Scale Sensor Networks in Spite of Program Counter Corruptions due to Soft Errors. In Proceedings of ICPADS December 2012 [BibTeX][PDF][Abstract]@inproceedings { Boelmann:2012:ICPADS,
author = {Boelmann, Christopher and Weis, Torben and Wacker, Arno and Engel, Michael},
title = {Self-Stabilizing Micro Controller for Large-Scale Sensor Networks in Spite of Program Counter Corruptions due to Soft Errors},
booktitle = {Proceedings of ICPADS},
year = {2012},
month = {dec},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-icpads-boelmann.pdf},
confidential = {n},
abstract = {For large installations of networked embedded systems it is important that each entity is self-stabilizing, because usually there is nobody to restart nodes that have hung up. Self-stabilization means to recover from temporary failures (soft errors) and adapt to a change of network topology caused by permanent failures. On the software side self-stabilizing
algorithms must assume that the hardware is executing the software correctly. In this paper we discuss cases in which soft errors invalidate this assumption, especially in cases where CPU registers or the watchdog timer are affected by the fault.
Based on the observation that a guaranteed self-stabilization is only possible as long as the watchdog-timer is working
properly after temporary failures, we propose and compare three different approaches that meet the requirements of
sensor networks, to solve this problem with a combination of hardware- and software-modifications:
1) A run-time verification of every watchdog access
2) A completely hardware-based approach, without any software modifications
3) A 2X byte code alignment, to realign a corrupted program counter
Furthermore we determine the average code-size increase and evaluate necessary hardware-modifications that come along
with each approach.},
} For large installations of networked embedded systems it is important that each entity is self-stabilizing, because usually there is nobody to restart nodes that have hung up. Self-stabilization means to recover from temporary failures (soft errors) and adapt to a change of network topology caused by permanent failures. On the software side self-stabilizing
algorithms must assume that the hardware is executing the software correctly. In this paper we discuss cases in which soft errors invalidate this assumption, especially in cases where CPU registers or the watchdog timer are affected by the fault.
Based on the observation that a guaranteed self-stabilization is only possible as long as the watchdog-timer is working
properly after temporary failures, we propose and compare three different approaches that meet the requirements of
sensor networks, to solve this problem with a combination of hardware- and software-modifications:
1) A run-time verification of every watchdog access
2) A completely hardware-based approach, without any software modifications
3) A 2X byte code alignment, to realign a corrupted program counter
Furthermore we determine the average code-size increase and evaluate necessary hardware-modifications that come along
with each approach.
|
| Olivera Jovanovic, Peter Marwedel, Iuliana Bacivarov and Lothar Thiele. MAMOT: Memory-Aware Mapping Optimization Tool for MPSoC. In 15th Euromicro Conference on Digital System Design (DSD 2012) Izmir, Turkey, September 2012 [BibTeX]@inproceedings { Jovanovic/etal/2012a,
author = {Jovanovic, Olivera and Marwedel, Peter and Bacivarov, Iuliana and Thiele, Lothar},
title = {MAMOT: Memory-Aware Mapping Optimization Tool for MPSoC},
booktitle = {15th Euromicro Conference on Digital System Design (DSD 2012)},
year = {2012},
address = {Izmir, Turkey},
month = {September},
confidential = {n},
} |
| Jörg Henkel, Lars Bauer, Joachim Becker, Oliver Bringmann, Uwe Brinkschulte, Samarjit Chakraborty, Michael Engel, Rolf Ernst, Hermann Härtig, Lars Hedrich, Andreas Herkersdorf, Rüdiger Kapitza, Daniel Lohmann, Peter Marwedel, Marco Platzner, Wolfgang Rosenstiel, Ulf Schlichtmann, Olaf Spinczyk, Mehdi Tahoori, Jürgen Teich, Norbert Wehn and Hans-Joachim Wunderlich. Design and Architectures for Dependable Embedded Systems. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract]@inproceedings { SPP1500:11,
author = {Henkel, J{\"o}rg and Bauer, Lars and Becker, Joachim and Bringmann, Oliver and Brinkschulte, Uwe and Chakraborty, Samarjit and Engel, Michael and Ernst, Rolf and H{\"a}rtig, Hermann and Hedrich, Lars and Herkersdorf, Andreas and Kapitza, R{\"u}diger and Lohmann, Daniel and Marwedel, Peter and Platzner, Marco and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Spinczyk, Olaf and Tahoori, Mehdi and Teich, J{\"u}rgen and Wehn, Norbert and Wunderlich, Hans-Joachim},
title = {Design and Architectures for Dependable Embedded Systems},
booktitle = {Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)},
year = {2011},
address = {Taipei, Taiwan},
month = {oct},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-esweek-marwedel.pdf},
confidential = {n},
abstract = {The paper presents an overview of a major research project on dependable embedded systems that has started in Fall 2010 and is running for a projected duration of six years. Aim is a `dependability co-design' that spans various levels of abstraction in the design process of embedded systems starting from gate level through operating system, applications software to system architecture. In addition, we present a new classication on faults, errors, and failures.},
} The paper presents an overview of a major research project on dependable embedded systems that has started in Fall 2010 and is running for a projected duration of six years. Aim is a `dependability co-design' that spans various levels of abstraction in the design process of embedded systems starting from gate level through operating system, applications software to system architecture. In addition, we present a new classication on faults, errors, and failures.
|
| Constantin Timm, Frank Weichert, David Fiedler, Christian Prasse, Heinrich Müller, Michael Hompel and Peter Marwedel. Decentralized Control of a Material Flow System enabled by an Embedded Computer Vision System. In Proceedings of the IEEE ICC 2011 Workshop on Embedding the Real World into the Future Internet June 2011 [BibTeX][PDF][Abstract]@inproceedings { Timm:2011a,
author = {Timm, Constantin and Weichert, Frank and Fiedler, David and Prasse, Christian and M{\"u}ller, Heinrich and Hompel, Michael and Marwedel, Peter},
title = {Decentralized Control of a Material Flow System enabled by an Embedded Computer Vision System},
booktitle = {Proceedings of the IEEE ICC 2011 Workshop on Embedding the Real World into the Future Internet},
year = {2011},
month = {jun},
file = {http://dx.doi.org/10.1109/iccw.2011.5963564},
confidential = {n},
abstract = {In this study, a novel sensor/actuator network approach for scalable automated facility logistics systems is presented. The approach comprises (1) a new sensor combination (cameras and few RFID scanners) for distributed detection, localization and identification of parcels and bins and (2) a novel middleware approach based on a service oriented architecture tailored towards the utilization in sensor/actuator networks. The latter enables a more flexible deploying of automated facility logistics system, while the former presents a novel departure for the detection and tracking of bins and parcels in automated facility logistics systems: light barriers and bar code readers are substituted by low-cost cameras, local conveyor mounted embedded evaluation units and few RFID readers. By combining vision-based systems and RFID systems, this approach can compensate for the drawbacks of each respective system. By utilizing a state-of-the-art middleware for connecting all computer systems of an automated facility logistics system the costs for deployment and reconfiguring the system can be decreased.
The paper describes image processing methods specific to the given problem to both track and read visual markers attached to parcels or bins, processing the data on an embedded system and communication/middleware aspects between different computer systems of an automated facility logistics system such as a database holding the loading and routing information of the conveyed objects as a service for the different visual sensor units. In addition, information from the RFID system is used to narrow the decision space for detection and identification. From an economic point of view this approach enables high density of identification while lowering hardware costs compared to state of the art applications and, due to decentralized control, minimizing the effort for (re-)configuration. These innovations will make automated material flow systems more cost-efficient.},
} In this study, a novel sensor/actuator network approach for scalable automated facility logistics systems is presented. The approach comprises (1) a new sensor combination (cameras and few RFID scanners) for distributed detection, localization and identification of parcels and bins and (2) a novel middleware approach based on a service oriented architecture tailored towards the utilization in sensor/actuator networks. The latter enables a more flexible deploying of automated facility logistics system, while the former presents a novel departure for the detection and tracking of bins and parcels in automated facility logistics systems: light barriers and bar code readers are substituted by low-cost cameras, local conveyor mounted embedded evaluation units and few RFID readers. By combining vision-based systems and RFID systems, this approach can compensate for the drawbacks of each respective system. By utilizing a state-of-the-art middleware for connecting all computer systems of an automated facility logistics system the costs for deployment and reconfiguring the system can be decreased.
The paper describes image processing methods specific to the given problem to both track and read visual markers attached to parcels or bins, processing the data on an embedded system and communication/middleware aspects between different computer systems of an automated facility logistics system such as a database holding the loading and routing information of the conveyed objects as a service for the different visual sensor units. In addition, information from the RFID system is used to narrow the decision space for detection and identification. From an economic point of view this approach enables high density of identification while lowering hardware costs compared to state of the art applications and, due to decentralized control, minimizing the effort for (re-)configuration. These innovations will make automated material flow systems more cost-efficient.
|
| Constantin Timm, Pascal Libuschewski, Dominic Siedhoff, Frank Weichert, Heinrich Müller and Peter Marwedel. Improving Nanoobject Detection in Optical Biosensor Data. In Proceedings of the 5th International Symposium on Bio- and Medical Informatics and Cybernetics, BMIC 2011 July 2011 [BibTeX][PDF][Abstract]@inproceedings { Timm:2011b,
author = {Timm, Constantin and Libuschewski, Pascal and Siedhoff, Dominic and Weichert, Frank and M{\"u}ller, Heinrich and Marwedel, Peter},
title = {Improving Nanoobject Detection in Optical Biosensor Data},
booktitle = {Proceedings of the 5th International Symposium on Bio- and Medical Informatics and Cybernetics, BMIC 2011},
year = {2011},
month = {July},
file = {http://www.iiis.org/CDs2011/CD2011SCI/BMIC_2011/PapersPdf/BA536CW.pdf},
confidential = {n},
abstract = {The importance of real-time capable mobile biosensors increases in face of rising numbers of global virus epidemics. Such biosensors can be used for on-site diagnosis, e.g. at airports, to prevent further spread of virus-transmitted diseases, by answering the question whether or not a sample contains a certain virus. In-depth laboratory analysis might furthermore demand for measurements of the concentration of virus particles in a sample. The novel PAMONO sensor technique allows for accomplishing both tasks. One of its basic prerequisites is an efficient analysis of the biosensor image data by means of digital image processing and classification. In this study, we present a high performance approach to this analysis: The diagnosis whether a virus occurs in the sample can be carried out in real-time with high accuracy. An estimate of the concentration can be obtained in real-time as well, if that concentration is not too high.
The contribution of this work is an optimization of our processing pipeline used for PAMONO sensor data analysis. The following objectives are optimized: detection-quality, speed and consumption of resources (e.g. energy, memory). Thus our approach respects the constraints imposed by medical applicability, as well as the constraints on resource consumption arising in embedded systems. The parameters to be optimized are descriptive (virus appearance parameters) and hardware-related (design space exploration).
},
} The importance of real-time capable mobile biosensors increases in face of rising numbers of global virus epidemics. Such biosensors can be used for on-site diagnosis, e.g. at airports, to prevent further spread of virus-transmitted diseases, by answering the question whether or not a sample contains a certain virus. In-depth laboratory analysis might furthermore demand for measurements of the concentration of virus particles in a sample. The novel PAMONO sensor technique allows for accomplishing both tasks. One of its basic prerequisites is an efficient analysis of the biosensor image data by means of digital image processing and classification. In this study, we present a high performance approach to this analysis: The diagnosis whether a virus occurs in the sample can be carried out in real-time with high accuracy. An estimate of the concentration can be obtained in real-time as well, if that concentration is not too high.
The contribution of this work is an optimization of our processing pipeline used for PAMONO sensor data analysis. The following objectives are optimized: detection-quality, speed and consumption of resources (e.g. energy, memory). Thus our approach respects the constraints imposed by medical applicability, as well as the constraints on resource consumption arising in embedded systems. The parameters to be optimized are descriptive (virus appearance parameters) and hardware-related (design space exploration).
|
| Michael Engel, Florian Schmoll, Andreas Heinig and Peter Marwedel. Temporal Properties of Error Handling for Multimedia Applications. In Proceedings of the 14th ITG Conference on Electronic Media Technology Dortmund / Germany, February 2011 [BibTeX][PDF][Abstract]@inproceedings { engel:11:itg,
author = {Engel, Michael and Schmoll, Florian and Heinig, Andreas and Marwedel, Peter},
title = {Temporal Properties of Error Handling for Multimedia Applications},
booktitle = {Proceedings of the 14th ITG Conference on Electronic Media Technology},
year = {2011},
address = {Dortmund / Germany},
month = {feb},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-itg-engel.pdf},
confidential = {n},
abstract = {In embedded consumer electronics devices, cost pressure is one of the driving design objectives. Devices that handle multimedia information, like DVD players or digital video cameras require high computing performance and real- time capabilities while adhering to the cost restrictions. The cost pressure often results in system designs that barely exceed the minimum requirements for such a system.
Thus, hardware-based fault tolerance methods frequently are ignored due to their cost overhead. However, the amount of transient faults showing up in semiconductor-based systems is expected to increase sharply in the near future. Thus, low- overhead methods to correct related errors in such systems are required. Considering restrictions in processing speed, the real-time properties of a system with added error handling are of special interest. In this paper, we present our approach to flexible error handling and discuss the challenges as well as the inherent timing dependencies to deploy it in a typical soft real- time multimedia system, a H.264 video decoder.},
} In embedded consumer electronics devices, cost pressure is one of the driving design objectives. Devices that handle multimedia information, like DVD players or digital video cameras require high computing performance and real- time capabilities while adhering to the cost restrictions. The cost pressure often results in system designs that barely exceed the minimum requirements for such a system.
Thus, hardware-based fault tolerance methods frequently are ignored due to their cost overhead. However, the amount of transient faults showing up in semiconductor-based systems is expected to increase sharply in the near future. Thus, low- overhead methods to correct related errors in such systems are required. Considering restrictions in processing speed, the real-time properties of a system with added error handling are of special interest. In this paper, we present our approach to flexible error handling and discuss the challenges as well as the inherent timing dependencies to deploy it in a typical soft real- time multimedia system, a H.264 video decoder.
|
| Emanuele Cannella, Lorenzo Di Gregorio, Leandro Fiorin, Menno Lindwer, Paolo Meloni, Olaf Neugebauer and Andy D. Pimentel. Towards an ESL Design Framework for Adaptive and Fault-tolerant MPSoCs: MADNESS or not?. In Proceedings of the 9th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia (ESTIMedia'11) Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract]@inproceedings { madness.2011,
author = {Cannella, Emanuele and Gregorio, Lorenzo Di and Fiorin, Leandro and Lindwer, Menno and Meloni, Paolo and Neugebauer, Olaf and Pimentel, Andy D.},
title = {Towards an ESL Design Framework for Adaptive and Fault-tolerant MPSoCs: MADNESS or not?},
booktitle = {Proceedings of the 9th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia (ESTIMedia'11)},
year = {2011},
address = {Taipei, Taiwan},
month = {October},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-estimedia-madness.pdf},
confidential = {n},
abstract = {The MADNESS project aims at the definition of innovative system-level design methodologies for embedded MP-SoCs, extending the classic concept of design space exploration in multi-application domains to cope with high heterogeneity, technology scaling and system reliability. The main goal of the project is to provide a framework able to guide designers and researchers to the optimal composition of embedded MPSoC architectures, according to the requirements and the features of a given target application field. The proposed approach will tackle the new challenges, related to both architecture and design methodologies, arising with the technology scaling, the system reliability and the ever-growing computational needs of modern applications. The methodologies proposed with this project act at different levels of the design flow, enhancing the state-of-the art with novel features in system-level synthesis, architectural evaluation and prototyping. Support for fault resilience and efficient adaptive runtime management is introduced at hardware and middleware level, and considered by the system-level synthesis as one of the optimization factors to be taken into account. This paper presents the first stable results obtained in the MADNESS project, already demonstrating the effectiveness of the proposed methods.},
} The MADNESS project aims at the definition of innovative system-level design methodologies for embedded MP-SoCs, extending the classic concept of design space exploration in multi-application domains to cope with high heterogeneity, technology scaling and system reliability. The main goal of the project is to provide a framework able to guide designers and researchers to the optimal composition of embedded MPSoC architectures, according to the requirements and the features of a given target application field. The proposed approach will tackle the new challenges, related to both architecture and design methodologies, arising with the technology scaling, the system reliability and the ever-growing computational needs of modern applications. The methodologies proposed with this project act at different levels of the design flow, enhancing the state-of-the art with novel features in system-level synthesis, architectural evaluation and prototyping. Support for fault resilience and efficient adaptive runtime management is introduced at hardware and middleware level, and considered by the system-level synthesis as one of the optimization factors to be taken into account. This paper presents the first stable results obtained in the MADNESS project, already demonstrating the effectiveness of the proposed methods.
|
| Michael Engel, Florian Schmoll, Andreas Heinig and Peter Marwedel. Unreliable yet Useful -- Reliability Annotations for Data in Cyber-Physical Systems. In Proceedings of the 2011 Workshop on Software Language Engineering for Cyber-physical Systems (WS4C) Berlin / Germany, October 2011 [BibTeX][PDF][Abstract]@inproceedings { engel:11:ws4c,
author = {Engel, Michael and Schmoll, Florian and Heinig, Andreas and Marwedel, Peter},
title = {Unreliable yet Useful -- Reliability Annotations for Data in Cyber-Physical Systems},
booktitle = {Proceedings of the 2011 Workshop on Software Language Engineering for Cyber-physical Systems (WS4C)},
year = {2011},
address = {Berlin / Germany},
month = {oct},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-ws4c-engel.pdf},
confidential = {n},
abstract = {Today, cyber-physical systems face yet another challenge in addition to the traditional constraints in energy, computing power, or memory. Shrinking semiconductor structure sizes and supply voltages imply that the number of errors that manifest themselves in a system will rise significantly. Most CP systems have to survive errors, but many systems do not have sufficient resources to correct all errors that show up. Thus, it is important to spend the available resources on handling errors with the most critical effect. We propose an ``unreliability'' annotation for data types in C programs that indicates if an error showing up in a specific variable or data structure will possibly cause a severe problem like a program crash or might only show rather negligible effects, e.g., a discolored pixel in video decoding. This classification of data is supported by static analysis methods that verify if the value contained in a variable marked as unreliable does not end up as part of a critical operation, e.g., an array index or loop termination condition. This classification enables several approaches to flexible error handling. For example, a CP system designer might choose to selectively safeguard variables marked as non-unreliable or to employ memories with different reliabilty properties to store the respective values.},
} Today, cyber-physical systems face yet another challenge in addition to the traditional constraints in energy, computing power, or memory. Shrinking semiconductor structure sizes and supply voltages imply that the number of errors that manifest themselves in a system will rise significantly. Most CP systems have to survive errors, but many systems do not have sufficient resources to correct all errors that show up. Thus, it is important to spend the available resources on handling errors with the most critical effect. We propose an "unreliability" annotation for data types in C programs that indicates if an error showing up in a specific variable or data structure will possibly cause a severe problem like a program crash or might only show rather negligible effects, e.g., a discolored pixel in video decoding. This classification of data is supported by static analysis methods that verify if the value contained in a variable marked as unreliable does not end up as part of a critical operation, e.g., an array index or loop termination condition. This classification enables several approaches to flexible error handling. For example, a CP system designer might choose to selectively safeguard variables marked as non-unreliable or to employ memories with different reliabilty properties to store the respective values.
|
| Heiko Falk and Helena Kotthaus. WCET-driven Cache-aware Code Positioning. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), pages 145-154 Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract]@inproceedings { falk:11:cases,
author = {Falk, Heiko and Kotthaus, Helena},
title = {WCET-driven Cache-aware Code Positioning},
booktitle = {Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES)},
year = {2011},
pages = {145-154},
address = {Taipei, Taiwan},
month = {oct},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-cases_1.pdf},
confidential = {n},
abstract = {Code positioning is a well-known compiler optimization aiming at the improvement of the instruction cache behavior. A contiguous mapping of code fragments in memory avoids overlapping of cache sets and thus decreases the number of cache conflict misses.
We present a novel cache-aware code positioning optimization driven by worst-case execution time (WCET) information. For this purpose, we introduce a formal cache model based on a conflict graph which is able to capture a broad class of cache architectures. This cache model is combined with a formal WCET timing model, resulting in a cache conflict graph weighted with WCET data. This conflict graph is then exploited by heuristics for code positioning of both basic blocks and entire functions.
Code positioning is able to decrease the accumulated cache misses for a total of 18 real-life benchmarks by 15.5% on average for an automotive processor featuring a 2-way set-associative cache. These cache miss reductions translate to average WCET reductions by 6.1%. For direct-mapped caches, even larger savings of 18.8% (cache misses) and 9.0% (WCET) were achieved.
},
} Code positioning is a well-known compiler optimization aiming at the improvement of the instruction cache behavior. A contiguous mapping of code fragments in memory avoids overlapping of cache sets and thus decreases the number of cache conflict misses.
We present a novel cache-aware code positioning optimization driven by worst-case execution time (WCET) information. For this purpose, we introduce a formal cache model based on a conflict graph which is able to capture a broad class of cache architectures. This cache model is combined with a formal WCET timing model, resulting in a cache conflict graph weighted with WCET data. This conflict graph is then exploited by heuristics for code positioning of both basic blocks and entire functions.
Code positioning is able to decrease the accumulated cache misses for a total of 18 real-life benchmarks by 15.5% on average for an automotive processor featuring a 2-way set-associative cache. These cache miss reductions translate to average WCET reductions by 6.1%. For direct-mapped caches, even larger savings of 18.8% (cache misses) and 9.0% (WCET) were achieved.
|
| Heiko Falk, Norman Schmitz and Florian Schmoll. WCET-aware Register Allocation based on Integer-Linear Programming. In Proceedings of the 23rd Euromicro Conference on Real-Time Systems (ECRTS), pages 13-22 Porto / Portugal, July 2011 [BibTeX][PDF][Abstract]@inproceedings { falk:11:ecrts,
author = {Falk, Heiko and Schmitz, Norman and Schmoll, Florian},
title = {WCET-aware Register Allocation based on Integer-Linear Programming},
booktitle = {Proceedings of the 23rd Euromicro Conference on Real-Time Systems (ECRTS)},
year = {2011},
pages = {13-22},
address = {Porto / Portugal},
month = {jul},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-ecrts_2.pdf},
confidential = {n},
abstract = {Current compilers lack precise timing models guiding their built-in optimizations. Hence, compilers apply ad-hoc heuristics during optimization to improve code quality. One of the most important optimizations is register allocation. Many compilers heuristically decide when and where to spill a register to memory, without having a clear understanding of the impact of such spill code on a program's runtime. This paper presents an integer-linear programming \textit{(ILP)} based register allocator that uses precise worst-case execution time \textit{(WCET)} models. Using this WCET timing data, the compiler avoids spill code generation along the critical path defining a program's WCET. To the best of our knowledge, this paper is the first one to present a WCET-aware ILP-based register allocator. Our results underline the effectiveness of the proposed techniques. For a total of 55 realistic benchmarks, we reduced WCETs by 20.2\% on average and ACETs by 14\%, compared to a standard graph coloring allocator. Furthermore, our ILP-based register allocator outperforms a WCET-aware graph coloring allocator by more than a factor of two for the considered benchmarks, while requiring less runtime.},
} Current compilers lack precise timing models guiding their built-in optimizations. Hence, compilers apply ad-hoc heuristics during optimization to improve code quality. One of the most important optimizations is register allocation. Many compilers heuristically decide when and where to spill a register to memory, without having a clear understanding of the impact of such spill code on a program's runtime. This paper presents an integer-linear programming (ILP) based register allocator that uses precise worst-case execution time (WCET) models. Using this WCET timing data, the compiler avoids spill code generation along the critical path defining a program's WCET. To the best of our knowledge, this paper is the first one to present a WCET-aware ILP-based register allocator. Our results underline the effectiveness of the proposed techniques. For a total of 55 realistic benchmarks, we reduced WCETs by 20.2% on average and ACETs by 14%, compared to a standard graph coloring allocator. Furthermore, our ILP-based register allocator outperforms a WCET-aware graph coloring allocator by more than a factor of two for the considered benchmarks, while requiring less runtime.
|
| Constantin Timm, Frank Weichert, Peter Marwedel and Heinrich Müller. Multi-Objective Local Instruction Scheduling for GPGPU Applications. In Proceedings of the International Conference on Parallel and Distributed Computing Systems 2011 (PDCS) Dallas, USA, December 2011 [BibTeX][PDF][Abstract]@inproceedings { timm:2011:pdcs,
author = {Timm, Constantin and Weichert, Frank and Marwedel, Peter and M{\"u}ller, Heinrich},
title = {Multi-Objective Local Instruction Scheduling for GPGPU Applications},
booktitle = {Proceedings of the International Conference on Parallel and Distributed Computing Systems 2011 (PDCS) },
year = {2011},
address = {Dallas, USA},
month = {December},
publisher = {IASTED/ACTA Press},
file = {http://www.actapress.com/PaperInfo.aspx?paperId=453074},
confidential = {n},
abstract = {In this paper, a new optimization approach (MOLIS: Multi-Objective Local Instruction Scheduling) is presented which maximizes the performance and minimizes the energy consumption of GPGPU applications. The design process of writing efficient GPGPU applications is time-consuming. This disadvantage mainly arises from the fact that the optimization of an application is accomplished in an expensive trial-and-error manner without efficient compiler support. Especially, efficient register utilization and load balancing of the concurrently working instruction and memory pipelines were not considered in the compile process up to now. Another drawback of the state-of-the-art GPGPU application design process is that energy consumption is not taken into account, which is important in the face of green computing. In order to optimize performance and energy consumption simultaneously, a multi-objective genetic algorithm was utilized. The optimization of GPGPU applications in MOLIS employs local instruction scheduling methods. The optimization potential of MOLIS was evaluated by profiling the runtime and the energy consumption on a real platform. The optimization approach was tested with several real-world benchmarks stemming from the Nvidia CUDA examples, the VSIPL-GPGPU-Library and the Rodinia benchmark suite. By applying MOLIS to the real-world benchmarks, up to 9% energy and 12% runtime can be saved.},
} In this paper, a new optimization approach (MOLIS: Multi-Objective Local Instruction Scheduling) is presented which maximizes the performance and minimizes the energy consumption of GPGPU applications. The design process of writing efficient GPGPU applications is time-consuming. This disadvantage mainly arises from the fact that the optimization of an application is accomplished in an expensive trial-and-error manner without efficient compiler support. Especially, efficient register utilization and load balancing of the concurrently working instruction and memory pipelines were not considered in the compile process up to now. Another drawback of the state-of-the-art GPGPU application design process is that energy consumption is not taken into account, which is important in the face of green computing. In order to optimize performance and energy consumption simultaneously, a multi-objective genetic algorithm was utilized. The optimization of GPGPU applications in MOLIS employs local instruction scheduling methods. The optimization potential of MOLIS was evaluated by profiling the runtime and the energy consumption on a real platform. The optimization approach was tested with several real-world benchmarks stemming from the Nvidia CUDA examples, the VSIPL-GPGPU-Library and the Rodinia benchmark suite. By applying MOLIS to the real-world benchmarks, up to 9% energy and 12% runtime can be saved.
|
| Timon Kelter, Heiko Falk, Peter Marwedel, Sudipta Chattopadhyay and Abhik Roychoudhury. Bus-Aware Multicore WCET Analysis through TDMA Offset Bounds. In Proceedings of the 23rd Euromicro Conference on Real-Time Systems (ECRTS), pages 3-12 Porto / Portugal, July 2011 [BibTeX][PDF][Abstract]@inproceedings { kelter:11:ecrts,
author = {Kelter, Timon and Falk, Heiko and Marwedel, Peter and Chattopadhyay, Sudipta and Roychoudhury, Abhik},
title = {Bus-Aware Multicore WCET Analysis through TDMA Offset Bounds},
booktitle = {Proceedings of the 23rd Euromicro Conference on Real-Time Systems (ECRTS)},
year = {2011},
pages = {3-12},
address = {Porto / Portugal},
month = {jul},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-ecrts_1.pdf},
confidential = {n},
abstract = {In the domain of real-time systems, the analysis of the timing behavior of programs is crucial for guaranteeing the schedulability and thus the safeness of a system. Static analyses of the \textit{WCET} (Worst-Case Execution Time) have proven to be a key element for timing analysis, as they provide safe upper bounds on a program's execution time. For single-core systems, industrial-strength WCET analyzers are already available, but up to now, only first proposals have been made to analyze the WCET in multicore systems, where the different cores may interfere during the access to shared resources. An important example for this are shared buses which connect the cores to a shared main memory. The time to gain access to the shared bus may vary significantly, depending on the used bus arbitration protocol and the access timings. In this paper, we propose a new technique for analyzing the duration of accesses to shared buses. We implemented a prototype tool which uses the new analysis and tested it on a set of realworld benchmarks. Results demonstrate that our analysis achieves the same precision as the best existing approach while drastically outperforming it in matters of analysis time.},
} In the domain of real-time systems, the analysis of the timing behavior of programs is crucial for guaranteeing the schedulability and thus the safeness of a system. Static analyses of the WCET (Worst-Case Execution Time) have proven to be a key element for timing analysis, as they provide safe upper bounds on a program's execution time. For single-core systems, industrial-strength WCET analyzers are already available, but up to now, only first proposals have been made to analyze the WCET in multicore systems, where the different cores may interfere during the access to shared resources. An important example for this are shared buses which connect the cores to a shared main memory. The time to gain access to the shared bus may vary significantly, depending on the used bus arbitration protocol and the access timings. In this paper, we propose a new technique for analyzing the duration of accesses to shared buses. We implemented a prototype tool which uses the new analysis and tested it on a set of realworld benchmarks. Results demonstrate that our analysis achieves the same precision as the best existing approach while drastically outperforming it in matters of analysis time.
|
| Sascha Plazar, Jan C. Kleinsorge, Heiko Falk and Peter Marwedel. WCET-driven Branch Prediction aware Code Positioning. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), pages 165-174 Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract]@inproceedings { plazar:11:cases,
author = {Plazar, Sascha and Kleinsorge, Jan C. and Falk, Heiko and Marwedel, Peter},
title = {WCET-driven Branch Prediction aware Code Positioning},
booktitle = {Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES)},
year = {2011},
pages = {165-174},
address = {Taipei, Taiwan},
month = {oct},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-cases_2.pdf},
confidential = {n},
abstract = {In the past decades, embedded system designers moved from simple, predictable system designs towards complex systems equipped with caches, branch prediction units and speculative execution. This step was necessary in order to fulfill increasing requirements on computational power. Static analysis techniques considering such speculative units had to be developed to allow the estimation of an upper bound of the execution time of a program. This bound is called worst-case execution time (WCET). Its knowledge is crucial to verify whether hard real-time systems satisfy their timing constraints, and the WCET is a key parameter for the design of embedded systems.
In this paper, we propose a WCET-driven branch prediction aware optimization which reorders basic blocks of a function in order to reduce the amount of jump instructions and mispredicted branches. We employed a genetic algorithm which rearranges basic blocks in order to decrease the WCET of a program. This enables a first estimation of the possible optimization potential at the cost of high optimization runtimes. To avoid time consuming repetitive WCET analyses, we developed a new algorithm employing integer-linear programming (ILP). The ILP models the worst-case execution path (WCEP) of a program and takes branch prediction effects into account. This algorithm enables short optimization runtimes at slightly decreased optimization results. In a case study, the genetic algorithm is able to reduce the benchmarks’ WCET by up to 24.7% whereas our ILP-based approach is able to decrease the WCET by up to 20.0%.
},
} In the past decades, embedded system designers moved from simple, predictable system designs towards complex systems equipped with caches, branch prediction units and speculative execution. This step was necessary in order to fulfill increasing requirements on computational power. Static analysis techniques considering such speculative units had to be developed to allow the estimation of an upper bound of the execution time of a program. This bound is called worst-case execution time (WCET). Its knowledge is crucial to verify whether hard real-time systems satisfy their timing constraints, and the WCET is a key parameter for the design of embedded systems.
In this paper, we propose a WCET-driven branch prediction aware optimization which reorders basic blocks of a function in order to reduce the amount of jump instructions and mispredicted branches. We employed a genetic algorithm which rearranges basic blocks in order to decrease the WCET of a program. This enables a first estimation of the possible optimization potential at the cost of high optimization runtimes. To avoid time consuming repetitive WCET analyses, we developed a new algorithm employing integer-linear programming (ILP). The ILP models the worst-case execution path (WCEP) of a program and takes branch prediction effects into account. This algorithm enables short optimization runtimes at slightly decreased optimization results. In a case study, the genetic algorithm is able to reduce the benchmarks’ WCET by up to 24.7% whereas our ILP-based approach is able to decrease the WCET by up to 20.0%.
|
| Daniel Cordes, Andreas Heinig, Peter Marwedel and Arindam Mallik. Automatic Extraction of Pipeline Parallelism for Embedded Software Using Linear Programming. In Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2011, pages 699 -706 Tainan, Taiwan, December 2011 [BibTeX][PDF][Abstract]@inproceedings { cordes:2011:icpads,
author = {Cordes, Daniel and Heinig, Andreas and Marwedel, Peter and Mallik, Arindam},
title = {Automatic Extraction of Pipeline Parallelism for Embedded Software Using Linear Programming},
booktitle = {Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems (ICPADS), 2011},
year = {2011},
pages = {699 -706},
address = {Tainan, Taiwan},
month = {dec},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-icpads-cordes.pdf},
confidential = {n},
abstract = {The complexity and performance requirements of embedded software are continuously increasing, making Multiprocessor System-on-Chip (MPSoC) architectures more and more important in the domain of embedded and cyber-physical systems. Using multiple cores in a single system reduces problems concerning energy consumption, heat dissipation, and increases performance. Nevertheless, these benefits do not come for free. Porting existing, mostly sequential, applications to MPSoCs requires extracting efficient parallelism to utilize all available cores. Many embedded applications, like network services and multimedia tasks for voice-, image- and video processing, are operating on data streams and thus have a streaming-based structure. Despite the abundance of parallelism in streaming applications, it is a non-trivial task to split and efficiently map sequential applications to MPSoCs. Therefore, we present an algorithm which automatically extracts pipeline parallelism from sequential ANSI-C applications. The presented tool employs an integer linear programming (ILP) based approach enriched with an adequate cost model to automatically control the granularity of the parallelization. By applying our tool to real-life applications, it can be shown that our approach is able to speed up applications by a factor of up to 3.9x on a four-core MPSoC architecture, compared to a sequential execution.},
} The complexity and performance requirements of embedded software are continuously increasing, making Multiprocessor System-on-Chip (MPSoC) architectures more and more important in the domain of embedded and cyber-physical systems. Using multiple cores in a single system reduces problems concerning energy consumption, heat dissipation, and increases performance. Nevertheless, these benefits do not come for free. Porting existing, mostly sequential, applications to MPSoCs requires extracting efficient parallelism to utilize all available cores. Many embedded applications, like network services and multimedia tasks for voice-, image- and video processing, are operating on data streams and thus have a streaming-based structure. Despite the abundance of parallelism in streaming applications, it is a non-trivial task to split and efficiently map sequential applications to MPSoCs. Therefore, we present an algorithm which automatically extracts pipeline parallelism from sequential ANSI-C applications. The presented tool employs an integer linear programming (ILP) based approach enriched with an adequate cost model to automatically control the granularity of the parallelization. By applying our tool to real-life applications, it can be shown that our approach is able to speed up applications by a factor of up to 3.9x on a four-core MPSoC architecture, compared to a sequential execution.
|
| Peter Marwedel and Michael Engel. Embedded System Design 2.0: Rationale Behind a Textbook Revision. In Proceedings of Workshop on Embedded Systems Education (WESE) Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract]@inproceedings { marwedel:2011:wese,
author = {Marwedel, Peter and Engel, Michael},
title = {Embedded System Design 2.0: Rationale Behind a Textbook Revision},
booktitle = {Proceedings of Workshop on Embedded Systems Education (WESE)},
year = {2011},
address = {Taipei, Taiwan},
month = {October},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-wese-marwedel.pdf},
confidential = {n},
abstract = {Seven years after its first release, it became necessary to publish a new edition of the author’s text book on embedded system
design. This paper explains the key changes that were incorporated into the second edition. These changes reflect seven years of teaching of the subject, with two courses every year. The rationale behind these changes can also be found in the paper. In this way, the paper also reflects changes in the area over time, while the area becomes more mature. The paper helps understanding why a particular topic is included in this curriculum for embedded system design and why a certain structure of the course is suggested.},
} Seven years after its first release, it became necessary to publish a new edition of the author’s text book on embedded system
design. This paper explains the key changes that were incorporated into the second edition. These changes reflect seven years of teaching of the subject, with two courses every year. The rationale behind these changes can also be found in the paper. In this way, the paper also reflects changes in the area over time, while the area becomes more mature. The paper helps understanding why a particular topic is included in this curriculum for embedded system design and why a certain structure of the course is suggested.
|
| Horst Schirmeier, Jens Neuhalfen, Ingo Korb, Olaf Spinczyk and Michael Engel. RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers. In Proceedings of the 11th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2011) Pasadena, USA, 2011 [BibTeX][PDF][Abstract]@inproceedings { schirmeier:11:prdc,
author = {Schirmeier, Horst and Neuhalfen, Jens and Korb, Ingo and Spinczyk, Olaf and Engel, Michael},
title = {RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers},
booktitle = {Proceedings of the 11th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2011)},
year = {2011},
address = {Pasadena, USA},
organization = {IEEE Computer Society Press},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-prdc-schirmeier.pdf},
confidential = {n},
abstract = {Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation.
Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime.
To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64- based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use.
We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.},
} Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation.
Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime.
To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64- based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use.
We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.
|
| Jan C. Kleinsorge, Heiko Falk and Peter Marwedel. A Synergetic Approach To Accurate Analysis Of Cache-Related Preemption Delay. In Proceedings of the International Conference on Embedded Software (EMSOFT), pages 329-338 Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract]@inproceedings { kleinsorge:11:emsoft,
author = {Kleinsorge, Jan C. and Falk, Heiko and Marwedel, Peter},
title = {A Synergetic Approach To Accurate Analysis Of Cache-Related Preemption Delay},
booktitle = {Proceedings of the International Conference on Embedded Software (EMSOFT)},
year = {2011},
pages = {329-338},
address = {Taipei, Taiwan},
month = {oct},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-emsoft.pdf},
confidential = {n},
abstract = {The worst-case execution time (WCET) of a task denotes the largest possible execution time for all possible inputs and thus, hardware states. For non-preemptive multitask scheduling, techniques for the static estimation of safe upper bounds have been subject to industrial practice for years. For preemptive scheduling however, the isolated analysis of tasks becomes imprecise as interferences among tasks cannot be considered with sufficient precision. For such scenarios, the cache-related preemption delay (CRPD) denotes a key metric as it reflects the eects of preemptions on the execution behavior of a single task. Until recently, proposals for CRPD analyses were often limited to direct mapped caches or comparably imprecise for k-way set-associative caches.
In this paper, we propose how the current best techniques for CRPD analysis, which have only been proposed separately and for dierent aspects of the analysis can be brought together to construct an efficient CRPD analysis with unique properties. Moreover, along the construction, we propose several different enhancements to the methods employed. We also exploit that in a complete approach, analysis steps are synergetic and can be combined into a single analysis pass solving all formerly separate steps at once. In addition, we argue that it is often sufficient to carry out the combined analysis on basic block bounds, which further lowers the overall complexity. The result is a proposal for a fast CRPD analysis of very high accuracy.
},
} The worst-case execution time (WCET) of a task denotes the largest possible execution time for all possible inputs and thus, hardware states. For non-preemptive multitask scheduling, techniques for the static estimation of safe upper bounds have been subject to industrial practice for years. For preemptive scheduling however, the isolated analysis of tasks becomes imprecise as interferences among tasks cannot be considered with sufficient precision. For such scenarios, the cache-related preemption delay (CRPD) denotes a key metric as it reflects the eects of preemptions on the execution behavior of a single task. Until recently, proposals for CRPD analyses were often limited to direct mapped caches or comparably imprecise for k-way set-associative caches.
In this paper, we propose how the current best techniques for CRPD analysis, which have only been proposed separately and for dierent aspects of the analysis can be brought together to construct an efficient CRPD analysis with unique properties. Moreover, along the construction, we propose several different enhancements to the methods employed. We also exploit that in a complete approach, analysis steps are synergetic and can be combined into a single analysis pass solving all formerly separate steps at once. In addition, we argue that it is often sufficient to carry out the combined analysis on basic block bounds, which further lowers the overall complexity. The result is a proposal for a fast CRPD analysis of very high accuracy.
|
| Samarjit Chakraborty, Marco Di Natale, Heiko Falk, Martin Lukasiewyzc and Frank Slomka. Timing and Schedulability Analysis for Distributed Automotive Control Applications. In Tutorial at the International Conference on Embedded Software (EMSOFT), pages 349-350 Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract]@inproceedings { falk:11:emsoft_tutorial,
author = {Chakraborty, Samarjit and Di Natale, Marco and Falk, Heiko and Lukasiewyzc, Martin and Slomka, Frank},
title = {Timing and Schedulability Analysis for Distributed Automotive Control Applications},
booktitle = {Tutorial at the International Conference on Embedded Software (EMSOFT)},
year = {2011},
pages = {349-350},
address = {Taipei, Taiwan},
month = {oct},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-emsoft_tutorial.pdf},
confidential = {n},
abstract = {High-end cars today consist of more than 100 electronic control units (ECUs) that are connected to a set of sensors and actuators and run multiple distributed control applications. The design
ow of such architectures consists of specifying control applications as Simulink/State
flow models, followed by generating code from them and finally mapping such code onto multiple ECUs. In addition, the scheduling policies and parameters on both the ECUs and the communication buses over which they communicate also need to be specified. These policies and parameters are computed from high-level timing and control performance constraints. The proposed tutorial will cover different aspects of this design
flow, with a focus on timing and schedulability problems. After reviewing the basic concepts of worst-case execution time analysis and schedulability analysis, we will discuss the differences between meeting timing constraints (as in classical real-time systems) and meeting control performance constraints (e.g., stability, steady and transient state performance). We will then describe various control performance related schedulability analysis techniques and how they may be tied to model-based software development. Finally, we will discuss various schedule synthesis techniques, both for ECUs as well as for communication protocols like FlexRay, so that control performance constraints specified at the model-level may be satisfied. Throughout the tutorial different commercial as well as academic tools will be discussed and demonstrated.
},
} High-end cars today consist of more than 100 electronic control units (ECUs) that are connected to a set of sensors and actuators and run multiple distributed control applications. The design
ow of such architectures consists of specifying control applications as Simulink/State
flow models, followed by generating code from them and finally mapping such code onto multiple ECUs. In addition, the scheduling policies and parameters on both the ECUs and the communication buses over which they communicate also need to be specified. These policies and parameters are computed from high-level timing and control performance constraints. The proposed tutorial will cover different aspects of this design
flow, with a focus on timing and schedulability problems. After reviewing the basic concepts of worst-case execution time analysis and schedulability analysis, we will discuss the differences between meeting timing constraints (as in classical real-time systems) and meeting control performance constraints (e.g., stability, steady and transient state performance). We will then describe various control performance related schedulability analysis techniques and how they may be tied to model-based software development. Finally, we will discuss various schedule synthesis techniques, both for ECUs as well as for communication protocols like FlexRay, so that control performance constraints specified at the model-level may be satisfied. Throughout the tutorial different commercial as well as academic tools will be discussed and demonstrated.
|
| Peter Marwedel, Jürgen Teich, Georgia Kouveli, Iuliana Bacivarov, Lothar Thiele, Soonhoi Ha, Chanhee Lee, Qiang Xu and Lin Huang. Mapping of Applications to MPSoCs. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract]@inproceedings { marwedel:2011:codes-isss2,
author = {Marwedel, Peter and Teich, J\"urgen and Kouveli, Georgia and Bacivarov, Iuliana and Thiele, Lothar and Ha, Soonhoi and Lee, Chanhee and Xu, Qiang and Huang, Lin},
title = {Mapping of Applications to MPSoCs},
booktitle = {Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)},
year = {2011},
address = {Taipei, Taiwan},
month = {October},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-codes-isss-marwedel.pdf},
confidential = {n},
abstract = {The advent of embedded many-core architectures results in the need to come up with techniques for mapping embedded applications onto such architectures. This paper presents a representative set of such techniques. The techniques focus on optimizing performance, temperature distribution, reliability and fault tolerance for various models.},
} The advent of embedded many-core architectures results in the need to come up with techniques for mapping embedded applications onto such architectures. This paper presents a representative set of such techniques. The techniques focus on optimizing performance, temperature distribution, reliability and fault tolerance for various models.
|
| Robert Pyka, Felipe Klein, Peter Marwedel and Stylianos Mamagkakis. Versatile System-Level Memory-Aware Platform Description Approach for Embedded MPSoCs. In Proc. of the ACM SIGPLAN/SIGBED 2010 Conference on Languages, Compilers, and Tools for Embedded Systems, pages 9-16 2010 [BibTeX][Abstract]@inproceedings { pyka:2010,
author = {Pyka, Robert and Klein, Felipe and Marwedel, Peter and Mamagkakis, Stylianos},
title = {Versatile System-Level Memory-Aware Platform Description Approach for Embedded MPSoCs},
booktitle = {Proc. of the ACM SIGPLAN/SIGBED 2010 Conference on Languages, Compilers, and Tools for Embedded Systems},
year = {2010},
pages = {9-16},
publisher = {ACM},
confidential = {n},
abstract = {In this paper, we present a novel system modeling language which targets primarily the development of source-level multiprocessor memory aware optimizations. In contrast to previous system modeling approaches this approach tries to model the whole system and especially the memory hierarchy in a structural and semantically accessible way. Previous approaches primarily support generation of simulators or retargetable code selectors and thus concentrate on pure behavioral models or describe only the processor instruction set in a semantically accessible way, A simple, database-like, interface is offered to the optimization developer, which in conjunction with the MACCv2 framework enables rapid development of source-level architecture independent optimizations.},
} In this paper, we present a novel system modeling language which targets primarily the development of source-level multiprocessor memory aware optimizations. In contrast to previous system modeling approaches this approach tries to model the whole system and especially the memory hierarchy in a structural and semantically accessible way. Previous approaches primarily support generation of simulators or retargetable code selectors and thus concentrate on pure behavioral models or describe only the processor instruction set in a semantically accessible way, A simple, database-like, interface is offered to the optimization developer, which in conjunction with the MACCv2 framework enables rapid development of source-level architecture independent optimizations.
|
| Matthias Meier, Michael Engel, Matthias Steinkamp and Olaf Spinczyk. LavA: An Open Platform for Rapid Prototyping of MPSoCs. In Proceedings of the 20th International Conference on Field Programmable Logic and Applications (FPL '10), pages 452--457 Milano, Italy, 2010 [BibTeX]@inproceedings { meier:10:fpl,
author = {Meier, Matthias and Engel, Michael and Steinkamp, Matthias and Spinczyk, Olaf},
title = {LavA: An Open Platform for Rapid Prototyping of MPSoCs},
booktitle = {Proceedings of the 20th International Conference on Field Programmable Logic and Applications (FPL '10)},
year = {2010},
pages = {452--457},
address = {Milano, Italy},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| Michael Engel, Felix Jungermann, Katharina Morik and Nico Piatkowski. Enhancing Ubiquitous Systems Through System Call Mining. In Proceedings of the ICDM 2010 Workshop on Large-scale Analytics for Complex Instrumented Systems (LACIS 2010) 2010 [BibTeX]@inproceedings { engel:10:lacis,
author = {Engel, Michael and Jungermann, Felix and Morik, Katharina and Piatkowski, Nico},
title = {Enhancing Ubiquitous Systems Through System Call Mining},
booktitle = {Proceedings of the ICDM 2010 Workshop on Large-scale Analytics for Complex Instrumented Systems (LACIS 2010)},
year = {2010},
confidential = {n},
} |
| Peter Marwedel and Michael Engel. Plea for a Holistic Analysis of the Relationship between Information Technology and Carbon-Dioxide Emissions. In Workshop on Energy-aware Systems and Methods (GI-ITG) Hanover / Germany, February 2010 [BibTeX][PDF][Abstract]@inproceedings { marwedel:10:GI,
author = {Marwedel, Peter and Engel, Michael},
title = {Plea for a Holistic Analysis of the Relationship between Information Technology and Carbon-Dioxide Emissions},
booktitle = {Workshop on Energy-aware Systems and Methods (GI-ITG)},
year = {2010},
address = {Hanover / Germany},
month = {feb},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/arcs-10-marwedel.pdf},
confidential = {n},
abstract = {An analysis of the relationship between information technology (IT) and carbon-dioxide (CO2) emissions should not be constrained to an analysis of emissions caused during the operation of IT equipment. Rather, an analysis of emissions should be based on a full life-cycle assessment (LCA) of IT systems, from their conception until their recycling. Also, the reduction of emissions through the use of IT systems should not be forgotten. This paper explains these viewpoints in more detail and provides rough life-cycle analyses of personal computers (PCs). It will be shown that|for standard scenarios|emissions from PC production are exceeding those of their shipment and use. This stresses the importance of using PCs as long as possible.},
} An analysis of the relationship between information technology (IT) and carbon-dioxide (CO2) emissions should not be constrained to an analysis of emissions caused during the operation of IT equipment. Rather, an analysis of emissions should be based on a full life-cycle assessment (LCA) of IT systems, from their conception until their recycling. Also, the reduction of emissions through the use of IT systems should not be forgotten. This paper explains these viewpoints in more detail and provides rough life-cycle analyses of personal computers (PCs). It will be shown that|for standard scenarios|emissions from PC production are exceeding those of their shipment and use. This stresses the importance of using PCs as long as possible.
|
| Constantin Timm, Andrej Gelenberg, Peter Marwedel and Frank Weichert. Energy Considerations within the Integration of General Purpose GPUs in Embedded Systems. In Proceedings of the International Conference on Advances in Distributed and Parallel Computing November 2010 [BibTeX]@inproceedings { timm:2010:adpc,
author = {Timm, Constantin and Gelenberg, Andrej and Marwedel, Peter and Weichert, Frank},
title = {Energy Considerations within the Integration of General Purpose GPUs in Embedded Systems},
booktitle = {Proceedings of the International Conference on Advances in Distributed and Parallel Computing},
year = {2010},
month = {November},
publisher = {Global Science \& Technology Forum},
confidential = {n},
} |
| Daniel Cordes, Peter Marwedel and Arindam Mallik. Automatic Parallelization of Embedded Software Using Hierarchical Task Graphs and Integer Linear Programming. In Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS 2010) Scottsdale / US, October 2010 [BibTeX][PDF][Abstract]@inproceedings { cordes:10:CODES,
author = {Cordes, Daniel and Marwedel, Peter and Mallik, Arindam},
title = {Automatic Parallelization of Embedded Software Using Hierarchical Task Graphs and Integer Linear Programming},
booktitle = {Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES+ISSS 2010)},
year = {2010},
address = {Scottsdale / US},
month = {oct},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-codes-cordes.pdf},
confidential = {n},
abstract = {The last years have shown that there is no way to disregard the advantages provided by multiprocessor System-on-Chip (MPSoC) architectures in the embedded systems domain. Using multiple cores in a single system enables to close the gap between energy consumption, problems concerning heat dissipation, and computational power. Nevertheless, these benefits do not come for free. New challenges arise, if existing applications have to be ported to these multiprocessor platforms. One of the most ambitious tasks is to extract efficient parallelism from these existing sequential applications. Hence, many parallelization tools have been developed, most of them are extracting as much parallelism as possible, which is in general not the best choice for embedded systems with their limitations in hardware and software support. In contrast to previous approaches, we present a new automatic parallelization tool, tailored to the particular requirements of the resource constrained embedded systems. Therefore, this paper presents an algorithm which automatically steers the granularity of the generated tasks, with respect to architectural requirements and the overall execution time reduction. For this purpose, we exploit hierarchical task graphs to simplify a new integer linear programming based approach in order to split up sequential programs in an efficient way. Results on real-life benchmarks have shown that the presented approach is able to speed sequential applications up by a factor of up to 3.7 on a four core MPSoC architecture.},
} The last years have shown that there is no way to disregard the advantages provided by multiprocessor System-on-Chip (MPSoC) architectures in the embedded systems domain. Using multiple cores in a single system enables to close the gap between energy consumption, problems concerning heat dissipation, and computational power. Nevertheless, these benefits do not come for free. New challenges arise, if existing applications have to be ported to these multiprocessor platforms. One of the most ambitious tasks is to extract efficient parallelism from these existing sequential applications. Hence, many parallelization tools have been developed, most of them are extracting as much parallelism as possible, which is in general not the best choice for embedded systems with their limitations in hardware and software support. In contrast to previous approaches, we present a new automatic parallelization tool, tailored to the particular requirements of the resource constrained embedded systems. Therefore, this paper presents an algorithm which automatically steers the granularity of the generated tasks, with respect to architectural requirements and the overall execution time reduction. For this purpose, we exploit hierarchical task graphs to simplify a new integer linear programming based approach in order to split up sequential programs in an efficient way. Results on real-life benchmarks have shown that the presented approach is able to speed sequential applications up by a factor of up to 3.7 on a four core MPSoC architecture.
|
| Peter Marwedel and Michael Engel. Ein Plädoyer für eine holistische Analyse der Zusammenhänge zwischen Informationstechnologie und Kohlendioxyd-Emissionen. In VDE-Kongress Leipzig, Germany, November 2010 [BibTeX]@inproceedings { marwedel:10:VDE,
author = {Marwedel, Peter and Engel, Michael},
title = {Ein Pl{\"a}doyer f{\"u}r eine holistische Analyse der Zusammenh{\"a}nge zwischen Informationstechnologie und Kohlendioxyd-Emissionen},
booktitle = {VDE-Kongress},
year = {2010},
address = {Leipzig, Germany},
month = {nov},
confidential = {n},
} |
| Katharina Morik, Nico Piatkowski, Michael Engel and Felix Jungermann. Enhancing Ubiquitous Systems Through System Call Mining. In Proceedings of the ICDM Workshop on Large-scale Analytics for Complex Instrumented Systems ({LACIS 2010}) Sydney, Australia, December 2010 [BibTeX]@inproceedings { morik:10:icdm10,
author = {Morik, Katharina and Piatkowski, Nico and Engel, Michael and Jungermann, Felix},
title = {Enhancing Ubiquitous Systems Through System Call Mining},
booktitle = {Proceedings of the ICDM Workshop on Large-scale Analytics for Complex Instrumented Systems ({LACIS 2010})},
year = {2010},
address = {Sydney, Australia},
month = {dec},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| Sascha Plazar, Peter Marwedel and Jörg Rahnenführer. Optimizing Execution Runtimes of R Programs. In Book of Abstracts of International Symposium on Business and Industrial Statistics (ISBIS), pages 81-82 Portoroz (Portorose) / Slovenia, July 2010 [BibTeX][PDF]@inproceedings { plazar:10:isbis,
author = {Plazar, Sascha and Marwedel, Peter and Rahnenf\ührer, J\örg},
title = {Optimizing Execution Runtimes of R Programs},
booktitle = {Book of Abstracts of International Symposium on Business and Industrial Statistics (ISBIS)},
year = {2010},
pages = {81-82},
address = {Portoroz (Portorose) / Slovenia},
month = {jul},
keywords = {rcs},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-isbis.pdf},
confidential = {n},
} |
| Sascha Plazar, Paul Lokuciejewski and Peter Marwedel. WCET-driven Cache-aware Memory Content Selection. In Proceedings of the 13th IEEE International Symposium on Object/Component/Service-oriented Real-time Distributed Computing (ISORC), pages 107-114 Carmona / Spain, May 2010 [BibTeX][PDF][Abstract]@inproceedings { plazar:10:isorc,
author = {Plazar, Sascha and Lokuciejewski, Paul and Marwedel, Peter},
title = {WCET-driven Cache-aware Memory Content Selection},
booktitle = {Proceedings of the 13th IEEE International Symposium on Object/Component/Service-oriented Real-time Distributed Computing (ISORC)},
year = {2010},
pages = {107-114},
address = {Carmona / Spain},
month = {may},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-isorc.pdf},
confidential = {n},
abstract = {Caches are widely used to bridge the increasingly growing gap between processor and memory performance. They store copies of frequently used parts of the slow main memory for faster access. Static analysis techniques allow the estimation of the worst-case cache behavior and enable the computation of an upper bound of the execution time of a program. This bound is called worst-case execution time (WCET). Its knowledge is crucial to verify if hard real-time systems satisfy their timing constraints and the WCET is a key parameter for the design of embedded systems. In this paper, we propose a new WCET-driven cache-aware memory content selection algorithm, which allocates functions whose WCET highly benefits from a cached execution to cached memory areas. Vice versa, rarely used functions which do not benefit from a cached execution are allocated to non-cached memory areas. As a result of this, unfavorable functions w.\,r.\,t. a program's WCET can not evict beneficial functions from the cache. This can lead to a reduced cache miss ratio and a decreased WCET. The effectiveness of our approach is demonstrated by results achieved on real-life benchmarks. In a case study, our greedy algorithm is able to reduce the benchmarks' WCET by up to 20\%.},
} Caches are widely used to bridge the increasingly growing gap between processor and memory performance. They store copies of frequently used parts of the slow main memory for faster access. Static analysis techniques allow the estimation of the worst-case cache behavior and enable the computation of an upper bound of the execution time of a program. This bound is called worst-case execution time (WCET). Its knowledge is crucial to verify if hard real-time systems satisfy their timing constraints and the WCET is a key parameter for the design of embedded systems. In this paper, we propose a new WCET-driven cache-aware memory content selection algorithm, which allocates functions whose WCET highly benefits from a cached execution to cached memory areas. Vice versa, rarely used functions which do not benefit from a cached execution are allocated to non-cached memory areas. As a result of this, unfavorable functions w. r. t. a program's WCET can not evict beneficial functions from the cache. This can lead to a reduced cache miss ratio and a decreased WCET. The effectiveness of our approach is demonstrated by results achieved on real-life benchmarks. In a case study, our greedy algorithm is able to reduce the benchmarks' WCET by up to 20%.
|
| Frank Weichert, Marcel Gaspar, Alexander Zybin, Evgeny Gurevich, Alexander Görtz, Constantin Timm, Heinrich Müller and Peter Marwedel. Plasmonen-unterstützte Mikroskopie zur Detektion von Viren. In Bildverarbeitung für die Medizin Aachen / Germany, March 2010 [BibTeX][Abstract]@inproceedings { weichert:10:bvm,
author = {Weichert, Frank and Gaspar, Marcel and Zybin, Alexander and Gurevich, Evgeny and G\"ortz, Alexander and Timm, Constantin and M\"uller, Heinrich and Marwedel, Peter},
title = {Plasmonen-unterst\"utzte Mikroskopie zur Detektion von Viren},
booktitle = {Bildverarbeitung f\"ur die Medizin},
year = {2010},
address = {Aachen / Germany},
month = {March},
confidential = {n},
abstract = {In Anbetracht zunehmend epidemisch auftretender viraler Infektionen ist eine effiziente und ubiquit\"ar verf\"ugbare Methode zur sicheren Virusdetektion hoch relevant. Mit der Plasmonen-unterst\"utzten Mikroskopie steht hierzu eine neuartige Untersuchungsmethode bereit, die aber gro\"se Anforderungen an die Bildverarbeitung zur Differenzierung der Viren innerhalb der Bilddaten stellt. In dieser Arbeit wird hierzu ein erster erfolgversprechender Ansatz vorgestellt. \"Uber bildbasierte Mustererkennung und Zeitreihenanalysen in Kombination mit Klassifikationsverfahren konnte sowohl die Differenzierung von Nanoobjekten als auch die Detektion von Virus-\"ahnlichen Partikeln nachgewiesen werden.},
} In Anbetracht zunehmend epidemisch auftretender viraler Infektionen ist eine effiziente und ubiquitär verfügbare Methode zur sicheren Virusdetektion hoch relevant. Mit der Plasmonen-unterstützten Mikroskopie steht hierzu eine neuartige Untersuchungsmethode bereit, die aber große Anforderungen an die Bildverarbeitung zur Differenzierung der Viren innerhalb der Bilddaten stellt. In dieser Arbeit wird hierzu ein erster erfolgversprechender Ansatz vorgestellt. \"Uber bildbasierte Mustererkennung und Zeitreihenanalysen in Kombination mit Klassifikationsverfahren konnte sowohl die Differenzierung von Nanoobjekten als auch die Detektion von Virus-ähnlichen Partikeln nachgewiesen werden.
|
| Andreas Heinig, Michael Engel, Florian Schmoll and Peter Marwedel. Using Application Knowledge to Improve Embedded Systems Dependability. In Proceedings of the Workshop on Hot Topics in System Dependability (HotDep 2010) Vancouver, Canada, October 2010 [BibTeX][PDF]@inproceedings { heinig:10:hotdep,
author = {Heinig, Andreas and Engel, Michael and Schmoll, Florian and Marwedel, Peter},
title = {Using Application Knowledge to Improve Embedded Systems Dependability},
booktitle = {Proceedings of the Workshop on Hot Topics in System Dependability (HotDep 2010)},
year = {2010},
address = {Vancouver, Canada},
month = {oct},
publisher = {USENIX Association},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-heinig-hotdep.pdf},
confidential = {n},
} |
| Jochen Strunk, Andreas Heinig, Toni Volkmer, Wolfgang Rehm and Heiko Schick. ACCFS - Virtual File System Support for Host Coupled Run-Time Reconfigurable FPGAs. In Advances in Parallel Computing, Volume 19, Parallel Computing: From Multicores and GPU's to Petascale, 2010, from Parallel Computing with FPGAs (ParaFPGA) held in conjunction with International Conference on Parallel Computing (ParCo 2009) 2010 [BibTeX]@inproceedings { sjoc2010parafpga,
author = {Strunk, Jochen and Heinig, Andreas and Volkmer, Toni and Rehm, Wolfgang and Schick, Heiko},
title = {ACCFS - Virtual File System Support for Host Coupled Run-Time Reconfigurable FPGAs},
booktitle = {Advances in Parallel Computing, Volume 19, Parallel Computing: From Multicores and GPU's to Petascale, 2010, from Parallel Computing with FPGAs (ParaFPGA) held in conjunction with International Conference on Parallel Computing (ParCo 2009)},
year = {2010},
publisher = {IOS Press},
confidential = {n},
} |
| Timon Kelter. Superblock-basierte Quellcodeoptimierungen zur WCET-Reduktion. In Workshop ''Echtzeit 2010'' Boppard / Germany, November 2010 [BibTeX][PDF][Abstract]@inproceedings { kelter:2010:gi-ez,
author = {Kelter, Timon},
title = {Superblock-basierte Quellcodeoptimierungen zur WCET-Reduktion},
booktitle = {Workshop ''Echtzeit 2010''},
year = {2010},
address = {Boppard / Germany},
month = {nov},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-gi-echtzeit.pdf},
confidential = {n},
abstract = {Das Konzept der \emph{Superbl{\"o}cke} wurde auf dem Gebiet der Compileroptimierungen in der Vergangenheit bereits erfolgreich zur Optimierung der \emph{ACET} (Average Case Execution Time) verwendet. Superbl{\"o}cke sind dabei spezielle Ketten von Basisbl{\"o}cken, die es erleichtern Optimierungen {\"u}ber Basisblockgrenzen anzuwenden und somit ein h{\"o}heres Optimierungspotential zu schaffen. In der vorliegenden Arbeit wurde dieses Konzept zum ersten Mal zur Reduktion der \emph{WCET} (Worst Case Execution Time) von Programmen f{\"u}r eingebettete Systeme ausgenutzt. Die WCET ist im Kontext der eingebetteten Systeme eine wichtige Metrik, da viele eingebettete Systeme unter Echtzeitbedingungen arbeiten m{\"u}ssen und hierzu eine sichere obere Schranke f{\"u}r die Laufzeit eines Programms unabdingbar ist. Die vorgestellte Superblockbildung baut auf einem neuartigen Trace-Selektions-Algorithmus auf, der WCET-Daten auswertet. Au{\"s}erdem wurde das Konzept der Superbl{\"o}cke zum ersten Mal auf der Quellcodeebene angewandt. Auf diese Weise findet die Optimierung fr{\"u}her statt, so da{\"s} eine gr{\"o}{\"s}ere Anzahl nachfolgender Optimierungen von der erzielten Umstrukturierung profitieren kann. Weiterhin wurden die klassischen Optimierungen \emph{Common Subexpression Elimination (CSE)} und \emph{Dead Code Elimination (DCE)} an die Anwendung in Quellcode-Superbl{\"o}cken angepasst. Mit diesem Techniken wurde auf einer Testmenge von 55 bekannten Standard-Benchmarks eine durschnittliche WCET-Reduktion von bis zu 10.2\% erzielt.},
} Das Konzept der Superblöcke wurde auf dem Gebiet der Compileroptimierungen in der Vergangenheit bereits erfolgreich zur Optimierung der ACET (Average Case Execution Time) verwendet. Superblöcke sind dabei spezielle Ketten von Basisblöcken, die es erleichtern Optimierungen über Basisblockgrenzen anzuwenden und somit ein höheres Optimierungspotential zu schaffen. In der vorliegenden Arbeit wurde dieses Konzept zum ersten Mal zur Reduktion der WCET (Worst Case Execution Time) von Programmen für eingebettete Systeme ausgenutzt. Die WCET ist im Kontext der eingebetteten Systeme eine wichtige Metrik, da viele eingebettete Systeme unter Echtzeitbedingungen arbeiten müssen und hierzu eine sichere obere Schranke für die Laufzeit eines Programms unabdingbar ist. Die vorgestellte Superblockbildung baut auf einem neuartigen Trace-Selektions-Algorithmus auf, der WCET-Daten auswertet. Außerdem wurde das Konzept der Superblöcke zum ersten Mal auf der Quellcodeebene angewandt. Auf diese Weise findet die Optimierung früher statt, so daß eine größere Anzahl nachfolgender Optimierungen von der erzielten Umstrukturierung profitieren kann. Weiterhin wurden die klassischen Optimierungen Common Subexpression Elimination (CSE) und Dead Code Elimination (DCE) an die Anwendung in Quellcode-Superblöcken angepasst. Mit diesem Techniken wurde auf einer Testmenge von 55 bekannten Standard-Benchmarks eine durschnittliche WCET-Reduktion von bis zu 10.2% erzielt.
|
| Christos Baloukas, Lazaros Papadopoulos, Dimitrios Soudris, Sander Stuijk, Olivera Jovanovic, Florian Schmoll, Daniel Cordes, Robert Pyka, Arindam Mallik, Stylianos Mamagkakis, François Capman, Séverin Collet, Nikolaos Mitas and Dimitrios Kritharidis. Mapping Embedded Applications on MPSoCs: The MNEMEE Approach. In Proceedings of the 2010 IEEE Annual Symposium on VLSI, pages 512-517 Washington, DC, USA, September 2010 [BibTeX][PDF][Abstract]@inproceedings { baloukas:10:isvlsi,
author = {Baloukas, Christos and Papadopoulos, Lazaros and Soudris, Dimitrios and Stuijk, Sander and Jovanovic, Olivera and Schmoll, Florian and Cordes, Daniel and Pyka, Robert and Mallik, Arindam and Mamagkakis, Stylianos and Capman, Fran\c{c}ois and Collet, S\'{e}verin and Mitas, Nikolaos and Kritharidis, Dimitrios},
title = {Mapping Embedded Applications on MPSoCs: The MNEMEE Approach},
booktitle = {Proceedings of the 2010 IEEE Annual Symposium on VLSI},
year = {2010},
series = {ISVLSI '10},
pages = {512-517},
address = {Washington, DC, USA},
month = {sep},
publisher = {IEEE Computer Society},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-isvlsi.pdf},
confidential = {n},
abstract = {As embedded systems are becoming the center of our digital life, system design becomes progressively harder. The integration of multiple features on devices with limited resources requires careful and exhaustive exploration of the design search space in order to efficiently map modern applications to an embedded multi-processor platform. The MNEMEE project addresses this challenge by offering a unique integrated tool flow that performs source-to-source transformations to automatically optimize the original source code and map it on the target platform. The optimizations aim at reducing the number of memory accesses and the required memory storage of both dynamically and statically allocated data. Furthermore, the MNEMEE tool flow performs optimal assignment of all data on the memory hierarchy of the target platform. Designers can use the whole flow or a part of it and integrate it into their own design flow. This paper gives an overview of the MNEMEE tool flow along. It also presents two industrial case studies that demonstrate who the techniques and tools developed in the MNEMEE project can be integrated into industrial design flows.},
} As embedded systems are becoming the center of our digital life, system design becomes progressively harder. The integration of multiple features on devices with limited resources requires careful and exhaustive exploration of the design search space in order to efficiently map modern applications to an embedded multi-processor platform. The MNEMEE project addresses this challenge by offering a unique integrated tool flow that performs source-to-source transformations to automatically optimize the original source code and map it on the target platform. The optimizations aim at reducing the number of memory accesses and the required memory storage of both dynamically and statically allocated data. Furthermore, the MNEMEE tool flow performs optimal assignment of all data on the memory hierarchy of the target platform. Designers can use the whole flow or a part of it and integrate it into their own design flow. This paper gives an overview of the MNEMEE tool flow along. It also presents two industrial case studies that demonstrate who the techniques and tools developed in the MNEMEE project can be integrated into industrial design flows.
|
| Andreas Heinig, Michael Engel, Florian Schmoll and Peter Marwedel. Improving Transient Memory Fault Resilience of an H.264 Decoder. In Proceedings of the Workshop on Embedded Systems for Real-time Multimedia (ESTIMedia 2010) Scottsdale, AZ, USA, October 2010 [BibTeX][PDF]@inproceedings { heinig:10:estimedia,
author = {Heinig, Andreas and Engel, Michael and Schmoll, Florian and Marwedel, Peter},
title = {Improving Transient Memory Fault Resilience of an H.264 Decoder},
booktitle = {Proceedings of the Workshop on Embedded Systems for Real-time Multimedia (ESTIMedia 2010)},
year = {2010},
address = {Scottsdale, AZ, USA},
month = {oct},
publisher = {IEEE Computer Society Press},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-heinig-estimedia.pdf},
confidential = {n},
} |
| Paul Lokuciejewski, Timon Kelter and Peter Marwedel. Superblock-Based Source Code Optimizations for WCET Reduction. In Proceedings of the 7th International Conference on Embedded Software and Systems (ICESS), pages 1918-1925 Bradford / UK, June 2010 [BibTeX][PDF][Abstract]@inproceedings { lokuciejewski:10:icess,
author = {Lokuciejewski, Paul and Kelter, Timon and Marwedel, Peter},
title = {Superblock-Based Source Code Optimizations for WCET Reduction},
booktitle = {Proceedings of the 7th International Conference on Embedded Software and Systems (ICESS)},
year = {2010},
pages = {1918-1925},
address = {Bradford / UK},
month = {jun},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-icess.pdf},
confidential = {n},
abstract = {Superblocks represent regions in a program code that consist of multiple basic blocks. Compilers benefit from this structure since it enables optimization across block boundaries. This increased optimization potential was thoroughly studied in the past for average-case execution time (ACET) reduction at assembly level. In this paper, the concept of superblocks is exploited for the optimization of embedded real-time systems that have to meet stringent timing constraints specified by the worst-case execution time (WCET). To achieve this goal, our superblock formation is based on a novel trace selection algorithm which is driven by WCET data. Moreover, we translate superblocks for the first time from assembly to source code level. This approach enables an early code restructuring in the optimizer, providing more optimization opportunities for both subsequent source code and assembly level transformations. An adaption of the traditional optimizations common subexpression and dead code elimination to our WCET-aware superblocks allows an effective WCET reduction. Using our techniques, we significantly outperform standard optimizations and achieve an average WCET reduction of up to 10.2\% for a total of 55 real-life benchmarks.},
} Superblocks represent regions in a program code that consist of multiple basic blocks. Compilers benefit from this structure since it enables optimization across block boundaries. This increased optimization potential was thoroughly studied in the past for average-case execution time (ACET) reduction at assembly level. In this paper, the concept of superblocks is exploited for the optimization of embedded real-time systems that have to meet stringent timing constraints specified by the worst-case execution time (WCET). To achieve this goal, our superblock formation is based on a novel trace selection algorithm which is driven by WCET data. Moreover, we translate superblocks for the first time from assembly to source code level. This approach enables an early code restructuring in the optimizer, providing more optimization opportunities for both subsequent source code and assembly level transformations. An adaption of the traditional optimizations common subexpression and dead code elimination to our WCET-aware superblocks allows an effective WCET reduction. Using our techniques, we significantly outperform standard optimizations and achieve an average WCET reduction of up to 10.2% for a total of 55 real-life benchmarks.
|
| Paul Lokuciejewski, Sascha Plazar, Heiko Falk, Peter Marwedel and Lothar Thiele. Multi-Objective Exploration of Compiler Optimizations for Real-Time Systems. In Proceedings of the 13th International Symposium on Object/Component/Service-oriented Real-time Distributed Computing (ISORC), pages 115-122 Carmona / Spain, May 2010 [BibTeX][PDF][Abstract]@inproceedings { lokuciejewski:10:isorc,
author = {Lokuciejewski, Paul and Plazar, Sascha and Falk, Heiko and Marwedel, Peter and Thiele, Lothar},
title = {Multi-Objective Exploration of Compiler Optimizations for Real-Time Systems},
booktitle = {Proceedings of the 13th International Symposium on Object/Component/Service-oriented Real-time Distributed Computing (ISORC)},
year = {2010},
pages = {115-122},
address = {Carmona / Spain},
month = {may},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-isorc_2.pdf},
confidential = {n},
abstract = {With the growing complexity of embedded systems software, high code quality can only be achieved using a compiler. Sophisticated compilers provide a vast spectrum of various optimizations to improve code aggressively w.r.t. different objective functions, e.g., average-case execution time \textit{(ACET)} or code size. Due to the complex interactions between the optimizations, the choice for a promising sequence of code transformations is not trivial. Compiler developers address this problem by proposing standard optimization levels, e.g., \textit{O3} or \textit{Os}. However, previous studies have shown that these standard levels often miss optimization potential or might even result in performance degradation. In this paper, we propose the first adaptive WCET-aware compiler framework for an automatic search of compiler optimization sequences which yield highly optimized code. Besides the objective functions ACET and code size, we consider the worst-case execution time \textit{(WCET)} which is a crucial parameter for real-time systems. To find suitable trade-offs between these objectives, stochastic evolutionary multi-objective algorithms identifying Pareto optimal solutions are exploited. A comparison based on statistical performance assessments is performed which helps to determine the most suitable multi-objective optimizer. The effectiveness of our approach is demonstrated on real-life benchmarks showing that standard optimization levels can be significantly outperformed.},
} With the growing complexity of embedded systems software, high code quality can only be achieved using a compiler. Sophisticated compilers provide a vast spectrum of various optimizations to improve code aggressively w.r.t. different objective functions, e.g., average-case execution time (ACET) or code size. Due to the complex interactions between the optimizations, the choice for a promising sequence of code transformations is not trivial. Compiler developers address this problem by proposing standard optimization levels, e.g., O3 or Os. However, previous studies have shown that these standard levels often miss optimization potential or might even result in performance degradation. In this paper, we propose the first adaptive WCET-aware compiler framework for an automatic search of compiler optimization sequences which yield highly optimized code. Besides the objective functions ACET and code size, we consider the worst-case execution time (WCET) which is a crucial parameter for real-time systems. To find suitable trade-offs between these objectives, stochastic evolutionary multi-objective algorithms identifying Pareto optimal solutions are exploited. A comparison based on statistical performance assessments is performed which helps to determine the most suitable multi-objective optimizer. The effectiveness of our approach is demonstrated on real-life benchmarks showing that standard optimization levels can be significantly outperformed.
|
| Paul Lokuciejewski, Marco Stolpe, Katharina Morik and Peter Marwedel. Automatic Selection of Machine Learning Models for WCET-aware Compiler Heuristic Generation. In Proceedings of the 4th Workshop on Statistical and Machine Learning Approaches to Architectures and Compilation (SMART), pages 3-17 Pisa / Italy, January 2010 [BibTeX][PDF][Abstract]@inproceedings { lokuciejewski:10:smart,
author = {Lokuciejewski, Paul and Stolpe, Marco and Morik, Katharina and Marwedel, Peter},
title = {Automatic Selection of Machine Learning Models for WCET-aware Compiler Heuristic Generation},
booktitle = {Proceedings of the 4th Workshop on Statistical and Machine Learning Approaches to Architectures and Compilation (SMART)},
year = {2010},
pages = {3-17},
address = {Pisa / Italy},
month = {jan},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-smart.pdf},
confidential = {n},
abstract = {Machine learning has shown its capabilities for an automatic generation of heuristics used by optimizing compilers. The advantages of these heuristics are that they can be easily adopted to a new environment and in some cases outperform hand-crafted compiler optimizations. However, this approach shifts the effort from manual heuristic tuning to the model selection problem of machine learning - i.e., selecting learning algorithms and their respective parameters - which is a tedious task in its own right. In this paper, we tackle the model selection problem in a systematic way. As our experiments show, the right choice of a learning algorithm and its parameters can significantly affect the quality of the generated heuristics. We present a generic framework integrating machine learning into a compiler to enable an automatic search for the best learning algorithm. To find good settings for the learner parameters within the large search space, optimizations based on evolutionary algorithms are applied. In contrast to the majority of other approaches aiming at a reduction of the average-case execution time (ACET), our goal is the minimization of the worst-case execution time (WCET) which is a key parameter for embedded systems acting as real-time systems. A careful case study on the heuristic generation for the well-known optimization loop invariant code motion shows the challenges and benefits of our methods.},
} Machine learning has shown its capabilities for an automatic generation of heuristics used by optimizing compilers. The advantages of these heuristics are that they can be easily adopted to a new environment and in some cases outperform hand-crafted compiler optimizations. However, this approach shifts the effort from manual heuristic tuning to the model selection problem of machine learning - i.e., selecting learning algorithms and their respective parameters - which is a tedious task in its own right. In this paper, we tackle the model selection problem in a systematic way. As our experiments show, the right choice of a learning algorithm and its parameters can significantly affect the quality of the generated heuristics. We present a generic framework integrating machine learning into a compiler to enable an automatic search for the best learning algorithm. To find good settings for the learner parameters within the large search space, optimizations based on evolutionary algorithms are applied. In contrast to the majority of other approaches aiming at a reduction of the average-case execution time (ACET), our goal is the minimization of the worst-case execution time (WCET) which is a key parameter for embedded systems acting as real-time systems. A careful case study on the heuristic generation for the well-known optimization loop invariant code motion shows the challenges and benefits of our methods.
|
| Michael Engel, Hans P. Reiser, Olaf Spinczyk, Rüdiger Kapitza and Jörg Nolte. Proceedings of the Workshop on Isolation and Integration for Dependable Systems (IIDS 2010).
Paris, France, April 2010 [BibTeX]@inproceedings { engel:10:eurosys-iids-proc,
author = {Engel, Michael and Reiser, Hans P. and Spinczyk, Olaf and Kapitza, R{\"u}diger and Nolte, J{\"o}rg},
title = {Proceedings of the Workshop on Isolation and Integration for Dependable Systems (IIDS 2010)},
year = {2010},
address = {Paris, France},
month = {apr},
publisher = {ACM Press},
confidential = {n},
} |
| Constantin Timm, Jens Schmutzler, Peter Marwedel and Christian Wietfeld. Dynamic Web Service Orchestration applied to the Device Profile for Web Services in Hierarchical Networks. In COMSWARE '09: Proceedings of the Fourth International ICST Conference on COMmunication System softWAre and middlewaRE, pages 1 - 6 Dublin, Ireland, 06 2009 [BibTeX][Abstract]@inproceedings { 2009Timm,
author = {Timm, Constantin and Schmutzler, Jens and Marwedel, Peter and Wietfeld, Christian},
title = {Dynamic Web Service Orchestration applied to the Device Profile for Web Services in Hierarchical Networks},
booktitle = {COMSWARE '09: Proceedings of the Fourth International ICST Conference on COMmunication System softWAre and middlewaRE},
year = {2009},
pages = {1 - 6},
address = {Dublin, Ireland},
month = {06},
confidential = {n},
abstract = {Based on the idea of Service Oriented Architectures (SOA), Web Services paved the way for open and flexible interac- tion between heterogeneous systems with a loose coupling between service endpoints. The Device Profile for Web Ser- vices (DPWS) implements a subset of WS-* specifications in order to make the advantages of the Web Service archi- tecture available to a growing embedded systems market. In this paper we are proposing a service orchestration mecha- nism applied to services on top of a DPWS-based middle- ware. The approach is complementary to the rather complex and resource intensive Web Service Business Process Execu- tion Language (WS-BPEL) and focuses on service orchestra- tion on resource constrained devices deployed in hierarchi- cal network topologies. We validate our service orchestra- tion concept through its resource consumption and illustrate its seamless integration into the service development cycle based on the underlying DPWS-compliant middleware.},
} Based on the idea of Service Oriented Architectures (SOA), Web Services paved the way for open and flexible interac- tion between heterogeneous systems with a loose coupling between service endpoints. The Device Profile for Web Ser- vices (DPWS) implements a subset of WS-* specifications in order to make the advantages of the Web Service archi- tecture available to a growing embedded systems market. In this paper we are proposing a service orchestration mecha- nism applied to services on top of a DPWS-based middle- ware. The approach is complementary to the rather complex and resource intensive Web Service Business Process Execu- tion Language (WS-BPEL) and focuses on service orchestra- tion on resource constrained devices deployed in hierarchi- cal network topologies. We validate our service orchestra- tion concept through its resource consumption and illustrate its seamless integration into the service development cycle based on the underlying DPWS-compliant middleware.
|
| Heiko Falk. WCET-aware Register Allocation based on Graph Coloring. In The 46th Design Automation Conference (DAC), pages 726-731 San Francisco / USA, July 2009 [BibTeX][PDF][Abstract]@inproceedings { falk:09:dac1,
author = {Falk, Heiko},
title = {WCET-aware Register Allocation based on Graph Coloring},
booktitle = {The 46th Design Automation Conference (DAC)},
year = {2009},
pages = {726-731},
address = {San Francisco / USA},
month = {jul},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-dac_1.pdf},
confidential = {n},
abstract = {Current compilers lack precise timing models guiding their built-in optimizations. Hence, compilers apply ad-hoc heuristics during optimization to improve code quality. One of the most important optimizations is register allocation. Many compilers heuristically decide when and where to spill a register to memory, without having a clear understanding of the impact of such spill code on a program's run time. This paper extends a graph coloring register allocator such that it uses precise worst-case execution time \textit{(WCET)} models. Using this WCET timing data, the compiler tries to avoid spill code generation along the critical path defining a program's WCET. To the best of our knowledge, this paper is the first one to present a WCET-aware register allocator. Our results underline the effectiveness of the proposed techniques. For a total of 46 realistic benchmarks, we reduced WCETs by 31.2\% on average. Additionally, the runtimes of our WCET-aware register allocator still remain acceptable.},
} Current compilers lack precise timing models guiding their built-in optimizations. Hence, compilers apply ad-hoc heuristics during optimization to improve code quality. One of the most important optimizations is register allocation. Many compilers heuristically decide when and where to spill a register to memory, without having a clear understanding of the impact of such spill code on a program's run time. This paper extends a graph coloring register allocator such that it uses precise worst-case execution time (WCET) models. Using this WCET timing data, the compiler tries to avoid spill code generation along the critical path defining a program's WCET. To the best of our knowledge, this paper is the first one to present a WCET-aware register allocator. Our results underline the effectiveness of the proposed techniques. For a total of 46 realistic benchmarks, we reduced WCETs by 31.2% on average. Additionally, the runtimes of our WCET-aware register allocator still remain acceptable.
|
| Heiko Falk and Jan C. Kleinsorge. Optimal Static WCET-aware Scratchpad Allocation of Program Code. In The 46th Design Automation Conference (DAC), pages 732-737 San Francisco / USA, July 2009 [BibTeX][PDF][Abstract]@inproceedings { falk:09:dac2,
author = {Falk, Heiko and Kleinsorge, Jan C.},
title = {Optimal Static WCET-aware Scratchpad Allocation of Program Code},
booktitle = {The 46th Design Automation Conference (DAC)},
year = {2009},
pages = {732-737},
address = {San Francisco / USA},
month = {jul},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-dac_2.pdf},
confidential = {n},
abstract = {Caches are notorious for their unpredictability. It is difficult or even impossible to predict if a memory access will result in a definite cache hit or miss. This unpredictability is highly undesired especially when designing real-time systems where the \textit{worst-case execution time (WCET)} is one of the key metrics. \textit{Scratchpad memories (SPMs)} have proven to be a fully predictable alternative to caches. In contrast to caches, however, SPMs require dedicated compiler support. This paper presents an optimal static SPM allocation algorithm for program code. It minimizes WCETs by placing the most beneficial parts of a program's code in an SPM. Our results underline the effectiveness of the proposed techniques. For a total of 73 realistic benchmarks, we reduced WCETs on average by 7.4\% up to 40\%. Additionally, the run times of our ILP-based SPM allocator are negligible.},
} Caches are notorious for their unpredictability. It is difficult or even impossible to predict if a memory access will result in a definite cache hit or miss. This unpredictability is highly undesired especially when designing real-time systems where the worst-case execution time (WCET) is one of the key metrics. Scratchpad memories (SPMs) have proven to be a fully predictable alternative to caches. In contrast to caches, however, SPMs require dedicated compiler support. This paper presents an optimal static SPM allocation algorithm for program code. It minimizes WCETs by placing the most beneficial parts of a program's code in an SPM. Our results underline the effectiveness of the proposed techniques. For a total of 73 realistic benchmarks, we reduced WCETs on average by 7.4% up to 40%. Additionally, the run times of our ILP-based SPM allocator are negligible.
|
| Andreas Heinig, Jochen Strunk, Wolfgang Rehm and Heiko Schick. ACCFS - Operating System Integration of Computational Accelerators Using a VFS Approach. In Proceedings of Applied Reconfigurable Computing (ARC) 2009 [BibTeX]@inproceedings { Heinig2009arc,
author = {Heinig, Andreas and Strunk, Jochen and Rehm, Wolfgang and Schick, Heiko},
title = {ACCFS - Operating System Integration of Computational Accelerators Using a VFS Approach},
booktitle = {Proceedings of Applied Reconfigurable Computing (ARC)},
year = {2009},
publisher = {LNCS},
confidential = {n},
} |
| Jochen Strunk, Andreas Heinig, Toni Volkmer, Wolfgang Rehm and Heiko Schick. Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS. In Proceedings of First International Workshop on HyperTransport Research and Applications 2009 [BibTeX]@inproceedings { sjoc2009whtra,
author = {Strunk, Jochen and Heinig, Andreas and Volkmer, Toni and Rehm, Wolfgang and Schick, Heiko},
title = {Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS},
booktitle = {Proceedings of First International Workshop on HyperTransport Research and Applications},
year = {2009},
publisher = {WHTRA},
confidential = {n},
} |
| Michael Engel and Olaf Spinczyk. A Radical Approach to Network-on-Chip Operating Systems. In Proceedings of the 42nd Hawai'i International Conference on System Sciences (HICSS '09) Waikoloa, Big Island, Hawaii, January 2009 [BibTeX]@inproceedings { engel:09:hicss,
author = {Engel, Michael and Spinczyk, Olaf},
title = {A Radical Approach to Network-on-Chip Operating Systems},
booktitle = {Proceedings of the 42nd Hawai'i International Conference on System Sciences (HICSS '09)},
year = {2009},
address = {Waikoloa, Big Island, Hawaii},
month = {jan},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| Sascha Plazar, Paul Lokuciejewski and Peter Marwedel. WCET-aware Software Based Cache Partitioning for Multi-Task Real-Time Systems. In The 9th International Workshop on Worst-Case Execution Time Analysis (WCET), pages 78-88 Dublin / Ireland, June 2009 [BibTeX][PDF][Abstract]@inproceedings { plazar:09:wcet,
author = {Plazar, Sascha and Lokuciejewski, Paul and Marwedel, Peter},
title = {WCET-aware Software Based Cache Partitioning for Multi-Task Real-Time Systems},
booktitle = {The 9th International Workshop on Worst-Case Execution Time Analysis (WCET)},
year = {2009},
pages = {78-88},
address = {Dublin / Ireland},
month = {jun},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-wcet.pdf},
confidential = {n},
abstract = {Caches are a source of unpredictability since it is very difficult to predict if a memory access results in a cache hit or miss. In systems running multiple tasks steered by a preempting scheduler, it is even impossible to determine the cache behavior since interrupt-driven schedulers lead to unknown points of time for context switches. Partitioned caches are already used in multi-task environments to increase the cache hit ratio by avoiding mutual eviction of tasks from the cache. For real-time systems, the upper bound of the execution time is one of the most important metrics, called the Worst-Case Execution Time (WCET). In this paper, we use partitioning of instruction caches as a technique to achieve tighter WCET estimations since tasks can not be evicted from their partition by other tasks. We propose a novel WCET-aware algorithm, which determines the optimal partition size for each task with focus on decreasing the system's WCET for a given set of possible partition sizes. Employing this algorithm, we are able to decrease the WCET depending on the number of tasks in a set by up to 34\%. On average, reductions between 12\% and 19\% can be achieved.},
} Caches are a source of unpredictability since it is very difficult to predict if a memory access results in a cache hit or miss. In systems running multiple tasks steered by a preempting scheduler, it is even impossible to determine the cache behavior since interrupt-driven schedulers lead to unknown points of time for context switches. Partitioned caches are already used in multi-task environments to increase the cache hit ratio by avoiding mutual eviction of tasks from the cache. For real-time systems, the upper bound of the execution time is one of the most important metrics, called the Worst-Case Execution Time (WCET). In this paper, we use partitioning of instruction caches as a technique to achieve tighter WCET estimations since tasks can not be evicted from their partition by other tasks. We propose a novel WCET-aware algorithm, which determines the optimal partition size for each task with focus on decreasing the system's WCET for a given set of possible partition sizes. Employing this algorithm, we are able to decrease the WCET depending on the number of tasks in a set by up to 34%. On average, reductions between 12% and 19% can be achieved.
|
| Daniel Dressler, Martin Groß, Jan-Philipp Kappmeier, Timon Kelter, Daniel Plümpe, Melanie Schmidt, Martin Skutella and Sylvie Temme. On the Use of Network Flow Techniques for Assigning Evacuees to Exits. In The First International Conference on Evacuation Modeling (ICEM) Delft / The Netherlands, September 2009 [BibTeX][PDF][Abstract]@inproceedings { dressler:09:icem,
author = {Dressler, Daniel and Gro\ss, Martin and Kappmeier, Jan-Philipp and Kelter, Timon and Pl\"umpe, Daniel and Schmidt, Melanie and Skutella, Martin and Temme, Sylvie},
title = {On the Use of Network Flow Techniques for Assigning Evacuees to Exits},
booktitle = {The First International Conference on Evacuation Modeling (ICEM)},
year = {2009},
address = {Delft / The Netherlands},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-icem.pdf},
confidential = {n},
abstract = {We apply network flow techniques to find good exit selections for evacuees in an emergency evacuation. More precisely, we present two algorithms for computing exit distributions using both classical flows and flows over time which are well known from combinatorial optimization. The performance of these new proposals is compared to a simple shortest path approach and to a best response dynamics approach by using a cellular automaton model.},
} We apply network flow techniques to find good exit selections for evacuees in an emergency evacuation. More precisely, we present two algorithms for computing exit distributions using both classical flows and flows over time which are well known from combinatorial optimization. The performance of these new proposals is compared to a simple shortest path approach and to a best response dynamics approach by using a cellular automaton model.
|
| G. Schuenemann, P. Hartmann, D. Schirmer, P. Towalski, T. Weis, K. Wille and P. Marwedel. An FPGA Based Data Acquisition System for a fast Orbit Feedback at DELTA. In 9th European Workshop on Beam Diagnostics and Instrumentation for Particle Accelerators Basel / Switzerland, May 2009 [BibTeX][PDF][Abstract]@inproceedings { marwedel:09:dipac,
author = {Schuenemann, G. and Hartmann, P. and Schirmer, D. and Towalski, P. and Weis, T. and Wille, K. and Marwedel, P.},
title = {An FPGA Based Data Acquisition System for a fast Orbit Feedback at DELTA},
booktitle = {9th European Workshop on Beam Diagnostics and Instrumentation for Particle Accelerators},
year = {2009},
address = {Basel / Switzerland},
month = {may},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-dipac.pdf},
confidential = {n},
abstract = {The demand for beam orbit stability for frequencies up to 1kHz resulted in the need for a fast orbit position data acquisition system at DELTA. The measurement frequency was decided to be 10kHz which results in a good margin for 1kHz corrections. It is based on a Xilinx University Program Virtex-II Pro Development System in conjunction with an inhouse developed Analog-Digital Converter board, featuring two Analog Devices AD974 chips. An inhouse developed software written in VHDL manages measurement and data pre-processing. A communication controller has been adopted from the Diamond Light Source and is used as communication instance. The communication controller is versatile in its application. The data distribution between two or more of the developed measuring systems is possible. This includes data distribution with other systems utilizing the communication controller, e.g. the Libera beam diagnostic system1. To enhance its measuring capabilities one of the two onboard PowerPC cores is running a Linux kernel. A kernel module, capable of receiving the measurement data from the Field Programmable Gateway Array (FPGA) measurement core, was implemented , allowing for advanced data processing and distribution options. The paper presents the design of the system, the used methods and successful results of the first beam measurements.},
} The demand for beam orbit stability for frequencies up to 1kHz resulted in the need for a fast orbit position data acquisition system at DELTA. The measurement frequency was decided to be 10kHz which results in a good margin for 1kHz corrections. It is based on a Xilinx University Program Virtex-II Pro Development System in conjunction with an inhouse developed Analog-Digital Converter board, featuring two Analog Devices AD974 chips. An inhouse developed software written in VHDL manages measurement and data pre-processing. A communication controller has been adopted from the Diamond Light Source and is used as communication instance. The communication controller is versatile in its application. The data distribution between two or more of the developed measuring systems is possible. This includes data distribution with other systems utilizing the communication controller, e.g. the Libera beam diagnostic system1. To enhance its measuring capabilities one of the two onboard PowerPC cores is running a Linux kernel. A kernel module, capable of receiving the measurement data from the Field Programmable Gateway Array (FPGA) measurement core, was implemented , allowing for advanced data processing and distribution options. The paper presents the design of the system, the used methods and successful results of the first beam measurements.
|
| Paul Lokuciejewski, Daniel Cordes, Heiko Falk and Peter Marwedel. A Fast and Precise Static Loop Analysis based on Abstract Interpretation, Program Slicing and Polytope Models. In International Symposium on Code Generation and Optimization (CGO), pages 136-146 Seattle / USA, March 2009 [BibTeX][PDF][Abstract]@inproceedings { lokuciejewski:09:cgo,
author = {Lokuciejewski, Paul and Cordes, Daniel and Falk, Heiko and Marwedel, Peter},
title = {A Fast and Precise Static Loop Analysis based on Abstract Interpretation, Program Slicing and Polytope Models},
booktitle = {International Symposium on Code Generation and Optimization (CGO)},
year = {2009},
pages = {136-146},
address = {Seattle / USA},
month = {mar},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-cgo.pdf},
confidential = {n},
abstract = {A static loop analysis is a program analysis computing loop iteration counts. This information is crucial for different fields of applications. In the domain of compilers, the knowledge about loop iterations can be exploited for aggressive loop optimizations like Loop Unrolling. A loop analyzer also provides static information about code execution frequencies which can assist feedback-directed optimizations. Another prominent application is the static worst-case execution time (WCET) analysis which relies on a safe approximation of loop iteration counts. In this paper, we propose a framework for a static loop analysis based on Abstract Interpretation, a theory of a sound approximation of program semantics. To accelerate the analysis, we preprocess the analyzed code using Program Slicing, a technique that removes statements irrelevant for the loop analysis. In addition, we introduce a novel polytope-based loop evaluation that further significantly reduces the analysis time. The efficiency of our loop analyzer is evaluated on a large number of benchmarks. Results show that 99\% of the considered loops could be successfully analyzed in an acceptable amount of time. This study points out that our methodology is best suited for real-world problems.},
} A static loop analysis is a program analysis computing loop iteration counts. This information is crucial for different fields of applications. In the domain of compilers, the knowledge about loop iterations can be exploited for aggressive loop optimizations like Loop Unrolling. A loop analyzer also provides static information about code execution frequencies which can assist feedback-directed optimizations. Another prominent application is the static worst-case execution time (WCET) analysis which relies on a safe approximation of loop iteration counts. In this paper, we propose a framework for a static loop analysis based on Abstract Interpretation, a theory of a sound approximation of program semantics. To accelerate the analysis, we preprocess the analyzed code using Program Slicing, a technique that removes statements irrelevant for the loop analysis. In addition, we introduce a novel polytope-based loop evaluation that further significantly reduces the analysis time. The efficiency of our loop analyzer is evaluated on a large number of benchmarks. Results show that 99% of the considered loops could be successfully analyzed in an acceptable amount of time. This study points out that our methodology is best suited for real-world problems.
|
| Paul Lokuciejewski and Peter Marwedel. Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization. In The 21st Euromicro Conference on Real-Time Systems (ECRTS), pages 35-44 Dublin / Ireland, July 2009 [BibTeX][PDF][Abstract]@inproceedings { lokuciejewski:09:ecrts,
author = {Lokuciejewski, Paul and Marwedel, Peter},
title = {Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization},
booktitle = {The 21st Euromicro Conference on Real-Time Systems (ECRTS)},
year = {2009},
pages = {35-44},
address = {Dublin / Ireland},
month = {jul},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-ecrts.pdf},
confidential = {n},
abstract = {Program loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization loop unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance. In this paper, we present loop unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable unrolling factor is based on precise loop iteration counts provided by a static loop analysis. In addition, our heuristics avoid adverse effects of unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive loop unrolling can yield WCET reductions of up to 13.7\% over simple, naive approaches employed by many production compilers.},
} Program loops are notorious for their optimization potential on modern high-performance architectures. Compilers aim at their aggressive transformation to achieve large improvements of the program performance. In particular, the optimization loop unrolling has shown in the past decades to be highly effective achieving significant increases of the average-case performance. In this paper, we present loop unrolling that is tailored towards real-time systems. Our novel optimization is driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. To exploit maximal optimization potential, the determination of a suitable unrolling factor is based on precise loop iteration counts provided by a static loop analysis. In addition, our heuristics avoid adverse effects of unrolling which result from instruction cache overflows and the generation of additional spill code. Results on 45 real-life benchmarks demonstrate that aggressive loop unrolling can yield WCET reductions of up to 13.7% over simple, naive approaches employed by many production compilers.
|
| Paul Lokuciejewski, Fatih Gedikli, Peter Marwedel and Katharina Morik. Automatic WCET Reduction by Machine Learning Based Heuristics for Function Inlining. In Proceedings of the 3rd Workshop on Statistical and Machine Learning Approaches to Architectures and Compilation (SMART), pages 1-15 Paphos / Cyprus, January 2009 [BibTeX][PDF][Abstract]@inproceedings { lokuciejewski:09:smart,
author = {Lokuciejewski, Paul and Gedikli, Fatih and Marwedel, Peter and Morik, Katharina},
title = {Automatic WCET Reduction by Machine Learning Based Heuristics for Function Inlining},
booktitle = {Proceedings of the 3rd Workshop on Statistical and Machine Learning Approaches to Architectures and Compilation (SMART)},
year = {2009},
pages = {1-15},
address = {Paphos / Cyprus},
month = {jan},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-smart.pdf},
confidential = {n},
abstract = {The application of machine learning techniques in compiler frameworks has become a challenging research area. Learning algorithms are exploited for an automatic generation of optimization heuristics which often outperform hand-crafted models. Moreover, these automatic approaches can effectively tune the compilers' heuristics after larger changes in the optimization sequence or they can be leveraged to tailor heuristics towards a particular architectural model. Previous works focussed on a reduction of the average-case performance. In this paper, learning approaches are studied in the context of an automatic minimization of the worst-case execution time (WCET) which is the upper bound of the program's maximum execution time. We show that explicitly taking the new timing model into account allows the construction of compiler heuristics that effectively reduce the WCET. This is demonstrated for the well-known optimization function inlining. Our WCET-driven inlining heuristics based on a fast classifier called random forests outperform standard heuristics by up to 9.1% on average in terms of the WCET reduction. Moreover, we point out that our classifier is highly accurate with a prediction rate for inlining candidates of 84.0%.},
} The application of machine learning techniques in compiler frameworks has become a challenging research area. Learning algorithms are exploited for an automatic generation of optimization heuristics which often outperform hand-crafted models. Moreover, these automatic approaches can effectively tune the compilers' heuristics after larger changes in the optimization sequence or they can be leveraged to tailor heuristics towards a particular architectural model. Previous works focussed on a reduction of the average-case performance. In this paper, learning approaches are studied in the context of an automatic minimization of the worst-case execution time (WCET) which is the upper bound of the program's maximum execution time. We show that explicitly taking the new timing model into account allows the construction of compiler heuristics that effectively reduce the WCET. This is demonstrated for the well-known optimization function inlining. Our WCET-driven inlining heuristics based on a fast classifier called random forests outperform standard heuristics by up to 9.1% on average in terms of the WCET reduction. Moreover, we point out that our classifier is highly accurate with a prediction rate for inlining candidates of 84.0%.
|
| Paul Lokuciejewski, Fatih Gedikli and Peter Marwedel. Accelerating WCET-driven Optimizations by the Invariant Path Paradigm - a Case Study of Loop Unswitching. In The 12th International Workshop on Software & Compilers for Embedded Systems (SCOPES), pages 11-20 Nice / France, April 2009 [BibTeX][PDF][Abstract]@inproceedings { lokuciejewski:09:scopes,
author = {Lokuciejewski, Paul and Gedikli, Fatih and Marwedel, Peter},
title = {Accelerating WCET-driven Optimizations by the Invariant Path Paradigm - a Case Study of Loop Unswitching},
booktitle = {The 12th International Workshop on Software \& Compilers for Embedded Systems (SCOPES)},
year = {2009},
pages = {11-20},
address = {Nice / France},
month = {apr},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2009-scopes.pdf},
confidential = {n},
abstract = {The worst-case execution time (WCET) being the upper bound of the maximum execution time corresponds to the longest path through the program's control flow graph. Its reduction is the objective of a WCET optimization. Unlike average-case execution time compiler optimizations which consider a static (most frequently executed) path, the longest path is variable since its optimization might result in another path becoming the effective longest path. To keep path information valid, WCET optimizations typically perform a time-consuming static WCET analysis after each code modification to ensure that subsequent optimization steps operate on the critical path. However, a code modification does not always lead to a path switch, making many WCET analyses superfluous. To cope with this problem, we propose a new paradigm called Invariant Path which eliminates the pessimism by indicating whether a path update is mandatory. To demonstrate the paradigm's practical use, we developed a novel optimization called WCET-driven Loop Unswitching which exploits the Invariant Path information. In a case study, our optimization reduced the WCET of real-world benchmarks by up to 18.3\%, while exploiting the Invariant Path paradigm led to a reduction of the optimization time by 57.5\% on average.},
} The worst-case execution time (WCET) being the upper bound of the maximum execution time corresponds to the longest path through the program's control flow graph. Its reduction is the objective of a WCET optimization. Unlike average-case execution time compiler optimizations which consider a static (most frequently executed) path, the longest path is variable since its optimization might result in another path becoming the effective longest path. To keep path information valid, WCET optimizations typically perform a time-consuming static WCET analysis after each code modification to ensure that subsequent optimization steps operate on the critical path. However, a code modification does not always lead to a path switch, making many WCET analyses superfluous. To cope with this problem, we propose a new paradigm called Invariant Path which eliminates the pessimism by indicating whether a path update is mandatory. To demonstrate the paradigm's practical use, we developed a novel optimization called WCET-driven Loop Unswitching which exploits the Invariant Path information. In a case study, our optimization reduced the WCET of real-world benchmarks by up to 18.3%, while exploiting the Invariant Path paradigm led to a reduction of the optimization time by 57.5% on average.
|
| Paul Lokuciejewski, Heiko Falk and Peter Marwedel. WCET-driven Cache-based Procedure Positioning Optimizations. In The 20th Euromicro Conference on Real-Time Systems (ECRTS), pages 321-330 Prague / Czech Republic, July 2008 [BibTeX][PDF][Abstract]@inproceedings { loku:08:ecrts,
author = {Lokuciejewski, Paul and Falk, Heiko and Marwedel, Peter},
title = {WCET-driven Cache-based Procedure Positioning Optimizations},
booktitle = {The 20th Euromicro Conference on Real-Time Systems (ECRTS)},
year = {2008},
pages = {321-330},
address = {Prague / Czech Republic},
month = {jul},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2008-ecrts.pdf},
confidential = {n},
abstract = {Procedure Positioning is a well known compiler optimization aiming at the improvement of the instruction cache behavior. A contiguous mapping of procedures calling each other frequently in the memory avoids overlapping of cache lines and thus decreases the number of cache conflict misses. In standard literature, these positioning techniques are guided by execution profile data and focus on an improved average-case performance. We present two novel positioning optimizations driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. WCET reductions by 10\% on average are achieved. Moreover, a combination of positioning and the WCET-driven Procedure Cloning optimization is presented improving the WCET analysis by 36\% on average.},
} Procedure Positioning is a well known compiler optimization aiming at the improvement of the instruction cache behavior. A contiguous mapping of procedures calling each other frequently in the memory avoids overlapping of cache lines and thus decreases the number of cache conflict misses. In standard literature, these positioning techniques are guided by execution profile data and focus on an improved average-case performance. We present two novel positioning optimizations driven by worst-case execution time (WCET) information to effectively minimize the program's worst-case behavior. WCET reductions by 10% on average are achieved. Moreover, a combination of positioning and the WCET-driven Procedure Cloning optimization is presented improving the WCET analysis by 36% on average.
|
| Andreas Heinig, René Oertel, Jochen Strunk, Wolfgang Rehm and Heiko Schick. Generalizing the SPUFS concept - a case study towards a common accelerator interface. In Proceedings of the Many-core and Reconfigurable Supercomputing Conference Belfast, 1-3 April 2008 [BibTeX]@inproceedings { Heinig2008mrsc,
author = {Heinig, Andreas and Oertel, Ren\'{e} and Strunk, Jochen and Rehm, Wolfgang and Schick, Heiko},
title = {Generalizing the SPUFS concept - a case study towards a common accelerator interface},
booktitle = {Proceedings of the Many-core and Reconfigurable Supercomputing Conference},
year = {2008},
address = {Belfast},
month = {1-3 April},
confidential = {n},
} |
| Niklas Holsti, Jan Gustafsson, Guillem Bernat, Clément Ballabriga, Armelle Bonenfant, Roman Bourgade, Hugues Cassé, Daniel Cordes, Albrecht Kadlec, Raimund Kirner, Jens Knoop, Paul Lokuciejewski and Merriam. WCET Tool Challenge 2008: Report. In International Workshop on Worst-Case Execution Time Analysis (WCET) Prague / Czech Republic, September 2008 [BibTeX][PDF][Abstract]@inproceedings { holsti:08:wcet,
author = {Holsti, Niklas and Gustafsson, Jan and Bernat, Guillem and Ballabriga, Cl\'ement and Bonenfant, Armelle and Bourgade, Roman and Cass\'e, Hugues and Cordes, Daniel and Kadlec, Albrecht and Kirner, Raimund and Knoop, Jens and Lokuciejewski, Paul and Merriam,},
title = {WCET Tool Challenge 2008: Report},
booktitle = {International Workshop on Worst-Case Execution Time Analysis (WCET)},
year = {2008},
address = {Prague / Czech Republic},
month = {sep},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2008-wcet.pdf},
confidential = {n},
abstract = {Following the successful WCET Tool Challenge in 2006, the second event in this series was organized in 2008, again with support from the ARTIST2 Network of Excellence. The WCET Tool Challenge 2008 (WCC'08) provides benchmark programs and poses a number of "analysis problems" about the dynamic, runtime properties of these programs. The participants are challenged to solve these problems with their program-analysis tools. Two kinds of problems are defined: WCET problems, which ask for bounds on the execution time of chosen parts (subprograms) of the benchmarks, under given constraints on input data; and flow-analysis problems, which ask for bounds on the number of times certain parts of the benchmark can be executed, again under some constraints. We describe the organization of WCC'08, the benchmark programs, the participating tools, and the general results, successes, and failures. Most participants found WCC'08 to be a useful test of their tools. Unlike the 2006 Challenge, the WCC'08 participants include several tools for the same target (ARM7, LPC2138), and tools that combine measurements and static analysis, as well as pure static-analysis tools.},
} Following the successful WCET Tool Challenge in 2006, the second event in this series was organized in 2008, again with support from the ARTIST2 Network of Excellence. The WCET Tool Challenge 2008 (WCC'08) provides benchmark programs and poses a number of "analysis problems" about the dynamic, runtime properties of these programs. The participants are challenged to solve these problems with their program-analysis tools. Two kinds of problems are defined: WCET problems, which ask for bounds on the execution time of chosen parts (subprograms) of the benchmarks, under given constraints on input data; and flow-analysis problems, which ask for bounds on the number of times certain parts of the benchmark can be executed, again under some constraints. We describe the organization of WCC'08, the benchmark programs, the participating tools, and the general results, successes, and failures. Most participants found WCC'08 to be a useful test of their tools. Unlike the 2006 Challenge, the WCC'08 participants include several tools for the same target (ARM7, LPC2138), and tools that combine measurements and static analysis, as well as pure static-analysis tools.
|
| Paul Lokuciejewski, Heiko Falk, Peter Marwedel and Henrik Theiling. WCET-Driven, Code-Size Critical Procedure Cloning. In The 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES), pages 21-30 Munich / Germany, March 2008 [BibTeX][PDF][Abstract]@inproceedings { loku:08:scopes,
author = {Lokuciejewski, Paul and Falk, Heiko and Marwedel, Peter and Theiling, Henrik},
title = {WCET-Driven, Code-Size Critical Procedure Cloning},
booktitle = {The 11th International Workshop on Software \& Compilers for Embedded Systems (SCOPES)},
year = {2008},
pages = {21-30},
address = {Munich / Germany},
month = {mar},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2008-scopes.pdf},
confidential = {n},
abstract = {In the domain of the worst-case execution time (WCET) analysis, loops are an inherent source of unpredictability and loss of precision since the determination of tight and safe information on the number of loop iterations is a diffi- cult task. In particular, data-dependent loops whose itera- tion counts depend on function parameters can not be pre- cisely handled by a timing analysis. Procedure Cloning can be exploited to make these loops explicit within the source code allowing a highly precise WCET analysis. In this paper we extend the standard Procedure Cloning optimization by WCET-aware concepts with the objective to improve the tightness of the WCET estimation. Our novel approach is driven by WCET information which succes- sively eliminates code structures leading to overestimated timing results, thus making the code more suitable for the analysis. In addition, the code size increase during the op- timization is monitored and large increases are avoided. The effectiveness of our optimization is shown by tests on real-world benchmarks. After performing our optimiza- tion, the estimated WCET is reduced by up to 64.2\% while the employed code transformations yield an additional code size increase of 22.6\% on average. In contrast, the average- case performance being the original objective of Procedure Cloning showed a slight decrease.},
} In the domain of the worst-case execution time (WCET) analysis, loops are an inherent source of unpredictability and loss of precision since the determination of tight and safe information on the number of loop iterations is a diffi- cult task. In particular, data-dependent loops whose itera- tion counts depend on function parameters can not be pre- cisely handled by a timing analysis. Procedure Cloning can be exploited to make these loops explicit within the source code allowing a highly precise WCET analysis. In this paper we extend the standard Procedure Cloning optimization by WCET-aware concepts with the objective to improve the tightness of the WCET estimation. Our novel approach is driven by WCET information which succes- sively eliminates code structures leading to overestimated timing results, thus making the code more suitable for the analysis. In addition, the code size increase during the op- timization is monitored and large increases are avoided. The effectiveness of our optimization is shown by tests on real-world benchmarks. After performing our optimiza- tion, the estimated WCET is reduced by up to 64.2% while the employed code transformations yield an additional code size increase of 22.6% on average. In contrast, the average- case performance being the original objective of Procedure Cloning showed a slight decrease.
|
| Sascha Plazar, Paul Lokuciejewski and Peter Marwedel. A Retargetable Framework for Multi-objective WCET-aware High-level Compiler Optimizations. In Proceedings of The 29th IEEE Real-Time Systems Symposium (RTSS) WiP, pages 49-52 Barcelona / Spain, December 2008 [BibTeX][PDF][Abstract]@inproceedings { plazar:08:rtss,
author = {Plazar, Sascha and Lokuciejewski, Paul and Marwedel, Peter},
title = {A Retargetable Framework for Multi-objective WCET-aware High-level Compiler Optimizations},
booktitle = {Proceedings of The 29th IEEE Real-Time Systems Symposium (RTSS) WiP},
year = {2008},
pages = {49-52},
address = {Barcelona / Spain},
month = {dec},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2008-rtss.pdf},
confidential = {n},
abstract = {The worst-case execution time (WCET) is a key parameter in the domain of real-time systems and its automatic compiler-based minimization becomes a challenging research area. Although today's embedded system applications are written in a high-level language, most published works consider low-level optimizations which complicate their portability to other processors. In this work, we present a framework for the development of novel WCETdriven high-level optimizations. Our WCET-aware compiler framework provides a multi-target support as well as an integration of different non-functional objectives. It enables multi-objective optimizations, thus opens avenues to a state-of-the-art design of predictable and efficient systems. In addition, the multi-target support provides the opportunity to efficiently evaluate the impact of different compiler optimizations on various processors.},
} The worst-case execution time (WCET) is a key parameter in the domain of real-time systems and its automatic compiler-based minimization becomes a challenging research area. Although today's embedded system applications are written in a high-level language, most published works consider low-level optimizations which complicate their portability to other processors. In this work, we present a framework for the development of novel WCETdriven high-level optimizations. Our WCET-aware compiler framework provides a multi-target support as well as an integration of different non-functional objectives. It enables multi-objective optimizations, thus opens avenues to a state-of-the-art design of predictable and efficient systems. In addition, the multi-target support provides the opportunity to efficiently evaluate the impact of different compiler optimizations on various processors.
|
| Peter Marwedel and Heiko Falk (presentation). Memory-architecture aware compilation. In The ARTIST2 Summer School 2008 in Europe Autrans / France, 2008 [BibTeX][PDF]@inproceedings { marwedel:08:artist2,
author = {Marwedel, Peter and Falk (presentation), Heiko},
title = {Memory-architecture aware compilation},
booktitle = {The ARTIST2 Summer School 2008 in Europe},
year = {2008},
address = {Autrans / France},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2008-artist2summerschool.pdf},
confidential = {n},
} |
| Michael Engel and Olaf Spinczyk. Aspects in Hardware - What Do They Look Like?. In Proceedings of the 7th AOSD Workshop on Aspects, Components, and Patterns for Infrastructure Software (AOSD-ACP4IS '08) Brussels, Belgium, April 2008 [BibTeX]@inproceedings { engel:08:aosd-acp4is,
author = {Engel, Michael and Spinczyk, Olaf},
title = {Aspects in Hardware - What Do They Look Like?},
booktitle = {Proceedings of the 7th AOSD Workshop on Aspects, Components, and Patterns for Infrastructure Software (AOSD-ACP4IS '08)},
year = {2008},
address = {Brussels, Belgium},
month = {apr},
publisher = {ACM Press},
confidential = {n},
} |
| Michael Engel and Olaf Spinczyk. System-on-Chip Integration of Embedded Automotive Controllers. In Proceedings of the First Workshop on Isolation and Integration in Embedded Systems Glasgow, UK, April 2008 [BibTeX]@inproceedings { engel:08:eurosys-iies,
author = {Engel, Michael and Spinczyk, Olaf},
title = {System-on-Chip Integration of Embedded Automotive Controllers},
booktitle = {Proceedings of the First Workshop on Isolation and Integration in Embedded Systems},
year = {2008},
address = {Glasgow, UK},
month = {apr},
publisher = {ACM Press},
confidential = {n},
} |
| Paul Lokuciejewski, Heiko Falk, Martin Schwarzer and Peter Marwedel. Tighter WCET Estimates by Procedure Cloning. In 7th International Workshop on Worst-Case Execution Time Analysis (WCET), pages 27-32 Pisa/Italy, July 2007 [BibTeX][PDF][Abstract]@inproceedings { loku:07:wcet,
author = {Lokuciejewski, Paul and Falk, Heiko and Schwarzer, Martin and Marwedel, Peter},
title = {Tighter WCET Estimates by Procedure Cloning},
booktitle = {7th International Workshop on Worst-Case Execution Time Analysis (WCET)},
year = {2007},
pages = {27-32},
address = {Pisa/Italy},
month = {jul},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2007-wcet.pdf},
confidential = {n},
abstract = {Embedded software spends most of its execution time in loops. To allow a precise static WCET analysis, each loop iteration should, in theory, be represented by an individual calling context. However, due to the enormous analysis times of real-world applications, this approach is not feasible and requires a reduction of the analysis complexity by limiting the number of considered contexts. This restricted timing analysis results in imprecise WCET estimates. In particular, data-dependent loops with iteration counts depending on function parameters cannot be precisely analyzed. In order to reduce the number of contexts that must be implicitly considered, causing an increase in analysis time, we apply the standard compiler optimization \textem{procedure cloning} which improves the program's predictability by making loops explicit and thus allowing a precise annotation of loop bounds. The result is a tight WCET estimation within a reduced analysis time. Our results indicate that reductions of the WCET between 12\% and 95\% were achieved for real-world benchmarks. In contrast, the reduction of the simulated program execution time remained marginal with only 3\%. As will be also shown, this optimization only produces a small overhead for the WCET analysis.},
} Embedded software spends most of its execution time in loops. To allow a precise static WCET analysis, each loop iteration should, in theory, be represented by an individual calling context. However, due to the enormous analysis times of real-world applications, this approach is not feasible and requires a reduction of the analysis complexity by limiting the number of considered contexts. This restricted timing analysis results in imprecise WCET estimates. In particular, data-dependent loops with iteration counts depending on function parameters cannot be precisely analyzed. In order to reduce the number of contexts that must be implicitly considered, causing an increase in analysis time, we apply the standard compiler optimization procedure cloning which improves the program's predictability by making loops explicit and thus allowing a precise annotation of loop bounds. The result is a tight WCET estimation within a reduced analysis time. Our results indicate that reductions of the WCET between 12% and 95% were achieved for real-world benchmarks. In contrast, the reduction of the simulated program execution time remained marginal with only 3%. As will be also shown, this optimization only produces a small overhead for the WCET analysis.
|
| P. Reinhardt, O. Battenfeld, M. Engel and B. Freisleben. A Paravirtualized Scalable Emulation Testbed for Mobile Ad-Hoc Networks. In Proceedings of ICCP07, Oman 2007 [BibTeX]@inproceedings { engel:07:iccp,
author = {Reinhardt, P. and Battenfeld, O. and Engel, M. and Freisleben, B.},
title = {A Paravirtualized Scalable Emulation Testbed for Mobile Ad-Hoc Networks},
booktitle = {Proceedings of ICCP07, Oman},
year = {2007},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| Robert Pyka, Christoph Faßbach, Manish Verma, Heiko Falk and Peter Marwedel. Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications. In 10th International Workshop on Software & Compilers for Embedded Systems (SCOPES), pages 41-50 Nice/France, April 2007 [BibTeX][PDF][Abstract]@inproceedings { pyka:07:scopes,
author = {Pyka, Robert and Fa\"sbach, Christoph and Verma, Manish and Falk, Heiko and Marwedel, Peter},
title = {Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications},
booktitle = {10th International Workshop on Software \& Compilers for Embedded Systems (SCOPES)},
year = {2007},
pages = {41-50},
address = {Nice/France},
month = {apr},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2007-scopes.pdf},
confidential = {n},
abstract = {Various scratchpad allocation strategies have been developed in the past. Most of them target the reduction of energy consumption. These approaches share the necessity of having direct access to the scratchpad memory. In earlier embedded systems this was always true, but with the increasing complexity of tasks systems have to perform, an additional operating system layer between the hardware and the application is becoming mandatory. This paper presents an approach to integrate a scratchpad memory manager into the operating system. The goal is to minimize energy consumption. In contrast to previous work, compile time knowledge about the application's behavior is taken into account. A set of fast heuristic allocation methods is proposed in this paper. An in-depth study and comparison of achieved energy savings and cycle reductions was performed. The results show that even in the highly dynamic environment of an operating system equipped embedded system, up to 83% energy consumption reduction can be achieved.},
} Various scratchpad allocation strategies have been developed in the past. Most of them target the reduction of energy consumption. These approaches share the necessity of having direct access to the scratchpad memory. In earlier embedded systems this was always true, but with the increasing complexity of tasks systems have to perform, an additional operating system layer between the hardware and the application is becoming mandatory. This paper presents an approach to integrate a scratchpad memory manager into the operating system. The goal is to minimize energy consumption. In contrast to previous work, compile time knowledge about the application's behavior is taken into account. A set of fast heuristic allocation methods is proposed in this paper. An in-depth study and comparison of achieved energy savings and cycle reductions was performed. The results show that even in the highly dynamic environment of an operating system equipped embedded system, up to 83% energy consumption reduction can be achieved.
|
| Heiko Falk, Sascha Plazar and Henrik Theiling. Compile Time Decided Instruction Cache Locking Using Worst-Case Execution Paths. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS, pages 143-148 Salzburg/Austria, September 2007 [BibTeX][PDF][Abstract]@inproceedings { falk:07:codes_isss,
author = {Falk, Heiko and Plazar, Sascha and Theiling, Henrik},
title = {Compile Time Decided Instruction Cache Locking Using Worst-Case Execution Paths},
booktitle = {International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS},
year = {2007},
pages = {143-148},
address = {Salzburg/Austria},
month = {sep},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2007-codes+isss_1.pdf},
confidential = {n},
abstract = {Caches are notorious for their unpredictability. It is difficult or even impossible to predict if a memory access results in a definite cache hit or miss. This unpredictability is highly undesired for real-time systems. The Worst-Case Execution Time \textem{(WCET)} of a software running on an embedded processor is one of the most important metrics during real-time system design. The WCET depends to a large extent on the total amount of time spent for memory accesses. In the presence of caches, WCET analysis must always assume a memory access to be a cache miss if it can not be guaranteed that it is a hit. Hence, WCETs for cached systems are imprecise due to the overestimation caused by the caches. Modern caches can be controlled by software. The software can load parts of its code or of its data into the cache and lock the cache afterwards. Cache locking prevents the cache's contents from being flushed by deactivating the replacement. A locked cache is highly predictable and leads to very precise WCET estimates, because the uncertainty caused by the replacement strategy is eliminated completely. This paper presents techniques exploring the lockdown of instruction caches at compile-time to minimize WCETs. In contrast to the current state of the art in the area of cache locking, our techniques explicitly take the worst-case execution path into account during each step of the optimization procedure. This way, we can make sure that always those parts of the code are locked in the I-cache that lead to the highest WCET reduction. The results demonstrate that WCET reductions from 54\% up to 73\% can be achieved with an acceptable amount of CPU seconds required for the optimization and WCET analyses themselves.},
} Caches are notorious for their unpredictability. It is difficult or even impossible to predict if a memory access results in a definite cache hit or miss. This unpredictability is highly undesired for real-time systems. The Worst-Case Execution Time (WCET) of a software running on an embedded processor is one of the most important metrics during real-time system design. The WCET depends to a large extent on the total amount of time spent for memory accesses. In the presence of caches, WCET analysis must always assume a memory access to be a cache miss if it can not be guaranteed that it is a hit. Hence, WCETs for cached systems are imprecise due to the overestimation caused by the caches. Modern caches can be controlled by software. The software can load parts of its code or of its data into the cache and lock the cache afterwards. Cache locking prevents the cache's contents from being flushed by deactivating the replacement. A locked cache is highly predictable and leads to very precise WCET estimates, because the uncertainty caused by the replacement strategy is eliminated completely. This paper presents techniques exploring the lockdown of instruction caches at compile-time to minimize WCETs. In contrast to the current state of the art in the area of cache locking, our techniques explicitly take the worst-case execution path into account during each step of the optimization procedure. This way, we can make sure that always those parts of the code are locked in the I-cache that lead to the highest WCET reduction. The results demonstrate that WCET reductions from 54% up to 73% can be achieved with an acceptable amount of CPU seconds required for the optimization and WCET analyses themselves.
|
| Paul Lokuciejewski, Heiko Falk, Martin Schwarzer, Peter Marwedel and Henrik Theiling. Influence of Procedure Cloning on WCET Prediction. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 137-142 Salzburg/Austria, September 2007 [BibTeX][PDF][Abstract]@inproceedings { loku:07:codes_isss,
author = {Lokuciejewski, Paul and Falk, Heiko and Schwarzer, Martin and Marwedel, Peter and Theiling, Henrik},
title = {Influence of Procedure Cloning on WCET Prediction},
booktitle = {International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)},
year = {2007},
pages = {137-142},
address = {Salzburg/Austria},
month = {sep},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2007-codes+isss_2.pdf},
confidential = {n},
abstract = {For the worst-case execution time \textem{(WCET)} analysis, especially loops are an inherent source of unpredictability and loss of precision. This is caused by the difficulty to obtain safe and tight information on the number of iterations executed by a loop in the worst case. In particular, data-dependent loops whose iteration counts depend on function parameters are extremely difficult to analyze precisely. Procedure cloning helps by making such data-dependent loops explicit within the source code, thus making them accessible for high-precision WCET analyses. This paper presents the effect of procedure cloning applied at the source-code level on worst-case execution time. The optimization generates specialized versions of functions being called with constant values as arguments. In standard literature, it is used to enable further optimizations like constant propagation within functions and to reduce calling overhead. We show that procedure cloning for WCET minimization leads to significant improvements. Reductions of the WCET from 12\% up to 95\% were measured for real-life benchmarks. These results demonstrate that procedure cloning improves analyzability and predictability of real-time applications dramatically. In contrast, average-case performance as the criterion procedure cloning was developed for is reduced by only 3\% at most. Our results also show that these WCET reductions only implied small overhead during WCET analysis.},
} For the worst-case execution time (WCET) analysis, especially loops are an inherent source of unpredictability and loss of precision. This is caused by the difficulty to obtain safe and tight information on the number of iterations executed by a loop in the worst case. In particular, data-dependent loops whose iteration counts depend on function parameters are extremely difficult to analyze precisely. Procedure cloning helps by making such data-dependent loops explicit within the source code, thus making them accessible for high-precision WCET analyses. This paper presents the effect of procedure cloning applied at the source-code level on worst-case execution time. The optimization generates specialized versions of functions being called with constant values as arguments. In standard literature, it is used to enable further optimizations like constant propagation within functions and to reduce calling overhead. We show that procedure cloning for WCET minimization leads to significant improvements. Reductions of the WCET from 12% up to 95% were measured for real-life benchmarks. These results demonstrate that procedure cloning improves analyzability and predictability of real-time applications dramatically. In contrast, average-case performance as the criterion procedure cloning was developed for is reduced by only 3% at most. Our results also show that these WCET reductions only implied small overhead during WCET analysis.
|
| Peter Marwedel, Heiko Falk, Sascha Plazar, Robert Pyka and Lars Wehmeyer. Automatic mapping to tightly-coupled memories and cache locking. In Proceedings of 4th HiPEAC Industrial Workshop on Compilers and Architectures Cambridge, UK, August 2007 [BibTeX][PDF][Link]@inproceedings { marwedel:07:hipeac,
author = {Marwedel, Peter and Falk, Heiko and Plazar, Sascha and Pyka, Robert and Lars Wehmeyer,},
title = {Automatic mapping to tightly-coupled memories and cache locking},
booktitle = {Proceedings of 4th HiPEAC Industrial Workshop on Compilers and Architectures},
year = {2007},
address = {Cambridge, UK},
month = {aug},
url = {http://www.hipeac.net/industry_workshop4},
keywords = {wcet},
file = {http://www.hipeac.net/system/files?file=session1_3.ppt},
confidential = {n},
} |
| Heiko Falk, Paul Lokuciejewski and Henrik Theiling. Design of a WCET-Aware C Compiler. In 6th International Workshop on Worst-Case Execution Time Analysis (WCET) Dresden/Germany, July 2006 [BibTeX][PDF][Abstract]@inproceedings { falk:06:wcet,
author = {Falk, Heiko and Lokuciejewski, Paul and Theiling, Henrik},
title = {Design of a WCET-Aware C Compiler},
booktitle = {6th International Workshop on Worst-Case Execution Time Analysis (WCET)},
year = {2006},
address = {Dresden/Germany},
month = {jul},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2006-wcet_1.pdf},
confidential = {n},
abstract = {This paper presents techniques to tightly integrate worst-case execution time \textem{(WCET)} information into a compiler framework. Currently, a tight integration of WCET information into the compilation process is strongly desired, but only some ad-hoc approaches have been reported currently. Previous publications mainly used self-written WCET estimators with very limited functionality and preciseness during compilation. A very tight integration of a high quality industry-relevant WCET analyzer into a compiler was not yet achieved up to now. This work is the first to present techniques capable of achieving such a tight coupling between a compiler and the WCET analyzer aiT. This is done by automatically translating the assembly-like contents of the compiler's low-level intermediate representation \textem{(LLIR)} to aiT's exchange format CRL2. Additionally, the results produced by the WCET analyzer are automatically collected and re-imported into the compiler infrastructure. The work described in this paper is smoothly integrated into a C compiler environment for the Infineon TriCore processor. It opens up new possibilities for the design of WCET-aware optimizations in the future. The concepts for extending the compiler infrastructure are kept very general so that they are not limited to WCET information. Rather, it is possible to use our structures also for multi-objective optimization of e.g. best-case execution time \textem{(BCET)} or energy dissipation.},
} This paper presents techniques to tightly integrate worst-case execution time (WCET) information into a compiler framework. Currently, a tight integration of WCET information into the compilation process is strongly desired, but only some ad-hoc approaches have been reported currently. Previous publications mainly used self-written WCET estimators with very limited functionality and preciseness during compilation. A very tight integration of a high quality industry-relevant WCET analyzer into a compiler was not yet achieved up to now. This work is the first to present techniques capable of achieving such a tight coupling between a compiler and the WCET analyzer aiT. This is done by automatically translating the assembly-like contents of the compiler's low-level intermediate representation (LLIR) to aiT's exchange format CRL2. Additionally, the results produced by the WCET analyzer are automatically collected and re-imported into the compiler infrastructure. The work described in this paper is smoothly integrated into a C compiler environment for the Infineon TriCore processor. It opens up new possibilities for the design of WCET-aware optimizations in the future. The concepts for extending the compiler infrastructure are kept very general so that they are not limited to WCET information. Rather, it is possible to use our structures also for multi-objective optimization of e.g. best-case execution time (BCET) or energy dissipation.
|
| M. Smith, B. Klose, R. Ewerth, T. Friese, M. Engel and B. Freisleben. Runtime Integration of Reconfigurable Hardware in Service-Oriented Grids. In Proceedings of the IEEE International Conference on Web Services (ICWS), Chicago, USA, pages 945-948 2006 [BibTeX]@inproceedings { engel:06:icws,
author = {Smith, M. and Klose, B. and Ewerth, R. and Friese, T. and Engel, M. and Freisleben, B.},
title = {Runtime Integration of Reconfigurable Hardware in Service-Oriented Grids},
booktitle = {Proceedings of the IEEE International Conference on Web Services (ICWS), Chicago, USA},
year = {2006},
pages = {945-948},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| M. Smith, T. Friese, M. Engel, B. Freisleben, G. Koenig and W. Yurcik. Security Issues in On-Demand Grid and Cluster Computing. In Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06), pages 24 2006 [BibTeX]@inproceedings { engel:06:iscc,
author = {Smith, M. and Friese, T. and Engel, M. and Freisleben, B. and Koenig, G. and Yurcik, W.},
title = {Security Issues in On-Demand Grid and Cluster Computing},
booktitle = {Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06)},
year = {2006},
pages = {24},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| M. Smith, T. Friese, M. Engel and B. Freisleben. Countering Security Threats in Service-Oriented On-Demand Grid Computing Using Sandboxing and Trusted Computing Techniques. In Journal of Parallel and Distributed Computing, Volume 66, Issue 9, pages 1189-1204 2006 [BibTeX]@inproceedings { engel:06:jpdc,
author = {Smith, M. and Friese, T. and Engel, M. and Freisleben, B.},
title = {Countering Security Threats in Service-Oriented On-Demand Grid Computing Using Sandboxing and Trusted Computing Techniques},
booktitle = {Journal of Parallel and Distributed Computing, Volume 66, Issue 9},
year = {2006},
pages = {1189-1204},
publisher = {Elsevier},
confidential = {n},
} |
| Heiko Falk and Martin Schwarzer. Loop Nest Splitting for WCET-Optimization and Predictability Improvement. In 6th International Workshop on Worst-Case Execution Time Analysis (WCET) Dresden/Germany, July 2006 [BibTeX][PDF][Abstract]@inproceedings { falk:06:wcet2,
author = {Falk, Heiko and Schwarzer, Martin},
title = {Loop Nest Splitting for WCET-Optimization and Predictability Improvement},
booktitle = {6th International Workshop on Worst-Case Execution Time Analysis (WCET)},
year = {2006},
address = {Dresden/Germany},
month = {jul},
keywords = {sco, wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2006-wcet_2.pdf},
confidential = {n},
abstract = {This paper presents the influence of the loop nest splitting source code optimization on the worst-case execution time \textem{(WCET)}. Loop nest splitting minimizes the number of executed if-statements in loop nests of embedded multimedia applications. It identifies iterations of a loop nest where all if-statements are satisfied and splits the loop nest such that if-statements are not executed at all for large parts of the loop nest's iteration space. Especially loops and if-statements of high-level languages are an inherent source of unpredictability and loss of precision for WCET analysis. This is caused by the fact that it is difficult to obtain safe and tight worst-case estimates of an application's flow of control through these high-level constructs. In addition, the corresponding control flow redirections expressed at the assembly level reduce predictability even more due to the complex pipeline and branch prediction behavior of modern embedded processors. The analysis techniques for loop nest splitting are based on precise mathematical models combined with genetic algorithms. On the one hand, these techniques achieve a significantly more homogeneous structure of the control flow. On the other hand, the precision of our analyses leads to the generation of very accurate high-level flow facts for loops and if-statements. The application of our implemented algorithms to three real-life multimedia benchmarks leads to average speed-ups by 25.0\% - 30.1\%, while WCET is reduced between 34.0\% and 36.3\%.},
} This paper presents the influence of the loop nest splitting source code optimization on the worst-case execution time (WCET). Loop nest splitting minimizes the number of executed if-statements in loop nests of embedded multimedia applications. It identifies iterations of a loop nest where all if-statements are satisfied and splits the loop nest such that if-statements are not executed at all for large parts of the loop nest's iteration space. Especially loops and if-statements of high-level languages are an inherent source of unpredictability and loss of precision for WCET analysis. This is caused by the fact that it is difficult to obtain safe and tight worst-case estimates of an application's flow of control through these high-level constructs. In addition, the corresponding control flow redirections expressed at the assembly level reduce predictability even more due to the complex pipeline and branch prediction behavior of modern embedded processors. The analysis techniques for loop nest splitting are based on precise mathematical models combined with genetic algorithms. On the one hand, these techniques achieve a significantly more homogeneous structure of the control flow. On the other hand, the precision of our analyses leads to the generation of very accurate high-level flow facts for loops and if-statements. The application of our implemented algorithms to three real-life multimedia benchmarks leads to average speed-ups by 25.0% - 30.1%, while WCET is reduced between 34.0% and 36.3%.
|
| Michael Engel and Bernd Freisleben. {TOSKANA:} A Toolkit for Operating System Kernel Aspects. In Transactions on AOSD II 4242, pages 182--226 2006 [BibTeX]@inproceedings { engel:06:taosd,
author = {Engel, Michael and Freisleben, Bernd},
title = {{TOSKANA:} A Toolkit for Operating System Kernel Aspects},
booktitle = {Transactions on AOSD II},
year = {2006},
editor = {Awais Rashid and Mehmet Aksit},
number = {4242},
series = {Lecture Notes in Computer Science},
pages = {182--226},
publisher = {Springer-Verlag},
confidential = {n},
} |
| Manish Verma and Peter Marwedel. Compilation and Simulation Tool Chain for Memory Aware Energy Optimizations. In Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI) Samos, Greece, July 2006 [BibTeX][PDF][Abstract]@inproceedings { verma:06:samos,
author = {Verma, Manish and Marwedel, Peter},
title = {Compilation and Simulation Tool Chain for Memory Aware Energy Optimizations},
booktitle = {Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS VI)},
year = {2006},
address = {Samos, Greece},
month = {jul},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2006-samos.pdf},
confidential = {n},
abstract = {Memory hierarchies are known to be the energy bottleneck of portable embedded devices. Numerous memory aware energy optimizations have been proposed. However, both the optimization and the validation is performed in an ad-hoc manner as a coherent compilation and simulation framework does not exist as yet. In this paper, we present such a framework for performing memory hierarchy aware energy optimization. Both the compiler and the simulator are configured from a single memory hierarchy description. Significant savings of up to 50\% in the total energy dissipation are reported.},
} Memory hierarchies are known to be the energy bottleneck of portable embedded devices. Numerous memory aware energy optimizations have been proposed. However, both the optimization and the validation is performed in an ad-hoc manner as a coherent compilation and simulation framework does not exist as yet. In this paper, we present such a framework for performing memory hierarchy aware energy optimization. Both the compiler and the simulator are configured from a single memory hierarchy description. Significant savings of up to 50% in the total energy dissipation are reported.
|
| M. Smith, M. Engel, S. Hanemann and B. Freisleben. Towards a Roadcasting Communications Infrastructure. In Proccedings of the IEEE International Conference on Mobile Communications and Learning Technologies, pages 213-213 2006 [BibTeX]@inproceedings { engel:06:icmclt,
author = {Smith, M. and Engel, M. and Hanemann, S. and Freisleben, B.},
title = {Towards a Roadcasting Communications Infrastructure},
booktitle = {Proccedings of the IEEE International Conference on Mobile Communications and Learning Technologies},
year = {2006},
pages = {213-213},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| Heiko Falk, Jens Wagner and André Schaefer. Use of a Bit-true Data Flow Analysis for Processor-Specific Source Code Optimization. In 4th IEEE Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia), pages 133-138 Seoul/Korea, October 2006 [BibTeX][PDF][Abstract]@inproceedings { falk:06:estimedia,
author = {Falk, Heiko and Wagner, Jens and Schaefer, Andr\'e},
title = {Use of a Bit-true Data Flow Analysis for Processor-Specific Source Code Optimization},
booktitle = {4th IEEE Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia)},
year = {2006},
pages = {133-138},
address = {Seoul/Korea},
month = {oct},
keywords = {sco},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2006-estimedia_1.pdf},
confidential = {n},
abstract = {Nowadays, key characteristics of a processor's instruction set are only exploited in high-level languages by using inline assembly or compiler intrinsics. Inserting intrinsics into the source code is up to the programmer, since only few automatic approaches exist. Additionally, these approaches base on simple code pattern matching strategies. This paper presents techniques for processor-specific code analysis and optimization at the source-level. It is shown how a bit-true data flow analysis is made applicable for source code analysis for the TI C6x DSPs for the very first time. Based on this bit-true analysis, fully automated optimizations superior to conventional pattern matching techniques are presented which optimize saturated arithmetic, reduce bitwidths of variables and exploit SIMD data processing within source codes. The application of our implemented algorithms to complex real-life codes leads to speed-ups between 33\% - 48\% for the optimization of saturated arithmetic, and up to 16\% after SIMD optimization.},
} Nowadays, key characteristics of a processor's instruction set are only exploited in high-level languages by using inline assembly or compiler intrinsics. Inserting intrinsics into the source code is up to the programmer, since only few automatic approaches exist. Additionally, these approaches base on simple code pattern matching strategies. This paper presents techniques for processor-specific code analysis and optimization at the source-level. It is shown how a bit-true data flow analysis is made applicable for source code analysis for the TI C6x DSPs for the very first time. Based on this bit-true analysis, fully automated optimizations superior to conventional pattern matching techniques are presented which optimize saturated arithmetic, reduce bitwidths of variables and exploit SIMD data processing within source codes. The application of our implemented algorithms to complex real-life codes leads to speed-ups between 33% - 48% for the optimization of saturated arithmetic, and up to 16% after SIMD optimization.
|
| Heiko Falk and Martin Schwarzer. Loop Nest Splitting for WCET-Optimization and Predictability Improvement. In 4th IEEE Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia), pages 115-120 Seoul/Korea, October 2006 [BibTeX][PDF][Abstract]@inproceedings { falk:06:estimedia2,
author = {Falk, Heiko and Schwarzer, Martin},
title = {Loop Nest Splitting for WCET-Optimization and Predictability Improvement},
booktitle = {4th IEEE Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia)},
year = {2006},
pages = {115-120},
address = {Seoul/Korea},
month = {oct},
keywords = {sco, wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2006-estimedia_2.pdf},
confidential = {n},
abstract = {This paper presents the effect of the loop nest splitting source code optimization on worst-case execution time \textem{(WCET)}. Loop nest splitting minimizes the number of executed if-statements in loop nests of multimedia applications. It identifies iterations where all if-statements are satisfied and splits the loop nest such that if-statements are not executed at all for large parts of the loop nest's iteration space. Especially loops and if-statements are an inherent source of unpredictability and loss of precision for WCET analysis. This is caused by the difficulty to obtain safe and tight worst-case estimates of an application's high-level control flow. In addition, assembly-level control flow redirections reduce predictability even more due to complex processor pipelines and branch prediction units. Loop nest splitting bases on precise mathematical models combined with genetic algorithms. On the one hand, these techniques achieve a significantly more homogeneous control flow structure. On the other hand, the precision of our analyses enables to generate very accurate high-level flow facts for loops and if-statements. The application of our implemented algorithms to three real-life benchmarks leads to average speed-ups by 25.0\% - 30.1\%, while WCET is reduced by 34.0\% - 36.3\%.},
} This paper presents the effect of the loop nest splitting source code optimization on worst-case execution time (WCET). Loop nest splitting minimizes the number of executed if-statements in loop nests of multimedia applications. It identifies iterations where all if-statements are satisfied and splits the loop nest such that if-statements are not executed at all for large parts of the loop nest's iteration space. Especially loops and if-statements are an inherent source of unpredictability and loss of precision for WCET analysis. This is caused by the difficulty to obtain safe and tight worst-case estimates of an application's high-level control flow. In addition, assembly-level control flow redirections reduce predictability even more due to complex processor pipelines and branch prediction units. Loop nest splitting bases on precise mathematical models combined with genetic algorithms. On the one hand, these techniques achieve a significantly more homogeneous control flow structure. On the other hand, the precision of our analyses enables to generate very accurate high-level flow facts for loops and if-statements. The application of our implemented algorithms to three real-life benchmarks leads to average speed-ups by 25.0% - 30.1%, while WCET is reduced by 34.0% - 36.3%.
|
| Heiko Falk, Paul Lokuciejewski and Henrik Theiling. Design of a WCET-Aware C Compiler. In 4th IEEE Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia), pages 121-126 Seoul/Korea, October 2006 [BibTeX][PDF][Abstract]@inproceedings { falk:06:estimedia3,
author = {Falk, Heiko and Lokuciejewski, Paul and Theiling, Henrik},
title = {Design of a WCET-Aware C Compiler},
booktitle = {4th IEEE Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia)},
year = {2006},
pages = {121-126},
address = {Seoul/Korea},
month = {oct},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2006-estimedia_3.pdf},
confidential = {n},
abstract = {This paper presents techniques to integrate worst-case execution time \textem{(WCET)} data into a compiler. Currently, a tight integration of WCET into compilers is strongly desired, but only some ad-hoc approaches were reported currently. Previous work mainly used self-written WCET estimators with limited functionality and preciseness during compilation. A very tight integration of a high quality WCET analyzer into a compiler was not yet achieved. This work is the first to present such a tight coupling between a compiler and the WCET analyzer aiT. This is done by automatically translating the assembly-like contents of the compiler's low-level format \textem{(LLIR)} to aiT's exchange format CRL2. Additionally, the results produced by aiT are automatically collected and re-imported into the compiler infrastructure. The work described in this paper is smoothly integrated into a C compiler for the Infineon TriCore processor. It opens up new possibilities for the design of WCET-aware optimizations in the future. The concepts for extending the compiler structure are kept very general so that they are not limited to WCET information. Rather, it is possible to use our concepts also for multi-objective optimization of e.g. best-case execution time \textem{(BCET)} or energy dissipation.},
} This paper presents techniques to integrate worst-case execution time (WCET) data into a compiler. Currently, a tight integration of WCET into compilers is strongly desired, but only some ad-hoc approaches were reported currently. Previous work mainly used self-written WCET estimators with limited functionality and preciseness during compilation. A very tight integration of a high quality WCET analyzer into a compiler was not yet achieved. This work is the first to present such a tight coupling between a compiler and the WCET analyzer aiT. This is done by automatically translating the assembly-like contents of the compiler's low-level format (LLIR) to aiT's exchange format CRL2. Additionally, the results produced by aiT are automatically collected and re-imported into the compiler infrastructure. The work described in this paper is smoothly integrated into a C compiler for the Infineon TriCore processor. It opens up new possibilities for the design of WCET-aware optimizations in the future. The concepts for extending the compiler structure are kept very general so that they are not limited to WCET information. Rather, it is possible to use our concepts also for multi-objective optimization of e.g. best-case execution time (BCET) or energy dissipation.
|
| Heiko Falk. Control Flow driven Code Hoisting at the Source Code Level. In Optimizations for DSP and Embedded Systems San Jose/United States, March 2005 [BibTeX][PDF][Abstract]@inproceedings { falk:05:odes,
author = {Falk, Heiko},
title = {Control Flow driven Code Hoisting at the Source Code Level},
booktitle = {Optimizations for DSP and Embedded Systems},
year = {2005},
address = {San Jose/United States},
month = {mar},
keywords = {sco},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2005-odes.pdf},
confidential = {n},
abstract = {This paper presents a novel source code optimization technique called advanced code hoisting. It aims at moving portions of code from inner loops to outer ones. In contrast to existing code motion techniques, this is done under consideration of control flow aspects. Depending on the conditions of \textem{if}-statements, moving an expression can lead to an increased number of executions of this expression. This paper contains formal descriptions of the polyhedral models used for control flow analysis so as to suppress a code motion in such a situation. Due to the inherent portability of source code transformations, a very detailed benchmarking using 8 different processors was performed. The application of our implemented techniques to real-life multimedia benchmarks leads to average speed-ups of 25.5\%-52\% and energy savings of 33.4\%-74.5\%. Furthermore, advanced code hoisting leads to improved pipeline and cache behavior and smaller code sizes.},
} This paper presents a novel source code optimization technique called advanced code hoisting. It aims at moving portions of code from inner loops to outer ones. In contrast to existing code motion techniques, this is done under consideration of control flow aspects. Depending on the conditions of if-statements, moving an expression can lead to an increased number of executions of this expression. This paper contains formal descriptions of the polyhedral models used for control flow analysis so as to suppress a code motion in such a situation. Due to the inherent portability of source code transformations, a very detailed benchmarking using 8 different processors was performed. The application of our implemented techniques to real-life multimedia benchmarks leads to average speed-ups of 25.5%-52% and energy savings of 33.4%-74.5%. Furthermore, advanced code hoisting leads to improved pipeline and cache behavior and smaller code sizes.
|
| Lars Wehmeyer and Peter Marwedel. Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software. In Design Automation and Test in Europe (DATE) Munich, Germany, March 2005 [BibTeX][PDF][Abstract]@inproceedings { wehm:05:date,
author = {Wehmeyer, Lars and Marwedel, Peter},
title = {Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software},
booktitle = {Design Automation and Test in Europe (DATE)},
year = {2005},
address = {Munich, Germany},
month = {mar},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2005-date.pdf},
confidential = {n},
abstract = {Safety-critical embedded systems having to meet real-time constraints are expected to be highly predictable in order to guarantee at design time that certain timing deadlines will always be met. This requirement usually prevents designers from utilizing caches due to their highly dynamic, thus hardly predictable behavior. The integration of scratchpad memories represents an alternative approach which allows the system to benefit from a performance gain comparable to that of caches while at the same time maintaining predictability. In this work, we compare the impact of scratchpad memories and caches on worst case execution time (WCET) analysis results. We show that caches, despite requiring complex techniques, can have a negative impact on the predicted WCET, while the estimated WCET for scratchpad memories scales with the achieved performance gain at no extra analysis cost.},
} Safety-critical embedded systems having to meet real-time constraints are expected to be highly predictable in order to guarantee at design time that certain timing deadlines will always be met. This requirement usually prevents designers from utilizing caches due to their highly dynamic, thus hardly predictable behavior. The integration of scratchpad memories represents an alternative approach which allows the system to benefit from a performance gain comparable to that of caches while at the same time maintaining predictability. In this work, we compare the impact of scratchpad memories and caches on worst case execution time (WCET) analysis results. We show that caches, despite requiring complex techniques, can have a negative impact on the predicted WCET, while the estimated WCET for scratchpad memories scales with the achieved performance gain at no extra analysis cost.
|
| M. Engel and B. Freisleben. Supporting Autonomic Computing Functionality via Dynamic Operating System Kernel Aspects. In Proceedings of the Fourth International Conference on Aspect Oriented Software Development, Chicago, USA, pages 51-62 2005 [BibTeX]@inproceedings { engel:05:aosd,
author = {Engel, M. and Freisleben, B.},
title = {Supporting Autonomic Computing Functionality via Dynamic Operating System Kernel Aspects},
booktitle = {Proceedings of the Fourth International Conference on Aspect Oriented Software Development, Chicago, USA},
year = {2005},
pages = {51-62},
publisher = {ACM Press},
confidential = {n},
} |
| Peter Marwedel, Manish Verma and Lars Wehmeyer. Compiler optimizations improving the processor/memory interface. In Workshop on Optimizing Compiler Assisted SoC Assembly (OCASA) September 2005 [BibTeX]@inproceedings { marw:05:ocasa,
author = {Marwedel, Peter and Verma, Manish and Wehmeyer, Lars},
title = {Compiler optimizations improving the processor/memory interface},
booktitle = {Workshop on Optimizing Compiler Assisted SoC Assembly (OCASA)},
year = {2005},
month = {sep},
confidential = {n},
} |
| Manish Verma and Peter Marwedel. Memory Optimization Techniques for Low-Power Embedded Processors. In IFIP VIVA Workshop - Fundamentals and Methods for Low-Power Information Processing Bonn, Germany, September 2005 [BibTeX][PDF][Abstract]@inproceedings { verma:05:viva,
author = {Verma, Manish and Marwedel, Peter},
title = {Memory Optimization Techniques for Low-Power Embedded Processors},
booktitle = {IFIP VIVA Workshop - Fundamentals and Methods for Low-Power Information Processing},
year = {2005},
address = {Bonn, Germany},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2005-viva.pdf},
confidential = {n},
abstract = {Power consumption is an important design issue for contemporary portable embedded devices. It is known that the next generation of portable devices will feature faster processors and larger memories, both of which require high operational power. %This is expected to Memory subsystem has already been identified as the energy bottleneck of the entire system. Consequently, memory hierarchies are being constructed to reduce the memory subsystem's energy dissipation. Caches and scratchpad memories represent two contrasting memory architectures. Scratchpads are both area and power efficient than caches. However, they require explicit support from the compiler for managing their contents. In this work, we present three approaches for the prudent utilization of the scratchpad memory of an ARM7 processor and of a M5 DSP based system. The first approach is based on the following observations. Firstly, a small memory requires less energy per access than that by a large memory. Secondly, applications in general consist of small and frequently accessed arrays and large but infrequently accessed arrays. Consequently, the approach partitions the large scratchpad into several small scratchpads. The arrays are also statically mapped such that the small arrays are mapped to small and energy efficient scratchpads. The approach leads to average energy savings of 52\% and 35\% in the data memory subsystem of the ARM7 and the M5 DSP, respectively. The second approach utilizes the scratchpad as an instruction buffer in a cache based memory hierarchy. The approach models the cache as a conflict graph and assigns instructions to the scratchpad. The objective is to minimize the energy consumption of the system while preserving the predictable behavior of the memory hierarchy. The approach results in an average energy saving of 21\% against the above approach for the ARM7 based system. The last approach optimizes the energy consumption of the system by overlaying memory objects (\textem{i.e.} code segments and data elements) on to the scratchpad. Memory objects with non-conflicting life-times are assigned to the same location on the scratchpad. This improves the scratchpad utilization, however, it requires copying memory objects on and off the scratchpad during the execution of the application. Average energy reductions of 34\% and 33\% are reported for the ARM7 and the M5 DSP based systems, respectively.},
} Power consumption is an important design issue for contemporary portable embedded devices. It is known that the next generation of portable devices will feature faster processors and larger memories, both of which require high operational power. %This is expected to Memory subsystem has already been identified as the energy bottleneck of the entire system. Consequently, memory hierarchies are being constructed to reduce the memory subsystem's energy dissipation. Caches and scratchpad memories represent two contrasting memory architectures. Scratchpads are both area and power efficient than caches. However, they require explicit support from the compiler for managing their contents. In this work, we present three approaches for the prudent utilization of the scratchpad memory of an ARM7 processor and of a M5 DSP based system. The first approach is based on the following observations. Firstly, a small memory requires less energy per access than that by a large memory. Secondly, applications in general consist of small and frequently accessed arrays and large but infrequently accessed arrays. Consequently, the approach partitions the large scratchpad into several small scratchpads. The arrays are also statically mapped such that the small arrays are mapped to small and energy efficient scratchpads. The approach leads to average energy savings of 52% and 35% in the data memory subsystem of the ARM7 and the M5 DSP, respectively. The second approach utilizes the scratchpad as an instruction buffer in a cache based memory hierarchy. The approach models the cache as a conflict graph and assigns instructions to the scratchpad. The objective is to minimize the energy consumption of the system while preserving the predictable behavior of the memory hierarchy. The approach results in an average energy saving of 21% against the above approach for the ARM7 based system. The last approach optimizes the energy consumption of the system by overlaying memory objects (i.e. code segments and data elements) on to the scratchpad. Memory objects with non-conflicting life-times are assigned to the same location on the scratchpad. This improves the scratchpad utilization, however, it requires copying memory objects on and off the scratchpad during the execution of the application. Average energy reductions of 34% and 33% are reported for the ARM7 and the M5 DSP based systems, respectively.
|
| M. Engel, M. Mezini and B. Freisleben. Creating a Component-Based Multi-Server OS From Existing Source Code Using Aspect-Oriented Programming. In Proceedings of ICCCP'05 2005 [BibTeX]@inproceedings { engel:05:icccp,
author = {Engel, M. and Mezini, M. and Freisleben, B.},
title = {Creating a Component-Based Multi-Server OS From Existing Source Code Using Aspect-Oriented Programming},
booktitle = {Proceedings of ICCCP'05},
year = {2005},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| M. Engel and B. Freisleben. Using a Low-Level Virtual Machine to Improve Dynamic Aspect Support in Operating System Kernels. In Proceedings of the AOSD ACPIS Workshop 2005, pages 1-6 2005 [BibTeX]@inproceedings { engel:05:acp4is,
author = {Engel, M. and Freisleben, B.},
title = {Using a Low-Level Virtual Machine to Improve Dynamic Aspect Support in Operating System Kernels},
booktitle = {Proceedings of the AOSD ACPIS Workshop 2005},
year = {2005},
pages = {1-6},
publisher = {ACM Press},
confidential = {n},
} |
| Peter Marwedel. Towards laying common grounds for embedded system design education. In Workshop on Embedded Systems Education (WESE) 2005 [BibTeX][PDF][Abstract]@inproceedings { marwedel:05:wese,
author = {Marwedel, Peter},
title = {Towards laying common grounds for embedded system design education},
booktitle = {Workshop on Embedded Systems Education (WESE)},
year = {2005},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2005-wese.pdf},
confidential = {n},
abstract = {In this paper, we propose to introduce a common introductory course for embedded system education. The course puts the different areas of embedded system design into perspective and avoids an early over-specialization. Also, it motivates the students for attending more advanced theoretical courses. The content, the structure and the prerequisites of such a course are outlined. The course requires a basic understanding of computer hardware and software and can typically be taught in the second or third year.},
} In this paper, we propose to introduce a common introductory course for embedded system education. The course puts the different areas of embedded system design into perspective and avoids an early over-specialization. Also, it motivates the students for attending more advanced theoretical courses. The content, the structure and the prerequisites of such a course are outlined. The course requires a basic understanding of computer hardware and software and can typically be taught in the second or third year.
|
| M. Engel and B. Freisleben. Autonomic Network Services on a Microkernel. In Proceedings of EUROCON, Belgrade, Serbia, pages 636-639 2005 [BibTeX]@inproceedings { engel:06:eurocon1,
author = {Engel, M. and Freisleben, B.},
title = {Autonomic Network Services on a Microkernel},
booktitle = {Proceedings of EUROCON, Belgrade, Serbia},
year = {2005},
pages = {636-639},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| M. Engel and B. Freisleben. Dynamic Aspect Support for Native Code. In Proceedings of EUROCON, Belgrade, Serbia, pages 732-735 2005 [BibTeX]@inproceedings { engel:06:eurocon2,
author = {Engel, M. and Freisleben, B.},
title = {Dynamic Aspect Support for Native Code},
booktitle = {Proceedings of EUROCON, Belgrade, Serbia},
year = {2005},
pages = {732-735},
publisher = {IEEE Computer Society Press},
confidential = {n},
} |
| Manish Verma, Klaus Petzold, Lars Wehmeyer, Heiko Falk and Peter Marwedel. Scratchpad Sharing Strategies for Multiprocess Embedded Systems: A First Approach. In IEEE 3rd Workshop on Embedded System for Real-Time Multimedia (ESTIMedia), pages 115-120 Jersey City, USA, September 2005 [BibTeX][PDF][Abstract]@inproceedings { verma:05:estimedia,
author = {Verma, Manish and Petzold, Klaus and Wehmeyer, Lars and Falk, Heiko and Marwedel, Peter},
title = {Scratchpad Sharing Strategies for Multiprocess Embedded Systems: A First Approach},
booktitle = {IEEE 3rd Workshop on Embedded System for Real-Time Multimedia (ESTIMedia)},
year = {2005},
pages = {115-120},
address = {Jersey City, USA},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2005-estimedia.pdf},
confidential = {n},
abstract = {Portable embedded systems require diligence in managing their energy consumption. Thus, power efficient processors coupled with onchip memories (e.g. caches, scratchpads) are the base of today's portable devices. Scratchpads are more energy efficient than caches but require software support for their utilization. Portable devices' applications consist of multiple processes for different tasks. However, all the previous scratchpad allocation approaches only consider single process applications. In this paper, we propose a set of optimal strategies to reduce the energy consumption of applications by sharing the scratchpad among multiple processes. The strategies assign both code and data elements to the scratchpad and result in average total energy reductions of 9\%-20\% against a published single process approach. Furthermore, the strategies generate Pareto-optimal curves for the applications allowing design time exploration of energy/scratchpad size tradeoffs.},
} Portable embedded systems require diligence in managing their energy consumption. Thus, power efficient processors coupled with onchip memories (e.g. caches, scratchpads) are the base of today's portable devices. Scratchpads are more energy efficient than caches but require software support for their utilization. Portable devices' applications consist of multiple processes for different tasks. However, all the previous scratchpad allocation approaches only consider single process applications. In this paper, we propose a set of optimal strategies to reduce the energy consumption of applications by sharing the scratchpad among multiple processes. The strategies assign both code and data elements to the scratchpad and result in average total energy reductions of 9%-20% against a published single process approach. Furthermore, the strategies generate Pareto-optimal curves for the applications allowing design time exploration of energy/scratchpad size tradeoffs.
|
| Lars Wehmeyer and Peter Marwedel. Influence of Onchip Scratchpad Memories on WCET prediction. In Proceedings of the 4th International Workshop on Worst-Case Execution Time (WCET) Analysis Catania, Sicily, Italy, June 2004 [BibTeX][PDF][Abstract]@inproceedings { wehm:04:wcet,
author = {Wehmeyer, Lars and Marwedel, Peter},
title = {Influence of Onchip Scratchpad Memories on WCET prediction},
booktitle = {Proceedings of the 4th International Workshop on Worst-Case Execution Time (WCET) Analysis},
year = {2004},
address = {Catania, Sicily, Italy},
month = {jun},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-WCET.pdf},
confidential = {n},
abstract = {In contrast to standard PCs and many high-performance computer systems, systems that have to meet real-time requirements usually do not feature caches, since caches primarily improve the average case performance, whereas their impact on WCET is generally hard to predict. Especially in embedded systems, scratchpad memories have become popular. Since these small, fast memories can be controlled by the programmer or the compiler, their behavior is perfectly predictable. In this paper, we study for the first time the impact of scratchpad memories on worst case execution time (WCET) prediction. Our results indicate that scratchpads can significantly improve WCET at no extra analysis cost.},
} In contrast to standard PCs and many high-performance computer systems, systems that have to meet real-time requirements usually do not feature caches, since caches primarily improve the average case performance, whereas their impact on WCET is generally hard to predict. Especially in embedded systems, scratchpad memories have become popular. Since these small, fast memories can be controlled by the programmer or the compiler, their behavior is perfectly predictable. In this paper, we study for the first time the impact of scratchpad memories on worst case execution time (WCET) prediction. Our results indicate that scratchpads can significantly improve WCET at no extra analysis cost.
|
| Lars Wehmeyer, Urs Helmig and Peter Marwedel. Compiler-optimized Usage of Partitioned Memories. In Proceedings of the 3rd Workshop on Memory Performance Issues (WMPI2004) Munich, Germany, June 2004 [BibTeX][PDF][Abstract]@inproceedings { wehm:04:wmpi,
author = {Wehmeyer, Lars and Helmig, Urs and Marwedel, Peter},
title = {Compiler-optimized Usage of Partitioned Memories},
booktitle = {Proceedings of the 3rd Workshop on Memory Performance Issues (WMPI2004)},
year = {2004},
address = {Munich, Germany},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-WMPI.pdf},
confidential = {n},
abstract = {In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also include small onchip memories called scratchpad memories whose usage is not controlled by hardware, but rather by the programmer or the compiler. Techniques for utilization of these scratchpads have been known for some time. Some new processors provide more than one scratchpad, making it necessary to enhance the workflow such that this complex memory architecture can be efficiently utilized. In this work, we present an energy model and an ILP formulation to optimally assign memory objects to different partitions of scratchpad memories at compile time, achieving energy savings of up to 22\% compared to previous approaches.},
} In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also include small onchip memories called scratchpad memories whose usage is not controlled by hardware, but rather by the programmer or the compiler. Techniques for utilization of these scratchpads have been known for some time. Some new processors provide more than one scratchpad, making it necessary to enhance the workflow such that this complex memory architecture can be efficiently utilized. In this work, we present an energy model and an ILP formulation to optimally assign memory objects to different partitions of scratchpad memories at compile time, achieving energy savings of up to 22% compared to previous approaches.
|
| Manish Verma, Lars Wehmeyer and Peter Marwedel. Cache Aware Scratchpad Allocation. In DATE Paris/France, February 2004 [BibTeX][PDF][Abstract]@inproceedings { verma:04:date,
author = {Verma, Manish and Wehmeyer, Lars and Marwedel, Peter},
title = {Cache Aware Scratchpad Allocation},
booktitle = {DATE},
year = {2004},
address = {Paris/France},
month = {feb},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-date.pdf},
confidential = {n},
abstract = {In the context of portable embedded systems, reducing energy is one of the prime objectives. Most high-end embedded microprocessors include onchip instruction and data caches, along with a small energy efficient scratchpad. Previous approaches for utilizing scratchpad did not consider caches and hence fail for the au courant architecture. In the presented work, we use the scratchpad for storing instructions and propose a generic Cache Aware Scratchpad Allocation (CASA) algorithm. We report an average reduction of 8-29\% in instruction memory energy consumption compared to a previously published technique for benchmarks from the Mediabench suite. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of our approach against preloaded loop caches, we report average energy savings of 20-44\%.},
} In the context of portable embedded systems, reducing energy is one of the prime objectives. Most high-end embedded microprocessors include onchip instruction and data caches, along with a small energy efficient scratchpad. Previous approaches for utilizing scratchpad did not consider caches and hence fail for the au courant architecture. In the presented work, we use the scratchpad for storing instructions and propose a generic Cache Aware Scratchpad Allocation (CASA) algorithm. We report an average reduction of 8-29% in instruction memory energy consumption compared to a previously published technique for benchmarks from the Mediabench suite. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of our approach against preloaded loop caches, we report average energy savings of 20-44%.
|
| Michael Engel and Guido Germano. CITY at home: Monte Carlo option pricing distributed on personal computers. In Proc. of the 10th International Conference on Computing in Economics and Finance of the Society of Computational Economics 2004 [BibTeX]@inproceedings { engel:04:iccef,
author = {Engel, Michael and Germano, Guido},
title = {CITY at home: Monte Carlo option pricing distributed on personal computers},
booktitle = {Proc. of the 10th International Conference on Computing in Economics and Finance of the Society of Computational Economics},
year = {2004},
confidential = {n},
} |
| Heiko Falk and Manish Verma. Combined Data Partitioning and Loop Nest Splitting for Energy Consumption Minimization. In SCOPES, pages 137-151 Amsterdam/The Netherlands, September 2004 [BibTeX][PDF][Abstract]@inproceedings { falk:04:scopes,
author = {Falk, Heiko and Verma, Manish},
title = {Combined Data Partitioning and Loop Nest Splitting for Energy Consumption Minimization},
booktitle = {SCOPES},
year = {2004},
pages = {137-151},
address = {Amsterdam/The Netherlands},
month = {sep},
keywords = {sco},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-scopes.pdf},
confidential = {n},
abstract = {For mobile embedded systems, the energy consumption is a limiting factor because of today's battery capacities. Besides the processor, memory accesses consume a high amount of energy. The use of additional less power hungry memories like caches or scratchpads is thus common. This paper presents a combined approach for energy consumption minimization consisting of two complementary and phase-coupled optimizations, viz. data partitioning and loop nest splitting. In a first step, data partitioning partitions large arrays found in typical embedded software into smaller ones which are placed onto an on-chip scratchpad memory. Although being effective w.r.t. energy dissipation, this optimization adds overhead to the code since the correct part of a partitioned array has to be selected at runtime. Therefore, the control flow is optimized as a second step in our framework. In this phase, loop nests containing \textem{if}-statements are split using genetic algorithms leading to minimized \textem{if}-statement executions. However, loop nest splitting leads to an increase in code size and can potentially annul the program layout achieved by the first step. Consequently, the proposed approach iteratively applies these optimizations till a local optimum is found. The proposed framework of combined memory and control flow optimization leads to considerable energy savings for a representative set of typical embedded software routines. Using an accurate energy model for the ARM7 processor, energy savings between 20.3\% and 43.3\% were measured.},
} For mobile embedded systems, the energy consumption is a limiting factor because of today's battery capacities. Besides the processor, memory accesses consume a high amount of energy. The use of additional less power hungry memories like caches or scratchpads is thus common. This paper presents a combined approach for energy consumption minimization consisting of two complementary and phase-coupled optimizations, viz. data partitioning and loop nest splitting. In a first step, data partitioning partitions large arrays found in typical embedded software into smaller ones which are placed onto an on-chip scratchpad memory. Although being effective w.r.t. energy dissipation, this optimization adds overhead to the code since the correct part of a partitioned array has to be selected at runtime. Therefore, the control flow is optimized as a second step in our framework. In this phase, loop nests containing if-statements are split using genetic algorithms leading to minimized if-statement executions. However, loop nest splitting leads to an increase in code size and can potentially annul the program layout achieved by the first step. Consequently, the proposed approach iteratively applies these optimizations till a local optimum is found. The proposed framework of combined memory and control flow optimization leads to considerable energy savings for a representative set of typical embedded software routines. Using an accurate energy model for the ARM7 processor, energy savings between 20.3% and 43.3% were measured.
|
| Markus Lorenz and Peter Marwedel. Phase Coupled Code Generation for DSPs Using a Genetic Algorithm. In DATE, pages 1270-1275 June 2004 [BibTeX][PDF][Abstract]@inproceedings { lorenz:04:date,
author = {Lorenz, Markus and Marwedel, Peter},
title = {Phase Coupled Code Generation for DSPs Using a Genetic Algorithm},
booktitle = {DATE},
year = {2004},
pages = {1270-1275},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-date-lorenz.pdf},
confidential = {n},
abstract = {The growing use of digital signal processors (DSPs) in embedded systems necessitates the use of optimizing compilers supporting special hardware features. Due to the irregular architectures present in todays DSPs there is a need of compilers which are capable of performing a phase coupling of the highly interdependent code generation subtasks and a graph based code selection. In this paper we present a code generator which performs a graph based code selection and a complete phase coupling of code selection, instruction scheduling (including compaction) and register allocation. In addition, our code generator takes into account effects of the subsequent address code generation phase. In order to solve the phase coupling problem and to handle the problem complexity, our code generator is based on a genetic algorithm. Experimental results for several benchmarks and an MP3 application for two DSPs show the effectiveness and the retargetability of our approach. Using the presented techniques, the number of execution cycles is reduced by 51\% on average for the M3-DSP and by 38\% on average for the ADSP2100 compared to standard techniques.},
} The growing use of digital signal processors (DSPs) in embedded systems necessitates the use of optimizing compilers supporting special hardware features. Due to the irregular architectures present in todays DSPs there is a need of compilers which are capable of performing a phase coupling of the highly interdependent code generation subtasks and a graph based code selection. In this paper we present a code generator which performs a graph based code selection and a complete phase coupling of code selection, instruction scheduling (including compaction) and register allocation. In addition, our code generator takes into account effects of the subsequent address code generation phase. In order to solve the phase coupling problem and to handle the problem complexity, our code generator is based on a genetic algorithm. Experimental results for several benchmarks and an MP3 application for two DSPs show the effectiveness and the retargetability of our approach. Using the presented techniques, the number of execution cycles is reduced by 51% on average for the M3-DSP and by 38% on average for the ADSP2100 compared to standard techniques.
|
| Peter Marwedel, Lars Wehmeyer, Manish Verma, Stefan Steinke and Urs Helmig. Fast, predictable and low energy memory references through architecture-aware compilation. In ASPDAC, pages 4-11 January 2004 [BibTeX][PDF][Abstract]@inproceedings { marw:04:aspdac,
author = {Marwedel, Peter and Wehmeyer, Lars and Verma, Manish and Steinke, Stefan and Helmig, Urs},
title = {Fast, predictable and low energy memory references through architecture-aware compilation},
booktitle = {ASPDAC},
year = {2004},
pages = {4-11},
month = {jan},
keywords = {wcet},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-aspdac-spm.pdf},
confidential = {n},
abstract = {The design of future high-performance embedded systems is hampered by two problems: First, the required hardware needs more energy than is available from batteries. Second, current cache-based approaches for bridging the increasing speed gap between processors and memories cannot guarantee predictable real-time behavior. A contribution to solving both problems is made in this paper which describes a comprehensive set of algorithms that can be applied at design time in order to maximally exploit scratch pad memories (SPMs). We show that both the energy consumption as well as the computed worst case execution time (WCET) can be reduced by up to to 80\% and 48\%, respectively, by establishing a strong link between the memory architecture and the compiler.},
} The design of future high-performance embedded systems is hampered by two problems: First, the required hardware needs more energy than is available from batteries. Second, current cache-based approaches for bridging the increasing speed gap between processors and memories cannot guarantee predictable real-time behavior. A contribution to solving both problems is made in this paper which describes a comprehensive set of algorithms that can be applied at design time in order to maximally exploit scratch pad memories (SPMs). We show that both the energy consumption as well as the computed worst case execution time (WCET) can be reduced by up to to 80% and 48%, respectively, by establishing a strong link between the memory architecture and the compiler.
|
| Manish Verma, Lars Wehmeyer and Peter Marwedel. Dynamic Overlay of Scratchpad Memory for Energy Minimization. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) Stockholm, Sweden, September 2004 [BibTeX][PDF][Abstract]@inproceedings { verma:04:codes,
author = {Verma, Manish and Wehmeyer, Lars and Marwedel, Peter},
title = {Dynamic Overlay of Scratchpad Memory for Energy Minimization},
booktitle = {International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)},
year = {2004},
address = {Stockholm, Sweden},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-isss.pdf},
confidential = {n},
abstract = {The memory subsystem accounts for a significant portion of the aggregate energy budget of contemporary embedded systems. Moreover, there exists a large potential for optimizing the energy consumption of the memory subsystem. Consequently, novel memories as well as novel algorithms for their efficient utilization are being designed. Scratchpads are known to perform better than caches in terms of power, performance, area and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present an allocation technique which analyzes the application and inserts instructions to dynamically copy both code segments and variables onto the scratchpad at runtime. We demonstrate that the problem of dynamically overlaying scratchpad is an extension of the Global Register Allocation problem. The overlay problem is solved optimally using ILP formulation techniques. Our approach improves upon the only previously known allocation technique for statically allocating both variables and code segments onto the scratchpad. Experiments report an average reduction of 34\% and 18\% in the energy consumption and the runtime of the applications, respectively. A minimal increase in code size is also reported.},
} The memory subsystem accounts for a significant portion of the aggregate energy budget of contemporary embedded systems. Moreover, there exists a large potential for optimizing the energy consumption of the memory subsystem. Consequently, novel memories as well as novel algorithms for their efficient utilization are being designed. Scratchpads are known to perform better than caches in terms of power, performance, area and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present an allocation technique which analyzes the application and inserts instructions to dynamically copy both code segments and variables onto the scratchpad at runtime. We demonstrate that the problem of dynamically overlaying scratchpad is an extension of the Global Register Allocation problem. The overlay problem is solved optimally using ILP formulation techniques. Our approach improves upon the only previously known allocation technique for statically allocating both variables and code segments onto the scratchpad. Experiments report an average reduction of 34% and 18% in the energy consumption and the runtime of the applications, respectively. A minimal increase in code size is also reported.
|
| M. Engel, B. Freisleben, M. Smith and S. Hanemann. Wireless Ad-Hoc Network Emulation Using Microkernel-Based Virtual Linux Systems. In Proceedings of the 5th EUROSIM Congress on Modeling and Simulation, Marne la Vallee, France, pages 198-203 2004 [BibTeX]@inproceedings { engel:04:eurosim,
author = {Engel, M. and Freisleben, B. and Smith, M. and Hanemann, S.},
title = {Wireless Ad-Hoc Network Emulation Using Microkernel-Based Virtual Linux Systems},
booktitle = {Proceedings of the 5th EUROSIM Congress on Modeling and Simulation, Marne la Vallee, France},
year = {2004},
pages = {198-203},
publisher = {EUROSIM Publishers},
confidential = {n},
} |
| Markus Lorenz, Peter Marwedel, Thorsten Dräger, Gerhard Fettweis and Rainer Leupers. Compiler based Exploration of DSP Energy Savings by SIMD Operations. In ASPDAC, pages 839-842 June 2004 [BibTeX][PDF][Abstract]@inproceedings { lorenz:04:aspdac,
author = {Lorenz, Markus and Marwedel, Peter and Dr\"ager, Thorsten and Fettweis, Gerhard and Leupers, Rainer},
title = {Compiler based Exploration of DSP Energy Savings by SIMD Operations},
booktitle = {ASPDAC},
year = {2004},
pages = {839-842},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-aspdac-lorenz.pdf},
confidential = {n},
abstract = {The growing use of digital signal processors (DSPs) in embedded systems necessitates the use of optimizing compilers supporting their special architecture features. Besides the irregular DSP architectures for reducing chip size and energy consumption, single instruction multiple data (SIMD) functionality is frequently integrated with the intention of performance improvement. In order to get an energy-efficient system consisting of processor and compiler, it is necessary to optimize hardware as well as software. It is not obvious that SIMD operations can save any energy: if n operations are executed in parallel, each of them might consume the same amount of energy as if there were executed sequentially. Up to now, no work has been done to investigate the influence of compiler generated code containing SIMD operations w.r.t. the energy consumption. This paper deals with the exploration of the energy saving potential of SIMD operations for a DSP by using a generic compilation framework including an integrated instruction level energy cost model for our target architecture. Effects of SIMD operations on the energy consumption are shown for several benchmarks and an MP3 application.},
} The growing use of digital signal processors (DSPs) in embedded systems necessitates the use of optimizing compilers supporting their special architecture features. Besides the irregular DSP architectures for reducing chip size and energy consumption, single instruction multiple data (SIMD) functionality is frequently integrated with the intention of performance improvement. In order to get an energy-efficient system consisting of processor and compiler, it is necessary to optimize hardware as well as software. It is not obvious that SIMD operations can save any energy: if n operations are executed in parallel, each of them might consume the same amount of energy as if there were executed sequentially. Up to now, no work has been done to investigate the influence of compiler generated code containing SIMD operations w.r.t. the energy consumption. This paper deals with the exploration of the energy saving potential of SIMD operations for a DSP by using a generic compilation framework including an integrated instruction level energy cost model for our target architecture. Effects of SIMD operations on the energy consumption are shown for several benchmarks and an MP3 application.
|
| Peter Marwedel and Birgit Sirocic. Bridges to computer architectures education. In Proceedings of the Workshop of Computer Architecture Education (WCAE) Munich, Germany, June 2004 [BibTeX][PDF][Abstract]@inproceedings { marwedel:04:wcae,
author = {Marwedel, Peter and Sirocic, Birgit},
title = {Bridges to computer architectures education},
booktitle = {Proceedings of the Workshop of Computer Architecture Education (WCAE)},
year = {2004},
address = {Munich, Germany},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-wcae.pdf},
confidential = {n},
abstract = {Bridging the gap between the student's current knowledge and the technical details of computer systems is frequently required for the education of undergraduate students or students with a lack of previous technical knowledge. In this paper we will describe how to build bridges by the combination of introducing Flash-animations and the educational units of the RaVi system. In a first step the Flash-based animations make the students familiar with the underlaying principles. After that the student could jump to the more technical context by using the educational unit of RaVi. We have developed two bridges, one for explaining the principles of cache coherency protocols and the other for showing the concept of processor pipelines.},
} Bridging the gap between the student's current knowledge and the technical details of computer systems is frequently required for the education of undergraduate students or students with a lack of previous technical knowledge. In this paper we will describe how to build bridges by the combination of introducing Flash-animations and the educational units of the RaVi system. In a first step the Flash-based animations make the students familiar with the underlaying principles. After that the student could jump to the more technical context by using the educational unit of RaVi. We have developed two bridges, one for explaining the principles of cache coherency protocols and the other for showing the concept of processor pipelines.
|
| Xiaoning Nie and Jens Wagner. High Performance Network Protocol Processor - Architecture and Tools. In Euro DesignCon München, June 2004 [BibTeX][PDF]@inproceedings { nie:04:eurodesign,
author = {Nie, Xiaoning and Wagner, Jens},
title = {High Performance Network Protocol Processor - Architecture and Tools},
booktitle = {Euro DesignCon},
year = {2004},
address = {M\"unchen},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2004-DC.pdf},
confidential = {n},
} |
| Heiko Falk and Peter Marwedel. Control Flow driven Splitting of Loop Nests at the Source Code Level. In Design, Automation and Test in Europe (DATE) 2003, pages 410-415 Munich/Germany, March 2003 [BibTeX][PDF][Abstract]@inproceedings { falk:03:date,
author = {Falk, Heiko and Marwedel, Peter},
title = {Control Flow driven Splitting of Loop Nests at the Source Code Level},
booktitle = {Design, Automation and Test in Europe (DATE) 2003},
year = {2003},
pages = {410-415},
address = {Munich/Germany},
month = {mar},
keywords = {sco},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2003-date.pdf},
confidential = {n},
abstract = {This paper presents a novel source code transformation for control flow optimization called loop nest splitting which minimizes the number of executed if-statements in loop nests of embedded multimedia applications. The goal of the optimization is to reduce runtimes and energy consumption. The analysis techniques are based on precise mathematical models combined with genetic algorithms. Due to the inherent portability of source code transformations, a very detailed benchmarking using 10 different processors can be performed. The application of our implemented algorithms to three real-life multimedia benchmarks leads to average speed-ups by 23.6\% - 62.1\% and energy savings by 19.6\% - 57.7\%. Furthermore, our optimization also leads to advantageous pipeline and cache performance.},
} This paper presents a novel source code transformation for control flow optimization called loop nest splitting which minimizes the number of executed if-statements in loop nests of embedded multimedia applications. The goal of the optimization is to reduce runtimes and energy consumption. The analysis techniques are based on precise mathematical models combined with genetic algorithms. Due to the inherent portability of source code transformations, a very detailed benchmarking using 10 different processors can be performed. The application of our implemented algorithms to three real-life multimedia benchmarks leads to average speed-ups by 23.6% - 62.1% and energy savings by 19.6% - 57.7%. Furthermore, our optimization also leads to advantageous pipeline and cache performance.
|
| Manish Verma, Lars Wehmeyer and Peter Marwedel. Efficient Scratchpad Allocation Algorithms for Energy Constrained Embedded Systems. In PACS 2003 San Diego, CA 2003, June 2003, Also in: Lecture Notes in Computer Science (LCNS 3164) Vol. 3164/2004 [BibTeX][PDF][Abstract]@inproceedings { verma:03:pacs,
author = {Verma, Manish and Wehmeyer, Lars and Marwedel, Peter},
title = {Efficient Scratchpad Allocation Algorithms for Energy Constrained Embedded Systems},
booktitle = {PACS 2003},
year = {2003},
address = {San Diego, CA 2003},
month = {jun},
note = {Also in: Lecture Notes in Computer Science (LCNS 3164) Vol. 3164/2004},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2003-pacs.pdf},
confidential = {n},
abstract = {In the context of portable embedded systems, reducing energy is one of the prime objectives. Memories are responsible for a significant percentage of a system's aggregate energy consumption. Consequently, novel memories as well as novel memory hierarchies are being designed to reduce the energy consumption. Caches and scratchpads are two contrasting variants of memory architectures. The former relies completely on hardware logic while the latter requires software for its utilization. Most high-end embedded microprocessors today include onchip instruction and data caches along with a scratchpad. Previous software approaches for utilizing scratchpad did not consider caches and hence fail for the prevalent high-end system architectures. In this work, we use the scratchpad for storing instructions. We solve the allocation problem using a greedy heuristic and also solve it optimally using an ILP formulation. We report an average reduction of 20.7\% in instruction memory energy consumption compared to a previously published technique. Larger reductions are also reported when the problem is solved optimally. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of our approach against that of preloaded loop caches, we report average energy savings of 28.9\% using the heuristic.},
} In the context of portable embedded systems, reducing energy is one of the prime objectives. Memories are responsible for a significant percentage of a system's aggregate energy consumption. Consequently, novel memories as well as novel memory hierarchies are being designed to reduce the energy consumption. Caches and scratchpads are two contrasting variants of memory architectures. The former relies completely on hardware logic while the latter requires software for its utilization. Most high-end embedded microprocessors today include onchip instruction and data caches along with a scratchpad. Previous software approaches for utilizing scratchpad did not consider caches and hence fail for the prevalent high-end system architectures. In this work, we use the scratchpad for storing instructions. We solve the allocation problem using a greedy heuristic and also solve it optimally using an ILP formulation. We report an average reduction of 20.7% in instruction memory energy consumption compared to a previously published technique. Larger reductions are also reported when the problem is solved optimally. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of our approach against that of preloaded loop caches, we report average energy savings of 28.9% using the heuristic.
|
| Heiko Falk, Cédric Ghez, Miguel Miranda and Rainer Leupers. High-level Control Flow Transformations for Performance Improvement of Address-Dominated Multimedia Applications. In 11th Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI), pages 338-344 Hiroshima/Japan, April 2003 [BibTeX][PDF][Abstract]@inproceedings { falk:03:sasimi,
author = {Falk, Heiko and Ghez, C\'edric and Miranda, Miguel and Leupers, Rainer},
title = {High-level Control Flow Transformations for Performance Improvement of Address-Dominated Multimedia Applications},
booktitle = {11th Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI)},
year = {2003},
pages = {338-344},
address = {Hiroshima/Japan},
month = {apr},
keywords = {sco},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2003-sasimi.pdf},
confidential = {n},
abstract = {This paper describes a set of novel high-level control flow transformations for performance improvement of typical address-dominated multimedia applications. We show that these transformations applied at the source code level can have a very large impact on execution time at the cost of limited overhead in code size for a broad range of instruction set processor families (i.e. CISC, RISC, DSP, VLIW, ...). For a profound evaluation, all transformations are applied to the C-codes of two real-life applications selected from the video and image processing domains. A detailed analysis of the effect of the transformations is done by compiling and executing the transformed programs on seven different programmable processors. The measured runtimes indicate quite significant improvements in all processor families when comparing the performance of the transformed codes to their initial version even when these are compiled using their native optimizing compilers with their most aggressive optimization features enabled. The average gains in execution time range from 40.2\% and 87.7\% depending on the driver, with an average overhead in code size between 21.1\% and 100.9\%.},
} This paper describes a set of novel high-level control flow transformations for performance improvement of typical address-dominated multimedia applications. We show that these transformations applied at the source code level can have a very large impact on execution time at the cost of limited overhead in code size for a broad range of instruction set processor families (i.e. CISC, RISC, DSP, VLIW, ...). For a profound evaluation, all transformations are applied to the C-codes of two real-life applications selected from the video and image processing domains. A detailed analysis of the effect of the transformations is done by compiling and executing the transformed programs on seven different programmable processors. The measured runtimes indicate quite significant improvements in all processor families when comparing the performance of the transformed codes to their initial version even when these are compiled using their native optimizing compilers with their most aggressive optimization features enabled. The average gains in execution time range from 40.2% and 87.7% depending on the driver, with an average overhead in code size between 21.1% and 100.9%.
|
| Peter Marwedel and Birgit Sirocic. Overcoming the Limitations of Traditional Media For Teaching Modern Processor Design. In International Conference on Microelectronic Systems Education (MSE 2003) Anaheim, June 2003 [BibTeX][PDF][Abstract]@inproceedings { marwedel:03:mse,
author = {Marwedel, Peter and Sirocic, Birgit},
title = {Overcoming the Limitations of Traditional Media For Teaching Modern Processor Design},
booktitle = {International Conference on Microelectronic Systems Education (MSE 2003)},
year = {2003},
address = {Anaheim},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2003-mse.pdf},
confidential = {n},
abstract = {Understanding modern processors requires a good knowledge of the dynamic behavior of processors. Traditional media like books can be used for describing the dynamic behavior of processors. Visualization of this behavior, however, is impossible, due to the static nature of books. In this paper, we describe a Java-based tool for visualizing the dynamic behavior of hardware structures, called RaVi (abbreviation for the German equivalent of "computer architecture visualization"). Available RaVi components include models of a microcoded MIPS architecture, of a MIPS pipeline, of scoreboarding, Tomasulo's algorithm and the MESI multiprocessor cache protocol. These models were found to be more useful than general simulators in classroom use. The Java-based design also enables Internet-based distance learning. Tools are available at http://ls12-www.cs.tu-dortmund.de/ravi.},
} Understanding modern processors requires a good knowledge of the dynamic behavior of processors. Traditional media like books can be used for describing the dynamic behavior of processors. Visualization of this behavior, however, is impossible, due to the static nature of books. In this paper, we describe a Java-based tool for visualizing the dynamic behavior of hardware structures, called RaVi (abbreviation for the German equivalent of "computer architecture visualization"). Available RaVi components include models of a microcoded MIPS architecture, of a MIPS pipeline, of scoreboarding, Tomasulo's algorithm and the MESI multiprocessor cache protocol. These models were found to be more useful than general simulators in classroom use. The Java-based design also enables Internet-based distance learning. Tools are available at http://ls12-www.cs.tu-dortmund.de/ravi.
|
| Manish Verma, Stefan Steinke and Peter Marwedel. Data Partitioning for Maximal Scratchpad Usage. In ASPDAC 2003 KitaKyushu/Japan, January 2003 [BibTeX][PDF][Abstract]@inproceedings { verma:03:aspdac,
author = {Verma, Manish and Steinke, Stefan and Marwedel, Peter},
title = {Data Partitioning for Maximal Scratchpad Usage},
booktitle = {ASPDAC 2003},
year = {2003},
address = {KitaKyushu/Japan},
month = {jan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2003-aspdac.pdf},
confidential = {n},
abstract = {The energy consumption for Mobile Embedded Systems is a limiting factor because of today's battery capacities. The memory subsystem consumes a large chunk of the energy, necessitating its efficient utilization. Energy efficient scratchpads are thus becoming common, though unlike caches they require to be explicitly utilized. In this paper, an algorithm integrated into a compiler is presented which analyzes the application, partitions an array variable whenever its beneficial, appropriately modifies the application and selects the best set of variables and program parts to be placed onto the scratchpad. Results show an energy improvement between 5.7\% and 17.6\% for a variety of applications against a previously known algorithm.},
} The energy consumption for Mobile Embedded Systems is a limiting factor because of today's battery capacities. The memory subsystem consumes a large chunk of the energy, necessitating its efficient utilization. Energy efficient scratchpads are thus becoming common, though unlike caches they require to be explicitly utilized. In this paper, an algorithm integrated into a compiler is presented which analyzes the application, partitions an array variable whenever its beneficial, appropriately modifies the application and selects the best set of variables and program parts to be placed onto the scratchpad. Results show an energy improvement between 5.7% and 17.6% for a variety of applications against a previously known algorithm.
|
| Rainer Leupers, Oliver Wahlen, Manuel Hohenauer, Tim Kogel and Peter Marwedel. An Executable Intermediate Representation for Retargetable Compilation and High-Level Code Optimization. In International Conference on Information Communication Technologies in Education (SAMOS 2003) June 2003 [BibTeX][PDF][Abstract]@inproceedings { leupers:03:samos,
author = {Leupers, Rainer and Wahlen, Oliver and Hohenauer, Manuel and Kogel, Tim and Marwedel, Peter},
title = {An Executable Intermediate Representation for Retargetable Compilation and High-Level Code Optimization},
booktitle = {International Conference on Information Communication Technologies in Education (SAMOS 2003)},
year = {2003},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2003-samosIII.pdf},
confidential = {n},
abstract = {Due to fast time-to-market and IP reuse requirements, an increasing amount of the functionality of embedded HW/SW systems is implemented in software. As a consequence, software programming languages like C play an important role in system specification, design, and validation. Besides many other advantages, the C language offers executable specifications, with clear semantics and high simulation speed. However, virtually any tool operating on C specifications has to convert C sources into some intermediate representation (IR), during which the executability is normally lost. In order to overcome this problem, this paper describes a novel IR format, called IR-C, for the use in C based design tools, which combines the simplicity of three address code with the executability of C. Besides the IR-C format and its generation from ANSI C, we also describe its applications in the areas of validation, retargetable compilation, and sourcelevel code optimization.},
} Due to fast time-to-market and IP reuse requirements, an increasing amount of the functionality of embedded HW/SW systems is implemented in software. As a consequence, software programming languages like C play an important role in system specification, design, and validation. Besides many other advantages, the C language offers executable specifications, with clear semantics and high simulation speed. However, virtually any tool operating on C specifications has to convert C sources into some intermediate representation (IR), during which the executability is normally lost. In order to overcome this problem, this paper describes a novel IR format, called IR-C, for the use in C based design tools, which combines the simplicity of three address code with the executability of C. Besides the IR-C format and its generation from ANSI C, we also describe its applications in the areas of validation, retargetable compilation, and sourcelevel code optimization.
|
| Peter Marwedel and Birgit Sirocic. Multimedia componets for the visualization of dynamic behavior in computer architectures. In Proceedings of the Workshop of Computer Architecture Education (WCAE'03) San Diego, CA, June 2003 [BibTeX][PDF][Abstract]@inproceedings { marwedel:03:wcae,
author = {Marwedel, Peter and Sirocic, Birgit},
title = {Multimedia componets for the visualization of dynamic behavior in computer architectures},
booktitle = {Proceedings of the Workshop of Computer Architecture Education (WCAE'03)},
year = {2003},
address = {San Diego, CA},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2003-wcae.pdf},
confidential = {n},
abstract = {Understanding modern processors requires a good knowledge of the dynamic behavior of processors. Traditional media like books use text for describing the dynamic behavior of processors. Visualization of this behavior, however, in impossible, due to the static nature of books. In this paper, we describe multimedia components for visualizing the dynamic behavior of hardware structures, called RaVi (abbreviation for the German equivalent of "computer architecture visualization"). Available RaVi components include models of a microcoded MIPS architecture, of a MIPS pipeline, of scoreboarding, Tomasulo's algorithm and the MESI multiprocessor cache protocol.},
} Understanding modern processors requires a good knowledge of the dynamic behavior of processors. Traditional media like books use text for describing the dynamic behavior of processors. Visualization of this behavior, however, in impossible, due to the static nature of books. In this paper, we describe multimedia components for visualizing the dynamic behavior of hardware structures, called RaVi (abbreviation for the German equivalent of "computer architecture visualization"). Available RaVi components include models of a microcoded MIPS architecture, of a MIPS pipeline, of scoreboarding, Tomasulo's algorithm and the MESI multiprocessor cache protocol.
|
| Jens Wagner and Rainer Leupers. A Fast Simulator and Debugger for a Network Processor. In Embedded Intelligence Nuernberg, Germany, February 2002 [BibTeX][PDF][Abstract]@inproceedings { wagner:02:es,
author = {Wagner, Jens and Leupers, Rainer},
title = {A Fast Simulator and Debugger for a Network Processor},
booktitle = {Embedded Intelligence},
year = {2002},
address = {Nuernberg, Germany},
month = {feb},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-embedded.pdf},
confidential = {n},
abstract = {In this paper, we describe the design of an efficient simulator/debugger tool environment for an industrial network processor. The simulator is based on the compiled simulation principle. The debugger, which builds on the compiled simulator, has been linked to the popular software debugger DDD from TU Braunschweig.},
} In this paper, we describe the design of an efficient simulator/debugger tool environment for an industrial network processor. The simulator is based on the compiled simulation principle. The debugger, which builds on the compiled simulator, has been linked to the popular software debugger DDD from TU Braunschweig.
|
| M. Engel and B. Freisleben. A Lightweight Communication Infrastructure for Spontaneously Networked Devices With Limited Resources. In Proceedings of the 2002 International Conference on Objects, Components, Architectures, Services, and Applications for a Networked World, Erfurt, Germany, pages 22-40 2002 [BibTeX]@inproceedings { engel:02:lncs,
author = {Engel, M. and Freisleben, B.},
title = {A Lightweight Communication Infrastructure for Spontaneously Networked Devices With Limited Resources},
booktitle = {Proceedings of the 2002 International Conference on Objects, Components, Architectures, Services, and Applications for a Networked World, Erfurt, Germany},
year = {2002},
pages = {22-40},
publisher = {LNCS 2591, Springer-Verlag},
confidential = {n},
} |
| Jens Wagner and Rainer Leupers. Advanced Code Generation for Network Processors with Bit Packet Addressing. In Workshop on Network Processors (NP1) Cambridge, Massachusetts, February 2002 [BibTeX][PDF]@inproceedings { wagner:02:npi,
author = {Wagner, Jens and Leupers, Rainer},
title = {Advanced Code Generation for Network Processors with Bit Packet Addressing},
booktitle = {Workshop on Network Processors (NP1)},
year = {2002},
address = {Cambridge, Massachusetts},
month = {feb},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-np1.pdf},
confidential = {n},
} |
| Thorsten Dräger and Gerhard Fettweis. Energy Savings with Appropriate Interconnection Networks in Parallel DSP. In VIVA Workshop Chemnitz, March 2002 [BibTeX][PDF][Abstract]@inproceedings { draeger:02:viva,
author = {Dr\"ager, Thorsten and Fettweis, Gerhard},
title = {Energy Savings with Appropriate Interconnection Networks in Parallel DSP},
booktitle = {VIVA Workshop},
year = {2002},
address = {Chemnitz},
month = {mar},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-viva.pdf},
confidential = {n},
abstract = {This paper presents an instruction level power model for a very long instructions word (VLIW) single instruction multiple data (SIMD) digital signal processor (DSP). We show that power consuming memory accesses can be reduced in such SIMD processor architectures with an appropriate network connecting the parallel register files and/or data paths. Several network topologies are analyzed for a variety of digital signal processor algorithms concerning energy issues.},
} This paper presents an instruction level power model for a very long instructions word (VLIW) single instruction multiple data (SIMD) digital signal processor (DSP). We show that power consuming memory accesses can be reduced in such SIMD processor architectures with an appropriate network connecting the parallel register files and/or data paths. Several network topologies are analyzed for a variety of digital signal processor algorithms concerning energy issues.
|
| Stefan Steinke, Lars Wehmeyer, Bo-Sik Lee and Peter Marwedel. Assigning Program and Data Objects to Scratchpad for Energy Reduction. In DATE 2002 Paris/France, March 2002 [BibTeX][PDF][Abstract]@inproceedings { steinke:02:date,
author = {Steinke, Stefan and Wehmeyer, Lars and Lee, Bo-Sik and Marwedel, Peter},
title = {Assigning Program and Data Objects to Scratchpad for Energy Reduction},
booktitle = {DATE 2002},
year = {2002},
address = {Paris/France},
month = {mar},
keywords = {ecc},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-date.pdf},
confidential = {n},
abstract = {The number of embedded systems is increasing and a remarkable percentage is designed as mobile applications. For the latter, the energy consumption is a limiting factor because of today's battery capacities. Besides the processor, memory accesses consume a high amount of energy. The use of additional less power hungry memories like caches or scratchpads is thus common. Caches incorporate the hardware control logic for moving data in and out automatically. On the other hand, this logic requires chip area and energy. A scratchpad memory is much more energy efficient, but there is a need for software control of its content. In this paper, an algorithm integrated into a compiler is presented which analyses the application and selects program and data parts which are placed into the scratchpad. Comparisons against a cache solution show remarkable advantages between 12\% and 43\% in energy consumption for designs of the same memory size.},
} The number of embedded systems is increasing and a remarkable percentage is designed as mobile applications. For the latter, the energy consumption is a limiting factor because of today's battery capacities. Besides the processor, memory accesses consume a high amount of energy. The use of additional less power hungry memories like caches or scratchpads is thus common. Caches incorporate the hardware control logic for moving data in and out automatically. On the other hand, this logic requires chip area and energy. A scratchpad memory is much more energy efficient, but there is a need for software control of its content. In this paper, an algorithm integrated into a compiler is presented which analyses the application and selects program and data parts which are placed into the scratchpad. Comparisons against a cache solution show remarkable advantages between 12% and 43% in energy consumption for designs of the same memory size.
|
| Stefan Steinke, Nils Grunwald, Lars Wehmeyer, Rajeshwari Banakar, M. Balakrishnan and Peter Marwedel. Reducing Energy Consumption by Dynamic Copying of Instructions onto Onchip Memory. In ISSS 2002 Kyoto/Japan, October 2002 [BibTeX][PDF][Abstract]@inproceedings { steinke:02:isss,
author = {Steinke, Stefan and Grunwald, Nils and Wehmeyer, Lars and Banakar, Rajeshwari and Balakrishnan, M. and Marwedel, Peter},
title = {Reducing Energy Consumption by Dynamic Copying of Instructions onto Onchip Memory},
booktitle = {ISSS 2002},
year = {2002},
address = {Kyoto/Japan},
month = {oct},
keywords = {ecc},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-isss.pdf},
confidential = {n},
abstract = {The number of mobile embedded systems is increasing and all of them are limited in their uptime by their battery capacity. Several hardware changes have been introduced during the last years, but the steadily growing functionality still requires further energy reductions, e.g. by software optimizations. A significant amount of energy can be saved in the memory hierarchy where most of the energy is consumed. In this paper a new software technique is presented which supports the use of an onchip scratchpad memory by dynamically copying program parts into it. The set of selected program parts are determined with an optimal algorithm using integer linear programming. Experimental results show a reduction of the energy consumption by nearly 30\%, a performance increase by 25\% against a common cache system and energy improvements against a static approach of up to 38\%.},
} The number of mobile embedded systems is increasing and all of them are limited in their uptime by their battery capacity. Several hardware changes have been introduced during the last years, but the steadily growing functionality still requires further energy reductions, e.g. by software optimizations. A significant amount of energy can be saved in the memory hierarchy where most of the energy is consumed. In this paper a new software technique is presented which supports the use of an onchip scratchpad memory by dynamically copying program parts into it. The set of selected program parts are determined with an optimal algorithm using integer linear programming. Experimental results show a reduction of the energy consumption by nearly 30%, a performance increase by 25% against a common cache system and energy improvements against a static approach of up to 38%.
|
| Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan and Peter Marwedel. Scratchpad Memory : A Design Alternative for Cache On-chip memory in Embedded Systems. In CODES Estes Park (Colorado), May 2002 [BibTeX][PDF][Abstract]@inproceedings { banakar:02:codes,
author = {Banakar, Rajeshwari and Steinke, Stefan and Lee, Bo-Sik and Balakrishnan, M. and Marwedel, Peter},
title = {Scratchpad Memory : A Design Alternative for Cache On-chip memory in Embedded Systems},
booktitle = {CODES},
year = {2002},
address = {Estes Park (Colorado)},
month = {may},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-codes.pdf},
confidential = {n},
abstract = {In this paper we address the problem of on-chip memory selection for computationally intensive applications, by proposing scratch pad memory as an alternative to cache. Area and energy for different scratch pad and cache sizes are computed using the CACTI tool while performance was evaluated using the trace results of the simulator. The target processor chosen for evaluation was AT91M40400. The results clearly establish scratchpad memory as a low power alternative in most situations with an average energy reduction of 40\%. Further the average area-time reduction for the scratchpad memory was 46\% of the cache memory.},
} In this paper we address the problem of on-chip memory selection for computationally intensive applications, by proposing scratch pad memory as an alternative to cache. Area and energy for different scratch pad and cache sizes are computed using the CACTI tool while performance was evaluated using the trace results of the simulator. The target processor chosen for evaluation was AT91M40400. The results clearly establish scratchpad memory as a low power alternative in most situations with an average energy reduction of 40%. Further the average area-time reduction for the scratchpad memory was 46% of the cache memory.
|
| Markus Lorenz, Lars Wehmeyer, Thorsten Dräger and Rainer Leupers. Energy aware Compilation for DSPs with SIMD Instructions. In LCTES/SCOPES 2002 Berlin, Germany, June 2002 [BibTeX][PDF][Abstract]@inproceedings { lorenz:02:scopes,
author = {Lorenz, Markus and Wehmeyer, Lars and Dr\"ager, Thorsten and Leupers, Rainer},
title = {Energy aware Compilation for DSPs with SIMD Instructions},
booktitle = {LCTES/SCOPES 2002},
year = {2002},
address = {Berlin, Germany},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-lctes-scopes.pdf},
confidential = {n},
abstract = {The growing use of digital signal processors (DSPs) in embedded systems necessitates the use of optimizing compilers supporting special hardware features. In this paper we present compiler optimizations with the aim of minimizing energy consumption of embedded applications: This comprises loop optimizations for exploitation of SIMD instructions and zero overhead hardware loops in order to increase performance and decrease the energy consumption. In addition, we use a phase coupled code generator based on a genetic algorithm (GCG) which is capable of performing energy aware instruction selection and scheduling. Energy aware compilation is done with respect to an instruction level energy cost model which is integrated into our code generator and simulator. Experimental results for several benchmarks show the effectiveness of our approach.},
} The growing use of digital signal processors (DSPs) in embedded systems necessitates the use of optimizing compilers supporting special hardware features. In this paper we present compiler optimizations with the aim of minimizing energy consumption of embedded applications: This comprises loop optimizations for exploitation of SIMD instructions and zero overhead hardware loops in order to increase performance and decrease the energy consumption. In addition, we use a phase coupled code generator based on a genetic algorithm (GCG) which is capable of performing energy aware instruction selection and scheduling. Energy aware compilation is done with respect to an instruction level energy cost model which is integrated into our code generator and simulator. Experimental results for several benchmarks show the effectiveness of our approach.
|
| Peter Marwedel and Luca Benini. Low-power/low-energy embedded software. In Tutorial at Design, Automation and Test in Europe (DATE) Paris, 2002, March 2002 [BibTeX][Abstract]@inproceedings { marwedel:02:date,
author = {Marwedel, Peter and Benini, Luca},
title = {Low-power/low-energy embedded software},
booktitle = {Tutorial at Design, Automation and Test in Europe (DATE)},
year = {2002},
address = {Paris, 2002},
month = {mar},
confidential = {n},
abstract = {The size of the embedded system market is expected to grow significantly in the next years. For many of these systems, power and energy management are becoming primary issues due to rapidly increasing computational requirement and the slow rate of improvements in battery technology. Hence, researchers and designers have to be aware of approaches for reducing the power and energy consumption of embedded systems. In addition to standard techniques used at the device and the circuit level, new techniques are required which exploit the fact that a major part of the functionality of embedded systems is implemented in software. There are various levels at which the energy consumption of software can be considered, including the high-level algorithm descriptions, compilers, and operating systems. These levels will be covered in the tutorial. Summary: With a focus on programmable embedded systems, this tutorial will: \begin{itemize} \item survey the interaction of architecture, operating systems, compilers and memories from a power/energy focus, \item present specific contributors of each part to power and energy, and \item outline software techniques for minimization of power/energy. \end{itemize} Outline: In the first section, an introduction to the topic will be provided. Due to the large influence of the memory architecture on the total energy consumption, different memory architectures will be presented next. We will show how partitioned memories can help reducing the energy consumption. In the next section, we will describe how partitioned memory architectures and other features of embedded systems can be exploited in compilers. This includes the exploitation of scratch-pad memories and a comparison between scratch-pads and caches. In addition, this includes an analysis of the size of register files. Furthermore, we will explain techniques for reducing the memory traffic by global optimizations designed for multimedia applications. This section also comprises a description of applicable standard compiler optimizations and their potential for energy reductions as well as a brief introduction to compression techniques. The final section describes system software and real-time operating system (RTOS) issues. This will include hardware for RTOS-based power management, software support for power management, power-aware process scheduling and power-aware device management. Exploitation of application-specific information and power management of distributed systems will also be covered. Audience: Researchers starting to work on embedded systems or on low power design techniques; designers interested in getting to know available methodologies for low power software design.},
} The size of the embedded system market is expected to grow significantly in the next years. For many of these systems, power and energy management are becoming primary issues due to rapidly increasing computational requirement and the slow rate of improvements in battery technology. Hence, researchers and designers have to be aware of approaches for reducing the power and energy consumption of embedded systems. In addition to standard techniques used at the device and the circuit level, new techniques are required which exploit the fact that a major part of the functionality of embedded systems is implemented in software. There are various levels at which the energy consumption of software can be considered, including the high-level algorithm descriptions, compilers, and operating systems. These levels will be covered in the tutorial. Summary: With a focus on programmable embedded systems, this tutorial will: itemize \item survey the interaction of architecture, operating systems, compilers and memories from a power/energy focus, \item present specific contributors of each part to power and energy, and \item outline software techniques for minimization of power/energy. itemize Outline: In the first section, an introduction to the topic will be provided. Due to the large influence of the memory architecture on the total energy consumption, different memory architectures will be presented next. We will show how partitioned memories can help reducing the energy consumption. In the next section, we will describe how partitioned memory architectures and other features of embedded systems can be exploited in compilers. This includes the exploitation of scratch-pad memories and a comparison between scratch-pads and caches. In addition, this includes an analysis of the size of register files. Furthermore, we will explain techniques for reducing the memory traffic by global optimizations designed for multimedia applications. This section also comprises a description of applicable standard compiler optimizations and their potential for energy reductions as well as a brief introduction to compression techniques. The final section describes system software and real-time operating system (RTOS) issues. This will include hardware for RTOS-based power management, software support for power management, power-aware process scheduling and power-aware device management. Exploitation of application-specific information and power management of distributed systems will also be covered. Audience: Researchers starting to work on embedded systems or on low power design techniques; designers interested in getting to know available methodologies for low power software design.
|
| Peter Marwedel and Srinivas Devadas (eds.). LCTES'02-SCOPES'02: Joint Conference on Languages, Compilers and Tools for Embedded Systems & Software and Compilers for Embedded Systems. In ACM Press June 2002 [BibTeX][Abstract]@inproceedings { marwedel:02:lctes,
author = {Marwedel, Peter and Devadas (eds.), Srinivas},
title = {LCTES'02-SCOPES'02: Joint Conference on Languages, Compilers and Tools for Embedded Systems \\& Software and Compilers for Embedded Systems},
booktitle = {ACM Press},
year = {2002},
month = {jun},
confidential = {n},
abstract = {(Draft version of the preface, final layout is different): This volume contains the proceedings of the Joint Languages, Compilers, and Tools for Embedded Systems (LCTES'02) and Software and Compilers for Embedded Systems (SCOPES'02) Conference. LCTES/SCOPES'02 took place in Berlin during June 19th to the 21st. For the first time, LCTES and SCOPES were held together, resulting in stimulating contacts between researchers predominantly having a background in programming languages and electronic design automation, respectively. Also, for the very first time, LCTES was held as a conference and not as a workshop. LCTES/SCOPES'02 received a total of 73 papers. During a comprehensive review process a total of 234 reviews were submitted. Finally, 25 papers were accepted and included in the resulting high-quality program. Accepted papers covered the following areas: compilers including low-energy compilation, synthesis, design space exploration, debugging and validation, code generation and register allocation, processor modeling, hardware/software codesign, and real-time scheduling. Accepted papers were grouped into 10 sessions. In addition, two invited keynotes given by Dr. Philippe Magarshack of STMicroelectronics and Prof. Gerhard Fettweis of Systemonic AG and of Dresden University emphasized industrial view points and perspectives. We thank the following members of the program committee for their efforts in reviewing the papers: David August, Shuvra S. Bhattacharyya, Raul Camposano, Keith D. Cooper, Rajiv Gupta, Mary Hall, Seongsoo Hong, Masaharu Imai, Ahmed Jerraya, Jochen Jess, Kurt Keutzer, Rainer Leupers, Annie Liu, Jef van Meerbergen, SangLyul Min, Frank Mueller, Tatsuo Nakajima, Alex Nicolau, Santosh Pande, Manas Saksena, Bob Rau, Wolfgang Rosenstiel, Sreeranga P. Rajan, Gang-Ryung Uh, Carl Von Platen, Bernard Wess, David Whalley, Reinhard Wilhelm, and Hiroto Yasuura. We thank Jens Knoop of the University of Dortmund who served as the Finance Chair, Ahmed Jerraya of TIMA, Grenoble who served as the Publicity Chair for their efforts in organizing LCTES/SCOPES'02 and making it a success. Thanks to Armin Zimmermann of the Technical University of Berlin for organizing the social event in LCTES/SCOPES'02. We acknowledge the sponsorship of ACM SIGPLAN. ACM also provided travel grants to students. The European Design Automation Association (EDAA) provided an in-cooperation status to LCTES/SCOPES. Finally, we thank the steering committee for their efforts in ensuring continuity from the previous LCTES workshops. Peter Marwedel, University of Dortmund \\ Srinivas Devadas, MIT \\ Co-Chairs of LCTES/SCOPES'02},
} (Draft version of the preface, final layout is different): This volume contains the proceedings of the Joint Languages, Compilers, and Tools for Embedded Systems (LCTES'02) and Software and Compilers for Embedded Systems (SCOPES'02) Conference. LCTES/SCOPES'02 took place in Berlin during June 19th to the 21st. For the first time, LCTES and SCOPES were held together, resulting in stimulating contacts between researchers predominantly having a background in programming languages and electronic design automation, respectively. Also, for the very first time, LCTES was held as a conference and not as a workshop. LCTES/SCOPES'02 received a total of 73 papers. During a comprehensive review process a total of 234 reviews were submitted. Finally, 25 papers were accepted and included in the resulting high-quality program. Accepted papers covered the following areas: compilers including low-energy compilation, synthesis, design space exploration, debugging and validation, code generation and register allocation, processor modeling, hardware/software codesign, and real-time scheduling. Accepted papers were grouped into 10 sessions. In addition, two invited keynotes given by Dr. Philippe Magarshack of STMicroelectronics and Prof. Gerhard Fettweis of Systemonic AG and of Dresden University emphasized industrial view points and perspectives. We thank the following members of the program committee for their efforts in reviewing the papers: David August, Shuvra S. Bhattacharyya, Raul Camposano, Keith D. Cooper, Rajiv Gupta, Mary Hall, Seongsoo Hong, Masaharu Imai, Ahmed Jerraya, Jochen Jess, Kurt Keutzer, Rainer Leupers, Annie Liu, Jef van Meerbergen, SangLyul Min, Frank Mueller, Tatsuo Nakajima, Alex Nicolau, Santosh Pande, Manas Saksena, Bob Rau, Wolfgang Rosenstiel, Sreeranga P. Rajan, Gang-Ryung Uh, Carl Von Platen, Bernard Wess, David Whalley, Reinhard Wilhelm, and Hiroto Yasuura. We thank Jens Knoop of the University of Dortmund who served as the Finance Chair, Ahmed Jerraya of TIMA, Grenoble who served as the Publicity Chair for their efforts in organizing LCTES/SCOPES'02 and making it a success. Thanks to Armin Zimmermann of the Technical University of Berlin for organizing the social event in LCTES/SCOPES'02. We acknowledge the sponsorship of ACM SIGPLAN. ACM also provided travel grants to students. The European Design Automation Association (EDAA) provided an in-cooperation status to LCTES/SCOPES. Finally, we thank the steering committee for their efforts in ensuring continuity from the previous LCTES workshops. Peter Marwedel, University of Dortmund \ Srinivas Devadas, MIT \ Co-Chairs of LCTES/SCOPES'02
|
| Peter Marwedel, Khac Dung Cong and Sergej Schwenk. RAVI: Interactive Visualization of information systems dynamics using a Java-based schematic editor and simulator. In Open IFIP-GI-Conference on Social, Ethical and Cognitive Issues of Informatics and ICT (Information and Communication Technologies), SECIII June 2002 [BibTeX][PDF][Abstract]@inproceedings { marwedel:02:seciii,
author = {Marwedel, Peter and Cong, Khac Dung and Schwenk, Sergej},
title = {RAVI: Interactive Visualization of information systems dynamics using a Java-based schematic editor and simulator},
booktitle = {Open IFIP-GI-Conference on Social, Ethical and Cognitive Issues of Informatics and ICT (Information and Communication Technologies), SECIII},
year = {2002},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-seciii.pdf},
confidential = {n},
abstract = {One of the key limitations of traditional media, e.g. books, for teaching is their lack of visualizing the dynamic behavior of systems. Videos tapes and video distribution techniques have made it possible to show non-interactive media elements to students. However, one of the key advantages of multimedia systems over traditional media is their potential of providing interactiveness. This interactiveness, however, is not easy to obtain, since it requires the simulation of the system to be visualized. Designing simulators can be a challenging task that can not be solved within the time-frame usually allocated for multi-media projects. On the other hand, available simulators are not suitable for classroom use. They are frequently designed for optimum simulation speed and complex design projects. Ease of use, excellent visualization and portability have normally not been top goals for simulator design. Also, powerful simulators are typically proprietary and come at high costs, preventing their widespread deployment to class rooms and into the hands of students.},
} One of the key limitations of traditional media, e.g. books, for teaching is their lack of visualizing the dynamic behavior of systems. Videos tapes and video distribution techniques have made it possible to show non-interactive media elements to students. However, one of the key advantages of multimedia systems over traditional media is their potential of providing interactiveness. This interactiveness, however, is not easy to obtain, since it requires the simulation of the system to be visualized. Designing simulators can be a challenging task that can not be solved within the time-frame usually allocated for multi-media projects. On the other hand, available simulators are not suitable for classroom use. They are frequently designed for optimum simulation speed and complex design projects. Ease of use, excellent visualization and portability have normally not been top goals for simulator design. Also, powerful simulators are typically proprietary and come at high costs, preventing their widespread deployment to class rooms and into the hands of students.
|
| Peter Marwedel. Embedded Software: How to make it efficient?. In Proceedings of the Euromicro Symposium on Digital System Design Dortmund, 2002 [BibTeX][PDF][Abstract]@inproceedings { marwedel:02:euromicro,
author = {Marwedel, Peter},
title = {Embedded Software: How to make it efficient?},
booktitle = {Proceedings of the Euromicro Symposium on Digital System Design},
year = {2002},
address = {Dortmund},
publisher = {IEEE Design \& Test},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2002-euromicro.pdf},
confidential = {n},
abstract = {This paper stresses the importance of designing efficient embedded software and it provides a global view of some of the techniques that have been developed to meet this goal. These techniques include high-level transformations, compiler optimizations reducing the energy consumption of embedded programs and optimizations exploiting architectural features of embedded processors. Such optimizations lead to significant reductions of the execution time, the required energy and the memory size of embedded applications. Despite this, they can hardly be found in any available compiler. particular application) and processors and software. The latter is used to meet the flexibility requirements found for almost all of today's applications.},
} This paper stresses the importance of designing efficient embedded software and it provides a global view of some of the techniques that have been developed to meet this goal. These techniques include high-level transformations, compiler optimizations reducing the energy consumption of embedded programs and optimizations exploiting architectural features of embedded processors. Such optimizations lead to significant reductions of the execution time, the required energy and the memory size of embedded applications. Despite this, they can hardly be found in any available compiler. particular application) and processors and software. The latter is used to meet the flexibility requirements found for almost all of today's applications.
|
| Manoj Kumar Jain, Lars Wehmeyer, Stefan Steinke, Peter Marwedel and M. Balakrishnan. Evaluating Register File Size in ASIP Design. In CODES Copenhagen (Denmark), April 2001 [BibTeX][PDF][Abstract]@inproceedings { jain:01:codes,
author = {Jain, Manoj Kumar and Wehmeyer, Lars and Steinke, Stefan and Marwedel, Peter and Balakrishnan, M.},
title = {Evaluating Register File Size in ASIP Design},
booktitle = {CODES},
year = {2001},
address = {Copenhagen (Denmark)},
month = {apr},
keywords = {ecc},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2001-codes.pdf},
confidential = {n},
abstract = {Interest in synthesis of Application Specific Instruction Set Processors or ASIPs has increased considerably and a number of methodologies have been proposed for ASIP design. A key step in ASIP synthesis involves deciding architectural features based on application requirements and constraints. In this paper we observe the effect of changing register file size on the performance as well as power and energy consumption. Detailed data is generated and analyzed for a number of application programs. Results indicate that choice of an appropriate number of registers has a significant impact on performance.},
} Interest in synthesis of Application Specific Instruction Set Processors or ASIPs has increased considerably and a number of methodologies have been proposed for ASIP design. A key step in ASIP synthesis involves deciding architectural features based on application requirements and constraints. In this paper we observe the effect of changing register file size on the performance as well as power and energy consumption. Detailed data is generated and analyzed for a number of application programs. Results indicate that choice of an appropriate number of registers has a significant impact on performance.
|
| Markus Lorenz, Rainer Leupers, Peter Marwedel, Thorsten Dräger and Gerhard P. Fettweis. Low-Energy DSP Code Generation Using a Genetic Algorithm. In ICCD '01 Austin/Texas/USA, September 2001 [BibTeX][PDF][Abstract]@inproceedings { lorenz:01:iccd,
author = {Lorenz, Markus and Leupers, Rainer and Marwedel, Peter and Dr\"ager, Thorsten and Fettweis, Gerhard P.},
title = {Low-Energy DSP Code Generation Using a Genetic Algorithm},
booktitle = {ICCD '01},
year = {2001},
address = {Austin/Texas/USA},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2001-iccd.pdf},
confidential = {n},
abstract = {This paper deals with low-energy code generation for a highly optimized digital signal processor designed for mobile communication applications. We present a genetic algorithm based code generator (GCG), and an instruction-level power model for this processor. Our code generator is capable of reducing the power dissipation of target applications by means of two techniques: First, GCG minimizes the number of memory accesses by using a special list-scheduling algorithm. This technique makes it possible to perform graph based code selection and to take into account the high interdependencies of the subtasks of code generation by phase coupling. In addition, GCG optimizes the scheduling of processor instructions with respect to the instruction-level power model based on a gate level simulation. Experimental results for several benchmarks show the effectiveness of our approach.},
} This paper deals with low-energy code generation for a highly optimized digital signal processor designed for mobile communication applications. We present a genetic algorithm based code generator (GCG), and an instruction-level power model for this processor. Our code generator is capable of reducing the power dissipation of target applications by means of two techniques: First, GCG minimizes the number of memory accesses by using a special list-scheduling algorithm. This technique makes it possible to perform graph based code selection and to take into account the high interdependencies of the subtasks of code generation by phase coupling. In addition, GCG optimizes the scheduling of processor instructions with respect to the instruction-level power model based on a gate level simulation. Experimental results for several benchmarks show the effectiveness of our approach.
|
| Jens Wagner and Rainer Leupers. C Compiler Design for an Industrial Network Processor. In LCTES Snowbird (USA), June 2001 [BibTeX][Abstract]@inproceedings { wagner:01:lctes,
author = {Wagner, Jens and Leupers, Rainer},
title = {C Compiler Design for an Industrial Network Processor},
booktitle = {LCTES},
year = {2001},
address = {Snowbird (USA)},
month = {jun},
confidential = {n},
abstract = {One important problem in code generation for embedded processors is the design of efficient compilers for ASIPs with application specific architectures. This paper outlines the design of a C compiler for an industrial ASIP for telecom applications. The target ASIP is a network processor with special instructions for bit-level access to data registers, which is required for packet-oriented communication protocol processing. From a practical viewpoint, we describe the main challenges in exploiting these application specific features in a C compiler, and we show how a compiler backend has been designed that accomodates these features by means of compiler intrinsics and a dedicated register allocator. The compiler is fully operational, and first experimental results indicate that C-level programming of the ASIP leads to good code quality without the need for time-consuming assembly programming.},
} One important problem in code generation for embedded processors is the design of efficient compilers for ASIPs with application specific architectures. This paper outlines the design of a C compiler for an industrial ASIP for telecom applications. The target ASIP is a network processor with special instructions for bit-level access to data registers, which is required for packet-oriented communication protocol processing. From a practical viewpoint, we describe the main challenges in exploiting these application specific features in a C compiler, and we show how a compiler backend has been designed that accomodates these features by means of compiler intrinsics and a dedicated register allocator. The compiler is fully operational, and first experimental results indicate that C-level programming of the ASIP leads to good code quality without the need for time-consuming assembly programming.
|
| Markus Lorenz, David Kottmann, Steven Bashford, Rainer Leupers and Peter Marwedel. Optimized Address Assignment for DSPs with SIMD Memory Accesses. In ASP-DAC '01 Yokohama/Japan, January 2001 [BibTeX][PDF][Abstract]@inproceedings { lorenz:01:aspdac,
author = {Lorenz, Markus and Kottmann, David and Bashford, Steven and Leupers, Rainer and Marwedel, Peter},
title = {Optimized Address Assignment for DSPs with SIMD Memory Accesses},
booktitle = {ASP-DAC '01},
year = {2001},
address = {Yokohama/Japan},
month = {jan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2001-aspdac.pdf},
confidential = {n},
abstract = {This paper deals with address assignment in code generation for digital signal processors (DSPs) with SIMD (single instruction multiple data) memory accesses. In these processors data are organized in groups (or partitions), whose elements share one common memory address. In order to optimize program performance for processors with such memory architectures it is important to have a suitable memory layout of the variables. We propose a two-step address assignment technique for scalar variables using a genetic algorithm based partitioning method and a graph based heuristic which makes use of available DSP address generation hardware. We show that our address assignment techniques lead to a significant code quality improvement compared to heuristics.},
} This paper deals with address assignment in code generation for digital signal processors (DSPs) with SIMD (single instruction multiple data) memory accesses. In these processors data are organized in groups (or partitions), whose elements share one common memory address. In order to optimize program performance for processors with such memory architectures it is important to have a suitable memory layout of the variables. We propose a two-step address assignment technique for scalar variables using a genetic algorithm based partitioning method and a graph based heuristic which makes use of available DSP address generation hardware. We show that our address assignment techniques lead to a significant code quality improvement compared to heuristics.
|
| Rainer Leupers and Daniel Kotte. Variable Partitioning for Dual Memory Bank DSPs. In ICASSP Salt Lake City (USA), May 2001 [BibTeX][PDF][Abstract]@inproceedings { leupers:01:icassp,
author = {Leupers, Rainer and Kotte, Daniel},
title = {Variable Partitioning for Dual Memory Bank DSPs},
booktitle = {ICASSP},
year = {2001},
address = {Salt Lake City (USA)},
month = {may},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2001-icassp.pdf},
confidential = {n},
abstract = {DSPs with dual memory banks offer high memory bandwidth, which is required for high-performance applications. However, such DSP architectures pose problems for C compilers, which are mostly not capable of partitioning program variables between memory banks. As a consequence, time-consuming assembly programming is required for an efficient coding of time-critical algorithms. This paper presents a new technique for automatic variable partitioning between memory banks in compilers, which leads to a higher utilization of available memory bandwidth in the generated machine code. We present experimental results obtained by integrating the proposed technique into an existing C compiler for the AMS Gepard, an industrial DSP core.},
} DSPs with dual memory banks offer high memory bandwidth, which is required for high-performance applications. However, such DSP architectures pose problems for C compilers, which are mostly not capable of partitioning program variables between memory banks. As a consequence, time-consuming assembly programming is required for an efficient coding of time-critical algorithms. This paper presents a new technique for automatic variable partitioning between memory banks in compilers, which leads to a higher utilization of available memory bandwidth in the generated machine code. We present experimental results obtained by integrating the proposed technique into an existing C compiler for the AMS Gepard, an industrial DSP core.
|
| Peter Marwedel, Stefan Steinke and Lars Wehmeyer. Compilation techniques for energy-, code-size-, and run-time-efficient embedded software. In Int. Workshop on Advanced Compiler Techniques for High Performance and Embedded Processors Bucharest, Hungary, 2001 [BibTeX][Abstract]@inproceedings { marwedel:01:iwact,
author = {Marwedel, Peter and Steinke, Stefan and Wehmeyer, Lars},
title = {Compilation techniques for energy-, code-size-, and run-time-efficient embedded software},
booktitle = {Int. Workshop on Advanced Compiler Techniques for High Performance and Embedded Processors},
year = {2001},
address = {Bucharest, Hungary},
keywords = {ecc},
confidential = {n},
abstract = {This paper is motivated by two essential characteristics of embedded systems: the increasing amount of software that is used for implementing embedded systems and the need for implementing embedded systems efficiently. As a consequence, embedded software has to be efficient. In the following, we will present techniques for generating efficient machine code for architectures which are typically found in embedded systems. We will demonstrate, using examples, how compilers for embedded processors can exploit features that are found in embedded processors.},
} This paper is motivated by two essential characteristics of embedded systems: the increasing amount of software that is used for implementing embedded systems and the need for implementing embedded systems efficiently. As a consequence, embedded software has to be efficient. In the following, we will present techniques for generating efficient machine code for architectures which are typically found in embedded systems. We will demonstrate, using examples, how compilers for embedded processors can exploit features that are found in embedded processors.
|
| Stefan Steinke, Markus Knauer, Lars Wehmeyer and Peter Marwedel. An Accurate and Fine Grain Instruction-Level Energy Model Supporting Software Optimizations. In PATMOS Yverdon (Switzerland), September 2001 [BibTeX][PDF][Abstract]@inproceedings { steinke:01:patmos,
author = {Steinke, Stefan and Knauer, Markus and Wehmeyer, Lars and Marwedel, Peter},
title = {An Accurate and Fine Grain Instruction-Level Energy Model Supporting Software Optimizations},
booktitle = {PATMOS},
year = {2001},
address = {Yverdon (Switzerland)},
month = {sep},
keywords = {ecc},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2001-patmos.pdf},
confidential = {n},
abstract = {Power aware compilers have been under research during the last few years. However, there is still a need for accurate energy models for supporting software optimizations. In this paper we present a new energy model on the instruction level. As an addition to former models, the bit toggling on internal and external buses as well as accesses to off-chip memories are considered. To determine the characteristics, a measuring method is presented which can be used to establish the energy model without detailed knowledge of the internal processor structures. Finally, the proposed energy model is established for the ARM7TDMI RISC processor.},
} Power aware compilers have been under research during the last few years. However, there is still a need for accurate energy models for supporting software optimizations. In this paper we present a new energy model on the instruction level. As an addition to former models, the bit toggling on internal and external buses as well as accesses to off-chip memories are considered. To determine the characteristics, a measuring method is presented which can be used to establish the energy model without detailed knowledge of the internal processor structures. Finally, the proposed energy model is established for the ARM7TDMI RISC processor.
|
| Peter Marwedel. Compilation Techniques for Embedded Software. In Invited Tutorial, Asia and South Pacific Design Automation Conference (ASP-DAC) Yokohama, Japan, 2001 [BibTeX][Abstract]@inproceedings { marwedel:01:aspdac,
author = {Marwedel, Peter},
title = {Compilation Techniques for Embedded Software},
booktitle = {Invited Tutorial, Asia and South Pacific Design Automation Conference (ASP-DAC)},
year = {2001},
address = {Yokohama, Japan},
confidential = {n},
abstract = {Embedded systems demand for efficient processor architectures, optimized for application domains or applications. Current compiler technology supports these architectures poorly and has been recognized as a bottleneck for designing systems. Recent research projects aim at removing this bottleneck. We will present code optimization approaches taking the characteristics of embedded processor architectures into account. The following topics will be included: \begin{itemize} \item memory allocation techniques, \item compiler techniques for VLIW machines, \item compiler techniques aiming at energy minimization of embedded software, \item retargetability of compilers, \item code compression techniques. \end{itemize}},
} Embedded systems demand for efficient processor architectures, optimized for application domains or applications. Current compiler technology supports these architectures poorly and has been recognized as a bottleneck for designing systems. Recent research projects aim at removing this bottleneck. We will present code optimization approaches taking the characteristics of embedded processor architectures into account. The following topics will be included: itemize \item memory allocation techniques, \item compiler techniques for VLIW machines, \item compiler techniques aiming at energy minimization of embedded software, \item retargetability of compilers, \item code compression techniques. itemize
|
| Rainer Leupers. Compilertechniken für VLIW DSPs. In DSP Deutschland Munich, October 2000 [BibTeX][PDF][Abstract]@inproceedings { leupers:00:dsp,
author = {Leupers, Rainer},
title = {Compilertechniken f\"ur VLIW DSPs},
booktitle = {DSP Deutschland},
year = {2000},
address = {Munich},
month = {oct},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2000-dsp.pdf},
confidential = {n},
abstract = {F\"ur Hochleistungsanwendungen der digitalen Signalverarbeitung werden zunehmend VLIW-Prozessorarchitekturen eingesetzt. Zur effektiven Programmierung von VLIW DSPs werden leistungsf\"ahige C-Compiler ben\"otigt. In diesem Beitrag zeigen wir einige Schwachstellen derzeitiger Compiler f\"ur VLIW DSPs auf und stellen hierf\"ur neue Codeoptimierungsverfahren vor. Diese beziehen sich auf die Ausnutzung von SIMD-Befehlen und conditional instructions, sowie effizientes Scheduling und Funktions-Inlining. Experimentelle Ergebnisse f\"ur einen bekannten VLIW DSP, den Texas Instruments C6201, zeigen den praktischen Nutzen. Des weiteren werden auch Frontend-Aspekte und maschinenunabh\"angige Codeoptimierungen kurz behandelt.},
} Für Hochleistungsanwendungen der digitalen Signalverarbeitung werden zunehmend VLIW-Prozessorarchitekturen eingesetzt. Zur effektiven Programmierung von VLIW DSPs werden leistungsfähige C-Compiler benötigt. In diesem Beitrag zeigen wir einige Schwachstellen derzeitiger Compiler für VLIW DSPs auf und stellen hierfür neue Codeoptimierungsverfahren vor. Diese beziehen sich auf die Ausnutzung von SIMD-Befehlen und conditional instructions, sowie effizientes Scheduling und Funktions-Inlining. Experimentelle Ergebnisse für einen bekannten VLIW DSP, den Texas Instruments C6201, zeigen den praktischen Nutzen. Des weiteren werden auch Frontend-Aspekte und maschinenunabhängige Codeoptimierungen kurz behandelt.
|
| Rainer Leupers. Code Selection for Media Processors with SIMD Instructions. In DATE 2000 Paris/France, March 2000 [BibTeX][PDF][Abstract]@inproceedings { leupers:00:date,
author = {Leupers, Rainer},
title = {Code Selection for Media Processors with SIMD Instructions},
booktitle = {DATE 2000},
year = {2000},
address = {Paris/France},
month = {mar},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2000-date.pdf},
confidential = {n},
abstract = {Media processors show special instruction sets for fast execution of signal processing algorithms on different media data types. They provide SIMD instructions, capable of executing one operation on multiple data in parallel within a single instruction cycle. Unfortunately, their use in compilers is so far very restricted and requires either assembly libraries or compiler intrinsics. This paper presents a novel code selection technique capable of exploiting SIMD instructions also when compiling plain C source code. It permits to take advantage of SIMD instructions for multimedia applications, while still using portable source code.},
} Media processors show special instruction sets for fast execution of signal processing algorithms on different media data types. They provide SIMD instructions, capable of executing one operation on multiple data in parallel within a single instruction cycle. Unfortunately, their use in compilers is so far very restricted and requires either assembly libraries or compiler intrinsics. This paper presents a novel code selection technique capable of exploiting SIMD instructions also when compiling plain C source code. It permits to take advantage of SIMD instructions for multimedia applications, while still using portable source code.
|
| Rainer Leupers. Code Generation for Embedded Processors. In ISSS 2000 Madrid/Spain, September 2000 [BibTeX][PDF][Abstract]@inproceedings { leupers:00:isss,
author = {Leupers, Rainer},
title = {Code Generation for Embedded Processors},
booktitle = {ISSS 2000},
year = {2000},
address = {Madrid/Spain},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2000-isss.pdf},
confidential = {n},
abstract = {The increasing use of programmable processors as IP blocks in embedded system design creates a need for C/C++ compilers capable of generating efficient machine code. Many of today's compilers for embedded processors suffer from insufficient code quality in terms of code size and performance. This violates the tight chip area and real-time constraints often imposed on embedded systems. The reason is that embedded processors typically show architectural features which are not well handled by classical compiler technology. This paper provides a survey of methods and techniques dedicated to efficient code generation for embedded processors. Emphasis is put on DSP and multimedia processors, for which better compiler technology is definitely required. In addition, some frontend aspects and recent trends in research and industry are briefly covered. The goal of these recent efforts in embedded code generation is to facilitate the step from assembly to high-level language programming of embedded systems, so as to provide higher productivity, dependability, and portability of embedded software.},
} The increasing use of programmable processors as IP blocks in embedded system design creates a need for C/C++ compilers capable of generating efficient machine code. Many of today's compilers for embedded processors suffer from insufficient code quality in terms of code size and performance. This violates the tight chip area and real-time constraints often imposed on embedded systems. The reason is that embedded processors typically show architectural features which are not well handled by classical compiler technology. This paper provides a survey of methods and techniques dedicated to efficient code generation for embedded processors. Emphasis is put on DSP and multimedia processors, for which better compiler technology is definitely required. In addition, some frontend aspects and recent trends in research and industry are briefly covered. The goal of these recent efforts in embedded code generation is to facilitate the step from assembly to high-level language programming of embedded systems, so as to provide higher productivity, dependability, and portability of embedded software.
|
| Rainer Leupers. Instruction Scheduling for Clustered VLIW DSPs. In PACT 2000 Philadelphia/USA, October 2000 [BibTeX][PDF][Abstract]@inproceedings { leupers:00:pact,
author = {Leupers, Rainer},
title = {Instruction Scheduling for Clustered VLIW DSPs},
booktitle = {PACT 2000},
year = {2000},
address = {Philadelphia/USA},
month = {oct},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2000-pact.pdf},
confidential = {n},
abstract = {Recent digital signal processors (DSPs) show a homogeneous VLIW-like data path architecture, which allows C compilers to generate efficient code. However, still some special restrictions have to be obeyed in code generation for VLIW DSPs. In order to reduce the number of register file ports needed to provide data for multiple functional units working in parallel, the DSP data path may be clustered into several sub-paths, with very limited capabilities of exchanging values between the different clusters. An example is the well-known Texas Instruments C6201 DSP. For such an architecture, the tasks of scheduling and partitioning instructions between the clusters are highly interdependent. This paper presents a new instruction scheduling approach, which in contrast to earlier work, integrates partitioning and scheduling into a single technique, so as to achieve a high code quality. We show experimentally that the proposed technique is capable of generating more efficient code than a commercial code generator for the TI C6201.},
} Recent digital signal processors (DSPs) show a homogeneous VLIW-like data path architecture, which allows C compilers to generate efficient code. However, still some special restrictions have to be obeyed in code generation for VLIW DSPs. In order to reduce the number of register file ports needed to provide data for multiple functional units working in parallel, the DSP data path may be clustered into several sub-paths, with very limited capabilities of exchanging values between the different clusters. An example is the well-known Texas Instruments C6201 DSP. For such an architecture, the tasks of scheduling and partitioning instructions between the clusters are highly interdependent. This paper presents a new instruction scheduling approach, which in contrast to earlier work, integrates partitioning and scheduling into a single technique, so as to achieve a high code quality. We show experimentally that the proposed technique is capable of generating more efficient code than a commercial code generator for the TI C6201.
|
| Rainer Leupers. Register Allocation for Common Subexpressions in DSP Data Paths. In ASP-DAC 2000 Yokohama/Japan, June 2000 [BibTeX][PDF][Abstract]@inproceedings { leupers:00:aspdac,
author = {Leupers, Rainer},
title = {Register Allocation for Common Subexpressions in DSP Data Paths},
booktitle = {ASP-DAC 2000},
year = {2000},
address = {Yokohama/Japan},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2000-aspdac.pdf},
confidential = {n},
abstract = {This paper presents a new code optimization technique for DSPs with irregular data path structures. We consider the problem of generating machine code for data flow graphs with common subexpressions (CSEs). While in previous work CSEs are supposed to be strictly stored in memory, the technique proposed in this paper also permits the allocation of special purpose registers for temporarily storing CSEs. As a result, both the code size and the number of memory accesses are reduced. The optimization is controlled by a simulated annealing algorithm. We demonstrate its effectiveness for several DSP applications and a widespread DSP processor.},
} This paper presents a new code optimization technique for DSPs with irregular data path structures. We consider the problem of generating machine code for data flow graphs with common subexpressions (CSEs). While in previous work CSEs are supposed to be strictly stored in memory, the technique proposed in this paper also permits the allocation of special purpose registers for temporarily storing CSEs. As a result, both the code size and the number of memory accesses are reduced. The optimization is controlled by a simulated annealing algorithm. We demonstrate its effectiveness for several DSP applications and a widespread DSP processor.
|
| Peter Marwedel Anupam Basu. Array Index Allocation under Register Constraints. In 12th Int. Conf. on VLSI Design Goa/India, January 1999 [BibTeX][PDF][Abstract]@inproceedings { basu:1999:vlsi,
author = {Anupam Basu, Peter Marwedel},
title = {Array Index Allocation under Register Constraints},
booktitle = {12th Int. Conf. on VLSI Design},
year = {1999},
address = {Goa/India},
month = {jan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-vlsi.pdf},
confidential = {n},
abstract = {Code optimization for digital signal processors (DSPs) has been identified as an important new topic in system-level design of embedded systems. Both DSP processors and algorithms show special characteristics usually not found in general-purpose computing. Since real-time constraints imposed on DSP algorithms demand for very high quality machine code, high-level language compilers for DSPs should take these characteristics into account. One important characteristic of DSP algorithms is the iterative pattern of references to array elements within loops. DSPs support efficient address computations for such array accesses by means of dedicated address generation units (AGUs). In this paper, we present a heuristic code optimization technique which, given an AGU with a fixed number of address registers, minimizes the number of instructions needed for address computations in loops.},
} Code optimization for digital signal processors (DSPs) has been identified as an important new topic in system-level design of embedded systems. Both DSP processors and algorithms show special characteristics usually not found in general-purpose computing. Since real-time constraints imposed on DSP algorithms demand for very high quality machine code, high-level language compilers for DSPs should take these characteristics into account. One important characteristic of DSP algorithms is the iterative pattern of references to array elements within loops. DSPs support efficient address computations for such array accesses by means of dedicated address generation units (AGUs). In this paper, we present a heuristic code optimization technique which, given an AGU with a fixed number of address registers, minimizes the number of instructions needed for address computations in loops.
|
| Matthias Weiss, Gerhard Fettweis, Markus Lorenz, Rainer Leupers and Peter Marwedel. Toolumgebung fuer plattformbasierte DSPs der naechsten Generation. In DSP Deutschland Munich/Germany, September 1999 [BibTeX][PDF][Abstract]@inproceedings { weiss:1999:dsp,
author = {Weiss, Matthias and Fettweis, Gerhard and Lorenz, Markus and Leupers, Rainer and Marwedel, Peter},
title = {Toolumgebung fuer plattformbasierte DSPs der naechsten Generation},
booktitle = {DSP Deutschland},
year = {1999},
address = {Munich/Germany},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-dspd.pdf},
confidential = {n},
abstract = {Digitale Signalprozessoren (DSPs) werden ueberall dort eingesetzt, wo bestimmte Performance- oder Aufwandsanforderungen nicht von Standardprozessoren erfuellt werden koennen. Beispiele sind Media- und Sprachcodecs, Modems, sowie Bild- und Spracherkennungen. Aufgrund der unterschiedlichen Performanceanforderungen dieser Anwendungen beginnen DSP-Hersteller nun, nicht einen DSP sondern eine Plattformloesung anzukuendigen. Beispiele sind der StarCore von Motorola/Lucent oder der TigerSharc von Analog Devices. Am Lehrstuhl Mobile Nachrichtensysteme der TU Dresden ist hierbei die M3-DSP Plattform entstanden. Eine Programmierumgebung fuer eine solche DSP-Plattform muss im Gegensatz zu konventionellen Ansaetzen auch die Anforderungen der Plattform mit unterstuetzen. In diesem Paper stellen wir die Architektur einer solchen Toolumgebung fuer die M3-Plattform vor und demonstrieren die Anwendbarkeit anhand von Beispielen.},
} Digitale Signalprozessoren (DSPs) werden ueberall dort eingesetzt, wo bestimmte Performance- oder Aufwandsanforderungen nicht von Standardprozessoren erfuellt werden koennen. Beispiele sind Media- und Sprachcodecs, Modems, sowie Bild- und Spracherkennungen. Aufgrund der unterschiedlichen Performanceanforderungen dieser Anwendungen beginnen DSP-Hersteller nun, nicht einen DSP sondern eine Plattformloesung anzukuendigen. Beispiele sind der StarCore von Motorola/Lucent oder der TigerSharc von Analog Devices. Am Lehrstuhl Mobile Nachrichtensysteme der TU Dresden ist hierbei die M3-DSP Plattform entstanden. Eine Programmierumgebung fuer eine solche DSP-Plattform muss im Gegensatz zu konventionellen Ansaetzen auch die Anforderungen der Plattform mit unterstuetzen. In diesem Paper stellen wir die Architektur einer solchen Toolumgebung fuer die M3-Plattform vor und demonstrieren die Anwendbarkeit anhand von Beispielen.
|
| U. Bieker, M. Kaibel, P. Marwedel and W. Geisselhardt. STAR-DUST: Hierarchical Test of Embedded Processors by Self-Test Programs. In European Test Workshop (ETW) June 1999 [BibTeX][PDF][Abstract]@inproceedings { bieker:1999:etw,
author = {Bieker, U. and Kaibel, M. and Marwedel, P. and Geisselhardt, W.},
title = {STAR-DUST: Hierarchical Test of Embedded Processors by Self-Test Programs},
booktitle = {European Test Workshop (ETW)},
year = {1999},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-etw.pdf},
confidential = {n},
abstract = {This paper describes the hierarchical test-generation method STAR-DUST, using self-test program generator RESTART, test pattern generator DUST, fault simulator FAUST and SYNOPSYS logic synthesis tools. RESTART aims at supporting self-test of embedded processors. Its integration into the STAR-DUST environment allows test program generation for realistic fault assumptions and provides, for the first time, experimental data on the fault coverage that can be obtained for full processor models. Experimental data shows that fault masking is not a problem even though the considered processor has to perform result comparison and arithmetic operations in the same ALU.},
} This paper describes the hierarchical test-generation method STAR-DUST, using self-test program generator RESTART, test pattern generator DUST, fault simulator FAUST and SYNOPSYS logic synthesis tools. RESTART aims at supporting self-test of embedded processors. Its integration into the STAR-DUST environment allows test program generation for realistic fault assumptions and provides, for the first time, experimental data on the fault coverage that can be obtained for full processor models. Experimental data shows that fault masking is not a problem even though the considered processor has to perform result comparison and arithmetic operations in the same ALU.
|
| Rainer Leupers Steven Bashford. Constraint Driven Code Selection for Fixed-Point DSPs. In 36th Design Automation Conference New Orleans (USA), June 1999 [BibTeX][PDF][Abstract]@inproceedings { bashford:1999:dac,
author = {Steven Bashford, Rainer Leupers},
title = {Constraint Driven Code Selection for Fixed-Point DSPs},
booktitle = {36th Design Automation Conference},
year = {1999},
address = {New Orleans (USA)},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-dac.pdf},
confidential = {n},
abstract = {Fixed-point DSPs are a class of embedded processors with highly irregular architectures. This irregularity makes it difficult to generate high-quality machine code from programming languages such as C. In this paper we present a novel constraint driven approach to code selection for irregular processor architectures, which provides a twofold improvement of earlier work. First, it handles complete data flow graphs instead of trees and thereby generates better code in presence of common subexpressions. Second, the presented technique is not restricted to computation of a single solution, but it generates alternative solutions. This feature enables the tight coupling of different code generation phases, resulting in better exploitation of instruction-level parallelism. Experimental results indicate that our technique is capable of generating machine code that competes well with hand-written assembly code.},
} Fixed-point DSPs are a class of embedded processors with highly irregular architectures. This irregularity makes it difficult to generate high-quality machine code from programming languages such as C. In this paper we present a novel constraint driven approach to code selection for irregular processor architectures, which provides a twofold improvement of earlier work. First, it handles complete data flow graphs instead of trees and thereby generates better code in presence of common subexpressions. Second, the presented technique is not restricted to computation of a single solution, but it generates alternative solutions. This feature enables the tight coupling of different code generation phases, resulting in better exploitation of instruction-level parallelism. Experimental results indicate that our technique is capable of generating machine code that competes well with hand-written assembly code.
|
| Rainer Leupers. Exploiting Conditional Instructions in Code Generation for Embedded VLIW Processors. In Design Automation and Test in Europe (DATE) Munich/Germany, March 1999 [BibTeX][PDF][Abstract]@inproceedings { leupers:1999:date,
author = {Leupers, Rainer},
title = {Exploiting Conditional Instructions in Code Generation for Embedded VLIW Processors},
booktitle = {Design Automation and Test in Europe (DATE)},
year = {1999},
address = {Munich/Germany},
month = {mar},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-date.pdf},
confidential = {n},
abstract = {This paper presents a new code optimization technique for a class of embedded processors. Modern embedded processor architectures show deep instruction pipelines and highly parallel VLIW-like instruction sets. For such architectures, any change in the control flow of a machine program due to a conditional jump may cause a significant code performance penalty. Therefore, the instruction sets of recent VLIW machines offer support for branch-free execution of conditional statements in the form of so-called conditional instructions. Whether an if-then-else statement is implemented by a conditional jump scheme or by conditional instructions has a strong impact on its worst-case execution time. However, the optimal selection is difficult particularly for nested conditionals. We present a dynamic programming technique for selecting the fastest implementation for nested if-then-else statements based on estimations. The efficacy is demonstrated for a real-life VLIW DSP.},
} This paper presents a new code optimization technique for a class of embedded processors. Modern embedded processor architectures show deep instruction pipelines and highly parallel VLIW-like instruction sets. For such architectures, any change in the control flow of a machine program due to a conditional jump may cause a significant code performance penalty. Therefore, the instruction sets of recent VLIW machines offer support for branch-free execution of conditional statements in the form of so-called conditional instructions. Whether an if-then-else statement is implemented by a conditional jump scheme or by conditional instructions has a strong impact on its worst-case execution time. However, the optimal selection is difficult particularly for nested conditionals. We present a dynamic programming technique for selecting the fastest implementation for nested if-then-else statements based on estimations. The efficacy is demonstrated for a real-life VLIW DSP.
|
| Peter Marwedel Rainer Leupers. Function Inlining under Code Size Constraints for Embedded Processors. In ICCAD San Jose (USA), November 1999 [BibTeX][PDF][Abstract]@inproceedings { leupers:1999:iccad,
author = {Rainer Leupers, Peter Marwedel},
title = {Function Inlining under Code Size Constraints for Embedded Processors},
booktitle = {ICCAD},
year = {1999},
address = {San Jose (USA)},
month = {nov},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-iccad.pdf},
confidential = {n},
abstract = {Function inlining is a compiler optimization that generally increases performance at the expense of larger code size. However, current inlining techniques do not meet the special demands in the design of embedded systems, since they are based on simple heuristics, and they generate code of unpredictable size. This paper presents a novel approach to function inlining in C compilers for embedded processors, which aims a maximum program speedup under a global limit on code size. The core of this approach is a branch-and-bound algorithm which allows to quickly explore the large search space. In an application study we show how this algorithm can be applied to maximize the execution speed of an application under a given code size constraint.},
} Function inlining is a compiler optimization that generally increases performance at the expense of larger code size. However, current inlining techniques do not meet the special demands in the design of embedded systems, since they are based on simple heuristics, and they generate code of unpredictable size. This paper presents a novel approach to function inlining in C compilers for embedded processors, which aims a maximum program speedup under a global limit on code size. The core of this approach is a branch-and-bound algorithm which allows to quickly explore the large search space. In an application study we show how this algorithm can be applied to maximize the execution speed of an application under a given code size constraint.
|
| Birger Landwehr Rainer Leupers. Generation of Interpretive and Compiled Instruction Set Simulators. In Asia South Pacific Design Automation Conference (ASP-DAC) Hong Kong, China, January 1999 [BibTeX][PDF][Abstract]@inproceedings { leupers:1999:aspdac,
author = {Rainer Leupers, Birger Landwehr},
title = {Generation of Interpretive and Compiled Instruction Set Simulators},
booktitle = {Asia South Pacific Design Automation Conference (ASP-DAC)},
year = {1999},
address = {Hong Kong, China},
month = {jan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-aspdac-lel.pdf},
confidential = {n},
abstract = {Due to the large variety of different embedded processor types, retargetable software development tools, such as compilers and simulators, have received attention recently. Retargetability allows to handle different target processors with a single tool. In this paper, we present a system for automatic generation of instruction set simulators for a class of embedded processors. Retargetability is achieved by automatic generation of simulators from processor descriptions, given as behavioral or RT-level HDL models. The presented system is capable of bit-true simulation for arbitrary processor word lengths, and it generates both interpretive or compiled simulators. Experimental results for different processors indicate comparatively high simulation speed.},
} Due to the large variety of different embedded processor types, retargetable software development tools, such as compilers and simulators, have received attention recently. Retargetability allows to handle different target processors with a single tool. In this paper, we present a system for automatic generation of instruction set simulators for a class of embedded processors. Retargetability is achieved by automatic generation of simulators from processor descriptions, given as behavioral or RT-level HDL models. The presented system is capable of bit-true simulation for arbitrary processor word lengths, and it generates both interpretive or compiled simulators. Experimental results for different processors indicate comparatively high simulation speed.
|
| Rainer Leupers. Compiler Optimizations for Media Processors. In EMMSEC '99 Stockholm/Sweden, June 1999 [BibTeX][PDF][Abstract]@inproceedings { leupers:1999:emmsec,
author = {Leupers, Rainer},
title = {Compiler Optimizations for Media Processors},
booktitle = {EMMSEC '99},
year = {1999},
address = {Stockholm/Sweden},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-emmsec.pdf},
confidential = {n},
abstract = {In the design of embedded systems, programmable processors gain more and more importance due to their high flexibility and potential for reuse. As a consequence, compilers for embedded processors are required, capable of generating very fast and dense code. In particular, this concerns the area of computation-intensive multimedia applications. While domain-specific digital signal processors may offer sufficient performance for multimedia, they show comparatively low flexibility and pose difficult problems to compiler and software developers due to their irregular architectures. In fact, meeting the system specification while minimizing the costs frequently requires time-consuming assembly-level programming of embedded processors. Recent media processors cover a larger set of application areas and, due to a more regular architecture, also facilitate the construction of compilers capable of generating high-quality machine code. However, media processors simultaneously introduce new challenges for compiler technology. In this paper, we motivate the use of media processors and we present two new compiler optimizations for such processors.},
} In the design of embedded systems, programmable processors gain more and more importance due to their high flexibility and potential for reuse. As a consequence, compilers for embedded processors are required, capable of generating very fast and dense code. In particular, this concerns the area of computation-intensive multimedia applications. While domain-specific digital signal processors may offer sufficient performance for multimedia, they show comparatively low flexibility and pose difficult problems to compiler and software developers due to their irregular architectures. In fact, meeting the system specification while minimizing the costs frequently requires time-consuming assembly-level programming of embedded processors. Recent media processors cover a larger set of application areas and, due to a more regular architecture, also facilitate the construction of compilers capable of generating high-quality machine code. However, media processors simultaneously introduce new challenges for compiler technology. In this paper, we motivate the use of media processors and we present two new compiler optimizations for such processors.
|
| Birger Landwehr. A Genetic Algorithm based Approach for Multi-Objective Data-Flow Graph Optimization. In Asia South Pacific Design Automation Conference (ASP-DAC) Hong Kong, China, January 1999 [BibTeX][PDF][Abstract]@inproceedings { landwehr:1999:aspdac,
author = {Landwehr, Birger},
title = {A Genetic Algorithm based Approach for Multi-Objective Data-Flow Graph Optimization},
booktitle = {Asia South Pacific Design Automation Conference (ASP-DAC)},
year = {1999},
address = {Hong Kong, China},
month = {jan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1999-aspdac_a.pdf},
confidential = {n},
abstract = {This paper presents a genetic algorithm based approach for algebraic optimization of behavioral system specifications. We introduce a chromosomal representation of data-flow graphs (DFG) which ensures that the correctness of algebraic transformations realized by the underlying genetic operators selection, recombination, and mutation is always preserved. We present substantial fitness functions for both the minimization of overall resource costs and critical path length. We also demonstrate that, due to their flexibility, genetic algorithms can be simply adapted to different objective functions which is examplarily shown for power optimization. In order to avoid inferior results caused by the counteracting demands on resources of different basic blocks, all DFGs of the input description are optimized concurrently.},
} This paper presents a genetic algorithm based approach for algebraic optimization of behavioral system specifications. We introduce a chromosomal representation of data-flow graphs (DFG) which ensures that the correctness of algebraic transformations realized by the underlying genetic operators selection, recombination, and mutation is always preserved. We present substantial fitness functions for both the minimization of overall resource costs and critical path length. We also demonstrate that, due to their flexibility, genetic algorithms can be simply adapted to different objective functions which is examplarily shown for power optimization. In order to avoid inferior results caused by the counteracting demands on resources of different basic blocks, all DFGs of the input description are optimized concurrently.
|
| Peter Marwedel Anupam Basu. Register-Constrained Address Computation in DSP Programs. In Design Automation and Test in Europe (DATE) Paris/France, February 1998 [BibTeX][PDF][Abstract]@inproceedings { basu:1998:date,
author = {Anupam Basu, Peter Marwedel},
title = {Register-Constrained Address Computation in DSP Programs},
booktitle = {Design Automation and Test in Europe (DATE)},
year = {1998},
address = {Paris/France},
month = {feb},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1998-date-BLM.pdf},
confidential = {n},
abstract = {This paper describes a new code optimization technique for digital signal processors (DSPs). One important characteristic of DSP algorithms are iterative accesses to data array elements within loops. DSPs support efficient address computations for such array accesses by means of dedicated address generation units (AGUs). We present a heuristic technique which, given an AGU with a fixed number of address registers, minimizes the number of instructions needed for array address computations in a program loop.},
} This paper describes a new code optimization technique for digital signal processors (DSPs). One important characteristic of DSP algorithms are iterative accesses to data array elements within loops. DSPs support efficient address computations for such array accesses by means of dedicated address generation units (AGUs). We present a heuristic technique which, given an AGU with a fixed number of address registers, minimizes the number of instructions needed for array address computations in a program loop.
|
| Peter Marwedel Anupam Basu. Interface Synthesis for Embedded Applications in a Codesign Environment. In Eleventh International Conference on VLSI Design Chennai, India, January 1998 [BibTeX][PDF][Abstract]@inproceedings { basu:1998:vlsi,
author = {Anupam Basu, Peter Marwedel},
title = {Interface Synthesis for Embedded Applications in a Codesign Environment},
booktitle = {Eleventh International Conference on VLSI Design},
year = {1998},
address = {Chennai, India},
month = {jan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1998-vlsi.pdf},
confidential = {n},
abstract = {In embedded systems, programmable peripherals are often coupled with the main programmable processor to achieve desired functionality. Interfacing such peripherals with the processor qualifies as an important task of hardware software codesign. In this paper, three important aspects of such interfacing, namely the allocation of addresses to the devices, allocation of device drivers, and approaches to handle events and transitions have been discussed. The proposed approaches have been incorporated in a codesign system MICKEY. The paper includes a number of examples, taken from the results synthesized by MICKEY, to illustrate the ideas.},
} In embedded systems, programmable peripherals are often coupled with the main programmable processor to achieve desired functionality. Interfacing such peripherals with the processor qualifies as an important task of hardware software codesign. In this paper, three important aspects of such interfacing, namely the allocation of addresses to the devices, allocation of device drivers, and approaches to handle events and transitions have been discussed. The proposed approaches have been incorporated in a codesign system MICKEY. The paper includes a number of examples, taken from the results synthesized by MICKEY, to illustrate the ideas.
|
| Sujit Dey Rajesh K. Gupta and Peter Marwedel. Embedded System Design and Validation: Building Systems from IC Cores to Chips (Tutorial). In Eleventh Int. conf on VLSI Design 1998 [BibTeX][Abstract]@inproceedings { gupta:1998:vlsi,
author = {Rajesh K. Gupta, Sujit Dey and Marwedel, Peter},
title = {Embedded System Design and Validation: Building Systems from IC Cores to Chips (Tutorial)},
booktitle = {Eleventh Int. conf on VLSI Design},
year = {1998},
confidential = {n},
abstract = {This tutorial addresses the challenges in the design and validation of an embedded system-on-a-chip. In particular, we examine the design and use of pre-designed, pre-characterized, and pre-verified silicon building blocks called `cores.' We examine design styles for portability and reuse of core cells, and how these can be validated for given applications. We examine the trend in IC cores by describing a wide range of available processor, DSP, interface and analog cores and how these are used in application domains such as multimedia and networkings systems. Covered topics include emulation, in-circuit emulation, compliance test environments, instruction-set simulation, and hardware-software co-simulation. We discuss the increasing software content on chips, made possible by proliferation of processor and DSP cores. We examine application-specific instruction-set (ASIP) and signal processors (ASSPs) and their use in specific application domains. We then present new code optimization approaches related to memory allocation and code compaction that exploit the embedded processor architectures. Finally, we present techniques for retargeting compilers to new architectures easily. In particular, we show how compilers can be generated from descriptions of processor architectures.},
} This tutorial addresses the challenges in the design and validation of an embedded system-on-a-chip. In particular, we examine the design and use of pre-designed, pre-characterized, and pre-verified silicon building blocks called `cores.' We examine design styles for portability and reuse of core cells, and how these can be validated for given applications. We examine the trend in IC cores by describing a wide range of available processor, DSP, interface and analog cores and how these are used in application domains such as multimedia and networkings systems. Covered topics include emulation, in-circuit emulation, compliance test environments, instruction-set simulation, and hardware-software co-simulation. We discuss the increasing software content on chips, made possible by proliferation of processor and DSP cores. We examine application-specific instruction-set (ASIP) and signal processors (ASSPs) and their use in specific application domains. We then present new code optimization approaches related to memory allocation and code compaction that exploit the embedded processor architectures. Finally, we present techniques for retargeting compilers to new architectures easily. In particular, we show how compilers can be generated from descriptions of processor architectures.
|
| Rainer Leupers. Ausnutzung von Conditional Instructions in VLIW DSP-Compilern. In DSP Deutschland Munich/Germany, October 1998 [BibTeX][PDF][Abstract]@inproceedings { leupers:1998:dsp,
author = {Leupers, Rainer},
title = {Ausnutzung von Conditional Instructions in VLIW DSP-Compilern},
booktitle = {DSP Deutschland},
year = {1998},
address = {Munich/Germany},
month = {oct},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1998-dspd.pdf},
confidential = {n},
abstract = {In diesem Beitrag stellen wir eine neuartige Compiler-Optimierungstechnik vor, welche vor allem bei der Programmierung von kontrollintensiven Applikationen auf VLIW DSPs Anwendung findet. Die Technik beruht auf der optimierten Ausnutzung sog. conditional instructions, welche bei den Befehlss"atzen neuerer DSPs vorkommen, zur Implementierung von if-then-else Statements. W"ahrend die ''klassische'' Implementierung mittels bedingter Spr"unge aufgrund von Pipeline-Konflikten erhebliche Performanceverluste im Maschinencode bewirken kann, ist die Verwendung von conditional instructions h"aufig effizienter. Die vorgestellte Technik w"ahlt f"ur jedes (evtl. geschachtelte) if-then-else Statement im Source Code die schnellste Implementierung aus. Hierzu werden Absch"atzungen der Ausf"uhrungszeit sowie ein auf dynamischer Programmierung basierendes Verfahren eingesetzt. Experimentelle Ergebnisse f"ur einen TI 'C62xx demonstrieren die Leistungsf"ahigkeit der Optimierungstechnik.},
} In diesem Beitrag stellen wir eine neuartige Compiler-Optimierungstechnik vor, welche vor allem bei der Programmierung von kontrollintensiven Applikationen auf VLIW DSPs Anwendung findet. Die Technik beruht auf der optimierten Ausnutzung sog. conditional instructions, welche bei den Befehlss"atzen neuerer DSPs vorkommen, zur Implementierung von if-then-else Statements. W"ahrend die ''klassische'' Implementierung mittels bedingter Spr"unge aufgrund von Pipeline-Konflikten erhebliche Performanceverluste im Maschinencode bewirken kann, ist die Verwendung von conditional instructions h"aufig effizienter. Die vorgestellte Technik w"ahlt f"ur jedes (evtl. geschachtelte) if-then-else Statement im Source Code die schnellste Implementierung aus. Hierzu werden Absch"atzungen der Ausf"uhrungszeit sowie ein auf dynamischer Programmierung basierendes Verfahren eingesetzt. Experimentelle Ergebnisse f"ur einen TI 'C62xx demonstrieren die Leistungsf"ahigkeit der Optimierungstechnik.
|
| Ralf Niemann and Peter Marwedel. Synthesis of Communicating Controllers for Concurrent Hardware/Software Systems. In Design, Automation and Test in Europe (DATE) Paris/France, February 1998 [BibTeX][PDF][Abstract]@inproceedings { niemann:1998:date,
author = {Niemann, Ralf and Marwedel, Peter},
title = {Synthesis of Communicating Controllers for Concurrent Hardware/Software Systems},
booktitle = {Design, Automation and Test in Europe (DATE)},
year = {1998},
address = {Paris/France},
month = {feb},
keywords = {hwsw},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1998-date-niemann.pdf},
confidential = {n},
abstract = {Two main aspects in hardware/software codesign are hardware/software partitioning and co-synthesis. Most codesign approaches work only on one of these problems. In this paper, an approach coupling hardware/software partitioning and co-synthesis will be presented, working fully-automatic. The techniques have been integrated in the codesign tool COOL (COdesign toOL) supporting the complete design flow from system specification to board-level implementation for multi-processor and multi-ASIC target architectures for data-flow dominated applications.},
} Two main aspects in hardware/software codesign are hardware/software partitioning and co-synthesis. Most codesign approaches work only on one of these problems. In this paper, an approach coupling hardware/software partitioning and co-synthesis will be presented, working fully-automatic. The techniques have been integrated in the codesign tool COOL (COdesign toOL) supporting the complete design flow from system specification to board-level implementation for multi-processor and multi-ASIC target architectures for data-flow dominated applications.
|
| Rainer Leupers. Novel Code Optimization Techniques for DSPs. In 2nd European DSP Education and Research Conference Paris/France, September 1998 [BibTeX][PDF][Abstract]@inproceedings { leupers:1998:dsper,
author = {Leupers, Rainer},
title = {Novel Code Optimization Techniques for DSPs},
booktitle = {2nd European DSP Education and Research Conference},
year = {1998},
address = {Paris/France},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1998-dsper.pdf},
confidential = {n},
abstract = {Software development for DSPs is frequently a bottleneck in the system design process, due to the poor code quality delivered by many current C compilers. As a consequence, most of the DSP software still has to be written manually in assembly language. In order to overcome this problem, new DSP-specific code optimization techniques are required, which, in contrast to classical compiler technology, take the detailed processor architecture sufficiently into account. This paper describes several new DSP code optimization techniques: maximum utilization of parallel address generation units, exploitation of instruction-level parallelism through exact code compaction, and optimized code generation for IF-statements by means of conditional instructions. Experimental results indicate significant improvements in code quality as compared to existing compilers.},
} Software development for DSPs is frequently a bottleneck in the system design process, due to the poor code quality delivered by many current C compilers. As a consequence, most of the DSP software still has to be written manually in assembly language. In order to overcome this problem, new DSP-specific code optimization techniques are required, which, in contrast to classical compiler technology, take the detailed processor architecture sufficiently into account. This paper describes several new DSP code optimization techniques: maximum utilization of parallel address generation units, exploitation of instruction-level parallelism through exact code compaction, and optimized code generation for IF-statements by means of conditional instructions. Experimental results indicate significant improvements in code quality as compared to existing compilers.
|
| Rainer Leupers. HDL-based Modeling of Embedded Processor Behavior for Retargetable Compilation. In ISSS '98 Hsinchu/Taiwan, December 1998 [BibTeX][PDF][Abstract]@inproceedings { leupers:1998:isss1,
author = {Leupers, Rainer},
title = {HDL-based Modeling of Embedded Processor Behavior for Retargetable Compilation},
booktitle = {ISSS '98},
year = {1998},
address = {Hsinchu/Taiwan},
month = {dec},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1998-isss_a.pdf},
confidential = {n},
abstract = {The concept of retargetability enables compiler technology to keep pace with the increasing variety of domain-specific embedded processors. In order to achieve user retargetability, powerful processor modeling formalisms are required. Most of the recent modeling formalisms concentrate on horizontal, VLIW-like instruction formats. However, for encoded instruction formats with restricted instruction-level parallelism (ILP), a large number of ILP constraints might need to be specified, resulting in less concise processor models. This paper presents an HDL-based approach to processor modeling for retargetable compilation, in which ILP may be implicitly constrained. As a consequence, the formalism allows for concise models also for encoded instruction formats. The practical applicability of the modeling formalism is demonstrated by means of a case study for a complex DSP.},
} The concept of retargetability enables compiler technology to keep pace with the increasing variety of domain-specific embedded processors. In order to achieve user retargetability, powerful processor modeling formalisms are required. Most of the recent modeling formalisms concentrate on horizontal, VLIW-like instruction formats. However, for encoded instruction formats with restricted instruction-level parallelism (ILP), a large number of ILP constraints might need to be specified, resulting in less concise processor models. This paper presents an HDL-based approach to processor modeling for retargetable compilation, in which ILP may be implicitly constrained. As a consequence, the formalism allows for concise models also for encoded instruction formats. The practical applicability of the modeling formalism is demonstrated by means of a case study for a complex DSP.
|
| Fabian David Rainer Leupers. A Uniform Optimization Technique for Offset Assignment Problems. In ISSS '98 Hsinchu/Taiwan, December 1998 [BibTeX][PDF][Abstract]@inproceedings { leupers:1998:isss2,
author = {Rainer Leupers, Fabian David},
title = {A Uniform Optimization Technique for Offset Assignment Problems},
booktitle = {ISSS '98},
year = {1998},
address = {Hsinchu/Taiwan},
month = {dec},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1998-isss_b.pdf},
confidential = {n},
abstract = {A number of different algorithms for optimized offset assignment in DSP code generation have been developed recently. These algorithms aim at constructing a layout of local variables in memory, such that the addresses of variables can be computed efficiently in most cases. This is achieved by maximizing the use of auto-increment operations on address registers. However, the algorithms published in previous work only consider special cases of offset assignment problems, characterized by fixed parameters such as register file sizes and auto-increment ranges. In contrast, this paper presents a genetic optimization technique capable of simultaneously handling arbitrary register file sizes and auto-increment ranges. Moreover, this technique is the first that integrates the allocation of modify registers into offset assignment. Experimental evaluation indicates a significant improvement in the quality of constructed offset assignments, as compared to previous work.},
} A number of different algorithms for optimized offset assignment in DSP code generation have been developed recently. These algorithms aim at constructing a layout of local variables in memory, such that the addresses of variables can be computed efficiently in most cases. This is achieved by maximizing the use of auto-increment operations on address registers. However, the algorithms published in previous work only consider special cases of offset assignment problems, characterized by fixed parameters such as register file sizes and auto-increment ranges. In contrast, this paper presents a genetic optimization technique capable of simultaneously handling arbitrary register file sizes and auto-increment ranges. Moreover, this technique is the first that integrates the allocation of modify registers into offset assignment. Experimental evaluation indicates a significant improvement in the quality of constructed offset assignments, as compared to previous work.
|
| Peter Marwedel Rainer Leupers. Optimized Array Index Computation in DSP Programs. In ASP-DAC '98 Yokohama/Japan, February 1998 [BibTeX][PDF][Abstract]@inproceedings { leupers:1998:aspdac,
author = {Rainer Leupers, Peter Marwedel},
title = {Optimized Array Index Computation in DSP Programs},
booktitle = {ASP-DAC '98},
year = {1998},
address = {Yokohama/Japan},
month = {feb},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1998-asp-dac.pdf},
confidential = {n},
abstract = {An increasing number of components in embedded systems are implemented by software running on embedded processors. This trend creates a need for compilers for embedded processors capable of generating high quality machine code. Particularly for DSPs, such compilers are hardly available, and novel DSP-specific code optimization techniques are required. In this paper we focus on efficient address computation for array accesses in loops. Based on previous work, we present a new and optimal algorithm for address register allocation and provide an experimental evaluation of different algorithms. Furthermore, an efficient and close-to-optimum heuristic is proposed for large problems.},
} An increasing number of components in embedded systems are implemented by software running on embedded processors. This trend creates a need for compilers for embedded processors capable of generating high quality machine code. Particularly for DSPs, such compilers are hardly available, and novel DSP-specific code optimization techniques are required. In this paper we focus on efficient address computation for array accesses in loops. Based on previous work, we present a new and optimal algorithm for address register allocation and provide an experimental evaluation of different algorithms. Furthermore, an efficient and close-to-optimum heuristic is proposed for large problems.
|
| Rainer Leupers and Peter Marwedel. Formale Methoden in der Codeerzeugung für digitale Signalprozessoren. In 5. GI/ITG/GMM Workshop "Methoden des Entwurfs und der Verifikation digitaler Systeme" Linz/Austria, April 1997 [BibTeX][PDF][Abstract]@inproceedings { eupers:1997:itg,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Formale Methoden in der Codeerzeugung f\"ur digitale Signalprozessoren},
booktitle = {5. GI/ITG/GMM Workshop "Methoden des Entwurfs und der Verifikation digitaler Systeme"},
year = {1997},
address = {Linz/Austria},
month = {apr},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-gi-itg.pdf},
confidential = {n},
abstract = {Der Bereich HW/SW-Codesign f\"ur eingebettete Systeme umfa\"st neben Methoden zur HW/SW-Partitionierung und Hardwaresynthese notwendigerweise auch Techniken zur Codeerzeugung f\"ur eingebettete programmierbare Prozessoren. Speziell im Falle von digitalen Signalprozessoren (DSPs) ist die Qualit\"at verf\"ugbarer Compiler unzureichend. Zur Vermeidung von aufwendiger Programmierung auf Assemblerebene sind daher neue DSP-spezifische Codeerzeugungstechiken notwendig. Dieser Beitrag stellt den Compiler RECORD vor, welcher f\"ur eine Klasse von DSPs Hochsprachenprogramme in Maschinencode \"ubersetzt. Um den speziellen Anforderungen an Compiler f\"ur DSPs gerecht zu werden, werden teilweise formale Methoden eingesetzt. Wir stellen zwei solche f\"ur RECORD entwickelte Methoden vor, welche zur Analyse von Prozessormodellen sowie zur Code-Kompaktierung verwendet werden, und diskutieren deren praktische Anwendung.},
} Der Bereich HW/SW-Codesign für eingebettete Systeme umfaßt neben Methoden zur HW/SW-Partitionierung und Hardwaresynthese notwendigerweise auch Techniken zur Codeerzeugung für eingebettete programmierbare Prozessoren. Speziell im Falle von digitalen Signalprozessoren (DSPs) ist die Qualität verfügbarer Compiler unzureichend. Zur Vermeidung von aufwendiger Programmierung auf Assemblerebene sind daher neue DSP-spezifische Codeerzeugungstechiken notwendig. Dieser Beitrag stellt den Compiler RECORD vor, welcher für eine Klasse von DSPs Hochsprachenprogramme in Maschinencode übersetzt. Um den speziellen Anforderungen an Compiler für DSPs gerecht zu werden, werden teilweise formale Methoden eingesetzt. Wir stellen zwei solche für RECORD entwickelte Methoden vor, welche zur Analyse von Prozessormodellen sowie zur Code-Kompaktierung verwendet werden, und diskutieren deren praktische Anwendung.
|
| Rainer Leupers and Peter Marwedel. Optimierende Compiler für DSPs: Was ist verfügbar ? (in German). In DSP Deutschland '97 Munich/Germany, October 1997 [BibTeX][PDF][Abstract]@inproceedings { leupers:1997:dsp,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Optimierende Compiler f\"ur DSPs: Was ist verf\"ugbar ? (in German)},
booktitle = {DSP Deutschland '97},
year = {1997},
address = {Munich/Germany},
month = {oct},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-dsp-deutschland.pdf},
confidential = {n},
abstract = {Die Softwareentwicklung f\"ur eingebettete Prozessoren findet heute gr\"o\"stenteils noch auf Assemblerebene statt. Der Grund f\"ur diesen langfristig wohl unhaltbaren Zustand liegt in der mangelnden Verf\"ugbarkeit von guten C-Compilern. In den letzten Jahren wurden allerdings wesentliche Fortschritte in der Codeoptimierung - speziell f\"ur DSPs - erzielt, welche bisher nur unzureichend in kommerzielle Produkte umgesetzt wurden. Dieser Beitrag zeigt die prinzipiellen Optimierungsquellen auf und fa\"st den Stand der Technik zusammen. Die zentralen Methoden hierbei sind komplexe Optimierungsverfahren, welche \"uber die traditionelle Compilertechnologie hinausgehen, sowie die Ausnutzung der DSP-spezifischen Hardware-Architekturen zur effizienten \"Ubersetzung von C-Sprachkonstrukten in DSP-Maschinenbefehle. Die genannten Verfahren lassen sich teilweise auch allgemein auf (durch Compiler generierte oder handgeschriebene) Assemblerprogramme anwenden.},
} Die Softwareentwicklung für eingebettete Prozessoren findet heute größtenteils noch auf Assemblerebene statt. Der Grund für diesen langfristig wohl unhaltbaren Zustand liegt in der mangelnden Verfügbarkeit von guten C-Compilern. In den letzten Jahren wurden allerdings wesentliche Fortschritte in der Codeoptimierung - speziell für DSPs - erzielt, welche bisher nur unzureichend in kommerzielle Produkte umgesetzt wurden. Dieser Beitrag zeigt die prinzipiellen Optimierungsquellen auf und faßt den Stand der Technik zusammen. Die zentralen Methoden hierbei sind komplexe Optimierungsverfahren, welche über die traditionelle Compilertechnologie hinausgehen, sowie die Ausnutzung der DSP-spezifischen Hardware-Architekturen zur effizienten \"Ubersetzung von C-Sprachkonstrukten in DSP-Maschinenbefehle. Die genannten Verfahren lassen sich teilweise auch allgemein auf (durch Compiler generierte oder handgeschriebene) Assemblerprogramme anwenden.
|
| Rainer Leupers and Peter Marwedel. Retargetable Generation of Code Selectors from HDL Processor Models. In European Design & Test Conference (ED & TC) Paris/France, March 1997 [BibTeX][PDF][Abstract]@inproceedings { leupers:1997:edtc,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Retargetable Generation of Code Selectors from HDL Processor Models},
booktitle = {European Design \& Test Conference (ED \& TC)},
year = {1997},
address = {Paris/France},
month = {mar},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-edtc.pdf},
confidential = {n},
abstract = {Besides high code quality,a primary issue in embedded code generation is retargetability of code generators. This paper presents techniques for automatic generation of code selectors from externally specified processor models. In contrast to previous work, our retargetable compiler RECORD does not require tool-specific modelling formalisms, but starts from general HDL processor models. From an HDL model, all processor aspects needed for code generation are automatically derived. As demonstrated by experimental results, short turnaround times for retargeting are achieved, which permits to study the HW/SW trade-off between processor architectures and program execution speed.},
} Besides high code quality,a primary issue in embedded code generation is retargetability of code generators. This paper presents techniques for automatic generation of code selectors from externally specified processor models. In contrast to previous work, our retargetable compiler RECORD does not require tool-specific modelling formalisms, but starts from general HDL processor models. From an HDL model, all processor aspects needed for code generation are automatically derived. As demonstrated by experimental results, short turnaround times for retargeting are achieved, which permits to study the HW/SW trade-off between processor architectures and program execution speed.
|
| Peter Marwedel. Code Generation for Core Processors. In Invited embedded tutorial, 34th Design Automation Conference 1997 [BibTeX][PDF][Abstract]@inproceedings { marwedel:1997:dac,
author = {Marwedel, Peter},
title = {Code Generation for Core Processors},
booktitle = {Invited embedded tutorial, 34th Design Automation Conference},
year = {1997},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-f_dac.pdf},
confidential = {n},
abstract = {This tutorial responds to the rapidly increasing use of cores in general and of processor cores in particular for implementing systems-on-a-chip. In the first part of this text, we will provide a brief introduction to various cores. Applications can be found in most segments of the embedded systems market. These applications demand for extreme efficiency, and in particular for efficient processor architectures and for efficient embedded software. In the second part of this text, we will show that current compilers do not provide the required efficiency and we will give an overview over new compiler optimization techniques, which aim at making assembly language programming for embedded software obsolete. These new techniques take advantage of the special characteristics of embedded software and embedded architectures. Due to efficiency considerations, processor architectures optimized for application domains or even for particular applications are of interest. This results in a large number of architectures and instruction sets, leading to the requirement for retargeting compilers to those numerous architectures. In the final section of the tutorial, we will present techniques for retargeting compilers to new architectures easily. We will show, how compilers can be generated from descriptions of processors. One of the approaches closes the gap which so far existed between electronic CAD and compiler generation.},
} This tutorial responds to the rapidly increasing use of cores in general and of processor cores in particular for implementing systems-on-a-chip. In the first part of this text, we will provide a brief introduction to various cores. Applications can be found in most segments of the embedded systems market. These applications demand for extreme efficiency, and in particular for efficient processor architectures and for efficient embedded software. In the second part of this text, we will show that current compilers do not provide the required efficiency and we will give an overview over new compiler optimization techniques, which aim at making assembly language programming for embedded software obsolete. These new techniques take advantage of the special characteristics of embedded software and embedded architectures. Due to efficiency considerations, processor architectures optimized for application domains or even for particular applications are of interest. This results in a large number of architectures and instruction sets, leading to the requirement for retargeting compilers to those numerous architectures. In the final section of the tutorial, we will present techniques for retargeting compilers to new architectures easily. We will show, how compilers can be generated from descriptions of processors. One of the approaches closes the gap which so far existed between electronic CAD and compiler generation.
|
| Peter Marwedel. Processor-Core Based Design and Test - Invited Embedded Tutorial. In Asia and South Pacific Design Automation Conference 1998 Automation Tokyo, Japan, 1997 [BibTeX][PDF][Abstract]@inproceedings { marwedel:1997:tut,
author = {Marwedel, Peter},
title = {Processor-Core Based Design and Test - Invited Embedded Tutorial},
booktitle = {Asia and South Pacific Design Automation Conference 1998 Automation},
year = {1997},
address = {Tokyo, Japan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-aspdactut.pdf},
confidential = {n},
abstract = {This tutorial responds to the rapidly increasing use of various cores for implementing systems-on-a-chip. It specifically focusses on processor cores. We will give some examples of cores, including DSP cores and application-specific instruction-set processors (ASIPs). We will mention market trends for these components, and we will touch design procedures, in particular the use compilers. Finally, we will discuss the problem of testing core-based designs. Existing solutions include boundary scan, embedded in-circuit emulation (ICE), the use of processor resources for stimuli/response compaction and self-test programs.},
} This tutorial responds to the rapidly increasing use of various cores for implementing systems-on-a-chip. It specifically focusses on processor cores. We will give some examples of cores, including DSP cores and application-specific instruction-set processors (ASIPs). We will mention market trends for these components, and we will touch design procedures, in particular the use compilers. Finally, we will discuss the problem of testing core-based designs. Existing solutions include boundary scan, embedded in-circuit emulation (ICE), the use of processor resources for stimuli/response compaction and self-test programs.
|
| Renate Beckmann and Jürgen Herrmann. Using Constraint Logic Programming in Memory Synthesis for General Purpose Computers. In European Design & Test Conference (ED & TC) Paris/France, March 1997 [BibTeX][PDF][Abstract]@inproceedings { beckmann:1997:edtc,
author = {Beckmann, Renate and Herrmann, J\"urgen},
title = {Using Constraint Logic Programming in Memory Synthesis for General Purpose Computers},
booktitle = {European Design \& Test Conference (ED \& TC)},
year = {1997},
address = {Paris/France},
month = {mar},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-edtc-poster.pdf},
confidential = {n},
abstract = {In modern computer systems the performance is dominated by the memory performance. Currently, there is neither a systematic design methodology nor a tool for the design of memory systems for general purpose computers.We present a first approach to CAD support for this crucial subtask of system level design. Dependencies between influencing factors and design decisions are explicitly represented by constraints and constraint logic programming is used to make the desing decisions. The memory design is optimized with respect to several objectives by iterating the (re)design cycle. Event driven simulation is used for evaluation of the intermediate results. The system is organized as an interactive design assistant.},
} In modern computer systems the performance is dominated by the memory performance. Currently, there is neither a systematic design methodology nor a tool for the design of memory systems for general purpose computers.We present a first approach to CAD support for this crucial subtask of system level design. Dependencies between influencing factors and design decisions are explicitly represented by constraints and constraint logic programming is used to make the desing decisions. The memory design is optimized with respect to several objectives by iterating the (re)design cycle. Event driven simulation is used for evaluation of the intermediate results. The system is organized as an interactive design assistant.
|
| Birger Landwehr, Peter Marwedel, Ingolf Markhof and Rainer Dömer. Exploiting Isomorphism for Speeding-Up Instance-Binding in an Integrated Scheduling, Allocation and Assignment Approach to Architectural Synthesis. In Conference on Computer Hardware Description Languages and their Applications Toledo, Spain, April 1997 [BibTeX][PDF][Abstract]@inproceedings { landwehr:1997:chdl,
author = {Landwehr, Birger and Marwedel, Peter and Markhof, Ingolf and D\"omer, Rainer},
title = {Exploiting Isomorphism for Speeding-Up Instance-Binding in an Integrated Scheduling, Allocation and Assignment Approach to Architectural Synthesis},
booktitle = {Conference on Computer Hardware Description Languages and their Applications},
year = {1997},
address = {Toledo, Spain},
month = {apr},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-chdl.pdf},
confidential = {n},
abstract = {Register-Transfer (RT-) level netlists are said to be isomorphic if they can be made identical by relabeling RT-components. RT-netlists can be generated by architectural synthesis. In order to consider just the essential design decisions, architectural synthesis should consider only a single representative of sets of isomorphic netlists. In this paper, we are using netlist isomorphism for the very first time in architectural synthesis. Furthermore, we describe how an integer-programming (IP-) based synthesis technique can be extended to take advantage of netlist isomorphism.},
} Register-Transfer (RT-) level netlists are said to be isomorphic if they can be made identical by relabeling RT-components. RT-netlists can be generated by architectural synthesis. In order to consider just the essential design decisions, architectural synthesis should consider only a single representative of sets of isomorphic netlists. In this paper, we are using netlist isomorphism for the very first time in architectural synthesis. Furthermore, we describe how an integer-programming (IP-) based synthesis technique can be extended to take advantage of netlist isomorphism.
|
| Peter Marwedel Birger Landwehr. A New Optimization Technique for Improving Resource Exploitation and Critical Path Minimization. In 10th International Symposium on System Synthesis Antwerp, Belgium, 1997 [BibTeX][PDF][Abstract]@inproceedings { landwehr:1997:isss,
author = {Birger Landwehr, Peter Marwedel},
title = {A New Optimization Technique for Improving Resource Exploitation and Critical Path Minimization},
booktitle = {10th International Symposium on System Synthesis},
year = {1997},
address = {Antwerp, Belgium},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-isss.pdf},
confidential = {n},
abstract = {This paper presents a novel approach to algebraic optimization of data-flow graphs in the domain of computationally intensive applications. The presented approach is based upon the paradigm of simulated evolution which has been proven to be a powerful method for solving large non-linear optimization problems. We introduce a genetic algorithm with a new chromosomal representation of data-flow graphs that serves as a basis for preserving the correctness of algebraic transformations and allows an efficient implementation of the genetic operators. Furthermore, we introduce a new class of hardware-related transformation rules which for the first time allow to take existing component libraries into account. The efficiency of our method is demonstrated by encouraging experimental results for several standard benchmarks.},
} This paper presents a novel approach to algebraic optimization of data-flow graphs in the domain of computationally intensive applications. The presented approach is based upon the paradigm of simulated evolution which has been proven to be a powerful method for solving large non-linear optimization problems. We introduce a genetic algorithm with a new chromosomal representation of data-flow graphs that serves as a basis for preserving the correctness of algebraic transformations and allows an efficient implementation of the genetic operators. Furthermore, we introduce a new class of hardware-related transformation rules which for the first time allow to take existing component libraries into account. The efficiency of our method is demonstrated by encouraging experimental results for several standard benchmarks.
|
| Rainer Leupers and Peter Marwedel. Retargetable Compilers for Embedded DSPs. In 7th European Multimedia, Microprocessor Systems and Electronic Commerce Conference (EMMSEC) Florence/Italy, November 1997 [BibTeX][PDF][Abstract]@inproceedings { leupers:1997:emmsec,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Retargetable Compilers for Embedded DSPs},
booktitle = {7th European Multimedia, Microprocessor Systems and Electronic Commerce Conference (EMMSEC)},
year = {1997},
address = {Florence/Italy},
month = {nov},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-emmsec.pdf},
confidential = {n},
abstract = {Programmable devices are a key technology for the design of embedded systems, such as in the consumer electronics market. Processor cores are used as building blocks for more and more embedded system designs, since they provide a unique combination of features: flexibility and reusability. Processor-based design implies that compilers capable of generating efficient machine code are necessary. However, highly efficient compilers for embedded processors are hardly available. In particular, this holds for digital signal processors (DSPs). This contribution is intended to outline different aspects of DSP compiler technology. First, we cover demands on compilers for embedded DSPs, which are partially in sharp contrast to traditional compiler construction. Secondly, we present recent advances in DSP code optimization techniques, which explore a comparatively large search space in order to achieve high code quality. Finally, we discuss the different approaches to retargetability of compilers, that is, techniques for automatic generation of compilers from processor models.},
} Programmable devices are a key technology for the design of embedded systems, such as in the consumer electronics market. Processor cores are used as building blocks for more and more embedded system designs, since they provide a unique combination of features: flexibility and reusability. Processor-based design implies that compilers capable of generating efficient machine code are necessary. However, highly efficient compilers for embedded processors are hardly available. In particular, this holds for digital signal processors (DSPs). This contribution is intended to outline different aspects of DSP compiler technology. First, we cover demands on compilers for embedded DSPs, which are partially in sharp contrast to traditional compiler construction. Secondly, we present recent advances in DSP code optimization techniques, which explore a comparatively large search space in order to achieve high code quality. Finally, we discuss the different approaches to retargetability of compilers, that is, techniques for automatic generation of compilers from processor models.
|
| Rainer Dömer Peter Marwedel. Introducing Complex Components into Architectural Synthesis. In Asia South Pacific Design Automation Conference (ASP-DAC) Chiba, Japan, January 1997 [BibTeX][PDF][Abstract]@inproceedings { marwedel:1997:aspdac,
author = {Peter Marwedel, Rainer D\"omer},
title = {Introducing Complex Components into Architectural Synthesis},
booktitle = {Asia South Pacific Design Automation Conference (ASP-DAC)},
year = {1997},
address = {Chiba, Japan},
month = {jan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-asp-dac.pdf},
confidential = {n},
abstract = {In this paper, we extend the set of library components which are usually considered in architectural synthesis by components with built-in chaining. For such components, the result of some internally computed arithmetic function is made available as an argument to some other function through a local connection. These components can be used to implement chaining in a data-path in a single component. Components with built-in chaining are combinatorial circuits. They correspond to ``complex gates'' in logic synthesis. If compared to implementations with several components, components with built-in chaining usually provide a denser layout, reduced power consumption, and a shorter delay time. Multiplier/accumulators are the most prominent example of such components. Such components require new approaches for library mapping in architectural synthesis. In this paper, we describe an IP-based approach taken in our OSCAR synthesis system.},
} In this paper, we extend the set of library components which are usually considered in architectural synthesis by components with built-in chaining. For such components, the result of some internally computed arithmetic function is made available as an argument to some other function through a local connection. These components can be used to implement chaining in a data-path in a single component. Components with built-in chaining are combinatorial circuits. They correspond to "complex gates" in logic synthesis. If compared to implementations with several components, components with built-in chaining usually provide a denser layout, reduced power consumption, and a shorter delay time. Multiplier/accumulators are the most prominent example of such components. Such components require new approaches for library mapping in architectural synthesis. In this paper, we describe an IP-based approach taken in our OSCAR synthesis system.
|
| Peter Marwedel. Compilers for Embedded Processors. In Invited Embedded Tutorial SASIMI, Osaka, 1997 [BibTeX][PDF][Abstract]@inproceedings { marwedel:1997:sasimi,
author = {Marwedel, Peter},
title = {Compilers for Embedded Processors},
booktitle = {Invited Embedded Tutorial},
year = {1997},
address = {SASIMI, Osaka},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1997-sasimi.pdf},
confidential = {n},
abstract = {This talk responds to the rapidly increasing use of embedded processors for implementing systems. Such processors come in the form of discrete processors as well as in the form of core processors. They are available both from vendors and within system companies. Applications can be found in most segments of the embedded system market, such as automotive electronics and telecommunications. These applications demand for extremely efficient processor architectures, optimized for a certain application domain or even a certain application. Current compiler technology supports these architectures very poorly and has recently been recognized as a major bottleneck for designing systems quickly, efficiently and reliably. A number of recent research projects aim at removing this bottleneck. The talk will briefly discuss the trend towards embedded processors. We will show market trends and examples of recent embedded processors. We will also introduce the terms "application specific instruction-set processors" (ASIPs), "application-specific signal processors" (ASSPs), "soft cores" and "hard cores". We will then present new code optimization approaches taking the special characterstics of embedded processor architectures into account. In particular, we will present new memory allocation and code compaction algorithms. In the final section of the talk, we will present techniques for retargeting compilers to new architectures easily. These techniques are motivated by the need for domain- or application-dependent optimizations of processor architectures. The scope for such optimizations should not be restricted to hardware architectures but has to include the corresponding work on compilers as well. We will show, how compilers can be generated from descriptions of processor architectures. Presented techniques aim at bridging the gap between electronic CAD and compiler generation. },
} This talk responds to the rapidly increasing use of embedded processors for implementing systems. Such processors come in the form of discrete processors as well as in the form of core processors. They are available both from vendors and within system companies. Applications can be found in most segments of the embedded system market, such as automotive electronics and telecommunications. These applications demand for extremely efficient processor architectures, optimized for a certain application domain or even a certain application. Current compiler technology supports these architectures very poorly and has recently been recognized as a major bottleneck for designing systems quickly, efficiently and reliably. A number of recent research projects aim at removing this bottleneck. The talk will briefly discuss the trend towards embedded processors. We will show market trends and examples of recent embedded processors. We will also introduce the terms "application specific instruction-set processors" (ASIPs), "application-specific signal processors" (ASSPs), "soft cores" and "hard cores". We will then present new code optimization approaches taking the special characterstics of embedded processor architectures into account. In particular, we will present new memory allocation and code compaction algorithms. In the final section of the talk, we will present techniques for retargeting compilers to new architectures easily. These techniques are motivated by the need for domain- or application-dependent optimizations of processor architectures. The scope for such optimizations should not be restricted to hardware architectures but has to include the corresponding work on compilers as well. We will show, how compilers can be generated from descriptions of processor architectures. Presented techniques aim at bridging the gap between electronic CAD and compiler generation. |
| Steven Bashford Ulrich Bieker. Scheduling, Compaction and Binding in a Retargetable Code Generator using Constraint Logic Programming. In 4. GI/ITG/GME Workshop "Methoden des Entwurfs und der Verifikation digitaler Systeme" Kreischa/Germany, March 1996 [BibTeX][PDF][Abstract]@inproceedings { bieker:1996:itg,
author = {Ulrich Bieker, Steven Bashford},
title = {Scheduling, Compaction and Binding in a Retargetable Code Generator using Constraint Logic Programming},
booktitle = {4. GI/ITG/GME Workshop "Methoden des Entwurfs und der Verifikation digitaler Systeme"},
year = {1996},
address = {Kreischa/Germany},
month = {mar},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1996-gi_itg_gme.pdf},
confidential = {n},
abstract = {Code generation for embedded programmable processors is becoming increasingly important. Many of these processors have irregular architectures and offer instruction-level paral- lelism (e.g. DSPs). In order to generate code for a wide range of architectures, a code generator should be retargetable. Most of the previous code generation approaches concentrate on the data- path, not taking the peculiarities of the controller into account. The controller can have strange address generation schemes and imposes restrictions on the amount of parallelism in the datapath. In this paper we propose a new method to model all these restrictions and characteristics of the con- troller uniformly, in order to perform scheduling, compaction and binding in a retargetable code generator. For this, we exploit the programming paradigm of constraint logic programming (CLP). CLP offers a general and uniform model for various constraints, performs consistency checks, and integrates constraint solving techniques.},
} Code generation for embedded programmable processors is becoming increasingly important. Many of these processors have irregular architectures and offer instruction-level paral- lelism (e.g. DSPs). In order to generate code for a wide range of architectures, a code generator should be retargetable. Most of the previous code generation approaches concentrate on the data- path, not taking the peculiarities of the controller into account. The controller can have strange address generation schemes and imposes restrictions on the amount of parallelism in the datapath. In this paper we propose a new method to model all these restrictions and characteristics of the con- troller uniformly, in order to perform scheduling, compaction and binding in a retargetable code generator. For this, we exploit the programming paradigm of constraint logic programming (CLP). CLP offers a general and uniform model for various constraints, performs consistency checks, and integrates constraint solving techniques.
|
| Peter Marwedel Rainer Leupers. Flexible Compiler-Techniken für anwendungsspezifische DSPs (in German). In DSP Deutschland '96 Munich/Germany, October 1996 [BibTeX][PDF][Abstract]@inproceedings { leupers:1996:dsp,
author = {Rainer Leupers, Peter Marwedel},
title = {Flexible Compiler-Techniken f\"ur anwendungsspezifische DSPs (in German)},
booktitle = {DSP Deutschland '96},
year = {1996},
address = {Munich/Germany},
month = {oct},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1996-dsp-d.pdf},
confidential = {n},
abstract = {Unzureichende Codequalit\"at ist bekannterma\"sen ein Hauptproblem derzeit verf\"ugbarer Hochsprachen-Compiler f\"ur DSPs. Die Notwendigkeit neuer DSP-spezifischer Compilertechniken wird dar\"uber hinaus durch den Trend zu anwendungsspezifischen DSPs verst\"arkt. F\"ur diese sind nicht nur optimierende, sondern auch flexible Compiler erforderlich, die mit geringem Aufwand an ver\"anderte Zielarchitekturen anpa\"sbar sind. In diesem Beitrag beschreiben wir einen Ansatz zur automatischen Erzeugung von Codegeneratoren aus hardwarenahen, leicht \"anderbaren Prozessormodellen. Des weiteren stellen wir DSP-spezifische Code-Optimierungstechniken vor, die auf eine bestm\"ogliche Ausnutzung potentieller Parallelit\"at abzielen.},
} Unzureichende Codequalität ist bekanntermaßen ein Hauptproblem derzeit verfügbarer Hochsprachen-Compiler für DSPs. Die Notwendigkeit neuer DSP-spezifischer Compilertechniken wird darüber hinaus durch den Trend zu anwendungsspezifischen DSPs verstärkt. Für diese sind nicht nur optimierende, sondern auch flexible Compiler erforderlich, die mit geringem Aufwand an veränderte Zielarchitekturen anpaßbar sind. In diesem Beitrag beschreiben wir einen Ansatz zur automatischen Erzeugung von Codegeneratoren aus hardwarenahen, leicht änderbaren Prozessormodellen. Des weiteren stellen wir DSP-spezifische Code-Optimierungstechniken vor, die auf eine bestmögliche Ausnutzung potentieller Parallelität abzielen.
|
| Rainer Leupers and Peter Marwedel. Instruction-Set Modelling for ASIP Code Generation. In 9th Int. Conference on VLSI Design Bangalore/India, January 1996 [BibTeX][PDF][Abstract]@inproceedings { leupers:1996:vlsi,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Instruction-Set Modelling for ASIP Code Generation},
booktitle = {9th Int. Conference on VLSI Design},
year = {1996},
address = {Bangalore/India},
month = {jan},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1996-vlsi-design.pdf},
confidential = {n},
abstract = {A main objective in code generation for ASIPs is to develop retargetable compilers in order to permit exploration of different architectural alternatives within short turnaround time. Retargetability requires that the compiler is supplied with a formal description of the target processor. This description is usually transformed into an internal instruction set model, on which the actual code generation operates. In this contribution we analyze the demands on instruction set models for retargetable code generation, and we present a formal instruction set model which meets these demands. Compared to previous work, it covers a broad range of instruction formats and includes a detailed view of inter-instruction restrictions.},
} A main objective in code generation for ASIPs is to develop retargetable compilers in order to permit exploration of different architectural alternatives within short turnaround time. Retargetability requires that the compiler is supplied with a formal description of the target processor. This description is usually transformed into an internal instruction set model, on which the actual code generation operates. In this contribution we analyze the demands on instruction set models for retargetable code generation, and we present a formal instruction set model which meets these demands. Compared to previous work, it covers a broad range of instruction formats and includes a detailed view of inter-instruction restrictions.
|
| Ralf Niemann and Peter Marwedel. Hardware/Software Partitioning using Integer Programming. In Proc. European Design & Test Conference 1996 [BibTeX][PDF][Abstract]@inproceedings { niemann:1996:edtc,
author = {Niemann, Ralf and Marwedel, Peter},
title = {Hardware/Software Partitioning using Integer Programming},
booktitle = {Proc. European Design \\& Test Conference},
year = {1996},
keywords = {hwsw},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1996-edtc.pdf},
confidential = {n},
abstract = {One of the key problems in hardware/software codesign is hardware/software partitioning. This paper describes a new approach to hardware/software partitioning using integer programming (IP). The advantage of using IP is that optimal results are calculated respective to the chosen objective function. The partitioning approach works fully automatic and supports multi-processor systems, interfacing and hardware sharing. In contrast to other approaches where special estimators are used, we use compilation and synthesis tools for cost estimation. The increased time for calculating the cost metrics is compensated by an improved quality of the estimations compared to the results of estimators. Therefore, fewer iteration steps of partitioning will be needed. The paper will show that using integer programming to solve the hardware/software partitioning problem is feasible and leads to promising results.},
} One of the key problems in hardware/software codesign is hardware/software partitioning. This paper describes a new approach to hardware/software partitioning using integer programming (IP). The advantage of using IP is that optimal results are calculated respective to the chosen objective function. The partitioning approach works fully automatic and supports multi-processor systems, interfacing and hardware sharing. In contrast to other approaches where special estimators are used, we use compilation and synthesis tools for cost estimation. The increased time for calculating the cost metrics is compensated by an improved quality of the estimations compared to the results of estimators. Therefore, fewer iteration steps of partitioning will be needed. The paper will show that using integer programming to solve the hardware/software partitioning problem is feasible and leads to promising results.
|
| Rainer Leupers and Peter Marwedel. Algorithms for Address Assignment in DSP Code Generation. In Int. Conf. on Computer-Aided Design (ICCAD) San Jose, November 1996 [BibTeX][PDF][Abstract]@inproceedings { leupers:1996:iccad,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Algorithms for Address Assignment in DSP Code Generation},
booktitle = {Int. Conf. on Computer-Aided Design (ICCAD)},
year = {1996},
address = {San Jose},
month = {nov},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1996-iccad.pdf},
confidential = {n},
abstract = {This paper presents DSP code optimization techniques, which originate from dedicated memory address generation hardware. We define a generic model of DSP address generation units. Based on this model, we present efficient heuristics for computing memory layouts for program variables, which optimize utilization of parallel address generation units. Improvements and generalizations of previous work are described, and the efficacy of the proposed algorithms is demonstrated through experimental evaluation.},
} This paper presents DSP code optimization techniques, which originate from dedicated memory address generation hardware. We define a generic model of DSP address generation units. Based on this model, we present efficient heuristics for computing memory layouts for program variables, which optimize utilization of parallel address generation units. Improvements and generalizations of previous work are described, and the efficacy of the proposed algorithms is demonstrated through experimental evaluation.
|
| Rainer Leupers and Peter Marwedel. Instruction Selection for Embedded DSPs with Complex Instructions. In European Design Automation Conference (EURO-DAC) Geneva/Switzerland, September 1996 [BibTeX][PDF][Abstract]@inproceedings { leupers:1996:eurodac,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Instruction Selection for Embedded DSPs with Complex Instructions},
booktitle = {European Design Automation Conference (EURO-DAC)},
year = {1996},
address = {Geneva/Switzerland},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1996-eurodac.pdf},
confidential = {n},
abstract = {We address the problem of instruction selection in code generation for embedded digital signal processors. Recent work has shown that this task can be efficiently solved by tree covering with dynamic programming, even in combination with the task of register allocation. However, performing instruction selection by tree covering only does not exploit available instruction-level parallelism, for instance in form of multiply-accumulate instructions or parallel data moves. In this paper we investigate how such complex instructions may affect detection of optimal tree covers, and we present a two-phase scheme for instruction selection which exploits available instruction-level parallelism. At the expense of higher compilation time, this technique may significantly increase the code quality compared to previous work, which is demonstrated for a widespread DSP.},
} We address the problem of instruction selection in code generation for embedded digital signal processors. Recent work has shown that this task can be efficiently solved by tree covering with dynamic programming, even in combination with the task of register allocation. However, performing instruction selection by tree covering only does not exploit available instruction-level parallelism, for instance in form of multiply-accumulate instructions or parallel data moves. In this paper we investigate how such complex instructions may affect detection of optimal tree covers, and we present a two-phase scheme for instruction selection which exploits available instruction-level parallelism. At the expense of higher compilation time, this technique may significantly increase the code quality compared to previous work, which is demonstrated for a widespread DSP.
|
| Rainer Leupers and Peter Marwedel. A BDD-based Frontend for Retargetable Compilers. In Proc. European Design & Test Conference 1995 [BibTeX][PDF][Abstract]@inproceedings { leupers:1995:edtc,
author = {Leupers, Rainer and Marwedel, Peter},
title = {A BDD-based Frontend for Retargetable Compilers},
booktitle = {Proc. European Design \& Test Conference},
year = {1995},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1995-edtc.pdf},
confidential = {n},
abstract = {In this paper we present a unified frontend for retargetable compilers that performs analysis of the target processor model. Our approach bridges the gap between structural and behavioral processor models for retargetable compilation. This is achieved by means of instruction set extraction. The extraction technique is based on a BDD data structure which significantly improves control signal analysis in the target processor compared to previous approaches.},
} In this paper we present a unified frontend for retargetable compilers that performs analysis of the target processor model. Our approach bridges the gap between structural and behavioral processor models for retargetable compilation. This is achieved by means of instruction set extraction. The extraction technique is based on a BDD data structure which significantly improves control signal analysis in the target processor compared to previous approaches.
|
| Rainer Leupers and Peter Marwedel. Time-constrained Code Compaction for DSPs. In Int. Symp. on System Synthesis (ISSS) September 1995 [BibTeX][PDF][Abstract]@inproceedings { leupers:1995:isss,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Time-constrained Code Compaction for DSPs},
booktitle = {Int. Symp. on System Synthesis (ISSS)},
year = {1995},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1995-isss.pdf},
confidential = {n},
abstract = {DSP algorithms in most cases are subject to hard real-time constraints. In case of programmable DSP processors, meeting those constraints must be ensured by appropriate code generation techniques. For processors offering instruction-level parallelism, the task of code generation includes code compaction. The exact timing behavior of a DSP program is only known after compaction. Therefore, real-time constraints should be taken into account during the compaction phase. While most known DSP code generators rely on rigid heuristics for that phase, this paper proposes a novel approach to local code compaction based on an Integer Programming model, which obeys exact timing constraints. Due to a general problem formulation, the model also obeys encoding restrictions and possible side effects.},
} DSP algorithms in most cases are subject to hard real-time constraints. In case of programmable DSP processors, meeting those constraints must be ensured by appropriate code generation techniques. For processors offering instruction-level parallelism, the task of code generation includes code compaction. The exact timing behavior of a DSP program is only known after compaction. Therefore, real-time constraints should be taken into account during the compaction phase. While most known DSP code generators rely on rigid heuristics for that phase, this paper proposes a novel approach to local code compaction based on an Integer Programming model, which obeys exact timing constraints. Due to a general problem formulation, the model also obeys encoding restrictions and possible side effects.
|
| Rainer Leupers and Peter Marwedel. Using Compilers for Heterogeneous System Design. In Int. Conf. on Parallel Architectures and Compilation Techniques (PACT) June 1995 [BibTeX][PDF][Abstract]@inproceedings { leupers:1995:pact,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Using Compilers for Heterogeneous System Design},
booktitle = {Int. Conf. on Parallel Architectures and Compilation Techniques (PACT)},
year = {1995},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1995-pact.pdf},
confidential = {n},
abstract = {Heterogeneous systems combine both data and control processing functions. A programmable DSP core forms the central component. The design of such systems establishes a new application of compilers in electronic CAD: In order to meet given real-time constraints and optimize chip area consumption, the DSP core needs to be customized for each application. In turn, this requires compiler support for evaluating different architectural alternatives. This paper discusses the importance of retargetable compilers in heterogeneous system design.},
} Heterogeneous systems combine both data and control processing functions. A programmable DSP core forms the central component. The design of such systems establishes a new application of compilers in electronic CAD: In order to meet given real-time constraints and optimize chip area consumption, the DSP core needs to be customized for each application. In turn, this requires compiler support for evaluating different architectural alternatives. This paper discusses the importance of retargetable compilers in heterogeneous system design.
|
| Ulrich Bieker and Peter Marwedel. Retargetable self-test program generation using constraint logic programming. In Proceedings of the 32nd annual ACM/IEEE Design Automation Conference, pages 605--611 New York, NY, USA, 1995 [BibTeX][PDF][Link]@inproceedings { Bieker:1995:RSP:217474.217597,
author = {Bieker, Ulrich and Marwedel, Peter},
title = {Retargetable self-test program generation using constraint logic programming},
booktitle = {Proceedings of the 32nd annual ACM/IEEE Design Automation Conference},
year = {1995},
series = {DAC '95},
pages = {605--611},
address = {New York, NY, USA},
publisher = {ACM},
url = {http://doi.acm.org/10.1145/217474.217597},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1995-dac.pdf},
confidential = {n},
} |
| Ulrich Bieker and Andreas Neumann. Using Logic Programming and Coroutining for electronic CAD.. In The Second International Conference on the Practical Application of Prolog London, April 1994 [BibTeX][PDF][Abstract]@inproceedings { bieker:1994:pap,
author = {Bieker, Ulrich and Neumann, Andreas},
title = {Using Logic Programming and Coroutining for electronic CAD.},
booktitle = {The Second International Conference on the Practical Application of Prolog},
year = {1994},
address = {London},
month = {apr},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1994-pap.pdf},
confidential = {n},
abstract = {We show how an extended Prolog can be exploited to implement different electronic CAD tools. Starting with a computer hardware description language (CHDL) several problems like digital circuit analysis, simulation, test generation and code generation for programmable microprocessors are discussed. For that purpose the MIMOLA (machine independent microprogramming language) system MSS (MIMOLA hardware design system) is presented. Several advantages obtained by applying techniques of logic programming to solve problems in the area of integrated circuit design are shown. Especially maintenance, small source code, backtracking and the extension of standard Prolog by a coroutining mechanism to express Boolean constraints are pointed out.},
} We show how an extended Prolog can be exploited to implement different electronic CAD tools. Starting with a computer hardware description language (CHDL) several problems like digital circuit analysis, simulation, test generation and code generation for programmable microprocessors are discussed. For that purpose the MIMOLA (machine independent microprogramming language) system MSS (MIMOLA hardware design system) is presented. Several advantages obtained by applying techniques of logic programming to solve problems in the area of integrated circuit design are shown. Especially maintenance, small source code, backtracking and the extension of standard Prolog by a coroutining mechanism to express Boolean constraints are pointed out.
|
| Ingolf Markhof. High-Level-Synthesis by Constraint Logic Programming. In GI/ITG-Workshop "Anwendung formaler Methoden im Systementwurf" Frankfurt a.M., March 1994 [BibTeX][PDF][Abstract]@inproceedings { markhof:1994:gi,
author = {Markhof, Ingolf},
title = {High-Level-Synthesis by Constraint Logic Programming},
booktitle = {GI/ITG-Workshop "Anwendung formaler Methoden im Systementwurf"},
year = {1994},
address = {Frankfurt a.M.},
month = {mar},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1994-gi-itg.pdf},
confidential = {n},
abstract = {Integer programming has become popular to synthesis since it allows to compute optimal solutions by efficient formal methods. The drawback of this approach to synthesis is its resticted mathematical model. We adopted the basic idea of handling the synthesis problem as a constraint satisfaction problem and focus on solving it by constraint search. We use constraint logic programming, which is more flexible with repect to the representation of constrains.},
} Integer programming has become popular to synthesis since it allows to compute optimal solutions by efficient formal methods. The drawback of this approach to synthesis is its resticted mathematical model. We adopted the basic idea of handling the synthesis problem as a constraint satisfaction problem and focus on solving it by constraint search. We use constraint logic programming, which is more flexible with repect to the representation of constrains.
|
| Ulrich Bieker Renate Beckmann and Ingolf Markhof. Application of Constraint Logic Programming for VLSI CAD Tools.. In Proceedings of the Conference Constraints in Computational Logics (CCL) September 1994 [BibTeX][PDF][Abstract]@inproceedings { beckmann:1994:ccl,
author = {Renate Beckmann, Ulrich Bieker and Markhof, Ingolf},
title = {Application of Constraint Logic Programming for VLSI CAD Tools.},
booktitle = {Proceedings of the Conference Constraints in Computational Logics (CCL)},
year = {1994},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1994-ccl.pdf},
confidential = {n},
abstract = {This paper describes the application of CLP (constraint logic programming) to several digital circuit design problems. It is shown that logic programming together with efficient constraint propagation techniques is an adequate programming environment for complex real world problems like high level synthesis, simulation, code generation, and memory synthesis. Different types of constraints - Boolean, integer, symbolic, structural, and type binding ones - are used to express relations between the components of a digital circuit and efficient propagation is achieved by the coroutining mechanism. To deal with the increasing complexity of digital circuits we use HDL's (hardware description languages) to represent structure and behaviour of circuits.},
} This paper describes the application of CLP (constraint logic programming) to several digital circuit design problems. It is shown that logic programming together with efficient constraint propagation techniques is an adequate programming environment for complex real world problems like high level synthesis, simulation, code generation, and memory synthesis. Different types of constraints - Boolean, integer, symbolic, structural, and type binding ones - are used to express relations between the components of a digital circuit and efficient propagation is achieved by the coroutining mechanism. To deal with the increasing complexity of digital circuits we use HDL's (hardware description languages) to represent structure and behaviour of circuits.
|
| Rainer Leupers, Wolfgang Schenk and Peter Marwedel. Retargetable Assembly Code Generation by Bootstrapping. In Proc. 7th International Symposium on High-Level Synthesis 1994 [BibTeX][PDF][Abstract]@inproceedings { leupers:1994:hlss,
author = {Leupers, Rainer and Schenk, Wolfgang and Marwedel, Peter},
title = {Retargetable Assembly Code Generation by Bootstrapping},
booktitle = {Proc. 7th International Symposium on High-Level Synthesis},
year = {1994},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1994-hlss.pdf},
confidential = {n},
abstract = {In a hardware/software codesign environment compilers are needed that map software components of a partitioned system behavioral description onto a programmable processor. Since the processor structure is not static, but can repeatedly change during the design process, the compiler should be retargetable in order to avoid manual compiler adaption for each alternative architecture. A restriction of existing retargetable compilers is that they only generate microcode for the target architecture instead of machine-level code. In this paper we introduce a bootstrapping technique permitting to translate high-level language (HLL) programs into real machine-level code using a retargetable microcode compiler. Retargetability is preserved, permitting to compare different architectural alternatives in a codesign framework within relatively short time.},
} In a hardware/software codesign environment compilers are needed that map software components of a partitioned system behavioral description onto a programmable processor. Since the processor structure is not static, but can repeatedly change during the design process, the compiler should be retargetable in order to avoid manual compiler adaption for each alternative architecture. A restriction of existing retargetable compilers is that they only generate microcode for the target architecture instead of machine-level code. In this paper we introduce a bootstrapping technique permitting to translate high-level language (HLL) programs into real machine-level code using a retargetable microcode compiler. Retargetability is preserved, permitting to compare different architectural alternatives in a codesign framework within relatively short time.
|
| Rainer Leupers, Ralf Niemann and Peter Marwedel. Methods for Retargetable DSP Code Generation. In IEEE Workshop on VLSI Signal Processing 1994 1994 [BibTeX][PDF][Abstract]@inproceedings { leupers:1994:ieee,
author = {Leupers, Rainer and Niemann, Ralf and Marwedel, Peter},
title = {Methods for Retargetable DSP Code Generation},
booktitle = {IEEE Workshop on VLSI Signal Processing 1994},
year = {1994},
keywords = {hwsw},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1994-IEEE.pdf},
confidential = {n},
abstract = {Efficient embedded DSP system design requires methods of hardware/software codesign. In this contribution we focus on software synthesis for partitioned system behavioral descriptions. In previous approaches, this task is performed by compiling the behavioral descriptions onto standard processors using target-specific compilers. It is argued that abandoning this restriction allows for higher degrees of freedom in design space exploration. In turn, this demands for retargetable code generation tools. We present different schemes for DSP code generation using the MSSQ microcode generator. Experiments with industrial applications revealed that retargetable DSP code generation based on structural hardware descriptions is feasible, but there exists a strong dependency between the behavioral description style and the resulting code quality. As a result, necessary features of high-quality retargetable DSP code generators are identified.},
} Efficient embedded DSP system design requires methods of hardware/software codesign. In this contribution we focus on software synthesis for partitioned system behavioral descriptions. In previous approaches, this task is performed by compiling the behavioral descriptions onto standard processors using target-specific compilers. It is argued that abandoning this restriction allows for higher degrees of freedom in design space exploration. In turn, this demands for retargetable code generation tools. We present different schemes for DSP code generation using the MSSQ microcode generator. Experiments with industrial applications revealed that retargetable DSP code generation based on structural hardware descriptions is feasible, but there exists a strong dependency between the behavioral description style and the resulting code quality. As a result, necessary features of high-quality retargetable DSP code generators are identified.
|
| Wolfgang Schenk Rainer Leupers and Peter Marwedel. Microcode Generation for Flexible Parallel Target Architectures. In IFIP Trans.\ A-50: Parallel Architectures and Compilation Techniques (PACT-94) 1994 [BibTeX][PDF][Abstract]@inproceedings { leupers:1994:pact,
author = {Rainer Leupers, Wolfgang Schenk and Marwedel, Peter},
title = {Microcode Generation for Flexible Parallel Target Architectures},
booktitle = {IFIP Trans.\ A-50: Parallel Architectures and Compilation Techniques (PACT-94)},
year = {1994},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1994-pact.pdf},
confidential = {n},
abstract = {Advanced architectural features of microprocessors like instruction level parallelism and pipelined functional hardware units require code generation techniques beyond the scope of traditional compilers. Additionally, recent design styles in the area of digital signal processing pose a strong demand for retargetable compilation. This paper presents an approach to code generation based on netlist descriptions of the target processor. The basic features of the MSSQ microcode compiler are outlined, and novel techniques for handling complex hardware modules and multi-cycle operations are presented.},
} Advanced architectural features of microprocessors like instruction level parallelism and pipelined functional hardware units require code generation techniques beyond the scope of traditional compilers. Additionally, recent design styles in the area of digital signal processing pose a strong demand for retargetable compilation. This paper presents an approach to code generation based on netlist descriptions of the target processor. The basic features of the MSSQ microcode compiler are outlined, and novel techniques for handling complex hardware modules and multi-cycle operations are presented.
|
| Rainer Leupers and Peter Marwedel. Instruction Set Extraction From Programmable Structures.. In Proc. EURO-DAC 1994 1994 [BibTeX][PDF][Abstract]@inproceedings { leupers:1994:eurodac,
author = {Leupers, Rainer and Marwedel, Peter},
title = {Instruction Set Extraction From Programmable Structures.},
booktitle = {Proc. EURO-DAC 1994},
year = {1994},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1994-eurodac-extract.pdf},
confidential = {n},
abstract = {Due to the demand for more design flexibility and design reuse, ASIPs have emerged as a new important design style in the area of DSP systems. In order to obtain efficient hardware/software partitionings within ASIP-based systems, the designer has to be supported by CAD tools that allow frequent re-mapping of algorithms onto variable programmable target structures. This leads to a new class of design tools: retargetable compilers. Considering existing retargetable compilers based on pattern matching, automatic instruction set extraction is identified as a profitable frontend for those compilers. This paper presents concepts and an implementation of an instruction set extractor.},
} Due to the demand for more design flexibility and design reuse, ASIPs have emerged as a new important design style in the area of DSP systems. In order to obtain efficient hardware/software partitionings within ASIP-based systems, the designer has to be supported by CAD tools that allow frequent re-mapping of algorithms onto variable programmable target structures. This leads to a new class of design tools: retargetable compilers. Considering existing retargetable compilers based on pattern matching, automatic instruction set extraction is identified as a profitable frontend for those compilers. This paper presents concepts and an implementation of an instruction set extractor.
|
| Peter Marwedel Birger Landwehr and Rainer Doemer. Optimum Simultaneous Scheduling, Allocation and Resource Binding Based on Integer Programming.. In Eurodac 1994 Grenoble, September 1994 [BibTeX][PDF][Abstract]@inproceedings { landwehr:1994:eurodac,
author = {Birger Landwehr, Peter Marwedel and Doemer, Rainer},
title = {Optimum Simultaneous Scheduling, Allocation and Resource Binding Based on Integer Programming.},
booktitle = {Eurodac 1994},
year = {1994},
address = {Grenoble},
month = {sep},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1994-eurodac-oscar.pdf},
confidential = {n},
abstract = {This paper presents an approach to high-level synthesis which is based upon a 0/1 integer programming model. In contrast to other approaches, this model allows solving all three subtasks of high-level synthesis (scheduling, allocation and binding) simultaneously. As a result, designs which are optimal with respect to the cost function are generated. The model is able to exploit large component libraries with multi-functional units and complex components such as multiplier-accumulators. Furthermore, the model is capable of handling mixed speeds and chaining in its general form.},
} This paper presents an approach to high-level synthesis which is based upon a 0/1 integer programming model. In contrast to other approaches, this model allows solving all three subtasks of high-level synthesis (scheduling, allocation and binding) simultaneously. As a result, designs which are optimal with respect to the cost function are generated. The model is able to exploit large component libraries with multi-functional units and complex components such as multiplier-accumulators. Furthermore, the model is capable of handling mixed speeds and chaining in its general form.
|
| Ulrich Bieker. On the Formal Semantics of a CHDL - A Case Study.. In GI/ITG-Workshop: Formale Methoden zum Entwurf korrekter Systeme Bad Herrenalb, March 1993 [BibTeX][PDF][Abstract]@inproceedings { bieker:1993:gi,
author = {Bieker, Ulrich},
title = {On the Formal Semantics of a CHDL - A Case Study.},
booktitle = {GI/ITG-Workshop: Formale Methoden zum Entwurf korrekter Systeme},
year = {1993},
address = {Bad Herrenalb},
month = {mar},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1993-gi-itg.pdf},
confidential = {n},
abstract = {The semantics of HDL descriptions influences all facets of VLSI design such as synthesis, test, verification, logic simulation and fault simulation. In this paper formal semantics of the intermediate language TREEMOLA, used in the MIMOLA hardware design system MSS, is presented. In particular semantics of module declarations, described at the Register Transfer Level by the CHDL MIMOLA, is defined.},
} The semantics of HDL descriptions influences all facets of VLSI design such as synthesis, test, verification, logic simulation and fault simulation. In this paper formal semantics of the intermediate language TREEMOLA, used in the MIMOLA hardware design system MSS, is presented. In particular semantics of module declarations, described at the Register Transfer Level by the CHDL MIMOLA, is defined.
|
| Andreas Neumann Ulrich Bieker. Using Logic Programming and Coroutining for VLSI Design.. In 9. Workshop Logische Programmierung Hagen, October 1993 [BibTeX][PDF][Abstract]@inproceedings { bieker:1993:wlp,
author = {Ulrich Bieker, Andreas Neumann},
title = {Using Logic Programming and Coroutining for VLSI Design.},
booktitle = {9. Workshop Logische Programmierung},
year = {1993},
address = {Hagen},
month = {oct},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1993-wlp.pdf},
confidential = {n},
abstract = {We show how an extended Prolog can be exploited to implement different electronic CAD tools. Starting with a computer hardware description language (CHDL) several problems like digital circuit analysis, simulation and code generation for programmable microprocessors are discussed. For that purpose a part of the MIMOLA (machine independent microprogramming language) system MSS (MIMOLA hardware design system) is presented. Several advantages obtained by applying techniques of logic programming to solve problems in the area of integrated circuit design are shown. Especially maintenance, small source code, backtracking and the extension of standard Prolog by a coroutining mechanism to express Boolean constraints are pointed out.},
} We show how an extended Prolog can be exploited to implement different electronic CAD tools. Starting with a computer hardware description language (CHDL) several problems like digital circuit analysis, simulation and code generation for programmable microprocessors are discussed. For that purpose a part of the MIMOLA (machine independent microprogramming language) system MSS (MIMOLA hardware design system) is presented. Several advantages obtained by applying techniques of logic programming to solve problems in the area of integrated circuit design are shown. Especially maintenance, small source code, backtracking and the extension of standard Prolog by a coroutining mechanism to express Boolean constraints are pointed out.
|
| Lorenz Ladage and Rainer Leupers. Resistance Extraction using a Routing Algorithm.. In Proc. 30th Design Automation Conference, pages 38-42 June 1993 [BibTeX][PDF][Abstract]@inproceedings { ladage:1993:dac,
author = {Ladage, Lorenz and Leupers, Rainer},
title = {Resistance Extraction using a Routing Algorithm.},
booktitle = {Proc. 30th Design Automation Conference},
year = {1993},
pages = {38-42},
month = {jun},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1993-dac.pdf},
confidential = {n},
abstract = {This paper presents a new algorithm for calculating the resistance of an arbitrarily shaped polygon within a VLSI mask layout analysis program. In contrast to earlier approaches no polygon decomposition is required. Instead the current flow is determined by a routing algorithm. The resistance approximation is derived from the current flow. Experimental results have shown that this new algorithm achieves accurate results in comparatively little time.},
} This paper presents a new algorithm for calculating the resistance of an arbitrarily shaped polygon within a VLSI mask layout analysis program. In contrast to earlier approaches no polygon decomposition is required. Instead the current flow is determined by a routing algorithm. The resistance approximation is derived from the current flow. Experimental results have shown that this new algorithm achieves accurate results in comparatively little time.
|
| Birger Landwehr and Peter Marwedel. Intelligent library component selection and management in an IP-model based high-level synthesis system. In IFIP Workshop on Logic and Architecture Synthesis Grenoble, 1993 [BibTeX]@inproceedings { landwehr:93:ifip,
author = {Landwehr, Birger and Marwedel, Peter},
title = {Intelligent library component selection and management in an IP-model based high-level synthesis system},
booktitle = {IFIP Workshop on Logic and Architecture Synthesis},
year = {1993},
address = {Grenoble},
confidential = {n},
} |
| Wolfgang Schenk Peter Marwedel. Cooperation of Synthesis, Retargetable Code Generation and Test Generation in the MSS.. In European Design and Test Conf. (EDAC-ETC-EUROASIC) 1993 [BibTeX][PDF][Abstract]@inproceedings { marwedel:1993:edac,
author = {Peter Marwedel, Wolfgang Schenk},
title = {Cooperation of Synthesis, Retargetable Code Generation and Test Generation in the MSS.},
booktitle = {European Design and Test Conf. (EDAC-ETC-EUROASIC)},
year = {1993},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1993-edac.pdf},
confidential = {n},
abstract = {This paper demonstrates how the different tools in the MIMOLA hardware design system MSS are used during a typical design process. Typical design processes are partly automatic and partly manual. They include high-level synthesis, manual post-optimization, retargetable code generation, testability evaluation and simulation. The paper demonstrates how consistent tools can help to solve a variety of related design tasks. There is no other system with an equivalent set of consistent tools. A key contribution of this paper is to show how current high-level synthesis systems can be extended by retargetable code generators which map algorithms to predefined structures. This extension is necessary in order to support manual design modifications.},
} This paper demonstrates how the different tools in the MIMOLA hardware design system MSS are used during a typical design process. Typical design processes are partly automatic and partly manual. They include high-level synthesis, manual post-optimization, retargetable code generation, testability evaluation and simulation. The paper demonstrates how consistent tools can help to solve a variety of related design tasks. There is no other system with an equivalent set of consistent tools. A key contribution of this paper is to show how current high-level synthesis systems can be extended by retargetable code generators which map algorithms to predefined structures. This extension is necessary in order to support manual design modifications.
|
| Peter Marwedel. Tree-Based Mapping of Algorithms to Predefined Structures. In International Conference on Computer Aided Design (ICCAD) 1993 [BibTeX][PDF][Abstract]@inproceedings { marwedel:1993:iccad,
author = {Marwedel, Peter},
title = {Tree-Based Mapping of Algorithms to Predefined Structures},
booktitle = {International Conference on Computer Aided Design (ICCAD)},
year = {1993},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1993-iccad.pdf},
confidential = {n},
abstract = {Due to the need for fast design cycles and low production cost, programmable targets like DSP processors are becoming increasingly popular. Design planning, detailed design as well as updating such designs requires mapping existing algorithms onto these targets. Instead of writing target-specific mappers, we propose using retargetable mappers. The technique reported in this paper is based on pattern matching. Binary code is generated as a result of this matching process. This paper describes the techniques of our mapper MSSV and identifies areas for improvements. As a result, it shows that efficient handling of alternative mappings is crucial for an acceptable performance.},
} Due to the need for fast design cycles and low production cost, programmable targets like DSP processors are becoming increasingly popular. Design planning, detailed design as well as updating such designs requires mapping existing algorithms onto these targets. Instead of writing target-specific mappers, we propose using retargetable mappers. The technique reported in this paper is based on pattern matching. Binary code is generated as a result of this matching process. This paper describes the techniques of our mapper MSSV and identifies areas for improvements. As a result, it shows that efficient handling of alternative mappings is crucial for an acceptable performance.
|
| C. Albrecht, S. Bashford, P. Marwedel, A. Neumann and W. Schenk. The design of the PRIPS microprocessor. In 4th EUROCHIP-Workshop on VLSI Training 1993 [BibTeX][PDF][Abstract]@inproceedings { albrecht:1993:eurochip,
author = {Albrecht, C. and Bashford, S. and Marwedel, P. and Neumann, A. and Schenk, W.},
title = {The design of the PRIPS microprocessor},
booktitle = {4th EUROCHIP-Workshop on VLSI Training},
year = {1993},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/1993-eurochips-albrecht.pdf},
confidential = {n},
abstract = {The PRIPS microprocessor was recently designed at the University of Dortmund, Lehrstuhl
Informatik XII. PRIPS is a coprocessor with an RISC-like instruction set. The supported
data types and some of the instructions are oriented towards supporting the execution of
PROLOG programs. The design was performed by a project group consisting of 10 stu-
dents. Such project groups are one of the key features of the computer science curriculum at
Dortmund. The project group was partitioned into three subgroups. The first subgroup was
responsible for everything related to compiling PROLOG programs into machine code. This
group designed the instruction set of PRIPS. the first approach was to consider implementing
the Warren abstract machine (WAM). This approach was rejected because of the size of the
required microcode. Therefore, it was decided, to implement a RISC-like instruction set with
some special instructions for PROLOG. With this approach, PROLOG-programs are first
compiled into WAM-code. WAM instructions are then expanded into RISC instructions. The
first subgroup also analysed the effect of different options for caches on the performance.
The second subgroup designed the register-transfer structure for the given instruction set. To
this end, semantics of the instruction set was described in the hardware description language
MIMOLA. Using the TODOS high-level synthesis system designed at the Universities of
Kiel and Dortmund, an initial RT-structure was generated. Subsequent improvements were
added by using the retargetable code generator MSSQ (see paper by Marwedel at EDAC-
EUROASIC-93 for a description of the design process). The final register transfer structure
contains a register file of 32 registers of 32 bits and an ALU with very sophisticated concurrent
tag checking modes. In order to achieve maximum flexibility, it was decided to implement an
on-chip loadable microstore. In order to improve testability of the chip, PRIPS uses external
clock generation and a scan-path design for the controller.
The third group entered the PRIPS design into a commercial Cadence EDGE CAD database.
Due to the problems with the EDIF interface, the design was entered using schematics entry.
Final layout was also obtained with EDGE. Several iterations were required to meet design
constraints. The final chip size is 12 by 8 mm for the ES2 1.5μm CMOS process.
PRIPS has been submitted to EUROCHIP for fabrication. After some delay, caused by
undocumented features of format converters, 30 samples were received in February, 1993.
The setup used for testing basically consists of an Hewlett Packard 16500A tester, which is
linked to a Sun workstation and programmed using the TSSI software package. First results
indicate that some of the chips are working. However, detailed results are not available yet.},
} The PRIPS microprocessor was recently designed at the University of Dortmund, Lehrstuhl
Informatik XII. PRIPS is a coprocessor with an RISC-like instruction set. The supported
data types and some of the instructions are oriented towards supporting the execution of
PROLOG programs. The design was performed by a project group consisting of 10 stu-
dents. Such project groups are one of the key features of the computer science curriculum at
Dortmund. The project group was partitioned into three subgroups. The first subgroup was
responsible for everything related to compiling PROLOG programs into machine code. This
group designed the instruction set of PRIPS. the first approach was to consider implementing
the Warren abstract machine (WAM). This approach was rejected because of the size of the
required microcode. Therefore, it was decided, to implement a RISC-like instruction set with
some special instructions for PROLOG. With this approach, PROLOG-programs are first
compiled into WAM-code. WAM instructions are then expanded into RISC instructions. The
first subgroup also analysed the effect of different options for caches on the performance.
The second subgroup designed the register-transfer structure for the given instruction set. To
this end, semantics of the instruction set was described in the hardware description language
MIMOLA. Using the TODOS high-level synthesis system designed at the Universities of
Kiel and Dortmund, an initial RT-structure was generated. Subsequent improvements were
added by using the retargetable code generator MSSQ (see paper by Marwedel at EDAC-
EUROASIC-93 for a description of the design process). The final register transfer structure
contains a register file of 32 registers of 32 bits and an ALU with very sophisticated concurrent
tag checking modes. In order to achieve maximum flexibility, it was decided to implement an
on-chip loadable microstore. In order to improve testability of the chip, PRIPS uses external
clock generation and a scan-path design for the controller.
The third group entered the PRIPS design into a commercial Cadence EDGE CAD database.
Due to the problems with the EDIF interface, the design was entered using schematics entry.
Final layout was also obtained with EDGE. Several iterations were required to meet design
constraints. The final chip size is 12 by 8 mm for the ES2 1.5μm CMOS process.
PRIPS has been submitted to EUROCHIP for fabrication. After some delay, caused by
undocumented features of format converters, 30 samples were received in February, 1993.
The setup used for testing basically consists of an Hewlett Packard 16500A tester, which is
linked to a Sun workstation and programmed using the TSSI software package. First results
indicate that some of the chips are working. However, detailed results are not available yet.
|
| Jürgen Herrmann and Renate Beckmann. A Heuristic Inductive Generalization Method and its Application to VLSI-Design. In German Workshop on Artificial Intelligence 1992 [BibTeX][Abstract]@inproceedings { hermmann:1992:ai,
author = {Herrmann, J\"urgen and Beckmann, Renate},
title = {A Heuristic Inductive Generalization Method and its Application to VLSI-Design},
booktitle = {German Workshop on Artificial Intelligence},
year = {1992},
confidential = {n},
abstract = {The system LEFT is presented that learns most specific general- izations (MSGs) from structural descriptions. The new inductive multi-staged generalization |