The vision of Ambient Intelligence opens a world of unprecedented experiences: the interaction of people with electronic devices is changed as contextual awareness, natural interfaces and ubiquitous availability of information are realized. We analyze the consequences of the ambient intelligence vision for electronic devices by mapping the involved technologies on a power-information graph. Based on the differences in power consumption, three types of devices are introduced: the autonomous or microWatt-node, the personal or milliWatt-node and the static or Watt-node. Ambient intelligent functions are realized by a network of these devices with the computing, communication and interface electronics realized in Silicon IC technologies. Three case studies highlight the IC design challenges involved, and show the variety of problems that have to be solved.
Semiconductors will play a key role to drive the technological evolution in the next 20 years. We already possess many of the technologies that will deeply change our scenario, among which we can mention nanotechnologies, bioelectronics, photonics. The central role of Integrated Circuits in the economy will grow stronger and stronger in future, starting from the convergence between storage, security, video, audio, mobility and connectivity. Systems are converging and ICs are more and more converging with systems. The fundamental issue is how to translate knowledge and competence coming from different fields into single architectures. The key factor to win this challenge is to build the right culture. This means to be able to build an organisation for innovation, with the right mix of creativity, personal initiative and execution skills.
Memory partitioning is an effective approach to memory energy optimization in embedded systems. Spatial locality of the memory address profile is the key property that partitioning exploits to determine an efficient multi-bank memory architecture. This paper presents an approach, called address clustering, for increasing the locality of a given memory access profile, and thus improving the efficiency of partitioning. Results obtained on several embedded applications running on an ARM7 core show average energy reductions of 25% (maximum 57%) w.r.t. a partitioned memory architecture synthesized without resorting to address clustering.
This paper presents a new algorithm for on-the-fly
data compression in high performance VLIW
processors. The algorithm aggressively targets energy
minimization of some of the dominant factors in the
SoC energy budget (i.e., main memory access and high
throughput global bus). Based on a differential
technique, both the new algorithm and the HW
compression unit have been developed to efficiently
manage data compression and decompression into a
high performance industrial processor architecture,
under strict real time constraints (Lx-ST200: A 4issue,
6-stages pipelined VLIW processor with on-chip D and
I-Cache). The original DataCache
line is compressed
before write-back to main memory and, then,
decompressed whenever Cache refill takes place. An
extensive experimental strategy has been developed
for the specific validation of the target Lx processor. In
order to allow public comparison, we also report the
results obtained on a MIPS pipelined RISC processor
simulated with SimpleScalar. The two platforms have
been benchmarked over Ptolemy and MediaBench
programs. Energy savings provided by the application
of the proposed technique range from 10% to 22% on
the Lx-ST200 platform and from 11% to 14% on the
MIPS platform.
Keywords: Data compression algorithms, system-level
energy optimization, VLIW embedded processors.
The instruction memory communication path constitutes a significant amount of power consumption in embedded processors. We propose an encoding technique that exploits application information to reduce the associated power consumption. The microarchitectural support enables reprogrammability of the encoding transformations so as to track code particularities effectively. The restriction to functional transformations enables effective coding while delivering major power savings, in the process obviating furthermore the necessity to rely on dictionary lookup, one of the major shortcomings of prior approaches. The frugal functional transformation, reliant on a single bit logic gate, introduces no impact to the critical fetch stage of the processor pipeline while delivering fully all the theoretically achievable power savings. The reprogrammable hardware implementation enables flexible and inexpensive switches between the transformations. Extensive experimental results on numerical and DSP codes confirm the theoretically expected magnitude of power savings, evincing reductions that range up to half of the original transitions.
This paper presents a new technique to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal is to improve application energy consumption. Two levels of on-chip data storage are assumed in the reconfigurable architecture. The Data Scheduler attempts to optimally exploit this storage, by deciding in which on-chip memory the data have to be stored in order to reduce energy consumption. We also show that a suitable data scheduling could decrease the energy required to implement the dynamic reconfiguration of the system.
The gap between the advances and the utilization of the deep submicron (DSM) technology is increasing as the new generation of technology is introduced faster than ever. Signal integrity is one of the most important issues in overcoming this gap. With the increasing coupling capacitance between the high aspect ratio wires, the delay uncertainty is unpredictable in the current design flow. We present an algorithm to generate the global wire bus configuration with minumum delay uncertainty under timing constraints. The timing window information from the timing budget (or specified in IPs) is integrated with the modern accurate crosstalk noise models in the proposed algorithm. HSPICE simulations show that the algorithm is very effective and efficient when compared to the buffer insertion scheme with minumom delay. The standard deviation of the delay obtained from the Monte-Carlo simulation is improved up to 73%. This global wire bus configuration can be adopted in early wire planning to improve the timing closure problem and increase the accuracy of the timing budget.
Delay variation due to crosstalk has made timing analysis
a hard problem. In sequential circuits with transparent
latches, crosstalk makes the timing verification (also known
as clock schedule verification) even harder. In this paper,
we point out a false negative problem in current timing verification
techniques and propose a new approach based on
switching windows. In this approach, coupling delay calculations
are combined naturally with latch timing iterations. A
novel algorithm is given for timing verification with crosstalk
in transparently latched circuits and primitive experiments
show promising results.
Categories & Subject Descriptors
B.7.2 Design Aids: Verification
General Terms
Algorithms, Design, Verification
Keywords
Timing, Clock, Verification, Coupling, Delay
The growing impact of within-die process variation has created the need for statistical timing analysis, where gate delays are modeled as random variables. Statistical timing analysis has traditionally suffered from exponential run time complexity with circuit size, due to the dependencies created by reconverging paths in the circuit. In this paper, we propose a new approach to statistical timing analysis which uses statistical bounds. First, we provide a formal definition of the statistical delay of a circuit and derive a statistical timing analysis method from this definition. Since this method for finding the exact statistical delay has exponential run time complexity with circuit size, we also propose a new method for computing statistical bounds which has linear run time complexity. We prove the correctness of the proposed bounds. Since we provide both a lower and upper bound on the true statistical delay, we can determine the quality of the bounds. The proposed methods were implemented and tested on benchmark circuits. The results demonstrate that the proposed bounds have only a small error.
The design of clock distribution networks in synchronous digital systems presents enormous challenges. Controlling the clock signal delay in the presence of various noise sources, process parameter variations, and environmental effects represents a fundamental problem in the design of high speed synchronous circuits. A polynomial time algorithm that improves the tolerance of a clock distribution network to process and environmental variations is presented in this paper. The algorithm generates a clock tree topology that minimizes the uncertainty of the clock signal delay to the most critical data paths. Strategies for enhancing the physical layout of the clock tree to decrease delay uncertainty are also presented. Application of the methodology on benchmark circuits demonstrates clock tree topologies with decreased delay uncertainties of up to 90%. Techniques to enhance a clock tree layout have been applied on a set of benchmark circuits, yielding a reduction in delay uncertainty of up to 48%.
Smart cards are vulnerable to both invasive and non-invasive attacks. Specifically, non-invasive attacks using power and timing measurements to extract the cryptographic key has drawn a lot of negative publicity for smart card usage. The power measurement techniques rely on the data-dependent energy behavior of the underlying system. Further, power analysis can be used to identify the specific portions of the program being executed to induce timing glitches that may in turn help to bypass key checking. Thus, it is important to mask the energy consumption when executing the encryption algorithms. In this work, we augment the instruction set architecture of a simple five-stage pipelined smart card processor with secure instructions to mask the energy differences due to key-related data-dependent computations in DES encryption. The secure versions operate on the normal and complementary versions of the operands simultaneously to mask the energy variations due to value dependent operations. However, this incurs the penalty of increased overall energy consumption in the data-path components. Consequently, we employ secure versions of instructions only for critical operations; that is we use secure instructions selectively, as directed by an optimizing compiler. Using a cycle-accurate energy simulator, we demonstrate the effectiveness of this enhancement. Our approach achieves the energy masking of critical operations consuming 83% less energy as compared to existing approaches employing dual rail circuits.
This paper describes a new Dynamic Voltage Scaling (DVS) technique for embedded systems expressed as Conditional Task Graphs (CTGs). The idea is to identify and exploit the available worst case slack time, taking into account the conditional behaviour of CTGs. Also we examine the effect of combining a genetic algorithm based mapping with the DVS technique for CTGs and show that further energy reduction can be obtained. The techniques have been tested on a number of CTGs including a real life example. The results show that the DVS technique can be applied to CTGs with energy saving up to 24%. Furthermore it is shown that savings of up to 51% are achieved by considering DVS during the mapping.
We present a novel design methodology for synthesizing multiple configurations (or modes) into a single programmable system. Many DSP and multimedia applications require reconfigurability of a system along with efficiency in terms of power, performance and area. FPGAs provide a reconfigurable platform, however, they are slower in speed with significantly higher power consumption than achievable by a customized ASIC. In this work, we have developed techniques to realize an efficient reconfigurable system for a set of user-specified configurations. A data flow graph transformation method coupled with efficient scheduling and allocation are used to automatically synthesize the system from its behavioral level specifications. Experimental results on several applications demonstrate that we can achieve about 60X power reduction on average with about 4X improvement in performance over corresponding FPGA implementations.
We propose a technique for compressing test vectors. The technique reduces test application time and tester memory requirements by utilizing part of the predecessor response in constructing the subsequent test vector. An algorithm is provided for stitching test vectors that retains full fault coverage while appreciably reducing time and tester requirements. The analysis provided enables significant compression ratios, while necessitating no hardware outlay whatsoever, making the technique we propose particularly suitable for SOC testing. The test time benefits necessitate no MISR utilization, ensuring no consequent aliasing loss. We examine a number of implementation considerations for the new compression technique and we provide experimental data that can be used to guide an eventual commercial implementation. Experimental data confirms the significant test application time and tester memory reductions.
This paper proposes a new test compression technique that employs Fan-out SCAN chain with Feedback (FSCANF) architecture. It allows us to use prelude vectors to resolve dependencies created by fanning out multiple scan chains from a single scan-in pin. This paper describes the new proposed architecture as well as the algorithm that generates compressed test vectors using vertex coloring algorithm. The distribution of specified bits in each test pattern determines the compression ratio of the individual test pattern. Therefore, our technique optimizes the overall compression ratio and shows higher reduction in test data and application time than previous techniques, which use the extreme case of serializing all the scan chains in the presence of conflicts across the fanout scan chains. The FSCANF architecture has small hardware overhead and is independent of scan cell orders in the scan chains. Experimental results show that our technique significantly reduces both the test data volume and test application time in six of the largest ISCAS 89 sequential benchmark circuits compared to the previous techniques.
Reduction of both the test suite size and the download time of test vectors is important in today's System-On-a-Chip designs. In this paper, a method for compressing the scan test patterns using the LZW algorithm is presented. This method leverages the large number of "Don't-Cares" in test vectors in order to improve the compression ratio significantly. The hardware decompression architecture presented here uses existing on-chip embedded memories. Tests using the ISCAS89 and the ITC99 benchmarks show that this method achieves high compression ratios.
Data correlation is a well-known problem that causes difficulty in VLSI testing. Based on a correlation metric, an efficient heuristic to select BIST registers has been proposed in the previous work. However, the computation of data correlation itself was a computational intensive process and became a bottleneck in the previous work. This paper presents an efficient technique to compute data correlation using Binary Decision Diagrams (BDDs). Once a BDD is built, our algorithms take linear time to compute the corresponding data correlation. The experimental results show that this technique is much faster than the previous technique based on simulation. It enables testing approaches based on data correlation to handle more practical designs. As one of the successful applications, partial scan is demonstrated by integrating our computation results.
System level synthesis is widely seen as the solution for closing the productivity gap in system design. High level system models are used in system level design for early design exploration. While real time operating systems (RTOS) are an increasingly important component in system design, specific RTOS implementations can not be used directly in high level models. On the other hand, existing system level design languages (SLDL) lack support for RTOS modeling. In this paper we propose a RTOS model built on top of existing SLDLs which, by providing the key features typically available in any RTOS, allows the designer to model the dynamic behavior of multi-tasking systems at higher abstraction levels to be incorporated into existing design flows. Experimental result shows that our RTOS model is easy to use and efficient while being able to provide accurate results.
This paper describes automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability and fast time to market. We formally specify device behaviors using event driven finite state machines, communication channels, declaratively described rules, constraints and synthesis patterns. A driver is synthesized from this specification for a virtual environment that is platform (processor, operating system and other hardware) independent. The virtual environment is mapped to a specific platform to complete the driver implementation. The illustrative application of our approach for a USB device driver in Linux demonstrates improved productivity and reusability.
The embedded software design cost represents an important percentage of the embedded-system development costs [1]. This paper presents a method for systematic embedded software generation that reduces the software generation cost in a platform-based HW/SW codesign methodology for embedded systems based on SystemC. The goal is that the same SystemC code allows system-level specification and verification, and, after SW/HW partition, SW/HW co-simulation and embedded software generation. The C++ code for the SW partition (processes and process communication including HW/SW interfaces) is systematically generated including the user-selected embedded OS (e.g.: the eCos open source OS).
Noise performance is a critical analog and RF circuit design constraint, and can impact the selection of the IC system-level architecture. It is therefore imperative that some model of the noise is represented at the highest levels of abstraction during the design process. In this paper we propose a noise macromodel for analog circuits and demonstrate it by way of implementation in a system level simulator based on MATLAB. We also explain our process of macromodel extraction via reformulation of frequency-domain noise analysis results, and the corresponding steps of model order reduction. The results demonstrate the efficacy of this macromodel for frequency domain system level simulation.
A new computational concept of timing jitter is proposed that is suitable for exploitation in circuit simulators. It is based on the approximation of computed noise characteristicin an arbitrary language. It is explained how different models of the building blocks can easily be included in the model. To show the validity of the model, some simulation results with an implementation in the MATLABTM programming language are presented.
Behavioural simulation is the common alternative to the costly electrical simulation of Delta-Sigma modulators (Delta-SigmaMs). This paper explores the behavioural modelling and simulation of Delta-SigmaMs by using hardware description languages (HDLs) and commercial behavioural simulators,as an alternative to the common special-purpose behavioural simulators. A library of building blocks, where a HDL has been used to model a complete set of circuit non-idealities influencing the performance of Delta-SigmaMs,is introduced. Three alternatives for introducing Delta-SigmaM topologies have been implemented. Experimental results of the simulation of a fourth-order 2-1-1 cascade multi-bit Delta-SigmaM are given.
We present an approach to schedulability analysis for the synthesis of multi-cluster distributed embedded systems consisting of time-triggered and event-triggered clusters, interconnected via gateways. We have also proposed a buffer size and worst case queuing delay analysis for the gateways, responsible for routing inter-cluster traffic. Optimization heuristics for the priority assignment and synthesis of bus access parameters aimed at producing a schedulable system with minimal buffer needs have been proposed. Extensive experiments and a real-life example show the efficiency of our approaches.
We present a framework (Real-Time Calculus) for analysing various system properties pertaining to timing analysis, loads on various components and on-chip buffer memory requirements of heterogeneous platform-based architectures, in a single coherent way. Many previous analysis techniques from the real-time systems domain, which are based on standard event models, turn out to be special cases of our framework. We illustrate this using various realistic examples.
In this paper, a novel approach to high-level (i.e. architecture independent) worst case execution time (WCET) analysis is presented that automatically computes exact bounds for all inputs. To this end, we make use of the distinction between micro and macro steps as usually done by synchronous languages. As macro steps must not contain loops, a later low-level WCET analysis (architecture dependent) is simplified to a large extent. Checking exact execution times for all inputs is a complex task that can nevertheless be efficiently done when implicit state space representations are used. With our tools, it is not only possible to compute path information by exploring all computations, but also to verify given path information.
The functionality of a typical embedded system is specified
once at design time and cannot be altered later during
the whole mission period. There are, however, a number
of important application domains that ask for both flexibility
and availability. In such a flexible embedded system the
functionality can be modified while the application is running.
This paper presents a rapid prototyping environment for
flexible embedded systems on multi-DSP architectures. This
prototyping environment automatically maps and schedules
an application onto a multi-DSP architecture and introduces
a special, lightweight reconfiguration environment
onto the target platform. A running multi-DSP application
can, therefore, be modified by reconfiguring software tasks.
By using our prototyping environment the modified application
can be tested, simulated and emulated prior to the
implementation on the target.
keywords: embedded system; multi-DSP architectures;
task reconfiguration; testability; rapid prototyping
This paper presents a methodology for testing high-performance
pipelined circuits with slow-speed testers. The
technique uses a clock timing circuit to control data transfer
in the pipeline in test mode. A clock timing circuit capable of
achieving a timing resolution of 50ps in 0.18um CMOS technology
is presented. The design provides the ability to test the clock
timing circuit itself.
Keywords: Delay-fault testing, high-performance testing,
design for testability, design for delay testability.
As the technology is shrinking and the working frequency is going into multi gigahertz range, the issues related to interconnect testing are becoming more dominant. Specifically, signal integrity loss issues are becoming more important and detection and diagnosis of these losses are becoming a great challenge. In this paper, an enhanced boundary scan architecture with slight modification in the boundary scan cells is proposed to test signal integrity in SoC interconnects. Our extended JTAG architecture: 1) minimizes scan-in operation by using modified boundary scan cells in pattern generation; and 2) incorporates the integrity loss information within the modified observation cells. To fully comply with JTAG standard, we propose two new instructions, one for pattern generation and the other for scanning out the captured signal integrity information.
A novel design methodology for test pattern generation in BIST is presented. Here faults and errors in the generator itself are detected. Two different design methodologies are presented. The first one guarantees all single fault/error detection and the second methodology is capable of detecting multiple faults and errors. Furthermore the proposed LFSRs do not have additional hardware overhead. Also importantly the test patterns generated have the potential to achieve superior fault coverage.
We present a new partition-based fault diagnosis technique for identifying failing scan cells in a scan-BIST environment. This approach relies on a two-step scan chain partitioning scheme. In the first step, an interval-based partitioning scheme is used to generate a small number of partitions, where each element of a partition consists of a set of scan cells. In the second step, additional partitions are created using an earlier-proposed random-selection partitioning method. Two-step partitioning leads to higher diagnostic resolution than a scheme that relies only on random-selection partitioning, with only a small amount of additional hardware. The proposed scheme is especially suitable for a system-on-chip (SOC) composed of multiple embedded cores, where test access is provided by means of a TestRail that is threaded through the internal scan chains of the embedded cores. We present experimental results for the six largest ISCAS-89 benchmark circuits and for two SOCs crafted from some of the ISCAS-89 circuits.
This paper presents a new, frequency-domain based method for modeling and analysis of phase-locked loop (PLL) small-signal behavior, including time-varying aspects. Focus is given to PLLs with sampling phase-frequency detectors (PFDs) which compute the phase error only once per period of the reference signal. Using the harmonic transfer matrix (HTM) formalism, the well-known continuous-time, linear time-invariant (LTI) approximations are extended to take the impact of time-varying behavior, arising from the sampling nature of the PFD, into account. Especially for PLLs with a fast feedback loop, this time-varying behavior has severe impact on, for example, loop stability and cannot be neglected. Contrary to LTI analysis, our method is able to predict and quantify these difficulties. The method is verified for a typical loop design.
A new numerical technique for periodic small signal analysis based on harmonic balance method is proposed. Special-purpose numerical procedures based on Krylov subspace methods are developed that reduce the computational efforts of solving linear problems under frequency sweeping. Examples are given to show the efficiency of the new algorithm for computing small signal characteristics for typical RF circuits.
This paper presents a new method to automatically generate posynomial symbolic expressions for the performance characteristics of analog integrated circuits. The coefficient set as well as the exponent set of the posynomial expression are determined based on SPICE simulation data with device-level accuracy. We will prove that this problem corresponds to solving a nonöconvex optimization problem without local minima. The presented method is capable of generating posynomial performance expressions for both linear and nonlinear circuits and circuit characteristics. This approach allows to automatically generate an accurate sizing model that composes a geometric program that fully describes the analog circuit sizing problem. The automatic generation avoids the timeöconsuming nature of handöcrafted analytic model generation. Experimental results illustrate the capabilities and effectiveness of the presented modeling technique.
A novel methodology is presented to structured yieldö aware synthesis. The trade-off between yield and the unspecified performances is explored along the design space boundaries, while respecting specifications on the other performances. Through the unique combination of multi-objective evolutionary optimization techniques, multi-variate regression modeling and sensitivityöbased yield estimation, the designer is given access to this trade-off, all within transistor-level accuracy. Even more, a large reduction in required computer resources is obtained compared to alternative approaches.
Conventional synthesis algorithms perform the allocation of heterogeneous specifications, those formed by operations of different types and widths, by binding operations to functional units of their same type and width. Thus, in most of the implementations obtained some hardware waste appears. This paper proposes an allocation algorithm able to minimize this hardware waste by fragmenting operations into their common operative kernel, which then may be executed over the same functional units. Hence, fragmented operations are executed over sets of several linked hardware resources. The implementations proposed by our algorithm need considerably smaller area than the ones proposed by conventional allocation algorithms. And due to operation fragmentation, in the datapaths produced the type, number, and width of the hardware resources are independent of the type, number, and width of the specification operations and variables.
We present two novel strategies to increase the scope for application of speculative code motions: (1) Adding scheduling steps dynamically during scheduling to conditional branches with fewer scheduling steps. This increases the opportunities to apply code motions such as conditional speculation that duplicate operations into the branches of a conditional block. (2) Determining if an operation can be conditionally speculated into multiple basic blocks either by using existing idle resources or by creating new scheduling steps. These strategies lead to balancing of the number of steps in the conditional branches without increasing the longest path through the conditional block. Algorithms for these strategies have been implemented within the Spark high-level synthesis framework that accepts a behavioral description in ANSI-C as input and produces synthesizable register-transfer level VHDL. Experiments on two moderately complex industrial-strength applications, namely, MPEG-1 and the GIMP image processing tool, demonstrate that conditional speculation is ineffective without using these strategies.
In order to enjoy performance improvement effects of variable computation time arithmetic units in a system level, we propose a new synchronous control unit design methodology for dataflow graphs under allocation of a telescopic arithmetic unit which is one of well-known synchronous variable computation time arithmetic units. The proposed method generates an independent synchronous controller for each component arithmetic unit, and builds a distributed synchronous control unit through integrating derived controllers. The distributed structure of a final synchronous control unit maximizes performance improvement effect of telescopic arithmetic units through a complete preservation of original concurrency among operations.
The performance of a system, especially a multiprocessor system, heavily depends upon the efficiency of its bus architecture. This paper presents a methodology to generate a custom bus system for a multiprocessor System-on-a-Chip (SoC). Our bus synthesis tool (BusSyn) uses this methodology to generate five different bus systems as examples: Bi-FIFO Bus Architecture (BFBA), Global Bus Architecture Version I (GBAVI), Global Bus Architecture Version III (GBAVIII), Hybrid bus architecture (Hybrid) and Split Bus Architecture (SplitBA). We verify and evaluate the performance of each bus system in the context of two applications: an Orthogonal Frequency Division Multiplexing (OFDM) wireless transmitter and an MPEG2 decoder. This methodology gives the designer a great benefit in fast design space exploration of bus architectures across a variety of performance impacting factors such as bus types, processor types and software programming style. In this paper, we show that BusSyn can generate buses that achieve superior performance when compared to a simple General Global Bus Architecture (GGBA) (e.g., 16.44% performance improvement in the case of OFDM transmitter) or when compared to the CoreConnect Bus Architecture (CCBA) (e.g., 15.54% performance improvement in the case of MPEG2 decoder). In addition, the bus architecture generated by BusSyn is designed in a matter of seconds instead of weeks for the hand design of a custom bus system.
This paper presents our work toward an operating system that manages the resources of a reconfigurable device in a multitasking manner. We propose an online scheduling system that allocates tasks to a block-partitioned reconfigurable device. The blocks are statically-fixed but can have different widths, which allows to match the computational resources with the task requirements. We implement several non-preemptive and preemptive schedulers as well as different placement strategies. Finally, we present a simulation environment that allows to experimentally investigate the effects of specific partitioning, placement, and scheduling methods.
Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compilation tools are essential to their success. In this paper, we present a modulo scheduling algorithm to exploit loop-level parallelism for coarse-grained reconfigurable architectures. This algorithm is a key part of our Dynamically Reconfigurable Embedded Systems Compiler (DRESC). It is capable of solving placement, scheduling and routing of operations simultaneously in a modulo-constrained 3D space and uses an abstract architecture representation to model a wide class of coarse-grained architectures. The experimental results show high performance and efficient resource utilization on tested kernels.
Reconfigurable hardware will be used in many future embedded applications. Since most of these embedded systems will be temporarily or permanently connected to a network, the possibility to reload parts of the application at run time arises. In the 90ies it was recognized, that the huge variety of processors would lead to a tremendous amount of binaries for the same piece of software. For the hardware parts of an embedded system, the situation today is even worse. The java approach based on a java virtual machine (JVM) was invented to solve the problem for software. In this paper, we show how the hardware parts of an embedded system can be implemented in a hardware byte code, which can be interpreted using a virtual hardware machine running on an arbitrary FPGA. Our results show that this approach is feasible and that it leads to fast, portable and reconfigurable designs, which run on any programmable target architecture.
In this paper, we propose a test generation method for non-robust path delay faults using stuck-at fault test generation algorithms. In our method, we first transform an original combinational circuit into a circuit called a partial leaf-dag using path-leaf transformation. Then we generate test patterns using a stuck-at fault test generation algorithm for stuck-at faults in the partial leaf-dag. Finally we transform the test patterns into two-pattern tests for path delay faults in the original circuit. We prove the correctness of the approach and experimental results on several benchmark circuits show the effectiveness of it.
This paper presents a new and low-cost approach for identifying sequentially untestable faults. Unlike the single fault theorem, where the stuck-at fault is injected only in the right-most time frame of the k-frame unrolled circuit, our approach can handle fault injection in any time frame within the unrolled sequential circuit. To efficiently apply our concept to untestable fault identification, powerful sequential implications are used to efficiently extend the unobservability propagation of gates in multiple time frames. Application of the proposed theorem to ISCAS Î89 sequential benchmark circuits showed that more untestable faults could be identified using our approach, at practically no overhead in both memory and execution time.
The first non-enumerative framework for diagnosing path delay faults using zero suppressed binary decision diagrams is introduced. We show that fault free path delay faults with a validated non-robust test may together with fault free robustly tested faults be used to eliminate faults from the set of suspected faults. All operations are implemented by an implicit diagnosis tool based on the zero suppressed binary decision diagram. The proposed method is space and time non-enumerative as opposed to existing methods which are space and time enumerative. Experimental results on the ISCAS'85 benchmarks show that the proposed technique is on an average least three times more efficient to improve the diagnostic resolution than existing techniques.
This paper defines a new diagnosis problem for diagnosing delay defects based upon statistical timing models. We illustrate the differences between the delay defect diagnosis and traditional logic defect diagnosis. We propose different diagnosis algorithms, and evaluate their performance via statistical defect injection and statistical delay fault simulation. With a statistical timing analysis framework developed in the past, we demonstrate the new concepts in delay defect diagnosis, and discuss experimental results based upon benchmark circuits.
In this paper, hardware abstraction layer is explained in the context of SoC design. First, HAL definition is given and the difference between HAL and other similar concepts are given. Existing HALs are examined. The role of HAL is explained for SoC design. Finally, a proposal of standard HAL is presented.
Traditionally, an Operating System (OS) implements in software basic system functions such as task/process management and I/O. Furthermore, a Real-Time Operating Systems (RTOS) has also been implemented in software to manage tasks in a predictable, real-time manner. However, with System-on-a-Chip (SoC) architecture similar to Figure becoming more and more common, OS and RTOS functionality need not be implemented solely in software. Thus, partitioning the interface between hardware and software for an OS is a new idea that can have a significant impact.
The new standard DRM for digital radio broadcast in AM band requires integrated devices for radio receivers at low cost and very low power consumption. A chipset is currently designed based upon an ARM9 multi-cores architecture. This paper introduces the application itself, the HW architecture of the SoC and the SW architecture which includes physical layer, receiver management, the application layer and the global scheduler based on a real-time OS. Then, the paper presents the HW/SW partitioning and SW breakdown between the various processing cores. The methodology used in the project to develop, to validate and to integrate the SW covering various methods such as simulation, emulation and covalidation is described. Key points and critical issues are also addressed. One of the challenge is to integrate the whole receiver in the mono-chip with respect to the real-time constraints linked to the audio services.
Interconnect networks play a critical role in shared memory multiprocessor systems-on-chip (MPSoC) designs. MPSoC performance and power consumption are greatly affected by the packet dataflows that are transported on the network. In this paper, by introducing a packetized on-chip communication power model, we discuss the packetization impact on MPSoC performance and power consumption. Particularly, we propose a quantitative analysis method to evaluate the relationship between different design options (cache, memory, packetization scheme, etc.) at the architectural level. From the benchmark experiments, we show that optimal performance and power tradeoff can be achieved by the selection of appropriate packet sizes.
Managing the complexity of designing chips containing billions of transistors requires decoupling computation from communication. For the communication, scalable and compositional interconnects, such as networks on chip (NoC), must be used. In this paregate bandwidth of 80 Gbit/s. It occupies 0:26 mm2 in CMOS12. This shows that our router provides high performance at reasonable cost, bringing NoCs one step closer.
Software implementations of channel decoding algorithms are attractive for communication systems with their large variety of existing and emerging standards due to their flexibility and extensibility. For high throughput, however, a single processor can not provide the necessary compute power. Using several processors in parallel without exploiting the internal parallelism of the algorithm leads to intolerable overhead in area, power consumption, and latency. We propose a multiprocessor based Turbo-Decoder implementation where inherently parallel decoding tasks are mapped onto individual processing nodes. The implied challenging inter-processor communication is efficiently handled by our framework such that throughput is not degraded. In this paper we present communication centric architectures from buses to heterogenous networks that allow to interconnect numerous processors to perform high throughput Turbo-decoding.
The ForSyDe methodology has been developed for system level design. Starting with a formal specification model, that captures the functionality of the system at a high abstraction level, it provides formal design transformation methods for a transparent refinement process of the system model into an implementation model that is optimized for synthesis. The main contribution of this paper is the formal treatment of transformational design refinement. Using the formal semantics of ForSyDe processes we introduce the term characteristic function to be able to define and classify transformations as either semantic preserving or design decision. We also illustrate how we can incorporate classical synthesis techniques that have traditionally been used with control/data-flow graphs as ForSyDe transformations. Thus, our approach avoids discontinuities since it moves design refinement into the domain of the specification model.
The Lava system provides novel techniques for representing system level specifications which are supported by a design flow that maps Lava descriptions onto System-on-Chip platforms implemented on very large FPGAs. The key contribution of this paper is a type class based approach for specifying bus-based system configurations. This provides a very flexible and parameterised flow for combining predesigned IP blocks into a complete FPGA-based system.
In this article, a denotational definition of synchronous subset of SystemC is proposed. The subset treated includes modules, processes, threads, wait statement, ports and signals. We propose formal model for System C delta delay. Also, we give a complete semantic definition for the language's two-phase scheduler. The proposed semantic can constitute a base for validating the equivalence of synchronous HDL subsets.
Reflection and automated introspection of a design in system level design frameworks are seen as necessities for the CAD tools to manipulate the designs within the tools. These features are also useful for debuggers, class and object browsers, design analyzers, composition validation, type checking, compatibility checking, etc. However, the central question is whether such features should be integrated into the language, or if we should build frameworks which feature these capabilities in a meta-layer, leaving the system-level language intact. In our recent interactions with designers, we have found differing opinions. Especially in the context of SystemC, the temptation to integrate reflective APIs into the language is great, because C++ is expressive, and already has type introspective packages available. In this paper, we analyze this issue and show that (i) it is a better EDA system architecture to implement reflection /introspection at a meta-layer in a design framework (ii) there are relatively unexplored territories of design automation, such as behavioral typing of component interfaces, corresponding type-theory, and their implication in automating component composition, interface synthesis, and validation, which can be better incorporated if the introspection is implemented at a meta-layer.
This paper presents and discusses the foundations on which the analog and mixed-signal extensions of SystemC, named SystemC-AMS, will be developed. First, requirements from targeted application domains are identified. These are then used to derive design objectives and related rationales. Finally, some preliminary seed work is presented and the outline of the analog and mixed-signal extensions development work is given.
Novel reconfigurable computing architectures exploit the inherent parallelism available in many signal processing problems. These architectures often consist of networks of compute elements that have an ALU-like structure with corresponding instructions. This opens opportunities for rapid dynamic reconfiguration and instruction multiplexing. The field of computer architectures has significantly contributed to the systematic and quantified exploration of architectures. Novel reconfigurable architecture exploration should learn from this approach. Future System-on-a-Chip platforms will consist of a combination of processor architectures, on-chip memories, and reconfigurable architectures. The real challenge is to design those architectures that can be programmed efficiently. This requires that first a programming environment and benchmarks be created and then that the reconfigurable architectures be systematically explored.
Dynamically reprogrammable hardware has been advocated in the academic research community as the next hot area in system design for some time now. The lack of integrated systems in the marketplace that incorporate dynamic reprogramming stands at contrast to the enthusiasm of the research community for the topic. We would like to offer as a middle ground several examples of dynamic reprogramming in working silicon that might help to illuminate the path towards the future of SoCs. In our research at STMicroelectronics, we have built two independent SoCs that utilize embedded FPGAs to provide the dynamic reprogramming capability. The benefit of the embedded FPGA has been demonstrated to range from application acceleration to augmenting functionality and providing silicon area reuse. The first system to be described is intended for image processing and biometric recognition. The second system is aimed at wireless LAN baseband processing.
This paper presents a lightweight approach for embedded reconfiguration of Xilinx Virtex IItm series FPGAs. A hardware and software infrastructure is reported that enables an FPGA to dynamically reconfigure itself under the control of a soft microprocessor core that is instantiated on the same array. The system provides a highly integrated, lightweight approach to dynamic reconfiguration for embedded systems. It combines the benefits of intelligent control, fast reconfiguration and small overhead.
This paper presents a novel source code transformation for control flow optimization called loop nest splitting which minimizes the number of executed if-statements in loop nests of embedded multimedia applications. The goal of the optimization is to reduce runtimes and energy consumption. The analysis techniques are based on precise mathematical models combined with genetic algorithms. Due to the inherent portability of source code transformations, a very detailed benchmarking using 10 different processors can be performed. The application of our implemented algorithms to three real-life multimedia benchmarks leads to average speed-ups by 23.6% - 62.1% and energy savings by 19.6% - 57.7%. Furthermore, our optimization also leads to advantageous pipeline and cache performance.
With the widespread use of embedded devices such as PDAs, printers, game machines, cellular telephones, achieving high performance demands an optimized operating system (OS) that can take full advantage of the underlying hardware components. This paper presents a locality conscious process scheduling strategy for embedded environments. The objective of our scheduling strategy is to maximize reuse in the data cache. It achieves this by restructuring the process codes based on data sharing patterns between processes.
Compiler-directed ILP extraction techniques are critical to effectively exploiting the significant processing capacity of contemporaneous VLIW/EPIC machines. In this paper we propose a novel algorithm for ILP extraction targeting clustered EPIC machines that integrates three powerful techniques: predication, speculation and modulo scheduling. In addition, our framework schedules and binds operations, generating actual VLIW code. To the best of our knowledge, there is no other algorithm in the literature on predicated code optimizations that jointly considers speculation and modulo scheduling in the context of clustered EPIC machines. Our experimental results show that by jointly considering different extraction techniques in a resource aware context, the proposed algorithm can take maximum advantage of the resources available on the clustered machine, aggressively improving performance.
This paper presents an efficient hash table based method to optimally overcome a new variant of the state space explosion which appears during the quasi-static task scheduling of embedded, reactive systems. Our application domain is targeted to one-processor software synthesis, and the scheduling process is based on Petri net reachability analysis to ensure cyclic, bounded and undeadlocked programs. To achieve greater flexibility, we employ a dynamic, history based criterion to prune the search space. This makes our synthesis approach different from most existing code generation techniques. Our experimental results reveal a significant reduction in algorithmic complexity (both in memory storage and CPU time) obtained for medium and large size problems.
Wireplanning is an approach in which the timing of input-output paths is planned before modules are specified, synthesized or sized. If these global wires are optimally segmented and buffered, their delay is linear in the path length and independent of the position of the modules along these paths. From timing requirements, the total budget left to modules after allocating the appropriate delay to the wires can be determined. This paper describes how this budget can be optimally divided amongst the modules. A novel, static timing-like, mathematical programming formulation is introduced such that the total module area is minimized. Instead of only the worst delay, all pin-to-pin delays are implicitly taken into account. If area-delay tradeoffs are convex, a reasonable approximation in practice, the program can be solved efficiently. Further, efficiency of different formulations is discussed, and a low-cost method of making the budget relatively immune to downstream uncertainties and surprises is presented. The efficiency of the formulation is clear from benchmarks with over 2000 nodes and 5e19 paths.
We present a framework that considers global routing, repeater insertion, and flip-flop relocation for early interconnect planning. We formulate the interconnect retiming and flip-flop placement problem as a local area constrained retiming problem and solve it as a series of weighted minimum area retiming problems. Our method for early interconnect planning can reduce and even avoid design iterations between physical planning and high level designs. Experimental results show that our method can reduce the number of area violations by an average of 84% in a single interconnect planning step.
We propose a new metric for evaluation of interconnect architectures. This metric is computed by optimal assignment of wires from a given wire length distribution (WLD) to a given interconnect architecture (IA). This new metric, the rank of an IA, is a single number that gives the number of connections in the WLD that meet a specific target delay when embedded in the IA. A dynamic programming algorithm is presented to exactly compute the rank of an IA with respect to a given WLD within practical runtimes. We use our new IA metric to quantitatively compare impacts of geometric parameters as well as process and material technology advances. For example, we observe that 42% reduction in Miller coupling factor achieves the same rank improvement as a 38% reduction in inter-layer dielectric permittivity for a 1M gate design in the 130nm technology.
In its most general sense, intellectual property components (IPs) refer to any design artifacts that are reusable. While the specification of the functional IPs, such as behavioral and RTL specifications have been widely investigated, the specifications of others, such as timing, constraints, layouts and architectures are largely ad hoc. This leads to different standard or proprietary file/database formats with interoperatability problems, which eventually hinder the distribution and integration of IPs. In this paper, we address the difficult problem of integrating semantically diverse non-functional IPs by the use of a new, extensible language called Babel. Despite its simple 1-page grammar, Babel is front-end for a powerful IP-based design infrastructure. We demonstrate the effectiveness of our approach by two case studies, one for the creation of parameterized memory IPs and one for the creation of processor IPs.
In the embedded system design, memory is one of the most restricted resources. Code compression has been proposed as a solution to reduce the code size of applications for embedded systems. Data compression techniques are used to compress programs to reduce memory size. Most previous work compresses all instructions found in an executable, without taking into account the program execution profile. In this paper, a profile-driven code compression design methodology is proposed. Program profiling information can be used to help code compression to selectively compress non-critical instructions, such that the system performance degradation due to the decompression penalty is reduced.
This work treats the design and analysis of a programmable (or reconfigurable) DSP-domain-specific architecture called MorphoSys, upon which world's first single-chip software solution for DVB-T base-band receiver can be implemented. Based on the first version of MorphoSys, many modifications have been made to improve greatly both computation power and data movement efficiency. Sequential codes and SIMD codes can be parallelized; temporal granularity adjustment boosts up performance up to 4 times; numerous different types of data movement can be accelerated 8 to 64 times faster than sequential movement. As a complicated (21GOPS) and typical communication system, DVB-T base-band receiver is designed with low performance loss and mapped onto MorphoSys architecture (>28GOPS). This solidly contributes to the software defined radio development.
Built-In Self-Test (BIST) becomes important also for more complex structures like complete front-ends. In order to bring down the costs for the test overhead, Spectral Signature Analysis at system level seems to be a promising concept. Investigations that have been carried out are targeted on the most challenging problems. Generation of the Test Signature, Evaluation of the Signature Response, Implementation of the concept and Verification by Simulation. From investigation it can be concluded that the concept is suitable especially in the case of transceiver-type DUT.
Stresses are considered an integral part of any
modern industrial DRAM test. This paper describes a
novel method to optimize stresses for memory testing, using
defect injection and electrical simulation. The new
method shows how each stress should be applied to achieve
a higher fault coverage of a given test, based on an understanding
of the internal behavior of the memory. In addition,
results of a fault analysis study, performed to verify
the new optimization method, show its effectiveness.
Key words: stresses, memory testing, test optimization,
defect simulation.
Circuit marginality failures in high performance VLSI circuits are projected to increase due to shrinking process geometries and high frequency design techniques. Capacitive cross coupling between interconnects is known to be a prime contributor to such failures. In this paper, we present novel techniques to model and prioritize capacitive cross-talk faults. Experimental results are provided to show effectiveness of the proposed modeling technique on industrial circuits.
Charge Pump Phase locked loops are used in a variety of applications, including on chip clock synthesis, symbol timing recovery for serial data streams, and generation of frequency agile high frequency carrier signals. In many applications PLL's are embedded into larger digital systems, in consequence, analogue test access is often limited. Test motivation is thus towards methods that can either aid digital only test of the PLL, or alternatively facilitate complete self testing of the PLL. One useful characterisation technique used by PLL designers is that of closed loop phase transfer function measurement. This test allows, an estimation of the PLLâs natural frequency, damping, and 3dB bandwidth to be made from the magnitude and phase response plots. These parameters relate directly to the time domain response of the PLL and will indicate errors in the PLL circuitry. This paper provides suggestions towards test methods that use a novel maximum frequency detection technique to aid automatic measurement of the closed loop phase transfer function. In addition, techniques presented have potential for full BIST applications.
Keywords: PLL, CP-PLL, BIST, TEST, DfT.
This paper presents a novel approach for an efficient, yet accurate estimation technique for power consumption and performance of embedded and general purpose applications. Our approach is adaptive in nature and is based on detecting sections of code characterized by high temporal locality (also called hotspots) in the execution profile of the benchmark being executed on a target processor. The technique itself is architecture and input independent and can be used for both embedded, as well as for general purpose processors. We have implemented a hybrid simulation engine which can significantly shorten the simulation time by using on-the-fly profiling for critical sections of the code and by reusing this information during power/performance estimation for the rest of the code. By using this strategy, we were able to achieve up to 20X better accuracy compared to a flat, non-adaptive sampling scheme and a simulation speed-up of up to 11.84X with a maximum error of 1.03% for performance and 1.92% for total energy on a wide variety of media and general purpose applications.
Chip multiprocessing (or multiprocessor system-on-a-chip) is a technique that combines two or more processor cores on a single piece of silicon to enhance computing performance. An important problem to be addressed in executing applications on an on-chip multiprocessor environment is to select the most suitable number of processors to use for a given objective function (e.g., minimizing execution time or energy-delay product) under multiple constraints. Previous research proposed an ILP-based solution to this problem that is based on exhaustive evaluation of each nest under all possible processor sizes. In this paper, we take a different approach and propose a pure runtime strategy for determining the best number of processors to use at runtime. This approach is more general than static techniques and can be applicable in situations where the latter cannot be.
Heterogeneous multi-processors platforms are an interesting option to satisfy the computational performance of dynamic multi-media applications at a reasonable energy cost. Today, almost no support exists to energy-efficiently manage the data of a multi-threaded application on these platforms. In this paper we show that the assignment of data of dynamically created/ deleted tasks to the shared memory has a large impact on the energy consumption. We present two dynamic memory allocators which solve the bank assignment problem for shared multi-banked SDRAM memories. Both allocators assign the tasksâ data to the available SDRAM banks such that the number of page-misses is reduced. We have measured large energy savings with these allocators compared to existing dynamic memory allocators for several task-sets based on MediaBench[5].
Interconnects have deserved attention as a source of crosstalk to other interconnects, but have been ignored as a source of substrate noise. In this paper, we evaluate the importance of interconnect-induced substrate noise. A known interconnect and substrate model is validated by comparing simulation results to experimental measurements. Based on the validated modeling approach, a complete study considering frequency, geometrical, load and shielding effects is presented. The importance of interconnect-induced substrate noise is demonstrated after observing that, for typically sized interconnects and state-of-the-art speeds, the amount of coupled noise is already comparable to that injected by hundreds of transistors.
A new model-order reduction technique for linear dynamic systems is presented. The idea behind this technique is to transform the dynamic system function from the s-domain into the z-domain via the bilinear transformation, then use Prony's [1] least-squares approximation method instead of the commonly employed Pad«e approximation method, and finally transform the reduced system back into the s-domain using the inverse bilinear transformation.tegy allows a very accurate and efficient full-wave solution of interconnection structures with possibly complex geometry including the nonlinear and dynamic effects of real-world digital devices, without the need of detailed transistor-level models. Examples of signal integrity and field coupling analysis are shown.
Signal integrity is and will continue to be a major concern in deep sub-micron VLSI designs where the proximity of signal carrying lines leads to crosstalk, unpredictable signal delays and other parasitic side effects. Our scheme uses bus encoding that guarantees that at any time any two signal carrying lines will be separated by at least one grounded line and thus providing a high degree of signal integrity. This comes at a small overhead of only one additional bus line (the closest related work needs 14 additional lines for a 32-bit bus) and a small average performance decrease of 0.36%. By means of a large set of real-world applications, we compare our scheme to other state-of-the-art approaches and present comparisons in terms of degree of integrity, overhead (e.g. additional lines required) and a possible performance decrease.
As a fast and accurate SW simulation model, we present a model called fast timed SW model. The model enables fast simulation by native execution of application SW and OS. It gives simulation accuracy by timed SW and HW simulation. When building fast timed SW models, we need to solve two problems: (1) how to enable timing synchronization between the native execution and HW simulation and (2) how to obtain the portability of native execution (that needs multi-tasking from simulation environments to emulate its multi-tasking operation) on different simulation environments (that give different types of multi-tasking). In this paper, to enable the synchronization, we present a synchronization function. To enable the portability, we present an adaptation layer called simulation environment abstraction layer. We present our case studies in building fast timed SW models.
Given the growth in application-specific processors, there is a strong need for a retargetable modeling framework that is capable of accurately capturing complex processor behaviors and generating efficient simulators. We propose the operation state machine (OSM) computation model to serve as the foundation of such a modeling framework. The OSM model separates the processor into two interacting layers: the operation layer where operation semantics and timing are modeled, and the hardware layer where disciplined hardware units interact. This declarative model allows for direct synthesis of micro-architecture simulators as it encapsulates precise concurrency semantics of microprocessors. We illustrate the practical benefits of this model through two case studies - the StrongARM core and the PowerPC-750 superscalar processor. The experimental results demonstrate that the OSM model has excellent modeling productivity and model efficiency. Additional applications of this modeling framework include derivation of information required by compilers and formal analysis for processor validation.
In this paper the application of Instruction Set Emulation for rapid prototyping of SoCs will be presented. The emulation works in a way that both the software and the hardware behaviour of the emulated processor core is reproduced cycle accurately. This requires the use of hardware and software components. The hardware component consists of a board containing a VLIW processor and FPGAs. The software component is an instruction set simulator of the core running on the VLIW processor. The FPGAs are used for emulating the SoC bus of this processor core. This way the simulation of an instruction set of a processor core has been extended to a real emulation of this core that can be used for rapid prototyping.
This paper describes an approach to hardware/ software design space exploration for reconfigurable processors. The existing compiler tool-chain, because of the user-definable instructions, needs to be extended in order to offer developers an easy way to explore design space. Such extension often is not easy to use for developer that have only a software background, thus ignoring reconfigurable architecture details or hardware logic synthesis on FPGA. Our approach differs from others because it is based on a simple extension on the standard programming model well known to software developers.
The emergence of run-time reconfigurable architectures makes feasible the configure-execute paradigm. Compilation of behavioral descriptions (in, e.g., C, Java, etc.), apart from mapping the computational structures onto the available resources on the device, must split the program in temporal sections if it needs more resources than physically available. In addition, since the execution of the computational structures in a configuration needs at least two stages (i.e., configuring and computing), it is important to split the program such that the reconfiguration overheads are minimized, taking advantage of the overlapping of the execution stages on different configurations. This paper presents mapping techniques to cope with those features. The techniques are being researched in the context of a C compiler for the eXtreme Processing Platform (XPP). Temporal partitioning is applied to furnish a set of configurations that reduces the reconfiguration overhead and thus may lead to performance gains. We also show that when applications include a sequence of loops, the use of several configurations may be more beneficial than the mapping of the entire application onto a single configuration. Preliminary results for a number of benchmarks strongly confirm the approach.
In this paper we present an hardware implementation of the RSA algorithm for public-key cryptography. The RSA algorithm consists in the computation of modular exponentials on large integers, that can be reduced to repeated modular multiplications. We present a serial implementation of RSA, which is based upon an optimized version of the RSA algorithm originally proposed by P.L. Montgomery. The proposed architecture is innovative, and it widely exploits specific capabilities of Xilinx programmable devices. As compared to other solutions in the literature, the proposed implementation of the RSA processor has smaller area occupation and comparable performance. The final performance level is a function of the serialization factor. We provide a thorough discussion of design tradeoffs, in terms of area requirements vs performance, for different values of the key length and of the serialization factor.
In modern SoCs, embedded memories occupy the largest part of the chip area and include an even larger amount of active devices. As memories are designed very tightly to the limits of the technology they are more prone to failures than logic. Thus, memories concentrate the large majority of defects and affect circuit yield dramatically. As a matter Built-In Self-Repair is gaining importance. This work presents optimal reconfigurations functions for memory built-in self-repair on the data-bit level. We also present a dynamic repair scheme that allows reducing the size of the repairable units. The combination of these schemes allows repairing multiple faults affecting both regular and spare units, by means of low hardware cost. The scheme uses a single test pass, resulting on low test and repair time.
There have been several recent attempts to include duplication-based on-line testability in behaviourally synthesized designs. In this paper, on-line testability is considered within the optimisation process of iterative, cost function-driven high-level synthesis, such that on-line testing resources are inserted automatically without any modification of the source HDL code. This involves the introduction of a metric for on-line testability. A variation of duplication testing (namely inversion testing) is also used, providing the system with an additional degree of freedom towards minimising hardware overheads associated with test resource insertion. Considering online testability within the synthesis process facilitates fast and efficient design space exploration, resulting in a versatile high-level synthesis process, capable of producing alternative realisations according to the designerâs directions.
Instruction and data caches are well known architectural solutions that allow significantly improving the performance of high-end processors. Due to their sensitivity to soft errors they are often disabled in safety critical applications, thus sacrificing performance for improved dependability. In this paper we report an accurate analysis of the effects of soft errors in the instruction and data caches of a soft core implementing the SPARC architecture. Thanks to an efficient simulation-based fault injection environment we developed, we are able to present in this paper an extensive analysis of the effects of soft errors on a processor running several applications under different memory configurations. The procedure we followed allows the precise computation of the processor failure rate when the cache is enabled even without resorting to expensive radiation experiments.
In this article we propose a high speed and highly testable parallel two-rail code checker, which features a compact structure and is Totally-Self-Checking or Strongly Code-Disjoint with respect to a wide set of realistic faults. The proposed checker is also particularly suitable to implement embedded two-rail code checkers, as it requires only two input codewords for fault detection. Our checker can be employed to check the correct operation of a connected functional block using the two-rail code, to implement the output two-rail code checker of ãnormalä checkers for unordered codes, or to join together the error messages produced by various checkers (possibly using different codes) present within the same self-checking system. The behavior of our checker has been verified by means of electrical level simulations (performed using HSPICE), considering both nominal values and statistical variations of electrical parameters.
Automotive systems engineering has made significant progress in using formal methods to design safe hardware-software systems. The architectures and design methods could become a model for safe and cost-efficient embedded software development as a whole. This paper gives several examples from the leading edge of industrial automotive applications.
The architectural study of wireless communication systems typically requires simulations with high-level models for different analog and RF blocks. Among these blocks, frequency-translating devices such as mixers pose problems in RF circuit simulation since their response typically covers a mix of long- and short-time scales. This paper proposes a technique to analyze and model nonlinear frequency-translating RF circuits such as up-and down conversion mixers. The proposed method is based on a generalized Volterra series approach for periodically time-varying systems. It enables a multi-tone distortion analysis starting from a circuit description and derives simplified high-level models based on the most important nonlinear contributions. These models give both insight in the nonlinear behavior and enable an efficient high-level simulation during architectural design of front-ends of RF transceivers.
For the example of a 12-bit Nyquist-rate ADC, a model for nonlinearity-causing mechanisms is developed based on circuit simulations. The model is used to estimate circuit element values from measured device characteristics. Post-manufacture reconfiguration of the digital control part of the device-type that is used as a test vehicle in this work can improve the linearity performance of a device. An algorithm is proposed that searches for a locally-optimal reconfiguration based on the determined circuit element values. Applying calibration to the circuit simulation model allows one to estimate the performance improvement obtainable with the proposed calibration scheme for a given manufacturing process prior to a physical implementation.
This paper describes a sizing and design methodology for high-speed high-accuracy current steering D/A converters taking into account mismatching in all the transistors of the current source cell. The presented method allows a more accurate selection of the optimal design point without introducing arbitrary safety margins, as was done in the previous literature. This methodology has been applied to the design of a CMOS 12-bit 400 MHz current-steering segmented D/A converter. Commercial CAD tools are used to automatically lay out regular structures of the DAC, specially the current source array, following an optimal two-dimensional switching scheme to compensate for systematic mismatch errors.
Wireless LAN (WLAN) operating in the 5-6 GHz range, become commercially viable only, if they can be produced at low cost. Consequently, tight integration of the physical layer, consisting of the radio front-end and the digital signal processing part, is a must. Especially with respect to mixed-signal feedback loops, with automatic gain control as a recurring example, existing tools have major difficulties in offering efficient ways of modeling and simulation. We present a modeling approach where the complexity of the analog behavioral model has been reduced to the minimum required by the digital receiver, namely its steady-state responses and a Îworst-caseâ time delay. Moreover, we show how this mixed-signal receiver model can be used in an end-to-end communication link simulation to provide the designer insight into statistical information such as transient delays and gain tolerances. For this model, we set up a co-simulation of two existing in-house tools, one for the analog part, the other for the digital system part.
The increasing use of microprocessor cores in embedded
systems, as well as mobile and portable devices, creates
an opportunity for customizing the cache subsystem for
improved performance. Traditionally, a
design-simulate-analyze methodology is used to achieve
desired cache performance. Here, to bootstrap the
process, arbitrary cache parameters are selected, the
cache sub-system is simulated using a cache simulator,
based on performance results, cache parameters are
tuned, and the process is repeated until an acceptable
design is obtained. Since the cache design space is
typically very large, the traditional approach often
requires a very long time to converge. In the proposed
approach, we outline an efficient algorithm that directly
computes cache parameters satisfying the desired
performance. We demonstrate the feasibility of our
algorithm by applying it to a large number of embedded
system benchmarks.
Keywords
Cache Optimization, Core-Based Design, Design Space
Exploration, System-on-a-Chip
In system-level platform-based embedded systems design, the mapping model is a crucial link between the application model and the architecture model. All three models must match when design-space exploration has to be fast and accurate, and when exploration methods and design methods have to be closely related. For the media processing application domain we present an architecture model and corresponding mapping model that meet these requirements better than previously proposed models. A case study illustrates this improvement.
This paper describes a design space exploration experiment for a real application from the embedded networking domain - the physical layer of a wireless protocol. The application models both control oriented as well as data processing functions, and hence requires composing tasks from different models of computation. We show how the cost and performance of communication and computation can be quickly evaluated, with a reasonable modeling cost. While the example uses a specific tool, the methodology and results can be used in a more general context.
This paper proposes a novel methodology tailored to design embedded systems, taking into account the emerging market needs, such as hw/sw partitioning, object-oriented specifications, overall design costs and early analysis of design alternatives. The proposal tackles the problem by considering UML as the starting point for system-level description and uses a customization of Function Point analysis and COCOMO to provide cost metrics both for hardware and software. Finally, a genetic algorithm is used to select the best candidate architecture. The paper also reports some results, obtained from a case studies, showing the viability of the proposed approach.
This paper details the first step of the Design Trotter framework for design space exploration applied to dedicated SOCs. The aim of this step is to provide metrics in order to guide the designer and the synthesis tool towards an efficient application architecture matching. This work presents a computation of metrics at all levels of the application graph-based hierarchy. These metrics are computed through data and control dependency analysis. They quantify the memory, control and processing orientations as well as the average of parallelism for different granularities.
This paper presents an efficient methodology for estimating the energy consumption of application programs running on extensible processors. Extensible processors, which are increasingly popular in embedded system design, allow a designer to customize a base processor core through instruction set extensions. Existing processor energy macro-modeling techniques are not applicable to extensible processors, since they assume that the instruction set architecture as well as the underlying structural description of the microarchitecture remain fixed. Our solution to this problem is an energy macro-model suitably parameterized to estimate the energy consumption of a processor instance that incorporates any custom instruction extensions. Such a characterization is facilitated by careful selection of macro-model parameters/variables that can capture both the functional and structural aspects of the execution of a program on an extensible processor. Another feature of the proposed characterization flow is the use of regression analysis to build the macro-model. Regression analysis allows for in-situ characterization, thus allowing arbitrary test programs to be used during macro-model construction. We validate the proposed methodology by characterizing the energy consumption of a state-of-the-art extensible processor (Tensilicaâs Xtensa). We use the macro-model to analyze the energy consumption of several benchmark applications with custom instructions. The mean absolute error in the macro-model estimates is only 3.3%, when compared to the energy values obtained by a commercial tool operating on the synthesized RTL description of the custom processor. Our approach achieves an average speedup of three orders of magnitude over the commercial RTL energy estimator. Our experiments show that the proposed methodology also achieves good relative accuracy, which is essential in energy optimization studies.
In this paper, we present an algorithm which automatically maps the IPs onto a generic regular Network on Chip (NoC) architecture and constructs a deadlock-free deterministic routing function such that the total communication energy is minimized. At the same time, the performance of the resulting communication system is guaranteed to satisfy the specified constraints through bandwidth reservation. As the main contribution, we first formulate the problem of energy/performance aware mapping, in a topological sense, and show how the routing flexibility can be exploited to expand the solution space and improve the solution quality. An efficient branch-and-bound algorithm is then described to solve this problem. Experimental results show that the proposed algorithm is very fast, and significant energy savings can be achieved. For instance, for a complex video/audio application, 51.7% energy savings have been observed, on average, compared to an ad-hoc implementation.
This paper presents a low-power encoding technique, called chromatic encoding, for the Digital Visual Interface standard (DVI), a digital serial video interface. Chromatic encoding reduces power consumption by minimizing the transition counts on the DVI. This technique relies on the notion of tonal locality, i.e., the observation that the signal differences between adjacent pixels in images follow a Gaussian distribution. Based on this observation, an optimal code assignment is performed to minimize the transition counts. Furthermore, the three color channels of the DVI may be reciprocally encoded to achieve even more power saving. The idea is that given the signal values from the three color channels, one or two of these channels are encoded by reciprocal differences with a number of redundant bits used to indicate the selection. The proposed technique requires only three redundant bits for each 24-bit pixel. Experimental results show up to a 75% transition reduction.
We present a graph theoretical methodology that reduces the implementation complexity of a vector multiplied by a scalar. The proposed approach is called MRP (minimally redundant parallel) optimization and is presented in FIR filtering framework to obtain a low-complexity multiplier-less implementation. The key idea is to expand the design space using shift inclusive differential coefficients together with computation reordering using a graph theoretic approach to obtain maximal computation sharing. The transformed architecture of a filter is obtained by solving a set cover problem of the graph. A simple algorithm based on a greedy approach is presented. The proposed approach is merged with common sub-expression elimination. The simulation results show that 70% and 16% improvement in terms of computational complexity over simple implementation (transposed direct form) and common sub-expression, respectively, when using carry lookahead adder synthesized from synopsys designware library in .25 u technology.
For wireless embedded systems, the power consumption in the network interface (radio) plays a dominant role in determining battery life. In this paper, we explore transport protocol optimizations for reducing the energy consumption of wireless LAN interfaces. Our work is based on the observation that, the transport protocol, which implements flow control to regulate the network traffic, plays a significant role in determining the workload of the network interface. Hence, by monitoring run-time parameters in the transport protocol, coarse-granularity idle periods, which present the best opportunities for network interface power reduction, can be accurately identified. We further show that, by tuning parameters in the protocol software implementation, we can shape the activity profile of the network interface, making it more energy efficient while remaining compliant to the TCP standard. We have performed extensive current measurements using an experimental testbed that consists of a Compaq iPAQ PDA with a Cisco Aironet wireless network adapter, to validate the proposed techniques. Our measurements indicate energy savings ranging from 28% to 69% compared to the use of state-of-the-art MAC layer power reduction techniques, with little or no impact on performance.
Software self-testing of embedded processor cores which effectively partitions the testing effort between low-speed external equipment and internal processor resources, has been recently proposed as an alternative to classical hardware built-in self-test techniques over which it provides significant advantages. In this A P1500-Compatible Programmable BIST Approach for the Test of Embedded Flash Memories [p. 720]
In this paper we present a microprocessor-based approach suitable for embedded flash memory testing in a System-on-achip (SOC) environment. The main novelty of the approach is the high flexibility, which guarantees easy exploitation of the same architecture to different memory cores. The proposed approach is compatible with the P1500 standard. A case study has been developed and demonstrates the advantages of the proposed core test strategy in terms of area overhead and test application time.
Test data compression (TDC) is a promising low-cost methodology for System-on-a-Chip (SOC) test. This is due to the fact that it can reduce not only the volume of test data but also the bandwidth requirements. In this paper we provide a quantitative analysis of two distinctive TDC methods from the system integrator's standpoint considering a core based SOC environment. The proposed analysis addresses four parameters: compression ratio, test application time, area overhead and power dissipation. Based on our analysis, some future research directions are given which can lead to an easier integration of TDC in the SOC design flow and to further improve the four parameters.
One of the difficult problems which core-based system-on-chip (SoC) designs face is test access. For testing the cores in a SoC, a special mechanism is required, since they are not directly accessible via chip inputs and outputs. In this paper we introduce a novel Test Access Mechanism (TAM) based on time domain multiplexing (TDM-TAM). This TAM is P1500 compatible and uses a P1500 wrapper. The TAM characteristics are its flexibility, scalability, and reconfigurability. The proposed TAM is compared with two other approaches: a serial threading approach analogous to the IEEE1149.1 standard (Serial TAM)[7]and a packet switching test network (NIMA)[9]. A network-processing engine SoC is used as a platform to compare the different TAMs [6]. Results show that in most cases, TDM is the most effective TAM in both test time and overhead area. Keywords: SoC te