This paper introduces an FPGA IP evaluation and delivery
system that operates within Java applets. The use of such
applets allows designers to create, evaluate, test, and obtain
FPGA circuits directly within a web browser. Based on
the JHDL design tool, these applets allow structural viewing,
circuit simulation, and netlist generation of application-specific
circuits. Applets can be customized to provide varying
levels of IP visibility and functionality as needed by both
customer and vendor.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - Simulation, Hardware
description languages
General Terms
Design
Keywords
Intellectual Property, JHDL, Applet, FPGA
Linear programming (LP) in its many forms has proven to be an
indispensable tool for expressing and solving optimization problems
in numerous domains. We propose the first set of generic
watermarking techniques for integer-LP (ILP). The proof of authorship
by watermarking is achieved by introducing additional
constraints to limit the solution space and can be used as effective
means of intellectual property protection (IPP) and authentication.
We classify and analyze the types of constraints in the ILP watermarking
domain and show how ILP formulations provide more
degrees of freedom for embedding signatures than other existing
approaches. To demonstrate the effectiveness of the proposed ILP
watermarking techniques, the generic discussion is further concretized
using two examples, namely Satisfiability and Scheduling.
Categories and Subject Descriptors
K.6.5 [Management of Computing and Information Systems]:
Security and Protection -- Digital Watermarking
General Terms
Algorithms, Economics, Theory, Legal Aspects.
Keywords
Digital Watermarking, Intellectual Property Protection
Design tools can be profitably associated with libraries of
reusable modeling components that will make the description
and also the validation of the models much easier. Furthermore,
applications of today and tomorrow will be increasingly based
on three fundamental technologies: Object
Orientation, Client/Server and Internet. We propose in this
article an object-oriented architecture for the definition of
Web-based hierarchical models libraries. The originality of
our approach lies in the facts that it is based on : (i) a
notion of genericity of use, (ii) notions like inheritance and
abstraction links between the stored models and (iii) Web-based
storing and consulting libraries procedures.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-Aided Design;
D.2.11 [Software]: Software Engineering|Software Architectures
; D.2.13 [Software]: Software Engineering|Reusable Software
General Terms
Design, Management
Keywords
models reuse, models libraries, Web-based access, abstraction hierarchy
Engineering change (EC) is a technique that enables a designer
to rapidly perform minor specification alternations
while minimally resynthesizing only small portions of the
specification throughout several levels of design abstraction.
In this paper, we introduce the first EC-based synthesis technique
for coordinated design optimization in multiple steps.
The technique has four phases: optimization region identification,feedback
formulation,resynthesis in first step, and
finally resynthesis in the second design step. To demonstrate
the technique,we focus on behavioral synthesis and
transformation, scheduling, and register assignment steps.
We developed a generic EC-based approach for design optimization
during multiple consecutive synthesis steps. Next,
we show how one can use EC to enhance coordinated application
of transformations and scheduling,and scheduling
and register assignment.
Categories and Subject Descriptors
B.5.2 [Register-Transfer-Level Implementation]: Design
Aids Optimization
General Terms
Design
Keywords
Engineering change, transformations, scheduling, register assignment
In the last decade, instruction-set simulators have become
an essential development tool for the design of new programmable
architectures. Consequently, the simulator performance
is a key factor for the overall design efficiency.
Based on the extremely poor performance of commonly used
interpretive simulators, research work on fast compiled instruction-set
simulation was started ten years ago. However,
due to the restrictiveness of the compiled technique,
it has not been able to push through in commercial products.
This paper presents a new retargetable simulation
technique which combines the performance of traditional
compiled simulators with the flexibility of interpretive simulation.
This technique is not limited to any class of architectures
or applications and can be utilized from architecture
exploration up to end-user software development.
The work-flow and the applicability of the so-called just-in-time
cache compiled simulation (JIT-CCS) technique will be
demonstrated by means of state of the art real world architectures.
Categories and Subject Descriptors
I.6.3 [Simulation and Modeling]: Simulation Support
Systems; I.6.3 [Simulation and Modeling]: Model Validation
and Analysis; D.3.2 [Programming Languages]:
Design Languages - LISA; C.0 [General]: Modeling of Computer
Architecture
General Terms
Design, Languages, Performance
Keywords
Retargetable simulation, compiled simulation, instruction
set architectures
Profiling an application executing on a microprocessor is part of
the solution to numerous software and hardware optimization and
design automation problems. Most current profiling techniques
suffer from runtime overhead, inaccuracy, or slowness, and the
traditional non-intrusive method of using a logic analyzer doesn't
work for today's system-on-a-chip having embedded cores. We
introduce a novel on-chip memory architecture that overcomes
these limitations. The architecture, which we call ProMem, is
based on a pipelined binary tree structure. It achieves single-cycle
throughput, so it can keep up with today's fastest pipelined
processors. It can also be laid out efficiently and scales very well,
becoming more efficient the larger it gets. The memory can be
used in a wide-variety of common profiling situations, such as
instruction profiling, value profiling, and network traffic profiling,
which in turn can be used to guide numerous design automation
tasks.
Keywords
Profiling, system-on-a-chip, platform tuning, adaptive
architectures, low power, embedded CAD, binary tree, memory
design, embedded systems.
Code compression is known as an effective technique to reduce
instruction memory size on an embedded system. However, code
compression can also be very effective in increasing processor-to-memory
bandwidth and hence provide increased system performance.
In this paper we describe our design and design methodology
of the first running prototype of a one-cycle code decompression
unit that decompresses compressed instructions on-the-fly.
We describe in detail the architecture that enables decompression
of multiple instructions in one cycle and we present the design
methodologies and tools used. The stand-alone decompression unit
does not require any modifications on the processor core. We observed
up to 63% performance increase with 25% in average over
a wide variety of applications running on the hardware prototype
under various system configurations.
Categories and Subject Descriptors
B.3 [Hardware]: Memory Structures; C.3 [Computer Systems
Organization]: Special-purpose and Application-based Systems-Real-time
and embedded systems
General Terms
Algorithms, Design, Performance
We present a framework for passivity-preserving model reduction
for RLC systems that includes, as a special case, the well-known
PRIMA model reduction algorithm. This framework provides a
new interpretation for PRIMA, and offers a qualitative explanation
as to why PRIMA performs remarkably well in practice. In addition,
the framework enables the derivation of new error bounds for
PRIMA-like methods. We also show how the framework offers a
systematic approach to computing reduced-order models that better
approximate the original system than PRIMA, while still preserving
passivity.
Categories and Subject Descriptors
G.1.3 [NUMERICAL ANALYSIS]: Numerical Linear Algebra
linear systems; F.2.1 [ANALYSIS OF ALGORITHMS AND
PROBLEM COMPLEXITY]: Numerical Algorithms and Problems-computations
on matrices
General Terms
Algorithms
Keywords
Model Reduction, Large Scale Systems, RLC interconnect, Passivity
Preserving, Factorization
This paper presents a class of algorithms suitable for model reduction
of distributed systems. Distributed systems are not suitable
for treatment by standard model-reduction algorithms such as
PRIMA, PVL, and the Arnoldi schemes because they generate matrices
that are dependent on frequency (or other parameters) and
cannot be put in a lumped or state-space form. Our algorithms
build on well-known projection-based reduction techniques, and so
require only matrix-vector product operations and are thus suitable
for operation in conjunction with electromagnetic analysis codes
that use iterative solution methods and fast-multipole acceleration
techniques. Under the condition that the starting systems satisfy
system-theoretic properties required of physical systems, the reduced
systems can be guaranteed to be passive. For distributed
systems, we argue that causality of the underlying representation is
as important a consideration as passivity has become.
Categories and Subject Descriptors: B.7.2 Simulation, B.8.2 Performance
Analysis and Design Aids, G.1.1 Interpolation G.1.2, Approximations,
I.6 Simulation and Modeling.
General Terms: Algorithms, Performance, Design
Keywords: Passive reduced order modeling, Distributed systems.
The major concerns in state-of-the-art model reduction algorithms
are: achieving accurate models of sufficiently small size, numerically
stable and efficient generation of the models, and preservation
of system properties such as passivity. Algorithms such
as PRIMA generate guaranteed-passive models, for systems with
special internal structure, using numerically stable and efficient
Krylov-subspace iterations. Truncated Balanced Realization (TBR)
algorithms, as used to date in the design automation community,
can achieve smaller models with better error control, but do not
necessarily preserve passivity. In this paper we show how to construct
TBR-like methods that guarantee passive reduced models
and in addition are applicable to state-space systems with arbitrary
internal structure.
Categories and Subject Descriptors: B.7.2 Simulation, B.8.2 Performance
Analysis and Design Aids, I.6 Simulation and Modeling.
General Terms: Algorithms, Performance, Design
Keywords: Passive reduced order modeling, Truncated balanced
realization.
Almost by definition, well-tuned digital circuits have a large number of equally critical paths, which form a so-called "wall" in the slack histogram. However, by the time the design has been through manufacturing, many uncertainties cause these carefully aligned delays to spread out. Inaccuracies in parasitic predictions, clock slew, mode-to-hardware correlation, static timing assumptions and manufacturing variations all cause the performance to vary from prediction. Simple statistical principles tell us that the variation of the limiting slack is larger when the height of the wall is greater. Although the wall may be the optimum solution if the static timing predictions were perfect, in the presence of uncertainty in timing and manufacturing, it may no longer be the best choice. The application of formal mathematical optimization in transistor sizing increases the height of the wall, thus exacerbating the problem. There is also a practical matter that schematic restructuring down-stream in the design methodology is easier to conceive when there are fewer equally critical paths. This paper describes a method that gives formal mathematical optimizers the incentive to avoid the wall of equally critical paths, while giving up as little as possible in nominal performance. Surprisingly, such a formulation reduces the degeneracy of the optimization problem and can render the optimizer more effective. This "uncertainty-aware" mode has been implemented and applied to several high-performance microprocessor macros. Numerical results are included.
We present a global wire design methodology that simultaneously
considers the performance needs for both signal
lines and power grids under congestion considerations. An
iterative procedure is employed in which the global routing
is performed according to a congestion map that includes
the resource utilization of the power grid, followed by a step
in which the power grid is adjusted to relax the congestion
in crowded regions. This adjustment is in the form of wire
removal in noncritical regions, followed by a wire sizing step
that overcomes the effects of wire removal. Experimental
results show that the overall routability can be significantly
improved while the power grid noise is maintained within
the voltage droop constraint.
Categories and Subject Descriptors
B.8.2 [Performance and Reliability]: Performance Analysis
and Design Aids
General Terms
Algorithms
Keywords
wire congestion, codesign, signal routing, power grid noise
Interconnect management is a critical design issue for large FPGA
based designs. One of the most important issues for planning interconnection
is the ability to accurately and efficiently predict the
routability of a given design on a given FPGA architecture. The recently
proposed routability estimation procedure, fGREP [6], produced
estimates within 3 to 4% of an actual detailed router. Other
known routability estimation methods include RISA [5], Lou's [7]
method and Rent's rule based methods [1] [12] [9]. Comparing
these methods has been difficult because of the different reporting
methods used by the authors. We propose a uniform reporting metric
based on comparing the estimates produced with the results of
an actual detailed router on both local and global levels. We compare
all the above methods using our reporting metric on a large
number of benchmark circuits and show that the enhanced fGREP
method produces tight estimates that outperform most other techniques.
Categories and Subject Descriptors
B.7.2 [Design Aids for Integrated Circuits]:
General Terms
Algorithms, Measurement, Experimentation
Keywords
FPGA, fGREP, routability estimation, congestion, RISA, Rent's
rule
This paper discusses potential solutions to the CMOS device
technology scaling at gate lengths approaching 10nm. Promising
circuit and design techniques to control leakage power are
described. Energy-efficient microarchitecture trends for general purpose
microprocessors are elucidated.
Categories and Subject Descriptors
B.7 INTEGRATED CIRCUITS
B.7.1 Types and Design Styles Microprocessors and
microcomputers, VLSI.
General Terms
Performance, Design
Keywords
Technology scaling, Leakage control, Microarchitecture
The next generation of computer chips will continue the trend
for more complexity than their predecessors. Many of them will
contain different chip technologies and are termed SoCs (System
on a Chip). They present to the process community, the system
and circuit communities, as well as to the design and test
communities major new challenges. On the other hand they also
offer at the same time also new opportunities!. For one, the
desire to bring more functionality onto a single chip tends to
require additional processing, which in turn results in various
degrees of device compromises. The chips will also tend to
become larger due to the added device content, and this
generally will impact the yieldability of the final chip. And such
chips will require potentially new approaches to validate the
intended design performances. Chip sector reuse must also be
brought into the discussion and wherever possible into practice.
The net effect implies higher chip costs. Much of the industry's
efforts are therefore focused in addressing these challenges;
however, so far, not yet very successfully. The alternative has
been to continue in the placement of chips onto substrate
modules. Yet, this solution creates practical limits on achievable
wiring densities and bandwidth, due to the spacing requirements
of the C4 interconnection. Furthermore, every C4 joint is
associated with a signal delay of about 50 psec. All of these
handicaps would potentially benefit greatly from new SoC
methods, starting with the fabrication methodology and
extending it into the chip design and test areas.
Such a direction has been set in motion. The opportunity for a
uniquely new chip fabrication method has emerged by
combining a set of somewhat diverse processes. It is based on a
judicious selection of process elements from the traditional chip
area and combined with those of a somewhat more recent chip
packaging process methodology. This approach results in
overcoming simultaneously all of the key current process
limitations as experienced with today's SoC chip designs, as
well as eliminates certain chip packaging technology handicaps.
Yet, it does not require the need for new process tooling. It relies
on currently existing process tooling and process methodologies.
This new process direction has been found to be quite applicable
to a number of desirable SoC device designs, and offers new
opportunities for yet another expansion of the current
semiconductor technology base over the next few years.
However, effective SoC designs and fabrications require a much
closer and earlier collaboration between the process, design and
test communities.
General Terms: Design
Key Words: SoCs (System on a Chip); Chip Fabrication
methods; Chip/Packing integration; Chip Subsector concepts
In this paper, CMOS evolution and their fundamental and practical
limitations are briefly reviewed, and the working principles,
performance, and fabrication of single-electron transistors (SETs) are
addressed in detail. Some of the unique characteristics and
functionality of SETs, like unrivalled integration and low power,
which are complementary to the sub-20 nm CMOS1, are
demonstrated. Characteristics of two novel SET architectures, namely,
C-SET and R-SET, aimed at logic applications are compared. Finally,
it is shown that combination of CMOS and SET in hybrid ICs appears
to be attractive in terms of new functionality and performance,
together with better integrability for ULSI, especially because of their
complementary characteristics. It is envisioned that efforts in terms of
compatible fabrication processes, packaging, modeling, electrical
characterization, co-design and co-simulation will be needed in the
near future to achieve substantial advances in both memory and logic
circuit applications based on CMOS-SET hybrid circuits.
Categories and Subject Descriptors
B.7 INTEGRATED CIRCUITS
B.7.1 Types and Design Styles Advanced Technologies
General Terms
Design, Experimentation, Measurement, Performance
Keywords
Nanoelectronics, Single-Electron Transistors, Ultimate CMOS,
Hybrid CMOS-SET Circuits, Low power, Inverter, Quantizer.
In this paper, we present recent advances in the understanding of
the properties of semiconducting single wall carbon nanotube and
in the exploration of their use as field-effect transistors (FETs).
Both electrons and holes can be injected in a nanotube transistor
by either controlling the metal-nanotube Schottky barriers present
at the contacts or simply by doping the bulk of the nanotube.
These methods give complementary nanotube FETs that can be
integrated together to make inter- and intra-nanotube logic
circuits. The device performance and their general characteristics
suggest that they can compete with silicon MOSFETs. While this
is true when considering simple prototype devices, several issues
remain to be explored before a nanotube-based technology is
possible. They are also discussed.
Categories and Subject Descriptors
B.6.0 [Logic Design]: General novel logic devices.
General Terms
Measurement, Performance, Design, Experimentation.
Keywords
Nanoelectronics, Carbon Nanotube, Semiconductor, Field-Effect
Transistor, FET, Schottky Barrier, Circuits, Inverter, Logic Gate,
SWNT.
Symbolic simulation is attracting increasing interest for the validation
of digital circuits. It allows the verification engineer to explore
all, or a major portion of the circuit's state space without having
to design specific and time-consuming test stimuli. However, the
complexity and unpredictable run-time behavior of symbolic simulation
have limited its scope to small-to-medium circuits.
In this paper, we propose a novel approach to symbolic simulation
that reduces the size of the BDDs of the state vector while
maintaining an exact representation of the set of states visited. The
method exploits the decomposition properties of Boolean functions.
By restructuring the next-state functions in their disjoint support
components, we gain a better insight in the role of each input variable.
Consequently, we can simplify the next-state functions without
significantly sacrificing the simulation accuracy. Our experimental
results shows that this approach can be used in effectively
reducing the memory requirements of symbolic simulation while
surrendering only a small portion of the design's state space.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - Verification, Simulation; B.8
[Hardware]: Performance and Reliability
General Terms
Design, Verification, Performance, Theory
Keywords
Formal Verification, Symbolic Simulation, BDDs
Symbolic simulation is a formal verification technique which combines
the flexibility of conventional simulation with powerful symbolic
methods. Some constructs, however, which are easy to handle in conventional
simulation need special consideration in symbolic simulation.
This paper discusses some special constructs that require unique
treatment in symbolic simulation such as the symbolic representation
of arrays, an efficient symbolic method for storing arrayed instances
and the handling of symbolic data-dependent delays. We present results
which demonstrate the effectiveness of our symbolic array model
in the simulation of highly regular structures like FPGAs, memories or
cellular automata.
Categories and Subject Descriptors
B.5.2 [Hardware]: Register-Transfer-Level ImplementationDesign
Aids, Verification; B.6.3 [Hardware]: Logic Design - Design Aids,
Verification; B.7.2 [Hardware]: Integrated Circuits - Design Aids, Verification
General Terms
Verification
Keywords
Symbolic Simulation, Formal Verification
One method of handling the computational complexity of the verification
process is to combine the strengths of different approaches.
We propose a hybrid verification technology combining symbolic
trajectory evaluation with either symbolic model checking or SAT-based
model checking. This reduces significantly the cost (both
human and computing) of verifying circuits with complex initialisation,
as well as simplifying proof development by enhancing verification
productivity. The approach has been tested on current Intel
designs.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aidsverification; F.3.1 [Specifying
and Verifying and Reasoning about Programs]: mechanical verification
General Terms
Verification, Theory
Keywords
symbolic model checking, symbolic trajectory evaluation, hybrid
verification
The usefulness of Bounded Model Checking (BMC) based on propositional satisfiability (SAT) methods has recently proven its efficacy for bug hunting. BDD based tools are able to verify broader sets of properties (e.g. CTL formulas) but recent experimental comparisons between SAT and BDDs in formal verification lead to the conclusion that SAT approaches are more robust and scalable than BDD techniques. In this work we extend BDD-based verification to larger circuit and problem sizes, so that it can indeed compete with SAT-based tools. The approach we propose solves Bounded Model Checking problems using BDDs. In order to cope with larger models it exploits approximate traversals, yet it is exact, i.e. it does not produce false negatives or positives. It reaps relevant performance enhancements from mixed forward and backward, approximate and exact traversals, guided search, conjunctive decompositions and generalized cofactor based BDD simplifications. We experimentally compare our tool with BMC in NuSMV (using mchaff as SAT engine), and we show that BDDs are able to accomplish large verification tasks, and they can better cope with increasing sequential depths.
A RTL C-based design and verification methodology is
presented which enabled the successful high speed validation of a 7
million gate simultaneous multi-threaded (SMT) network processor.
The methodology is centered on statically scheduled C-based coding
style, C to HDL translation, and a novel RTL-C to RTL-Verilog
equivalence checking flow. It leverages improved simulation performance
combined with static techniques to reduce the amount of
RTL-Verilog and gate-level verification required during development.
Categories - B.5.2 [Register-Transfer-Level Implementation]
Design Aids: Automatic synthesis, Hardware description languages,
Optimization, Simulation,Verification.
General Terms - Design, Verification, Performance, Languages.
Keywords - C/C++, RTL, design, verification, formal equivalence checking.
A central problem in functional verification is to check that a circuit
block is producing correct outputs while enforcing that the environment
is providing legal inputs. To attack this problem, several researchers
have proposed monitor-based methodologies, which offer
many benefits. This paper presents a novel, high-level specification
style for these monitors, along with a linear-size, linear-time translation
algorithm into monitor circuits. The specification style naturally
fits the complex, but well-specified interfaces used between
IP blocks in systems-on-chip. To demonstrate the advantage of our
specification style, we have specified monitors for various versions
of the Sonics OCP protocol as well as the AMBA AHB protocol,
and have developed a prototype tool that automatically translates
specifications into Verilog or VHDL monitor circuits.
Categories and Subject Descriptors
B.5.2 [Register-Transfer Level Implementation]: Design Aids;
B.6.3 [Logic Design]: Design Aids; C.0 [Computer Systems
Organization]: General - Systems specification methodology; J.6
[Computer-Aided Engineering]: Computer-aided design (CAD)
General Terms
Documentation, Languages, Verification
Keywords
Formal Verification, Regular Expressions, Pipelining, Alternation
Getting the interlock logic which controls pipeline flow correct
is an important prerequisite for maximising pipeline performance.
Unnecessary pipeline stalls can only be eliminated when they can
be distinguished from those stalls which are necessary to preserve
functional correctness.
We propose a method for deriving a maximum pipeline performance
specification from a complete functional specification of the
pipeline control logic. The performance specification can be used
to generate simulation testbench assertions. On the other hand, the
specification can serve as a basis for formal property checking. The
most promising aspect of our work is, however, the potential to synthesise
the actual control logic from its formal description.
Categories and Subject Descriptors
B.5.2 [Register-Transfer-Level Implementation]: Design Aids -
Verification; B.5.1 [Register-Transfer-Level Implementation]: Design
- Control Design, Pipeline
General Terms
Performance, Verification
Keywords
Pipeline Stall, Interlock Logic, Verification
One of the main concerns of the designer of a circuit module is to guarantee
that the interface of the module conforms to specific protocols (such as
PCI Bus or Ethernet) by which it interacts with its environment. The
computational complexity of verifying such open systems under all possible
environments has been shown to be very hard (EXPTIME complete [10]). On
the other hand, designers are typically required to guarantee correct behavior
only for specific valid behaviors of the environment (such as a valid PCI
Bus environment). Designers attempt to model these behaviors through an
appropriate test bench for the module. In this paper we present a module
verifier tool based on a proposed real time temporal logic called Open-RTCTL,
which allows combined specification of the correctness properties and the input
environments. The tool accepts the design in a subset of Verilog. By making
the designer specify the environment constraints, we are able to verify a
module in isolation, and thereby avoid the state explosion problem due to
composition of modules. We present experimental results on modules from
the Texas-97 Benchmark circuits [14] to demonstrate the space/time
efficiency of the tool.
Categories and Subject Descriptors
B.7.2. [Hardware]: Integrated Circuits - Verification
General Terms
Verification
Keywords
Formal Verification, Temporal Logic
The automated generation of timing models from gate-level
netlists facilitates IP reuse and dramatically improves chip-level
STA runtime in a hierarchical design flow. In this paper we
discuss two different approaches to model generation, the design
flows they lend themselves to and results from the application of
these model generation solutions to large customer designs.
Categories and Subject Descriptors
J.6: Computer Application.CAD.
General Terms
Design, Performance, Algorithm, Verification
Keywords
Static Timing Analysis, Model Generation. EDA.
Timing model extractor builds a timing model of a digital circuit for use with a static timing analyzer. This paper proposes a novel method of generating a gray box timing model from gate-level netlist by reducing a timing graph. Previous methods of generating timing models sacrificed accuracy and/or did not scale well with design size. The proposed method is simple, yet it provides model accuracy including arbitrary levels of latch time borrowing, correct support for self-loop timing checks and capability to support timing constraints that span multiple blocks. Also, cpu and memory resources required to generate the model scale well with size of the circuit. We were able to extract a model for a 456K gate block using under 2 minutes of cpu time and 464 MB of memory on a Sun Fire 880 machine. The generated model can provide a capacity improvement in timing verification by more than two orders of magnitude.
We have developed a new timing abstraction model for digital
circuit blocks that is stimulus independent, port based, supports
designs with level triggered latches, and can be input into
commercial STA (Static Timing Analysis) tools. The model is
based on an extension of the concept of latch transparency to
circuit block transparency introduced in this paper. It was
implemented, tested and is being used in conjunction with
transistor level STA for microprocessor designs with tens of
millions of transistors. The STA simulation times are significantly
shorter than with gray box timing models, which can decrease the
overall chip timing verification time. The model can also be used
in the intellectual property encapsulation domain.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids - simulation, verification
General Terms
Performance, Design, Verification.
Keywords
Timing analysis, timing model, VLSI design, circuit optimization
This paper proposes a fast multi-cycle path analysis method
for large sequential circuits. It determines whether or not all
the paths between every flip-flop pair are multi-cycle paths.
The proposed method is based on ATPG techniques, especially
on implication techniques, to utilize circuit structure
and multi-cycle path condition directly. The method also
checks whether or not the multi-cycle path may be invalidated
by static hazards in combinational logic parts. Experimental
results show that our method is much faster than
conventional ones.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids
General Terms
Algorithms, Designs, Verification
Keywords
multi-cycle path, sequential circuits, implication, ATPG
Textiles and computing share a synergistic relationship, which is being harnessed to create a new paradigm in personalized mobile information processing (PMIP). In this paper, we provide an overview of this "interconnection" between the two fields and present the vision for "E-Textiles," which represents the convergence of the two fields. We discuss the role of the Georgia Tech Wearable Motherboard in pioneering this paradigm of "fabric is the computer" and serving as a framework for PMIP. Finally, recent research in this area resulting in the realization of a "computational fabric network" is discussed.
This paper addresses an emerging new field of
research that combines the strengths and capabilities of electronics
and textiles in one: electronic textiles, or e-textiles. E-textiles, also
called Smart Fabrics, have not only "wearable" capabilities like any
other garment, but also local monitoring and computation, as well as
wireless communication capabilities. Sensors and simple
computational elements are embedded in e-textiles, as well as built
into yarns, with the goal of gathering sensitive information,
monitoring vital statistics and sending them remotely (possibly over a
wireless channel) for further processing. Possible applications include
medical (infant or patient) monitoring, personal information
processing systems, or remote monitoring of deployed personnel in
military or space applications. We illustrate the challenges imposed by
the dual textile/electronics technology on their modeling and
optimization methodology.
Categories and Subject Descriptors: I.6 [Simulation and
Modeling]: Modeling methodologies; B.8.2 [Performance and
reliability]: performance analysis and design aids.
General terms: design, performance
Categories and Subject Descriptors
I.2.8 [Problem Solving, Control Methods, and Search]:
Scheduling
General Terms
Algorithms, Design
Keywords
voltage selection, task scheduling
In this paper, we present a two-phase framework that integrates
task assignment, ordering and voltage selection (VS)
together to minimize energy consumption of real-time dependent
tasks executing on a given number of variable voltage processors.
Task assignment and ordering in the first
phase strive to maximize the opportunities that can be exploited
for lowering voltage levels during the second phase,
i.e., voltage selection. In the second phase, we formulate the
VS problem as an Integer Programming (IP) problem and
solve the IP efficiently. Experimental results demonstrate
that our framework is very effective in executing tasks at
lower voltage levels under different system configurations.
Operation of battery-powered portable systems can no longer be
sustained once a battery becomes discharged. Maximization of the
battery lifetime is a difficult task due to nonlinearity of battery behavior
that depends on the characteristics of the system load profile.
We address the problem of task sequencing without and with voltage/
clock scaling that shapes the profile so that the battery lifetime
is maximized. We developed an accurate analytical battery model
and validated it with measurements taken on a real lithium-ion battery
used in a pocket computer. We use the model as a basis for a
unique battery-conscious cost function and utilize its properties to
develop several novel algorithms, including insertion of recovery
periods and voltage/clock scaling for delay slack distribution.
Categories and Subject Descriptors
J.6.2 [Computer-Aided Engineering]: Computer-Aided Design
General Terms
Algorithms, Performance, Design
Keywords
Battery, modeling, low-power design, scheduling, voltage scaling
In this paper, we evaluate an adaptive loop parallelization strategy
(i.e., a strategy that allows each loop nest to execute using different
number of processors if doing so is beneficial) and measure
the potential energy savings when unused processors during execution
of a nested loop in a multi-processor on-a-chip (MPoC)
are shut down (i.e., placed into a power-down or sleep state). Our
results show that shutting down unused processors can lead to as
much as 67% energy savings with up to 17% performance loss in
a set of array-intensive applications. We also discuss and evaluate
a processor pre-activation strategy based on compile-time analysis
of nested loops. Based on our experiments, we conclude that an
adaptive loop parallelization strategy combined with idle processor
shut-down and pre-activation can be very effective in reducing
energy consumption without increasing execution time.
Categories and Subject Descriptors
D.3.4 [Programming Languages]: Processors - Compilers, Optimization
General Terms
Design, Experimentation, Performance
Keywords
Adaptive Parallelization, Multiprocessing, Energy Consumption
A regular circuit structure called a River PLA and its reconfigurable
version, Glacier PLA, are presented. River PLAs
provide greater regularity than circuits implemented with
standard-cells. Conventional optimization stages such as
technology mapping, placement and routing are eliminated. These
two features make the River PLA a highly predictable structure.
Glacier PLAs can be an alternative to FPGAs, but with a simpler
and more efficient design methodology.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids Automatic synthesis.
General Terms
Algorithms.
Keywords
Programmable Logic Array, River routing.
In deep sub-micron (DSM) technology, wires are equally or more
important than logic components since wire-related problems such
as crosstalk, noise are much critical in system-on-chip (SoC) design.
Recently, a method [12] for generating a partial product reduction
tree (PPRT) with optimal-timing using bit-level adders to
implement arithmetic circuits, which outperforms the current best
designs, is proposed. However, in the conventional approaches including
[12], interconnects are not primary components to be optimized
in the synthesis of arithmetic circuits, mainly due to its
high integration complexity or unpredictable wire effects, thereby
resulting in unsatisfactory layout results with long and messed wire
connections. To overcome the limitation, we propose a new module
generation/synthesis algorithm for arithmetic circuits utilizing
carry-save-adder (CSA) modules, which not only optimizes the circuit
timing but also generates a much regular interconnect topology
of the final circuits. Specifically, we propose a two-step algorithm:
(Phase 1: CSA module generation) we propose an optimal-timing
CSA module generation algorithm for an arithmetic expression under
a general CSA timing model; (Phase 2: Bit-level interconnect
refinements) we optimally refine the interconnects between the CSA
modules while retaining the global CSA-tree structure produced by
Phase 1. It is shown that the timing of the circuits produced by
our approach is equal or almost close to that by [12] in most testcases
(even without including the interconnect delay), and at the
same time, the interconnects in layout are significantly short and
regular.
Categories and Subject Descriptions
B.2.4. [Arithmetic and Logic Structures]: High-Speed Arithmetic
- Algorithms,Cost/Performance
General Terms: Algorithms, Design and Performance
Keywords: Carry-save-adder, layout, high performance
An architectural solution to reducing memory energy consumption
is to adopt a multi-bank memory system instead of a monolithic
(single-bank) memory system. Some recent multi-bank memory
architectures help reduce memory energy by allowing an unused
bank to be placed into a low-power operating mode. This paper
describes an automatic data migration strategy which dynamically
places the arrays with temporal affinity into the same set of banks.
This strategy increases the number of banks which can be put into
low-power modes and allows the use of more aggressive energy saving
modes. Experiments using several array-dominated applications
show the usefulness of data migration and indicate that large
energy savings can be achieved with low overhead.
Categories and Subject Descriptors
B.3 [Hardware]: Memory Structures
General Terms
Design, Experimentation, Performance
Keywords
Energy Consumption, Multi-Bank Memories, Data Migration
In this paper, we present a compiler strategy to optimize data accesses
in regular array-intensive applications running on embedded
multiprocessor environments. Specifically, we propose an optimization
algorithm that targets the reduction of extra off-chip memory
accesses caused by inter-processor communication. This is
achieved by increasing the application-wide reuse of data that resides
in the scratch-pad memories of processors. Our experimental
results obtained on four array-intensive image processing applications
indicate that exploiting inter-processor data sharing can reduce
the energy-delay product by as much as 33.8% (and 24.3%
on average) on a four-processor embedded system. The results also
show that the proposed strategy is robust in the sense that it gives
consistently good results over a wide range of several architectural
parameters.
Categories and Subject Descriptors B.3 [Hardware] Memory
Structures; D.3.4 [Software] Programming Languages: Processors
[Compilers]
Terms Algorithms, management, performance.
Keywords Embedded multiprocessors, energy consumption, scratch
pad memories, access patterns, compiler optimizations, data tiles.
One of the important issues in embedded system design
is to optimize program code for the microprocessor to be stored
in ROM. In this paper, we propose an integrated approach to the
DSP address code generation problem for minimizing the number
of addressing instructions. Unlike previous works in which code
scheduling and offset assignment are performed sequentially without
any interaction between them, our work tightly couples offset
assignment problem with code scheduling to exploit scheduling on
minimizing addressing instructions more effectively. We accomplish
this by developing a fast but accurate two-phase procedure
which, for a sequence of code schedules, finds a sequence of memory
layouts with minimum addressing instructions. Experimental
results with benchmark DSP programs show improvements of
13%-33% in the address code size over Solve-SOA/GOA [7].
Categories and Subject Descriptors
C.3 [Special-purpose and application-based systems]: [Signal
processing systems]
General Terms
Algorithms, Performance
Keywords
Offset assignment, Scheduling, Code Generation
An agile, transparent optical network is emerging. This paper
enumerates the functions that will be needed at nodes in order to
add transparency and agility to the network while robustly
assuring optical fibre channel performance. One key requirement
will be scalable integration of multiple functions on a platform.
The paper presents recent results along one strategic axis for
integration the use of functionalized self-organized photonic
crystals and heterostructures thereof to control the flow and
features of light.
Categories and Subject Descriptors
C.2.1 [Network Architecture and Design]: Enabling technologies.
General Terms
Experimentation, Theory.
Keywords
Agile optical networks, reconfigurability, optical performance
monitoring, optoelectronic integration, photonic crystals, optical
nonlinearity, electro-optics, optical polymers, semiconductor
nanocrystals.
We present a general overview of the role of computer models in
the design and optimization of commercial optical transmission
systems. Specifically, we discuss (1) the role of modeling in a
commercial setting, (2) achieving the proper balance between
accuracy and computation speed, (3) model verification against
experiment, and (4) case studies demonstrating the benefits of
modeling. Ideally, experiments are preferable to models when
describing system performance, particularly to support claims of
a systems functionality to a customer. However, modeling is
often the only choice for many of the problems that a commercial
networking company must solve. Because there are design
parameter spaces that are either too expensive or too time consuming
to verify experimentally, the main role of modeling in
industry is to study what experiments cannot. For example,
when an analytical solution of a statistical problem is infeasible,
a common modeling solution is to perform Monte Carlo trials to
study the statistical behavior. Another typical modeling task
involves looking at variations of hardware that would be
prohibitively expensive to acquire and test. Implementing
modeling in industry involves a balance between three needs:
cost-efficiency, time-efficiency, and accuracy. We will discuss
the approaches we have taken at PhotonEx to meet these needs:
leveraging academic research, developing reduced models and
utilizing computational clusters. Specifically, we will use case
studies to illustrate the application of these approaches to
modeling long-haul optical transmission systems.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-Aided Engineering -
computer-aided design (CAD) and computer-aided
manufacturing (CAM).
General Terms
Algorithms, Measurement, Performance, Design,
Experimentation, Theory, Verification
Keywords
Optical Communication, Long-Haul (LH) Transmission, Ultra-Long
Haul (ULH) Transmission, Optical Modeling
As designers become more aggressive in introducing optical
components to micro-systems, rigorous optical models are
required for system-level simulation tools. Common optical
modeling techniques and approximations are not valid for most
optical micro-systems, and those techniques that provide accurate
simulation are computationally slow. In this paper, we introduce
an angular frequency optical propagation technique that greatly
reduces computation time while achieving the accuracy of a full
scalar formulation. We present simulations of a diffractive optical
MEM Grating Light Valve to show the advantages of this optical
propagation method and the integration of the technique into a
system-level multi-domain CAD tool.
Categories and Subject Descriptors
I.6.5 [Simulation and Modeling]: Model Development -
modeling methodologies
General Terms
Algorithms, Design
Keywords
Optical Propagation, Angular Spectrum, CAD, Optical
Micro-systems, Optical MEMS
We present a new clock-control DFT technique
for sequential circuits, based on clock partitioning
and selective clock freezing, and we use it to break the global
feedback loops and to generate clock waves to test the
resulting sequential circuit with self-loops. Clock waves
allow us to significantly reduce the complexity of sequential
ATPG. Unlike scan, our non-intrusive DFT technique
does not introduce any delay penalty; the generated tests
may be applied at speed, have shorter application time, and
dissipate less power.
Categories and Subject Descriptors: B.8.1 [Performance
and Reliability]: Reliability, Testing, and
Fault-Tolerance
General Terms: Algorithms, Design, Reliability
Logic built-in self test (BIST) is increasingly being adopted to
improve test quality and reduce test costs for rapidly growing
designs. Compared to deterministic automated test pattern generation
(ATPG), BIST presents inherent fault diagnostic challenges.
Previous diagnostic techniques have been limited in their diagnosis
resolution and/or require significant hardware overhead. This
paper proposes an interval-based scan-unload method that ensures
diagnosis resolution down to gate-level faults with minimal hardware
overhead. Tester fail-data collection is based on a novel construct
incorporated into the design-extensions of the standard test-interface
language (STIL). The implementation of the proposed
method is presented and analyzed.
Categories and Subject Descriptors: B.8.1 [Performance
and Reliability]: Reliability, Testing and Fault-Tolerance.
General Terms: Algorithms, Design.
Keywords: built-in self-test (BIST), fault diagnosis.
A circuit may produce unknown output values during simulation of an input sequence due to an unknown initial state or due to the existence of tri-state elements. For circuits tested using BIST, unknown output values make it impossible to determine a single unique signature for the fault free circuit. To accommodate unknown output values in a BIST scheme, we describe a procedure for synthesizing a minimal logic block that replaces unknown output values by a known constant. The proposed procedure ensures that the BIST scheme will be able to detect all the faults detectable by the input sequence applied to the circuit while allowing a single unique signature to be obtained.
Software-based self-test (SBST) is emerging as a promising
technology for enabling at-speed test of high-speed microprocessors
using low-cost testers. We explore the fault diagnosis capability
of SBST, in which functional information can be used to
guide and facilitate the generation of diagnostic tests. By using a
large number of carefully constructed diagnostic test programs,
the fault universe can be divided into fine-grained partitions,
each corresponding to a unique pass/fail pattern. We evaluate the
quality of diagnosis by constructing diagnostic-tree-based fault
dictionaries. We demonstrate the feasibility of the proposed
method by applying it to a processor example. Experimental
results show its potential as an effective method for diagnosing
larger processors.
Categories and Subject Descriptors
B.8.1 [Performance and Reliability]: Reliability, Testing, and
Fault-Tolerance.
General Terms
Algorithms, Measurement, Reliability, Experimentation.
Keywords
Microprocessor, self-test, instruction, diagnostics.
The design of high-throughput large-state Viterbi decoders relies
on the use of multiple arithmetic units. The global communication
channels among these parallel processors often consist of long interconnect
wires, resulting in large area and high power consumption.
In this paper, we propose a data-transfer oriented design
methodology to implement a low-power 256-state rate-1/3 IS95
Viterbi decoder. Our architectural level scheme uses operation partitioning,
packing, and scheduling to analyze and optimize interconnect
effects in early design stages. In comparison with other
published Viterbi decoders, our approach reduces the global data
transfers by up to 75% and decreases the amount of global buses
by up to 48%, while enabling the use of deeply pipelined datapaths
with no data forwarding. In the RTL implementation of the individual
processors, we apply precomputation in conjunction with
saturation arithmetic to further reduce power dissipation with provably
no coding performance degradation. Designed using a 0.25
m standard cell library, our decoder achieves a throughput of 20
Mbps in simulation and dissipates only 450 mW.
Categories and Subject Descriptors
B.7.1 [Integrated Circuits]: Types and Design StylesAlgorithms
implemented in hardware
General Terms
Design, Performance
Keywords
Communications, Pipelining, Bus reduction
Hardware/software co-design methodologies generally focus on
the prediction of system performance or co-verification of system
functionality. This study extends this conventional focus through
the development of a methodology and software tool that
evaluates system (hardware and software) development,
fabrication, and testing costs (dollar costs) concurrent with
hardware/software partitioning in a co-design environment.
Based on the determination of key metrics such as gate count
and lines of software, a new tool called Ghost, evaluates software
and hardware development, fabrication, packaging and testing
costs. Ghost enables optimization of hardware/software
partitioning as a function of specific combinations of hardware
foundries and software development environments.
Categories and Subject Descriptors
E3 [HW/SW co-design]: specification, model., co-simulation and
performance analysis, system-level scheduling and partitioning.
General Terms Design, Economics.
Keywords Cost Modeling, Cost-Performance Trade-off.
This paper presents efficient automatic code synthesis techniques
from dataflow graphs for multimedia applications. Since
multimedia applications require large size buffers containing
composite type data, we aim to reduce the buffer sizes with
fractional rate dataflow extension and buffer sharing technique. In
an H.263 encoder experiment, the FRDF extension and buffer
sharing technique enable us to reduce the buffer size by 67%. The
final buffer size is no more than in a manual reference code.
Keywords
memory optimization, software synthesis, multimedia, dataflow
The ForSyDe methodology has been developed for system level design.
In this paper we present formal transformation methods for
the refinement of an abstract and formal system model into an implementation
model. The methodology defines two classes of design
transformations: (1) semantic-preserving transformations and
(2) design decisions. In particular we present and illustrate communication
and clock domain refinement by way of a digital equalizer
system.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design-Aids; J.6 [Computer-Aided
Engineering]: Computer-Aided Design (CAD)
General Terms
Design, Theory
Keywords
System Design, System Modeling, Design Refinement
Categories and Subject Descriptors
C.0 [General]: System Architecture; C.3 [Computer Systems Organization]:
Special-Purpose and Application-Based Systems - realtime
and embedded systems; C.4 [Computer Systems Organization]:
Performance of Systems
General Terms
Algorithms, Performance, Verification
Keywords
Platform-Based Design, Performance Analysis, Scheduling, Formal
Analysis
In this paper, a new timing generation method is proposed for the performance analysis of embedded software. The time stamp generation of I/O accesses is crucial to performance estimation and architecture exploration in the timed functional simulation, which simulates the whole design at a functional level with timing. A portable compiler is modified to generate time-deltas, which are the estimated cycle counts between two adjacent I/O accesses, by counting the cycles of the intermediate representation (IR) operations and using a machine description that contains information on a target processor. Since the proposed method is based on the machine-independent IR of a compiler, the method can be applied to various processors by changing the machine description. The experimental results show that the proposed method is effective in that the average estimation error is about 2% and the maximum speed-up over the corresponding instruction-set simulators is about 300 times. The proposed method is also verified in a timed functional simulation environment.
A chip that is required to meet strict operating criteria in terms of speed,
power, or area is commonly custom designed at the switch level. Traditional
techniques for verifying these designs, based on simulation, are expensive
in terms of resources and cannot completely guarantee correct operation.
Formal verification methods, on the other hand, provide for a
complete proof of correctness, and require less effort to setup. This paper
presents Motorolas Switch Level Verification (SLV) tool, which employs
detailed switch level analysis to model the behavior of MOS transistors
and obtain an equivalent RTL model. This tool has been used for equivalence
checking at the switch level for several years within Motorola for
the PowerPC, M*Core and DSP custom blocks. We focus on the novel
techniques employed in SLV, particularly in the areas of pre-charged and
sequential logic analysis, and provide details on the automated and integrated
equivalence checking flow in which the tool is used.
Categories and Subject Descriptors
J.6 [Computer-Aided Engineering]: Computer-Aided Design.
General Terms
Algorithms, Design, Verification.
Keywords
Custom design, switch level analysis, equivalence checking, formal verification,
MOS circuits, VLSI design.
An important step in using combinational equivalence
checkers to verify sequential designs is identifying and
matching corresponding compare-points in the two
sequential designs to be verified. Both non-function and
function-based matching methods are usually employed in
commercial verification tools. In this paper, we describe a
heuristic algorithm using ATPG for matching compare-points
based on the functionality of the combinational
blocks in the sequential designs. Results on industrial-sized
circuits show our methods are both practical and efficient.
Categories and Subject Descriptors
J.6 [Computer-aided engineering]: Verification, compare-point
matching.
General Terms
Algorithms, Experimentation, Verification.
Keywords
Combinational verification, equivalence checking, latch
mapping.
Verification of gate-level implementations of arithmetic circuits is
challenging due to a number of reasons: the existence of some
hard-to-verify arithmetic operators (e.g. multiplication), the use of
different operand ordering, the incorporation of merged arithmetic
with cross-operator implementations, and the employment of circuit
transformations based on arithmetic relations. It is hence a
peculiar problem that does not fit quite well into the existing RTL-to-gate
equivalence checking methodology. In this paper, we propose
a self-referential functional verification approach which uses
the gate-level implementation of the arithmetic circuit under verification
to verify itself. Specifically, the verification task is decomposed
into a sequence of equivalence checking subproblems,
each of which compare circuit pairs derived from the implementation
under verification based on the proposed self-referential functional
equations. A decomposition-based heuristic using structural
information is employed to guide the verification process for better
efficiency. Experimental results on a number of implementations
of the multiply-add units and the inner product units with different
architectures demonstrate the versatility of this approach.
Categories and Subject Descriptors
B.5.2 [Register-Transfer-Level Implementation]: Design Aids
verification
General Terms
Algorithm, Verification
Keywords
Arithmetic circuit verification
This paper addresses the problem of automatic generation of implementation software from high-level functional specifications in the context of embedded system on chip designs. Software design complexity for embedded systems has increased so much that a high-level functional programming paradigm need to be adopted for formal verifiability, maintainability and short time-to-market. We propose a framework for efficiently generating implementation software from a synchronous state machine specification for embedded control systems. The framework is generic enough to allow hardware/software partition for a given architecture platform. It is demonstrated that the logic optimization and simulation techniques can be combined to produce fast execution code for such embedded systems. Specifically, we propose a framework for software synthesis from multi-valued logic, including fast evaluation of logic functions, and scheduling techniques for node execution. Experiments are performed to show the initial results of our algorithms in this framework.
Embedded software designers often use libraries that have been
pre-optimized for a given processor to achieve higher code
quality. However, using such libraries in legacy code
optimization is nontrivial and typically requires manual
intervention. This paper presents a methodology that maps
algorithmic constructs of the software specification to a library of
complex software elements. This library-mapping step is
automated by using symbolic algebra techniques. We illustrate
the advantages of our methodology by optimizing an algorithmic
level description of MPEG Layer III (MP3) audio decoder for the
Badge4 [2] portable embedded system. During the optimization
process we use commercially available libraries with complex
elements ranging from simple mathematical functions such as
exp to the IDCT routine. We implemented and measured the
performance and energy consumption of the MP3 decoder
software on Badge4 running embedded Linux operating system.
The optimized MP3 audio decoder runs 300 times faster than the
original code obtained from the standards body while consuming
400 times less energy. Since our optimized MP3 decoder runs 3.5
times faster than real-time, additional energy can be saved by
using processor frequency and voltage scaling.
Categories and Subject Descriptors
C.3 [Special-Purpose and Application-Based Systems]:
Microprocessor/microcomputer applications, Real-time and
embedded systems, Signal processing systems.
General Terms
Algorithms, Performance, Design, Experimentation, Theory.
Keywords
Embedded software optimization, Automated library mapping,
Symbolic algebra, Polynomial representation, Computation
intensive software.
Since software is playing an increasingly important role in system-on-chip,
retargetable compilation has been an active research area
in the last few years. However, the retargetting of equally important
downstream system tools, such as assemblers, linkers and debuggers,
has either been ignored, or falls short of meeting the requirements
of modern programming languages and operating systems.
In this paper, we present techniques that can automatically
retarget the GNU binutils tool kit, which contains a large array of
production-quality downstream tools. Other than having all the advantages
enjoyed by open-source software by aligning to a de facto
standard, our techniques are systematic, as a result of using a formal
model of instruction set architecture (ISA) and application binary
interface (ABI); and simple, as a result of leveraging free software
to the largest extent.
Categories and Subject Descriptors
D.3.4 [Processors]: Retargetable compilers
General Terms
Design, Languages
Increasing non-recurring engineering (NRE) and mask costs are
making it harder to turn to hardwired Application Specific
Integrated Circuit (ASIC) solutions for high performance
applications [12]. The volume required to amortize these high
costs has been increasing, making it increasingly expensive to
afford ASIC solutions for medium volume products. This has led
to designers seeking programmable solutions of varying sorts
using these so-called programmable platforms. These
programmable platforms span a large range from bit-level
programmable Field Programmable Gate Arrays (FPGAs), to
word-level programmable application-specific, and in some cases
even general-purpose processors. The programmability comes
with a power and performance overhead. Attempts to reduce this
overhead typically involve making some core hardwired ASIC
like logic blocks accessible to the programmable elements. This
paper presents one such hybrid solution in this space a relatively
simple processor with a dynamically reconfigurable datapath
acting as an accelerating co-processor. This datapath consists of
hardwired function units and reconfigurable interconnect. We
present a methodology for the design of these solutions and
illustrate it with two complete case studies: an MPEG 2 coder,
and a GSM coder, to show how significant speedups can be
obtained using relatively little hardware. The co-processor can be
viewed as a VLIW processor with a single instruction per kernel
loop. We compare the efficiency of exploiting the operation level
parallelism using classic VLIW processors and this proposed class
of dynamically configurable co-processors. This work is part of
the MESCAL project, which is geared towards developing design
environments for the development of application specific
platforms.
Categories and Subject Descriptors
J.6 [COMPUTER-AIDED ENGINEERING]: Computer-aided
design (CAD)
General Terms
Design, Performance
Tools and a design methodology have been developed to
support partial run-time reconfiguration of FPGA logic on
the Field Programmable Port Extender. High-speed Internet
packet processing circuits on this platform are implemented
as Dynamic Hardware Plugin (DHP) modules that
fit within a specific region of an FPGA device. The PARBIT
tool has been developed to transform and restructure bitfiles
created by standard computer aided design tools into partial
bitsteams that program DHPs. The methodology allows
the platform to hot-swap application-specific DHP modules
without disturbing the operation of the rest of the system.
Keywords
FPGA, partial RTR, reconfiguration, hardware, modularity,
network, routing, packet, Internet, IP, platform computing
Categories and Subject Descriptors
B.7.2 [Hardware]: Circuits|Design Aids; B.7.1 [Hardware]:
Circuits|VLSI ; B.4.3 [Hardware]: Input/Output
and Data Communications|Interconnections (Subsystems);
C.2.1 [Computer Systems Organization]: Computer-Communication
Networks|Network Architecture and Design
General Terms
Design, Experimentation
A hard disk readback signal generator designed to provide noise-corrupted signals to a channel simulator has been implemented on a Xilinx VirtexTME FPGA device. The generator simulates pulses sensed by read heads in hard drives. All major distortion and noise processes, such as intersymbol interference, transition noise, electronics noise, head and media nonlinearity, intertrack interference, and write timing error, can be generated according to the statistics and parameters defined by the user. Reconfigurable implementation enables an update of the signal characteristics in runtime. The user also has the flexibility to choose from a set of bitstreams to simulate particular combinations of noise and distortion. Such customized restructuring helps reduce the area consumption and hence virtually increase the capacity of the FPGA device. The time to generate the readback signals has been reduced by four orders compared to its software counterpart.
At-speed testing of high-speed circuits is becoming increasingly difficult with external testers due to the growing gap between design and tester performance, growing cost of high-performance testers and increasing yield loss caused by inherent tester inaccuracy. Therefore, empowering the chip to test itself seems like a natural solution. Hardware-based self-testing techniques have limitations due to performance and area overhead and problems caused by the application of non-functional patterns.
Embedded software-based self-testing has recently become focus of intense research. In this methodology, the programmable cores are used for on-chip test generation, measurement, response analysis and even diagnosis. After the programmable core on a System-onChip (SoC) has been self-tested, it can be reused for testing on-chip buses, interfaces and other non-programmable cores. The advantages of this methodology include at-speed testing, low design-for-testability overhead and application of functional patterns in the functional environment. In this paper, we give a survey and outline the roadmap and challenges of this emerging embedded software-based self-testing paradigm.
Categories and Subject Descriptors
B.8.1 [Integrated Circuits]: Performance and Reliability reliability, testing and fault-tolerance.
General Terms
Algorithms, Performance, Reliability.
Keywords
VLSI test, SoC test, functional test, microprocessor test.
Transient current (IDD) based testing has been often cited
and investigated as an alternative and/or supplement to quiescent
current (IDDQ) testing. While the potential of IDD
testing for fault detection has been established, there is no
known efficient method for fault diagnosis using IDD analysis.
In this paper, we present a novel integrated method
for fault detection and localization using wavelet transform
based IDD waveform analysis. The time-frequency resolution
property of wavelet transform helps us detect as well
as localize faults in digital CMOS circuits. Experiments
performed on measured data from a fabricated 8-bit shift
register and simulation data from more complex circuits
show promising results for both detection and localization.
Wavelet based detection method shows superior sensitivity
than spectral and time-domain methods. The effectiveness
of the localization method in presence of process variation,
measurement noise and complex power supply network is
addressed.
Categories and Subject Descriptors
B.8.2 [Hardware]: Performance and Reliability| Reliability,
Testing, and Fault-Tolerance
General Terms
Algorithms, Reliability, Experimentation
Keywords
Transient current (IDD), wavelet transform, fault localization
This paper aims at analysis of signal integrity for the purpose of testing high speed interconnects. This requires taking into account the effect of inputs as well as parasitic RLC elements of the interconnect. To improve the analysis/simulation time in integrity fault testing, we use reduced-order modeling that essentially performs the analysis in the frequency domain. To demonstrate the generality and usefulness of our method, we also discuss its application for test pattern generation targeting signal integrity loss.
In conventional delay testing, the test clock is a single pre-defined
parameter that is often set to be the same as the system clock. This
paper discusses the potential of enhancing test efficiency by using
multiple clock frequencies. The intuition behind our work is that
for a given set of AC delay patterns, a carefully-selected, tighter
clock would result in higher effectiveness to screen out the potential
defective chips. Then, by using a smarter test clock scheme and
combining with a second set of AC delay patterns, the overall quality
of AC delay test can be enhanced while the cost of including the
second pattern set can be minimized. We demonstrate these concepts
through analysis and experiments using a statistical timing
analysis framework with defect-injected simulation.
Categories and Subject Descriptors
B.8.1 [Hardware]: Reliability, Testing, and Fault-Tolerance
General Terms
Experimentation, Measurement, Reliability
Keywords
Delay Testing, Statistical Timing Analysis, Transition Fault Model
The complexity of a System-on-Chip design is not only in the
million transistors packed in a square millimeter. The major
challenge for technical success of a SoC is to make sure that
millions lines of software fit in with millions gates.
In this paper, the problematic of multi-million gate design is
illustrated from the viewpoint of a practical development of a
complex digital system done at STMicroelectronics for a
GSM/GPRS cellular application.
Categories and Subject Descriptors
C.3 [Computer Systems Organisation]: Special-purpose and
application-based systems real-time and embedded systems.
General Terms
Design.
Keywords
SoC Design, HW/SW co-design.
This paper proposes a general hierarchical analysis methodology, HiPRIME, to efficiently analyze RLKC power delivery systems. After partitioning the circuits into blocks, we develop and apply the IEKS (Improved Extended Krylov Subspace) method to build the Multi-port Norton Equivalent circuits which transform all the internal sources to Norton current sources at ports. Since there is no active elements inside the Norton circuits, passive or realizable model order reduction techniques such as PRIMA can be applied. To further reduce the top-level hierarchy runtime, we develop a second-level model reduction algorithm and prove its passivity. Experimental results show 400-700X runtime improvement with less than 0.2% error.
We present a frequency domain current macro-modeling technique
for capturing the dependence of the block current
waveform on its input vectors. The macro-model is based on
estimating the Discrete Cosine Transform (DCT) of the current
waveform as a function of input vector pair and then
taking the inverse transform to estimate the time domain
current waveform. The input vector pairs are partitioned
according to Hamming distance and a current macro-model
is built for each Hamming distance using regression. Regression
is done on a set of current waveforms generated for
each circuit, using HSPICE. The average relative error in
peak current estimation using the current macro-model is
less than 20%.
Categories and Subject Descriptors
B.7 [Hardware]: Integrated CircuitsCAD; B.7.2 [Integrated
Circuits]: Design AidsModeling
General Terms
Algorithms
Keywords
Power grid, Current macro-model, DCT
The power delivery network is made up of passive elements in the
distribution network, as well as the active transistor loads. A chip
typically has three types of power supplies that require attention:
core, I/O, and analog. Core circuits consist of digital circuits and
have the largest current demand. In addition to all of the system
issues/models for the core, modeling the I/O subsystem has the
additional requirement of modeling return paths and discontinuities.
The analog circuits present yet a different challenge to the macromodeling
of the supply network because they place a tight demand
on supply variations. This paper presents a design methodology on
how to generate macro-models of the entire chip electrical interface.
This methodology can be used by the chip, package, and system
designers and is being used to design high-reliability servers.
Categories and Subject Descriptors
C.5.3 [Computer System Implementation]: VLSI Systems.
General Terms
Performance, Design, Reliability.
Keywords
VLSI Power Distribution, Inductance, High Speed Microprocessor
Design, Analog and I/O Power Delivery.
In this paper we propose a novel and efficient methodology for
modeling and analysis of regular symmetrically-structured power/
ground distribution networks. The modeling of inductive effects is
simplified by a folding technique which exploits the symmetry in
the power/ground distribution. Furthermore, employment of susceptance
[10,11] (inverse of inductance) models enables further
simplification of the analysis, and is also shown to preserve the
symmetric positive definiteness of the circuit equations. Experimental
results demonstrate that our approach can provide up to 8x
memory savings and up to10x speedup over the already efficient
simulation based on the original sparse susceptance matrix without
loss of accuracy. Importantly, this work demonstrates that by employing
limited regularity, one can create excellent power/ground
distribution designs that are dramatically simpler to analyze, and
therefore amenable to more powerful global design optimization.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits] Design Aids - verification
General Terms
Design, Verification
Keywords
Power/Ground Distribution, Susceptance, Folding Technique, Design
Regularity
In a synchronous clock distribution network with zero latencies,
digital circuits switch simultaneously on the clock edge, therefore
they generate substrate noise due to the sharp peaks on the supply
current. We present a novel methodology optimizing the clock tree
for less substrate generation by using statistical single cycle supply
current profiles computed for every clock region taking the timing
constraints into account. Our methodology is novel as it uses an
error-driven compressed data set during the optimization over a
number of clock regions specified for a significant reduction in
substrate noise. It also produces a quality analysis of the computed
latencies as a function of the clock skew. The experimental results
show >x2 reduction of substrate noise generation from the circuits
having four clock regions of which the latencies are optimized.
Categories and Subject Descriptors
B.5.1 [Register-Transfer-Level Implementation]: Design - datapath
design. B.6.1 [Logic Design]: Design Styles - sequential
circuits. B.6.3 [Logic Design]: Design Aids - optimization,
simulation. B.7.1 [Integrated Circuits]: Types and Design Styles- VLSI.
B.8.2 [Performance and Reliability]: Performance
Analysis and Design Aids.
General Terms
Algorithms, Design, Performance, Reliability.
Keywords
Substrate noise, di/dt noise, low-noise digital design, clock
distribution networks, supply current shaping and optimization.
Several approaches have been proposed for the syntax-directed compilation of asynchronous circuits from high-level specification languages, such as Balsa and Tangram. Both compilers have been successfully used in large real-world applications; however, in practice, these methods suffer from significant performance overheads due to their reliance on straightforward syntax-directed translation. This paper introduces a powerful new set of transformations, and an extended channel-based language to support them, which can be used an optimizing back-end for Balsa. The transforms described in this paper fall into two categories: resynthesis and peephole. The proposed optimization techniques have been fully integrated into a comprehensive asynchronous CAD package, Balsa. Experimental results on several substantial design examples indicate significant performance improvements.
The roadblock to wide acceptance of asynchronous methodology is poor CAD support. Current asynchronous design tools require a significant re-education of designers, and their features are far behind synchronous commercial tools. This paper considers a particular subclass of asynchronous circuits (Null Convention Logic or NCL) and suggests a design flow that is based entirely on commercial CAD tools. This new design flow shows a significant area improvement over known flows based on NCL.
This paper presents an approach by which asynchronous circuits
can be realised with a conventional EDA tool flow and conventional
standard cell libraries. Based on a gate-level asynchronous circuit
implementation technique, direct-mapping, and by identifying the
delay constraints and exploiting certain EDA tool features, this paper
demonstrates that a conventional EDA tool flow can be used to
describe, place, route and timing-verify asynchronous circuits.
Categories and Subject Descriptors
B7.1. [Integrated Circuits]: Types and Design Styles
General Terms
Design, Experimentation, Standardization
Keywords
Asynchronous, EDA, Tool-Flow
This paper gives a simple but nontrivial set of local transformation
rules for Control-NOT(CNOT)-based combinatorial circuits. It is
shown that this rule set is complete, namely, for any two equivalent
circuits, S1 and S2, there is a sequence of transformations, each of
them in the rule set, which changes S1 to S2. Our motivation is to
use this rule set for developing a design theory for quantum circuits
whose Boolean logic parts should be implemented by CNOT-based
circuits. As a preliminary example, we give a design procedure
based on our transformation rules which reduces the cost of CNOT-based
circuits.
Categories and Subject Descriptors
B.6.m [LOGIC DESIGN]: Miscellaneous
General Terms
Design, Theory
Keywords
Quantum Circuit, CNOT Gate, Local Transformation Rules
Sum of Pseudoproducts (SPP) is a three level logic synthesis technique developed
in recent years. In this framework we exploit the "regularity" of Boolean
functions to decrease minimization time. Our main results are: 1) the
regularity of Boolean function f of n variables is expressed by its
autosymmetry degree k (which 0 <= k <= n), where k = 0 means no
regularity (that is, we are not able to provide any advantage over standard
synthesis); 2) for k >= 1 the function is autosymmetric, and a new
function fk is identified in polynomial time: fk is
"equivalent" to, but smaller than f, and depends on n - k variables only;
3) given a minimal SPP form for fk, a minimal SPP form for f
is built in linear time; 4) experimental results show that 61% of the
functions in the classical ESPRESSO benchmark suite are autosymmetric,
and the SPP minimization time for them is critically reduced; we can also solve
cases otherwise practically intractable. We finally discuss the role and
meaning of autosymmetry.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids - Automatic Synthesis, Optimization.
General Terms
Algorithms, Design, Theory.
Keywords:
Three-Level Logic, Synthesis, Autosymmetry.
This paper presents an new direct-fitting method to generate posynomial
response surface models with arbitrary constant exponents
for linear and nonlinear performance parameters of analog integrated
circuits. Posynomial models enable the use of efficient geometric
programming techniques for circuit sizing and optimization.
The automatic generation avoids the time-consuming nature
and inaccuracies of handcrafted analytic model generation. The
technique is based on the fitting of posynomial model templates
to numerical data from SPICE simulations. Attention is paid to
estimating the relative "goodness-of-fit" of the generated models.
Experimental results illustrate the significantly better accuracy of
the new approach.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids; B.8.2 [Performance
and Reliability]: Performance Analysis and Design Aids; I.6.5
[Simulation and Modeling]: Model Development
General Terms
Performance, Design, Algorithms
Keywords
Performance Modeling for Analog Circuits, Posynomial Response
Surface Modeling, Geometric Programming
The introduction of simulation-based analog synthesis tools creates a
new challenge for analog modeling. These tools routinely visit 103 to
105 fully simulated circuit solution candidates. What might we do
with all this circuit data? We show how to adapt recent ideas from
large-scale data mining to build models that capture significant
regions of this visited performance space, parameterized by
variables manipulated by synthesis, trained by the data points
visited during synthesis. Experimental results show that we can
automatically build useful nonlinear regression models for large
analog design spaces.
CATEGORIES AND SUBJECT DESCRIPTORS
B.7.2 [Integrated Circuits]: Design aids - verification
GENERAL TERMS
Algorithms
An algorithm for architecture-level exploration of delta-sigma ADC
design space is presented. The algorithm finds an optimal
solution by exhaustively exploring both single-loop and cascaded
architectures, with single-bit or multi-bit quantizer,
for a range of oversampling ratios. A fast filter-level step
evaluates the performance of all loop-filter topologies and
passes the accepted solutions to the architecture-level optimization
step which maps the filters on feasible architectures
and evaluates their performance. The power consumption of
each accepted architecture is estimated and the best top-ten
solutions in terms of the ratio of peak SNDR versus power
consumption are further optimized for yield. Experimental
results for two different design targets are presented. They
show that previously published solutions are among the best
architectures for a given target but that better solutions can
be designed.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-Aided Engineering
General Terms
Design
Keywords
ADC, CAD, delta-sigma
The systematic design of a high-speed, high-accuracy Nyquistrate
A/D converter is proposed. The presented design
methodology covers the complete flow and is supported by
software tools. A generic behavioral model is used to explore the
A/D converters specifications during high-level design and
exploration. The inputs to the flow are the specifications of the
A/D converter and the technology process. The result is a
generated layout and the corresponding extracted behavioral
model. The approach has been applied to a real-life test case,
where a Nyquist-rate 8-bit 200 MS/s 4-2 interpolating/averaging
A/D converter was developed for a WLAN application.
Categories and Subject Descriptors
B.7.m Integrated Circuits: miscellaneous
General Terms
Design
Keywords
A/D converters, Interpolating, Flash, Simulated Annealing.
In this paper, a new type of Petri net called Hierarchical Colored Hardware Petri net, to model real-delay switching activity for power estimation is proposed. The logic circuit is converted into a HCHPN and simulated as a Petri net to get the switching activity estimate and thus the power values. The method is accurate and is significantly faster than other simulative methods. The HCHPN yields an average error of 4.9% with respect to Hspice for the ISCAS '85 benchmark circuits. The per-pattern simulation time is about 46 times lesser than PowerMill.
The purpose of this work is two fold. First, to quantify and establish future trends for the dynamic power dissipation in global wires of high performance integrated circuits. Second, to develop a novel and efficient delay-power tradeoff formulation for minimizing power due to repeaters, which can otherwise constitute 50% of total global wire power dissipation. Using the closed form solutions from this formulation, power savings of 50% on repeaters are shown with minimal delay penalties of about 5% at the 50 nm technology node. These closed-form, analytical solutions provide a fast and powerful tool for designers to minimize power.
High-speed domino logic is now prevailing in performance critical block of
a chip. Low Voltage Swing Clock (LVSC) domino logic family is developed
for substantial dynamic power saving. To boost up the transition speed in
proposed circuitry, a well-established dual threshold voltage technique is
exploited. Dual supply voltage technique in the LVSC domino logic is
geared to reduce power consumption in clock tree and logic gates
effectively. Delay Constrained Power Optimization (DCPO) algorithm allocates
low supply voltage to logic gates such that dynamic power consumed by logic
gates is minimized. Delay time variations due to gate-to-source voltage
change and input signal arrival time difference are considered for accurate
timing analysis in DCPO.
Categories and Subject Descriptors
B.6 [Hardware]: Logic Design; B.7 [Hardware]: Integrated Circuits
General Terms
Design
Keywords
domino logic, low swing clock, dual supply voltage, dual threshold
voltage, low power
In this paper we propose a novel integrated circuit and
architectural level technique to reduce leakage power
consumption in high performance cache memories using single Vt
(transistor threshold voltage) process. We utilize the concept of
Gated-Ground [5] (NMOS transistor inserted between Ground
line and SRAM cell) to achieve reduction in leakage energy
without significantly affecting performance. Experimental results
on gated-Ground caches show that data is retained (DRG-Cache)
even if the memory are put in the stand-by mode of operation.
Data is restored when the gated-Ground transistor is turned on.
Turning off the gated-Ground transistor in turn gives large
reduction in leakage power. This technique requires no extra
circuitry; row decoder itself can be used to control the gated-Ground
transistor. The technique is applicable to data and
instruction caches as well as different levels of cache hierarchy
such as the L1, L2, or L3 caches. We fabricated a test chip in
TSMC 0.25 technology to show the data retention capability and
the cell stability of DRG-cache. Our simulation results on 100nm
and 70nm processes (Berkeley Predictive Technology Model)
show 16.5% and 27% reduction in consumed energy in L1 cache
and 50% and 47% reduction in L2 cache with less than 5% impact
on execution time and within 4% increase in area overhead.
Categories and Subject Descriptors
B.3.2 [Memory Structure]: Design Styles --- Cache memories;
B.3.1 [Memory Structure]: Semiconductor Memories --- Static
memory (SRAM); B.7.1 [Integrated Circuits]: Types and Design
Styles --- Memory technology.
General Terms: Design, Performance and
Experimentation.
Keywords: Gated-ground, SRAM, low leakage cache.
Reducing power dissipation is one of the most principle subjects
in VLSI design today. Scaling causes subthreshold leakage currents
to become a large component of total power dissipation. This
paper presents two techniques for efficient gate clustering in MTCMOS
circuits by modeling the problem via Bin-Packing (BP) and
Set-Partitioning (SP) techniques. An automated solution is presented,
and both techniques are applied to six benchmarks to verify
functionality. Both methodologies offer significant reduction
in both dynamic and leakage power over previous techniques during
the active and standby modes respectively. Furthermore, the
SP technique takes the circuits routing complexity into consideration
which is critical for Deep Sub-Micron (DSM) implementations.
Sufficient performance is achieved, while significantly reducing
the overall sleep transistors area. Results obtained indicate
that our proposed techniques can achieve on average 90% savings
for leakage power and 15% savings for dynamic power.
Categories and Subject Descriptors: B.7.1 [Integrated Circuits]:
Types and Design Styles
General Terms: Design
We describe various design automation solutions for design
migration to a dual-Vt process technology. We include the
results of a Lagrangian Relaxation based tool, iSTATS, and a
heuristic iterative optimization flow. Joint dual-Vt allocation
and sizing reduces total power by 10+% compared with Vt
allocation alone, and by 25+% compared with pure sizing
methods. The heuristic flow requires 5x larger computation
runtime than iSTATS due to its iterative nature.
Categories and Subject Descriptors
B.7 INTEGRATED CIRCUITS
B.7.1 Types and Design Styles Microprocessors and
microcomputers, VLSI.
General Terms
Algorithms, Performance, Design, Experimentation,
Verification.
Keywords
Dual-Vt design, multiple threshold, sizing, optimization.
This paper presents an optimal voltage synthesis technique for a
satellite application to maximize system performance subject to
energy budget. A period of a satellite's orbit is partitioned into
several independent regions with different characteristics such as
type of computation, importance, performance requirements, and
energy consumption. Given a periodic energy recharge model,
optimal voltages for the regions are synthesized such that the
overall performance is maximized within the energy budget in the
period.
Categories and Subject Descriptors
C.4 [PERFORMANCE OF SYSTEMS] Design studies,
Modeling techniques, Performance attributes.
General Terms
Algorithms, Management, Performance, Design.
Keywords
Power-aware design, power-efficient design, satellite application,
queueing.
Techniques for fast and accurate simulation of fractional-N
synthesizers at a detailed behavioral level are presented.
The techniques allow a uniform time step to be used for the
simulator, and can be applied to a variety of phase locked
loop (PLL) and delay locked loop (DLL) circuits beyond
fractional-N synthesizers, as well as to a variety of simulation
frameworks such as Verilog and Matlab. Simulated results
from a custom C++ simulator are shown to compare well to
measured results from a prototype fractional-N synthesizer
using a Delta-Sigma modulator to dither its divide value.
Categories and Subject Descriptors
I.6.5 [Simulation and Modeling]: Model Development
General Terms
Algorithms
Keywords
fractional-N,frequency,synthesizer,sigma,delta,PLL,DLL
Simulation of RF circuits often demands analysis of distributed component models that are described via frequency-dependent multiport Y, Z, or S parameters. Frequency-domain methods such as harmonic balance are able to handle these components without difficulty, while