 
                                Predictable Design
of
Embedded Systems
using
Networked Architectures
Henk Corporaal
www.ics.ele.tue.nl/~heco
ASCI Winterschool on Embedded Systems
Rockanje, March 2006
Outline
 Trends and design problems
 Unpredictability
 Platforms
 Predictable design
 Proposed design flow
 Open issues
Note: this lecture is not about a solved problem
ASCI Winterschool 2006
Henk Corporaal
(2)
Outline
 Trends and design problems
 Embedded systems everywhere
 Design practice
 Design complexity
 Memory wall
 Unpredictability
 Platforms
 Predictable design
 Design flow
 Open issues
ASCI Winterschool 2006
Henk Corporaal
(3)
Embedded systems everywhere
 Convergence of 3 Cs
computers, communications and
consumer electronics
 The computer enters the 3rd fase
computing power - networking - intelligent
processing
 The world is 1 network
wherever, whenever, all information and
communication available
We get a smart environment
ASCI Winterschool 2006
Henk Corporaal
(4)
Design practice:
Informal system specification
System Task
people
Task
Task
Paper spec
Hardware vhdl
people verilog
C
ASM
Software
people
Integration
ASCI Winterschool 2006
Henk Corporaal
(5)
Design practice
Behavioral
specification
System
Algorithm
Structure
description
R/T
Logic
circuit
Y-Chart (Gajski-Kuhn)
 Design Flow is path in Y chart
Physical
realization
 Till RT-level largely manual flow
ASCI Winterschool 2006
Henk Corporaal
(6)
Design complexity problem
complexity
Process technology + 58%
103
102
HW gap
HW design productivity +21 %
SW gap
101
SW productivity + 8 %
4
ASCI Winterschool 2006
8
12
16
year
Henk Corporaal
(7)
Hitting the memory wall
Performance
µProc:
55%/yea
r
1000
10
Processor-Memory
Performance Gap:
(grows 50% / year)
CPU
100
“Moore’s Law”
DRAM:
7%/year
DRAM
1
1980
1985
1990
1995
2000
2005
Time
[Patterson]
ASCI Winterschool 2006
Henk Corporaal
(8)
Outline
 Trends and design problems
 Unpredictability
 Platforms
 Predictable design
 Proposed design flow
 Open issues
ASCI Winterschool 2006
Henk Corporaal
(9)
Unpredictability at all levels
applications
architectures
DSM VLSI design
Uncertainty increases at all levels
ASCI Winterschool 2006
Henk Corporaal
(10)
Application: Two forms of unpredictability
mem
Txt
Video
In1
Video
In2
NR
NR
HSRC
HSRC
gen
VSRC
VSRC
mix
100Hz
mem
HSRC
Peak
Matrix
VSRC
mix
mem
resources
 Applications can be data dependent
 Applications may have different
scenarios
time
ASCI Winterschool 2006
Henk Corporaal
(11)
In addition: dynamic changing set of
applications
Multi-standard modem operation
 Several applications have to be activated simultaneously
Too many combinations for an analysis at design time (non
deterministic events)
[Philips EVP]
SCH = SCH search
SCH
100
SCH
CPICH search
Compute load 
125
75
50
25
SCH
Initial
acquisition
ASCI Winterschool 2006
SCH
Inter-system
handover
SCH
CPICH search
SCH
CPICH search
RAKE
chip-rate
processing
RAKE
chip-rate
processing
RAKE sym-rate proc.
RAKE sym-rate proc.
WLAN acquisition
UMTS
connected
UMTS connected/
WLAN acquisition
SCH
CPICH search
WLAN receiver
WLAN connected/
UMTS monitoring
time
Henk Corporaal
(12)
Architecture unpredictability
ext.
mem
mem
arb.
Local schedulers:
cpu $
OS
task switching
interrupts
IP
 interconnect
busses, bridges
networks
 memory controllers
IP
…
IP
external memory
e.g. RR, TDMA, FCFS,
LRU, EDLF, FIFO,
priority, …
IP
IP
…
IP
IP
IP
…
IP
IP
IP
interconnect
cache pollution
IP
interconnect
IP
interconnect
 cache strategy
$ cpu
IP
…
IP
IP
What is the global behavior (end-to-end),
composed of interacting local solutions ?
ASCI Winterschool 2006
Henk Corporaal
(13)
DSM VLSI Unpredictability
 Global wiring delay becomes dominant over gate delay
(timing closure)
Gate delay vs. wire delay
400
350
300
ps
250
wire delay (ps/mm)
200
gate delay (ps)
150
100
50
0
0.5
0.35
0.25
0.18
0.13
0.1
technology (micron)
ASCI Winterschool 2006
Henk Corporaal
(14)
DSM VLSI Unpredictability
Length of
Isosynchronous zone
as function of frequency
Other DSM problems:
 Clock distribution, skew
 VDD and VSS voltage drop
 Signal integrity, cross-talk
 Variance in process parameters increases
ASCI Winterschool 2006
Henk Corporaal
(15)
Unpredictability: Design Closure problems
Design closure =
 a realization meets all
requirements, including
functionality, speed, power,
area, yield, etc.,
without design iterations
application
mapping &
scheduling
architecture
placement &
routing
Closure problem
at all levels
ASCI Winterschool 2006
FPGA realization
VLSI realization
Henk Corporaal
(16)
Computational Requirements →
Unpredictability: Design Closure problems
1200%
1000%
800%
600%
400%
Orders of
Magnitude
200%
0%
Time →
Mapping with performance guarantees looks impossible !!
ASCI Winterschool 2006
Henk Corporaal
(17)
Solution ingredients:
 Higher abstraction levels
 SW and HW IP reuse / PnP principle
 Standards
 Avoid large design iterations
 Design correct by synthesis
 Avoid worst case resource requirements
How do we achieve all of this?
ASCI Winterschool 2006
Henk Corporaal
(18)
Outline
 Trends and design problems
 Unpredictability
 Platforms
 Predictable design
 Design flow
 Open issues
ASCI Winterschool 2006
Henk Corporaal
(19)
What is a platform?
Definition:
A platform is a generic, but domain specific
information processing (sub-)system
• Generic means that it is flexible, containing programmable
component(s).
• Platforms are meant to quickly realize your next system
(in a certain domain).
• Single chip?
ASCI Winterschool 2006
Henk Corporaal
(20)
Platforms, why?
- Reuse
- Short Time-to-Market
- High Quality
•
•
•
•
•
Flexible and Programmable
Large software component
Standardization
Optimized for specific domain
and you do not have to solve this design closure problem !!
ASCI Winterschool 2006
Henk Corporaal
(21)
Platforms separate the design communities !
SDT
system design
technology
PDT
platform design
technology
Design technology
Applications
Platform
Enabling technologies
ASCI Winterschool 2006
Henk Corporaal
(22)
Platform examples: Digital camera
Sanyo [Okada99]
ASCI Winterschool 2006
Henk Corporaal
(23)
TI OMAP
Up to 192Mbyte off-chip memory
192Kbyte shared SRAM
8Kb data cache (2-way,
512 lines of 16 bytes)
Write buffer (17 elements)
16Kb (2-way)
16Kb (2-way)
8Kb mem (2x 4K)
64Kb dual port (8x 4K x 16b)
96Kb single port (12x 4k x 16b)
32Kb ROM
ASCI Winterschool 2006
Henk Corporaal
(24)
SpaceCake (Philips research)
 Homogeneous: set of equal tiles
 Per tile e.g.:
 n * MIPS
 m * TriMedia
 Accelerators
 k * L2 Cache bank
 Shared memory
 Cache coherency
 Big interconnect switch
switch
L2 cache memory banks
 Inter Tile:
 Router
 Message passing
 Working on inter tile cache coherence
ASCI Winterschool 2006
Single tile
Henk Corporaal
(25)
IMAGINE Stream Processor (Stanford)
IMAGINE = SIMD of VLIWs
It is controlled by a host processor, which send it stream
instructions (Load, store, receive, send, VLIW op, load microcode)
ASCI Winterschool 2006
Henk Corporaal
(26)
Hybrid FPGAs: Xilinx Virtex 4-Pro
GHz IO: Up to 16 serial transceivers
PowerPCs
Memory blocks &
Multipliers
PowerPC
ReConfig.
logic
Reconfigurable logic
blocks
Courtesy of Xilinx (Virtex II Pro)
ASCI Winterschool 2006
Henk Corporaal
(27)
Fundamental platform design decisions
 Homogeneous versus Heterogeneous ?
 Bus versus Network ?
 Shared memory versus Message passing ?
 QoS support, Guarantees built-in ?
 Generic versus Application specific ?
 What types of parallelism to support ?
 ILP, DLP, TLP
 Focus on Performance, Power or Cost ?
 Memory organisation ?
 HW or SW reconfigurable ?
And further:
 OS support, Middleware ?
 Mapping support?
ASCI Winterschool 2006
Henk Corporaal
(28)
Homogeneous or Heterogeneous
 Homogenous:
 replication effect
 memory dominated any way
 solve realization issues
once and for all
 less flexible
ASCI Winterschool 2006
Henk Corporaal
(29)
Homogeneous or Heterogeneous
 Heterogeneous
 more flexible
 better fit to application domain
 smaller increments
 no tile reuse
ASCI Winterschool 2006
Henk Corporaal
(30)
Homogeneous or Heterogeneous
 Middle of the road approach
Flexibile tiles
Fixed tile structure at top level
tile
router
ASCI Winterschool 2006
Henk Corporaal
(31)
Reconfiguration time
HW or SW reconfigurable?
reset
FPGA
Spatial mapping
loopbuffer
context
Temporal mapping
Subword parallelism
1 cycle
fine
ASCI Winterschool 2006
Data path granularity
VLIW
coarse
Henk Corporaal
(32)
Outline
 Trends and design problems
 Unpredictability
 Platforms
 Predictable design
 Current practise
 Predictability
 Architecture consequences
 Design consequences
 Design flow
 Open issues
ASCI Winterschool 2006
Henk Corporaal
(33)
How should we design ?
 Trajectory, from Idea to Realization
 Desicions based on models
 Abstract from implementation details (not all known yet)
 Relatively cheap to create, validate and simulate
Idea
Concepts
Requirements
Design Problem
• Generate Ideas
Design Time
• Construct Models
“Steers”
• Evaluate Properties
• Make Design Decisions
Realization
ASCI Winterschool 2006
Henk Corporaal
(34)
Current practice
Mapping, easy, but...........
 Given
 reference C code for application
e.g. MPEG-4 Motion Estimation
 platform: SUPERDUPER-LX50
Idea
a=b*5+d;
for (...)
{..
}
 Task
 map application on architecture
 But … wait a moment
me@work> CC –o2 mpeg4_me mpeg4_me.c
Thank you for running SUPERDUPER-LX50
compiler.
Your program uses 257321886 bytes
memory, 78 Watt, 428798765291 clock
cycles
ASCI Winterschool 2006
Henk Corporaal
(35)
Current design process
application
mapping
constraints
OK ?
yes
 Post analysis: check constraints after mapping
no
 Simulation based
 Does it still work for other data ?
 Does it still work when other applications are active ?
 Too many iterations
 Easy to program, hard to tune
 Can this be improved ?
 e.g. Constraints = input
ASCI Winterschool 2006
Henk Corporaal
(36)
Predictable design
What is it?
 Being able to reason at a high level about a design (in terms of
functional and non-functional properties) and
 Being able to realize this design without time consuming
iterations in the design flow (design closure)
How:
 Predictable architecture
 Making resources predictable
 Proper modeling of less predictable elements
 Predictable design flow
 Compositionality
 Composability
 Design time analysis  Run time analysis
ASCI Winterschool 2006
Henk Corporaal
(37)
Making architectures predictable
 Getting rid of all unpredictable elements
 Caches ?
 No problem, but WCET estimation may be big and
unacceptable !
 Software controlled
locked cache lines
non-cachable memory
controlled replacement
 Shared memory
 Communication
ASCI Winterschool 2006
Henk Corporaal
(38)
Making architectures predictable: NoC
Philips AETHEREAL
Router provides both
guaranteed throughput
(GT) and best effort
(BE) services to
communicate with IPs.
Router
Network
Combination of GT and
BE leads to efficient use
of bandwidth and simple
programming model.
R
IP
ASCI Winterschool 2006
Network
Interface
R
R
R
R
R
R
R
R
Network
Interface
IP
Network
Interface
IP
Henk Corporaal
(39)
Making the NoC predictable:
how to support GT traffic?
Time wheel concept
 control injection traffic at network interface
8
7
2
6
3
5
ASCI Winterschool 2006
time
1
4
Henk Corporaal
(40)
Making the design flow predictable :
Compositionality
High level
design
a
b
y
x
z
P(x,y) if [P(a,b),...] !
Low level
design
a
b
y
x
z
P(x,y) if [P(a,b),...] ?
ASCI Winterschool 2006
Henk Corporaal
(41)
Making the design flow predictable
 Design time
 Determine of upper bounds on time and resources
pareto curves
Scenario discovery:
Freq
separate your application in parts for which upper bounds
not too far from worst case
Sc1
Sc2
Sc3
Load
ASCI Winterschool 2006
Henk Corporaal
(42)
What do we want ? Design time analysis
Single application
Reasoning about end-to-end timing constraints (for given
resources and quality) = predictability
Which local arbitration mechanisms are needed ?
How to translate this to the global level ?
Example:
Given
 Comp. Resources
 Bandwidth
 Buffer size
Throughput
 Pareto curve
A5
A1
P1
A2
P2
A4
A3
P3
P4
1/Throughput
(q1,c1)
ASCI Winterschool 2006
Cost (resources)
Henk Corporaal
(43)
Scenarios: MP3
ASCI Winterschool 2006
Henk Corporaal
(44)
What do we want ? Composability
 Multiple applications
 If app. 1 and app. 2 fit each individually, what can be said about
the combination ?
 Concept of virtual platform
A1
A2
Proc1
A3
ASCI Winterschool 2006
Proc2
A4
Henk Corporaal
(45)
Predictability: Composability
Can we add Pareto points?
application 1
application 2
Q
Q
(q1,c1)
(q2,c2)
Cost (resources)
Cost (resources)
+
(q1+q2,c1+c2) ?
ASCI Winterschool 2006
Henk Corporaal
(46)
Problem: Predictable Resource utilization?
50
A
50
50
50
B
50
50
Mapping & Scheduling
P1
ASCI Winterschool 2006
P2
P3
Henk Corporaal
(47)
Problem – Predictable Resource utilization?
50
A
50
50
50
B
50
50
Add ordering
dependences (edges)
P1
A
P2
B
P3
t0 t1
t2
Only 50%
processor
utilization !
t3
Scheduling conflict!
ASCI Winterschool 2006
Henk Corporaal
(48)
Where is the problem?
 Different throughput obtained for different order of
actors
 Possibilities of overall graph increases exponentially
with number of actors and individual graphs
 Very difficult to do a complete analysis to obtain an
optimal order
 Hard to model and analyze different arbitration
strategies realistically
ASCI Winterschool 2006
Henk Corporaal
(49)
Problem – Too many possibilities!
3
A
3
3
3
B
1
5
3
5
C
1
ASCI Winterschool 2006
Henk Corporaal
(50)
So, what is Composability?
 The degree to which we can analyze the applications
in isolation:
Throughput, Latency, Resource utilization, Deadlock,
Switching / reconfiguration overhead, etc.
 Design time analysis for complete system is too
expensive and often infeasible
 Each job should be executed as if it had access to its
own dedicated resources – Virtualization
 Consider applications separately and then reason
about the behavior of overall system
ASCI Winterschool 2006
Henk Corporaal
(51)
Providing a Bound for Resources
 Arbitration strategy plays an important role in
determining resource requirement
 A naive strategy leads to over-estimation of resources
 Worst-case estimate is not always possible
 Need predictable arbitration mechanism
 More ‘realistic’ worst case bounds
 Handle dynamism in the system
 An overall quality versus resources Pareto curve
needed
ASCI Winterschool 2006
Henk Corporaal
(52)
Making the design flow predictable:
Run-time aspects
 Scalable applications
 QoS management
Application n
Application n / Scenario m
Local manager
Local manager
QoS protocol
Global manager
Platform
ASCI Winterschool 2006
Henk Corporaal
(53)
Quality-1 →
Match quality with resources
Computational Requirements →
ASCI Winterschool 2006
Henk Corporaal
(54)
Outline
 Trends and design problems
 Unpredictability
 Platforms
 Predictable design
 Design flow
 Open issues
ASCI Winterschool 2006
Henk Corporaal
(55)
Design flow
Idea
C
Requirements spec
Models
Spec
Reactive Process Network
POOSL/SystemC
Kahn Process Network (YAPI)
BDF
SDF
correct by
synthesis
Platform
ASCI Winterschool 2006
Henk Corporaal
(56)
RPN (Reactive Process Networks):
events and streaming
Event_in
• Processing of events
•Finite State Machine
• Controlling host-CPU (e.g. ARM)
• RTOS; hard real-time
• ‘classical’ SW complexity
mode
Stream_in
ASCI Winterschool 2006
• Soft Real-time
• Compute intensive
• Special hardware
Event_out
status
Stream_out
Henk Corporaal
(57)
POOSL Modeling Language
 Mathematically defined semantics
 Allows formal analysis of model properties
 Can formally describe:
 concurrency
 synchronous communication
 timing (delay statements)
functionality
P1
P2
delay 1;
ASCI Winterschool 2006
Henk Corporaal
(58)
POOSL: Phases of Model Execution
State space
State space
State space
Synchronous
time passage
Asynchronous
actions execution
model
time
ASCI Winterschool 2006
Henk Corporaal
(59)
From Model to Realization
a
S1
delay d1
S2
b
S3
S5
c
Possible execution (timed) traces:
delay d2
S4
S6
(S1, t1), (S2, t1), (S3, t1+d1), (S5, t1+d1)
(S1, t1), (S2, t1), (S4, t1+d2), (S6, t1+d2)
a()();
(S1, t1), (S2, t1+wcet(a)), (S3, t1+d1),
(S5, t1+d1+wcet(b))
(S1, t1), (S2, t1+wcet(a)),
(S4, t1+wcet(a)+wcet(c)), (S6, t1+d2)
ASCI Winterschool 2006
sel
delay d1; b()();
or
c()(); delay d2;
les;
Henk Corporaal
(60)
-Hypothesis: property preservation
 If the time-deviation between two timed execution
traces is less than , then, if one trace satisfies a realtime property, that property, weakened upto , is
preserved in the second one as well
a
d1
b
Model
time
t1
t2
d1 - ε1
t’1
ASCI Winterschool 2006
ε1, ε2 < ε
t’2
a
b
t’1 + ε1
t’2 + ε2
Physical
time
Henk Corporaal
(61)
Extending SDF
SADF: Scenario Aware Data Flow
 Can deal with dynamism
 Still possible to reason about
 deadlock,
 resource utilization,
 latency and throughput
 Currently implemented in POOSL
ASCI Winterschool 2006
Henk Corporaal
(62)
SADF example: MPEG-2 Decoder
 Pipelined MPEG-2 decoder for I and P frames
d
 VLD and IDCT fire per macro-block
VLD
 MC and RC fire per frame
a
1
 FD (frame detector) models control part of VLD
that determines frame type
b
c c
 Image size = 176x144
1
 I-frame
 99 macro-blocks
 No motion vectors
 Px-frame
 x macro-blocks
 Motion vectors from VLD to MC
 Previous frame from RC to MC
 P0-frame (still video)
 Copy previous frame
 FD model based on occurrence
probability of frame types
 Execution time distributions of
kernels determined with profiling tool
ASCI Winterschool 2006
d
1
IDCT
d
1
1
1
MC
1
1
FD
1
1
1
1
e
RC
1
3
Rate
I
P0
Px
a
0
0
1
b
0
0
x
c
99
1
x
d
1
0
1
ex = {30, 40,
9950 ,60, 70,
0 80, 99} x
Henk Corporaal
(63)
Results for MPEG-2 Decoder
Time unit = 1 kCycle
Process
Throughput
VLD
0.063
rel. error ≤ 0.036%
IDCT
0.063
rel. error ≤ 0.036%
MC
0.00106
rel. error ≤ 0.190%
RC
0.00106
rel. error ≤ 0.191%
Average Latency between
Successive Firings
Accuracy results based on
confidence levels of 0.95
Process
Max. Latency between
Successive Firings
Variance in Latency between
Successive Firings
VLD
710
15.99
rel. error ≤ 0.031%
75.38
rel. error ≤ 0.18%
IDCT
698
15.99
rel. error ≤ 0.031%
56.45
rel. error ≤ 4.99%
MC
3305
940.3
rel. error ≤ 0.017%
2.4·105
rel. error ≤ 3.46%
RC
2216
940.3
rel. error ≤ 0.017%
1.5·105
rel. error ≤ 4.99%
Channel Memory
between Processes
Maximum
Occupancy
VLD and IDCT
9
1.910
rel. error ≤ 0.064%
0.528
rel. error ≤ 1.99%
IDCT and RC
154
60.19
rel. error ≤ 0.178%
671.8
rel. error ≤ 4.55%
VLD and MC
133
34.73
rel. error ≤ 0.517%
698.4
rel. error ≤ 4.39%
MC and RC
1
0.577
rel. error ≤ 0.561%
0.244
rel. error ≤ 3.27%
ASCI Winterschool 2006
Time-Average Occupancy
Time-Variance in Occupancy
Henk Corporaal
(64)
Design flow
 Run-time
 Combine pareto points
exploit pareto algebra
QoS management / scalable application
ASCI Winterschool 2006
Henk Corporaal
(65)
Mapping multiple jobs
T0
T1
T2
Multiple jobs can be active simultaneously.
 When can a second job start ?
 Are the requested resources available ?
 If not, can the quality level be lowered ?
 If not, can other jobs go for a lower
quality ?
 If yes, independent from other jobs ?
 How to give guarantees?
resources
100%
time
reconfiguration
ASCI Winterschool 2006
Henk Corporaal
(66)
Combining Pareto points
Cost
Application 1
80
Cost
100 Cycle Budget
Cycle Budget
+
Cost
ASCI Winterschool 2006
Application 2
•A new thread frame coming
•20 cycle budgets available
Application 3
Cycle Budget
Henk Corporaal
(67)
Combining Pareto points
Cost
Application 1
80
Cost
Application 2
100 Cycle Budget
Cycle Budget
Cost
Application 3
feasible,
but optimal?
20
ASCI Winterschool 2006
Cycle Budget
Henk Corporaal
(68)
Combining Pareto points
Cost
Application 1
Application 2
Cost
cost increase
1
80
80 100 Cycle Budget
Cycle Budget
Cost
Application 3
cost decrease
and
2 > 1
20
ASCI Winterschool 2006
40
a better
solution
Cycle Budget
Henk Corporaal
(69)
Outline
 Trends and design problems
 Unpredictability
 Platforms
 Predictable design
 Design flow
 Open issues
ASCI Winterschool 2006
Henk Corporaal
(70)
Open issues
 Gap between specification and architecture modeling
 High level modeling
 use of modeling pattern library
 Incorporate multiple pareto solutions into DSE
 Pareto Algebra
 Get synthesis correct for
 control applications including compute intensive tasks
 mapping to multi-processor
 Managing QoS
 Scenario detection, merging, prediction and exploitation
 Runtime resource manager optimizing overall quality
 Measuring overall quality
ASCI Winterschool 2006
Henk Corporaal
(71)
Open issues (cont'd)
 Architecture modeling
 how to deal with local memory (scratch pad / cache)
 Modeling scheduling and arbitration
 make things composable !
 Definition NAL (run-time services)
 Automatic partitioning
 e.g., SPRINT tool of IMEC is a good start (C to SystemC)
 VLSI tiling
 …. and many more …..
e.g. see: Ogras e.a.: Key research problems in NoC Design
A holistic perspective
CODES – ISSS 2005
ASCI Winterschool 2006
Henk Corporaal
(72)
ASCI Winterschool 2006
Henk Corporaal
(73)