Post on 13-Nov-2021
transcript
1
Tecniche di ottimizzazione per lo sviluppo di applicazioni embeddedsu piattatforme multiprocessore su singolo chip
Luca BeniniLbenini@deis.unibo.it
DEIS Università di Bologna
1%
99%
Embedded Systems
General purpose systems Embedded systems
Microprocessor market shares
2
Example Area: Automotive Electronics
What is “automotive electronics”?
Vehicle functions implemented with electronics
Body electronicsSystem electronics: chassis, engineInformation/entertainment
Automotive Electronics Market Size
8.9Market ($billions) 10.5 13.1 14.1 15.8 17.4 19.3 21.0
0200
400600
8001000
12001400
1998 1999 2000 2001 2002 2003 2004 2005
Cost of Electronics / Car ($)
90% of future innovations in vehicles:based on electronic embedded systems
2006: 25% of the total cost of a car will be electronics
3
Automotive Electronics Platform Example
Source: Expanding automotive electronic systems, IEEE Computer, Jan. 2002
Digital Convergence – Mobile Example
Broadcasting
TelematicsImaging
Computing
CommunicationEntertainment
One device, multiple functionsCenter of ubiquitous media networkSmart mobile device: next drive for semicon. Industry
4
4th Gen and Next-Gen Networks
Includes: 802.20, WiMAX (802.16), HSDPA, TDD UMTS, UMTS and future versions of UMTS
SoC: Enabler for Digital Convergence
Today
Future
> 100XPerformanceLow PowerComplexity
Storage
4G/5G, DMB, WiBro, etc.
SoCSoCSoC
5
Application pull
Year of Introduction2005 2007 2009 2011 2013 2015
5 GOPS/W
100GOPS/W
Signrecognition
A/Vstreaming
Adaptiveroute
Collisionavoidance
Autonomousdriving
3D projecteddisplay
HMI by motionGesture detection
Ubiquitousnavigation
Si Xray
Gbit radio
UWB
802.11n
Structured encoding
Structured decoding
3D TV 3D gaming
H264encoding
H264decoding
Imagerecognition
Fully recognition(security)
Autopersonalization
dictation
3D ambientinteraction
LanguageEmotionrecognition
Gesturerecognition
Expressionrecognition
MobileBase-band
1TOPS/W
[IMEC]
MPSoC Platform Evolution
45 nm
<4mm
<1GHz
I/OPERIPHERALS
3D stacked
main
mem
ory
2
30MtrLocalMemory
hierarchy
NetInt
PowerTest
Mgmt
routerBus basedMulti Proc
Applications Software opt. Middleware, RTOS, API,Run-Time Controller
MappingV,Vt,Fclk,IL
Today’s SoCs could fit in 1 tile!!Tile-based design
6
Multicores Are Here!
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128256
512
Athlon
Raw
Power4 Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Boardcom 1480 Opteron 4PXeon MP
AmbricAM2045
[Amarasinghe06]
MPSoC – 2005 ITRS roadmap
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200
200
400
600
800
1000
60
50
40
30
20
10
0
1200
Num
ber
of P
roce
ssin
g En
gine
s
Logi
c, M
emor
y Si
ze (N
orm
aliz
ed to
200
5)
Number of Processing Engines(Right Axis)
Total Logic Size(Normalized to 2005, Left Axis)
Total Memory Size(Normalized to 2005, Left Axis)
16 23 32 46 63 79101
133 161212
268
348
424
526
669
878
[Martin06]
7
System / ServiceApplication S/W
Mobile TerminalMiddleware
ModuleRTOS
ChipHAL
ProcessS/W IP
Target System Application
Requires design of Hardware AND software
SoC Solution-on-a-Chip
+
SOCSOC
System e-SW
Chip
Design as optimizationDesign spaceThe set of “all” possible design choicesConstraintsSolutions that we are not willing to
acceptCost functionA property we are interested in
(execution time, power, reliability…)
8
Hardware synthesisALGORITHM
HIGH-LEVEL SYNTHESIS
S1 S3 S4S2
0.0 200.0 4 00.0 600. 0Freq
-120 .0
-100 .0
-80 .0
-60 .0
-40 .0
-20 .0
Am
pl (
db)
++
++
D
D
++
++
D
D
c1 c2
c3
c4 c5
c6
kIN
+
+
D
D
++
+
D
D
+
++c1
c2 c3
c4
c5
c6 c7
c8
k
dIN OUT
APPLICATION
interconnect
ASICGP signal
MCM
processor
memory
ARCHITECTURE
LOGIC AND PHYSICAL SYNTHESIS
Behavioral synthesisC ontrol/D ataFlow G rap h
(C DFG )Implem en tation
RegReg
M ultiplier
Adder
RegReg2 1 1 ...2 3 2 ...
4 3 2 ...
0 4 7 ...4 7 9 ...
9
Allocation, Assignment, and Scheduling
D
+
-
>>
>>
+
-
>>
+ >>
+
>>
+
Allocation: How Much?2 adders
Assignment: Where?
Schedule: When?
Shifter 1
Time Slot 4
1 shifter24 registers
D
Techniques Well Understood and Mature
Resource constraints
+
*3*2
3
+
*1
2
+1 1
2
3
3
4 4
+
*3*2
3
+2
+1 2
3
4
1
2 3
4 control steps
+ * * + *
*1Schedule 1 Schedule 2
1 +1
2 +2
3 +3 *1
4 *2 *3
Control Step
1 +3
2 +1 *2
3 +2 *3
4 *1
Control Step
10
Scheduling under resource constraints
Intractable problemAlgorithms:
Exact:Integer linear programHu (restrictive assumptions)
Approximate :List schedulingForce-directed scheduling
Binary decision variables:X = { xil, i = 1,2,…. n; l = 1,2,…, λ + 1}xil, is TRUE only when operation vi starts in
step l of the schedule (i.e. l = ti)λ is an upper bound on latency
Start time of operation vi :Σl . xil
ILP formulation
l
11
Operations start only onceΣ xil = 1 i = 1, 2,…, nSequencing relations must be satisfiedti ≥ tj + dj (vj, vi) є EΣ l • xil – Σ l • xil – dj ≥ 0 (vj, vi) є EResource bounds must be satisfiedSimple case (unit delay)Σ xil ≤ ak k = 1,2,…nres ; l
ILP formulation constraints
l
AA
A
l
l
i:T(vi)=k
ILP Formulation
min (Σ l • xnl) such that
Σ xil = 1 i = 1, 2, …, n
Σ l • xij - Σ l • xjl - dj ≥ 0 i, j = 1, 2, …, n, (vj, vi) є E
Σ Σ xim ≤ ak k = 1, 2, …, nres ; l = 0, 1, …, λl
ll
l
m=l-di+1i:T(vi)=k
l
12
Example
Resource constraints:2 ALUs; 2 Multipliersa1 = 2; a2 = 2
Single-cycle operationdi = 1 i
* * + <
-
-
* * * * +
NOP
NOP
0
1 2
3
4
5
6
7
8
9
10
11
n
A
ExampleOperations start only oncex11 = 1x61 + x62 =1…
Sequencing relations must be satisfiedx61 + 2x62 – 2x72 – 3x73 + 1 ≤ 02x92 + 3x93 + 4x94 – 5xN5 + 1 ≤ 0…
Resource bounds must be satisfiedx11 + x21 +x61 + x81 ≤ 2x32 + x62 + x72 + x81 ≤ 2…
13
Example
*
*
+
<
-
-
* *
*
*
+
NOP
NOP
0
1 2
3
4
5
6
78
9
10
11
n
TIME 1
TIME 2
TIME 3
TIME 4
Resource-EfficientApplication mapping for MPSoCs
Given a platform1. Achieve a specified throughput2. Minimize usage of shared resources
MULTIMEDIAAPPLICATIONS
14
Optimization Development
The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictableand undesired behaviours.Programmers must be conscious about simplified assumptions taken into account in optimization tools.New methodology for multi-task application development on MPSoCs.
Platform Modelling
Optimization Analysis
Optimal Solution
Starting Implementation
Platform Execution
Abstractiongap
(. .
Final Implementation
Application design flow
Resource assignment and scheduling
SHARED SYSTEM BUS
On-chipMemory
Node 1 Node N
Processor
Tightly-CoupledMemory
Bus Interface
.....
Task. A (WCET Ta)Task. B (WCET Tb)
Task. N (WCET Tn)
THE SYSTEM
LimitedSize Mem
Max busbandwidth
Maxtime
wheelperiod
T
AssumedTo be
infinite
15
The application
T7T1 T2 T0 T3 …..
Signal Processing Pipeline
Queues for inter-processor communicationin TCM for efficiency reasons
Program datain TCM (if space) or on-chip memoryInternal statein TCM (if space) or on-chip memory
Each task is characterized by:• WCET• Memory requirements
ThroughputConstraint
Communication-aware Allocation and Scheduling for Stream-Oriented MPSoCs
T7T1 T2 T0 ….. Signal ProcessingPipeline
ARM7 LocalScratchpad
Memory BUS
PrivateMemory
ARM7
………………..
LocalScratchpad
Memory
PrivateMemory
……….
Message-orientedMPSoC
architecture
?Simplifying assumptions vs predictabilityEfficient solutions in reasonable timePure ILP formulations suitable for small task setsWidespread use of heuristics
16
Master Problem modelAssignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)
Tij= 1 if task i executes on processor j, 0 otherwise, Yij =1 if task i allocates program data on processor j memory, 0 otherwise, Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise
Each process should be allocated to one processor ∑ Tij= 1 for all j
Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)
If a task is NOT allocated to a processor nor its required memories are:Tij= 0 ⇒ Yij =0 and Zij =0
Objective function ∑ ∑ memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2
i
i j
Improvement of the model
With the proposed model, the allocation problem solver tends to With the proposed model, the allocation problem solver tends to pack pack all tasks on a single processor and all memory required on the lall tasks on a single processor and all memory required on the local ocal memory so as to have a ZERO communication cost: TRIVIAL SOLUTIONmemory so as to have a ZERO communication cost: TRIVIAL SOLUTION
To improve the model we should add a relaxation of the subproblTo improve the model we should add a relaxation of the subproblem to em to the master problem model: the master problem model:
For each set S of consecutive tasks whose sum of durations exceeFor each set S of consecutive tasks whose sum of durations exceeds the ds the Real time requirement, we impose that their processors should noReal time requirement, we impose that their processors should not be the t be the same same
∑ WCETi > RT ⇒ ∑ Tij ≤ |S| -1i ∈ S i ∈ S
17
Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)
i
Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)We have to schedule tasks so we have to decide when they start
Activity Starting Time: Starti::[0..Deadlinei]
Precedence constraints: Starti+Duri ≤ Startj
Real time constraints: for all activities running on the same processor∑ (Starti+Duri ) ≤ RT
Cumulative constraints on resourcesprocessors are unary resources: cumulative([Start], [Dur], [1],1)memories are additive resources: cumulative([Start],[Dur],[MR],C)
What about the bus??
i
18
Bus model
BANDWIDTHBIT/SEC
TIME
Max busbandwidth
Taskistate read
Taskistate write
Execution timetaski and task j
Unary resource: granularity clock cycle
Arbitration mechanism that decides the bus allocation
Taskjstate read
TaskjState write
Bus modelBANDWIDTH
BIT/SEC
TIME
Max busbandwidthSize of program data
TaskExecTimeTask0 accessesinput data:
BW=MaxBW/NoProc
Taskistate read
Taskistate write
taskjtaski
Additive bus model
The model does not hold under heavy bus congestion(more than 65% of total bandwidth)
Bus traffic has to be minimized
Taskjstate read
Taskistate write
19
No good generationAssignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)Task scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)
If no feasible schedule exist for the allocation provided by the master a no-good is generated.
We use the simple BUT EFFECTIVE one: identify CONFLICTING RESOURCES CR. For each R ∈ CR, STR set of tasks allocated on R
Σ TiR ≤ | STR | - 1
Other cuts are also possible, [Hooker, Constraints 2005], but these are enough for our case and easy to extract
MasterProblem
solution Sub-Problem
no good
solution
IP solver CP solver
i ∈ STR
Computational efficiency
CP and IP formulations simplifiedHybrid approach clearly outperforms pure CP and IP techniquesSearch time bounded to 15 minutes
CP and IP can found a solution only in 50%- of the instancesHybrid approach always found a solution
20
Validation of bus model
Requesting more than 65% of the theoretical maximumbandwidth causes the additive model to fail.Lower threshold in presence of communication hotspots (50%)Benefits of the additive model
task execution time almost indep. of bus utilizationPerformance predictability greatly enhanced
Validation of optimizersolutions
MAX error lower than 10%AVG error equal to 4.7%, with standard deviation of 0.08Optimizer turn out to be conservative in predicting infeasibilityThe flow was successfully applied to GSM benchmark
21
Energy-EfficientApplication mapping for MPSoCs
Given a platform1. Achieve a specified throughput2. Minimize power consumption
MULTIMEDIAAPPLICATIONS
Application Mapping
The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard.New tool flows for efficient mapping of multi-task applications onto hardware platforms
T1
T2 T3
T4 T5 T6
T7
T8
…Proc. 1 Proc. 2 Proc. N
INTERCONNECT
Private
Mem
Private
Mem
Private
Mem…
T1 T2 T3T4 T5 T6T8 T7
Time
Res
ourc
es
T1 T2
T3
T4
T5 T7
Deadline
T8
Allocation
Schedule&Freq.sel.
22
Exploiting Voltage SupplySupply voltage impacts power and performance
Circuit slowdown T=1/f=K/(Vdd-Vt)a
Cubic power savings P=Ceff*Vdd2*f
Just-in-time computationStretch execution time up to the max tolerable
Available time
PowerFixed voltage + Shutdown
Variable voltage
Scheduling & Voltage Scaling
deadlinet
P
τ1 τ2 τ3
Energy/speed trade-offs:varying the voltagesVbs
CPUVdd
f1 f2 f3
Different voltages:different frequencies
Mapping and scheduling: given (fastest freq.)
Power
deadlinetτ1 τ2 τ3
Slack
23
Target architecture - 2Homogeneous computation tiles:
ARM cores (including instruction and data caches);Tightly coupled software-controlled scratch-pad memories (SPM);
AMBA AHB;DMA engine;RTEMS OS;Technology homogeneous (0.13um) industrial power models (ST)
Variable Voltage/Frequency cores with discrete (Vdd,f) pairsFrequency dividers scale down the baseline 200 MHz system clockCores use non-cacheable sharedmemory to communicate;Semaphore and interrupt facilities are used for synchronization;Private on-chip memory to store data.
Tile TileTile Tile …Sync. Sync. Sync. Sync.
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
AMBA AHB INTERCONNECT
PrivateMem..
Prog.REG
CLOCK TREEGENERATOR
SystemC
LOC
K
CLOCK NCLOCK 3
CLOCK 2CLOCK 1
INTSlave
… Int_
CLKTile TileTile Tile …
Sync. Sync. Sync. Sync.
PrivateMem
PrivateMem
PrivateMem
PrivateMem
SharedMem
AMBA AHB INTERCONNECT
PrivateMem..
Prog.REG
CLOCK TREEGENERATOR
SystemC
LOC
K
CLOCK NCLOCK 3
CLOCK 2CLOCK 1
INTSlave
… Int_
CLK
A task graph represents:A group of tasks TTask dependenciesExecution times express in clock cycles: WCN(Ti)Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj)These values can be back-annotated from functional simulation
Application model
Task1
Task2
Task3
Task4
Task5
Task6
WCN(WT1T2)WCN(RT1T2)WCN(T1)
WCN(WT1T3)WCN(RT1T3)
WCN(T2) WCN(WT2T4)WCN(RT2T4)
WCN(WT3T5)WCN(RT3T5)
WCN(WT4T6)WCN(RT4T6)
WCN(WT5T6)WCN(RT5T6)
WCN(T3)
WCN(T4)
WCN(T5)
WCN(T6)
24
Efficient Application Development SupportIn optimization tools many simplifying assumptions are generally considered The neglecting of these assumptions in software implementation can generate:
unpredictable and not desired system-level interactions;make the overall system error-prone.
We propose an entire framework to help programmers in software implementation:
a generic customizable application template OFFLINE SUPPORT;a set of high-level APIs ONLINE SUPPORT.
The main goals of our development framework are:the exact and reliable application’s execution after the optimization step;guarantees about high performance and constraint satisfaction.
Customizable Application TemplateStarting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.Programmer can intuitively translate high level representation into C-code using our facilities and library.
Users can specify:the number of tasks included in the target application;their nature (e.g. branch, fork, or-node, and-node);their precedence constraints (e.g. due to data communication);
….thus quickly drawing its CTG schema.Programmer can focus onto the functionalities of the tasks:
the main effort is given to the more specific and critic sections of the application.
25
OS-level and Task-level APIsUsers can easily reproduce optimizer solutions, thus:
Indirectly neglecting optimizer’s abstractionsTask model;Communication model;OS overheads.
Obtaining the needed application constraint satisfaction.
Programmer can allocate to the right hardware resourcesTasks;Program data;Queues.
Scheduling support APIsFrequency and voltage selection;
Communication issuesShared queues;Semaphores;Interrupts.
//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};
#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};
uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};
//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};
ExampleNumber of nodes : 12Graph of activitiesNode type
Normal, Branch, Conditional, Terminator
Node behaviourOr, And, Fork, Branch
Number of CPU : 2Task AllocationTask SchedulingArc prioritiesFreq. & Voltage
Time
Res
ourc
es
N1 B2
B3
C4
C7
Deadline
N8
T2 T3
T4 T5 T6 T7
T8 T9 T10
T11
T12
T1N1
B2 B3
C4 C5 C6 C7
N8 N9 N10
N11
T12
fork
or
or
and
branch branch
P1
P2
N11
N10
T12
a1a2
a3 a4 a5 a6
a7 a8 a9 a10
a11 a12
B3 C7 N10
T12
a13
a14
#define TASK_NUMBER 12
26
Queue ordering optimization
Communication ordering affects system performances
T1
T2T4
CPU1 CPU2
…
C3C1
T3
…C
2
Wait!
RUN!
T5 T6… …
C4 C5
Queue ordering optimization
Communication ordering affects system performances
T1
T2T4
T5 T6
CPU1 CPU2
… … …
C3C1
T3
…
C2
Wait!
RUN!
C4 C5
27
T4 re-activated
Synchronization among tasks
T1
T2 T4C2
T3
C1
C3
Proc. 1
T1
Proc. 2
T2T3 T4
T4 is suspended
Non blocked semaphores
Logic Based Benders DecompositionObj. Function:Communication cost
& energy consumption
Validallocation
Allocation& Freq. Assign.:
INTEGER PROGRAMMING
Scheduling:CONSTRAINT PROGRAMMING
No good: linearconstraint
Memory constraints
Real Timeconstraint
Decomposes the problem into 2 sub-problems:Allocation & Assignment of freq. settings → IP
Objective Function: minimizing energy consumption during execution and communication of tasks
Scheduling → CPObjective Function: minimizing energy consumption during frequency switching
28
Solver Performance
Hundreds of of decision variablesMuch beyond ILP solver or CP solver capability
Allocation problem modelXtfp = 1 if task t executes on processor p at frequency f;Wijfp = 1 if task i and j run on different core.
Task i on core p writes data to j at freq. f;Rijfp = 1 if task i and j run on different core.
Task j on core p reads data to i at freq. f;
WriteadComp
P
p
M
fijfpijfp
P
p
M
fijfp
P
p
M
fijfp
P
p
M
ftfp
EnEnEnOF
TjiRW
TjiR
TjiW
tX
++=
∈∀=−
∈∀≤
∈∀≤
∀=
∑∑
∑∑
∑∑
∑∑
= =
= =
= =
= =
Re
1 1
1 1
1 1
1 1
,0)(
,1
,1
1 Each task can execute only on one processor at one freq.
Communication between tasks can execute only once for execution and one write corresponds to one read
The objective function: minimize energy consumption associated with task execution and communication
29
adWriteComp
P
p
M
ff
T
tLocRijmRijijfpLocRijifpad
P
p
M
ff
T
tLocWijmWijijfpLocWijifpWrite
P
p
M
f
T
tfttfpComp
EnEnEnOF
EWCNWCNRWCNXEn
EWCNWCNWWCNXEn
EWCNXEn
Re
1 1 1ReRe
1 1 1Re
1 1 1
))((
))((
++=
−+=
−+=
=
∑∑∑
∑∑∑
∑∑∑
= = =
= = =
= = =
Communication energy forReads from shared memory.
Reads carried out at the same frequency of the task
Allocation problem model
Bus
Mem
CPU CPU
Computation energy forall tasks in the system
Communication energy forWrites to shared memory. Writes carried out at the same frequency
of the task
Five phases behaviourINPUT=input data reading; EXEC=computation activity;OUTPUT=output data writing.
Atomic activities
Scheduling problem modelINPUT EXEC OUTPUT
The objective function: minimize energy consumption associated with frequency switching
•Processors are modelled as unary resource•Bus is modelled as additive resource
Duration of task i is now fixed since mode is fixed:Reading phase
input
input
input
exec
output
output
output
forkjoin
Writing phase
jijijii
jiii
jii
StartadddWritedurationStart
StartTdurationStart
StartdurationStart
≤+++
≤++
≤+
Re
Task i Task j
Tasks running on the same processor at the same frequency
Tasks running on the same processor at different frequencies
Tasks running on different processors
30
Application Development Methodology
CTGCharacterization
Phase
Simulator
OptimizationPhase
Optimizer
ApplicationProfiles
Optimal SWApplication
Implementation
ApplicationDevelopment
Support
Alloca
tion
Sched
uling
PlatformExecution
MAX error lower than 10%;AVG error equal to 4.51%, with standard deviation of 1.94;All the deadline constraints are satisfied.
Optimizer
Optimal Allocation & Schedule
Virtual Platform validation
-0.05
0
0.05
0.1
0.15
0.2
0.25
-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%
250 instances
Validation of optimizer solutions: Throughput
Prob
abili
ty (%
)
Throughput difference (%)
31
MAX error lower than 10%;AVG error equal to 4.80%, with standard deviation of 1.71;
Optimizer
Optimal Allocation & Schedule
Virtual Platform validation
250 instances
Validation of optimizer solutions: Power
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%
250 instances
Prob
abili
ty (%
)Energy consumption difference (%)
GSM Encoder
Throughput required: 1 frame/10ms.With 2 processors and 4 possible freq.&voltage settings:
Task Graph:10 computational tasks;15 communication tasks.
Without optimizations:50.9μJ
With optimizations:17.1 μJ - 66,4%
32
Summary & future workEnergy-optimal task mapping
Strong optimization engine (complete)Programmer support (design & exec time)Validation: accuracy & optimality
Future workConditional task graphsDealing with multiple use casesVariable execution timesAggressive communication scheduling