Tecniche di ottimizzazione per lo sviluppo di applicazioni ...

transcript

Tecniche di ottimizzazione per lo sviluppo di applicazioni embeddedsu piattatforme multiprocessore su singolo chip

Luca BeniniLbenini@deis.unibo.it

DEIS Università di Bologna

Embedded Systems

General purpose systems Embedded systems

Microprocessor market shares

Example Area: Automotive Electronics

What is “automotive electronics”?

Vehicle functions implemented with electronics

Body electronicsSystem electronics: chassis, engineInformation/entertainment

Automotive Electronics Market Size

8.9Market ($billions) 10.5 13.1 14.1 15.8 17.4 19.3 21.0

400600

8001000

12001400

1998 1999 2000 2001 2002 2003 2004 2005

Cost of Electronics / Car ($)

90% of future innovations in vehicles:based on electronic embedded systems

2006: 25% of the total cost of a car will be electronics

Automotive Electronics Platform Example

Source: Expanding automotive electronic systems, IEEE Computer, Jan. 2002

Digital Convergence – Mobile Example

Broadcasting

TelematicsImaging

Computing

CommunicationEntertainment

One device, multiple functionsCenter of ubiquitous media networkSmart mobile device: next drive for semicon. Industry

4th Gen and Next-Gen Networks

Includes: 802.20, WiMAX (802.16), HSDPA, TDD UMTS, UMTS and future versions of UMTS

SoC: Enabler for Digital Convergence

Future

> 100XPerformanceLow PowerComplexity

Storage

4G/5G, DMB, WiBro, etc.

SoCSoCSoC

Application pull

Year of Introduction2005 2007 2009 2011 2013 2015

5 GOPS/W

100GOPS/W

Signrecognition

A/Vstreaming

Adaptiveroute

Collisionavoidance

Autonomousdriving

3D projecteddisplay

HMI by motionGesture detection

Ubiquitousnavigation

Si Xray

Gbit radio

802.11n

Structured encoding

Structured decoding

3D TV 3D gaming

H264encoding

H264decoding

Imagerecognition

Fully recognition(security)

Autopersonalization

dictation

3D ambientinteraction

LanguageEmotionrecognition

Gesturerecognition

Expressionrecognition

MobileBase-band

1TOPS/W

[IMEC]

MPSoC Platform Evolution

I/OPERIPHERALS

3D stacked

30MtrLocalMemory

hierarchy

NetInt

PowerTest

routerBus basedMulti Proc

Applications Software opt. Middleware, RTOS, API,Run-Time Controller

MappingV,Vt,Fclk,IL

Today’s SoCs could fit in 1 tile!!Tile-based design

Multicores Are Here!

1985 199019801970 1975 1995 2000

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

128256

Athlon

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Boardcom 1480 Opteron 4PXeon MP

AmbricAM2045

[Amarasinghe06]

MPSoC – 2005 ITRS roadmap

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200

Number of Processing Engines(Right Axis)

Total Logic Size(Normalized to 2005, Left Axis)

Total Memory Size(Normalized to 2005, Left Axis)

16 23 32 46 63 79101

133 161212

[Martin06]

System / ServiceApplication S/W

Mobile TerminalMiddleware

ModuleRTOS

ChipHAL

ProcessS/W IP

Target System Application

Requires design of Hardware AND software

SoC Solution-on-a-Chip

SOCSOC

System e-SW

Design as optimizationDesign spaceThe set of “all” possible design choicesConstraintsSolutions that we are not willing to

acceptCost functionA property we are interested in

(execution time, power, reliability…)

Hardware synthesisALGORITHM

HIGH-LEVEL SYNTHESIS

S1 S3 S4S2

0.0 200.0 4 00.0 600. 0Freq

-120 .0

-100 .0

-80 .0

-60 .0

-40 .0

-20 .0

dIN OUT

APPLICATION

interconnect

ASICGP signal

processor

memory

ARCHITECTURE

LOGIC AND PHYSICAL SYNTHESIS

Behavioral synthesisC ontrol/D ataFlow G rap h

(C DFG )Implem en tation

RegReg

M ultiplier

RegReg2 1 1 ...2 3 2 ...

4 3 2 ...

0 4 7 ...4 7 9 ...

Allocation, Assignment, and Scheduling

Allocation: How Much?2 adders

Assignment: Where?

Schedule: When?

Shifter 1

Time Slot 4

1 shifter24 registers

Techniques Well Understood and Mature

Resource constraints

4 control steps

+ * * + *

*1Schedule 1 Schedule 2

3 +3 *1

4 *2 *3

Control Step

2 +1 *2

3 +2 *3

Control Step

Scheduling under resource constraints

Intractable problemAlgorithms:

Exact:Integer linear programHu (restrictive assumptions)

Approximate :List schedulingForce-directed scheduling

Binary decision variables:X = { xil, i = 1,2,…. n; l = 1,2,…, λ + 1}xil, is TRUE only when operation vi starts in

step l of the schedule (i.e. l = ti)λ is an upper bound on latency

Start time of operation vi :Σl . xil

ILP formulation

Operations start only onceΣ xil = 1 i = 1, 2,…, nSequencing relations must be satisfiedti ≥ tj + dj (vj, vi) є EΣ l • xil – Σ l • xil – dj ≥ 0 (vj, vi) є EResource bounds must be satisfiedSimple case (unit delay)Σ xil ≤ ak k = 1,2,…nres ; l

ILP formulation constraints

i:T(vi)=k

ILP Formulation

min (Σ l • xnl) such that

Σ xil = 1 i = 1, 2, …, n

Σ l • xij - Σ l • xjl - dj ≥ 0 i, j = 1, 2, …, n, (vj, vi) є E

Σ Σ xim ≤ ak k = 1, 2, …, nres ; l = 0, 1, …, λl

m=l-di+1i:T(vi)=k

Example

Resource constraints:2 ALUs; 2 Multipliersa1 = 2; a2 = 2

Single-cycle operationdi = 1 i

* * + <

* * * * +

ExampleOperations start only oncex11 = 1x61 + x62 =1…

Sequencing relations must be satisfiedx61 + 2x62 – 2x72 – 3x73 + 1 ≤ 02x92 + 3x93 + 4x94 – 5xN5 + 1 ≤ 0…

Resource bounds must be satisfiedx11 + x21 +x61 + x81 ≤ 2x32 + x62 + x72 + x81 ≤ 2…

Example

TIME 1

TIME 2

TIME 3

TIME 4

Resource-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize usage of shared resources

MULTIMEDIAAPPLICATIONS

Optimization Development

The abstraction gap between high level optimization tools and standard application programming models can introduce unpredictableand undesired behaviours.Programmers must be conscious about simplified assumptions taken into account in optimization tools.New methodology for multi-task application development on MPSoCs.

Platform Modelling

Optimization Analysis

Optimal Solution

Starting Implementation

Platform Execution

Abstractiongap

Final Implementation

Application design flow

Resource assignment and scheduling

SHARED SYSTEM BUS

On-chipMemory

Node 1 Node N

Processor

Tightly-CoupledMemory

Bus Interface

Task. A (WCET Ta)Task. B (WCET Tb)

Task. N (WCET Tn)

THE SYSTEM

LimitedSize Mem

Max busbandwidth

Maxtime

wheelperiod

AssumedTo be

infinite

The application

T7T1 T2 T0 T3 …..

Signal Processing Pipeline

Queues for inter-processor communicationin TCM for efficiency reasons

Program datain TCM (if space) or on-chip memoryInternal statein TCM (if space) or on-chip memory

Each task is characterized by:• WCET• Memory requirements

ThroughputConstraint

Communication-aware Allocation and Scheduling for Stream-Oriented MPSoCs

T7T1 T2 T0 ….. Signal ProcessingPipeline

ARM7 LocalScratchpad

Memory BUS

PrivateMemory

………………..

LocalScratchpad

Memory

PrivateMemory

……….

Message-orientedMPSoC

architecture

?Simplifying assumptions vs predictabilityEfficient solutions in reasonable timePure ILP formulations suitable for small task setsWidespread use of heuristics

Master Problem modelAssignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)

Tij= 1 if task i executes on processor j, 0 otherwise, Yij =1 if task i allocates program data on processor j memory, 0 otherwise, Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise

Each process should be allocated to one processor ∑ Tij= 1 for all j

Link between variables X and T: Xij = |Tij – Ti+1 j | for all i and j (can be linearized)

If a task is NOT allocated to a processor nor its required memories are:Tij= 0 ⇒ Yij =0 and Zij =0

Objective function ∑ ∑ memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2

Improvement of the model

With the proposed model, the allocation problem solver tends to With the proposed model, the allocation problem solver tends to pack pack all tasks on a single processor and all memory required on the lall tasks on a single processor and all memory required on the local ocal memory so as to have a ZERO communication cost: TRIVIAL SOLUTIONmemory so as to have a ZERO communication cost: TRIVIAL SOLUTION

To improve the model we should add a relaxation of the subproblTo improve the model we should add a relaxation of the subproblem to em to the master problem model: the master problem model:

For each set S of consecutive tasks whose sum of durations exceeFor each set S of consecutive tasks whose sum of durations exceeds the ds the Real time requirement, we impose that their processors should noReal time requirement, we impose that their processors should not be the t be the same same

∑ WCETi > RT ⇒ ∑ Tij ≤ |S| -1i ∈ S i ∈ S

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)

Sub-Problem modelTask scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)We have to schedule tasks so we have to decide when they start

Activity Starting Time: Starti::[0..Deadlinei]

Precedence constraints: Starti+Duri ≤ Startj

Real time constraints: for all activities running on the same processor∑ (Starti+Duri ) ≤ RT

Cumulative constraints on resourcesprocessors are unary resources: cumulative([Start], [Dur], [1],1)memories are additive resources: cumulative([Start],[Dur],[MR],C)

What about the bus??

Bus model

BANDWIDTHBIT/SEC

Max busbandwidth

Taskistate read

Taskistate write

Execution timetaski and task j

Unary resource: granularity clock cycle

Arbitration mechanism that decides the bus allocation

Taskjstate read

TaskjState write

Bus modelBANDWIDTH

BIT/SEC

Max busbandwidthSize of program data

TaskExecTimeTask0 accessesinput data:

BW=MaxBW/NoProc

Taskistate read

Taskistate write

taskjtaski

Additive bus model

The model does not hold under heavy bus congestion(more than 65% of total bandwidth)

Bus traffic has to be minimized

Taskjstate read

Taskistate write

No good generationAssignment of tasks and memory slotsAssignment of tasks and memory slots (master problem)Task scheduling with static resource assignmentTask scheduling with static resource assignment (subproblem)

If no feasible schedule exist for the allocation provided by the master a no-good is generated.

We use the simple BUT EFFECTIVE one: identify CONFLICTING RESOURCES CR. For each R ∈ CR, STR set of tasks allocated on R

Σ TiR ≤ | STR | - 1

Other cuts are also possible, [Hooker, Constraints 2005], but these are enough for our case and easy to extract

MasterProblem

solution Sub-Problem

no good

solution

IP solver CP solver

i ∈ STR

Computational efficiency

CP and IP formulations simplifiedHybrid approach clearly outperforms pure CP and IP techniquesSearch time bounded to 15 minutes

CP and IP can found a solution only in 50%- of the instancesHybrid approach always found a solution

Validation of bus model

Requesting more than 65% of the theoretical maximumbandwidth causes the additive model to fail.Lower threshold in presence of communication hotspots (50%)Benefits of the additive model

task execution time almost indep. of bus utilizationPerformance predictability greatly enhanced

Validation of optimizersolutions

MAX error lower than 10%AVG error equal to 4.7%, with standard deviation of 0.08Optimizer turn out to be conservative in predicting infeasibilityThe flow was successfully applied to GSM benchmark

Energy-EfficientApplication mapping for MPSoCs

Given a platform1. Achieve a specified throughput2. Minimize power consumption

MULTIMEDIAAPPLICATIONS

Application Mapping

The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard.New tool flows for efficient mapping of multi-task applications onto hardware platforms

T4 T5 T6

…Proc. 1 Proc. 2 Proc. N

INTERCONNECT

Private

Mem…

T1 T2 T3T4 T5 T6T8 T7

Deadline

Allocation

Schedule&Freq.sel.

Exploiting Voltage SupplySupply voltage impacts power and performance

Circuit slowdown T=1/f=K/(Vdd-Vt)a

Cubic power savings P=Ceff*Vdd2*f

Just-in-time computationStretch execution time up to the max tolerable

Available time

PowerFixed voltage + Shutdown

Variable voltage

Scheduling & Voltage Scaling

deadlinet

τ1 τ2 τ3

Energy/speed trade-offs:varying the voltagesVbs

CPUVdd

f1 f2 f3

Different voltages:different frequencies

Mapping and scheduling: given (fastest freq.)

deadlinetτ1 τ2 τ3

Target architecture - 2Homogeneous computation tiles:

ARM cores (including instruction and data caches);Tightly coupled software-controlled scratch-pad memories (SPM);

AMBA AHB;DMA engine;RTEMS OS;Technology homogeneous (0.13um) industrial power models (ST)

Variable Voltage/Frequency cores with discrete (Vdd,f) pairsFrequency dividers scale down the baseline 200 MHz system clockCores use non-cacheable sharedmemory to communicate;Semaphore and interrupt facilities are used for synchronization;Private on-chip memory to store data.

Tile TileTile Tile …Sync. Sync. Sync. Sync.

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

CLKTile TileTile Tile …

Sync. Sync. Sync. Sync.

PrivateMem

SharedMem

AMBA AHB INTERCONNECT

PrivateMem..

Prog.REG

CLOCK TREEGENERATOR

SystemC

CLOCK NCLOCK 3

CLOCK 2CLOCK 1

INTSlave

… Int_

A task graph represents:A group of tasks TTask dependenciesExecution times express in clock cycles: WCN(Ti)Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj)These values can be back-annotated from functional simulation

Application model

WCN(WT1T2)WCN(RT1T2)WCN(T1)

WCN(WT1T3)WCN(RT1T3)

WCN(T2) WCN(WT2T4)WCN(RT2T4)

WCN(T3)

WCN(T4)

WCN(T5)

WCN(T6)

Efficient Application Development SupportIn optimization tools many simplifying assumptions are generally considered The neglecting of these assumptions in software implementation can generate:

unpredictable and not desired system-level interactions;make the overall system error-prone.

We propose an entire framework to help programmers in software implementation:

a generic customizable application template OFFLINE SUPPORT;a set of high-level APIs ONLINE SUPPORT.

The main goals of our development framework are:the exact and reliable application’s execution after the optimization step;guarantees about high performance and constraint satisfaction.

Customizable Application TemplateStarting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure.Programmer can intuitively translate high level representation into C-code using our facilities and library.

Users can specify:the number of tasks included in the target application;their nature (e.g. branch, fork, or-node, and-node);their precedence constraints (e.g. due to data communication);

….thus quickly drawing its CTG schema.Programmer can focus onto the functionalities of the tasks:

the main effort is given to the more specific and critic sections of the application.

OS-level and Task-level APIsUsers can easily reproduce optimizer solutions, thus:

Indirectly neglecting optimizer’s abstractionsTask model;Communication model;OS overheads.

Obtaining the needed application constraint satisfaction.

Programmer can allocate to the right hardware resourcesTasks;Program data;Queues.

Scheduling support APIsFrequency and voltage selection;

Communication issuesShared queues;Semaphores;Interrupts.

//Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCHuint node_behaviour[TASK_NUMBER] = {2,3,3,..};

#define N_CPU 2uint task_on_core[TASK_NUMBER] = {1,1,2,1};int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..};

uint queue_consumer [..] [..] = {{0,1,1,0,..},{0,0,0,1,1,.},{0,0,0,0,0,1,1..},{0,0,0,0,..}..};

//Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTICuint node_type[TASK_NUMBER] = {1,2,2,1,..};

ExampleNumber of nodes : 12Graph of activitiesNode type

Normal, Branch, Conditional, Terminator

Node behaviourOr, And, Fork, Branch

Number of CPU : 2Task AllocationTask SchedulingArc prioritiesFreq. & Voltage

Deadline

T4 T5 T6 T7

T8 T9 T10

C4 C5 C6 C7

N8 N9 N10

branch branch

a3 a4 a5 a6

a7 a8 a9 a10

a11 a12

B3 C7 N10

#define TASK_NUMBER 12

Queue ordering optimization

Communication ordering affects system performances

CPU1 CPU2

T5 T6… …

Queue ordering optimization

Communication ordering affects system performances

CPU1 CPU2

… … …

T4 re-activated

Synchronization among tasks

T2 T4C2

Proc. 1

Proc. 2

T2T3 T4

T4 is suspended

Non blocked semaphores

Logic Based Benders DecompositionObj. Function:Communication cost

& energy consumption

Validallocation

Allocation& Freq. Assign.:

INTEGER PROGRAMMING

Scheduling:CONSTRAINT PROGRAMMING

No good: linearconstraint

Memory constraints

Real Timeconstraint

Decomposes the problem into 2 sub-problems:Allocation & Assignment of freq. settings → IP

Objective Function: minimizing energy consumption during execution and communication of tasks

Scheduling → CPObjective Function: minimizing energy consumption during frequency switching

Solver Performance

Hundreds of of decision variablesMuch beyond ILP solver or CP solver capability

Allocation problem modelXtfp = 1 if task t executes on processor p at frequency f;Wijfp = 1 if task i and j run on different core.

Task i on core p writes data to j at freq. f;Rijfp = 1 if task i and j run on different core.

Task j on core p reads data to i at freq. f;

WriteadComp

fijfpijfp

EnEnEnOF

∈∀=−

∈∀≤

∑∑

1 Each task can execute only on one processor at one freq.

Communication between tasks can execute only once for execution and one write corresponds to one read

The objective function: minimize energy consumption associated with task execution and communication

adWriteComp

tLocRijmRijijfpLocRijifpad

tLocWijmWijijfpLocWijifpWrite

tfttfpComp

EnEnEnOF

EWCNWCNRWCNXEn

EWCNWCNWWCNXEn

EWCNXEn

1 1 1ReRe

1 1 1Re

∑∑∑

Communication energy forReads from shared memory.

Reads carried out at the same frequency of the task

Allocation problem model

CPU CPU

Computation energy forall tasks in the system

Communication energy forWrites to shared memory. Writes carried out at the same frequency

of the task

Five phases behaviourINPUT=input data reading; EXEC=computation activity;OUTPUT=output data writing.

Atomic activities

Scheduling problem modelINPUT EXEC OUTPUT

The objective function: minimize energy consumption associated with frequency switching

•Processors are modelled as unary resource•Bus is modelled as additive resource

Duration of task i is now fixed since mode is fixed:Reading phase

output

forkjoin

Writing phase

jijijii

StartadddWritedurationStart

StartTdurationStart

StartdurationStart

≤+++

Task i Task j

Tasks running on the same processor at the same frequency

Tasks running on the same processor at different frequencies

Tasks running on different processors

Application Development Methodology

CTGCharacterization

Simulator

OptimizationPhase

Optimizer

ApplicationProfiles

Optimal SWApplication

Implementation

ApplicationDevelopment

Support

Alloca

PlatformExecution

MAX error lower than 10%;AVG error equal to 4.51%, with standard deviation of 1.94;All the deadline constraints are satisfied.

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

Validation of optimizer solutions: Throughput

Throughput difference (%)

MAX error lower than 10%;AVG error equal to 4.80%, with standard deviation of 1.71;

Optimizer

Optimal Allocation & Schedule

Virtual Platform validation

250 instances

Validation of optimizer solutions: Power

-5% -4% -3% -2% -1% 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11%

250 instances

)Energy consumption difference (%)

GSM Encoder

Throughput required: 1 frame/10ms.With 2 processors and 4 possible freq.&voltage settings:

Task Graph:10 computational tasks;15 communication tasks.

Without optimizations:50.9μJ

With optimizations:17.1 μJ - 66,4%

Summary & future workEnergy-optimal task mapping

Strong optimization engine (complete)Programmer support (design & exec time)Validation: accuracy & optimality

Future workConditional task graphsDealing with multiple use casesVariable execution timesAggressive communication scheduling

Tecniche di ottimizzazione per lo sviluppo di applicazioni ...

Documents