Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 66 2009-10-13
66 Model-Based Design for Embedded Systems
in the application binary. Depending on the current cache state and the
execution history, cache misses may occur at different points in time. How-
ever, formal methods are able to identify for each basic block the maximum
number of cache misses that may occur during the execution [46]. The con-
trol flow graph can be annotated with this information, making the longest
path analyses feasible again.
Depending on the actual system configuration, the upper bound on the
number of transactions per task execution may not be sufficiently accurate.
In a formal model, this could translate into an assumed burst of requests
that may not occur in practice. This can be addressed with a more detailed
analysis of the task control flow, as is done in [1,39], which provides bounds
on the minimum distances between any n requests of an activation of that
task. This pattern will then repeat with each task activation.
This procedure allows to conservatively derive the shared resource
request bound functions ˜η
+
τ
(w) and ˜η
−
τ
(w) that represent the transaction traf-
fic that each task τ in the system can produce within a given time window
of size w. Requesting tasks that share the same processor may be executed in
alternation, resulting in a combined request traffic for the complete proces-
sor. This again can be expressed as an event model. For example, a straight-
forward approach is to approximate the processor’s request event model (in
a given time window) with the aggregation of the request event models of
each individual task executing on that processor. Obviously, this is an over-
estimation, as the tasks will not be executed at the same time, but rather the
scheduler will assign the processor exclusively. The resulting requests will
be separated by the intermediate executions, which can be captured in the
joint shared resource request bound by a piecewise assembly from the ele-
mentary streams [39].
3.3.2 Response Time Analysis in the Presence of Shared Memory
Accesses
Memory access delays may be treated differently by various processor imple-
mentations. Many processors, and some of the most commonly used, allow
tasks to perform coprocessor or memory accesses by offering a multi-cycle
operation that stalls the entire processor until the transaction has been pro-
cessed by the system [44]. In other cases, a set of hardware threads may allow
to perform a quick context switch to another thread that is ready, effectively
keeping the processor utilized (e.g., [17]). While this behavior usually has
a beneficial effect on the average throughput of a system, multithreading
requires caution in priority-based systems with reactive or control applica-
tions. In this case, the worst-case response time of even high-priority tasks
may actually increase [38].
The integration of dynamic memory access delays into the real-time anal-
ysis will in the following be performed for a processor with priority-based
preemptive scheduling that is stalled during memory accesses. In such a
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 67 2009-10-13
Formal Performance Analysis 67
system, a task’s worst-case response time is determined by the task’s worst-
case execution time plus the maximum amount of time the task can be kept
from executing because of preemptions by higher-priority tasks and blocking
by lower-priority tasks. A task that performs memory accesses is addition-
ally delayed when waiting for the arrival of requested data. Furthermore,
preemption times are increased, as the remote memory accesses also cause
high-priority tasks to execute longer.
A possible runtime schedule is depicted in Figure 3.5. In the case where
both tasks execute in the local memory (Scenario 3.5a), the low-priority task
is kept from executing by three invocations of the high-priority tasks. Local
memory accesses are not explicitly shown, as they can be considered to be
part of the execution time. When both tasks access the same remote mem-
ory (Scenario 3.5b), the finishing time of the lower-priority task increases,
because it itself fetches data from the remote memory, and also because of
the prolonged preemptions by the higher-priority task (as its request also
stalls the processor). The execution of the low-priority task in the example
is now stretched such that it suffers from an additional preemption of the
other task. Finally, Scenario 3.5c shows the effect of a task on another core
CPUb that is also accessing the same shared memory, in this case, periodi-
cally. Whenever the memory is also used by a task on CPUb, CPUa is stalled
for a longer time, again increasing the task response times, possibly leading
to the violation of a given deadline. As the busy wait adds to the execution
t
(a)
Preemption
Stalling
CPU
Memory
CPU
(b)
Memory
CPUa
CPUb
(c)
FIGURE 3.5
Tasks on different processors accessing a shared memory. (a and b) Single
processor case and (c) conflicts from another CPU.
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 68 2009-10-13
68 Model-Based Design for Embedded Systems
time of a task, the total processor load increases—possibly making the overall
system unschedulable.
On the basis of these observations, a response time equation can be
derived for the example scheduler. The response time represents the sum
of the following:
• The core execution times of all tasks mapped to the processor, and their
activation event models.
• The increased blocking time due to the resources being stalled during
memory accesses. (This is not shown in the example.)
• The aggregate delay caused by the memory accesses that is a function
of the memory accesses of a specific task and its higher-priority tasks.
This is investigated in Section 3.3.3.
Variations of such a response time analysis have been presented for
single- and multithreaded static-priority preemptive scheduling [38], as well
as for round-robin scheduling [41]. Other scheduling policies for which clas-
sical real-time analysis is available can be straight-forwardly extended to
include memory delays by including a term that represents the aggregate
busy time due to memory accesses.
3.3.3 Deriving Aggregate Busy Time
Deriving the timing of many memory accesses has recently become an
important topic in real-time research. Previously, the worst-case timing of
individual events was the main concern. Technically, a sufficient solution
to find the delay that a set of many events may experience, is to derive
the single worst-case load scenario and assume it for every access. How-
ever, not every memory request will experience a worst-case system state,
such as worst-case time wheel positions in the time division multiple access
(TDMA) schedules, or transient overloads in priority-based components. For
example, the task on CPUb in Figure 3.5 will periodically access the shared
memory, and, as a consequence, disturb the accesses by the two tasks on
CPUa. A “worst-case memory access” will experience this delay, but of all
accesses from CPUb, this happens maximally three times in this example.
Thus, accounting this interference for every single memory access leads to
very unsatisfactory results—which has previously prevented the use of con-
servative methods in this context.
The key idea is instead to consider all requests that are processed during
the lifetime of a task jointly. We therefore introduce the worst-case accumu-
lated busy time, defined as the total amount of time, during which at least one
request is issued but is not finished. Multiple requests in a certain amount of
time can in total only be delayed by a certain amount of interference, which
is expressed by the aggregate busy time.
This aggregate busy time can be efficiently calculated (e.g., for a shared
bus): a set of requests is issued from different processors that may interfere
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 69 2009-10-13
Formal Performance Analysis 69
with each other. The exact individual request times are unknown and their
actual latency is highly dynamic. Extracting detailed timing information
(e.g., when a specific cache miss occurs) is virtually impossible and consid-
ering such details in a conservative analysis yields exponential complexity.
Consequently, we disregard such details and focus on bounding the aggre-
gate busy time. Given a certain level of dynamism in the system, this con-
sideration will not result in excessive overestimations. Interestingly, even in
multithreaded multicore architectures, the conservatism is moderate, sum-
ming up to less than a total of 25% of the overestimated response time, as
shown in practical experiments [42].
Without bus access prioritization, it has to be assumed that it is possi-
ble for every transaction issued by any processor during the lifetime of a
task activation i that it will disturb the transactions issued by i. Usually, the
interference is then given by the transactions issued by the other concur-
rently active tasks on the other processors, as well as the tasks on the same
processor as their requests are treated on a first-come-first-served basis. The
interested readers are referred to [40] for more details on the calculation of
aggregate memory access latencies.
If a memory controller is utilized, this can be very efficiently consid-
ered. For example, all requests from a certain processor may be prioritized
over those of another. Then, the imposed interference by all lower-priority
requests equals zero. Additionally, a small blocking factor of one elementary
memory access time is required, in order to model the time before a transac-
tion may be aborted for the benefit of a higher-priority request.
The compositional analysis approach of Section 3.2, used together with
the methods of Section 3.3, now delivers a complete framework for the
performance analysis of heterogeneous multiprocessor systems with shared
memories. The following section turns to detailed modeling of inter-
processor communication with the help of HEMs.
3.4 Hierarchical Communication
As explained in Section 3.2, traditional compositional analysis models bus
communication by a simple communication task that is directly activated by
the sending task, and which directly activates the receiving task. Figure 3.6
shows a simple example system that uses this model for communication,
where each output event of the sending tasks, T
a
and T
b
, triggers the trans-
mission of one message over the bus.
However, the modern communication stacks employed in today’s
embedded control units (ECUs), for example, in the automotive domain,
make this abstraction inadequate. Depending on the configuration of the
communication layer, the output events (called signals here) may or may
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 70 2009-10-13
70 Model-Based Design for Embedded Systems
T
a
T
c
T
d
T
b
ECU1 Bus
C
1
C
2
ECU2
FIGURE 3.6
Traditional model.
ES
a
HES
a,b
ES
b
ECU1 Bus ECU2
T
c
T
d
T
b
C
T
a
FIGURE 3.7
Communication via ComLayer.
not directly trigger the transmissions of messages (called frames here).
For instance, AUTOSAR [2] defines a detailed API for the communication
stack, including several frame transmission modes (direct, periodic, mixed,
or none) and signal transfer properties (triggered or pending) with key
influences on communication timings. Hence, the transmission timings of
messages over the bus do not have to be directly connected to the output
behaviors of the sending tasks anymore, but they may even be completely
independent of the task’s output behavior (e.g., sending several output
signals in one message).
In the example shown in Figure 3.7, the tasks T
a
and T
b
produce output
signals that are transmitted over the bus to the tasks T
c
and T
d
. The send-
ing tasks write their output data into registers provided by the communi-
cation layer, which is responsible for packing the data into messages, called
frames here, and triggering the transmission of these frames according to the
signal types and transmission modes. On the receiving side, the frames are
unpacked, which means that the contained signals are again written into dif-
ferent registers for the corresponding receiving task. Using flat event models,
the timings of signal arrivals can only be bound with a large overestimation.
To adequately consider such effects of modern communication stacks in
the system analysis, two elements must be determined:
1. The activation timings of the frames
2. The timings of signals transmitted within these frames arriving at the
receiving side
To cope with both the challenges, we introduce hierarchical event streams
(HESs) modeled by a HEM, which determines the activating function of the
frame and also captures the timings of the signals assignedtothat frame, and,
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 71 2009-10-13
Formal Performance Analysis 71
most importantly, defines how the effects on the frame timings influence the
timings of the transmitted signals. The latter allows to unpack the signals
on the receiving side, giving tighter bounds for the activations of those tasks
receiving the signals.
The general idea is that a HES has one outer representation in the form
of an event stream ES
outer
, and each combined event stream has one inner
representation, also in the form of an event stream ES
i
, where i denotes the
task to which the event stream corresponds. The relation between the outer
event stream and the inner event stream depends on the hierarchical stream
constructor (HSC) that combined the event streams. Each of the involved
event streams is defined by functions δ
−
(n) and δ
+
(n) (see Section 3.2.2),
returning the minimum and the maximum distance, respectively, between n
consecutive events.
Figure 3.8 illustrates the structure of the HES at the input of the chan-
nel C of the example shown in Figure 3.7. The HSC combines the output
streams of the tasks T
a
and T
b
, resulting in the hierarchical input stream of
the communication task C. According to the properties and the configuration
of the communication layer that is modeled by the HSC, the inner and outer
event streams of the HES are calculated. Each event of the outer event stream,
ES
outer
, represents the sending of one message by the communication layer.
The events of a specific inner event stream, ES
a
and ES
b
, model the timings of
only those messages that contain data from the corresponding sending tasks.
The detailed calculations of the inner and outer event streams, considering
the different signal properties and frame transmission modes, are presented
in [37].
For the local scheduling analysis of the bus, only the outer event stream
is relevant. As a result, the best-case response time, R
min
, and the worst-case
HES
a,b
ES
outer
HSC
Distances between total message releases
Distances between
messages containing a
new signal from T
b
Distances between
messages containing a
new signal from T
b
δ(n)
δ(n)
δ(n)
δ
–
(n)
δ
–
(n)δ
–
(n)
23456
23456
23456
n
n
n
δ
+
(n)
δ
+
(n)δ
+
(n)
HSC
HSC
ES'
b
ES'
b
ES'
a
ES'
a
ES
a
ES
outer
ES
b
FIGURE 3.8
Structure of the hierarchical input stream of C.
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 72 2009-10-13
72 Model-Based Design for Embedded Systems
response time, R
max
, are obtained. Based on the outer event stream, ES
outer
,
of the hierarchical input stream, we obtain the outer event stream, ES
outer
,of
the hierarchical output stream by using the following equations:
δ
−
outer
(n) = max{δ
−
outer
(n) − J
resp
, δ
−
outer
(n − 1) + d
min
} (3.2)
δ
+
outer
(n) = max{δ
+
outer
(n) + J
resp
, δ
+
outer
(n − 1) + d
min
} (3.3)
In fact, Equations 3.2 and 3.3 are generalizations of the output model calcu-
lation presented in Equation 3.1. As can be seen, actually two changes have
been made to the message timing. First, the minimum (maximum) distance
between a given number of events decreases (increases) by no more than the
response time jitter, J
resp
= R
max
− R
min
. Second, two consecutive events
at the output of the channel are separated by at least a minimum distance,
d
min
= R
min
. The resulting event stream, modeled by δ
−
outer
(n) and δ
+
outer
(n),
becomes the outer stream of the output model.
To obtain the inner event streams, ES
i
, of the hierarchical output stream,
we adapt the inner event streams, ES
i
, of the hierarchical input stream
according to the changes applied to the outer stream. For the adaptation,
we consider the two changes mentioned above separately. First, consider
that the minimum distance between n messages decreases by J
resp
. Then,
the minimum distance between k messages that contain the data of a spe-
cific task decreases by J
resp
. Second, we must consider that two consecu-
tive messages become separated by a minimum distance d
min
. Figure 3.9a
illustrates a sequence of events consisting of two different event types,
a and b. Assume that this event sequence models the message timing, where
the events labeled by a lowercase a correspond to the messages containing
data from task T
a
, and the events labeled by a lowercase b correspond to the
messages containing data from task T
b
. Figure 3.9b shows how this event
sequence changes when a minimum distance d
min
between two consecutive
events is considered. As indicated, the distance between the last two events
of type b further decreases because of the minimum distance. Likewise, the
maximum distance increases because of the minimum distance, d
min
, as can
be seen for the first and the second of the events of type b. Based on the min-
imum distance, d
min
, the maximum possible decrease (increase), D
max
,inthe
δ
+
b
(2)
ab a
(a)
(b)
ab b t
δ
–
b
(2)
δ
+
b
(2)
d
min
a
b
a
a
b b t΄
δ
–
b
(2)
FIGURE 3.9
(a) The event sequence before applying the minimum distance and (b) the
event sequence after considering the minimum distance d
min
.
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 73 2009-10-13
Formal Performance Analysis 73
minimum (maximum) distance between events that can occur because of the
minimum distance can be calculated. Note that, in the case of large bursts,
D
max
can be significantly larger than d
min
, since an event can be delayed by
its predecessor event, which itself is delayed by its predecessor and so on.
More details can be found in [37].
In general, considering the response time jitter, J
resp
, and the minimum
distance, d
min
, the inner stream of the hierarchical output stream, modeling
messages that contain data from the task T
i
, can be modeled by
δ
−
i
(n) = max{δ
−
i
(n) − J
resp
−D
max
, δ
−
i
(n − 1) + d
min
},
δ
+
i
(n) = δ
+
i
(n) + J
resp
+D
max
To determine the activation timings of the receiving tasks, T
c
and T
d
,we
now have not only the arrival times of messages, but also the timings of
exactly those messages that contain new data from a certain sending task,
given by the corresponding inner stream. Assuming that the task T
c
is only
activated every time a new signal from the task T
a
arrives, then the inner
event stream ES
a
of the hierarchical output stream of the communication
task C can directly be used as an input stream of the task T
c
.
It is also possible to have event streams with multiple hierarchical lay-
ers, for example, when modeling several layers of communication stacks or
communications over networks interconnected by gateways, where several
packets may be combined into some higher-level communication structure.
This can be captured by our HEM by having an inner event stream of a HES
that is the outer event stream of another HEM. For more details on multilevel
hierarchies, refer to [36].
3.5 Scenario-Aware Analysis
Because of the increasing complexity of modern applications, hard real-time
systems are often required to run different scenarios (also called operating
modes) over time. For example, an automotive platform may exclusively exe-
cute either an ESC or a parking-assistant application. While the investigation
of each static scenario can be achieved with classical real-time performance
analysis, timing failures during the transition phase can only be uncovered
with new methods, which consider the transient overload situation during
the transition phase in which both scenarios can impress load artifacts on the
system.
Each scenario is characterized by a specific behavior and is associated
with a specific set of tasks. A scenario change (SC) from one scenario to
another is triggered by a scenario change request (SCR) which may be caused
either by the need to change the system functionality over time or by a system
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 74 2009-10-13
74 Model-Based Design for Embedded Systems
transition to a specific internal state requiring an SC. Depending on the task
behavior across an SC, three types of tasks are defined:
• Unchanged task: An unchanged task belongs to both task sets of the ini-
tial (old) and the new scenario. It remains unchanged and continues
executing normally after the SCR.
• Completed task: A completed task only belongs to the old scenario task
set. However, to preserve data-consistency, completed task jobs acti-
vated before the SC are allowed to complete their execution after the
SCR. Then the task terminates.
• Added task: An added task only belongs to the new scenario task set.
It is initially activated after the SCR. Each added task is assigned an
offset value, φ, that denotes its earliest activation time after the SCR.
During an SC, executions of completed, unchanged, and added tasks may
interfere with one another, leading to a transient overload on the resource.
Since the timing requirements in the system have to be met at any time dur-
ing the system execution, it is necessary to verify if task deadlines could be
missed because of an SC.
Methods analyzing the timing behavior across an SC under static-priority
preemptive scheduling already exist [32,45,47]. However, they are limited
to independent tasks mapped on single resources. Under such an assump-
tion, the worst-case response time for an SC for a given task under analysis
is proved to be obtained within the busy window during which the SCR
occurs, called the transition busy window. These approaches can however
not be applied to distributed systems because of the so-called echo effect.
The echo effect is explained in the following section using the system exam-
ple in Figure 3.11.
3.5.1 Echo Effect
The system used in the experiments of Section 3.8 (depicted in Figure 3.11)
represents a hypothetical automotive system consisting of two IP compo-
nents, four ECUs, and one multicore ECU connected via a CAN bus. The
system is assumed to run two mutually exclusive applications: an ESP
application (Sens1, Sens2 → eval1,eval2) and a parking-assistant application
(Sens3 → SigOut). A detailed system description can be found in Section 3.8.
Let us focus on what happens on the CAN bus when the ESP application
is deactivated (Scenario 1) and the parking-assistant application becomes
active (Scenario 2). Depending on which application a communication task
belongs to, we can determine the following task types on the bus when an SC
occurs from Scenario 1 to Scenario 2: C1andC5 are unchanged communica-
tion tasks, C3andC4 are added communication tasks, and C2 is a completed
communication task. Furthermore, we assume the following priority order-
ing on the bus: C1 > C2 > C3 > C4 > C5.
Nicolescu/Model-Based Design for Embedded Systems 67842_C003 Finals Page 75 2009-10-13
Formal Performance Analysis 75
When an SC occurs from Scenario 1 to Scenario 2, the added communica-
tion task C3 is activated by events sent by the task mon3. However, C3 may
have to wait until the prior completed communication task C2 finishes exe-
cuting before being deactivated. This may lead to a burst of events waiting at
the input of C3 that in turn may lead to a burst of events produced at its out-
put. This burst of events is then propagated through the task ctrl3onECU4
to the input of C4. In between, this burst of events may have been ampli-
fied because of scheduling effects on ECU4 (the task ctrl3 might have to wait
until calc finishes executing). Until this burst of events arrives at C4’s input—
which is a consequence of the SC on the bus—the transition busy window
might already be finished on the bus. The effect of the transient overload
because of the SC on the bus may therefore not be limited to the transition
busy window but be recurrent. We call this recurrent effect the echo effect.
As a consequence of the echo effect, for the worst-case response time calcu-
lation across the SC of the low-priority unchanged communication task C5,
it is not sufficient to consider only its activations within the transition busy
window. Rather, the activations within the successive busy windows need to
be considered.
3.5.2 Compositional Scenario-Aware Analysis
The previous example illustrates how difficult it is to predict the effect of
the recurrent transient overload after an SC in a distributed system. As a
consequence of this unpredictability, it turns to be very difficult to describe
the event timings at task outputs and therefore to describe the event tim-
ings at the inputs of the connected tasks, needed for the response time cal-
culation across the SC. To overcome this problem, we need to describe the
event timing at each task output, in a way that covers all its possible tim-
ing behaviors, even those resulting from the echo effect that might occur
afteranSC.
This calculation is performed by extending the compositional methodol-
ogy presented in Section 3.2 as follows. As usual, all external event models at
the system inputs are propagated along the system paths until an initial acti-
vating event model is available at each task input. Then, global system anal-
ysis is performed in the following way. In the first phase, two task response
time calculations are performed on each resource. First, for each task we cal-
culate its worst-case response time during the transition busy window. This
calculation is described in detail in [13]. Additionally, for each unchanged or
added task, using the classical analysis techniques we calculate its worst-case
response times assuming the exclusive execution of the new scenario. Then,
for each task, a response time interval is built into which all its observable
response times may fall (i.e., the maximum of its response time during the
transition busy window and its response time assuming the exclusive exe-
cution of the new scenario). The tasks’ best-case response times are given by
their minimum execution times in all scenarios.