Principles of EMSEC - Side Channel Analysis: How we side-step encryption

SvenSvensonov · Jan 19, 2015

This is a little something I put together to explain a certain concept called Side Channel Analysis. Once again it will not provide any classified data, but it does go into additional and further depth and technicality.This is an advanced article with a lot of math that proves the concept of Side Channel Analysis - an EMSEC activity.

This is a big document so additional sections will be added as I finish with them. If something doesn't make sense, or you want to check a piece of information feel free to ask me for clarification... or you can wait until the entire document is posted. The final section will have all the references (these will explain some of the name, backup the research and offer additional articles on the subject), but in the meantime I will answer any questions you have as best as I can (the technicals aren't a problem, the legality is).

Most of the matrixes have been omitted as they were attachments and each section had too many. I couldn't keep each post under ten photos/attachments.

SvenSvensonov

Chapter 1 Introduction
1.1 Motivation
Nowadays, security becomes more and more important in the automotive industry. Robert Bosch GmbH, Corporate Research supplies the necessary know-how and provides solutions for future applications in the business units.

Many functions in a modern car are realized with the help of software applications and the underlying electronics. This means in particular, that a growing part of the components in a car are realized through an easy to copy and manipulate software. Due to this fact, there is a growing need for an effective protection of these software applications to save the intellectual property on the one hand and to disclose manipulation, e.g., by tuning, on the other hand.

There are up to 80 electronic control units (ECU) built into a modern upper class car. These control units come from different suppliers, communicate over potentially insecure networks with each other and with the outside world respectively, e.g., with diagnostic equipment or in the near future Car2Car and Car2Infrastructure. As a consequence, these control units should authenticate each other and the communication data should be cryptographically protected against manipulation.

For several use cases in the automotive domain, e.g., tuning detection and sensor protection, cryptography constitutes a considerable part of the overall-security architecture. Due to their performance and compactness in implementations, symmetric encryption primitives are of special interest. These primitives can be used for attaining the following security goals:

• confidentiality: encryption of data,• authenticity: challenge-response protocols,

• integrity: message authentication codes.

The State-of-the-Art symmetric encryption scheme is the Advanced Encryption Standard

(AES), [FIP01]. This block cipher with key sizes of 128, 192, and 256bits and a block size of 128bits is remarkably fast and can be implemented very efficiently on a wide variety of platforms. It replaces the old Data Encryption Standard DES or, to be more precise, its variant Triple-DES.

A lot of proprietary mechanisms in the automotive environment are getting replaced by standardized security mechanisms. An example is here the changeover from the proprietary KeeLoq System1, which is used for keyless entry in cars and garage doors, to modern challenge response protocols based on AES2.

1.2 Adversary Model and Assumptions
Ross Anderson and Markus Kuhn give a good classification of adversaries’ knowledge, funding, and skills in [AK98]. They define the following three classes of attackers:

• Class I (clever outsiders): They are very intelligent but may have insufficient knowledge of the system. They may have access to only moderately sophisticated equipment. They often try to take advantage of an existing weakness in the system, rather than try to create one.

• Class II (knowledgeable insiders): They have substantial specialized technical education and experience. They have varying degrees of understanding of parts of the system but potential access to most of it. They have highly sophisticated tools and instruments for analysis.

• Class III (funded organizations): They are able to assemble teams of specialists with related and complementary skills backed by great funding resources. They can do in-depth analysis of the system, design sophisticated attacks, and use the most advanced analysis tools. They may use Class II Adversaries as part of the attack team.

We will focus on knowledgeable insiders, i.e., the cryptographic algorithm itself is known to the attack agent (no security by obscurity!). The attack agent may have access to equipment like oscilloscopes etc. that can be found in university labs. The specific implementation, e.g., the source code or an open sample device, is assumed to be not available to the attacker.

In terms of side-channel analysis resistance3, this means that our implementation should be resistant against:

• timing attacks, and

• first-order differential power analysis (DPA) attacks.

Resistance against timing attacks implies that we assume an adversary having access to an exact simulation of the runtime of the implementation. Note that we do not require resistance against the new microarchitectural side-channel attacks like cache and branch predication attacks [AKS07, Ber05, OST06, NS06]. This does not mean that these attacks may not be relevant, it just means that we think that an attack agent has no or only little chance to install a spy process on the target processor and to measure the timing signals with the necessary precision.4

We assume that table lookups (TLU) have a constant runtime. For achieving timing analysis resistance, we need a key-independent runtime of the implementation. Therefore we will consider the effects of the 4-stage pipeline of the target processor, since they have a huge influence on the runtime.

Resistance against first-order differential power analysis (DPA) attacks means that neither a standard difference of means nor a correlation attack against the implementation will succeed

with a sufficient number of measurement traces. We assume that the attacker has no access to a training device where he can input, e.g., random masks or keys, i.e., we effectively rule out template attacks [GLRP06, APSQ06, CRR02].

The goal is not to make the implementation secure against second or higher order attacks, i.e., the focus is put on simple masking schemes. Of course, very simple second order attacks,

e.g., by using the same mask twice directly on the bus, have to be prevented.

1.3 Scope
Robert Bosch GmbH, Corporate Sector Research and Engineering investigates the suitability of side-channel resistant implementations of symmetric key primitives under the constraints of the embedded domain. In this thesis, we will survey and investigate several masking countermeasures described in the literature to protect an implementation of the Advanced Encryption Standard (AES) against first-order DPA attacks in terms of security and runtime.

A selected countermeasure will be implemented in C (and partly in assembly language) on a typical 32-bit automotive processor used in Embedded Control Units (ECUs), e.g., TriCore. The implementation shall be secured against timing attacks as well. Of special interest is an efficient implementation taking into account the requirements of embedded devices with respect to memory and performance. Implementations of AES without countermeasures against DPA are provided by Robert Bosch CR/AEA. The tasks, that are to be carried out are:

• to understand and to survey the relevant AES implementation options,

• to understand and to survey the most important side-channel attacks (DPA: differenceof means + correlation power attack), timing attack,

• to survey countermeasures (it is not the goal to invent new countermeasures, just todescribe what is available in the literature),

• to describe the different masking schemes in their memory usage (RAM and ROM),performance, and number of random bits,

• to select and to implement the most appropriate or a combination of methods,

• to implement AES efficiently and securely with a key length of 128bits on a TriCore 17965 processor in C and, maybe, parts of it in Assembler.

All countermeasures considered in this master thesis will be algorithmic, i.e., softwarebased, countermeasures. Although many hardware countermeasures have been discussed in the literature, e.g., special logic styles such as Masked Dual Rail Pre-Charge Logic (MDPL), increasing the noise by special noise generators, etc., we will only discuss these countermeasures if they can be applied to software as well. However, we will survey a number of countermeasures which have been originally proposed for hardware implementations e.g., masking in tower fields and investigate if they can be applied efficiently in software implementations.

For the masking countermeasures, we need a certain amount of random data. The generation of these random values is not part of this thesis. We assume that all random values are given to our program via a pointer to an array with sufficient numbers of the necessary entropy. For simulation the stream cipher Grain-80 [HJM05] is used as pseudo random number generator (RNG).

We will present a side-channel resistant AES implementation for the TriCore architecture, which is not available up to now. Therefore we will combine side-channel analysis (SCA) countermeasures.

In Chapter 2 we will give a short overview of AES and the existing implementation options. We will introduce some mathematical basics which are important to understand the upcoming implementations. We introduce a processor model, which will help us to select the right implementation option for our needs.

Chapter 3 gives an overview about known side-channels and possible attacks on it. The focus lies on timing analysis because this form of side-channel will be analyzed at the end of this thesis.

Chapter 4 describes countermeasures to prevent side-channel attacks. We analyze different masking schemes in terms of size and runtime and combine them with other countermeasures like randomization.

Chapter 5 is dedicated to the TriCore. We will describe the architecture and the assembler instructions which are used for the implementation.

Chapter 6 is focused on the implementation. We will discuss improvements for the already available implementations and based on the improvements build a protected AES implementation.

In Chapter 7 we will analyze the implemented AES versions in terms of their timing behavior.

Chapter 2 AES – Advanced Encryption Standard
2.1 Introduction
The Advanced Encryption Standard (AES), also known as Rijndael, is a symmetric block cipher that was developed by the Belgian cryptographers Joan Daemen and Vincent Rijmen. AES encrypts/decrypts data with a block size of 128bits and a key size of 128, 192, or 256bits. AES targets on a small and fast implementation in hardware and software. The standard is described in [FIP01]. Brian Gladman wrote a detailed description of AES in [Gla01] and an updated version of the description in [Gla07]. We will give here a short overview to introduce the notation used in this thesis.

AES is a typical substitution-permutation network (SPN), see [Sti05]. An SPN consists of several rounds, each made up with a substitution and a permutation stage. The substitution stage obscures the relationship between elements of the plaintext and elements of the ciphertext, while the permutation stage makes of the influence of plaintext elements spread over the ciphertext.

For AES, the number of rounds depends on the key size. We focus on a key size of 128bits, which corresponds to ten rounds. The 16byte plaintext P = (p0,p1,...,p15) is internally arranged in a 4 × 4 matrix, the so-called State matrix S. Every element sr,c of the State matrix (r denotes the index of the row and c the index of the column) equals to one byte. In Figure 2.1 we can see the individual transformations which are applied to the State matrix. Figure 2.2 shows how the plaintext P is initially mapped to the State matrix S and then to the ciphertext output, C = (c0,c1,...,c15).

In each round, four transformations are applied to the State matrix S:

1. Byte Substitution (SubBytes), which is a non-linear byte substitution operating independently on each byte (sr,c) of the State matrix, using a substitution table called S-box. This S-box can be constructed from two transformations:

a) an inversion in the finite field GF(28), where the zero is mapped to itself; and

b) a bitwise affine transformation.

2. ShiftRows, where the bytes sr,c in the State matrix are cyclically shifted. The first row is not rotated. The second row is rotated to the left by one position, row three is rotated to the left by two positions and row four is rotated to the left by three positions.

3. MixColumns, where every column of the State matrix is treated as a four-term polynomial and gets multiplied by a fixed 4×4 matrix over GF(28) which is defined in [FIP01].

5

4. Key Addition (AddRoundKey), which adds a 4 × 4 key matrix (128-bit) to the State matrix. This key matrix is called round key.

The final round differs slightly from the first nine rounds. As we see in Figure 2.1 the MixColumns transformation is missing, which is also typical for a substitution-permutation network since the MixColumns transformation adds no cryptographic strength to the algorithm if it is included at the end of the cipher.

Decryption is accomplished by running the algorithm “backward”. The last round key is first XORed with the ciphertext, and then in each round, the inverse linear transformations are applied, namely InvMixColumns, InvShiftRows, followed by the inverse S-box InvSubBytes. Note that for AES additional space for the inverse functions used in decryption is needed, in contrast to DES where encryption and decryption basically consume the same program memory. For example, in AES we need additional 256bytes if the inverse transformation InvSubBytes is implemented as lookup table (LUT).

To supply the algorithm with the required round keys, AES uses a mechanism called key scheduling. The encryption key of 16bytes is expanded to an array with eleven round keys, where each round key has a size of 16bytes. So in total 176bytes are needed for the round keys. To provide non-linearity between the cipher key and the expanded round key, the key schedule uses the SubBytes transformation followed by a cyclic rotation of the bytes and an addition of a round constant. The round constant is defined by a simple recursive rule.

2.2 Mathematical Background
In order to explain the upcoming implementation tricks and the side-channel countermeasures, we have to recall some basics on finite fields. Almost all operations that will follow are in the finite field GF(2), i.e., the bits ’0’ and ’1’, or the extension field GF(28), where GF denotes a Galois Field. A more detailed and general introduction to finite fields can be found in [Kna06].

GF(28) is the unique finite field with 256 elements and the operations ⊕ (addition) and ⊗ (multiplication). Its elements can be represented as polynomials of order 8 with binary coefficients ai ∈ {0,1},i = 0,...,7. An element a ∈ GF(28) can thus be seen as the set of coefficients ai ∈ GF(2) of the polynomial a with function value a(x), i.e.,

{a7a6a5a4a3a2a1a0}2 7→ a(x), (2.1)

7

a(x) = Xaixi = a7x7 + a6x6 + a5x5 + a4x4 + a3x3 + a2x2 + a1x + a0. (2.2)

i=0

A sequence of 8bits, for example {11010100}2, is called a byte. We will also depict byte values by their hexadecimal notation. Hence we can write the value {11010100}2 as {D4}16 which represents the polynomial x7 + x6 + x4 + x2.

Addition. In the following, ⊕ denotes an addition in GF(28), that can be realized using an exclusive OR (XOR). We can add two elements a,b ∈ GF(28) by adding their coefficients modulo 2 element wise.

7 7 7

Xaixi ⊕ Xbixi = X(ai ⊕ bi)xi (2.3) i=0 i=0 i=0

Multiplication. Multiplication in GF(28) is the usual polynomial multiplication, modulus an irreducible polynomial. For AES, the reduction polynomial m with

m(x) = x8 + x4 + x3 + x + 1

is used. The polynomial m can be represented by {11B}16. Note that in GF(28) reduction by m(x) is equivalent to adding m(x).

areduction by(xWe will use) · b(x) modm⊗(mxto indicate the multiplication of two GF(2)(xensures that the result is an element of GF) with a,b ∈ GF(28) and “·” denoting a polynomial multiplication. The8)(2elements, i.e.,8). a(x) ⊗ b(x) :=

We can implement the multiplication by x by a left shift and a subsequent conditional bitwise XOR with {11B}16, if the most significant bit (MSB) bit is set. For example, multiplying {11001000}2 by x gives {110010000}2. The overflow bit is then removed by adding {100011011}2, the modular polynomial, to give the final result of {10001011}2.

This method is called xtimes, see [Gla07]. Multiplication by higher powers of x can be implemented by repeated application of xtimes. Since we can write all elements of GF(28) as a sum of powers of {02}16, we can implement the multiplication by any value by a repeated use of xtimes. To get the result of the multiplication with any input g by the constant value {15}16.

2.3 Implementation Aspects
AES is an industry-wide used block cipher since the National Institute of Standards and Technologies (NIST) adopted it in 2001 from the Rijndael algorithm. It is used for the protection of government documents as well as for electronic commerce and private use. Many different implementations of AES have been proposed and have been optimized for the specific requirements like maximized throughput or minimized circuitry. We will give a short overview of the implementations, in order to get a picture of the tricks we can use for our implementation.

We begin by defining two microcontroller models (8-bit and 32-bit) to estimate the clock cycles for the implementation. Then we present the 8-bit implementation which is proposed in the standard [FIP01] and then look at existing 32-bit software implementations. We introduce the affine transformation, which is part of the SubBytes transformation, because we will need it in a later chapter.

For the current master thesis it is required to implement both encryption and decryption. For reasons of brevity, we will focus on the description of the encryption and give only the necessary hints for understanding the decryption process.

2.3.1 Microcontroller Model
To compare the runtime of an encryption, we define a simple 8-bit processor model and count the clock cycles (cc) according to that model. This model does not fit to all existing 8-bit microcontrollers. Hence, the given clock cycle count is a rough estimate to get an impression of the implementation. We make the following assumptions for byte operations on our processor model:

• load/store from/to RAM takes one clock cycle,

• load/store from/to ROM takes two clock cycles,

• XOR two registers takes one clock cycle,

• shift by one bit takes one clock cycle.

For the 32-bit (word) operations, we expand our processor model, which is defined in Section 2.3.3. We assume, that a load and a store operation on a word takes one clock cycle.

2.3.2 The Key Schedule
The AES implementation has a key size of 128bits, which leads to ten rounds. With the initial AddRoundKey transformation, an entire encryption needs eleven round keys. There are two different options to supply the algorithm with those round keys:

Key Expansion Before the Encryption
1 KeyExpansion ( byte key [16] , byte expkey [ 4 ] [ 4 4] )

2 begin

3 for ( j = 0; j < 4; j++) {

4 for ( i = 0; i < 4; i++)

5 expkey [ i ] [ j ] = key [ j ∗ 4 + i ] ;

6 }

7 for ( j = 4; j < 44; j++) {

8 if ( j % 4 == 0) {

9 expkey [ 0 ] [ j ] = expkey [ 0 ] [ j −4] ^

10 SubBytes(expkey [ 1 ] [ j −1], 0) ^ rc [ j /4];

11 for ( i = 1; i < 4; i++)

12 expkey [ i ] [ j ] = expkey [ i ] [ j −4] ^

13 SubBytes(expkey [( i +1) % 4][ j −1] ,0);

14 }

15 else {

16 for ( i = 0; i < 4; i++)

17 expkey [ i ] [ j ] = expkey [ i ] [ j − 4] ^ expkey [ i ] [ j − 1];

18 }

19 }

20 End

21

On the one hand, the whole round keys can be calculated before the actual encryption. Therefore, the KeyExpansion function is used, which is depicted in Listing 2.1. This function expands the 128-bit cipher key into an expanded key array, with four rows and 44 columns. The size of each element of this array equals to one byte. To store this expanded key array, we need 11·16bytes = 176bytes RAM. The first four columns (16bytes) of the expanded key are filled with the 128-bit cipher key. The next columns recursively depend on the previous ones, with the following cases: If the column index j = 0,...,43 is not a multiple of four, column j is a bitwise XOR of column j−4 and column j−1. Otherwise, column j is a bitwise XOR of column j − 4 and j − 1, with the difference, that the S-box is applied on the four bytes of column j − 1, with an additional XOR of the round constant rc on the first byte of this column. Since there are eleven rounds, also eleven round constants are needed, which are calculated by the following recursion rule in GF(28):

rc[1] = 1, rc[2] = x, rc[n] = x ⊗ rc[n − 1], n = 3,...,11. (2.5)

Key On-The-Fly Calculation
On the other hand, we can calculate the round keys on-the-fly. This alternative method is for environments with limited memory, e.g., smartcards. The calculation works the same way as we described above, but we only store the last calculated round key since we see that the calculations depend only on the columns j −4 to j −1, so we can derive the next one from it. In Listing 2.2 the function NextRoundKey calculates the next round key from the last used round key and the last used round constant rc. The first round key is the cipher key, the first round constant is one.

1 NextRoundKey( byte round_key [16] , byte rc )

2 begin

3 round_key [0] = round_key [0] ^ SubBytes [ round_key [ 13] ] ^ rc ;

4 round_key [1] = round_key [1] ^ SubBytes [ round_key [ 1 4 ] ] ;

5 round_key [2] = round_key [2] ^ SubBytes [ round_key [ 1 5 ] ] ;

6 round_key [3] = round_key [3] ^ SubBytes [ round_key [ 1 2 ] ] ;

7

8 for ( i =4; i <16; i++) {

9 round_key [ i ] = round_key [ i ] ^ round_key [ i −4];

10 }

11 rc = xtime( rc ); /∗ compute new round constant ∗/

12 end

13

If the key is calculated on-the-fly, we have to add the calculation costs to our encryption costs, since the calculation is done during every encryption. By counting the operations from Listing 2.2, 17 XORs, 3 · 16 = 48 load and store operations plus four load operations from ROM (for the S-box lookup) and two shift and an additional addition (XOR) for the xtime operation are necessary to calculate the next round key. Summarizing, we get 72cc for calculating the next round key and in total, the key on-the-fly method costs 10·72cc = 720cc.

2.3.3 8-bit Software Implementation Straight from the Standard
On an 8-bit processor, AES can be programmed by simply implementing the different component transformations which are described in Section 2.1.

The AddRoundKey Transformation
During the AddRoundKey transformation, every byte of the State matrix S is XORed with one byte from the expanded key. For one byte sr,c of the State this takes:

• 1cc to load the byte sr,c from RAM,

• 1cc to load one byte of the expanded key from RAM,

• 1cc to XOR both, and

• 1cc to store the result.

To process one State matrix S, 16 · 4cc = 64cc are needed. Since eleven round keys are added during an AES run, a whole AES en-/decryption needs 11 · 64cc = 704cc for the AddRoundKey transformations.

The ShiftRows Transformation
For ShiftRows the implementation is straightforward. Here the bytes in the last three rows of the State matrix are cyclically shifted. For this operation, the State value is loaded from RAM and must be written back at its new position. We assume that these two operations take two clock cycles per byte of the State matrix. Then it takes 4 · 2cc = 8cc per row and

3·8cc = 24cc per transformation. For a whole AES en-/decryption with ten rounds we need 10 · 24cc = 240cc for the ten ShiftRows transformations.

The SubBytes Transformation
The SubBytes transformation is implemented in the standard as a static lookup table (the S-box) of 256bytes. For decryption, the inverse SubBytes transformation is needed, which results in a second 256-byte lookup table. Both tables are generally located in read-only memory (ROM). The transformation loads one byte from ROM, one byte from the State matrix, XORs them, and writes the result back to RAM. We estimate that these operations take five clock cycles. One SubBytes transformation takes 16 · 5cc = 80cc. For a whole AES en-/decryption with ten rounds this makes 10 · 80cc = 800cc for the ten SubBytes transformations.

The MixColumns Transformation
The MixColumns transformation considers every column8). This column sc is multiplied by a fixed polynomialsc (0 ≤ c ≤ 3) of the State matrix as a four-term polynomial over GF(2 a with:

a(x) = {03}16 · x3 + {01}16 · x2 + {01}16 · x + {02}16 (2.6)

and then reduced by a fixed polynomial l with l(x) = x4 +1. The reduction by l decreases the result so it fits to the column again. Since a is coprime to l it is invertible and all elements have an inverse. The multiplication can be rewritten with the fixed polynomial a:

sc0 (x) = a(x) · sc(x) mod l(x), for 0 ≤ c ≤ 3

(2.7)

The indices 0,...,3 indicate the byte in the column c of the current State S. A multiplication with {02}16 can be efficiently implemented with the xtime operation which is denoted in Listing 2.31 and described in Section 2.2.

We estimate five clock cycles for the xtimes operation. A multiplication with {03}16 can be implemented as a multiplication with {02}16 plus an additional XOR operation with the operand. So for computing equations (2.8a) to (2.8d) we need:

• 4 · 1cc = 4cc for loading from RAM,

• 4 · 1cc = 4cc for writing to RAM,

• 4 · 3cc = 12cc for the additions (XOR),

• 4 · 5cc = 20cc for the four fixed multiplications with {02}16,

• 4 · 6cc = 24cc for the four fixed multiplications with {03}16.

This makes 64cc per column, 4 · 64cc = 256cc per round and 9 · 256cc = 2304cc for all nine MixColumns transformations during an AES encryption.

The Inverse MixColumns Transformation
The InvMixColumns transformation is similar to MixColumns transformation, but it uses the inverse of a:

a−1(x) = {0B}16 · x3 + {0D}16 · x2 + {09}16 · x + {0E}16. (2.9)

1 Please note that implementing xtime this way leaks timing information, due to the conditional multiplication with {1B}16. This will be discussed in Section 3.1.

The Hamming weight2 of the coefficients of a−1(x) is larger, which leads to more xtime calls and thus to a slower implementation for the decryption. For the InvMixColumns operation, which is given by the equations:

sc0 (x) = a−1(x) · sc(x) mod l(x), for 0 ≤ c ≤ 3

(2.10)

For the inverse MixColumns transformation the following operations are needed:

• 4 · 1cc = 4cc for loading from RAM,

• 4 · 1cc = 4cc for writing to RAM,

• 4 · 3cc = 12cc for the additions (XOR),

• 4 · 17cc = 68cc for the four fixed multiplications with {0E}16,

• 4 · 17cc = 68cc for the four fixed multiplications with {0D}16,

• 4 · 17cc = 68cc for the four fixed multiplications with {0B}16,

• 4 · 16cc = 64cc for the four fixed multiplications with {09}16,

Summarizing: To calculate a column of the State matrix S, 288cc are needed. To calculate the whole State matrix 4·288cc = 1152cc and 9·1152cc = 10368cc for all nine InvMixColumns transformations, during an AES decryption.

The Affine Transformation
As described in Section 2.1, the SubBytes transformation can be realized by a inversion followed by an affine transformation. Figure 2.6 illustrates these two steps. In the first part of the transformation the multiplicative inversion si,j−1 in GF(28) is build of the processed byte si,j. The inversion is followed by an affine transformation. The affine transformation

is a polynomial multiplication of a constant matrix with the byte a = s−1 followed by a

2 No. of “1” in the binary representation

XOR with the constant c. The matrix and the constant c are given in the standard [FIP01]. Equation (2.12) denotes the affine transformation.

b67 0 1 1 1 1 1 0 0 a67 1 b 1 1 1 1 1 0 0 0 a 0 b[1]43 0 0 1 1 1 1 1 0 a025143 1

b 0 0 0 1 1 1 1 1 a 0  2 = · a ⊕ 0 (2.12)

b 1 0 0 0 1 1 1 1

b10 1 1 0 0 0 1 1 1 a 0 b 1 1 1 0 0 0 1 1 a 1

b 1 1 1 1 0 0 0 1 a 1

because some of the countermeasures we will present calculate the inversion on the fly, we also have to calculate the affine transformation. So we estimate the runtime for this transformation. To calculate the runtime, we use the hypothetical processor model from Section 2.3.1 and count the clock cycles (cc). To calculate one bit of the result byte b we need 8 multiplications (AND) and seven additions (XOR). This makes 8+7 = 15cc for one result bit and 8·15cc = 120cc for all eight resulting bits a0 to a[2][3]. The affine transformation is executed 160 times during a whole AES run. So this naive way would take us 160 · 120cc = 19200cc just for the affine transformation.

Obviously, we can improve this implementation. If we take a closer look at the constant matrix in (2.12) the first thing we see, is the pattern m = {11111000}2 in the first row. This pattern repeats in the following rows, but every time rotated by one position to the right. With the fact, that this calculation in done in GF(2), we can also save some computation. If we are able to calculate the parity3 of a byte, we can safe the seven bitwise binary additions. Therefore we multiply the current column m of the constant matrix with the byte a and just set the output bit in b if the parity of the result is even. Algorithm 1 denotes this method.

Algorithm 1: Calculation of the affine transformation

Input: a

Output: b=aff_trans(a)

1 b = 0;

2 m = {11111000}2;

3 c = {01100011}2;

4 for i = 0 to 7 do

40cc. In the last step, the constant c is added to the result what also takes one clock cycle. To calculate the affine transformation for one byte this takes now 40cc + 1cc = 41cc. The affine transformation part of a whole AES run, now takes 160 · 41cc = 6560cc.

Summary
Summarizing, in our processor model an AES encryption needs 240cc for the ShiftRows transformation, 800cc for the SubBytes transformation, and 2304cc for the MixColumns transformation. Depending on the key calculation method, this gives either a total cycle count of 4048cc if separate key scheduling is used, or 4064cc for the key on-the-fly method for an entire AES encryption.

The AES decryption needs a bit more clock cycles, because the InvMixColumns is slower than the MixColumns transformation. With the 10368cc for an InvMixColumns transformation, the decryption takes 704cc+240cc+800cc+10368cc = 12112cc for an AES decryption with separate key scheduling.

2.3.4 8-bit Optimized Software Implementation
The discussed AES implementation can be improved by combining the ShiftRows and SubBytes transformation, see [Den07]. In addition, the inverse MixColumns transformation needed for the decryption can be rewritten in a clock cycle saving way. This leads to an optimized implementation as discussed in this section.

ShiftRows and SubBytes Combined
Since ShiftRows is only a permutation of the indices of the State matrix elements, it can be combined with the SubBytes transformation. With this combination, we safe the 240cc of the ShiftRows transformation.

Improved inverse MixColumns
In [DR02], P. Barreto observes the following relation between the MixColumns polynomial a(x), which has been defined in Equation (2.6), and the InvMixColumns polynomial a−1(x):

a−1(x) = ({04}16 · x2 + {05}16) · a(x) mod (x4 + {01}16) (2.13)

There are two important points. The Hamming weight of the coefficients is much smaller than in the original Equation (2.9) and it has only half as many coefficients. In matrix notation, the right part of Equation (2.13) becomes:

{02}16 {03}16 {01}16 {01}1616 {05}1616 {00}1616 {04}1616 {00}1616

{01}1616 {02}1616 {03}1616 {01}1616· {{0004}}1616 {{0500}}1616 {{0005}}1616 {{0400}}1616 =

{01} {01} {02} {03}

{03}16 {01}16 {01}16 {02} {00} {04} {00} {05}

{0E}1616 {0B}1616 {0D}1616 {09}1616

{09}1616 {0E}1616 {0B}1616 {0D}16 (2.14)

{0D} {09} {0E} {0B}

{0B} {0D} {09} {0E}16

So the normal MixColumns transformation can be applied after a little precalculation of the State matrix. For the constant multiplication of {04}16 two xtime calls and for the multiplication by {05}16 two xtime calls plus one addition (XOR) are needed. Listing 2.4 denotes the preprocessing step. The word w (four bytes) is the current column of the State matrix.

1 word ImvprovedInvMixColumns( byte w[ 4] )

2 begin

3 u = xtime (xtime(w[ 0] ^ w[ 2 ] ) ) ; 4 v = xtime (xtime(w[ 1] ^ w[ 3 ] ) ) ;

5

6 w[0] = w[0] ^ u; 7 w[1] = w[1] ^ v ; 8 w[2] = w[2] ^ u; 9 w[3] = w[3] ^ v ;

10

11 MixColumn(w);

12 return w;

13 end

Listing 2.4: Preprocessing step for improved InvMixColumns

The precalculation needs four xtimes operations and six XOR operations, these are 4·5cc+ 6·1cc = 26cc. The precalculation is followed by a normal MixColumns transformation which leads to 26cc + 64cc = 90cc to calculate the inverse MixColumns transformation for one column of the State matrix. A whole State S takes 4·90cc = 360cc and all nine MixColumns transformations during an AES decryption take 9 · 360cc = 3240cc.

Summary
Summarizing, the optimized (8-bit) AES encryption takes 704cc for the AddRoundKey transformation, 800cc for the combined ShiftRows and SubBytes transformation, and 2304cc for the MixColumns transformation. In total, this makes 3808cc for one entire AES encryption with TLU keys.

With the improved inverse MixColumns transformation, the AES decryption now needs 704cc + 800cc + 3240cc = 4744cc to decrypt a 128bit ciphertext.

2.3.5 32-bit Software Implementation
Two of the discussed 8-bit transformations, namely the AddRoundKey and the MixColumns transformation both benefit directly from a 32-bit register size. In the next sections, we discuss this 32-bit implementation. In addition, there is a 32-bit implementation called Ttables implementation, which combines the four transformations in precalculated tables.

The AddRoundKey Transformation
The AddRoundKey transformation can also be implemented to work simultaneously on a whole column of the State matrix. Therefore it loads one word from the expanded key array and one word from the State matrix S and adds them. The resulting word is written back to the State matrix. To process the State S, 4·4cc = 16cc are needed. All eleven AddRoundKey transformations now take 11 · 16cc = 176cc.

The MixColumns Transformation
The MixColumns transformation can be speed up considerably on 32-bit processors. Gladman mentioned in [Gla01] that equation (2.8a) to (2.8d) can be rewritten in such a way, that only multiplications by two are necessary, e.g., {03}16 ⊗ s0,c = {02}16 ⊗ s0,c ⊕ s0,c.

The four bytes s0,c to s3,c can be stored successively in the word w. Listing 2.5 denotes the multiplication by two within a whole word, called FFmulX. The word t, on the left hand side (from the second XOR) in line four, extracts the highest bits from each byte within the word w. The four individual bytes are now multiplied by two in parallel with a single left shift. On the right hand side, the bytes of word w whose top bits are set are multiplied by {1B}16, the AES reduction polynomial. This reduction ensures, that the four 8-bit results fits into the 32-bit register.

1 word FFmulX(word w)

2 begin

3 word t = w & 0x80808080 ;

4 return ((w ^ t ) << 1) ^ (( t >> 3) |

5 ( t >> 4) | ( t >> 6) | ( t >> 7));

6 end

Since the four bytes s0,c to s3,c are stored successively in the word w, the four Equations (2.16a) to Equations (2.16d) can be calculated in parallel. Listing 2.6 denotes this MixColumns transformation for one column of the State matrix. The resulting word contains the four transformed bytes s00 ,c to s30 ,c.

1 word MixColumn(word w)

2 begin

3 w = rot1 (w) ^ rot2 (w) ^ rot3 (w) ^ FFmulX( rot1 (w) ^ w);

4 return w;

5 end

Listing 2.6: MixColumns transformation for one column of the State matrix

With Listing 2.5 and Listing 2.6 we can estimate the clock cycles for the MixColumns transformation. The FFmulX function needs ≈ 11cc. The MixColumn operation needs four rotations, four XOR operations, one FFmulX operation which makes 19cc. In addition it has to load the word w from RAM and store the result back in RAM, this takes two clock cycles. One MixColumns transformation needs four MixColumn operations which makes 4 · 21cc = 84cc for one round and 9 · 84cc = 756cc for all nine MixColumns transformations.

The Inverse MixColumns Transformation
The 8-bit precalculation from Listing 2.4 can be formulated in such a way, that it uses 32-bit operations. Listing 2.7 denotes the improved inverse MixColumns transformation for one column w of the State matrix.

1 word InvMixColumn(word w)

2 begin

3 word tmp;

4

5 tmp = rot2 (w); 6 tmp = tmp ^ w;

7 tmp = FFmulX(tmp);

8 tmp = FFmulX(tmp);

9 10 w = w ^ tmp;

11

12 w = MixColumn(w);

13

14 return w;

15 end

Listing 2.7: Inverse MixColumns transformation with 32-bit operations

We can estimate the clock cycles for the InvMixColumns transformation with Listing 2.7. For one column of the State matrix it takes one rotation, two XOR operations, two FFmulX operations and one MixColumn operation. This makes 1cc + 2 · 1cc + 2 · 11cc + 21cc = 46cc for one column, 4 · 46cc = 184cc for a State and 9 · 184cc = 1656cc for all nine rounds.

Summary
Summarizing the optimized 32-bit AES encryption takes 176cc for the AddRoundKey transformation, 800cc for the combined ShiftRows and SubBytes transformation, and 756cc for the MixColumns transformation. In total this makes 1732cc for the encryption.

The decryption, with 1656cc for the InvMixColumns transformation, now needs 176cc + 800cc + 1656cc = 2632cc.

2.3.6 32-bit T-Tables Implementation
Daemon and Rijmen already showed in their original AES proposal, that AES can be implemented very efficiently on 32-bit processors by combining the transformations SubBytes,

ShiftRows, MixColumns, and AddRoundKey for one column4. They define four tables each of 256 4-byte words (for 0 ≤ x ≤ 255).

Where c(r) = c + r mod 4 with c(0) = c. Here c denotes the actual column index, and c(r),r = 0,...,3 denotes the new position after the ShiftRows transformation. With these tables each column in the output State s0 can be computed by using four XOR operations together with one word from the key schedule and four table-lookups that are indexed using four bytes from the input State. We need four XOR and five TLU per column. Since we have four columns per State matrix, 4 · 4 = 16 XOR, 4 · 4 = 16 TLU (ROM) and four TLU (RAM) per round are needed. This makes 16cc + 2 · 16cc + 4cc = 52cc for one round and 10 · 52cc = 520cc for the whole algorithm (plus additional key scheduling).

One table consumes 4·256bytes = 1024bytes ROM, and for the four tables 4·1024bytes =

4096bytes are needed. In the last round, the MixColumns is not present. This means that different tables are required for the last round. If we also implement the last round as a table, we need 2 · 4096bytes = 8192bytes of table space altogether.

If we take a closer look at the tables, it can be recognized that the structure of the four tables is closely related to each other. The columns are rotated by one, two and three bytes, since Ti(x) = rot(Ti−1(x)),i = 0,... ,3, where the function rot() rotates the bytes in the column cyclically by one position. So, the space needed for the tables can be reduced by a factor of four at the expense of three additional cyclically rotations per round.

2.3.7 32-bit Implementation with Transposed State
Bertoni et al. [BBF+03] had the idea to transpose the State matrix S in such a way, that it can be processed by columns instead of rows. Therefore they have restructured the MixColumns, the ShiftRows, and the round key generation for en-/decryption. Since the SubBytes transformation works on each byte individually, they left it as it is.

The consequence of the transposition is that the new version of the InvMixColumns achieves a much higher speed gain. In fact, the transposed InvMixColumns requires seven doublings and 27 additions. The authors report a cycle count of 2047cc for decryption on an ARM7TDMI. The encryption becomes a bit slower than the 32-bit reference implementation of Gladman. For encryption they need 1675cc. Their reference implementation from Brian Gladman needs 2763cc for decryption and 1641cc for encryption on the ARM7TDMI.

2.3.8 Hardware Implementations
There has also been much effort put in efficient hardware implementations of AES: From the 8-bit implementation straight from the standard to 32-bit implementations using a 32-bit data path up to high-speed implementations with a 128-bit data path. For every implementation, very much care has to be given to the S-box implementation, both for performance and for security reasons.

Rijmen suggested in 2001 to use subfield arithmetic for computing the inverse part of the SubBytes transformation in [Rij01]. This idea was extended by Satoh et al. in [SMTM01] by using sub-subfields, the so-called “composite field” or “tower field” approach, which resulted in the smallest AES circuit at that point. The “composite field” approach utilizes that a finite field GF(2k) with k = n · m can also be represented as GF((2n)m). Later, Canright [CB09] improved this method again by a carefully chosen normal basis, which results in the most compact S-box up to date.

As we will see, some ideas from the “composite field” approach can be reused for our SCA resistant AES implementation. So we will give a short overview to gain insight into the “composite field” arithmetic.

Composite Fields
We have seen in Section 2.2, that the finite field GF(28) can be seen as an extension field of GF(2) and its elements are represented as bytes. However, we can also see GF(28) as a quadratic extension of the field GF(24), which is far better suited for hardware implementations [Paa94, SMTM01]. We call this field composite field. In particular, composite field inversions are used to create compact AES hardware implementations [RDJ+01, SMTM01, MS03, WOL02].

We can represent every element a ∈ GF(28) as a linear polynomial ahx + al with the coefficients ah and al being elements of the field GF(24). We refer to this polynomial as twoterm polynomial. Both coefficients of this polynomial have four bits. The finite field GF(28) is isomorphic to the field GF((24)2). For each element in GF(28) there exists exactly one element in GF((24)2). The following field polynomial, which is used to define the quadratic extension of GF(24), is taken from [WOL02, Tri03].

Two two-term polynomials can be added by adding their corresponding coefficients:

(ahx + al) ⊕ (bhx + bl) = (ah ⊕ bh)x + (al ⊕ bl). (2.19)

For the multiplication and inversion of two two-term polynomials a modular reduction step is required, to ensure that the result is again a two-term polynomial. The irreducible polynomial n used for the reduction is given by:

n(x) = x2 + x + {E}16. (2.20)

The coefficients of n are elements of GF(24) and written in hexadecimal notation. To ensure, that the product of two coefficients lies in GF(24), i. e., a(x)⊗ b(x) := a(x) · b(x) mod m4(x) with a,b ∈ GF(24), the irreducible reduction polynomial m4 is used, which is given by:

m4(x) = x2 + x + 1. (2.21)

To calculate the inverse of d Oswald et al. shift this element down to GF(22) in [OMP04], where they compute the inversion on a dedicated hardware. This is feasible in hardware but it is not efficient in software on a microcontroller. Instead the inversion of d can be efficiently realized as a lookup table with 16 elements from GF(24).

If the inversion part of SubBytes is calculated with the composite field method, the big lookup table for the inversion in GF(28) with 256 elements can be replaced by the lookup table for the inversion in GF(24) with 16 elements.

2.3.9 Comparison of Different Implementation Options
In the last sections, several options to implement AES were discussed. Two main possibilities to write an 8-bit software implementation of AES have been pointed out.

In the first method, all round keys are calculated in advance, Table 2.1 line one, three, four, five and six. A method with separate key scheduling can be recommended, if there is enough RAM to store the 176bytes for the round keys. The separate key scheduling especially pays off, if two or more blocks of 128bit are encrypted5. An AES 128-bit encryption with separate key scheduling takes 3808cc on an 8-bit processor (line three) and 1732cc on a 32-bit processor (line four).

The second method is to implement an embedded key scheduling. Here the round keys are calculated when they are needed, on-the-fly, and afterwards they are discarded. This leads

to a very small RAM consumption. Table 2.1 denotes in line two, that this implementation needs 4064cc and only 16bytes RAM (for the last used round key). If the processor used has a large number of registers, everything can be calculated without the use of RAM. Thus, the on-the-fly method can be preferred on architectures where only very limited RAM is available.

On a 32-bit processor, AES can also be implemented with almost only table lookups. Table 2.1 refers to this method as T-tables. An 128-bit encryption with the T-tables approach takes approximately 520cc (with separate key scheduling). The T-tables implementation needs 8192bytes of ROM, to store the four pre-calculated tables that are needed for the encryption, and additional 8192bytes for the decryption.

The first three entries in the table are to get a rough feeling of the space and cycle consumption of AES. In the second part of the table they can be compared with real implementations. We notice that there are big clock cycles differences between different implementations even if they are on the same architecture (8 or 32-bit). So we just compare the ratio between the cycle counts for an implementation given from the same author.

The cAESar implementation from [HKQ99] is one of the first 8-bit implementations we found. The used processor is not clearly described in the paper. The implementation is completely written in Assembler and needs 2889cc for one encryption. The round keys are generated on-the-fly.

The second implementation is from Akkar/Giraud in 2001 [AG01] (line 8). Their nonoptimized implementation was done in assembly code using an 8-bit processor. In their paper, they specify the runtime with 18.1ms at 5MHz which results in 90500cc for the encryption of a 128-bit message. The whole implementation needs 730bytes ROM and 41bytes of RAM.

In [HOM06], Herbst et al. compare their SCA resistant 8-bit implementation with one self made older unprotected one from the Institute for Applied Information Processing and Communication (IAIK) [IAI06] they work at. This unprotected implementation needs 4427cc for an encryption. The second unprotected implementation they refer to, is an implementation from Christian Röpke [R¨03] on an ATmega163, which needs 7498cc for an encryption. These two clock cycle counts are interesting for us, because we want use to some countermeasures from Herbst et al.

Bernstein and Schwabe [BS08] have written the fastest 32-bit implementation we have found so far. For achieving these results, they have made heavily optimized versions for the underlying 32-bit micro architecture and utilized the T-tables approach. On an Intel Core 2 Quad with, e.g., combined load and XOR operations, their implementation needs ≈ 169cc for the encryption. In their clock cycle calculation they ignored the costs for computing the round keys.

The main requirement for our AES implementation is to be small in terms of RAM and ROM consumption. For this reason, we do not take the 32-bit T-tables approach. Because a typical AES implementation will have a fixed key and we assume that more than one 128-bit block is encrypted during one AES run, we decided us for the separate key scheduling method. Therefore we will take the optimized 32-bit implementation (line four) and try to implement it on the TriCore architecture as fast as possible. The RAM consumption in Table 2.1 does not contain the 16-byte State matrix.

[1] t = a ⊗ m;

6 b = rotate_bit_left(b);

[2] b = b ⊕ parity(t);

8 m = rotate_bit_right(m);

9 end

[3] b = b ⊕ c;

Now we one multiplication (AND) 1cc, two rotations 2cc, one parity calculation 1cc and one addition (XOR) 1cc to calculate one bit of the result. For all eight bits we need 8·5cc =

3 States whether the sum of all bits in the byte are even or odd.

@Slav Defence - Again, this is a much more advanced article on EMSEC, one that goes into the principles behind the concept of side channel analysis, or the task of side-stepping encryption. Also, I hope these article are helping you and others understand what I did in the US Navy. They are quite technical, if you'd like me to post less technical information, please say so. I really want people to enjoy these article and learn something, but the information can be overwhelming and scary at times, especially for novices... this isn't even too advanced, it's first year stuff!!!

@Jango @Manticore @Jungibaaz

@Hu Songshan - you told me in our conversation that you where interested in Electronic Warfare, so I thought I'd tag you here as well given this thread pertains to said topic.

@Nihonjin1051 - I have no idea if this topic interests you (if it does, you're welcome. If not, :cry:

), but this was my job in the USN.

SvenSvensonov · Jan 19, 2015

Chapter 3 Side-Channel Analysis
In the classical “cryptanalysis”, a cryptographic algorithm (cipher) is an abstract mathematical object, which turns some plaintext input data into encrypted output data, parameterized by a key. Here, Kerckhoffs’ law [Ker83] is a main rule for everybody. It means, that the security of a cryptosystem should not depend on the secrecy of the algorithm. All security is in the cryptographic key, which therefore needs to be carefully protected. In classical cryptanalysis the security of a cipher is investigated by mathematical proofs and analysis of the algorithm.

However, a cipher has to be implemented in a program, that will run on a given processor in a specific environment like a smartcard or embedded system. These systems have specific physical characteristics. If these physical characteristics deliver information about the secret key involved in the operation it is called a side-channel. Typical side-channels are:

• the time, an algorithm (or part of it) needs for execution,

• the power consumption of the device,

• the electromagnetic field of the device during its processing, and

• the specific behavior after injecting a fault.

The side-channel attacks attempt to recover the secret key by parts. By only looking at small parts of the key, exhaustive key search on these sub-keys becomes possible. This process is then repeated until the full key is found. In the following sections, we take a closer look at the first two side-channels and how they can be used to obtain information about the secret key.

3.1 Timing Analysis
If the runtime of a cryptographic device is key dependent, it can be used as a side-channel. Kocher introduced this idea in [Koc96]. A formalized description can be found in [HMS00], we will give here a short overview. For example, we want to discover the private exponent e in the following calculation:

y = xe mod n. (3.1)

We assume, that the calculation is done with the Square-and-Multiply method (left-to-right binary exponentiation) which is described in Algorithm 2. We see, that the running time of the algorithm directly depends on the bits e = (ep,...,e0) of the secret exponent where p+1 denotes the total number of bits in the exponent. If a bit of the exponent is one, then an

25

additional multiplication by the base x and reduction by n is performed, which will take a bit more time.

Let us assume, that we already know the first bits ep,...,em−1 of the exponent. Bit em will be attacked. We measure the runtime Ti,i = 1,...,N, for N modular exponentiations with random (but known) base xi:

Ti = T(xei mod n). (3.2)

Next we assume that the unknown bit em is zero. We compute the execution time Ti,m0 via an emulation of the runtime until bit em is reached. Now we build the empirical variance1 V0 := Var(Ti − Ti,m0 ). We do the same for em = 1 and build the empirical variance V1 := Var(Ti − Ti,m1 ). Then we decide whether the exponent bit em was zero or one:

em =0 if V0 < V1, (3.3)

1 else.

This method works, because a wrong guess of em leads to an increase of the empirical variance, if we can exactly simulate the execution time. Note that we do not require new measurements for the other unknown bits. We just repeat the steps for the next unknown bit of the exponent e.

We see, that this attack is applicable, since intermediate values of the algorithm we attack depend on small parts of the key and lead to a different timing. This fact is utilized in other attacks, too. Since we know of the existence of those attacks, we will show in the following a method to verify, if our implementation is vulnerable to timing attacks.

General countermeasures against timing attacks are to modify the algorithm in a way, that it:

• always has a constant, thus key-independent, runtime,

• or masking the data, for example, with Chaum’s blinding method presented in [Cha82].Here the exponentiation is done with a value ˜ei = λei, λ random, unknown to the attacker. Thus, the attacker cannot simulate the execution time Ti,mb .

•

3.1.1 Evaluation of Timing Analysis Resistance
The Square-and-Multiply algorithm has demonstrated, that cryptographic algorithms can leak critical information during runtime. This example showed a very clear key-dependent runtime behavior. Often this behavior is not so clear, i.e., there could be even very small timing differences in the so-called Square-and-Always-Multiply Algorithm, a “fixed” version of Square-and-Multiply where a multiplication is performed regardless of the exponent bit. The

“useless” result of this multiplication is then discarded. Obviously, in this case the timing difference between em = 0 and em = 1 is very small (if there is any). In order to detect such small variations we cannot rely on simple mechanisms, we need a thorough statistical background. Therefore, to verify if our AES implementation leaks key dependent information, some statistical tests are performed.

In the following, we assume that the running time of the algorithm is a random variable X which is distributed like a Gaussian distribution2, i.e., X ∼ N(µX,σX2 ) with mean µX and variance σX2 .3

For the statistical analysis, we will perform runtime measurements with different plaintexts and different keys. We define some measurement sets with, for example, N = 1000 measurements per set. A set contains the plaintext, key and running time for an encryption run. Since the runtime varies and the measurement process is not perfect, there are always differences in the measured time.4 The goal of the statistical tests, to be introduced in a minute, is to verify if these outliers or differences are systematic or not, i.e., to test if we can distinguish the measurement sets at hand of their runtime.

3.1.2 Two Sample T-Test
The density function of a t-distributed random variable X with m ∈ N degrees of freedom5 is given by

x m+1

fX,m for x ∈ R, (3.4)

m

where Γ denotes the Gamma function, which is defined the following way

Γ(1) = 1, Γ(n + 1) = n! for n ∈ N

and can be generalized to arbitrary real arguments x ∈ R. For sufficiently large m, the density function of the t-distribution approximates the Gaussian normal distribution N(0,1), i.e., fX,m −−−−m→∞→ N(0,1).

If X is an N(0,1)-normal distributed random variable and Y is a chi-squared (χ2) distributed random variable with m degrees of freedom and both random variables are independent, then the random variable

X T = qm

Y

is t-distributed with m degrees of freedom.

Now we can formulate the two sample t-test: Let Xi,i = 1,...,m, and Yj,j = 1,...,n, be independent and identically distributed (i.i.d) random variables, respectively, with

Xi ∼ N(µX,σX2 ) and Yj ∼ N(µY ,σY2 )

with equal but unknown variances σX2 = σY2 . Under the hypothesis H0 : µX = µY , the test characteristic

X Y mn

T = − r (3.5a)

SP m + n

with

SP2 = 1 − (m − 1)SX2 + (n − 1)SY2 , (3.5b) m + n 2

m

SX2 = 1 (Xi − X)2, (3.5c) m − 1 Xi=1

n

SY2 = 1 (Yj − Y )2 (3.5d) n − 1 jX=1

has a t-distribution with m + n − 2 degrees of freedom.

To test, if both random variables come from the same population, we make the following two hypotheses:

H0 : µX = µY and H1 : µX =6 µY . (3.6)

With a significance level α, for example, α = 0.05, the hypothesis H0 can be rejected, if

|T, (3.7)

where tm+n−2;1−α2 denotes the 1 − α/2 quantile of the t-distribution with m + n − 2 degrees of freedom.

Otherwise we have to say, that we have insufficient evidence to reject the null hypothesis.

3.1.3 Fisher’s F-Test
With the t-test we test the mean values of our measurements under the assumption that the unknown variances are the same. To statistically test the observed variances, the so-called Fisher F-test can be used which will be described in this section. With the F-test we can verify statistical moments of order two, which are usually much smaller and more difficult to detect.

Let us assume two normally distributed random variables. To test if the variance of one random variable equals the variance of another random variable with unknown, but not necessarily equal mean values, we can use the F-test.

3.1 Timing Analysis

The density function of an F-distributed random variable X with (m,n)-degrees of freedom, m,n ∈ N, is given by

m n

Γ(mΓ(+2m2n))Γ(m n22)n 2 · xm2 −m1+2 n for x >≤ 0, (3.8) fX(x) =(n+mx)

0 for x 0.

If X and Y are two independent χ2-distributed random variables with m and n degrees of freedom, X ∼ χ2m,Y ∼ χ2n, then the random variable

X/m

T =Y/n

is F-distributed with (m,n)-degrees of freedom.

Let Fm,n;α be the quantile with the order α. For large n the quantile of the F-distribution with (m,n)-degrees of freedom converges to the quantile of the χ2-distribution with m-degrees of freedom multiplied with the factor 1/m:

Fm,n;.

Now we are able to formulate Fisher’s F-test: Let Xi,i = 1,...,m, and Yj,j = 1,...,n, be independent and identically distributed (i.i.d) random variables, respectively. We assume, that the random variables are normally distributed

Xi ∼ N(µX,σX2 ) and Yj ∼ N(µY ,σY2 )

with µX and µY unknown. Under the hypotheses H0 : σX2 = σY2 , the test statistic

S2

T = SXY2 (3.9a)

with

SX2 = 1 m (Xi − X)2, (3.9b) m − 1 Xi=1

SY2 = −1 Xn (Yj − Y )2 (3.9c) n 1

j=1

is F-distributed with (m − 1,n − 1) degrees of freedom. We make the following hypotheses

H0 : σX2 = σY2 and H1 : σX2 6= σY2 . (3.10)

With a significance level α, for example, α = 0.05, we can reject the hypothesis H0, if

|T. (3.11)

Otherwise we have to say, that we have insufficient evidence to reject the null hypothesis.

3.2 Simple Power Analysis
At Crypto ’99, Kocher et al. [KJJ99] introduced the so-called power analysis attacks. Every hardware device which performs cryptographic operations can leak information by its power consumption. A common example of an algorithm which might be vulnerable to a Simple Power Analysis (SPA) is the Square-and-Multiply algorithm which was discussed above. Because the algorithm goes sequentially through the bits of the exponent, the difference whether a multiplication takes place or not can be directly observed in the power trace. The power trace can be obtained by measuring the power consumption. This can be done by adding a small resistor in series with the power or ground of the attacked device. The current, which can be derived from the voltage across that resistor divided by the resistance, yields to the power trace. The length of the power trace is the product of the time span in seconds with the sampling rate. A simple power analysis is typically performed on one single power trace. However, multiple power traces can be recorded while the device performs the same operation on the same data several times. Computing the mean of those power traces will reduce the noise. Messerges has observed several sources for noise in [Mes00]. Sources for noise are, for example, at the clock edges, during the quantization and algorithm noise.

Building a good measurement setup should not be underrated in side-channel attacks. From an attackers viewpoint it is important, to avoid noise as much as possible. Also, if multiple traces for the same operation were recorded, they have to be aligned exactly in time. This can be achieved with an exact trigger and alignment techniques. Hermann Seuschek has analyzed the different types of noise in his master thesis [Seu05] and gives a detailed explanation on how to minimize these effects for an optimal measurement setup.

3.3 Differential and Correlation Power Analysis

The differential power analysis (DPA) exploits the relationship between the processed data and the power leakage of multiple traces with different input data. For a DPA, the detailed description of the implementation of the algorithm is not needed, which makes it a very powerful attack. It was suggested by Kocher [KJJ99] and later formalized by Messerges et al. in [MDS99]. The four following requirements, which are necessary for a first-order DPA attack on AES, can be formulated:

1. The attacker needs physical access to the device to measure the power consumptionwhile it performs several en-/decryptions;

2. We need to know either the plaintext or the ciphertext of the algorithm (attack the firstor the last rounds of the algorithm), we refer to it as the processed data vector d = (d1,...,di,...,dD), where di denotes the plaintext or ciphertext of the ith decryption or encryption run and D is the total number of runs.

3. An intermediate result that occurs during a de-/encryption which needs to be a functionof the processed data di and a small part of the key, which is denoted as k∗.

4. The power consumption of the device has to depend on the processed data. We refer to the measured power trace that corresponds to the processed data di as ti = (ti,1,...,ti,T), where T denotes the length of the traces. We obtain a trace for each run so this results in the power trace matrix T with size D × T.

3.3 Differential and Correlation Power Analysis

Detailed instructions on how to perform a DPA can be found in [MOP07]. The basic idea for a DPA is, to make a hypothesis about one ore more bits of the secret key. Therefore, we chose an intermediate function, which depends on the processed data di and a small part of the key k∗. Since the small part of the key is unknown, all possible values k∗ = (k1,...,kj,...,kK) for k∗ have to be processed, where the index j denotes the jth possible key and the index i denotes the ith en-/decryption. Now the hypothetical intermediate values vi,j can be calculated for all D de-/encryption runs and for all K possible key hypotheses. This results in a matrix V with size D × K, where the column j contains the intermediate results that have been calculated with the key hypothesis kj. Now, we can derive the hypothetical power consumptions H from the hypothetical intermediate values via a power model. The better the power model matches the actual power consumption characteristics of our hardware device the less power traces are needed. The most common power models are:

• the bit model (LSB-Least Significant Bit):

hi,j = LSB(vi,j), • the Hamming-weight (HW) model:

(3.12)

hi,j = HW(vi,j), • the Hamming-distance (HD) model:

(3.13)

hi,j = HD(vi,j,di), and

(3.14)

• the zero-value model. Here we assume, that the power consumption for the data value0 is lower than for all other data values:

hi,j =0 if vi,j = 0 (3.15)

1 if vi,j = 06 .

The final step is to compare the measured power traces T with our hypothetical model H. In [KJJ99], the authors measure correlations indirectly by computing the difference of means. In this thesis, we describe how the attack is performed using the Pearson correlation coefficient. A detailed description is in the Appendix. We estimate each value ri,j with ri,j =PdD=1(h−d,i − h¯i·)P· (td,j − t¯j)− i = 1,... ,K, j = 1,...,T. (3.16)

(hd,i h

Dd=1 ¯i)2 Dd=1 (td,j t¯j)2

With the resulting matrix R of correlation coefficients, we can determine if the hypothetical power consumption H and the measured power traces T are correlated. In Equation (3.16), h¯i and t¯j denote the mean values of the columns hi in H and tj in T. Every row in matrix R is the correlation coefficient over time for a specific key hypothesis. If the key hypothesis (“key guess”) was wrong, then the hypothetical power consumption and the power trace should be uncorrelated; i.e., we see just noise. If the key guess is correct, we should see a peak in the correlation curve.

A second order DPA attack considers multiple points simultaneously within the same power trace t. These points belong to two different intermediate values, e.g., a mask value m and the hidden intermediate value d⊕m. The knowledge of either m or d⊕m alone is not of any use to the attacker. But if the attacker correlate both values, he can gain some information on d. A detailed description of higher order DPA can be found at [JPS05].

Chapter 4 Side-Channel Resistant AES Implementation
A main criteria for the feasibility of a power analysis attack is, that the part of the cipher which is attacked only depends on a small amount of key bits. AES offers exploitable functions like the first and the last AddRoundKey. They both depend only on a small amount of key bits. AES1 provides another interesting characteristic which makes a power analysis attack even more applicable: The non-linear SubBytes transformation. This is because this transformation is non-linear and therefore if there is one bit wrong in the key hypothesis several bits of the output differ from the calculated intermediate result. On average, half of the bits calculated based on an incorrect key hypothesis differ from the actual intermediate result – independently from the number of incorrect bits in the key hypothesis. This improves the ability of finding the right key during a SCA enormously. Figure 4.1 depicts the initial AddRoundKey operation (XOR) before the first round and the first SubBytes transformation for the byte s0,0 of the State matrix in round one. So an important task will be to protect

s0,0 (8-bit from the plaintext, known)

k (8-bit from the key, unknown)

s00 ,0 (8-bit result)

Figure 4.1: Initial AddRoundKey followed by the first SubBytes transformation in round one for one byte of the State matrix

the SubBytes transformation in the side-channel resistant implementation.

4.1 Random Bits
All following countermeasures need random bits. The Bundesamt für Sicherheit in der Informationstechnik (BSI) describe four classes for the quality of deterministic random number generators in [BSI01]. From the “weak” class K1, where the random numbers containing no identical consecutive elements with a high probability, to the “strong” class K4, where it should be impossible for an attacker to calculate or guess from an inner state of the generator any previous numbers in the sequence of random numbers.

To mask the SubBytes transformation, it is very important to generate high class random numbers for the mask values, because if the distribution of the random mask has a bias with respect to the uniform distribution, the masked value leaks information, which can be exploited during a DPA, see [CGPR08].

So the random number generator should at least fulfill the requirements of a K2 random number generator for the masked values. This class ensures, that the random numbers passed statistical tests and are uniformly distributed. For generating those qualitative good random values, the IAIK suggests the Grain-80 stream cipher, which can also be used as random number generator. The Grain-80 is optimized for hardware implementations. In practice the AES implementation will get the random numbers from outside (a hardware random number generator), Grain-80 is only be used to generate random numbers for the final measurements on the development board.

4.2 Countermeasure: Masking
During a power analysis attack, the attacker has to make a hypothesis about the power consumption of the device performing the attacked operation. Our goal is, to make the real intermediate values of the executed transformation independent from their power consumption. This can be realized by masking them. With masking we conceal every intermediate value with a random value m, the so-called mask. The attacker does not know the masking values and thus cannot compute meaningful hypothesis on the power consumption. As a result, the side-channel leakage of all intermediate, key-dependent values, does not correlate with the corresponding unmasked value. We can easily reproduce the propagation of the mask throughout the linear parts of AES in order to remove the mask at the end of the cipher. For the non-linear parts, namely the SubBytes transformation, we require considerable more effort for the mask correction after this transformation.

In the following subsections, we will discuss the different masking schemes in terms of space consumption, security, and speed. For all masking schemes, the runtime of the key scheduling is excluded. The 176-byte RAM consumption for the expanded round key is included in the estimated RAM value.

4.2.1 Additive Masking
Additive masking masks a value x with a random value m by adding (XORing) them, xm = x ⊕ m. The mask can be removed by adding the mask value again, x = xm ⊕ m. This is an easy but very efficient masking scheme, since the masked value xm is not distinguishable from a random value if the mask value m is a random value.

The “problem” with AES is, that the SubBytes transformation is a non-linear operation, which means SubBytes(si,j ⊕ m) =6 SubBytes(si,j) ⊕ SubBytes(m). So additive masking cannot be easily applied “out of the box”. The mask value m cannot be easily removed from the intermediate value si,j0 after the transformation without modifying the SubBytes transformation.

For additive masking, the whole lookup table (S-box) has to be recomputed. The input mask m masks the index and the output value mx masks the actual data according to Algorithm 3. This masking method was published first in [Mes00].

Algorithm 3: Computation of the masked AES SubBytes transformation as proposed in

[Mes00]

Input: m,mx

Output: MaskedSubBytes(si,j ⊕ mx)=SubBytes(si,j)⊕m

1 for i = 0 to 255 do

2 MaskedSubBytes(i ⊕ mx)=SubBytes(i)⊕m;

3 end

With one mask value pair m and mx for all intermediate values, we need additional

256bytes RAM for the new masked S-box and two random bytes for the input mask m and the output mask mx. With the expanded key that makes 176bytes + 256bytes + 2bytes =

434bytes.

The precalculation is made up of 256 additions (XOR) to mask the table index, 256 table lookups (TLU) to read the unmask S-box values from ROM, 256 additions (XOR) to mask the table entries and, finally, 256 write operations to store the masked table in RAM. This takes 256cc + 2 · 256cc + 256cc + 256cc = 1280cc for computing the new S-box.

With four mask value pairs a whole column of the State matrix can be masked. The advantage is, that during the MixColumns transformation, which works on a whole column (4bytes) of the State matrix, all intermediate values are mask with different masks. Therefore we need eight random bytes and 4 · 256bytes = 1024bytes for the new lookup tables, which results (with the expanded key) in 176bytes + 8bytes + 1024bytes = 1208bytes RAM. The precalculation overhead for four different masks leads to 4 · 1280cc = 5120cc for computing the new S-boxes. We can assume the runtime of the normal SubBytes transformation, since during an AES encryption, we just have to index the correct masked table for the TLU. In addition we have to mask the intermediate value from the State matrix with the input mask before and with the output mask after the transformation. That makes additional 2cc per byte of the State matrix and 2 · 160cc = 320cc for the whole encryption. Summarizing, this scheme needs for all ten rounds 320cc + 1280cc = 1600cc with one masked S-box and 320cc + 5120cc = 5440cc with four masked S-boxes.

To mask the 32-bit T-tables implementation we would require 32-bit mask values, since we now operate on whole rows, instead of bytes from the State matrix. Masking the tables with one input– and one output–32-bit mask requires eight random bytes for the masks and 8192bytes RAM for the eight masked T-tables (four for the first nine rounds and four for the last round). In total it makes (with the expanded key) 176bytes +8bytes+8192bytes = 8376bytes. The precalculation is made up of 256 additions (XOR) to mask the 32-bit table index, 256 table lookups (TLU) to read the unmask values from the tables (ROM), 256 additions (XOR) to mask the table entries and finally 256 write operations to store the masked table 32-bit values in RAM. This costs us 256cc+2·256cc+256cc+256cc = 1280cc for the precalculation of the masks and in total 520cc+1280cc = 1800cc for the whole AES. Masking all four columns of the State matrix with the T-tables implementation would require 8 · 4bytes = 32 random bytes for the masks and 4 · 8192bytes = 32768bytes for the masked tables. In total that makes with the expanded key 176bytes+32bytes+32768bytes = 32972bytes RAM. The precalculation of the tables take 4 · 1280cc = 5120cc and in total 520cc + 5120cc = 5640cc for the whole AES.

4.2.2 Transformed Masking Method
In [AG01], Akkar and Giraud do not use a lookup table for the SubBytes transformation. Instead, they calculate the SubBytes transformation. Thereby, they mask the intermediate value si,j multiplicative during the inversion part of the SubBytes transformation.

The idea of Transformed Multiplicative Masking (TMM) is to add a new multiplicative mask Yi,j to the intermediate value si,j and remove the additive mask Mi,j. The intermediate value is now masked with only the multiplicative mask. The advantage of the multiplicative mask is, that this mask can be removed after the inversion of the masked intermediate value easily by inverting the multiplicative mask Yi,j separately and then multiply that inverted mask with the inverted masked value.

If we calculate the multiplication with the binary multiplication method, as it is suggested in [FIP01], we need for the multiplication at average 16 shift operations and 8 XOR operations, which makes 16cc + 8cc = 24cc.

For an inversion of one byte from the State matrix with this method, we need four multiplications, two XOR, and two inversions (TLU). This costs us 4·24cc+2·1cc+2·2cc = 102cc. To invert the 10 · 16bytes = 160bytes for all ten rounds, we need 160 · 102cc = 16320cc.

Akkar and Giraud give a cycle count of 293500cc per encryption, which is about three times more than in the unprotected implementation we have seen in Section 2.3, which was 90500cc.

In [TSG02], Trichina et al. have published a simplified variant of the algorithm which is shown in Figure 4.3 on the right hand side. The scheme is equivalent to the original masking scheme but the authors here use the same mask value for the multiplicative masking (Y=M). For an inversion of one byte from the State matrix with this simplified method, we need only two multiplications, two XOR, and one inversion (TLU). This costs us 2·24cc+2·1cc+1·2cc = 52cc. To invert the 160bytes for all ten rounds, we need 160 · 52cc = 8320cc.

The authors in [TSG02] give us no information about the cycles or memory consumption in their paper, so we will compare it to the Akkar/Giraud version. Since it saves the half on multiplications and inversion during one SubBytes transformation, which are the expensive and time consuming functions, we assume, that this implementation needs half of the cycles, the Transformed Masking Method from Akkar/Giraud would need.

Embedded Multiplicative Masking
Both, the normal and the simplified transformed masking method mask only non-zero values. If a key byte equals a data byte, then the result of AddRoundKey is zero. This fact can be exploited in a special (first-order) differential side-channel attack, the zero-value attack. In [OMP04] it is shown, that around 64 times more measurements are needed in a zero-value attack than in a standard differential side-channel attack. This indicates that they must be avoided because they still pose a serious practical threat.

The idea from [GT03] is to embed the finite field GF(28) ' GF(2)[X]/P(x), P(X) =

(x8 +x4 +x3 +x+1)2 into the larger ring R = GF(2)[x]/(P(x)·Q(x)). Q(x) has degree k and needs to be irreducible over GF(2) and coprime to P(x). We define ρ : GF(2)[x]/P(x) → R:

ρ(U) = U + R × P (4.1)

The mask R(x) is a random polynomial = 06 of degree smaller than k. We define a mapping F on R, F : R → R with F(U) = U254 which coincides with the inversion on GF(2)[X]/P(x)3. Reducing F(U) modulo P(X) gives the inverse of U in GF(2)[X]/P(x). The additive masking can be restored before performing this final reduction step. If the polynomial R is random, the masking will not allow to reveal information even if X = 0. This zero value gets mapped to one of 2k random values. This is a mathematically elegant way to bypass the zero value problem

but it leads to very costly implementations. According to [BGK04], this leads in at least 731 AND and 766 XOR operations, including the transformations to the new field representation and back. For our cycle count this means that we need at least 731cc + 766cc = 1497cc for one byte of the State matrix and 160 · 1497cc = 239520cc for all ten rounds.

4.2.3 Perfectly Masking AES Inverter Against First-Order Side-Channel Attacks
Another way to protect the SubBytes function is presented in [BGK04]. The authors obtain a perfectly masked algorithm for AES by computing the multiplicative inverse in GF(28):

−1 = INV(si,j) =si,j−1, if si,j ∈ GF(28) (4.2) si,j

(

0, if si,j = 0

and masking all intermediate values.

They calculate the inverse si,j−1 = INV(si,j), as y = s254i,j by using the Square-and-Multiply algorithm. We discussed the algorithm and the resulting side-channel leakage in Section 3.1. The goal is to protect the value si,j, the intermediate values, and the resulting inverse during the calculation.

Let r, r0 be independently and uniformly distributed random masks and u = si,j. Now, the authors start with an additively masked value x = u+r and want to calculate y = INV(u)+r0.

Let us recall the Square-and-Multiply algorithm with the exponent e = 254 and the given values:

Algorithm 4: Left-to-Right binary Square-and-Multiply Algorithm

Input: x = (u + r),n = 255,e = {ep,...,e0}2 = {11111110}2

Output: y = xe mod n = x−1 mod n

1 y = 1

2 for k = p down to 0 do

3 y = y2 mod n

4 if ek = 1 then

5 y = y · x mod n

6 end

7 end

As we can see, the algorithm consists of two operations, a squaring in Step 3 followed by a conditional multiplication with a reduction of n in Step 5. These are the operations, which were modified by the authors of [BGK04] to ensure the side-channel resistance.

In the following two algorithms, the input si,j becomes the additively masked input u+rl,k, where r is the lth mask byte in Step k of Algorithm 4. Our goal is, to invert u. The auxiliary values f,v,w and t are needed in the algorithm.

Algorithm 5 describes the perfectly masked squaring (PMS) method. During the first step, the input ue + r1,k−1 is squared. In Step two and three, they change the old mask r1,k−1 to the new mask r1,k. The desired output is a squared input with a fresh mask u2e + r1,k.

In Algorithm 6, we see the perfectly masked multiplication (PMM) method. This algorithm gets two input values: The output of the previous step and a freshly masked value x0. In

Step 1, the authors calculate the product of the two masked values. In Steps 2-5 the auxiliary

Algorithm 5: Perfectly Masked Squaring (PMS), Source: [BGK04]

Input: x = ue + r1,k−1,r1,k Output: tk = u2e + r1,k

1 fk = x2 ; {f1 = u2e + r12,k−1}

2 w1,k = r12,k−1 + r1,k ; {auxiliary term to correct fk}

3 tk = fk + w1,k ; {tk = u2e + r1,k}

terms are calculated. The last part of the algorithm, Step 6 and 7 removes the disturbing parts of the masked product by simply adding up the two auxiliary terms w1,k, w2,k and fk.

To calculate the masked inverse of one element with the secured Square-and-Multiply algorithm, we need seven PMS with additional PMM, both including a modular reduction step, and at last one PMS with a modular reduction. Every intermediate value of the PMS and PMM is hidden by a fresh mask. So we need (7 · 3) + 1 = 22 random masks for the inversion of one element of the State matrix. For a whole AES en-/decryption we need 160·22 = 3520 random mask values each with the size of one byte. In the case of a first order side channel resistance, the authors point out in [BGK04, Section 5.4], that the number of random masks can be reduced to only three masks (r1,r2,r3), by using the same masks in each step k.

To calculate the clock cycles for the inversion, we recall the clock cycle for: addition

(1cc), multiplication (24cc) and squaring (2cc). By counting the operations during the algorithm we get: 2 · 2cc + 2cc = 6cc (two squaring and two additions), for one PMS, and 4 · 24cc + 2 · 1cc = 98cc (four multiplications and four additions) for one PMM operation. So we need:

• (7 · 6)cc = 42cc for the PMS operation

• (7 · 98)cc = 686cc for the PMM operation

• 3cc for loading the random intermediate mask r1,k,r1,k−1,r2,k, • 1cc for loading the masked intermediate value from RAM, and

• 1cc for saving the result.

In total we need (42 + 686 + 3 + 1 + 1)cc = 733cc for the inversion of one element from the State matrix. For a whole AES en-/decryption we need 160 · 733cc = 117280cc for the inversion part of the SubBytes transformation with three masks.

4.2.4 Combined Masking in Tower Fields
In [OMP04] and [OMPR05], Oswald et al. present a masking scheme, which is based on the composite field arithmetic we discussed in Section 2.3.8. They call it combined masking in tower fields (CMTF). The idea is, that the inversion in GF(22) is a linear transformation, which gives us the potential to track the changes which are applied to the mask value m during an inversion.

To securely calculate the inversion of the current intermediate byte si,j from the State matrix, we first mask it additively with the mask m to obtain a = si,j ⊕ m. Next we transform a to its composite field representation. Please recall, that every element of GF(28) can be represented as a linear polynomial over GF(24). The finite fields, which we will use are the same as before (Section 2.3.8):

GF(28) ' GF(2)[x]/(x8 + x4 + x3 + x + 1), GF(24) ' GF(2)[x]/(x4 + x + 1).

(4.3)

(4.4)

Note that operations ⊕ and ⊗ which we use here, are for coefficients from GF(24). We have introduced them in Section 2.3.8.

After the mapping, the value that needs to be inverted is represented by (ah⊕mh)x+(al⊕ml) instead of ahx + al with ah,al,mh,ml ∈ GF(24). Both values, ah and al, are now additively masked. It is easier to follow the masked inversion if we recall the normal inversion in the composite field:

al0 (4.5a) ah0 = ah ⊗ d−1 (4.5b) al0 = (ah ⊕ al) ⊗ d−1 (4.5c) d a2l (4.5d)

We can calculate the masked inverse, by following the steps from (4.5a) to (4.5d), but this time we use the masked values:

) (4.6a)

ah0 mh0 (4.6b) al0 ml0 (4.6c) dmh (4.6d)

Now we need to securely calculate the inverse for d⊕mh,d ∈ GF(24). So far, we have only shifted the problem from the bigger field down to the smaller one. Because GF(28) is a tower field over GF(22), we can shift the computation of the masked inversion for d down to GF(22) the same way as we did it in (4.6a) to (4.6d) with the corresponding field operations. In this field, the inversion is a linear operation and we keep track of the mask value. The inversion operation preserves the masking value, because (d ⊕ m)−1 = (d ⊕ m)2 = d2 ⊕ m2 in GF(22).

This approach has been presented in [OMPR05] to work in tower fields which are applicable on hardware implementations but inefficient to implement in software on a 32-bit processor, since during the field transformations we work on every individual bit of all bytes in the State matrix. Instead, all transformations above can be precomputed and stored in tables. This method will be discussed in the next subsection.

Combined Masking in Tower Fields with Precomputed Tables
To compute the masked inversion in GF(28), Schramm et al. presented in [OS05] a method which is based on the inversion in composite fields, as we discussed it in the last section. The difference is, that the calculations of the intermediate values are mapped to a sequence of table lookups. Therefore the authors first map the masked intermediate value a = si,j ⊕ m down to its field representation with coefficients from GF(24) as done before. With the use of the following four tables:

Td⊕

Td2 : ((x ⊕ m),(y ⊕ m0)) →7 ((x ⊕ m) ⊕ (y ⊕ m0 )) × (y ⊕ m ) Tm : ((x ⊕ m),(y ⊕ m )) →7 (x ⊕ m) × (y ⊕ m )

Tinv : ((x ⊕ m),m) 7→ x−1 ⊕ m

which can be precalculated and stored in ROM, the masked intermediate values of the formula (4.6b) to (4.5d) are calculated. Note, that all tables take two elements of GF(24) as inputs and give an element of GF(24) as output.

d ⊕ mh = Td1(ah ⊕ mh,mh) ⊕ Td2((ah ⊕ mh),(al ⊕ ml))

⊕ Tm((ah ⊕ mh),ml) ⊕ Tm((al ⊕ ml),mh) ⊕ Tm((mh ⊕ ml),ml). (4.7)

We compute the masked inversion of d with the following table lookup:

d−1 ⊕ mh = Tinv(d ⊕ mh,mh). (4.8)

Now we can compute ah0 ⊕ mh0 and al0 ⊕ ml0 from (4.6b) and (4.6c) the following way:

ah0 mh,d−1 ⊕ ml)

⊕ mh ⊕ Tm(d−1 ⊕ ml,mh) ⊕ Tm(ah ⊕ mh,ml) ⊕ Tm(mh,ml), (4.9)

and

al0

⊕ ml ⊕ Tm(d−1 ⊕ mh,ml) ⊕ Tm(al ⊕ ml,mh) ⊕ fah ⊕ mh ⊕ Tm(mh,ml). (4.10)

To map GF(28) elements to GF(24) × GF(24) elements and vice versa, two additional tables with a size of 512bytes are needed.

As the authors state and we can count above, the total costs of a masked inversion this way are 14 table lookup operations and 15 XOR operations for an inversion, which results in 14·2cc+15cc = 43cc. For an AES encryption this results in 160·43cc = 6880cc additional for the new masked SubBytes transformations.

To store the six tables we need 4 · 256bytes + 2 · 256bytes = 1536bytes of ROM.

4.2.5 Masking the MixColumns Transformation
The MixColumns transformation can leak timing informations because it uses the xtimes operation. Xtimes contains a conditional reduction by {11B}16 if the result of the operation is larger than eight bits, see Section 2.2.

To prevent the timing information leakage and power leakage, the State S has to be additively masked with the mask M before the transformation, SM = S ⊕ M, where M is a

4 ×0 4 matrix created of four random bytes M1,...,M4, see (4.11). After the transformation0 ( SM = MixColumns(SM)) the mask M can be removed from the transformed State SM by adding the transformed mask M0 = MixColumns(M), so S.



M1 M2 M3 M4 M1 M2 M3 M4

M =(4.11)

M1 M2 M3 M4
M1 M2 M3 M4

Masking the MixColumns transformation needs one additional MixColumns transformation

(84cc) to get M’ if the same mask M is used to mask all MixColumns transformations during an whole AES run. In addition, four random bytes are needed for the mask M, and 16bytes are needed for the transformed mask M0 = MixColumns(M).

4.3 Countermeasure: Randomization
Randomizing the execution of the algorithm provides additional resistance against power analysis attacks. Herbst et al. suggest in [HOM06] two ways to introduce randomness in the execution of the algorithm. They propose to shuffle the sequence of operations which work on the elements of the State matrix and to add dummy rounds (or parts of a round) during the normal ones.

Then we proceed with the next element. As we work on each intermediate value of the State matrix, this version of shuffling is applicable on 8-bit implementations. With shuffling, the probability that one specific intermediate value is processed at a specific point in time is 1/16. Theoretically, we can shuffle the whole State matrix with 16! possibilities. But that would not increase the security against a first-order DPA with the same factor since here it

is important that the operations with the intermediate values do not occur on the same fixed time.

The randomization can be broken with windowing, published by Clavier et al. [CCD00]. Windowing can be seen as performing multiple second-order DPA attacks in parallel and then combining the power traces for the mask processing with each of the points in time, where the targeted masked values can occur. So shuffling all 16! possibilities would not bring us more security but would increase the implementation costs. For a successful side-channel attack 16 times more traces are needed with this countermeasure than with an unprotected implementation.

Another way to increase the randomization is to add dummy intermediate values d to the State matrix which will be operated on. Those operations must not be distinguishable from real operations of the algorithm. The problem with added wait states or, for example, just added random NOPs (No Operation) is that they can be easily removed by analyzing a single power trace. The authors in [HOM06] therefore suggest to insert dummy rounds or parts of a round on some dummy State bytes, see Figure 4.5. The combination of shuffling and dummy operations make it very difficult for the attacker to locate the attack point in the first two and the last two rounds. These are the rounds, which need our attention, because they are vulnerable during a known/chosen plaintext/ciphertext attack. After these rounds, every value from the State has been subjected to three AddRoundKey transformations and depend on sufficient many key bytes, to prevent an DPA attack [HOM06]. The probability, that a certain intermediate value occurs on a specific point in time is p = 1/(D + 1), where D is the number of additional dummy cycles. If we chose to insert 15 dummy operations, the probability is 1/16, which would increase the number of needed traces for a successful SCA also by 16 times. Note, that we have to see the normal consecutive State bytes as one value compared to the dummy bytes, to calculate this probability.

With D additional dummy rounds and shuffling, which is recommended by the authors, the number of required traces for a successful mounted SCA increases by (D + 1) · 16. With, e.g., D = 5 dummy rounds randomly inserted at the start and the end of the algorithm, 96 times more measurements would be required for a successful DPA attack.

The number of added dummy operations D is a fixed value. The random number r, 0 ≤

r ≤ D partitions the D dummy operations into two parts. The dummy operations from 0 to r − 1 will be executed before and the dummy operations from r to D − 1 will be executed after the normal operations, like it is depicted in Figure 4.5 for r = 2 and D = 5.

4.4 Comparison
In this chapter we gave a short overview about the discussed (low-cost) software countermeasures against timing and power attacks. Now we estimate the number of clock cycles for a protected AES implementation with the selected countermeasure based on our 32-bit optimized implementation with a clock cycle count of 1732cc. The clocks cycles for the affine transformation (Section 2.3.3) with 6560cc are added to those schemes, which have to calculate it. We get the following runtimes estimated for the countermeasures:

• Additive Masking: 1732cc + 1600cc = 3332cc,

• Transformed Masking: 1732cc + 6560cc + 16320cc = 24612cc,

• Transformed Masking simpl.: 1732cc + 6560cc + 8320cc = 16612cc,

• Embedded Masking: 1732cc + 6560cc + 239520cc = 247812cc,

• Perfect Masking: 1732cc + 6560cc + 117280cc = 125572cc,

• CMTF with TL: 1732cc + 6560cc + 6880cc = 15172cc.

Table 4.2 contains the estimated memory consumption (ROM and RAM), the number of random bits, and the runtime in clock cycles (cc) for the discussed masking methods. Note that we are only interested in a rough estimate to select the most appropriate countermeasure. The value on the left side denotes the estimated result for using one random byte to mask all

16 elements of the State matrix. The value on the right side is the estimated value using four random bytes to mask a whole column of the State matrix. All RAM estimates include the 176-byte expanded key array for the key scheduling.

With 3332cc and 434bytes RAM, Additive Masking is applicable when used with one mask. With four different masks the RAM consumption of 1208bytes, which is mainly caused by the four masked S-boxes, is too big for our needs. The runtime 5440cc +1732cc = 7172cc is still acceptable.

In a 32-bit T-tables implementation, which is Additively Masked with one mask pair, the RAM consumption of 8376bytes and 32976bytes for four masks is also too big.

4.4 Comparison

The Transformed Masking and simplified Transformed Masking Method both multiplicatively mask an element of the State matrix during the SubBytes transformation. The simplified Transformed Masking Method is the improved one of the Transformed Masking Method, in terms of needed operation and random bytes for masking. It is very interesting regarded to the RAM consumption of 180bytes. Compared to the Additive Masking implementation, the runtime of 16612cc is also acceptable. A big pro of the Transformed Masking approach is the constant runtime, which is independent from the number of mask values.

The Perfect Masking in line six needs 179 bytes RAM (three for the masks and 176 for the expanded key). Since it calculates the inversion transformation and the affine transformation on the fly, it has a very big runtime, which is the main criteria against it.

The same constant runtime characteristic holds for the Combined Masking in Tower Fields in row seven. The big ROM consumption of 1536bytes mainly results from a couple of constant tables, which are used for the transformation between the different representation systems and for calculating the masked inversion, see Section 4.2.4.

The big part of the runtime from the Additive Masking Method arises from the creation of the masked S-boxes. If these masked S-boxes have been created, the implementation is even as fast as the unmasked implementation since then, the transformation is only a table lookup. Transformed Masking and the improved versions of it would only pay off if a high number of random masks is used. But using a high number of masks does not lead to a practically safer implementation. This is because a higher order DPA works whenever the same mask values occur on two different points in time. Since all masks have to be created at a specific point in time and are used to mask an intermediate value at some point later, there are always two points in time that allow a higher order DPA attack.

We decided to use the Additive Masking Method and combine it with randomization (shuffling and dummy states). By using the masking approach, we fulfill the requirements needed against a first-order DPA. In addition, we will mask the MixColumns transformation to prevent timing analysis attacks and use masking and shuffling to increase the resistance to a second-order DPA.

4.5 Low-Cost SCA-Resistant AES Software Implementation
On the Applied Cryptography and Network Security Conference (ACNS’07), Tillich, Herbst and Mangard [THM07] presented a low-cost SCA-resistant AES implementation, which takes advantage of the masking, shuffling and execution of dummy cycle approach. In Figure 4.7, the program flow of the randomized and masked AES is depicted. The randomization areas are called Randomization Zone 1 and Randomization Zone 2. In these two zones the shuffling and the dummy operation approach is applied. Because the randomization is very costly, the authors keep the Randomization Zones as short as possible. During the Randomization Zones, the following operations are randomized: The AddRoundKey operation, since the 16bytes of the State matrix S are processed independently. The same holds for the (masked) SubBytes transformation. During the MixColumns transformation, the sequence of the processed columns is randomized.

The first two and the last two rounds of the AES implementation are additionally protected by masking. The current mask of the State S is shown on the right in Figure 4.7. The SubBytes transformation is masked additively with the mask byte M, see Section 4.2.1.

To protect the MixColumns transformation, the authors use four mask bytes, M1 to M4, see Section 4.2.5. Every mask byte M1 to M4 is used to mask the corresponding column

4.5 Low-Cost SCA-Resistant AES Software Implementation

of the State S. To remove the four masks after the MixColumns transformation, the four mask bytes are also transformed, (M10,M20,M30,M40) = MixColumns(M1,M2,M3,M4). To remask the State, the AddRoundKey transformation is used whenever possible. The ShiftRows transformation is excluded from the Randomization Zones because it works on the whole State matrix.

We will adapt this scheme. The Additive Masking implementation with one masked Sbox needs needs 3332cc. To calculate the four masks M10,M20,M30 and M40 we need one additional MixColumns transformation (84cc). We implement one dummy operation to operate on one column of the State matrix. This is the smallest part on which all transformations can work and it fits to our 32-bit architecture. During one dummy operation the following transformations are executed AddRoundKeyColumn, SubBytesColumn, MixColumn,

AddRoundKeyColumn and SubBytesColumn. This takes 4cc + 20cc + 21cc + 4cc + 20cc = 69cc. For shuffling, we only change the indices of the current column and current row.

We think this runtime is negligible. So the protected implementation has a runtime of 3332cc + 84cc = 3416cc with no dummy operations and 3416+D · 69cc with D dummy operations.

The Additive Masking scheme needs 256bytes RAM for the masked S-box and 16bytes for the four words M10 to M40. In addition six random bytes for the masked values M,M0,M1 to M4 one random byte r to divide the number of dummy operations during the randomization zones, 16 random bytes for one dummy State and five random bytes for shuffling the five AddRoundKey, four SubBytes and three MixColumns transformations in the two randomization zones. That makes in total (176 + 256 + 16 + 6 + 1 + 16 + 5)bytes = 476bytes RAM. The masked AES implementation is roughly two times slower than their referenced unmasked AES implementation. The randomized implementation takes about 11845 clock cycles when no dummy operations are added. This is about 2.7 times slower than the unprotected AVR-based AES implementation. With D dummy operations added, their randomized AES implementation needs 11845 + D · 240 clock cycles. Our estimated protected AES encryption increases by a factor of two.

Screambowl · Jan 19, 2015

in 4th semester we had one optional module power analyses on cryptography. Which some how gives an idea of the size of message and keys used.

Full of Matrix and Set theory.

SvenSvensonov · Jan 19, 2015

Screambowl said:
in 4th semester we had one optional module power analyses on cryptography. Which some how gives an idea of the size of message and keys used.

Full of Matrix and Set theory.

Yep, scary thing is, I actually left out the matrices since there we so many of them that I couldn't keep each post under ten photos! You understand this? Didn't think I'd find another person who knew anything about electronic security. There are a couple people with an electronic background, I think @Donatello is one of them, but I didn't think too many others were. Did you go in-depth into this in your schooling, or was it just the basics?

Screambowl · Jan 19, 2015

SvenSvensonov said:
Yep, scary thing is, I actually left out the matrices since there we so many of them that I couldn't keep each post under ten photos! You understand this? Didn't think I'd find another person who knew anything about electronic security.

I am from electronic but this is more related to programming, and power analyses is more about reading graphs on oscilloscope. Then you use Fourier series to calculate the amplitude in respected time interval and make a matrix of all the values and then co relate it with given set of keys. It's f*ckin time consuming , only sounds cool when heard

After the given set of keys are 100% confirmed, then it is sent to programmers and mathematicians who work on decoding using matrices , probability and other set theories through high speed computers.

Slav Defence · Jan 19, 2015

SvenSvensonov said:
Chapter 3 Side-Channel Analysis
In the classical “cryptanalysis”, a cryptographic algorithm (cipher) is an abstract mathematical object, which turns some plaintext input data into encrypted output data, parameterized by a key. Here, Kerckhoffs’ law [Ker83] is a main rule for everybody. It means, that the security of a cryptosystem should not depend on the secrecy of the algorithm. All security is in the cryptographic key, which therefore needs to be carefully protected. In classical cryptanalysis the security of a cipher is investigated by mathematical proofs and analysis of the algorithm.

However, a cipher has to be implemented in a program, that will run on a given processor in a specific environment like a smartcard or embedded system. These systems have specific physical characteristics. If these physical characteristics deliver information about the secret key involved in the operation it is called a side-channel. Typical side-channels are:

• the time, an algorithm (or part of it) needs for execution,

• the power consumption of the device,

• the electromagnetic field of the device during its processing, and

• the specific behavior after injecting a fault.

The side-channel attacks attempt to recover the secret key by parts. By only looking at small parts of the key, exhaustive key search on these sub-keys becomes possible. This process is then repeated until the full key is found. In the following sections, we take a closer look at the first two side-channels and how they can be used to obtain information about the secret key.

3.1 Timing Analysis
If the runtime of a cryptographic device is key dependent, it can be used as a side-channel. Kocher introduced this idea in [Koc96]. A formalized description can be found in [HMS00], we will give here a short overview. For example, we want to discover the private exponent e in the following calculation:

y = xe mod n. (3.1)

We assume, that the calculation is done with the Square-and-Multiply method (left-to-right binary exponentiation) which is described in Algorithm 2. We see, that the running time of the algorithm directly depends on the bits e = (ep,...,e0) of the secret exponent where p+1 denotes the total number of bits in the exponent. If a bit of the exponent is one, then an

25

additional multiplication by the base x and reduction by n is performed, which will take a bit more time.

Let us assume, that we already know the first bits ep,...,em−1 of the exponent. Bit em will be attacked. We measure the runtime Ti,i = 1,...,N, for N modular exponentiations with random (but known) base xi:

Ti = T(xei mod n). (3.2)

Next we assume that the unknown bit em is zero. We compute the execution time Ti,m0 via an emulation of the runtime until bit em is reached. Now we build the empirical variance1 V0 := Var(Ti − Ti,m0 ). We do the same for em = 1 and build the empirical variance V1 := Var(Ti − Ti,m1 ). Then we decide whether the exponent bit em was zero or one:

em =0 if V0 < V1, (3.3)

1 else.

This method works, because a wrong guess of em leads to an increase of the empirical variance, if we can exactly simulate the execution time. Note that we do not require new measurements for the other unknown bits. We just repeat the steps for the next unknown bit of the exponent e.

We see, that this attack is applicable, since intermediate values of the algorithm we attack depend on small parts of the key and lead to a different timing. This fact is utilized in other attacks, too. Since we know of the existence of those attacks, we will show in the following a method to verify, if our implementation is vulnerable to timing attacks.

General countermeasures against timing attacks are to modify the algorithm in a way, that it:

• always has a constant, thus key-independent, runtime,

• or masking the data, for example, with Chaum’s blinding method presented in [Cha82].Here the exponentiation is done with a value ˜ei = λei, λ random, unknown to the attacker. Thus, the attacker cannot simulate the execution time Ti,mb .

•

3.1.1 Evaluation of Timing Analysis Resistance
The Square-and-Multiply algorithm has demonstrated, that cryptographic algorithms can leak critical information during runtime. This example showed a very clear key-dependent runtime behavior. Often this behavior is not so clear, i.e., there could be even very small timing differences in the so-called Square-and-Always-Multiply Algorithm, a “fixed” version of Square-and-Multiply where a multiplication is performed regardless of the exponent bit. The

“useless” result of this multiplication is then discarded. Obviously, in this case the timing difference between em = 0 and em = 1 is very small (if there is any). In order to detect such small variations we cannot rely on simple mechanisms, we need a thorough statistical background. Therefore, to verify if our AES implementation leaks key dependent information, some statistical tests are performed.

In the following, we assume that the running time of the algorithm is a random variable X which is distributed like a Gaussian distribution2, i.e., X ∼ N(µX,σX2 ) with mean µX and variance σX2 .3

For the statistical analysis, we will perform runtime measurements with different plaintexts and different keys. We define some measurement sets with, for example, N = 1000 measurements per set. A set contains the plaintext, key and running time for an encryption run. Since the runtime varies and the measurement process is not perfect, there are always differences in the measured time.4 The goal of the statistical tests, to be introduced in a minute, is to verify if these outliers or differences are systematic or not, i.e., to test if we can distinguish the measurement sets at hand of their runtime.

3.1.2 Two Sample T-Test
The density function of a t-distributed random variable X with m ∈ N degrees of freedom5 is given by

x m+1

fX,m for x ∈ R, (3.4)

m

where Γ denotes the Gamma function, which is defined the following way

Γ(1) = 1, Γ(n + 1) = n! for n ∈ N

and can be generalized to arbitrary real arguments x ∈ R. For sufficiently large m, the density function of the t-distribution approximates the Gaussian normal distribution N(0,1), i.e., fX,m −−−−m→∞→ N(0,1).

If X is an N(0,1)-normal distributed random variable and Y is a chi-squared (χ2) distributed random variable with m degrees of freedom and both random variables are independent, then the random variable

X T = qm

Y

is t-distributed with m degrees of freedom.

Now we can formulate the two sample t-test: Let Xi,i = 1,...,m, and Yj,j = 1,...,n, be independent and identically distributed (i.i.d) random variables, respectively, with

Xi ∼ N(µX,σX2 ) and Yj ∼ N(µY ,σY2 )

with equal but unknown variances σX2 = σY2 . Under the hypothesis H0 : µX = µY , the test characteristic

X Y mn

T = − r (3.5a)

SP m + n

with

SP2 = 1 − (m − 1)SX2 + (n − 1)SY2 , (3.5b) m + n 2

m

SX2 = 1 (Xi − X)2, (3.5c) m − 1 Xi=1

n

SY2 = 1 (Yj − Y )2 (3.5d) n − 1 jX=1

has a t-distribution with m + n − 2 degrees of freedom.

To test, if both random variables come from the same population, we make the following two hypotheses:

H0 : µX = µY and H1 : µX =6 µY . (3.6)

With a significance level α, for example, α = 0.05, the hypothesis H0 can be rejected, if

|T, (3.7)

where tm+n−2;1−α2 denotes the 1 − α/2 quantile of the t-distribution with m + n − 2 degrees of freedom.

Otherwise we have to say, that we have insufficient evidence to reject the null hypothesis.

3.1.3 Fisher’s F-Test
With the t-test we test the mean values of our measurements under the assumption that the unknown variances are the same. To statistically test the observed variances, the so-called Fisher F-test can be used which will be described in this section. With the F-test we can verify statistical moments of order two, which are usually much smaller and more difficult to detect.

Let us assume two normally distributed random variables. To test if the variance of one random variable equals the variance of another random variable with unknown, but not necessarily equal mean values, we can use the F-test.

3.1 Timing Analysis

The density function of an F-distributed random variable X with (m,n)-degrees of freedom, m,n ∈ N, is given by

m n

Γ(mΓ(+2m2n))Γ(m n22)n 2 · xm2 −m1+2 n for x >≤ 0, (3.8) fX(x) =(n+mx)

0 for x 0.

If X and Y are two independent χ2-distributed random variables with m and n degrees of freedom, X ∼ χ2m,Y ∼ χ2n, then the random variable

X/m

T =Y/n

is F-distributed with (m,n)-degrees of freedom.

Let Fm,n;α be the quantile with the order α. For large n the quantile of the F-distribution with (m,n)-degrees of freedom converges to the quantile of the χ2-distribution with m-degrees of freedom multiplied with the factor 1/m:

Fm,n;.

Now we are able to formulate Fisher’s F-test: Let Xi,i = 1,...,m, and Yj,j = 1,...,n, be independent and identically distributed (i.i.d) random variables, respectively. We assume, that the random variables are normally distributed

Xi ∼ N(µX,σX2 ) and Yj ∼ N(µY ,σY2 )

with µX and µY unknown. Under the hypotheses H0 : σX2 = σY2 , the test statistic

S2

T = SXY2 (3.9a)

with

SX2 = 1 m (Xi − X)2, (3.9b) m − 1 Xi=1

SY2 = −1 Xn (Yj − Y )2 (3.9c) n 1

j=1

is F-distributed with (m − 1,n − 1) degrees of freedom. We make the following hypotheses

H0 : σX2 = σY2 and H1 : σX2 6= σY2 . (3.10)

With a significance level α, for example, α = 0.05, we can reject the hypothesis H0, if

|T. (3.11)

Otherwise we have to say, that we have insufficient evidence to reject the null hypothesis.

3.2 Simple Power Analysis
At Crypto ’99, Kocher et al. [KJJ99] introduced the so-called power analysis attacks. Every hardware device which performs cryptographic operations can leak information by its power consumption. A common example of an algorithm which might be vulnerable to a Simple Power Analysis (SPA) is the Square-and-Multiply algorithm which was discussed above. Because the algorithm goes sequentially through the bits of the exponent, the difference whether a multiplication takes place or not can be directly observed in the power trace. The power trace can be obtained by measuring the power consumption. This can be done by adding a small resistor in series with the power or ground of the attacked device. The current, which can be derived from the voltage across that resistor divided by the resistance, yields to the power trace. The length of the power trace is the product of the time span in seconds with the sampling rate. A simple power analysis is typically performed on one single power trace. However, multiple power traces can be recorded while the device performs the same operation on the same data several times. Computing the mean of those power traces will reduce the noise. Messerges has observed several sources for noise in [Mes00]. Sources for noise are, for example, at the clock edges, during the quantization and algorithm noise.

Building a good measurement setup should not be underrated in side-channel attacks. From an attackers viewpoint it is important, to avoid noise as much as possible. Also, if multiple traces for the same operation were recorded, they have to be aligned exactly in time. This can be achieved with an exact trigger and alignment techniques. Hermann Seuschek has analyzed the different types of noise in his master thesis [Seu05] and gives a detailed explanation on how to minimize these effects for an optimal measurement setup.

3.3 Differential and Correlation Power Analysis

The differential power analysis (DPA) exploits the relationship between the processed data and the power leakage of multiple traces with different input data. For a DPA, the detailed description of the implementation of the algorithm is not needed, which makes it a very powerful attack. It was suggested by Kocher [KJJ99] and later formalized by Messerges et al. in [MDS99]. The four following requirements, which are necessary for a first-order DPA attack on AES, can be formulated:

1. The attacker needs physical access to the device to measure the power consumptionwhile it performs several en-/decryptions;

2. We need to know either the plaintext or the ciphertext of the algorithm (attack the firstor the last rounds of the algorithm), we refer to it as the processed data vector d = (d1,...,di,...,dD), where di denotes the plaintext or ciphertext of the ith decryption or encryption run and D is the total number of runs.

3. An intermediate result that occurs during a de-/encryption which needs to be a functionof the processed data di and a small part of the key, which is denoted as k∗.

4. The power consumption of the device has to depend on the processed data. We refer to the measured power trace that corresponds to the processed data di as ti = (ti,1,...,ti,T), where T denotes the length of the traces. We obtain a trace for each run so this results in the power trace matrix T with size D × T.

3.3 Differential and Correlation Power Analysis

Detailed instructions on how to perform a DPA can be found in [MOP07]. The basic idea for a DPA is, to make a hypothesis about one ore more bits of the secret key. Therefore, we chose an intermediate function, which depends on the processed data di and a small part of the key k∗. Since the small part of the key is unknown, all possible values k∗ = (k1,...,kj,...,kK) for k∗ have to be processed, where the index j denotes the jth possible key and the index i denotes the ith en-/decryption. Now the hypothetical intermediate values vi,j can be calculated for all D de-/encryption runs and for all K possible key hypotheses. This results in a matrix V with size D × K, where the column j contains the intermediate results that have been calculated with the key hypothesis kj. Now, we can derive the hypothetical power consumptions H from the hypothetical intermediate values via a power model. The better the power model matches the actual power consumption characteristics of our hardware device the less power traces are needed. The most common power models are:

• the bit model (LSB-Least Significant Bit):

hi,j = LSB(vi,j), • the Hamming-weight (HW) model:

(3.12)

hi,j = HW(vi,j), • the Hamming-distance (HD) model:

(3.13)

hi,j = HD(vi,j,di), and

(3.14)

• the zero-value model. Here we assume, that the power consumption for the data value0 is lower than for all other data values:

hi,j =0 if vi,j = 0 (3.15)

1 if vi,j = 06 .

The final step is to compare the measured power traces T with our hypothetical model H. In [KJJ99], the authors measure correlations indirectly by computing the difference of means. In this thesis, we describe how the attack is performed using the Pearson correlation coefficient. A detailed description is in the Appendix. We estimate each value ri,j with ri,j =PdD=1(h−d,i − h¯i·)P· (td,j − t¯j)− i = 1,... ,K, j = 1,...,T. (3.16)

(hd,i h

Dd=1 ¯i)2 Dd=1 (td,j t¯j)2

With the resulting matrix R of correlation coefficients, we can determine if the hypothetical power consumption H and the measured power traces T are correlated. In Equation (3.16), h¯i and t¯j denote the mean values of the columns hi in H and tj in T. Every row in matrix R is the correlation coefficient over time for a specific key hypothesis. If the key hypothesis (“key guess”) was wrong, then the hypothetical power consumption and the power trace should be uncorrelated; i.e., we see just noise. If the key guess is correct, we should see a peak in the correlation curve.

A second order DPA attack considers multiple points simultaneously within the same power trace t. These points belong to two different intermediate values, e.g., a mask value m and the hidden intermediate value d⊕m. The knowledge of either m or d⊕m alone is not of any use to the attacker. But if the attacker correlate both values, he can gain some information on d. A detailed description of higher order DPA can be found at [JPS05].

Chapter 4 Side-Channel Resistant AES Implementation
A main criteria for the feasibility of a power analysis attack is, that the part of the cipher which is attacked only depends on a small amount of key bits. AES offers exploitable functions like the first and the last AddRoundKey. They both depend only on a small amount of key bits. AES1 provides another interesting characteristic which makes a power analysis attack even more applicable: The non-linear SubBytes transformation. This is because this transformation is non-linear and therefore if there is one bit wrong in the key hypothesis several bits of the output differ from the calculated intermediate result. On average, half of the bits calculated based on an incorrect key hypothesis differ from the actual intermediate result – independently from the number of incorrect bits in the key hypothesis. This improves the ability of finding the right key during a SCA enormously. Figure 4.1 depicts the initial AddRoundKey operation (XOR) before the first round and the first SubBytes transformation for the byte s0,0 of the State matrix in round one. So an important task will be to protect

s0,0 (8-bit from the plaintext, known)

k (8-bit from the key, unknown)

s00 ,0 (8-bit result)

Figure 4.1: Initial AddRoundKey followed by the first SubBytes transformation in round one for one byte of the State matrix

the SubBytes transformation in the side-channel resistant implementation.

4.1 Random Bits
All following countermeasures need random bits. The Bundesamt für Sicherheit in der Informationstechnik (BSI) describe four classes for the quality of deterministic random number generators in [BSI01]. From the “weak” class K1, where the random numbers containing no identical consecutive elements with a high probability, to the “strong” class K4, where it should be impossible for an attacker to calculate or guess from an inner state of the generator any previous numbers in the sequence of random numbers.

To mask the SubBytes transformation, it is very important to generate high class random numbers for the mask values, because if the distribution of the random mask has a bias with respect to the uniform distribution, the masked value leaks information, which can be exploited during a DPA, see [CGPR08].

So the random number generator should at least fulfill the requirements of a K2 random number generator for the masked values. This class ensures, that the random numbers passed statistical tests and are uniformly distributed. For generating those qualitative good random values, the IAIK suggests the Grain-80 stream cipher, which can also be used as random number generator. The Grain-80 is optimized for hardware implementations. In practice the AES implementation will get the random numbers from outside (a hardware random number generator), Grain-80 is only be used to generate random numbers for the final measurements on the development board.

4.2 Countermeasure: Masking
During a power analysis attack, the attacker has to make a hypothesis about the power consumption of the device performing the attacked operation. Our goal is, to make the real intermediate values of the executed transformation independent from their power consumption. This can be realized by masking them. With masking we conceal every intermediate value with a random value m, the so-called mask. The attacker does not know the masking values and thus cannot compute meaningful hypothesis on the power consumption. As a result, the side-channel leakage of all intermediate, key-dependent values, does not correlate with the corresponding unmasked value. We can easily reproduce the propagation of the mask throughout the linear parts of AES in order to remove the mask at the end of the cipher. For the non-linear parts, namely the SubBytes transformation, we require considerable more effort for the mask correction after this transformation.

In the following subsections, we will discuss the different masking schemes in terms of space consumption, security, and speed. For all masking schemes, the runtime of the key scheduling is excluded. The 176-byte RAM consumption for the expanded round key is included in the estimated RAM value.

4.2.1 Additive Masking
Additive masking masks a value x with a random value m by adding (XORing) them, xm = x ⊕ m. The mask can be removed by adding the mask value again, x = xm ⊕ m. This is an easy but very efficient masking scheme, since the masked value xm is not distinguishable from a random value if the mask value m is a random value.

The “problem” with AES is, that the SubBytes transformation is a non-linear operation, which means SubBytes(si,j ⊕ m) =6 SubBytes(si,j) ⊕ SubBytes(m). So additive masking cannot be easily applied “out of the box”. The mask value m cannot be easily removed from the intermediate value si,j0 after the transformation without modifying the SubBytes transformation.

For additive masking, the whole lookup table (S-box) has to be recomputed. The input mask m masks the index and the output value mx masks the actual data according to Algorithm 3. This masking method was published first in [Mes00].

Algorithm 3: Computation of the masked AES SubBytes transformation as proposed in

[Mes00]

Input: m,mx

Output: MaskedSubBytes(si,j ⊕ mx)=SubBytes(si,j)⊕m

1 for i = 0 to 255 do

2 MaskedSubBytes(i ⊕ mx)=SubBytes(i)⊕m;

3 end

With one mask value pair m and mx for all intermediate values, we need additional

256bytes RAM for the new masked S-box and two random bytes for the input mask m and the output mask mx. With the expanded key that makes 176bytes + 256bytes + 2bytes =

434bytes.

The precalculation is made up of 256 additions (XOR) to mask the table index, 256 table lookups (TLU) to read the unmask S-box values from ROM, 256 additions (XOR) to mask the table entries and, finally, 256 write operations to store the masked table in RAM. This takes 256cc + 2 · 256cc + 256cc + 256cc = 1280cc for computing the new S-box.

With four mask value pairs a whole column of the State matrix can be masked. The advantage is, that during the MixColumns transformation, which works on a whole column (4bytes) of the State matrix, all intermediate values are mask with different masks. Therefore we need eight random bytes and 4 · 256bytes = 1024bytes for the new lookup tables, which results (with the expanded key) in 176bytes + 8bytes + 1024bytes = 1208bytes RAM. The precalculation overhead for four different masks leads to 4 · 1280cc = 5120cc for computing the new S-boxes. We can assume the runtime of the normal SubBytes transformation, since during an AES encryption, we just have to index the correct masked table for the TLU. In addition we have to mask the intermediate value from the State matrix with the input mask before and with the output mask after the transformation. That makes additional 2cc per byte of the State matrix and 2 · 160cc = 320cc for the whole encryption. Summarizing, this scheme needs for all ten rounds 320cc + 1280cc = 1600cc with one masked S-box and 320cc + 5120cc = 5440cc with four masked S-boxes.

To mask the 32-bit T-tables implementation we would require 32-bit mask values, since we now operate on whole rows, instead of bytes from the State matrix. Masking the tables with one input– and one output–32-bit mask requires eight random bytes for the masks and 8192bytes RAM for the eight masked T-tables (four for the first nine rounds and four for the last round). In total it makes (with the expanded key) 176bytes +8bytes+8192bytes = 8376bytes. The precalculation is made up of 256 additions (XOR) to mask the 32-bit table index, 256 table lookups (TLU) to read the unmask values from the tables (ROM), 256 additions (XOR) to mask the table entries and finally 256 write operations to store the masked table 32-bit values in RAM. This costs us 256cc+2·256cc+256cc+256cc = 1280cc for the precalculation of the masks and in total 520cc+1280cc = 1800cc for the whole AES. Masking all four columns of the State matrix with the T-tables implementation would require 8 · 4bytes = 32 random bytes for the masks and 4 · 8192bytes = 32768bytes for the masked tables. In total that makes with the expanded key 176bytes+32bytes+32768bytes = 32972bytes RAM. The precalculation of the tables take 4 · 1280cc = 5120cc and in total 520cc + 5120cc = 5640cc for the whole AES.

4.2.2 Transformed Masking Method
In [AG01], Akkar and Giraud do not use a lookup table for the SubBytes transformation. Instead, they calculate the SubBytes transformation. Thereby, they mask the intermediate value si,j multiplicative during the inversion part of the SubBytes transformation.

The idea of Transformed Multiplicative Masking (TMM) is to add a new multiplicative mask Yi,j to the intermediate value si,j and remove the additive mask Mi,j. The intermediate value is now masked with only the multiplicative mask. The advantage of the multiplicative mask is, that this mask can be removed after the inversion of the masked intermediate value easily by inverting the multiplicative mask Yi,j separately and then multiply that inverted mask with the inverted masked value.

If we calculate the multiplication with the binary multiplication method, as it is suggested in [FIP01], we need for the multiplication at average 16 shift operations and 8 XOR operations, which makes 16cc + 8cc = 24cc.

For an inversion of one byte from the State matrix with this method, we need four multiplications, two XOR, and two inversions (TLU). This costs us 4·24cc+2·1cc+2·2cc = 102cc. To invert the 10 · 16bytes = 160bytes for all ten rounds, we need 160 · 102cc = 16320cc.

Akkar and Giraud give a cycle count of 293500cc per encryption, which is about three times more than in the unprotected implementation we have seen in Section 2.3, which was 90500cc.

In [TSG02], Trichina et al. have published a simplified variant of the algorithm which is shown in Figure 4.3 on the right hand side. The scheme is equivalent to the original masking scheme but the authors here use the same mask value for the multiplicative masking (Y=M). For an inversion of one byte from the State matrix with this simplified method, we need only two multiplications, two XOR, and one inversion (TLU). This costs us 2·24cc+2·1cc+1·2cc = 52cc. To invert the 160bytes for all ten rounds, we need 160 · 52cc = 8320cc.

The authors in [TSG02] give us no information about the cycles or memory consumption in their paper, so we will compare it to the Akkar/Giraud version. Since it saves the half on multiplications and inversion during one SubBytes transformation, which are the expensive and time consuming functions, we assume, that this implementation needs half of the cycles, the Transformed Masking Method from Akkar/Giraud would need.

Embedded Multiplicative Masking
Both, the normal and the simplified transformed masking method mask only non-zero values. If a key byte equals a data byte, then the result of AddRoundKey is zero. This fact can be exploited in a special (first-order) differential side-channel attack, the zero-value attack. In [OMP04] it is shown, that around 64 times more measurements are needed in a zero-value attack than in a standard differential side-channel attack. This indicates that they must be avoided because they still pose a serious practical threat.

The idea from [GT03] is to embed the finite field GF(28) ' GF(2)[X]/P(x), P(X) =

(x8 +x4 +x3 +x+1)2 into the larger ring R = GF(2)[x]/(P(x)·Q(x)). Q(x) has degree k and needs to be irreducible over GF(2) and coprime to P(x). We define ρ : GF(2)[x]/P(x) → R:

ρ(U) = U + R × P (4.1)

The mask R(x) is a random polynomial = 06 of degree smaller than k. We define a mapping F on R, F : R → R with F(U) = U254 which coincides with the inversion on GF(2)[X]/P(x)3. Reducing F(U) modulo P(X) gives the inverse of U in GF(2)[X]/P(x). The additive masking can be restored before performing this final reduction step. If the polynomial R is random, the masking will not allow to reveal information even if X = 0. This zero value gets mapped to one of 2k random values. This is a mathematically elegant way to bypass the zero value problem

but it leads to very costly implementations. According to [BGK04], this leads in at least 731 AND and 766 XOR operations, including the transformations to the new field representation and back. For our cycle count this means that we need at least 731cc + 766cc = 1497cc for one byte of the State matrix and 160 · 1497cc = 239520cc for all ten rounds.

4.2.3 Perfectly Masking AES Inverter Against First-Order Side-Channel Attacks
Another way to protect the SubBytes function is presented in [BGK04]. The authors obtain a perfectly masked algorithm for AES by computing the multiplicative inverse in GF(28):

−1 = INV(si,j) =si,j−1, if si,j ∈ GF(28) (4.2) si,j

(

0, if si,j = 0

and masking all intermediate values.

They calculate the inverse si,j−1 = INV(si,j), as y = s254i,j by using the Square-and-Multiply algorithm. We discussed the algorithm and the resulting side-channel leakage in Section 3.1. The goal is to protect the value si,j, the intermediate values, and the resulting inverse during the calculation.

Let r, r0 be independently and uniformly distributed random masks and u = si,j. Now, the authors start with an additively masked value x = u+r and want to calculate y = INV(u)+r0.

Let us recall the Square-and-Multiply algorithm with the exponent e = 254 and the given values:

Algorithm 4: Left-to-Right binary Square-and-Multiply Algorithm

Input: x = (u + r),n = 255,e = {ep,...,e0}2 = {11111110}2

Output: y = xe mod n = x−1 mod n

1 y = 1

2 for k = p down to 0 do

3 y = y2 mod n

4 if ek = 1 then

5 y = y · x mod n

6 end

7 end

As we can see, the algorithm consists of two operations, a squaring in Step 3 followed by a conditional multiplication with a reduction of n in Step 5. These are the operations, which were modified by the authors of [BGK04] to ensure the side-channel resistance.

In the following two algorithms, the input si,j becomes the additively masked input u+rl,k, where r is the lth mask byte in Step k of Algorithm 4. Our goal is, to invert u. The auxiliary values f,v,w and t are needed in the algorithm.

Algorithm 5 describes the perfectly masked squaring (PMS) method. During the first step, the input ue + r1,k−1 is squared. In Step two and three, they change the old mask r1,k−1 to the new mask r1,k. The desired output is a squared input with a fresh mask u2e + r1,k.

In Algorithm 6, we see the perfectly masked multiplication (PMM) method. This algorithm gets two input values: The output of the previous step and a freshly masked value x0. In

Step 1, the authors calculate the product of the two masked values. In Steps 2-5 the auxiliary

Algorithm 5: Perfectly Masked Squaring (PMS), Source: [BGK04]

Input: x = ue + r1,k−1,r1,k Output: tk = u2e + r1,k

1 fk = x2 ; {f1 = u2e + r12,k−1}

2 w1,k = r12,k−1 + r1,k ; {auxiliary term to correct fk}

3 tk = fk + w1,k ; {tk = u2e + r1,k}

terms are calculated. The last part of the algorithm, Step 6 and 7 removes the disturbing parts of the masked product by simply adding up the two auxiliary terms w1,k, w2,k and fk.

To calculate the masked inverse of one element with the secured Square-and-Multiply algorithm, we need seven PMS with additional PMM, both including a modular reduction step, and at last one PMS with a modular reduction. Every intermediate value of the PMS and PMM is hidden by a fresh mask. So we need (7 · 3) + 1 = 22 random masks for the inversion of one element of the State matrix. For a whole AES en-/decryption we need 160·22 = 3520 random mask values each with the size of one byte. In the case of a first order side channel resistance, the authors point out in [BGK04, Section 5.4], that the number of random masks can be reduced to only three masks (r1,r2,r3), by using the same masks in each step k.

To calculate the clock cycles for the inversion, we recall the clock cycle for: addition

(1cc), multiplication (24cc) and squaring (2cc). By counting the operations during the algorithm we get: 2 · 2cc + 2cc = 6cc (two squaring and two additions), for one PMS, and 4 · 24cc + 2 · 1cc = 98cc (four multiplications and four additions) for one PMM operation. So we need:

• (7 · 6)cc = 42cc for the PMS operation

• (7 · 98)cc = 686cc for the PMM operation

• 3cc for loading the random intermediate mask r1,k,r1,k−1,r2,k, • 1cc for loading the masked intermediate value from RAM, and

• 1cc for saving the result.

In total we need (42 + 686 + 3 + 1 + 1)cc = 733cc for the inversion of one element from the State matrix. For a whole AES en-/decryption we need 160 · 733cc = 117280cc for the inversion part of the SubBytes transformation with three masks.

4.2.4 Combined Masking in Tower Fields
In [OMP04] and [OMPR05], Oswald et al. present a masking scheme, which is based on the composite field arithmetic we discussed in Section 2.3.8. They call it combined masking in tower fields (CMTF). The idea is, that the inversion in GF(22) is a linear transformation, which gives us the potential to track the changes which are applied to the mask value m during an inversion.

To securely calculate the inversion of the current intermediate byte si,j from the State matrix, we first mask it additively with the mask m to obtain a = si,j ⊕ m. Next we transform a to its composite field representation. Please recall, that every element of GF(28) can be represented as a linear polynomial over GF(24). The finite fields, which we will use are the same as before (Section 2.3.8):

GF(28) ' GF(2)[x]/(x8 + x4 + x3 + x + 1), GF(24) ' GF(2)[x]/(x4 + x + 1).

(4.3)

(4.4)

Note that operations ⊕ and ⊗ which we use here, are for coefficients from GF(24). We have introduced them in Section 2.3.8.

After the mapping, the value that needs to be inverted is represented by (ah⊕mh)x+(al⊕ml) instead of ahx + al with ah,al,mh,ml ∈ GF(24). Both values, ah and al, are now additively masked. It is easier to follow the masked inversion if we recall the normal inversion in the composite field:

al0 (4.5a) ah0 = ah ⊗ d−1 (4.5b) al0 = (ah ⊕ al) ⊗ d−1 (4.5c) d a2l (4.5d)

We can calculate the masked inverse, by following the steps from (4.5a) to (4.5d), but this time we use the masked values:

) (4.6a)

ah0 mh0 (4.6b) al0 ml0 (4.6c) dmh (4.6d)

Now we need to securely calculate the inverse for d⊕mh,d ∈ GF(24). So far, we have only shifted the problem from the bigger field down to the smaller one. Because GF(28) is a tower field over GF(22), we can shift the computation of the masked inversion for d down to GF(22) the same way as we did it in (4.6a) to (4.6d) with the corresponding field operations. In this field, the inversion is a linear operation and we keep track of the mask value. The inversion operation preserves the masking value, because (d ⊕ m)−1 = (d ⊕ m)2 = d2 ⊕ m2 in GF(22).

This approach has been presented in [OMPR05] to work in tower fields which are applicable on hardware implementations but inefficient to implement in software on a 32-bit processor, since during the field transformations we work on every individual bit of all bytes in the State matrix. Instead, all transformations above can be precomputed and stored in tables. This method will be discussed in the next subsection.

Combined Masking in Tower Fields with Precomputed Tables
To compute the masked inversion in GF(28), Schramm et al. presented in [OS05] a method which is based on the inversion in composite fields, as we discussed it in the last section. The difference is, that the calculations of the intermediate values are mapped to a sequence of table lookups. Therefore the authors first map the masked intermediate value a = si,j ⊕ m down to its field representation with coefficients from GF(24) as done before. With the use of the following four tables:

Td⊕

Td2 : ((x ⊕ m),(y ⊕ m0)) →7 ((x ⊕ m) ⊕ (y ⊕ m0 )) × (y ⊕ m ) Tm : ((x ⊕ m),(y ⊕ m )) →7 (x ⊕ m) × (y ⊕ m )

Tinv : ((x ⊕ m),m) 7→ x−1 ⊕ m

which can be precalculated and stored in ROM, the masked intermediate values of the formula (4.6b) to (4.5d) are calculated. Note, that all tables take two elements of GF(24) as inputs and give an element of GF(24) as output.

d ⊕ mh = Td1(ah ⊕ mh,mh) ⊕ Td2((ah ⊕ mh),(al ⊕ ml))

⊕ Tm((ah ⊕ mh),ml) ⊕ Tm((al ⊕ ml),mh) ⊕ Tm((mh ⊕ ml),ml). (4.7)

We compute the masked inversion of d with the following table lookup:

d−1 ⊕ mh = Tinv(d ⊕ mh,mh). (4.8)

Now we can compute ah0 ⊕ mh0 and al0 ⊕ ml0 from (4.6b) and (4.6c) the following way:

ah0 mh,d−1 ⊕ ml)

⊕ mh ⊕ Tm(d−1 ⊕ ml,mh) ⊕ Tm(ah ⊕ mh,ml) ⊕ Tm(mh,ml), (4.9)

and

al0

⊕ ml ⊕ Tm(d−1 ⊕ mh,ml) ⊕ Tm(al ⊕ ml,mh) ⊕ fah ⊕ mh ⊕ Tm(mh,ml). (4.10)

To map GF(28) elements to GF(24) × GF(24) elements and vice versa, two additional tables with a size of 512bytes are needed.

As the authors state and we can count above, the total costs of a masked inversion this way are 14 table lookup operations and 15 XOR operations for an inversion, which results in 14·2cc+15cc = 43cc. For an AES encryption this results in 160·43cc = 6880cc additional for the new masked SubBytes transformations.

To store the six tables we need 4 · 256bytes + 2 · 256bytes = 1536bytes of ROM.

4.2.5 Masking the MixColumns Transformation
The MixColumns transformation can leak timing informations because it uses the xtimes operation. Xtimes contains a conditional reduction by {11B}16 if the result of the operation is larger than eight bits, see Section 2.2.

To prevent the timing information leakage and power leakage, the State S has to be additively masked with the mask M before the transformation, SM = S ⊕ M, where M is a

4 ×0 4 matrix created of four random bytes M1,...,M4, see (4.11). After the transformation0 ( SM = MixColumns(SM)) the mask M can be removed from the transformed State SM by adding the transformed mask M0 = MixColumns(M), so S.



M1 M2 M3 M4 M1 M2 M3 M4

M =(4.11)

M1 M2 M3 M4
M1 M2 M3 M4

Masking the MixColumns transformation needs one additional MixColumns transformation

(84cc) to get M’ if the same mask M is used to mask all MixColumns transformations during an whole AES run. In addition, four random bytes are needed for the mask M, and 16bytes are needed for the transformed mask M0 = MixColumns(M).

4.3 Countermeasure: Randomization
Randomizing the execution of the algorithm provides additional resistance against power analysis attacks. Herbst et al. suggest in [HOM06] two ways to introduce randomness in the execution of the algorithm. They propose to shuffle the sequence of operations which work on the elements of the State matrix and to add dummy rounds (or parts of a round) during the normal ones.

Then we proceed with the next element. As we work on each intermediate value of the State matrix, this version of shuffling is applicable on 8-bit implementations. With shuffling, the probability that one specific intermediate value is processed at a specific point in time is 1/16. Theoretically, we can shuffle the whole State matrix with 16! possibilities. But that would not increase the security against a first-order DPA with the same factor since here it

is important that the operations with the intermediate values do not occur on the same fixed time.

The randomization can be broken with windowing, published by Clavier et al. [CCD00]. Windowing can be seen as performing multiple second-order DPA attacks in parallel and then combining the power traces for the mask processing with each of the points in time, where the targeted masked values can occur. So shuffling all 16! possibilities would not bring us more security but would increase the implementation costs. For a successful side-channel attack 16 times more traces are needed with this countermeasure than with an unprotected implementation.

Another way to increase the randomization is to add dummy intermediate values d to the State matrix which will be operated on. Those operations must not be distinguishable from real operations of the algorithm. The problem with added wait states or, for example, just added random NOPs (No Operation) is that they can be easily removed by analyzing a single power trace. The authors in [HOM06] therefore suggest to insert dummy rounds or parts of a round on some dummy State bytes, see Figure 4.5. The combination of shuffling and dummy operations make it very difficult for the attacker to locate the attack point in the first two and the last two rounds. These are the rounds, which need our attention, because they are vulnerable during a known/chosen plaintext/ciphertext attack. After these rounds, every value from the State has been subjected to three AddRoundKey transformations and depend on sufficient many key bytes, to prevent an DPA attack [HOM06]. The probability, that a certain intermediate value occurs on a specific point in time is p = 1/(D + 1), where D is the number of additional dummy cycles. If we chose to insert 15 dummy operations, the probability is 1/16, which would increase the number of needed traces for a successful SCA also by 16 times. Note, that we have to see the normal consecutive State bytes as one value compared to the dummy bytes, to calculate this probability.

With D additional dummy rounds and shuffling, which is recommended by the authors, the number of required traces for a successful mounted SCA increases by (D + 1) · 16. With, e.g., D = 5 dummy rounds randomly inserted at the start and the end of the algorithm, 96 times more measurements would be required for a successful DPA attack.

The number of added dummy operations D is a fixed value. The random number r, 0 ≤

r ≤ D partitions the D dummy operations into two parts. The dummy operations from 0 to r − 1 will be executed before and the dummy operations from r to D − 1 will be executed after the normal operations, like it is depicted in Figure 4.5 for r = 2 and D = 5.

4.4 Comparison
In this chapter we gave a short overview about the discussed (low-cost) software countermeasures against timing and power attacks. Now we estimate the number of clock cycles for a protected AES implementation with the selected countermeasure based on our 32-bit optimized implementation with a clock cycle count of 1732cc. The clocks cycles for the affine transformation (Section 2.3.3) with 6560cc are added to those schemes, which have to calculate it. We get the following runtimes estimated for the countermeasures:

• Additive Masking: 1732cc + 1600cc = 3332cc,

• Transformed Masking: 1732cc + 6560cc + 16320cc = 24612cc,

• Transformed Masking simpl.: 1732cc + 6560cc + 8320cc = 16612cc,

• Embedded Masking: 1732cc + 6560cc + 239520cc = 247812cc,

• Perfect Masking: 1732cc + 6560cc + 117280cc = 125572cc,

• CMTF with TL: 1732cc + 6560cc + 6880cc = 15172cc.

Table 4.2 contains the estimated memory consumption (ROM and RAM), the number of random bits, and the runtime in clock cycles (cc) for the discussed masking methods. Note that we are only interested in a rough estimate to select the most appropriate countermeasure. The value on the left side denotes the estimated result for using one random byte to mask all

16 elements of the State matrix. The value on the right side is the estimated value using four random bytes to mask a whole column of the State matrix. All RAM estimates include the 176-byte expanded key array for the key scheduling.

With 3332cc and 434bytes RAM, Additive Masking is applicable when used with one mask. With four different masks the RAM consumption of 1208bytes, which is mainly caused by the four masked S-boxes, is too big for our needs. The runtime 5440cc +1732cc = 7172cc is still acceptable.

In a 32-bit T-tables implementation, which is Additively Masked with one mask pair, the RAM consumption of 8376bytes and 32976bytes for four masks is also too big.

4.4 Comparison

The Transformed Masking and simplified Transformed Masking Method both multiplicatively mask an element of the State matrix during the SubBytes transformation. The simplified Transformed Masking Method is the improved one of the Transformed Masking Method, in terms of needed operation and random bytes for masking. It is very interesting regarded to the RAM consumption of 180bytes. Compared to the Additive Masking implementation, the runtime of 16612cc is also acceptable. A big pro of the Transformed Masking approach is the constant runtime, which is independent from the number of mask values.

The Perfect Masking in line six needs 179 bytes RAM (three for the masks and 176 for the expanded key). Since it calculates the inversion transformation and the affine transformation on the fly, it has a very big runtime, which is the main criteria against it.

The same constant runtime characteristic holds for the Combined Masking in Tower Fields in row seven. The big ROM consumption of 1536bytes mainly results from a couple of constant tables, which are used for the transformation between the different representation systems and for calculating the masked inversion, see Section 4.2.4.

The big part of the runtime from the Additive Masking Method arises from the creation of the masked S-boxes. If these masked S-boxes have been created, the implementation is even as fast as the unmasked implementation since then, the transformation is only a table lookup. Transformed Masking and the improved versions of it would only pay off if a high number of random masks is used. But using a high number of masks does not lead to a practically safer implementation. This is because a higher order DPA works whenever the same mask values occur on two different points in time. Since all masks have to be created at a specific point in time and are used to mask an intermediate value at some point later, there are always two points in time that allow a higher order DPA attack.

We decided to use the Additive Masking Method and combine it with randomization (shuffling and dummy states). By using the masking approach, we fulfill the requirements needed against a first-order DPA. In addition, we will mask the MixColumns transformation to prevent timing analysis attacks and use masking and shuffling to increase the resistance to a second-order DPA.

4.5 Low-Cost SCA-Resistant AES Software Implementation
On the Applied Cryptography and Network Security Conference (ACNS’07), Tillich, Herbst and Mangard [THM07] presented a low-cost SCA-resistant AES implementation, which takes advantage of the masking, shuffling and execution of dummy cycle approach. In Figure 4.7, the program flow of the randomized and masked AES is depicted. The randomization areas are called Randomization Zone 1 and Randomization Zone 2. In these two zones the shuffling and the dummy operation approach is applied. Because the randomization is very costly, the authors keep the Randomization Zones as short as possible. During the Randomization Zones, the following operations are randomized: The AddRoundKey operation, since the 16bytes of the State matrix S are processed independently. The same holds for the (masked) SubBytes transformation. During the MixColumns transformation, the sequence of the processed columns is randomized.

The first two and the last two rounds of the AES implementation are additionally protected by masking. The current mask of the State S is shown on the right in Figure 4.7. The SubBytes transformation is masked additively with the mask byte M, see Section 4.2.1.

To protect the MixColumns transformation, the authors use four mask bytes, M1 to M4, see Section 4.2.5. Every mask byte M1 to M4 is used to mask the corresponding column

4.5 Low-Cost SCA-Resistant AES Software Implementation

of the State S. To remove the four masks after the MixColumns transformation, the four mask bytes are also transformed, (M10,M20,M30,M40) = MixColumns(M1,M2,M3,M4). To remask the State, the AddRoundKey transformation is used whenever possible. The ShiftRows transformation is excluded from the Randomization Zones because it works on the whole State matrix.

We will adapt this scheme. The Additive Masking implementation with one masked Sbox needs needs 3332cc. To calculate the four masks M10,M20,M30 and M40 we need one additional MixColumns transformation (84cc). We implement one dummy operation to operate on one column of the State matrix. This is the smallest part on which all transformations can work and it fits to our 32-bit architecture. During one dummy operation the following transformations are executed AddRoundKeyColumn, SubBytesColumn, MixColumn,

AddRoundKeyColumn and SubBytesColumn. This takes 4cc + 20cc + 21cc + 4cc + 20cc = 69cc. For shuffling, we only change the indices of the current column and current row.

We think this runtime is negligible. So the protected implementation has a runtime of 3332cc + 84cc = 3416cc with no dummy operations and 3416+D · 69cc with D dummy operations.

The Additive Masking scheme needs 256bytes RAM for the masked S-box and 16bytes for the four words M10 to M40. In addition six random bytes for the masked values M,M0,M1 to M4 one random byte r to divide the number of dummy operations during the randomization zones, 16 random bytes for one dummy State and five random bytes for shuffling the five AddRoundKey, four SubBytes and three MixColumns transformations in the two randomization zones. That makes in total (176 + 256 + 16 + 6 + 1 + 16 + 5)bytes = 476bytes RAM. The masked AES implementation is roughly two times slower than their referenced unmasked AES implementation. The randomized implementation takes about 11845 clock cycles when no dummy operations are added. This is about 2.7 times slower than the unprotected AVR-based AES implementation. With D dummy operations added, their randomized AES implementation needs 11845 + D · 240 clock cycles. Our estimated protected AES encryption increases by a factor of two.

Wonderful post Sonova,
From now on,I will be calling you Professor Sonova

Again it was excellent

Regards

Neptune · Jan 19, 2015

@SvenSvensonov great piece. I assume you were SW. What was your specialized area mate? EP, ES, EA. I am glad I have found another squid around.

SvenSvensonov · Jan 19, 2015

Neptune said:
@SvenSvensonov great piece. I assume you were SW. What was your specialized area mate? EP, ES, EA. I am glad I have found another squid around.

Hey, @Neptune - I wasn't a squid, was never posted on US ships (at least not as an actual posting, I've been on US ships numerous times though) - was stationed at the Virginia SPAWAR facility in Norfolk. That said, because I have been on US subs for systems validation, I was required to undergo the USN's submariner program and I am a qualified submariner. The most time I've spent on a US submarine was 2-1/2 months. I'm not going into the details of that program though. I did my training at NAVSUBSCOL.

The closest designation to mine, without actually listing my designation, is CTR - Signals Collector.

I've seen you talking with some of the other members, what was your role in the Turkish Navy?

Space and Naval Warfare Systems Command - Wikipedia, the free encyclopedia

@Slav Defence - here's the next section of the Side Channel Analysis paper:

Chapter 5 Target Platform: TriCore TC1796
To write an optimized AES implementation, it is important to understand the underlying microcontroller architecture. We will discuss the most important parts of the microcontroller in the following subsections and only mention those features of the TriCore, which are interesting for our work.

The TriCore TC1796 or variants thereof is a typical automotive embedded processor which is used, for example, in the current engine control units (ECU) MEDC17 by Robert Bosch GmbH. The TriCore TC1796 combines a reduced instruction set computing (RISC) processor, a digital signal processor (DSP), and on-chip memories as well as peripherals. It has a program flash memory of 2Mbyte, a data flash of 128Kbyte, and 136Kbyte data memory and a 16Kbyte instruction cache (ICACHE). The TriCore TC1796 has no data cache [Inf07, page 2-36].

The program flash and the data flash are connected through the SBD (System Bus Domain) to the CPU. Because the data is accessed through a bridge, it can be one order of magnitude slower than the local domain which is located nearest to the core [Inf02, page 30]. This means for our processor model, that the runtime can also differ by an order of magnitude because our program code and program data is located at the flash.

The chip provides a lot more interesting features, which are explained in the data sheet of the chip [Inf08a]. But first let us start with an overview of the development environment used.

5.1 Development Environment

For programming and compilation of our code, we use the Tasking VX tool set version 3.0r1. It offers a C compiler, linker, Assembler, and a simulator for the TriCore. The C compiler supports the ISO C99 standard.

At the Robert Bosch GmbH, the Hightech GNU C development environment is widely used and also used for developing TriCore applications. We write our code in such a way, that it compiles on both development environments.

We use the Universal Debug Engine (UDE) UAD2 to communicate with the evaluation board. The debugger is connected via USB to our PC. To flash our application to the development board and to debug it in real time, we use UDE-Desktop application. The development board is a TriCore Evaboard TriBoard TC1796.

5.1.1 Compiler

We write our implementation in C following the C99 standard to keep it as portable as possible. If some optimizations are only possible in assembler, we will write a TriCore optimized version and a generic version, separated by defines in the source.

Assembly instructions can be inserted with the keyword __asm in C source code. The C variables are passed as operand to the assembly code. The general syntax of the __asm keyword is:

__asm("instruction_template" [ : output_param_list

[ : input_param_list

[ : register_save_list]]] );

The instruction_template is the assembly instruction, that may contain parameter from the input_param_listor output_param_list. The register_save_listcontains the name of registers which should not be used.

The following example multiplies two C variables and assigns the result to a third C variable. The =d denotes that a data register is necessary for output, and the d denotes that the values come from a data register.

The following sections list the built in functions with their respective prototypes and associated machine instructions. In case an assembler mnemonic is given, the builtin’s semantic is the same (from the C programmer’s point of view) as the semantic of the machine instruction. We do not repeat the instruction set manual here.

Please note, that the syntax of the Hightech GNU C compiler differs for the inline assembler. However, the compiler has built in functions for the assembler instructions which can be used instead.

To ensure the formal correctness of our code, we use splint (Secure Programming Lint). Splint is a tool for statically checking C programs for security vulnerabilities and coding mistakes.1

Because our implementation will be used in the automotive environment it should also be Motor Industry Software Reliability Association2 (MISRA) conform. MISRA is a set of programming rules for secure programming in C.

For code formating we use indent –no-tabs. We will deliver our AES implementation as source code, which can be compiled as a library.

5.1.2 Simulator
During the development process, we use the TriCore Instruction Set Simulator (TSIM) to simulate and debug our implementation. It is described in [Inf05] and perfectly integrated in our development IDE. The simulator has the disadvantage, that it does not simulate the pipelining right. Therefore, we will make the time measurements on the evaluation board.

5.1.3 Evaluation Board
We have the TriBoard – TC1796 V4.1 which hosts the TriCore TC1796. This board is connected via a JTAG (Joint Test Action Group) interface to the UAD. The Debugger is connected via USB to our development PC.

We can debug the TriCore with the Universal Debug Engine (UDE). The UDE has a virtual console, where we can receive messages from our application on the TriBoard. To receive the messages during a run on the evaluation board, we have to set the define CONSOLE_TYPE in BCLtypes.h to CONSOLE_TYPE_TASKING. This causes the compiler to overwrite the standard POSIX read and write functions, which are called by the ANSI-C Library. The stdin, stdout and stderr are now piped through a JTAG interface and we see the output in the UDE-

Desktop application.

5.1.4 The Multi-Core Debug System
In addition, we got the TriBoard – TC1796.303, which hosts the TC1796ED. With this board, we can do exact timing measurements without external influences. Figure 5.1 depicts the TriBoard and how the UAD is connected via the JTAG interface.

The TC1796ED provides a Multi-Core Debug System (MCDS) that can be used to trace the program, which is executed on the TC1796, in real-time. Its MCDS is integrated as a plug-in into the UDE-Desktop application. Figure 5.2 depicts the MCDS configuration dialog. The dialog contains 5 tasks which are needed to measure the runtime of the function aes_Encrypt. The first task configures the emulation device, there the timer resolution is set to 250ns.

The next two tasks create one signal each. The first signal (sig_start) fires if the TriCore program counter (PC) reaches the start address of the function aes_Encrypt. The second signal (sig_stop) fires when the end of aes_Encrypt is reached. The following two tasks define the actions on the signals. The first action starts a timer, the second one stops the timer and stores its value for every execution of the function aes_Encrypt.

5.2 TriCore CPU

The TC1796 processor contains a TriCore 1 V1.3 central processing unit (CPU), with a maximum CPU frequency of fCPU=150MHz. The most interesting features of the CPU for us are the:

• 32-bit load/store architecture,

• 16-bit and 32-bit instructions for reduced code size,

• data formats: bit, byte (8-bit), half-word (16-bit), word (32-bit), double-word (64-bit),

• byte and bit addressing,• little endian byte ordering,

• packed data.

This enumeration is taken from the user manual [Inf07, Section 2]. The CPU contains an instruction fetch unit, an execution unit, and a general purpose register file (GPR), namely the address and data registers.

The instruction fetch unit pre-fetches and aligns the incoming instructions. The execution unit of the TriCore contains three pipelines which work in parallel. This allows us to execute up to three instructions in one clock cycle. The execution unit contains the integer pipeline (IP), the load/store pipeline (LS), and the loop pipeline.

The integer pipeline and the load/store pipeline have the following four stages: fetch, decode, execute, and write-back. The loop pipeline has the two stages: decode and writeback. The execution unit can execute most operations we use in one clock cycle.

The instruction fetch unit feeds the three pipelines. The fetch unit is able of issuing, in one cycle, an integer pipeline instruction to the integer pipeline and an immediately following LS instruction to the LS pipeline.

Our two main goals during our TriCore optimization strategy are to avoid pipeline stalls as much as possible and to take advantage of the parallel issue capabilities as mentioned above.

The CPU has 16 data registers (register D0 to D15) and 16 address registers (register A0 to A15). Each address and data register has a size of 32-bit. Respectively two 32-bit data

5.3 TriCore Instruction Set
The TriCore supports 16 and 32-bit instructions. Since there is no mode bit and no requirement for word alignment of the 32-bit instructions we can freely mix the instructions. To specify a 16-bit command, the basic operation name is followed by a 16, e.g., SH16. The advantage of a 16-bit instruction is, that it is smaller than a 32-bit instruction in the program memory. It should be used whenever possible for smaller code size.

To follow the upcoming assembler examples, we first show how an assembler instruction is composed. The assembler instruction starts with the basic operation, which is derived from the intention of the instruction. For example, a shift instruction starts with SH. The basic operation can be followed by an operation modifier, e.g., the instruction JNE would execute a conditional jump. The condition here is given by the NE, which stands for “not equal”.

The data type modifier specifies the data type we want to work with. For example, if we load a byte from memory, we use the LD.B instruction. The .B denotes, that we load a byte value. In

The data type modifiers can be combined to a new one, e.g., the instruction LD.BU loads an unsigned byte. In the following subsections, we will describe the commonly used assembler instructions.

5.3.1 Load and Store Instructions
To load and store a value into data register D[A] we can use the load word and store word instructions (LD.W and ST.W).

5.3.2 Arithmetic Instructions
The following three bit arithmetic instruction work almost the same way. So we describe it only once. The OR,XOR and AND instruction computes the bitwise OR, XOR, respectively AND operation of the data register D[a] and the data register D. The result is stored in the data register D[c].

OR

D[c], D[a], D

; c = a OR b

XOR

D[c], D[a], D

; c = a XOR b

AND

D[c], D[a], D

; c = a AND b

The multiplication instruction multiplies two signed 32-bit values from the data register D[a] and the data register D and puts the product into the 32-bit data register D[c] or the 64-bit data register E[c].

MUL

D[c], D[a], D

; c = (lower half of)

; a * b

MUL

E[c], D[a], D

; c = a * b

With the shift instruction SH we can shift the value in D[a] by the number of bytes specified with either a constant value const or the value in data register D. The result is put in D[c]. The vacated bits are filled with zeros and the bits shifted out are discarded.

SH

D[c], D[a], D

; shift D[a] by D positions ; and store the result in D[c]

SH

D[c], D[a], const

; shift D[a] by #const positions

; and store the result in D[c]

5.3.3 Extract Instruction
It is possible to work with so called bit-fields. We can extract the number of consecutive bits specified by width starting at the bit pos from a source register D[a] with the EXTR (Extract Bit Field) and EXTR.U (Extract Bit Field Unsigned) instructions, beginning with the bit number specified by the pos operand. The result is stored sign extended (EXTR) or filled with zeros (EXTR.U) in the destination register D[c].

EXTR

D[c], D[a], pos, width ; Extract Bit Field

EXTR.U

D[c], D[a], pos, width ; Extract Bit Field Unsigned

To extract a 32-bit word from the registers {D[a] and D}, where D[a] contains the most-significant 32bits of the value, we use the DEXTR instruction. The extraction starts at the bit number specified with 32-pos. The result is put in D[C].

DEXTR D[c], D[a], D, pos ; Extract from Double Register

Packed Arithmetic
The packed arithmetic instruction partitions a 32-bit word into four byte values or into two, 16-bit halfwords values. Those can be fetched, stored, and operated on in parallel. Instructions which operate on the data in this way are denoted by the .B and .BU data type modifier. The arithmetic on packed data includes addition, subtraction, multiplication, shift and the absolute difference.

To load the contents of the memory location specified by the offset we can use the load byte instruction (LD.B).

5.4 Time Measurement

The ST.B instruction stores the byte value in the eight least-significant bits of the data register D[a] to the byte memory location specified by the offset off.

ST.B off, D[a] ; Store Byte

5.3.4 Address Arithmetic
The ADDSC.A operation left-shifts the contents of data register D[a] by the amount specified by n, where n = 0,...,3. That value is added to the contents of address register A and the result is put into address register A[c].

5.4 Time Measurement

The TriCore has a system timer (STM), which we can utilize for our time measurement, see the user’s manual [Inf07, Section 15]. The STM is a free running 56-bit upward counter and driven by max. 75MHz (=fSYS). The maximum clock period is 256·fSTM. For fSTM=75MHz, for example, the STM counts 30.46years before it overflows. The minimum resolution of the

STM is 13.3ns. The default frequency after a reset is fSTM = fSYS/2.

We can read the counter value from the STM timer register, which has the address {F0000210}16. Due to the fact, that the STM is 56-bit wide, we cannot read its entire content with one instruction. We have to use two load instructions. Listing 5.1 denotes, how we can load the STM register values with inline assembler.

The STM value can also be loaded in sections from seven registers, STM_TIM0 through STM_TIM6. Theses register can be viewed as individual 32-bit timers, each with a different resolution and timing range [Inf07, Section 15].

The "=e" (tmp) in line 8 denotes, that the variable tmp is used for the result of the LD.W instruction and is an extended 64-bit data register.

Chapter 6 Optimized and Randomized AES Implementations on TriCore
6.1 Implementation Constraints
The AES implementation runs on an automotive processor and it will not be the only task running there, so it will be implemented as space saving as possible in terms of RAM and ROM consumption1.

The ROM consumption is measured for a whole minimal AES implementation. We get the ROM consumption from the Assembler list file. To generate the list file, we have to enable the “Generate list file” option in the project settings dialog “Assembler -> List file” from the Tasking toolchain. The generated list file contains a “List section summary” with the ROM consumption.

6.1.1 Validation Tests
The National Institute of Standards and Technology (NIST) requires that all AES implementations have to pass some algorithm validation tests. This tests help to find failures early during the development process. It can happen, e.g., that we calculate a wrong byte in a S-box which is not used during an encryption but during a later one. This can happen because the used S-box bytes depend on the used plaintext and key combinations. To verify the correctness of the implementation, the following four tests are used:

• Variable key known answer test,

• Variable text (plaintext/ciphertext) known answer test, and

• tables known answer test,

• Monte Carlo test.

Variable Key Known Answer Test
During the Key Known Answer Test (Key KAT) [NIS98], the 128-bit plaintext is always initialized to zero. The key used in an AES iteration i, 0 ≤ i < 127 consists of a one in the ith position and zeros in all other positions. Each of the possible keys is created by shifting the one a single position at a time, starting at the most significant (left most) bit of the key.

Variable Text (Plaintext/Ciphertext) Known Answer Test
In this test, the 128-bit key is initialized to zero. Each 128-bit block data input consists of a one at the ith position, starting at the most significant (left most) bit position, where i = 0,...,127 is the AES iteration.

Table Known Answer Test
This test contain sets of key and plaintext combinations which have been carefully selected to test the tables used for the AES implementation, e.g., the S-boxes.

Monte Carlo Test
This test is used to identify implementation flaws. Therefore the authors of AES supply pseudo random values for the key and plaintext. The test consists of 4 million AES encryption/decryption cycles. Thereby the results of every 10,000th encryption or decryption cycle are recorded and evaluated with the supplied results from the authors. The first key k0 and plaintext p0 are given.

6.2 Available AES Implementations
We begin our work by reviewing the already existing AES implementations from Brijesh Singh [Sin08] and Robert Szerwinski, done at Robert Bosch GmbH, Corporate Research2.

The implementations are 8-bit implementations and work on the 16-byte State matrix, as it is described in the AES overview, Section 2.1. The State matrix is filled sequentially with memcpy() which is a few cycles faster than copying the plaintext byte-by-byte within a for() loop. Both authors combine the SubBytes and the ShiftRows transformation. This new transformation realizes the SubBytes transformation as a table lookup and integrates the ShiftRows transformation by using the indices for the table lookups, as they were after a ShiftRows transformation. This trick safes the rotation operations of the ShiftRows transformation, as described in Section 2.3.4.

Now we come to the differences of the available implementations. Brijesh Singh uses separate key scheduling in his implementation. He realized the xtimes operation as a table lookup.

This idea comes from Tom St Denis [Den07]. For this lookup table he needs 256bytes additional ROM. In the following we refer to this implementation as Singh-1. The second implementation of Brijesh Singh also uses a separate key scheduling but computes the xtimes operation as it is described in Section 2.3. We refer to this second implementation as Singh-2.

Robert Szerwinski realized the AES implementation without the lookup table for the xtimes operation. He implements this function as it was suggested in the standard and is described in the mathematical background, Section 2.2. Contrary to the standard, he calculates the keys on-the-fly. We refer to his implementation as Szerwinski.

6.3 General Optimizations Hints
Both presented implementations neither uses the 32-bit register size nor any special instruction of the TriCore. But they are a good start for our further optimization.

Since we program our implementation in C, we have to pay attention to the general C compiler specific behavior. For example we pass constant variables to functions as const values, which avoid the compiler from creating a local copy of the variable in the called function. Brijesh Singh also notes that using the __near pragma3 before a variable declaration causes the linker to put the variable into direct addressable memory, which makes the access to the variable faster. Another known fact is, that calling functions is very expensive. We avoid functions if they are used often, for example in loops. Alternatively for those functions we create a macro or an inline function of it. The compiler then replaces the macro resp. inline function call by the actual code of the function.

6.4 AES Optimizations for the TriCore
In the following sections we will walk through the four AES transformations AddRoundKey,

SubBytes, ShiftRows and MixColumns and take a look, if they can be optimized for the

TriCore. In addition, we will use the 32-bit capability of the TriCore and implement the transformations as discussed in Section 2.3.5.

6.4.1 AddRoundKey
The AddRoundKey transformation is implemented as macro to avoid the overhead of a function call. Because we arrange the State matrix and the expanded key arranged column wise on RAM, we can load one word w from the State matrix and one word from the expanded key in two clock cycles.

The address for the used key is calculated from the current round and the current column index of the State matrix.

The compiler creates the following Assembler instructions for the 32-bit addition of one column from the State matrix:

LD16.W

D0, [A10]

; load word from the

; State matrix

LD16.W

D15, [A15]16

; load key word from expanded

; key

XOR

D0, D15

; add them

ST16.W

[A10], D0

; write the word back to RAM

The LD.W (load word) instruction, loads 32 subsequent bits from the State matrix, which is located in RAM, into the data register D[0]. The next instruction loads 32-bits from the expanded key to D[15]. Both values are bitwise added with the XOR instruction which the result in D[0]. The ST.W (store word) instruction stores the register value at the given address in A[10] (the position of the State matrix in RAM).

By processing four bytes at the same time, we save approximately the factor four for the key addition, i.e., the AddRoundKey takes 176cc instead of 704cc.

6.4.2 SubBytes and ShiftRows
We also combine the SubBytes transformation with the ShiftRows transformation by a priori choosing the right index value for the table lookup.

During the SubBytes calculation we have to access each byte sr,c of the State matrix for processing the table lookup (the S-box). To access each byte of the State matrix instead of the whole column, we have to cast the 32-bit value from the State matrix to an 8-bit value.

This causes the compiler to use the byte data type modifiers, e.g., LD.BU (load byte unsigned). The LEA instruction computes the absolute address of the S-box and puts the resulting address in address register A[4]. The ADDSC.A (add scaled index to address) instruction calculates the index value and puts the resulting address in address register A[2]. Finally, the S-box byte at the calculated index is stored in the State matrix. The following assembler listing depicts the combined SubBytesShiftRows operation for one byte of the State matrix. Address register A[15] contains the next byte which is used as index for the TLU in the S-box (line three).

LEA

A4, sbox

; compute absolute address

; of the S-box

LD16.BU

D15, [A15]

; load byte from the State

; matrix

ADDSC16.A

A2, A4,D15,#0

; use the byte as index

; for the TLU

LD16.BU

D15, [A2]

; load the TLU value

ST16.BU

[A15], D15

; store the looked up value in

; the State matrix

We have to do these steps (excluding the LEA operation) for all 16bytes bytes of the State matrix. Since the State matrix of the reference implementations are also processed byte-wise, we cannot achieve an speed improvement here.

6.4.3 MixColumns

The MixColumns transformation is well discussed in Section 2.3.5.

Instead of the shift instructions which are used by Gladman, we use the multiplication by {1B}16 for the reduction if an overflow occurs. We do this because the TriCore can multiply in two clock cycles, which is still faster than the alternative shift and XOR method which is used by Gladman.

To implement the MixColumn operation efficiently, we need the ability to cyclically rotate the content of a 32-bit register by one, two and three bytes. If we take a look at the instruction manual of the TriCore [Inf08b], we cannot find a direct instruction for a rotation. Fortunately, we can utilize the DEXTR (extract from double register) instruction for the rotation. The function __my_rot1 returns the word w rotated by one byte position to the left. To rotate the word w by two and three bytes the same instruction can be used with a different starting bit, i.e., 16 for a rotation by two, and 24 for a rotation by three bytes.

The function MixColumn calculates the MixColumns operation for the column w with four rotations and four XOR operations. The MixColumns function in line six applies this transformation on each column of the State matrix.

6.4.4 Inverse MixColumns
The InvMixColumn operation is applied on each column w of the State matrix. The implementation follows the transformation description in Section 2.3.5.

6.5 Size Comparison of Unprotected Implementations
With the above discussed additional optimizations, we can achieve the memory consumption. The values are taken from the assembler listing file, which depicts the data and code section size. The data section from Singh-1 contains the 256-byte xtimes table, the 12bytes round constant RC and the 256bytes S-box. This leads to a total data memory consumption of 524bytes.

The data memory of Singh-2 does not contain the 256-byte xtimes table, it consists of the S-box 256bytes and 12bytes for the round constant RC. The Szerwinski implementation only holds the 256-byte S-box in the data memory.

It is obviously that the program code of the Optimized AES implementation is about twice as large as the other implementations. This is because it includes encryption and decryption whereas the first three implementations only include the encryption. The data memory contains the S-box, the inverse S-box for the decryption, the 12-byte round constant RC which is needed by the key expansion (Section 2.3.2) and four byte for the constant word {80808080}16 which is needed in the FFmulX function. The performance analysis will be done in Section 7.

6.6 Protected AES Implementation
The protected AES implementation uses masking, which is described in Section 4.2 and randomization from Section 4.3 to be resistant against timing and first order DPA attacks. We add D additional dummy operations to the real transformations. From these fixed D dummy operations, a random number of D1 dummy operations, D1 ≤ D are executed before the first round begins. After the last round, D − D1 dummy operations are executed. The dummy operations work one a dummy State matrix DS with a dummy key DK.

In the randomization zones, the transformations are shuffled. This means for the AddRoundKey and MixColumns transformations, that the starting column of the State matrix for these transformations is randomly chosen with two random bits rc. During the SubBytes transformation, the first column is randomly chosen as well as the first row. The random starting row is selected with the two random bits rr.

The four mask bytes M1 to M4 are used to mask the MixColumns transformation, as it is described in Section 4.2.5.

The SubBytes transformation is masked with the random mask M.

For masking and randomization 320 random bits are needed per AES encryption. Table 6.2 summarizes the names and usage of the random bits.

6.6.1 Masking the Optimized AES Implementation
We have outlined two critical operations which must be protected against a differential power analysis attack. The initial two SubBytes transformations during the first two rounds as well as the last two SubBytes transformations. We will present two independent masking schemes. To protect the implementations against first order DPA attacks, we will use additive masking from Section 4.2.1. In addition we will implement the Combined Masking in Tower Fields

approach from Section 4.2.4. Both masking schemes can be exchanged in the protected AES implementation.

Additively Masked SubBytes
To mask the SubBytes transformation additively, the whole S-box has to be recalculated, see Section 4.2.1. Because the recomputing of the S-box is an expensive operation in terms of runtime, we make it configurable. We can decide during the compilation of the implementation, after how many AES encryptions the S-box will be recalculated.

To perform a SubBytes transformation with the masked S-box, we have to add (XOR) the value M to every byte sr,c of the current State matrix. After the masked transformation, we have to remove the mask Mx from the looked-up value.

Two possibilities are conceivable. The mask values M and Mx can be added in the key scheduling and mask the expanded round. Then the current round key would mask the State in such a way we need it. But that implies, that we have to calculate the expanded key every time, we change the masks for the S-box. Since our implementation is targeted on devices with a fixed key, a recalculation of the expanded key would be very wasteful in terms of runtime.

We chose the AddRoundKey transformation, to add the currently needed mask values to the State.

SubBytes Masked with the CMTF Approach
As we have discussed in Section 2.3.8, we can calculate the inversion in so called composite fields. In this section we show how the equations from Section 4.2.4 can be computed. This can also be found in [WOL02]. First we have to transform an element a ∈ GF(28) to its isomorphic composite field representation ahx + al, ah,al ∈ GF(24):

a =∼ ahx + al, a ∈ GF(28), ah,al ∈ GF(24).

The transformation between the field representations can be calculated the following way:

ah2 = aB ⊕ a2 ⊕ a3, ah3 = aB,

where ai,i = 0,...,7 are the bits from a ∈ GF(28) and ahj,alj,j = 0,...,3 are the bits from ah,al ∈ GF(24). The ⊕ denotes the XOR operation in GF(2).

The inverse transformation, which converts the two-term polynomial ahx + al back into a GF(28) element a ∈ GF(28) can be calculated following way:

a4 = aA ⊕ aB ⊕ al3, a5 = aB ⊕ al2, a6 = aA ⊕ al2 ⊕ al3 ⊕ ah0, a7 = aB ⊕ al2 ⊕ ah3.

To compute the inversion in the composite field GF((24)2) an addition, multiplication, and inversion is needed. Two elements from GF(24) can be added the following way:

(ahx + al) ⊕ (bhx + bl) = (ah ⊕ bh)x + (al ⊕ bl). The multiplication of two elements, which is defined in Section 2.3.8:

(6.2)

q(x) = a(x) · b(x) mod m4(x), a,b,q ∈ GF(24)

can be computed the following way:

aA = a0 ⊕ a3, aB = a2 ⊕ a3,

q0 = a0b0 ⊕ a3b1 ⊕ a2b2 ⊕ a1b3, q1 = a1b0 ⊕ aAb1 ⊕ aBb2 ⊕ (a1 ⊕ a2)b3, q2 = a2b0 ⊕ a1b1 ⊕ aAb2 ⊕ aBb3, q3 = a3b0 ⊕ a2b1 ⊕ a1b2 ⊕ aAb3,

(6.3)

where ai,bi,qi,i = 0,...,3 are the bits from a, b and q ∈ GF(24).

To square an element in GF(24):

q(x) = a(x)2 mod m4(x) (6.4)

we calculate:

q0 = a0 ⊕ a2, q1 = a2, q2 = a1 ⊕ a3, q3 = a3.

The multiplicative inversion a−1 of an element a ∈ GF(24) is calculated by the following formula:

aA = a1 ⊕ a2 ⊕ a3 ⊕ a1a2a3

q0 = aA ⊕ a0 ⊕ a0a2 ⊕ a1a2 ⊕ a0a1a2 q1 = a0a1 ⊕ a0a2 ⊕ a1a2 ⊕ a3 ⊕ a1a3 ⊕ a0a1a3 q2 = a0a1 ⊕ a2 ⊕ a0a2 ⊕ a3 ⊕ a0a3 ⊕ a0a2a3 q3 = aA ⊕ a0a3 ⊕ a1a3 ⊕ a2a3

Now we can calculate the inversion of a whole two-term polynomial. Note that, this time, the addition ⊕ and multiplication ⊗ are in GF(24) and are computed as we have discussed above. The inversion of the whole two-term polynomial is done by calculating:

d (6.5)

where:

d , d ∈ GF(24) (6.6)

As we can see4, the transformation between the two field representations and the computation of the inverse are very expensive computations because they work bitwise on the elements. This values therefore will be pre-calculated stored in four lookup tables Td1,Td2,Tm and Tinv as it is discussed in Section 4.2.4. function gets a byte x from the State matrix and the mask m which masks the calculation. The functions mapGF16 and imapGF16 transform the elements between the field representations.

The function rot1_right rotates the byte b bitwise to the right. The function rot1_left rotates the byte b bitwise to the left. The function parity computes the parity bit of the given byte.

6.6.2 Size Comparison of Protected Implementations

Both AES implementations are protected with dummy operations and shuffling. The first uses additive masking to mask as masking aproach, the second uses CMTF as masking aproach. Both implementations contain the code for encryption and decryption.

The code size increases for the additive masked implementation, because we have additional functions that work on columns instead on the whole State matrix for the individual transformations.

Neptune · Jan 19, 2015

@SvenSvensonov

Hey, @@Neptune - I wasn't a squid, was never posted on US ships (at least not as an actual posting, I've been on US ships numerous times though) - was stationed at the Virginia SPAWAR facility in Norfolk. That said, because I have been on US subs for systems validation, I was required to undergo the USN's submariner program and I am a qualified submariner. The most time I've spent on a US submarine was 2-1/2 months. I'm not going into the details of that program though. I did my training at NAVSUBSCOL.

The closest designation to mine, without actually listing my designation, it CTR - Signals Collector.

I've seen you talking with some of the other members, what was your role in the Turkish Navy?

that's great, kinda similar to SIGINT I believe. Well, I was not commissioned at all, was an officer cadet left the academy at the end of the fourth year. Though sometimes I regret lol. I was qualified for naval aviation. You know for the S-70B, CN-235MPA/ASW or ATR-72/600.
Now I want political science and IR, let's see what the life will bring up here.

As for the topic. It is great that we have someone expertisd at SIGINT. Many may know about it but most don't know much about the SIGINT/ELINT collection and countermeasuring in naval warfare.

These calculations seem very confusing for civies. But it's amazing that this business is even deeper than science IMO.

Slav Defence · Jan 19, 2015

SvenSvensonov said:
Hey, @Neptune - I wasn't a squid, was never posted on US ships (at least not as an actual posting, I've been on US ships numerous times though) - was stationed at the Virginia SPAWAR facility in Norfolk. That said, because I have been on US subs for systems validation, I was required to undergo the USN's submariner program and I am a qualified submariner. The most time I've spent on a US submarine was 2-1/2 months. I'm not going into the details of that program though. I did my training at NAVSUBSCOL.

The closest designation to mine, without actually listing my designation, is CTR - Signals Collector.

I've seen you talking with some of the other members, what was your role in the Turkish Navy?

Space and Naval Warfare Systems Command - Wikipedia, the free encyclopedia

@Slav Defence - here's the next section of the Side Channel Analysis paper:

Chapter 5 Target Platform: TriCore TC1796
To write an optimized AES implementation, it is important to understand the underlying microcontroller architecture. We will discuss the most important parts of the microcontroller in the following subsections and only mention those features of the TriCore, which are interesting for our work.

The TriCore TC1796 or variants thereof is a typical automotive embedded processor which is used, for example, in the current engine control units (ECU) MEDC17 by Robert Bosch GmbH. The TriCore TC1796 combines a reduced instruction set computing (RISC) processor, a digital signal processor (DSP), and on-chip memories as well as peripherals. It has a program flash memory of 2Mbyte, a data flash of 128Kbyte, and 136Kbyte data memory and a 16Kbyte instruction cache (ICACHE). The TriCore TC1796 has no data cache [Inf07, page 2-36].

The program flash and the data flash are connected through the SBD (System Bus Domain) to the CPU. Because the data is accessed through a bridge, it can be one order of magnitude slower than the local domain which is located nearest to the core [Inf02, page 30]. This means for our processor model, that the runtime can also differ by an order of magnitude because our program code and program data is located at the flash.

The chip provides a lot more interesting features, which are explained in the data sheet of the chip [Inf08a]. But first let us start with an overview of the development environment used.

5.1 Development Environment

For programming and compilation of our code, we use the Tasking VX tool set version 3.0r1. It offers a C compiler, linker, Assembler, and a simulator for the TriCore. The C compiler supports the ISO C99 standard.

At the Robert Bosch GmbH, the Hightech GNU C development environment is widely used and also used for developing TriCore applications. We write our code in such a way, that it compiles on both development environments.

We use the Universal Debug Engine (UDE) UAD2 to communicate with the evaluation board. The debugger is connected via USB to our PC. To flash our application to the development board and to debug it in real time, we use UDE-Desktop application. The development board is a TriCore Evaboard TriBoard TC1796.

5.1.1 Compiler

We write our implementation in C following the C99 standard to keep it as portable as possible. If some optimizations are only possible in assembler, we will write a TriCore optimized version and a generic version, separated by defines in the source.

Assembly instructions can be inserted with the keyword __asm in C source code. The C variables are passed as operand to the assembly code. The general syntax of the __asm keyword is:

__asm("instruction_template" [ : output_param_list

[ : input_param_list

[ : register_save_list]]] );

The instruction_template is the assembly instruction, that may contain parameter from the input_param_listor output_param_list. The register_save_listcontains the name of registers which should not be used.

The following example multiplies two C variables and assigns the result to a third C variable. The =d denotes that a data register is necessary for output, and the d denotes that the values come from a data register.

The following sections list the built in functions with their respective prototypes and associated machine instructions. In case an assembler mnemonic is given, the builtin’s semantic is the same (from the C programmer’s point of view) as the semantic of the machine instruction. We do not repeat the instruction set manual here.

Please note, that the syntax of the Hightech GNU C compiler differs for the inline assembler. However, the compiler has built in functions for the assembler instructions which can be used instead.

To ensure the formal correctness of our code, we use splint (Secure Programming Lint). Splint is a tool for statically checking C programs for security vulnerabilities and coding mistakes.1

Because our implementation will be used in the automotive environment it should also be Motor Industry Software Reliability Association2 (MISRA) conform. MISRA is a set of programming rules for secure programming in C.

For code formating we use indent –no-tabs. We will deliver our AES implementation as source code, which can be compiled as a library.

5.1.2 Simulator
During the development process, we use the TriCore Instruction Set Simulator (TSIM) to simulate and debug our implementation. It is described in [Inf05] and perfectly integrated in our development IDE. The simulator has the disadvantage, that it does not simulate the pipelining right. Therefore, we will make the time measurements on the evaluation board.

5.1.3 Evaluation Board
We have the TriBoard – TC1796 V4.1 which hosts the TriCore TC1796. This board is connected via a JTAG (Joint Test Action Group) interface to the UAD. The Debugger is connected via USB to our development PC.

We can debug the TriCore with the Universal Debug Engine (UDE). The UDE has a virtual console, where we can receive messages from our application on the TriBoard. To receive the messages during a run on the evaluation board, we have to set the define CONSOLE_TYPE in BCLtypes.h to CONSOLE_TYPE_TASKING. This causes the compiler to overwrite the standard POSIX read and write functions, which are called by the ANSI-C Library. The stdin, stdout and stderr are now piped through a JTAG interface and we see the output in the UDE-

Desktop application.

5.1.4 The Multi-Core Debug System
In addition, we got the TriBoard – TC1796.303, which hosts the TC1796ED. With this board, we can do exact timing measurements without external influences. Figure 5.1 depicts the TriBoard and how the UAD is connected via the JTAG interface.

The TC1796ED provides a Multi-Core Debug System (MCDS) that can be used to trace the program, which is executed on the TC1796, in real-time. Its MCDS is integrated as a plug-in into the UDE-Desktop application. Figure 5.2 depicts the MCDS configuration dialog. The dialog contains 5 tasks which are needed to measure the runtime of the function aes_Encrypt. The first task configures the emulation device, there the timer resolution is set to 250ns.

The next two tasks create one signal each. The first signal (sig_start) fires if the TriCore program counter (PC) reaches the start address of the function aes_Encrypt. The second signal (sig_stop) fires when the end of aes_Encrypt is reached. The following two tasks define the actions on the signals. The first action starts a timer, the second one stops the timer and stores its value for every execution of the function aes_Encrypt.

5.2 TriCore CPU

The TC1796 processor contains a TriCore 1 V1.3 central processing unit (CPU), with a maximum CPU frequency of fCPU=150MHz. The most interesting features of the CPU for us are the:

• 32-bit load/store architecture,

• 16-bit and 32-bit instructions for reduced code size,

• data formats: bit, byte (8-bit), half-word (16-bit), word (32-bit), double-word (64-bit),

• byte and bit addressing,• little ****** byte ordering,

• packed data.

This enumeration is taken from the user manual [Inf07, Section 2]. The CPU contains an instruction fetch unit, an execution unit, and a general purpose register file (GPR), namely the address and data registers.

The instruction fetch unit pre-fetches and aligns the incoming instructions. The execution unit of the TriCore contains three pipelines which work in parallel. This allows us to execute up to three instructions in one clock cycle. The execution unit contains the integer pipeline (IP), the load/store pipeline (LS), and the loop pipeline.

The integer pipeline and the load/store pipeline have the following four stages: fetch, decode, execute, and write-back. The loop pipeline has the two stages: decode and writeback. The execution unit can execute most operations we use in one clock cycle.

The instruction fetch unit feeds the three pipelines. The fetch unit is able of issuing, in one cycle, an integer pipeline instruction to the integer pipeline and an immediately following LS instruction to the LS pipeline.

Our two main goals during our TriCore optimization strategy are to avoid pipeline stalls as much as possible and to take advantage of the parallel issue capabilities as mentioned above.

The CPU has 16 data registers (register D0 to D15) and 16 address registers (register A0 to A15). Each address and data register has a size of 32-bit. Respectively two 32-bit data

5.3 TriCore Instruction Set
The TriCore supports 16 and 32-bit instructions. Since there is no mode bit and no requirement for word alignment of the 32-bit instructions we can freely mix the instructions. To specify a 16-bit command, the basic operation name is followed by a 16, e.g., SH16. The advantage of a 16-bit instruction is, that it is smaller than a 32-bit instruction in the program memory. It should be used whenever possible for smaller code size.

To follow the upcoming assembler examples, we first show how an assembler instruction is composed. The assembler instruction starts with the basic operation, which is derived from the intention of the instruction. For example, a shift instruction starts with SH. The basic operation can be followed by an operation modifier, e.g., the instruction JNE would execute a conditional jump. The condition here is given by the NE, which stands for “not equal”.

The data type modifier specifies the data type we want to work with. For example, if we load a byte from memory, we use the LD.B instruction. The .B denotes, that we load a byte value. In

The data type modifiers can be combined to a new one, e.g., the instruction LD.BU loads an unsigned byte. In the following subsections, we will describe the commonly used assembler instructions.

5.3.1 Load and Store Instructions
To load and store a value into data register D[A] we can use the load word and store word instructions (LD.W and ST.W).

5.3.2 Arithmetic Instructions
The following three bit arithmetic instruction work almost the same way. So we describe it only once. The OR,XOR and AND instruction computes the bitwise OR, XOR, respectively AND operation of the data register D[a] and the data register D. The result is stored in the data register D[c].

OR

D[c], D[a], D

; c = a OR b

XOR

D[c], D[a], D

; c = a XOR b

AND

D[c], D[a], D

; c = a AND b

The multiplication instruction multiplies two signed 32-bit values from the data register D[a] and the data register D and puts the product into the 32-bit data register D[c] or the 64-bit data register E[c].

MUL

D[c], D[a], D

; c = (lower half of)

; a * b

MUL

E[c], D[a], D

; c = a * b

With the shift instruction SH we can shift the value in D[a] by the number of bytes specified with either a constant value const or the value in data register D. The result is put in D[c]. The vacated bits are filled with zeros and the bits shifted out are discarded.

SH

D[c], D[a], D

; shift D[a] by D positions ; and store the result in D[c]

SH

D[c], D[a], const

; shift D[a] by #const positions

; and store the result in D[c]

5.3.3 Extract Instruction
It is possible to work with so called bit-fields. We can extract the number of consecutive bits specified by width starting at the bit pos from a source register D[a] with the EXTR (Extract Bit Field) and EXTR.U (Extract Bit Field Unsigned) instructions, beginning with the bit number specified by the pos operand. The result is stored sign extended (EXTR) or filled with zeros (EXTR.U) in the destination register D[c].

EXTR

D[c], D[a], pos, width ; Extract Bit Field

EXTR.U

D[c], D[a], pos, width ; Extract Bit Field Unsigned

To extract a 32-bit word from the registers {D[a] and D}, where D[a] contains the most-significant 32bits of the value, we use the DEXTR instruction. The extraction starts at the bit number specified with 32-pos. The result is put in D[C].

DEXTR D[c], D[a], D, pos ; Extract from Double Register

Packed Arithmetic
The packed arithmetic instruction partitions a 32-bit word into four byte values or into two, 16-bit halfwords values. Those can be fetched, stored, and operated on in parallel. Instructions which operate on the data in this way are denoted by the .B and .BU data type modifier. The arithmetic on packed data includes addition, subtraction, multiplication, shift and the absolute difference.

To load the contents of the memory location specified by the offset we can use the load byte instruction (LD.B).

5.4 Time Measurement

The ST.B instruction stores the byte value in the eight least-significant bits of the data register D[a] to the byte memory location specified by the offset off.

ST.B off, D[a] ; Store Byte

5.3.4 Address Arithmetic
The ADDSC.A operation left-shifts the contents of data register D[a] by the amount specified by n, where n = 0,...,3. That value is added to the contents of address register A and the result is put into address register A[c].

5.4 Time Measurement

The TriCore has a system timer (STM), which we can utilize for our time measurement, see the user’s manual [Inf07, Section 15]. The STM is a free running 56-bit upward counter and driven by max. 75MHz (=fSYS). The maximum clock period is 256·fSTM. For fSTM=75MHz, for example, the STM counts 30.46years before it overflows. The minimum resolution of the

STM is 13.3ns. The default frequency after a reset is fSTM = fSYS/2.

We can read the counter value from the STM timer register, which has the address {F0000210}16. Due to the fact, that the STM is 56-bit wide, we cannot read its entire content with one instruction. We have to use two load instructions. Listing 5.1 denotes, how we can load the STM register values with inline assembler.

The STM value can also be loaded in sections from seven registers, STM_TIM0 through STM_TIM6. Theses register can be viewed as individual 32-bit timers, each with a different resolution and timing range [Inf07, Section 15].

The "=e" (tmp) in line 8 denotes, that the variable tmp is used for the result of the LD.W instruction and is an extended 64-bit data register.

Chapter 6 Optimized and Randomized AES Implementations on TriCore
6.1 Implementation Constraints
The AES implementation runs on an automotive processor and it will not be the only task running there, so it will be implemented as space saving as possible in terms of RAM and ROM consumption1.

The ROM consumption is measured for a whole minimal AES implementation. We get the ROM consumption from the Assembler list file. To generate the list file, we have to enable the “Generate list file” option in the project settings dialog “Assembler -> List file” from the Tasking toolchain. The generated list file contains a “List section summary” with the ROM consumption.

6.1.1 Validation Tests
The National Institute of Standards and Technology (NIST) requires that all AES implementations have to pass some algorithm validation tests. This tests help to find failures early during the development process. It can happen, e.g., that we calculate a wrong byte in a S-box which is not used during an encryption but during a later one. This can happen because the used S-box bytes depend on the used plaintext and key combinations. To verify the correctness of the implementation, the following four tests are used:

• Variable key known answer test,

• Variable text (plaintext/ciphertext) known answer test, and

• tables known answer test,

• Monte Carlo test.

Variable Key Known Answer Test
During the Key Known Answer Test (Key KAT) [NIS98], the 128-bit plaintext is always initialized to zero. The key used in an AES iteration i, 0 ≤ i < 127 consists of a one in the ith position and zeros in all other positions. Each of the possible keys is created by shifting the one a single position at a time, starting at the most significant (left most) bit of the key.

Variable Text (Plaintext/Ciphertext) Known Answer Test
In this test, the 128-bit key is initialized to zero. Each 128-bit block data input consists of a one at the ith position, starting at the most significant (left most) bit position, where i = 0,...,127 is the AES iteration.

Table Known Answer Test
This test contain sets of key and plaintext combinations which have been carefully selected to test the tables used for the AES implementation, e.g., the S-boxes.

Monte Carlo Test
This test is used to identify implementation flaws. Therefore the authors of AES supply pseudo random values for the key and plaintext. The test consists of 4 million AES encryption/decryption cycles. Thereby the results of every 10,000th encryption or decryption cycle are recorded and evaluated with the supplied results from the authors. The first key k0 and plaintext p0 are given.

6.2 Available AES Implementations
We begin our work by reviewing the already existing AES implementations from Brijesh Singh [Sin08] and Robert Szerwinski, done at Robert Bosch GmbH, Corporate Research2.

The implementations are 8-bit implementations and work on the 16-byte State matrix, as it is described in the AES overview, Section 2.1. The State matrix is filled sequentially with memcpy() which is a few cycles faster than copying the plaintext byte-by-byte within a for() loop. Both authors combine the SubBytes and the ShiftRows transformation. This new transformation realizes the SubBytes transformation as a table lookup and integrates the ShiftRows transformation by using the indices for the table lookups, as they were after a ShiftRows transformation. This trick safes the rotation operations of the ShiftRows transformation, as described in Section 2.3.4.

Now we come to the differences of the available implementations. Brijesh Singh uses separate key scheduling in his implementation. He realized the xtimes operation as a table lookup.

This idea comes from Tom St Denis [Den07]. For this lookup table he needs 256bytes additional ROM. In the following we refer to this implementation as Singh-1. The second implementation of Brijesh Singh also uses a separate key scheduling but computes the xtimes operation as it is described in Section 2.3. We refer to this second implementation as Singh-2.

Robert Szerwinski realized the AES implementation without the lookup table for the xtimes operation. He implements this function as it was suggested in the standard and is described in the mathematical background, Section 2.2. Contrary to the standard, he calculates the keys on-the-fly. We refer to his implementation as Szerwinski.

6.3 General Optimizations Hints
Both presented implementations neither uses the 32-bit register size nor any special instruction of the TriCore. But they are a good start for our further optimization.

Since we program our implementation in C, we have to pay attention to the general C compiler specific behavior. For example we pass constant variables to functions as const values, which avoid the compiler from creating a local copy of the variable in the called function. Brijesh Singh also notes that using the __near pragma3 before a variable declaration causes the linker to put the variable into direct addressable memory, which makes the access to the variable faster. Another known fact is, that calling functions is very expensive. We avoid functions if they are used often, for example in loops. Alternatively for those functions we create a macro or an inline function of it. The compiler then replaces the macro resp. inline function call by the actual code of the function.

6.4 AES Optimizations for the TriCore
In the following sections we will walk through the four AES transformations AddRoundKey,

SubBytes, ShiftRows and MixColumns and take a look, if they can be optimized for the

TriCore. In addition, we will use the 32-bit capability of the TriCore and implement the transformations as discussed in Section 2.3.5.

6.4.1 AddRoundKey
The AddRoundKey transformation is implemented as macro to avoid the overhead of a function call. Because we arrange the State matrix and the expanded key arranged column wise on RAM, we can load one word w from the State matrix and one word from the expanded key in two clock cycles.

The address for the used key is calculated from the current round and the current column index of the State matrix.

The compiler creates the following Assembler instructions for the 32-bit addition of one column from the State matrix:

LD16.W

D0, [A10]

; load word from the

; State matrix

LD16.W

D15, [A15]16

; load key word from expanded

; key

XOR

D0, D15

; add them

ST16.W

[A10], D0

; write the word back to RAM

The LD.W (load word) instruction, loads 32 subsequent bits from the State matrix, which is located in RAM, into the data register D[0]. The next instruction loads 32-bits from the expanded key to D[15]. Both values are bitwise added with the XOR instruction which the result in D[0]. The ST.W (store word) instruction stores the register value at the given address in A[10] (the position of the State matrix in RAM).

By processing four bytes at the same time, we save approximately the factor four for the key addition, i.e., the AddRoundKey takes 176cc instead of 704cc.

6.4.2 SubBytes and ShiftRows
We also combine the SubBytes transformation with the ShiftRows transformation by a priori choosing the right index value for the table lookup.

During the SubBytes calculation we have to access each byte sr,c of the State matrix for processing the table lookup (the S-box). To access each byte of the State matrix instead of the whole column, we have to cast the 32-bit value from the State matrix to an 8-bit value.

This causes the compiler to use the byte data type modifiers, e.g., LD.BU (load byte unsigned). The LEA instruction computes the absolute address of the S-box and puts the resulting address in address register A[4]. The ADDSC.A (add scaled index to address) instruction calculates the index value and puts the resulting address in address register A[2]. Finally, the S-box byte at the calculated index is stored in the State matrix. The following assembler listing depicts the combined SubBytesShiftRows operation for one byte of the State matrix. Address register A[15] contains the next byte which is used as index for the TLU in the S-box (line three).

LEA

A4, sbox

; compute absolute address

; of the S-box

LD16.BU

D15, [A15]

; load byte from the State

; matrix

ADDSC16.A

A2, A4,D15,#0

; use the byte as index

; for the TLU

LD16.BU

D15, [A2]

; load the TLU value

ST16.BU

[A15], D15

; store the looked up value in

; the State matrix

We have to do these steps (excluding the LEA operation) for all 16bytes bytes of the State matrix. Since the State matrix of the reference implementations are also processed byte-wise, we cannot achieve an speed improvement here.

6.4.3 MixColumns

The MixColumns transformation is well discussed in Section 2.3.5.

Instead of the shift instructions which are used by Gladman, we use the multiplication by {1B}16 for the reduction if an overflow occurs. We do this because the TriCore can multiply in two clock cycles, which is still faster than the alternative shift and XOR method which is used by Gladman.

To implement the MixColumn operation efficiently, we need the ability to cyclically rotate the content of a 32-bit register by one, two and three bytes. If we take a look at the instruction manual of the TriCore [Inf08b], we cannot find a direct instruction for a rotation. Fortunately, we can utilize the DEXTR (extract from double register) instruction for the rotation. The function __my_rot1 returns the word w rotated by one byte position to the left. To rotate the word w by two and three bytes the same instruction can be used with a different starting bit, i.e., 16 for a rotation by two, and 24 for a rotation by three bytes.

The function MixColumn calculates the MixColumns operation for the column w with four rotations and four XOR operations. The MixColumns function in line six applies this transformation on each column of the State matrix.

6.4.4 Inverse MixColumns
The InvMixColumn operation is applied on each column w of the State matrix. The implementation follows the transformation description in Section 2.3.5.

6.5 Size Comparison of Unprotected Implementations
With the above discussed additional optimizations, we can achieve the memory consumption. The values are taken from the assembler listing file, which depicts the data and code section size. The data section from Singh-1 contains the 256-byte xtimes table, the 12bytes round constant RC and the 256bytes S-box. This leads to a total data memory consumption of 524bytes.

The data memory of Singh-2 does not contain the 256-byte xtimes table, it consists of the S-box 256bytes and 12bytes for the round constant RC. The Szerwinski implementation only holds the 256-byte S-box in the data memory.

It is obviously that the program code of the Optimized AES implementation is about twice as large as the other implementations. This is because it includes encryption and decryption whereas the first three implementations only include the encryption. The data memory contains the S-box, the inverse S-box for the decryption, the 12-byte round constant RC which is needed by the key expansion (Section 2.3.2) and four byte for the constant word {80808080}16 which is needed in the FFmulX function. The performance analysis will be done in Section 7.

6.6 Protected AES Implementation
The protected AES implementation uses masking, which is described in Section 4.2 and randomization from Section 4.3 to be resistant against timing and first order DPA attacks. We add D additional dummy operations to the real transformations. From these fixed D dummy operations, a random number of D1 dummy operations, D1 ≤ D are executed before the first round begins. After the last round, D − D1 dummy operations are executed. The dummy operations work one a dummy State matrix DS with a dummy key DK.

In the randomization zones, the transformations are shuffled. This means for the AddRoundKey and MixColumns transformations, that the starting column of the State matrix for these transformations is randomly chosen with two random bits rc. During the SubBytes transformation, the first column is randomly chosen as well as the first row. The random starting row is selected with the two random bits rr.

The four mask bytes M1 to M4 are used to mask the MixColumns transformation, as it is described in Section 4.2.5.

The SubBytes transformation is masked with the random mask M.

For masking and randomization 320 random bits are needed per AES encryption. Table 6.2 summarizes the names and usage of the random bits.

6.6.1 Masking the Optimized AES Implementation
We have outlined two critical operations which must be protected against a differential power analysis attack. The initial two SubBytes transformations during the first two rounds as well as the last two SubBytes transformations. We will present two independent masking schemes. To protect the implementations against first order DPA attacks, we will use additive masking from Section 4.2.1. In addition we will implement the Combined Masking in Tower Fields

approach from Section 4.2.4. Both masking schemes can be exchanged in the protected AES implementation.

Additively Masked SubBytes
To mask the SubBytes transformation additively, the whole S-box has to be recalculated, see Section 4.2.1. Because the recomputing of the S-box is an expensive operation in terms of runtime, we make it configurable. We can decide during the compilation of the implementation, after how many AES encryptions the S-box will be recalculated.

To perform a SubBytes transformation with the masked S-box, we have to add (XOR) the value M to every byte sr,c of the current State matrix. After the masked transformation, we have to remove the mask Mx from the looked-up value.

Two possibilities are conceivable. The mask values M and Mx can be added in the key scheduling and mask the expanded round. Then the current round key would mask the State in such a way we need it. But that implies, that we have to calculate the expanded key every time, we change the masks for the S-box. Since our implementation is targeted on devices with a fixed key, a recalculation of the expanded key would be very wasteful in terms of runtime.

We chose the AddRoundKey transformation, to add the currently needed mask values to the State.

SubBytes Masked with the CMTF Approach
As we have discussed in Section 2.3.8, we can calculate the inversion in so called composite fields. In this section we show how the equations from Section 4.2.4 can be computed. This can also be found in [WOL02]. First we have to transform an element a ∈ GF(28) to its isomorphic composite field representation ahx + al, ah,al ∈ GF(24):

a =∼ ahx + al, a ∈ GF(28), ah,al ∈ GF(24).

The transformation between the field representations can be calculated the following way:

ah2 = aB ⊕ a2 ⊕ a3, ah3 = aB,

where ai,i = 0,...,7 are the bits from a ∈ GF(28) and ahj,alj,j = 0,...,3 are the bits from ah,al ∈ GF(24). The ⊕ denotes the XOR operation in GF(2).

The inverse transformation, which converts the two-term polynomial ahx + al back into a GF(28) element a ∈ GF(28) can be calculated following way:

a4 = aA ⊕ aB ⊕ al3, a5 = aB ⊕ al2, a6 = aA ⊕ al2 ⊕ al3 ⊕ ah0, a7 = aB ⊕ al2 ⊕ ah3.

To compute the inversion in the composite field GF((24)2) an addition, multiplication, and inversion is needed. Two elements from GF(24) can be added the following way:

(ahx + al) ⊕ (bhx + bl) = (ah ⊕ bh)x + (al ⊕ bl). The multiplication of two elements, which is defined in Section 2.3.8:

(6.2)

q(x) = a(x) · b(x) mod m4(x), a,b,q ∈ GF(24)

can be computed the following way:

aA = a0 ⊕ a3, aB = a2 ⊕ a3,

q0 = a0b0 ⊕ a3b1 ⊕ a2b2 ⊕ a1b3, q1 = a1b0 ⊕ aAb1 ⊕ aBb2 ⊕ (a1 ⊕ a2)b3, q2 = a2b0 ⊕ a1b1 ⊕ aAb2 ⊕ aBb3, q3 = a3b0 ⊕ a2b1 ⊕ a1b2 ⊕ aAb3,

(6.3)

where ai,bi,qi,i = 0,...,3 are the bits from a, b and q ∈ GF(24).

To square an element in GF(24):

q(x) = a(x)2 mod m4(x) (6.4)

we calculate:

q0 = a0 ⊕ a2, q1 = a2, q2 = a1 ⊕ a3, q3 = a3.

The multiplicative inversion a−1 of an element a ∈ GF(24) is calculated by the following formula:

aA = a1 ⊕ a2 ⊕ a3 ⊕ a1a2a3

q0 = aA ⊕ a0 ⊕ a0a2 ⊕ a1a2 ⊕ a0a1a2 q1 = a0a1 ⊕ a0a2 ⊕ a1a2 ⊕ a3 ⊕ a1a3 ⊕ a0a1a3 q2 = a0a1 ⊕ a2 ⊕ a0a2 ⊕ a3 ⊕ a0a3 ⊕ a0a2a3 q3 = aA ⊕ a0a3 ⊕ a1a3 ⊕ a2a3

Now we can calculate the inversion of a whole two-term polynomial. Note that, this time, the addition ⊕ and multiplication ⊗ are in GF(24) and are computed as we have discussed above. The inversion of the whole two-term polynomial is done by calculating:

d (6.5)

where:

d , d ∈ GF(24) (6.6)

As we can see4, the transformation between the two field representations and the computation of the inverse are very expensive computations because they work bitwise on the elements. This values therefore will be pre-calculated stored in four lookup tables Td1,Td2,Tm and Tinv as it is discussed in Section 4.2.4. function gets a byte x from the State matrix and the mask m which masks the calculation. The functions mapGF16 and imapGF16 transform the elements between the field representations.

The function rot1_right rotates the byte b bitwise to the right. The function rot1_left rotates the byte b bitwise to the left. The function parity computes the parity bit of the given byte.

6.6.2 Size Comparison of Protected Implementations

Both AES implementations are protected with dummy operations and shuffling. The first uses additive masking to mask as masking aproach, the second uses CMTF as masking aproach. Both implementations contain the code for encryption and decryption.

The code size increases for the additive masked implementation, because we have additional functions that work on columns instead on the whole State matrix for the individual transformations.

Thank you professor Sonova

SvenSvensonov · Jan 19, 2015

Slav Defence said:
Thank you professor Sonova

Glad you enjoyed it, I'll be posting the next few sections sometime later today.

Slav Defence · Jan 19, 2015

SvenSvensonov said:
Glad you enjoyed it, I'll be posting the next few sections sometime later today.

I have enjoyed all your posts, Professor!
Seems like you ain't reading any of em!

SvenSvensonov · Jan 19, 2015

Slav Defence said:
I have enjoyed all your posts, Professor!
Seems like you ain't reading any of em!

- nah, I read everything addressed to me, just reiterating my appreciation towards your appreciation of my contributions. Also, I'd welcome any feedback or suggestions you can provide on the future of my posting here on PDF.

Feedback appreciated

SvenSvensonov · Jan 20, 2015

Chapter 7 Implementation Results
In this chapter, the measurement setup is discussed. The runtime of the unprotected implementations is measured and compared in the following. The runtime behavior of the protected AES implementations is analyzed in detail.

7.1 Measurement Setup
To compare the four unprotected AES implementations in size and speed, the given source files are imported to the Tasking development environment. Therefore an own project for every implementation is created where the following default build settings have been changed:

• optimization: level 2 (Optimize more),

• trade-off between speed and size: Level 4 (Size), and

• active build configuration: Release.

To measure the timing behavior of the AES implementations, the TriCore 1796 emulation device (TC1796ED) development board is used, see Section 5.1.4.

The statistical analysis were carried out on a PC featuring a Intel Celeron processor at 2.53GHz and 960MB RAM. We used Matlab Release 2007b with the Statistics Toolbox version 6.1 [Mat07].

7.2 Runtime Comparison of the Implementations
For the runtime measurement, we use the maximal CPU frequency of 150MHz. The runtime Xj,j = 1,...,1000 of aes_Encrypt is measured for 1,000 different encryptions. Thereby we use one fixed key k and different plaintexts p. The new plaintext pj+1 we use is the output cj of the previous encryption cj = aes_Encrypt(k,pj). Algorithm 8 shows the program fragment used for the measurement process. We start the measurement right before the AES run which is denoted by the tic instruction. After the run we stop the measurement (denoted by toc) and save the elapsed runtime. The runtime X can be seen as a random variable that is normally distributed, i.e. X = x1,...,x1000 with Xj ∼ N(µx,σx2),xj iid.

The optimized AES implementation is with 78.31µs (204.32kbyte/s) the fastest implementation. It is about 2.6 times faster than the comparable implementation from Singh-2, which also uses an expanded key and only table lookups for the SubBytes transformation. The faster

Singh-1 implementation also uses tables lookups for the xtimes part of the MixColumns transformation. The runtime results of the measurements Singh-1 and Singh-2 fit with the results from Brijesh Singh presented in [Sin08].

The runtime of the protected AES implementation depends on the amount of dummy operations D and the number of AES runs which use the same mask for the S-box. The value in line five of Table 7.1 depicts the runtime, if the S-box is recalculated on every new AES run. With zero dummy operations one AES encryption takes 160µs (100kbyte/s).

Line six depicts the runtime for the CTMF approach. This implementation is just a concept. The decryption is not implemented. With this approach it is possible to remask every SubBytes operation. The runtime does not depend on the used masks since it does not recalculate the S-box. It calculates the inversion as we described it in Section 4.2.4. With 1183 + (D · 56)µs it is about eight times slower than the protected implementation in line five. This implementation can be used in higher order DPA resistant implementations.

The runtime for D = 0 to D = 255. The S-box is recalculated at every new AES run. We can see a linear runtime behavior which is described by 160µs+D·3.8µs for the encryption. For the decryption, we get 160µs+D·4µs. We can see, that the decryption takes a bit more time, which is mainly caused by the inverse MixColumns transformation. For the two figures, we measured the runtime for six different numbers of dummy operations, D ∈ {1,50,100,150,200,255} according to Algorithm 8 and drawed the regression line between the measured results. To calculate the polynomial for the regression line we use the Matlab function polyfit on the measured samples.

To check, if the S-box recalculation works correctly, we build a version with D = 10 dummy operations and which recalculates the S-box after every 5th AES run. The peak denotes, that at this AES run the S-box has been recalculated. We can observe, that a recalculation of the S-box takes about 38µs.

7.3 Timing Analysis
To measure the dependency between the runtime of an AES encryption and the input data five measurements sets were created. Between the different measurement sets, the key k and the plaintext p vary.

All measurement sets contain random data, but set two and set three share the same keys and different plaintexts and set four and five use the same plaintext but different keys.

The runtime Xj of every encryption run (aes_Encrypt) was measured 1,000 times for each set. Algorithm 9 denotes the measurement process.

For the timing analysis, we need a higher resolution. Because we cannot increase the precision of the MCDS timing measurement, we decrease the CPU frequency instead. The CPU frequency is controlled by the PLL1. We have to setup the four main parameters: the P-Divider, N-Divider, K-Divider, and VCO2 range where:

N

fCPU = · · fOSC.

P K

With N = 100, P = 5 and K = 16 we get the smallest frequency fCPU = 25MHz. The VCO range is set to 400 – 500MHz.

The smallest measurable time unit ∆ (the absolute failure) we can measure with the MCDS is 250µs.To get a more precise time domain we oversample our measurements by the factor of 163, which means we average the measurement over 16 AES runs.

7.3.1 Timing Analysis of the Unprotected AES Implementations

The arithmetic mean values X, i.e., an estimator for the expected value µX, and the empirical standard deviation sX, i.e., an estimator for σX, each for all 1,000 measurements.

The timing behavior of the implementation from Szerwinski and in Table 7.4 our optimized AES implementation are depicted. Both leak timing information, since they show a different runtime for different combinations of the key and plaintext.

We can distinguish the different runtimes for the different measurement sets. For example, we can clearly distinguish between set one on the left and set five on the right side of the histogram.

The measurement results from Singh-2 are given. In this AES implementation, the MixColumns transformation is calculated on-the-fly, as described in Section 2.3.3. It is the only transformation, with a conditional part (xtimes) during the calculation. As expected,

this leads to an exploitable timing consumption. The mean values X depend on the processed data.

The measurement results from Singh-1 are shown. Also in this implementa-

tion, the mean values X for the five different measurements sets are distinguishable. The implementation from Singh-1 contains no obviously data dependent parts which would affect the runtime. Here the xtimes part of the MixColumns transformation is realized as a lookup table. We guess, that the different runtime results from platform depending scratchpad-RAM effects. This contradicts our assumption from Section 1.2, that a table lookup has a constant runtime. Thus we can observe a kind of cache effect.

Because the Protected AES implementation is based on the optimized AES implementation, it can be assumed, that the protected implementation also shows a different timing consumption on different key and plaintext pairs. It must be verified, that this information cannot be used during a timing analysis attack.

7.3.2 Timing Analysis of the Protected AES Implementation
For the measurements we use D = 16 dummy operations. The SubBytes transformation is implemented as we described it in Section 6.6.1. We re-mask the S-box on every new run of the cipher.

Encryption

The mean values X for the different measurement sets, which also differs but with a much smaller variance.

The runtime of the an AES encryption with D = 16 dummy operations. The runtime was measured 1,000 times. As we expect, the most measured runtimes lie in the area x ± sx.

It is harder to distinguish between the measurement sets on the protected AES implementation.

To test, if the different measurement sets really do not lead to a distinguishable runtime, the two sample t-test is used. We fix the indices i and j. The statistical question we are interested in is if we can clearly distinguish between the sets Xki and Ykj, k = 1,...,1000. The t-test is applied among each mean value X = X(i), i = 1,... ,5 and Y = X(j), j = 1,...,5, where:

• Xk – Runtime of aes_Encrypt in µs for set i, with kth measurement, k = 1,...,1000

• Yk – Runtime of aes_Encrypt in µs for set j, with kth measurement, k = 1,...,1000

• Condition: Xk,Yk i.i.d., Xk ∼ N(µX,σX2 ), Yk ∼ N(µY ,σY2 ), k = 1,... ,1000, σX2 = σY2 and unknown

• Null hypothesis: H0 : µX = µY versus H1 : µX =6 µY

The zero indicates that the test cannot be rejected. This means that the mean values come from the same population and are not distinguishable. This holds for every measurement set combination. We get more information from the p-value on the right hand side of Table 7.8.

The p-value is the probability of observing the given result, [or one more extreme values] under the assumption that the null hypothesis is true. If the p-value is less than α, then you reject the null hypothesis. For example, if α = 0.05 and the p-value is 0.03, then you reject the null hypothesis. The converse is not true. If the p-value is greater than α, you do not accept the null hypothesis. You just have insufficient evidence to reject the null hypothesis.

One pre-condition for the applicability of the two sample t-test is that the unknown variances of both random variables X and Y are the same. We can check this with Fisher’s F-test. Fisher’s F-test verifies if the variances σX2 and σY2 of two random variables X and Y which are normally distributed with unknown and not necessarily equal mean values µx and µy are the same, see Section 3.1.3. Note, however, that the two-sample t-test checks statistical moments of first order, whereby the F-test checks statistical moments of second order. Effects of order two are more difficult to detect, in general, because they are smaller than effects of first order.

• Xk – Runtime of aes_Encrypt in µs for set i, with kth measurement, k = 1,...,1000

• Yk – Runtime of aes_Encrypt in µs for set j, with kth measurement, k = 1,...,1000

• Condition: Xk,Yk i.i.d., Xk ∼ N(µX,σX2 ), Yk ∼ N(µY ,σY2 ), k = 1,...,1000

• Null hypothesis: H0 : σx2 = σy2 versus H1 : σx2 =6 σy2

Every measurement set is compared with each other.

Decryption

The mean value X of the runtime for the given set. The mean values also differs for different key and plaintext combinations but with a much bigger variance.

It is harder to make a differentiation between the measurement sets on the protected AES implementation. The appearance of a specific runtime for a specific measurement set is almost normal distributed over all possible measured timings.

To test, if the different measurement sets really do not lead to a distinguishable runtime, we also use the two sample t-test. The t-test is applied among each mean value X = X(i), i = 1,...,5 and Y = X(j), j = 1,...,5, where:

• Xk – Runtime of aes_Decrypt in µs for set i, with kth measurement, k = 1,...,1000

• Yk – Runtime of aes_Decrypt in µs for set j, with kth measurement, k = 1,...,1000

• Condition: Xk,Yk i.i.d., Xk ∼ N(µX,σX2 ), Yk ∼ N(µY ,σY2 ), k = 1,... ,1000, σX2 = σY2 and unknown

• Null hypothesis: H0 : µX = µY versus H1 : µX =6 µY

The zero indicates that the hypothesis cannot be rejected. This means that the mean values probably come from the same population and are not distinguishable. This holds for every measurement set combination.

The condition for the t-test is, that the unknown variance of both random variables X and Y is the same. To ensure this, we use the F-test again.

• Xk – Runtime of aes_Decrypt in µs for set i, with kth measurement, k = 1,...,1000

• Yk – Runtime of aes_Decrypt in µs for set j, with kth measurement, k = 1,...,1000

• Condition: Xk,Yk i.i.d., Xk ∼ N(µX,σX2 ), Yk ∼ N(µY ,σY2 ), k = 1,...,1000

• Null hypothesis: H0 : σx2 = σy2 versus H1 : σx2 =6 σy2

The F-test results for the measured timings. We can see, that they are all above the 5%.

The statistical tests show, that we cannot distinguish between the different sets, i.e., the different keys, on hand of the runtime behavior of our protected AES implementation. Thus, we can assume that the protected AES implementation is resistant against timing analysis.

7.4 Power Analysis
It is not in the scope of this thesis to analyze the protected AES implementation with respect to Power Analysis resistance. The design of the countermeasures was chosen in such a way that the implementation should be resistant against Simple Power Analysis and standard Correlation and Differential Power Analysis, see [HOM06]. This means that more traces are needed for a successfully mounted DPA attack than for an attack against an unprotected implementation. However, this implementation is susceptible to advanced power analysis techniques, especially second order attacks.

The theoretical results from [THM07] and the practical results from [TH08] showed, that the number of required traces for a successful mounted SCA increases by (D + 1) · 16. The number of traces which are needed to attack an unprotected implementation on the TriCore has not been investigated yet.

To measure the resistance against a DPA, one should start to build a measurement setup. The setup could be tested by analyzing an unprotected AES version like the optimized AES implementation. This implementation should be fairly easy to attack.

Afterwards the protected AES implementation should be attacked with methods like windowing and second order DPA, see Section 4.3.1. However, this requires a fair amount of experience in attacking protected implementations.

Chapter 8 Conclusion
The topic of this master thesis was the side-channel resistant implementation of the Advanced Encryption Standard (AES) on a typical automotive processor.

Symmetric crypto primitives are the basis for most cryptographic architectures. The Advanced Encryption Standard (AES) is the current standard for block ciphers. An efficient and secure implementation of this algorithm therefore is very important for future security mechanisms in the automotive world. While there are many fairly efficient AES implementations in the field, most of them are not optimized for embedded systems, especially in the automotive world. Most AES implementations are for 8-bit systems, e.g., smart cards, or for 32-bit architectures with lots of memory. However, in the automotive domain we often have 32-bit RISC processors and tight constraints on the memory and the program size while speed is most often not an issue. Therefore, the selection and implementation of an unprotected AES specially adapted for the needs of the automotive industry is necessary.

The topic of side-channel attacks is one of the most important issues in applied cryptography. In the literature, one can find many results on high security systems like smart cards. However, until now we are not aware of any side-channel analysis resistant implementation of AES on automotive processors. For the first time, an efficient (32-bit optimized) and secured (countermeasures against timing and power analysis) AES implementation has been investigated on automotive platforms.

The target platform for the implementation was an Infineon TriCore TC1796. This processor is widely used in Engine Control Units (ECU) made by Robert Bosch GmbH as well as by Continental AG, for example. It is a 32-bit RISC processor.

After a short introduction of the Advanced Encryption Standard we discussed in Chapter 2 various implementation options for unsecured implementations. Since many of the existing highly efficient implementations are very processor-specific and are therefore not directly comparable, we introduced a simple processor model for comparison. Note that the resulting estimates are not very meaningful for itself but are important with respect to the other implementations. Thus, not the absolute number of clock cycles and memory consumption is important but the relative factor compared to other implementations. Specifically, we estimated a faster runtime with factor 2.2 between an existing, optimized 8-bit AES implementation(3808 cc) and a 32-bit optimized implementation (1732cc).

In practice, we could achieve this factor with our real optimized AES implementation (78.31µs) which utilizes the TriCore instruction set and uses the whole 32-bit register size. Compared to the optimized 8-bit implementation (Singh-2 with 208.3µs) our implementation is faster by the factor 2.6 as the reference implementation.

Since the overall security level of automotive processors is moderate (no secure memory, for example) it should be enough to secure an AES implementation against the most dangerous,

low-level attacks, i.e., Timing Analysis (TA), Simple and Differential Power Analysis (SPA, DPA) (first order). In Chapter 3 we therefore discussed various attack techniques for Side Channel Analysis (SCA) and gave an overview about software countermeasures against timing analysis and (first order) Differential Power Analysis (DPA).

With our goal, an efficient and protected AES implementation, in mind we analyzed the existing SCA countermeasures in terms of performance, memory requirement and known attacks in Chapter 4. Using our processor model, we showed that AES implementations with resistance against second order DPA attacks require, in general, over ten times more clock cycles than unprotected implementation. Irrespective of the security aspects, overheads of this order are too much for an implementation on automotive processors. So we decided to use a first order DPA resistant masking method and combine it with randomization (shuffling and dummy states) as proposed in [HOM06] and [THM07]. In these two papers, the authors carefully selected various low-cost countermeasures to achieve the overall security level required.

In addition to this method, we provide a concept study for a masking scheme called CMTF (Combined Masking in Tower Fields), that can be used in an implementation that is resistant against higher order DPA attacks. Originally, CMTF has been proposed as hardware countermeasure. However, [OS05] used precomputed tables for the transformation between sub-fields, making it thus applicable as a software countermeasure. The CMTF method with precomputed tables can be combined with the randomization approach by Herbst et al., i.e., shuffling and dummy states).

It has turned out, that the affine transformation which is part of the SubBytes transformation is a bottleneck in terms of runtime because it works bitwise on each byte. An improvement would be if four affine transformation would be calculated on four bytes of the State matrix in parallel. This would require the possibility of rotating the four bytes in a word in parallel. The CMTF masking scheme as well as the other scheme which calculate the SubBytes transformation on the fly can be used in implementations which use more than one mask for the bytes of the State matrix, since the RAM consumption of the additive masking increases very fast with the number of masks used.

In Chapter 7 we evaluated the timing behavior of our implementations. We showed, that the unprotected implementations are vulnerable to timing analysis attacks. In addition we showed, that our protected implementation is probably timing analysis resistant. Power analysis has been left as a subject for further investigation.

In our theoretical part, we estimated a runtime slowdown by a factor of 1.9 between the 32-bit optimized implementation (1732cc) and the additive masked implementation (3332cc). We could observe this factor in practice at the real optimized implementation (≈ 78µs) and the protected implementation with no dummy operations (≈ 160µs). This gives us a factor of two. The security level can be further increased by introducing additional dummy states

(D = 0).6

Summarizing, for the first time we have an efficient and side-channel secured AES implementation on a typical automotive processor. The resulting implementation is very competitive in terms of program and memory size as well as performance. The security of the implementation is adapted to the current security level of automotive processors (low-cost countermeasures since we have no secure memory or tamper protection). The implementation can be used as a plug and play solution which can replace vulnerable unprotected AES implementations.

We predict, that in the mid-term future, no automotive manufacturer or supplier can afford

Bibliography
[AG01]

M. L. Akkar and Chr. Giraud. An Implementation of DES and AES, Secure against Some Attacks. In C. K. Koç, D. Naccache, and Chr. Paar, editors, Proceedings of CHES 2001, volume 2162 of Lecture Notes in Computer Science, pages 309–318. Springer-Verlag, 2001.

[AK98]

Ross J. Anderson and Markus G. Kuhn. Low Cost Attacks on Tamper Resistant Devices. In Proceedings of the 5th International Workshop on Security Protocols, pages 125–136, London, UK, 1998. Springer-Verlag.

[AKS07]

Onur Aciiçmez, Çetin Kaya Koç, and Jean-Pierre Seifert. On the Power of Simple Branch Prediction Analysis. In ASIACCS 2007: Proceedings of the 2nd ACM symposium on Information, computer and communications security, pages 312– 320, New York, NY, USA, 2007. ACM.

[APSQ06]

Cédric Archambeau, Eric Peeters, François-Xavier Standaert, and Jean-Jacques Quisquater. Template Attacks in Principal Subspaces. In Louis Goubin and Mitsuru Matsui, editors, Proceedings of CHES 2006, volume 4249 of Lecture Notes in Computer Science, pages 1–14. Springer-Verlag, 2006.

[BBF+03]

Guido Bertoni, Luca Breveglieri, Pasqualina Fragneto, Marco Macchetti, and Stefano Marchesin. Efficient Software Implementation of AES on 32-bit Platforms. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, Proceedings of CHES 2002: Revised Papers from the 4th International Workshop on Cryptographic Hardware and Embedded Systems, volume 2523 of Lecture Notes in Computer Science, pages 159–171. Springer-Verlag, 2003.

[Ber05]

Daniel J. Bernstein. Cache-timing Attacks on AES, 2005. cr.yp.to antiforgery/cachetiming-20050414.pdf.

[BGK04]

Johannes Blömer, Jorge Guajardo, and Volker Krummel. Provably Secure Masking of AES. In H. Handschuh and M. Anwar Hasan, editors, Selected Areas in Cryptography – SAC 2004, number 3357 in Lecture Notes in Computer Science, pages 69–83. Springer-Verlag, 2004.

[Bos98]

Karl Bosch. Statistik-Taschenbuch. Oldenbourg, January 1998.

[BS08]

Daniel Bernstein and Peter Schwabe. New AES Software Speed Records. Cryptology ePrint Archive, Report 2008/381, 2008. Cryptology ePrint Archive

[BSI01]

Bundesamt für Sicherheit in der Informationstechnik BSI. AIS 20: Funktionalitätsklassen und Evaluationsmethodologie für deterministische Zufallszahlengeneratoren. Anwendungshinweise und Interpretationen zum Schema (AIS) Version 1, 932.12.1999, BSI, 2001.

[CB09]

D. Canright and Lejla Batina. A Very Compact “Perfectly Masked” S-Box for AES (corrected). Cryptology ePrint Archive, Report 2009/011, 2009. http:// eprint.iacr.org/.

[CCD00]

Christophe Clavier, Jean-Sébastien Coron, and Nora Dabbous. Differential Power Analysis in the Presence of Hardware Countermeasures. In Christof Paar

and Çetin Kaya Koç, editors, Cryptographic Hardware and Embedded Systems CHES 2000, volume 1965 of Lecture Notes in Computer Science, pages 252–263. Springer-Verlag, 2000.

[CGPR08] Jean-Sébastien Coron, Christophe Giraud, Emmanuel Prouff, and Matthieu Rivain. Attack and Improvement of a Secure S-Box Calculation Based on the Fourier Transform. In Cryptographic Hardware and Embedded Systems – CHES 2008, volume 5154 of Lecture Notes in Computer Science, pages 1–14. Springer-Verlag, 2008.

[Cha82]

David Chaum. Blinding Signatures for Untraceable Payments. In David Chaum, Ronald L. Rivest, and Alan T. Sherman, editors, Advances in Cryptology – CRYPTO 1982, pages 199–203, 1982.

[CRR02]

Suresh Chari, Josyula R. Rao, and Pankaj Rohatgi. Template Attacks. In Proceedings of CHES 2002, volume 2523 of Lecture Notes in Computer Science, pages 13–28, London, UK, August 2002. Springer-Verlag.

[Den07]

Tom St Denis. Cryptography for Developers. Syngress Media, 2007.

[DR02]

Joan Daemen and Vincent Rijmen. The Design of Rijndael, AES – The Advanced Encryption Standard. Springer-Verlag, 2002.

[FIP01]

FIPS-197. Advanced Encryption Standard (AES). Federal Information Processing Standards Publication 197, November 2001. Available at http://csrc.nist. gov/.

[Gla01]

Brian Gladman. A Specification for Rijndael, the AES Algorithm. v3.1.0, http://fp.gladman.plus.com/cryptography_technology/rijndael/ aes.spec.v310.pdf, 2001.

[Gla07]

Brian Gladman. A Specification for Rijndael, the AES Algorithm. v3.1.6, http://fp.gladman.plus.com/cryptography_technology/rijndael/ aes.spec.v316.pdf, 2007.

[GLRP06] Benedikt Gierlichs, Kerstin Lemke-Rust, and Christof Paar. Templates vs. Stochastic Methods. In Louis Goubin and Mitsuru Matsui, editors, Proceedings of CHES 2006, volume 4249 of Lecture Notes in Computer Science, pages 15–29.

Springer-Verlag, 2006.

[GT03]

Jovan Dj. Golic and Christophe Tymen. Multiplicative Masking and Power Analysis of AES. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, Proceedings of CHES 2002, volume 2523 of Lecture Notes in Computer Science, pages 198–212. Springer-Verlag, 2003.

[HJM05]

Martin Hell, Thomas Johansson, and Willi Meier. Grain - A Stream Cipher for Constrained Environments. eSTREAM, ECRYPT Stream Cipher. Technical report, 2005/010, ECRYPT (European Network of Excellence for Cryptology), 2005.

[HKQ99]

Gael Hachez, Francois Koeune, and Jean-Jacques Quisquater. cAESar Results:

Implementation of Four AES Candidates on Two Smart Cards. In Morris

Dworkin, editor, Proceedings of Second Advanced Encryption Standard Candidate Conference, pages 95–108, Rome, Italy, March 22-23, 1999, 1999.

[HMS00]

Erwin Hess, Bernd Meyer, and Torsten Schütze. Information Leakage Attacks Against Smart Card Implementations of Cryptographic Algorithms and Countermeasures – A Survey. In EUROSMART Security Conference, pages 55–64, 2000.

[HOM06]

Christoph Herbst, Elisabeth Oswald, and Stefan Mangard. An AES Smart Card

Implementation Resistant to Power Analysis Attacks. In Jianying Zhou, Moti Yung, and Feng Bao, editors, Proceedings of ACNS 2006, volume 3989 of Lecture Notes in Computer Science, pages 239–255. Springer-Verlag, 2006.

[IAI06]

IAIK-2006. VLSI Products – Software Modules. IAIK - TU Graz research/vlsi/02_products/index.php, January 2006.

[Inf02]

Infineon Technologies AG. TriCore 2, 32-bit Unified Processor Core, 2002,08 edition, Sept 2002.

[Inf05]

Infineon Technologies AG. TriCore™, Instruction Set Simulator (ISS), User Guide, 2005-01 edition, January 2005.

[Inf07]

Infineon Technologies AG. TC1796 32-Bit Single-Chip Microcontroller, TriCore, User’s Manual, 2007-07 edition, July 2007.

[Inf08a]

Infineon Technologies AG. TC1796 32-Bit Single-Chip Microcontroller, TriCore, Data Sheet, 2008-04 edition, April 2008.

[Inf08b]

Infineon Technologies AG. TC1796 32-Bit Single-Chip Microcontroller, TriCore, Instruction Set, V1.3 & V1.3.1 Architecture, 2008-01 edition, July 2008.

[JPS05]

Marc Joye, Pascal Paillier, and Berry Schoenmakers. On Second-Order Differential Power Analysis. In CHES 2005: Revised Papers from the 7th International Workshop on Cryptographic Hardware and Embedded Systems, volume 3659 of Lecture Notes in Computer Science, pages 293–308. Springer-Verlag, 2005.

[Ker83]

Auguste Kerckhoffs. La cryptographie militaire. In Journal des sciences militaires, volume IX, pages 5–83, Jan. 1883.

[KJJ99]

P. Kocher, J. Jaffe, and B. Jun. Differential Power Analysis. In M. J. Wiener, editor, Proceedings of CRYPTO 1999, volume 1666 of Lecture Notes in Computer Science, pages 388–397. Springer-Verlag, 1999.

[Kna06]

Anthony W. Knapp. Basic Algebra. Birkhäuser, 2006.

[Koc96]

P. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In N. Koblitz, editor, Proceedings of CRYPTO 1996, volume 1109 of Lecture Notes in Computer Science, pages 104–113. Springer-Verlag, 1996.

[Mat07]

The MathWorks, Inc. Statistics Toolbox User’s Guide, Version 6.1, 2007.

[MDS99]

Thomas S. Messerges, Ezzy A. Dabbish, and Robert H. Sloan. Investigations of Power Analysis Attacks on Smartcards. In Proceedings of the USENIX Workshop on Smartcard Technology, pages 151–162, Berkeley, CA, USA, 1999. USENIX Association.

[Mes00]

Thomas Messerges. Securing the AES Finalists Against Power Analysis Attacks. In Bruce Schneier, editor, Proceedings of FSE 2000, volume 1978 of Lecture Notes in Computer Science, pages 150–164. Springer-Verlag, 2000.

[MOP07]

Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer-Verlag, 2007.

[MS03]

Sumio Morioka and Akashi Satoh. An Optimized S-box Circuit Architecture for Low Power AES Design. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, CHES 2002: Revised Papers from the 4th International Workshop on Cryptographic Hardware and Embedded Systems, volume 2523 of Lecture Notes in Computer Science, pages 172–186. Springer-Verlag, 2003.

[NIS98]

NIST. Known Answer Test for Advanced Encryption Standard (AES). http: //csrc.nist.gov/archive/aes/rijndael/rijndael-vals.zip, 1998.

[NS06]

Michael Neve and Jean-Pierre Seifert. Advances on Access-Driven Cache Attacks on AES. In Selected Areas in Cryptography, pages 147–162, 2006.

[OMP04]

Elisabeth Oswald, Stefan Mangard, and Norbert Pramstaller. Secure and Efficient Masking of AES – a Mission Impossible? Cryptology ePrint Archive, Report 2004/134, 2004. Cryptology ePrint Archive

[OMPR05] Elisabeth Oswald, Stefan Mangard, Norbert Pramstaller, and Vincent Rijmen. A Side-Channel Analysis Resistant Description of the AES S-Box. In Henri Gilbert and Helena Handschuh, editors, Proceedings of FSE 2005, volume 3557 of Lecture Notes in Computer Science, pages 413–423. Springer-Verlag, 2005.

[OS05] Elisabeth Oswald and Kai Schramm. An Efficient Masking Scheme for AES Software Implementations. In JooSeok Song, Taekyoung Kwon, and Moti Yung, editors, Proceedings of WISA 2005, number 3786 in Lecture Notes in Computer Science, pages 292–305. Springer-Verlag, 2005.

[OST06]

Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache Attacks and Countermeasures: The Case of AES. In David Pointcheval, editor, Proceedings of CT-RSA 2006, volume 3860 of Lecture Notes in Computer Science, pages 1–20. Springer-Verlag, 2006.

[Paa94]

C. Paar. Efficient VLSI Architectures for Bit-Parallel Computation in Galois Fields. Dissertation, Institute for Experimental Mathematics, Universität Essen, Deutschland, 1994. http://crypto.rub.de/imperia/md/content/texte/ theses/paar_php_diss.pdf.

[R¨03]

Christian Röpke. Praktikum B: Embedded Smartcard Microcontrollers. http: //www.christianroepke.de/studium_praktikumB.html, 2003.

[RDJ+01] Atri Rudra, Pradeep K. Dubey, Charanjit S. Jutla, Vijay Kumar, Josyula R. Rao, and Pankaj Rohatgi. Efficient Rijndael Encryption Implementation with Composite Field Arithmetic. In CHES 2001: Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, pages 171–184. Springer-Verlag, 2001.

[Rij01]

Vincent Rijmen. Efficient Implementation of the Rijndael S-Box. Technical report, Katholieke Universiteit Leuven, Dept. ESAT, Belgium, 2001.

[Seu05]

Hermann Seuschek. DPA-Analyse von Implementierungen symmetrischer kryptographischer Algorithmen. Diplomarbeit, Technische Universität München, 2005.

[Sin08]

Brijesh Singh. Bosch Cryptographic Library – Documentation. Technical report, Robert Bosch GmbH, CR/AEA, 2008.

[SMTM01] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A Compact

Rijndael Hardware Architecture with S-Box Optimization. In Colin Boyd, editor,

Proceedings of ASIACRYPT ’2001, volume 2248 of Lecture Notes in Computer Science, pages 239–254. Springer-Verlag, 2001.

[Sti05]

Douglas R. Stinson. Cryptography: Theory and Practice. CRC press, 3rd edition, 2005.

[TH08]

Stefan Tillich and Christoph Herbst. Attacking State-of-the-Art Software Countermeasures—A Case Study for AES. In Elisabeth Oswald and Pankaj Rohatgi, editors, Cryptographic Hardware and Embedded Systems – CHES 2008, volume 5154 of LNCS, pages 228–243. Springer, 2008.

[THM07]

Stefan Tillich, Christoph Herbst, and Stefan Mangard. Protecting AES Software Implementations on 32-Bit Processors Against Power Analysis. In Jonathan Katz and Moti Yung, editors, Proceedings of ACNS 2007, volume 4521 of Lecture Notes in Computer Science, pages 141–157. Springer-Verlag, 2007.

[Tri03]

Elena Trichina. Combinational Logic Design for AES Subbytes Transformation on Masked Data. Technical report, Cryptology ePrint Archive: Report 2003/236, 2003.

[TSG02]

Elena Trichina, Domenico De Seta, and Lucia Germani. Simplified Adaptive Multiplicative Masking for AES. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, Proceedings of CHES 2002, volume 2523 of Lecture Notes in Computer Science, pages 187–197. Springer-Verlag, 2002.

[WG04]

Guido Walz and Barbara Grabowski, editors. Lexikon der Statistik: mit ausführlichem Anwendungsteil (German Edition). Spektrum Akademischer Verlag, first edition, June 2004.

[WOL02]

J. Wolkerstorfer, E. Oswald, and M. Lamberger. An ASIC Implementation of the AES S-Boxes. In Proceedings of CT-RSA 2002, volume 2271 of Lecture Notes in Computer Science, pages 67–78. Springer-Verlag, 2002.

@Slav Defence - My most esteemed pupil, so ends this lesson. Hope you enjoyed it and fortunately for you there will be no quiz :lol:

. @Jungibaaz

Slav Defence · Jan 20, 2015

SvenSvensonov said:
Chapter 7 Implementation Results
In this chapter, the measurement setup is discussed. The runtime of the unprotected implementations is measured and compared in the following. The runtime behavior of the protected AES implementations is analyzed in detail.

7.1 Measurement Setup
To compare the four unprotected AES implementations in size and speed, the given source files are imported to the Tasking development environment. Therefore an own project for every implementation is created where the following default build settings have been changed:

• optimization: level 2 (Optimize more),

• trade-off between speed and size: Level 4 (Size), and

• active build configuration: Release.

To measure the timing behavior of the AES implementations, the TriCore 1796 emulation device (TC1796ED) development board is used, see Section 5.1.4.

The statistical analysis were carried out on a PC featuring a Intel Celeron processor at 2.53GHz and 960MB RAM. We used Matlab Release 2007b with the Statistics Toolbox version 6.1 [Mat07].

7.2 Runtime Comparison of the Implementations
For the runtime measurement, we use the maximal CPU frequency of 150MHz. The runtime Xj,j = 1,...,1000 of aes_Encrypt is measured for 1,000 different encryptions. Thereby we use one fixed key k and different plaintexts p. The new plaintext pj+1 we use is the output cj of the previous encryption cj = aes_Encrypt(k,pj). Algorithm 8 shows the program fragment used for the measurement process. We start the measurement right before the AES run which is denoted by the tic instruction. After the run we stop the measurement (denoted by toc) and save the elapsed runtime. The runtime X can be seen as a random variable that is normally distributed, i.e. X = x1,...,x1000 with Xj ∼ N(µx,σx2),xj iid.

The optimized AES implementation is with 78.31µs (204.32kbyte/s) the fastest implementation. It is about 2.6 times faster than the comparable implementation from Singh-2, which also uses an expanded key and only table lookups for the SubBytes transformation. The faster

Singh-1 implementation also uses tables lookups for the xtimes part of the MixColumns transformation. The runtime results of the measurements Singh-1 and Singh-2 fit with the results from Brijesh Singh presented in [Sin08].

The runtime of the protected AES implementation depends on the amount of dummy operations D and the number of AES runs which use the same mask for the S-box. The value in line five of Table 7.1 depicts the runtime, if the S-box is recalculated on every new AES run. With zero dummy operations one AES encryption takes 160µs (100kbyte/s).

Line six depicts the runtime for the CTMF approach. This implementation is just a concept. The decryption is not implemented. With this approach it is possible to remask every SubBytes operation. The runtime does not depend on the used masks since it does not recalculate the S-box. It calculates the inversion as we described it in Section 4.2.4. With 1183 + (D · 56)µs it is about eight times slower than the protected implementation in line five. This implementation can be used in higher order DPA resistant implementations.

The runtime for D = 0 to D = 255. The S-box is recalculated at every new AES run. We can see a linear runtime behavior which is described by 160µs+D·3.8µs for the encryption. For the decryption, we get 160µs+D·4µs. We can see, that the decryption takes a bit more time, which is mainly caused by the inverse MixColumns transformation. For the two figures, we measured the runtime for six different numbers of dummy operations, D ∈ {1,50,100,150,200,255} according to Algorithm 8 and drawed the regression line between the measured results. To calculate the polynomial for the regression line we use the Matlab function polyfit on the measured samples.

To check, if the S-box recalculation works correctly, we build a version with D = 10 dummy operations and which recalculates the S-box after every 5th AES run. The peak denotes, that at this AES run the S-box has been recalculated. We can observe, that a recalculation of the S-box takes about 38µs.

7.3 Timing Analysis
To measure the dependency between the runtime of an AES encryption and the input data five measurements sets were created. Between the different measurement sets, the key k and the plaintext p vary.

All measurement sets contain random data, but set two and set three share the same keys and different plaintexts and set four and five use the same plaintext but different keys.

The runtime Xj of every encryption run (aes_Encrypt) was measured 1,000 times for each set. Algorithm 9 denotes the measurement process.

For the timing analysis, we need a higher resolution. Because we cannot increase the precision of the MCDS timing measurement, we decrease the CPU frequency instead. The CPU frequency is controlled by the PLL1. We have to setup the four main parameters: the P-Divider, N-Divider, K-Divider, and VCO2 range where:

N

fCPU = · · fOSC.

P K

With N = 100, P = 5 and K = 16 we get the smallest frequency fCPU = 25MHz. The VCO range is set to 400 – 500MHz.

The smallest measurable time unit ∆ (the absolute failure) we can measure with the MCDS is 250µs.To get a more precise time domain we oversample our measurements by the factor of 163, which means we average the measurement over 16 AES runs.

7.3.1 Timing Analysis of the Unprotected AES Implementations

The arithmetic mean values X, i.e., an estimator for the expected value µX, and the empirical standard deviation sX, i.e., an estimator for σX, each for all 1,000 measurements.

The timing behavior of the implementation from Szerwinski and in Table 7.4 our optimized AES implementation are depicted. Both leak timing information, since they show a different runtime for different combinations of the key and plaintext.

We can distinguish the different runtimes for the different measurement sets. For example, we can clearly distinguish between set one on the left and set five on the right side of the histogram.

The measurement results from Singh-2 are given. In this AES implementation, the MixColumns transformation is calculated on-the-fly, as described in Section 2.3.3. It is the only transformation, with a conditional part (xtimes) during the calculation. As expected,

this leads to an exploitable timing consumption. The mean values X depend on the processed data.

The measurement results from Singh-1 are shown. Also in this implementa-

tion, the mean values X for the five different measurements sets are distinguishable. The implementation from Singh-1 contains no obviously data dependent parts which would affect the runtime. Here the xtimes part of the MixColumns transformation is realized as a lookup table. We guess, that the different runtime results from platform depending scratchpad-RAM effects. This contradicts our assumption from Section 1.2, that a table lookup has a constant runtime. Thus we can observe a kind of cache effect.

Because the Protected AES implementation is based on the optimized AES implementation, it can be assumed, that the protected implementation also shows a different timing consumption on different key and plaintext pairs. It must be verified, that this information cannot be used during a timing analysis attack.

7.3.2 Timing Analysis of the Protected AES Implementation
For the measurements we use D = 16 dummy operations. The SubBytes transformation is implemented as we described it in Section 6.6.1. We re-mask the S-box on every new run of the cipher.

Encryption

The mean values X for the different measurement sets, which also differs but with a much smaller variance.

The runtime of the an AES encryption with D = 16 dummy operations. The runtime was measured 1,000 times. As we expect, the most measured runtimes lie in the area x ± sx.

It is harder to distinguish between the measurement sets on the protected AES implementation.

To test, if the different measurement sets really do not lead to a distinguishable runtime, the two sample t-test is used. We fix the indices i and j. The statistical question we are interested in is if we can clearly distinguish between the sets Xki and Ykj, k = 1,...,1000. The t-test is applied among each mean value X = X(i), i = 1,... ,5 and Y = X(j), j = 1,...,5, where:

• Xk – Runtime of aes_Encrypt in µs for set i, with kth measurement, k = 1,...,1000

• Yk – Runtime of aes_Encrypt in µs for set j, with kth measurement, k = 1,...,1000

• Condition: Xk,Yk i.i.d., Xk ∼ N(µX,σX2 ), Yk ∼ N(µY ,σY2 ), k = 1,... ,1000, σX2 = σY2 and unknown

• Null hypothesis: H0 : µX = µY versus H1 : µX =6 µY

The zero indicates that the test cannot be rejected. This means that the mean values come from the same population and are not distinguishable. This holds for every measurement set combination. We get more information from the p-value on the right hand side of Table 7.8.

The p-value is the probability of observing the given result, [or one more extreme values] under the assumption that the null hypothesis is true. If the p-value is less than α, then you reject the null hypothesis. For example, if α = 0.05 and the p-value is 0.03, then you reject the null hypothesis. The converse is not true. If the p-value is greater than α, you do not accept the null hypothesis. You just have insufficient evidence to reject the null hypothesis.

One pre-condition for the applicability of the two sample t-test is that the unknown variances of both random variables X and Y are the same. We can check this with Fisher’s F-test. Fisher’s F-test verifies if the variances σX2 and σY2 of two random variables X and Y which are normally distributed with unknown and not necessarily equal mean values µx and µy are the same, see Section 3.1.3. Note, however, that the two-sample t-test checks statistical moments of first order, whereby the F-test checks statistical moments of second order. Effects of order two are more difficult to detect, in general, because they are smaller than effects of first order.

• Xk – Runtime of aes_Encrypt in µs for set i, with kth measurement, k = 1,...,1000

• Yk – Runtime of aes_Encrypt in µs for set j, with kth measurement, k = 1,...,1000

• Condition: Xk,Yk i.i.d., Xk ∼ N(µX,σX2 ), Yk ∼ N(µY ,σY2 ), k = 1,...,1000

• Null hypothesis: H0 : σx2 = σy2 versus H1 : σx2 =6 σy2

Every measurement set is compared with each other.

Decryption

The mean value X of the runtime for the given set. The mean values also differs for different key and plaintext combinations but with a much bigger variance.

It is harder to make a differentiation between the measurement sets on the protected AES implementation. The appearance of a specific runtime for a specific measurement set is almost normal distributed over all possible measured timings.

To test, if the different measurement sets really do not lead to a distinguishable runtime, we also use the two sample t-test. The t-test is applied among each mean value X = X(i), i = 1,...,5 and Y = X(j), j = 1,...,5, where:

• Xk – Runtime of aes_Decrypt in µs for set i, with kth measurement, k = 1,...,1000

• Yk – Runtime of aes_Decrypt in µs for set j, with kth measurement, k = 1,...,1000

• Condition: Xk,Yk i.i.d., Xk ∼ N(µX,σX2 ), Yk ∼ N(µY ,σY2 ), k = 1,... ,1000, σX2 = σY2 and unknown

• Null hypothesis: H0 : µX = µY versus H1 : µX =6 µY

The zero indicates that the hypothesis cannot be rejected. This means that the mean values probably come from the same population and are not distinguishable. This holds for every measurement set combination.

The condition for the t-test is, that the unknown variance of both random variables X and Y is the same. To ensure this, we use the F-test again.

• Xk – Runtime of aes_Decrypt in µs for set i, with kth measurement, k = 1,...,1000

• Yk – Runtime of aes_Decrypt in µs for set j, with kth measurement, k = 1,...,1000

• Condition: Xk,Yk i.i.d., Xk ∼ N(µX,σX2 ), Yk ∼ N(µY ,σY2 ), k = 1,...,1000

• Null hypothesis: H0 : σx2 = σy2 versus H1 : σx2 =6 σy2

The F-test results for the measured timings. We can see, that they are all above the 5%.

The statistical tests show, that we cannot distinguish between the different sets, i.e., the different keys, on hand of the runtime behavior of our protected AES implementation. Thus, we can assume that the protected AES implementation is resistant against timing analysis.

7.4 Power Analysis
It is not in the scope of this thesis to analyze the protected AES implementation with respect to Power Analysis resistance. The design of the countermeasures was chosen in such a way that the implementation should be resistant against Simple Power Analysis and standard Correlation and Differential Power Analysis, see [HOM06]. This means that more traces are needed for a successfully mounted DPA attack than for an attack against an unprotected implementation. However, this implementation is susceptible to advanced power analysis techniques, especially second order attacks.

The theoretical results from [THM07] and the practical results from [TH08] showed, that the number of required traces for a successful mounted SCA increases by (D + 1) · 16. The number of traces which are needed to attack an unprotected implementation on the TriCore has not been investigated yet.

To measure the resistance against a DPA, one should start to build a measurement setup. The setup could be tested by analyzing an unprotected AES version like the optimized AES implementation. This implementation should be fairly easy to attack.

Afterwards the protected AES implementation should be attacked with methods like windowing and second order DPA, see Section 4.3.1. However, this requires a fair amount of experience in attacking protected implementations.

Chapter 8 Conclusion
The topic of this master thesis was the side-channel resistant implementation of the Advanced Encryption Standard (AES) on a typical automotive processor.

Symmetric crypto primitives are the basis for most cryptographic architectures. The Advanced Encryption Standard (AES) is the current standard for block ciphers. An efficient and secure implementation of this algorithm therefore is very important for future security mechanisms in the automotive world. While there are many fairly efficient AES implementations in the field, most of them are not optimized for embedded systems, especially in the automotive world. Most AES implementations are for 8-bit systems, e.g., smart cards, or for 32-bit architectures with lots of memory. However, in the automotive domain we often have 32-bit RISC processors and tight constraints on the memory and the program size while speed is most often not an issue. Therefore, the selection and implementation of an unprotected AES specially adapted for the needs of the automotive industry is necessary.

The topic of side-channel attacks is one of the most important issues in applied cryptography. In the literature, one can find many results on high security systems like smart cards. However, until now we are not aware of any side-channel analysis resistant implementation of AES on automotive processors. For the first time, an efficient (32-bit optimized) and secured (countermeasures against timing and power analysis) AES implementation has been investigated on automotive platforms.

The target platform for the implementation was an Infineon TriCore TC1796. This processor is widely used in Engine Control Units (ECU) made by Robert Bosch GmbH as well as by Continental AG, for example. It is a 32-bit RISC processor.

After a short introduction of the Advanced Encryption Standard we discussed in Chapter 2 various implementation options for unsecured implementations. Since many of the existing highly efficient implementations are very processor-specific and are therefore not directly comparable, we introduced a simple processor model for comparison. Note that the resulting estimates are not very meaningful for itself but are important with respect to the other implementations. Thus, not the absolute number of clock cycles and memory consumption is important but the relative factor compared to other implementations. Specifically, we estimated a faster runtime with factor 2.2 between an existing, optimized 8-bit AES implementation(3808 cc) and a 32-bit optimized implementation (1732cc).

In practice, we could achieve this factor with our real optimized AES implementation (78.31µs) which utilizes the TriCore instruction set and uses the whole 32-bit register size. Compared to the optimized 8-bit implementation (Singh-2 with 208.3µs) our implementation is faster by the factor 2.6 as the reference implementation.

Since the overall security level of automotive processors is moderate (no secure memory, for example) it should be enough to secure an AES implementation against the most dangerous,

low-level attacks, i.e., Timing Analysis (TA), Simple and Differential Power Analysis (SPA, DPA) (first order). In Chapter 3 we therefore discussed various attack techniques for Side Channel Analysis (SCA) and gave an overview about software countermeasures against timing analysis and (first order) Differential Power Analysis (DPA).

With our goal, an efficient and protected AES implementation, in mind we analyzed the existing SCA countermeasures in terms of performance, memory requirement and known attacks in Chapter 4. Using our processor model, we showed that AES implementations with resistance against second order DPA attacks require, in general, over ten times more clock cycles than unprotected implementation. Irrespective of the security aspects, overheads of this order are too much for an implementation on automotive processors. So we decided to use a first order DPA resistant masking method and combine it with randomization (shuffling and dummy states) as proposed in [HOM06] and [THM07]. In these two papers, the authors carefully selected various low-cost countermeasures to achieve the overall security level required.

In addition to this method, we provide a concept study for a masking scheme called CMTF (Combined Masking in Tower Fields), that can be used in an implementation that is resistant against higher order DPA attacks. Originally, CMTF has been proposed as hardware countermeasure. However, [OS05] used precomputed tables for the transformation between sub-fields, making it thus applicable as a software countermeasure. The CMTF method with precomputed tables can be combined with the randomization approach by Herbst et al., i.e., shuffling and dummy states).

It has turned out, that the affine transformation which is part of the SubBytes transformation is a bottleneck in terms of runtime because it works bitwise on each byte. An improvement would be if four affine transformation would be calculated on four bytes of the State matrix in parallel. This would require the possibility of rotating the four bytes in a word in parallel. The CMTF masking scheme as well as the other scheme which calculate the SubBytes transformation on the fly can be used in implementations which use more than one mask for the bytes of the State matrix, since the RAM consumption of the additive masking increases very fast with the number of masks used.

In Chapter 7 we evaluated the timing behavior of our implementations. We showed, that the unprotected implementations are vulnerable to timing analysis attacks. In addition we showed, that our protected implementation is probably timing analysis resistant. Power analysis has been left as a subject for further investigation.

In our theoretical part, we estimated a runtime slowdown by a factor of 1.9 between the 32-bit optimized implementation (1732cc) and the additive masked implementation (3332cc). We could observe this factor in practice at the real optimized implementation (≈ 78µs) and the protected implementation with no dummy operations (≈ 160µs). This gives us a factor of two. The security level can be further increased by introducing additional dummy states

(D = 0).6

Summarizing, for the first time we have an efficient and side-channel secured AES implementation on a typical automotive processor. The resulting implementation is very competitive in terms of program and memory size as well as performance. The security of the implementation is adapted to the current security level of automotive processors (low-cost countermeasures since we have no secure memory or tamper protection). The implementation can be used as a plug and play solution which can replace vulnerable unprotected AES implementations.

We predict, that in the mid-term future, no automotive manufacturer or supplier can afford

Bibliography
[AG01]

M. L. Akkar and Chr. Giraud. An Implementation of DES and AES, Secure against Some Attacks. In C. K. Koç, D. Naccache, and Chr. Paar, editors, Proceedings of CHES 2001, volume 2162 of Lecture Notes in Computer Science, pages 309–318. Springer-Verlag, 2001.

[AK98]

Ross J. Anderson and Markus G. Kuhn. Low Cost Attacks on Tamper Resistant Devices. In Proceedings of the 5th International Workshop on Security Protocols, pages 125–136, London, UK, 1998. Springer-Verlag.

[AKS07]

Onur Aciiçmez, Çetin Kaya Koç, and Jean-Pierre Seifert. On the Power of Simple Branch Prediction Analysis. In ASIACCS 2007: Proceedings of the 2nd ACM symposium on Information, computer and communications security, pages 312– 320, New York, NY, USA, 2007. ACM.

[APSQ06]

Cédric Archambeau, Eric Peeters, François-Xavier Standaert, and Jean-Jacques Quisquater. Template Attacks in Principal Subspaces. In Louis Goubin and Mitsuru Matsui, editors, Proceedings of CHES 2006, volume 4249 of Lecture Notes in Computer Science, pages 1–14. Springer-Verlag, 2006.

[BBF+03]

Guido Bertoni, Luca Breveglieri, Pasqualina Fragneto, Marco Macchetti, and Stefano Marchesin. Efficient Software Implementation of AES on 32-bit Platforms. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, Proceedings of CHES 2002: Revised Papers from the 4th International Workshop on Cryptographic Hardware and Embedded Systems, volume 2523 of Lecture Notes in Computer Science, pages 159–171. Springer-Verlag, 2003.

[Ber05]

Daniel J. Bernstein. Cache-timing Attacks on AES, 2005. cr.yp.to antiforgery/cachetiming-20050414.pdf.

[BGK04]

Johannes Blömer, Jorge Guajardo, and Volker Krummel. Provably Secure Masking of AES. In H. Handschuh and M. Anwar Hasan, editors, Selected Areas in Cryptography – SAC 2004, number 3357 in Lecture Notes in Computer Science, pages 69–83. Springer-Verlag, 2004.

[Bos98]

Karl Bosch. Statistik-Taschenbuch. Oldenbourg, January 1998.

[BS08]

Daniel Bernstein and Peter Schwabe. New AES Software Speed Records. Cryptology ePrint Archive, Report 2008/381, 2008. Cryptology ePrint Archive

[BSI01]

Bundesamt für Sicherheit in der Informationstechnik BSI. AIS 20: Funktionalitätsklassen und Evaluationsmethodologie für deterministische Zufallszahlengeneratoren. Anwendungshinweise und Interpretationen zum Schema (AIS) Version 1, 932.12.1999, BSI, 2001.

[CB09]

D. Canright and Lejla Batina. A Very Compact “Perfectly Masked” S-Box for AES (corrected). Cryptology ePrint Archive, Report 2009/011, 2009. http:// eprint.iacr.org/.

[CCD00]

Christophe Clavier, Jean-Sébastien Coron, and Nora Dabbous. Differential Power Analysis in the Presence of Hardware Countermeasures. In Christof Paar

and Çetin Kaya Koç, editors, Cryptographic Hardware and Embedded Systems CHES 2000, volume 1965 of Lecture Notes in Computer Science, pages 252–263. Springer-Verlag, 2000.

[CGPR08] Jean-Sébastien Coron, Christophe Giraud, Emmanuel Prouff, and Matthieu Rivain. Attack and Improvement of a Secure S-Box Calculation Based on the Fourier Transform. In Cryptographic Hardware and Embedded Systems – CHES 2008, volume 5154 of Lecture Notes in Computer Science, pages 1–14. Springer-Verlag, 2008.

[Cha82]

David Chaum. Blinding Signatures for Untraceable Payments. In David Chaum, Ronald L. Rivest, and Alan T. Sherman, editors, Advances in Cryptology – CRYPTO 1982, pages 199–203, 1982.

[CRR02]

Suresh Chari, Josyula R. Rao, and Pankaj Rohatgi. Template Attacks. In Proceedings of CHES 2002, volume 2523 of Lecture Notes in Computer Science, pages 13–28, London, UK, August 2002. Springer-Verlag.

[Den07]

Tom St Denis. Cryptography for Developers. Syngress Media, 2007.

[DR02]

Joan Daemen and Vincent Rijmen. The Design of Rijndael, AES – The Advanced Encryption Standard. Springer-Verlag, 2002.

[FIP01]

FIPS-197. Advanced Encryption Standard (AES). Federal Information Processing Standards Publication 197, November 2001. Available at http://csrc.nist. gov/.

[Gla01]

Brian Gladman. A Specification for Rijndael, the AES Algorithm. v3.1.0, http://fp.gladman.plus.com/cryptography_technology/rijndael/ aes.spec.v310.pdf, 2001.

[Gla07]

Brian Gladman. A Specification for Rijndael, the AES Algorithm. v3.1.6, http://fp.gladman.plus.com/cryptography_technology/rijndael/ aes.spec.v316.pdf, 2007.

[GLRP06] Benedikt Gierlichs, Kerstin Lemke-Rust, and Christof Paar. Templates vs. Stochastic Methods. In Louis Goubin and Mitsuru Matsui, editors, Proceedings of CHES 2006, volume 4249 of Lecture Notes in Computer Science, pages 15–29.

Springer-Verlag, 2006.

[GT03]

Jovan Dj. Golic and Christophe Tymen. Multiplicative Masking and Power Analysis of AES. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, Proceedings of CHES 2002, volume 2523 of Lecture Notes in Computer Science, pages 198–212. Springer-Verlag, 2003.

[HJM05]

Martin Hell, Thomas Johansson, and Willi Meier. Grain - A Stream Cipher for Constrained Environments. eSTREAM, ECRYPT Stream Cipher. Technical report, 2005/010, ECRYPT (European Network of Excellence for Cryptology), 2005.

[HKQ99]

Gael Hachez, Francois Koeune, and Jean-Jacques Quisquater. cAESar Results:

Implementation of Four AES Candidates on Two Smart Cards. In Morris

Dworkin, editor, Proceedings of Second Advanced Encryption Standard Candidate Conference, pages 95–108, Rome, Italy, March 22-23, 1999, 1999.

[HMS00]

Erwin Hess, Bernd Meyer, and Torsten Schütze. Information Leakage Attacks Against Smart Card Implementations of Cryptographic Algorithms and Countermeasures – A Survey. In EUROSMART Security Conference, pages 55–64, 2000.

[HOM06]

Christoph Herbst, Elisabeth Oswald, and Stefan Mangard. An AES Smart Card

Implementation Resistant to Power Analysis Attacks. In Jianying Zhou, Moti Yung, and Feng Bao, editors, Proceedings of ACNS 2006, volume 3989 of Lecture Notes in Computer Science, pages 239–255. Springer-Verlag, 2006.

[IAI06]

IAIK-2006. VLSI Products – Software Modules. IAIK - TU Graz research/vlsi/02_products/index.php, January 2006.

[Inf02]

Infineon Technologies AG. TriCore 2, 32-bit Unified Processor Core, 2002,08 edition, Sept 2002.

[Inf05]

Infineon Technologies AG. TriCore™, Instruction Set Simulator (ISS), User Guide, 2005-01 edition, January 2005.

[Inf07]

Infineon Technologies AG. TC1796 32-Bit Single-Chip Microcontroller, TriCore, User’s Manual, 2007-07 edition, July 2007.

[Inf08a]

Infineon Technologies AG. TC1796 32-Bit Single-Chip Microcontroller, TriCore, Data Sheet, 2008-04 edition, April 2008.

[Inf08b]

Infineon Technologies AG. TC1796 32-Bit Single-Chip Microcontroller, TriCore, Instruction Set, V1.3 & V1.3.1 Architecture, 2008-01 edition, July 2008.

[JPS05]

Marc Joye, Pascal Paillier, and Berry Schoenmakers. On Second-Order Differential Power Analysis. In CHES 2005: Revised Papers from the 7th International Workshop on Cryptographic Hardware and Embedded Systems, volume 3659 of Lecture Notes in Computer Science, pages 293–308. Springer-Verlag, 2005.

[Ker83]

Auguste Kerckhoffs. La cryptographie militaire. In Journal des sciences militaires, volume IX, pages 5–83, Jan. 1883.

[KJJ99]

P. Kocher, J. Jaffe, and B. Jun. Differential Power Analysis. In M. J. Wiener, editor, Proceedings of CRYPTO 1999, volume 1666 of Lecture Notes in Computer Science, pages 388–397. Springer-Verlag, 1999.

[Kna06]

Anthony W. Knapp. Basic Algebra. Birkhäuser, 2006.

[Koc96]

P. Kocher. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In N. Koblitz, editor, Proceedings of CRYPTO 1996, volume 1109 of Lecture Notes in Computer Science, pages 104–113. Springer-Verlag, 1996.

[Mat07]

The MathWorks, Inc. Statistics Toolbox User’s Guide, Version 6.1, 2007.

[MDS99]

Thomas S. Messerges, Ezzy A. Dabbish, and Robert H. Sloan. Investigations of Power Analysis Attacks on Smartcards. In Proceedings of the USENIX Workshop on Smartcard Technology, pages 151–162, Berkeley, CA, USA, 1999. USENIX Association.

[Mes00]

Thomas Messerges. Securing the AES Finalists Against Power Analysis Attacks. In Bruce Schneier, editor, Proceedings of FSE 2000, volume 1978 of Lecture Notes in Computer Science, pages 150–164. Springer-Verlag, 2000.

[MOP07]

Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power Analysis Attacks: Revealing the Secrets of Smart Cards. Springer-Verlag, 2007.

[MS03]

Sumio Morioka and Akashi Satoh. An Optimized S-box Circuit Architecture for Low Power AES Design. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, CHES 2002: Revised Papers from the 4th International Workshop on Cryptographic Hardware and Embedded Systems, volume 2523 of Lecture Notes in Computer Science, pages 172–186. Springer-Verlag, 2003.

[NIS98]

NIST. Known Answer Test for Advanced Encryption Standard (AES). http: //csrc.nist.gov/archive/aes/rijndael/rijndael-vals.zip, 1998.

[NS06]

Michael Neve and Jean-Pierre Seifert. Advances on Access-Driven Cache Attacks on AES. In Selected Areas in Cryptography, pages 147–162, 2006.

[OMP04]

Elisabeth Oswald, Stefan Mangard, and Norbert Pramstaller. Secure and Efficient Masking of AES – a Mission Impossible? Cryptology ePrint Archive, Report 2004/134, 2004. Cryptology ePrint Archive

[OMPR05] Elisabeth Oswald, Stefan Mangard, Norbert Pramstaller, and Vincent Rijmen. A Side-Channel Analysis Resistant Description of the AES S-Box. In Henri Gilbert and Helena Handschuh, editors, Proceedings of FSE 2005, volume 3557 of Lecture Notes in Computer Science, pages 413–423. Springer-Verlag, 2005.

[OS05] Elisabeth Oswald and Kai Schramm. An Efficient Masking Scheme for AES Software Implementations. In JooSeok Song, Taekyoung Kwon, and Moti Yung, editors, Proceedings of WISA 2005, number 3786 in Lecture Notes in Computer Science, pages 292–305. Springer-Verlag, 2005.

[OST06]

Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache Attacks and Countermeasures: The Case of AES. In David Pointcheval, editor, Proceedings of CT-RSA 2006, volume 3860 of Lecture Notes in Computer Science, pages 1–20. Springer-Verlag, 2006.

[Paa94]

C. Paar. Efficient VLSI Architectures for Bit-Parallel Computation in Galois Fields. Dissertation, Institute for Experimental Mathematics, Universität Essen, Deutschland, 1994. http://crypto.rub.de/imperia/md/content/texte/ theses/paar_php_diss.pdf.

[R¨03]

Christian Röpke. Praktikum B: Embedded Smartcard Microcontrollers. http: //www.christianroepke.de/studium_praktikumB.html, 2003.

[RDJ+01] Atri Rudra, Pradeep K. Dubey, Charanjit S. Jutla, Vijay Kumar, Josyula R. Rao, and Pankaj Rohatgi. Efficient Rijndael Encryption Implementation with Composite Field Arithmetic. In CHES 2001: Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, pages 171–184. Springer-Verlag, 2001.

[Rij01]

Vincent Rijmen. Efficient Implementation of the Rijndael S-Box. Technical report, Katholieke Universiteit Leuven, Dept. ESAT, Belgium, 2001.

[Seu05]

Hermann Seuschek. DPA-Analyse von Implementierungen symmetrischer kryptographischer Algorithmen. Diplomarbeit, Technische Universität München, 2005.

[Sin08]

Brijesh Singh. Bosch Cryptographic Library – Documentation. Technical report, Robert Bosch GmbH, CR/AEA, 2008.

[SMTM01] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A Compact

Rijndael Hardware Architecture with S-Box Optimization. In Colin Boyd, editor,

Proceedings of ASIACRYPT ’2001, volume 2248 of Lecture Notes in Computer Science, pages 239–254. Springer-Verlag, 2001.

[Sti05]

Douglas R. Stinson. Cryptography: Theory and Practice. CRC press, 3rd edition, 2005.

[TH08]

Stefan Tillich and Christoph Herbst. Attacking State-of-the-Art Software Countermeasures—A Case Study for AES. In Elisabeth Oswald and Pankaj Rohatgi, editors, Cryptographic Hardware and Embedded Systems – CHES 2008, volume 5154 of LNCS, pages 2
oti Yung, editors, Proceedings of ACNS 2007, volume 4521 of Lecture Notes in Computer Science, pages 141–157. Springer-Verlag, 2007.

[Tri03]

Elena Trichina. Combinational Logic Design for AES Subbytes Transformation on Masked Data. Technical report, Cryptology ePrint Archive: Report 2003/236, 2003.

[TSG02]

Elena Trichina, Domenico De Seta, and Lucia Germani. Simplified Adaptive Multiplicative Masking for AES. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, Proceedings of CHES 2002, volume 2523 of Lecture Notes in Computer Science, pages 187–197. Springer-Verlag, 2002.

[WG04]

Guido Walz and Barbara Grabowski, editors. Lexikon der Statistik: mit ausführlichem Anwendungsteil (German Edition). Spektrum Akademischer Verlag, first edition, June 2004.

[WOL02]

J. Wolkerstorfer, E. Oswald, and M. Lamberger. An ASIC Implementation of the AES S-Boxes. In Proceedings of CT-RSA 2002, volume 2271 of Lecture Notes in Computer Science, pages 67–78. Springer-Verlag, 2002.

@Slav Defence - My most esteemed pupil, so ends this lesson. Hope you enjoyed it and fortunately for you there will be no quiz. @Jungibaaz

My most respectable professor!

Du ist eine Mann!

Thank you so much for giving us time and making an attempt to knock some sense of Electronic warfare

We give you this special award:

Regards

Search

Principles of EMSEC - Side Channel Analysis: How we side-step encryption

SvenSvensonov

PROFESSIONAL

SvenSvensonov

PROFESSIONAL

Screambowl

BANNED

SvenSvensonov

PROFESSIONAL

Screambowl

BANNED

Slav Defence

THINK TANK VICE CHAIRMAN: ANALYST

Neptune

SENIOR MEMBER

SvenSvensonov

PROFESSIONAL

Neptune

SENIOR MEMBER

Slav Defence

THINK TANK VICE CHAIRMAN: ANALYST

SvenSvensonov

PROFESSIONAL

Slav Defence

THINK TANK VICE CHAIRMAN: ANALYST

SvenSvensonov

PROFESSIONAL

SvenSvensonov

PROFESSIONAL

Slav Defence

THINK TANK VICE CHAIRMAN: ANALYST

Similar threads

Latest posts

Pakistan Defence Latest Posts

Pakistan Affairs Latest Posts

Military Forum Latest Posts

Country Latest Posts