SvenSvensonov
PROFESSIONAL
- Joined
- Oct 15, 2014
- Messages
- 1,617
- Reaction score
- 207
- Country
- Location
This is a little something I put together to explain a certain concept called Side Channel Analysis. Once again it will not provide any classified data, but it does go into additional and further depth and technicality.This is an advanced article with a lot of math that proves the concept of Side Channel Analysis - an EMSEC activity.
This is a big document so additional sections will be added as I finish with them. If something doesn't make sense, or you want to check a piece of information feel free to ask me for clarification... or you can wait until the entire document is posted. The final section will have all the references (these will explain some of the name, backup the research and offer additional articles on the subject), but in the meantime I will answer any questions you have as best as I can (the technicals aren't a problem, the legality is).
Most of the matrixes have been omitted as they were attachments and each section had too many. I couldn't keep each post under ten photos/attachments.
SvenSvensonov
Chapter 1 Introduction
1.1 Motivation
Nowadays, security becomes more and more important in the automotive industry. Robert Bosch GmbH, Corporate Research supplies the necessary know-how and provides solutions for future applications in the business units.
Many functions in a modern car are realized with the help of software applications and the underlying electronics. This means in particular, that a growing part of the components in a car are realized through an easy to copy and manipulate software. Due to this fact, there is a growing need for an effective protection of these software applications to save the intellectual property on the one hand and to disclose manipulation, e.g., by tuning, on the other hand.
There are up to 80 electronic control units (ECU) built into a modern upper class car. These control units come from different suppliers, communicate over potentially insecure networks with each other and with the outside world respectively, e.g., with diagnostic equipment or in the near future Car2Car and Car2Infrastructure. As a consequence, these control units should authenticate each other and the communication data should be cryptographically protected against manipulation.
For several use cases in the automotive domain, e.g., tuning detection and sensor protection, cryptography constitutes a considerable part of the overall-security architecture. Due to their performance and compactness in implementations, symmetric encryption primitives are of special interest. These primitives can be used for attaining the following security goals:
• confidentiality: encryption of data,• authenticity: challenge-response protocols,
• integrity: message authentication codes.
The State-of-the-Art symmetric encryption scheme is the Advanced Encryption Standard
(AES), [FIP01]. This block cipher with key sizes of 128, 192, and 256bits and a block size of 128bits is remarkably fast and can be implemented very efficiently on a wide variety of platforms. It replaces the old Data Encryption Standard DES or, to be more precise, its variant Triple-DES.
A lot of proprietary mechanisms in the automotive environment are getting replaced by standardized security mechanisms. An example is here the changeover from the proprietary KeeLoq System1, which is used for keyless entry in cars and garage doors, to modern challenge response protocols based on AES2.
1.2 Adversary Model and Assumptions
Ross Anderson and Markus Kuhn give a good classification of adversaries’ knowledge, funding, and skills in [AK98]. They define the following three classes of attackers:
• Class I (clever outsiders): They are very intelligent but may have insufficient knowledge of the system. They may have access to only moderately sophisticated equipment. They often try to take advantage of an existing weakness in the system, rather than try to create one.
• Class II (knowledgeable insiders): They have substantial specialized technical education and experience. They have varying degrees of understanding of parts of the system but potential access to most of it. They have highly sophisticated tools and instruments for analysis.
• Class III (funded organizations): They are able to assemble teams of specialists with related and complementary skills backed by great funding resources. They can do in-depth analysis of the system, design sophisticated attacks, and use the most advanced analysis tools. They may use Class II Adversaries as part of the attack team.
We will focus on knowledgeable insiders, i.e., the cryptographic algorithm itself is known to the attack agent (no security by obscurity!). The attack agent may have access to equipment like oscilloscopes etc. that can be found in university labs. The specific implementation, e.g., the source code or an open sample device, is assumed to be not available to the attacker.
In terms of side-channel analysis resistance3, this means that our implementation should be resistant against:
• timing attacks, and
• first-order differential power analysis (DPA) attacks.
Resistance against timing attacks implies that we assume an adversary having access to an exact simulation of the runtime of the implementation. Note that we do not require resistance against the new microarchitectural side-channel attacks like cache and branch predication attacks [AKS07, Ber05, OST06, NS06]. This does not mean that these attacks may not be relevant, it just means that we think that an attack agent has no or only little chance to install a spy process on the target processor and to measure the timing signals with the necessary precision.4
We assume that table lookups (TLU) have a constant runtime. For achieving timing analysis resistance, we need a key-independent runtime of the implementation. Therefore we will consider the effects of the 4-stage pipeline of the target processor, since they have a huge influence on the runtime.
Resistance against first-order differential power analysis (DPA) attacks means that neither a standard difference of means nor a correlation attack against the implementation will succeed
with a sufficient number of measurement traces. We assume that the attacker has no access to a training device where he can input, e.g., random masks or keys, i.e., we effectively rule out template attacks [GLRP06, APSQ06, CRR02].
The goal is not to make the implementation secure against second or higher order attacks, i.e., the focus is put on simple masking schemes. Of course, very simple second order attacks,
e.g., by using the same mask twice directly on the bus, have to be prevented.
1.3 Scope
Robert Bosch GmbH, Corporate Sector Research and Engineering investigates the suitability of side-channel resistant implementations of symmetric key primitives under the constraints of the embedded domain. In this thesis, we will survey and investigate several masking countermeasures described in the literature to protect an implementation of the Advanced Encryption Standard (AES) against first-order DPA attacks in terms of security and runtime.
A selected countermeasure will be implemented in C (and partly in assembly language) on a typical 32-bit automotive processor used in Embedded Control Units (ECUs), e.g., TriCore. The implementation shall be secured against timing attacks as well. Of special interest is an efficient implementation taking into account the requirements of embedded devices with respect to memory and performance. Implementations of AES without countermeasures against DPA are provided by Robert Bosch CR/AEA. The tasks, that are to be carried out are:
• to understand and to survey the relevant AES implementation options,
• to understand and to survey the most important side-channel attacks (DPA: differenceof means + correlation power attack), timing attack,
• to survey countermeasures (it is not the goal to invent new countermeasures, just todescribe what is available in the literature),
• to describe the different masking schemes in their memory usage (RAM and ROM),performance, and number of random bits,
• to select and to implement the most appropriate or a combination of methods,
• to implement AES efficiently and securely with a key length of 128bits on a TriCore 17965 processor in C and, maybe, parts of it in Assembler.
All countermeasures considered in this master thesis will be algorithmic, i.e., softwarebased, countermeasures. Although many hardware countermeasures have been discussed in the literature, e.g., special logic styles such as Masked Dual Rail Pre-Charge Logic (MDPL), increasing the noise by special noise generators, etc., we will only discuss these countermeasures if they can be applied to software as well. However, we will survey a number of countermeasures which have been originally proposed for hardware implementations e.g., masking in tower fields and investigate if they can be applied efficiently in software implementations.
For the masking countermeasures, we need a certain amount of random data. The generation of these random values is not part of this thesis. We assume that all random values are given to our program via a pointer to an array with sufficient numbers of the necessary entropy. For simulation the stream cipher Grain-80 [HJM05] is used as pseudo random number generator (RNG).
We will present a side-channel resistant AES implementation for the TriCore architecture, which is not available up to now. Therefore we will combine side-channel analysis (SCA) countermeasures.
In Chapter 2 we will give a short overview of AES and the existing implementation options. We will introduce some mathematical basics which are important to understand the upcoming implementations. We introduce a processor model, which will help us to select the right implementation option for our needs.
Chapter 3 gives an overview about known side-channels and possible attacks on it. The focus lies on timing analysis because this form of side-channel will be analyzed at the end of this thesis.
Chapter 4 describes countermeasures to prevent side-channel attacks. We analyze different masking schemes in terms of size and runtime and combine them with other countermeasures like randomization.
Chapter 5 is dedicated to the TriCore. We will describe the architecture and the assembler instructions which are used for the implementation.
Chapter 6 is focused on the implementation. We will discuss improvements for the already available implementations and based on the improvements build a protected AES implementation.
In Chapter 7 we will analyze the implemented AES versions in terms of their timing behavior.
Chapter 2 AES – Advanced Encryption Standard
2.1 Introduction
The Advanced Encryption Standard (AES), also known as Rijndael, is a symmetric block cipher that was developed by the Belgian cryptographers Joan Daemen and Vincent Rijmen. AES encrypts/decrypts data with a block size of 128bits and a key size of 128, 192, or 256bits. AES targets on a small and fast implementation in hardware and software. The standard is described in [FIP01]. Brian Gladman wrote a detailed description of AES in [Gla01] and an updated version of the description in [Gla07]. We will give here a short overview to introduce the notation used in this thesis.
AES is a typical substitution-permutation network (SPN), see [Sti05]. An SPN consists of several rounds, each made up with a substitution and a permutation stage. The substitution stage obscures the relationship between elements of the plaintext and elements of the ciphertext, while the permutation stage makes of the influence of plaintext elements spread over the ciphertext.
For AES, the number of rounds depends on the key size. We focus on a key size of 128bits, which corresponds to ten rounds. The 16byte plaintext P = (p0,p1,...,p15) is internally arranged in a 4 × 4 matrix, the so-called State matrix S. Every element sr,c of the State matrix (r denotes the index of the row and c the index of the column) equals to one byte. In Figure 2.1 we can see the individual transformations which are applied to the State matrix. Figure 2.2 shows how the plaintext P is initially mapped to the State matrix S and then to the ciphertext output, C = (c0,c1,...,c15).
In each round, four transformations are applied to the State matrix S:
1. Byte Substitution (SubBytes), which is a non-linear byte substitution operating independently on each byte (sr,c) of the State matrix, using a substitution table called S-box. This S-box can be constructed from two transformations:
a) an inversion in the finite field GF(28), where the zero is mapped to itself; and
b) a bitwise affine transformation.
2. ShiftRows, where the bytes sr,c in the State matrix are cyclically shifted. The first row is not rotated. The second row is rotated to the left by one position, row three is rotated to the left by two positions and row four is rotated to the left by three positions.
3. MixColumns, where every column of the State matrix is treated as a four-term polynomial and gets multiplied by a fixed 4×4 matrix over GF(28) which is defined in [FIP01].
5
4. Key Addition (AddRoundKey), which adds a 4 × 4 key matrix (128-bit) to the State matrix. This key matrix is called round key.
The final round differs slightly from the first nine rounds. As we see in Figure 2.1 the MixColumns transformation is missing, which is also typical for a substitution-permutation network since the MixColumns transformation adds no cryptographic strength to the algorithm if it is included at the end of the cipher.
Decryption is accomplished by running the algorithm “backward”. The last round key is first XORed with the ciphertext, and then in each round, the inverse linear transformations are applied, namely InvMixColumns, InvShiftRows, followed by the inverse S-box InvSubBytes. Note that for AES additional space for the inverse functions used in decryption is needed, in contrast to DES where encryption and decryption basically consume the same program memory. For example, in AES we need additional 256bytes if the inverse transformation InvSubBytes is implemented as lookup table (LUT).
To supply the algorithm with the required round keys, AES uses a mechanism called key scheduling. The encryption key of 16bytes is expanded to an array with eleven round keys, where each round key has a size of 16bytes. So in total 176bytes are needed for the round keys. To provide non-linearity between the cipher key and the expanded round key, the key schedule uses the SubBytes transformation followed by a cyclic rotation of the bytes and an addition of a round constant. The round constant is defined by a simple recursive rule.
2.2 Mathematical Background
In order to explain the upcoming implementation tricks and the side-channel countermeasures, we have to recall some basics on finite fields. Almost all operations that will follow are in the finite field GF(2), i.e., the bits ’0’ and ’1’, or the extension field GF(28), where GF denotes a Galois Field. A more detailed and general introduction to finite fields can be found in [Kna06].
GF(28) is the unique finite field with 256 elements and the operations ⊕ (addition) and ⊗ (multiplication). Its elements can be represented as polynomials of order 8 with binary coefficients ai ∈ {0,1},i = 0,...,7. An element a ∈ GF(28) can thus be seen as the set of coefficients ai ∈ GF(2) of the polynomial a with function value a(x), i.e.,
{a7a6a5a4a3a2a1a0}2 7→ a(x), (2.1)
7
a(x) = Xaixi = a7x7 + a6x6 + a5x5 + a4x4 + a3x3 + a2x2 + a1x + a0. (2.2)
i=0
A sequence of 8bits, for example {11010100}2, is called a byte. We will also depict byte values by their hexadecimal notation. Hence we can write the value {11010100}2 as {D4}16 which represents the polynomial x7 + x6 + x4 + x2.
Addition. In the following, ⊕ denotes an addition in GF(28), that can be realized using an exclusive OR (XOR). We can add two elements a,b ∈ GF(28) by adding their coefficients modulo 2 element wise.
7 7 7
Xaixi ⊕ Xbixi = X(ai ⊕ bi)xi (2.3) i=0 i=0 i=0
Multiplication. Multiplication in GF(28) is the usual polynomial multiplication, modulus an irreducible polynomial. For AES, the reduction polynomial m with
m(x) = x8 + x4 + x3 + x + 1
is used. The polynomial m can be represented by {11B}16. Note that in GF(28) reduction by m(x) is equivalent to adding m(x).
areduction by(xWe will use) · b(x) modm⊗(mxto indicate the multiplication of two GF(2)(xensures that the result is an element of GF) with a,b ∈ GF(28) and “·” denoting a polynomial multiplication. The8)(2elements, i.e.,8). a(x) ⊗ b(x) :=
We can implement the multiplication by x by a left shift and a subsequent conditional bitwise XOR with {11B}16, if the most significant bit (MSB) bit is set. For example, multiplying {11001000}2 by x gives {110010000}2. The overflow bit is then removed by adding {100011011}2, the modular polynomial, to give the final result of {10001011}2.
This method is called xtimes, see [Gla07]. Multiplication by higher powers of x can be implemented by repeated application of xtimes. Since we can write all elements of GF(28) as a sum of powers of {02}16, we can implement the multiplication by any value by a repeated use of xtimes. To get the result of the multiplication with any input g by the constant value {15}16.
2.3 Implementation Aspects
AES is an industry-wide used block cipher since the National Institute of Standards and Technologies (NIST) adopted it in 2001 from the Rijndael algorithm. It is used for the protection of government documents as well as for electronic commerce and private use. Many different implementations of AES have been proposed and have been optimized for the specific requirements like maximized throughput or minimized circuitry. We will give a short overview of the implementations, in order to get a picture of the tricks we can use for our implementation.
We begin by defining two microcontroller models (8-bit and 32-bit) to estimate the clock cycles for the implementation. Then we present the 8-bit implementation which is proposed in the standard [FIP01] and then look at existing 32-bit software implementations. We introduce the affine transformation, which is part of the SubBytes transformation, because we will need it in a later chapter.
For the current master thesis it is required to implement both encryption and decryption. For reasons of brevity, we will focus on the description of the encryption and give only the necessary hints for understanding the decryption process.
2.3.1 Microcontroller Model
To compare the runtime of an encryption, we define a simple 8-bit processor model and count the clock cycles (cc) according to that model. This model does not fit to all existing 8-bit microcontrollers. Hence, the given clock cycle count is a rough estimate to get an impression of the implementation. We make the following assumptions for byte operations on our processor model:
• load/store from/to RAM takes one clock cycle,
• load/store from/to ROM takes two clock cycles,
• XOR two registers takes one clock cycle,
• shift by one bit takes one clock cycle.
For the 32-bit (word) operations, we expand our processor model, which is defined in Section 2.3.3. We assume, that a load and a store operation on a word takes one clock cycle.
2.3.2 The Key Schedule
The AES implementation has a key size of 128bits, which leads to ten rounds. With the initial AddRoundKey transformation, an entire encryption needs eleven round keys. There are two different options to supply the algorithm with those round keys:
Key Expansion Before the Encryption
1 KeyExpansion ( byte key [16] , byte expkey [ 4 ] [ 4 4] )
2 begin
3 for ( j = 0; j < 4; j++) {
4 for ( i = 0; i < 4; i++)
5 expkey [ i ] [ j ] = key [ j ∗ 4 + i ] ;
6 }
7 for ( j = 4; j < 44; j++) {
8 if ( j % 4 == 0) {
9 expkey [ 0 ] [ j ] = expkey [ 0 ] [ j −4] ^
10 SubBytes(expkey [ 1 ] [ j −1], 0) ^ rc [ j /4];
11 for ( i = 1; i < 4; i++)
12 expkey [ i ] [ j ] = expkey [ i ] [ j −4] ^
13 SubBytes(expkey [( i +1) % 4][ j −1] ,0);
14 }
15 else {
16 for ( i = 0; i < 4; i++)
17 expkey [ i ] [ j ] = expkey [ i ] [ j − 4] ^ expkey [ i ] [ j − 1];
18 }
19 }
20 End
21
On the one hand, the whole round keys can be calculated before the actual encryption. Therefore, the KeyExpansion function is used, which is depicted in Listing 2.1. This function expands the 128-bit cipher key into an expanded key array, with four rows and 44 columns. The size of each element of this array equals to one byte. To store this expanded key array, we need 11·16bytes = 176bytes RAM. The first four columns (16bytes) of the expanded key are filled with the 128-bit cipher key. The next columns recursively depend on the previous ones, with the following cases: If the column index j = 0,...,43 is not a multiple of four, column j is a bitwise XOR of column j−4 and column j−1. Otherwise, column j is a bitwise XOR of column j − 4 and j − 1, with the difference, that the S-box is applied on the four bytes of column j − 1, with an additional XOR of the round constant rc on the first byte of this column. Since there are eleven rounds, also eleven round constants are needed, which are calculated by the following recursion rule in GF(28):
rc[1] = 1, rc[2] = x, rc[n] = x ⊗ rc[n − 1], n = 3,...,11. (2.5)
Key On-The-Fly Calculation
On the other hand, we can calculate the round keys on-the-fly. This alternative method is for environments with limited memory, e.g., smartcards. The calculation works the same way as we described above, but we only store the last calculated round key since we see that the calculations depend only on the columns j −4 to j −1, so we can derive the next one from it. In Listing 2.2 the function NextRoundKey calculates the next round key from the last used round key and the last used round constant rc. The first round key is the cipher key, the first round constant is one.
1 NextRoundKey( byte round_key [16] , byte rc )
2 begin
3 round_key [0] = round_key [0] ^ SubBytes [ round_key [ 13] ] ^ rc ;
4 round_key [1] = round_key [1] ^ SubBytes [ round_key [ 1 4 ] ] ;
5 round_key [2] = round_key [2] ^ SubBytes [ round_key [ 1 5 ] ] ;
6 round_key [3] = round_key [3] ^ SubBytes [ round_key [ 1 2 ] ] ;
7
8 for ( i =4; i <16; i++) {
9 round_key [ i ] = round_key [ i ] ^ round_key [ i −4];
10 }
11 rc = xtime( rc ); /∗ compute new round constant ∗/
12 end
13
If the key is calculated on-the-fly, we have to add the calculation costs to our encryption costs, since the calculation is done during every encryption. By counting the operations from Listing 2.2, 17 XORs, 3 · 16 = 48 load and store operations plus four load operations from ROM (for the S-box lookup) and two shift and an additional addition (XOR) for the xtime operation are necessary to calculate the next round key. Summarizing, we get 72cc for calculating the next round key and in total, the key on-the-fly method costs 10·72cc = 720cc.
2.3.3 8-bit Software Implementation Straight from the Standard
On an 8-bit processor, AES can be programmed by simply implementing the different component transformations which are described in Section 2.1.
The AddRoundKey Transformation
During the AddRoundKey transformation, every byte of the State matrix S is XORed with one byte from the expanded key. For one byte sr,c of the State this takes:
• 1cc to load the byte sr,c from RAM,
• 1cc to load one byte of the expanded key from RAM,
• 1cc to XOR both, and
• 1cc to store the result.
To process one State matrix S, 16 · 4cc = 64cc are needed. Since eleven round keys are added during an AES run, a whole AES en-/decryption needs 11 · 64cc = 704cc for the AddRoundKey transformations.
The ShiftRows Transformation
For ShiftRows the implementation is straightforward. Here the bytes in the last three rows of the State matrix are cyclically shifted. For this operation, the State value is loaded from RAM and must be written back at its new position. We assume that these two operations take two clock cycles per byte of the State matrix. Then it takes 4 · 2cc = 8cc per row and
3·8cc = 24cc per transformation. For a whole AES en-/decryption with ten rounds we need 10 · 24cc = 240cc for the ten ShiftRows transformations.
The SubBytes Transformation
The SubBytes transformation is implemented in the standard as a static lookup table (the S-box) of 256bytes. For decryption, the inverse SubBytes transformation is needed, which results in a second 256-byte lookup table. Both tables are generally located in read-only memory (ROM). The transformation loads one byte from ROM, one byte from the State matrix, XORs them, and writes the result back to RAM. We estimate that these operations take five clock cycles. One SubBytes transformation takes 16 · 5cc = 80cc. For a whole AES en-/decryption with ten rounds this makes 10 · 80cc = 800cc for the ten SubBytes transformations.
The MixColumns Transformation
The MixColumns transformation considers every column8). This column sc is multiplied by a fixed polynomialsc (0 ≤ c ≤ 3) of the State matrix as a four-term polynomial over GF(2 a with:
a(x) = {03}16 · x3 + {01}16 · x2 + {01}16 · x + {02}16 (2.6)
and then reduced by a fixed polynomial l with l(x) = x4 +1. The reduction by l decreases the result so it fits to the column again. Since a is coprime to l it is invertible and all elements have an inverse. The multiplication can be rewritten with the fixed polynomial a:
sc0 (x) = a(x) · sc(x) mod l(x), for 0 ≤ c ≤ 3
(2.7)
The indices 0,...,3 indicate the byte in the column c of the current State S. A multiplication with {02}16 can be efficiently implemented with the xtime operation which is denoted in Listing 2.31 and described in Section 2.2.
We estimate five clock cycles for the xtimes operation. A multiplication with {03}16 can be implemented as a multiplication with {02}16 plus an additional XOR operation with the operand. So for computing equations (2.8a) to (2.8d) we need:
• 4 · 1cc = 4cc for loading from RAM,
• 4 · 1cc = 4cc for writing to RAM,
• 4 · 3cc = 12cc for the additions (XOR),
• 4 · 5cc = 20cc for the four fixed multiplications with {02}16,
• 4 · 6cc = 24cc for the four fixed multiplications with {03}16.
This makes 64cc per column, 4 · 64cc = 256cc per round and 9 · 256cc = 2304cc for all nine MixColumns transformations during an AES encryption.
The Inverse MixColumns Transformation
The InvMixColumns transformation is similar to MixColumns transformation, but it uses the inverse of a:
a−1(x) = {0B}16 · x3 + {0D}16 · x2 + {09}16 · x + {0E}16. (2.9)
1 Please note that implementing xtime this way leaks timing information, due to the conditional multiplication with {1B}16. This will be discussed in Section 3.1.
The Hamming weight2 of the coefficients of a−1(x) is larger, which leads to more xtime calls and thus to a slower implementation for the decryption. For the InvMixColumns operation, which is given by the equations:
sc0 (x) = a−1(x) · sc(x) mod l(x), for 0 ≤ c ≤ 3
(2.10)
For the inverse MixColumns transformation the following operations are needed:
• 4 · 1cc = 4cc for loading from RAM,
• 4 · 1cc = 4cc for writing to RAM,
• 4 · 3cc = 12cc for the additions (XOR),
• 4 · 17cc = 68cc for the four fixed multiplications with {0E}16,
• 4 · 17cc = 68cc for the four fixed multiplications with {0D}16,
• 4 · 17cc = 68cc for the four fixed multiplications with {0B}16,
• 4 · 16cc = 64cc for the four fixed multiplications with {09}16,
Summarizing: To calculate a column of the State matrix S, 288cc are needed. To calculate the whole State matrix 4·288cc = 1152cc and 9·1152cc = 10368cc for all nine InvMixColumns transformations, during an AES decryption.
The Affine Transformation
As described in Section 2.1, the SubBytes transformation can be realized by a inversion followed by an affine transformation. Figure 2.6 illustrates these two steps. In the first part of the transformation the multiplicative inversion si,j−1 in GF(28) is build of the processed byte si,j. The inversion is followed by an affine transformation. The affine transformation
is a polynomial multiplication of a constant matrix with the byte a = s−1 followed by a
2 No. of “1” in the binary representation
XOR with the constant c. The matrix and the constant c are given in the standard [FIP01]. Equation (2.12) denotes the affine transformation.
b67 0 1 1 1 1 1 0 0 a67 1 b 1 1 1 1 1 0 0 0 a 0 b[1]43 0 0 1 1 1 1 1 0 a025143 1
b 0 0 0 1 1 1 1 1 a 0 2 = · a ⊕ 0 (2.12)
b 1 0 0 0 1 1 1 1
b10 1 1 0 0 0 1 1 1 a 0 b 1 1 1 0 0 0 1 1 a 1
b 1 1 1 1 0 0 0 1 a 1
because some of the countermeasures we will present calculate the inversion on the fly, we also have to calculate the affine transformation. So we estimate the runtime for this transformation. To calculate the runtime, we use the hypothetical processor model from Section 2.3.1 and count the clock cycles (cc). To calculate one bit of the result byte b we need 8 multiplications (AND) and seven additions (XOR). This makes 8+7 = 15cc for one result bit and 8·15cc = 120cc for all eight resulting bits a0 to a[2][3]. The affine transformation is executed 160 times during a whole AES run. So this naive way would take us 160 · 120cc = 19200cc just for the affine transformation.
Obviously, we can improve this implementation. If we take a closer look at the constant matrix in (2.12) the first thing we see, is the pattern m = {11111000}2 in the first row. This pattern repeats in the following rows, but every time rotated by one position to the right. With the fact, that this calculation in done in GF(2), we can also save some computation. If we are able to calculate the parity3 of a byte, we can safe the seven bitwise binary additions. Therefore we multiply the current column m of the constant matrix with the byte a and just set the output bit in b if the parity of the result is even. Algorithm 1 denotes this method.
Algorithm 1: Calculation of the affine transformation
Input: a
Output: b=aff_trans(a)
1 b = 0;
2 m = {11111000}2;
3 c = {01100011}2;
4 for i = 0 to 7 do
40cc. In the last step, the constant c is added to the result what also takes one clock cycle. To calculate the affine transformation for one byte this takes now 40cc + 1cc = 41cc. The affine transformation part of a whole AES run, now takes 160 · 41cc = 6560cc.
Summary
Summarizing, in our processor model an AES encryption needs 240cc for the ShiftRows transformation, 800cc for the SubBytes transformation, and 2304cc for the MixColumns transformation. Depending on the key calculation method, this gives either a total cycle count of 4048cc if separate key scheduling is used, or 4064cc for the key on-the-fly method for an entire AES encryption.
The AES decryption needs a bit more clock cycles, because the InvMixColumns is slower than the MixColumns transformation. With the 10368cc for an InvMixColumns transformation, the decryption takes 704cc+240cc+800cc+10368cc = 12112cc for an AES decryption with separate key scheduling.
2.3.4 8-bit Optimized Software Implementation
The discussed AES implementation can be improved by combining the ShiftRows and SubBytes transformation, see [Den07]. In addition, the inverse MixColumns transformation needed for the decryption can be rewritten in a clock cycle saving way. This leads to an optimized implementation as discussed in this section.
ShiftRows and SubBytes Combined
Since ShiftRows is only a permutation of the indices of the State matrix elements, it can be combined with the SubBytes transformation. With this combination, we safe the 240cc of the ShiftRows transformation.
Improved inverse MixColumns
In [DR02], P. Barreto observes the following relation between the MixColumns polynomial a(x), which has been defined in Equation (2.6), and the InvMixColumns polynomial a−1(x):
a−1(x) = ({04}16 · x2 + {05}16) · a(x) mod (x4 + {01}16) (2.13)
There are two important points. The Hamming weight of the coefficients is much smaller than in the original Equation (2.9) and it has only half as many coefficients. In matrix notation, the right part of Equation (2.13) becomes:
{02}16 {03}16 {01}16 {01}1616 {05}1616 {00}1616 {04}1616 {00}1616
{01}1616 {02}1616 {03}1616 {01}1616· {{0004}}1616 {{0500}}1616 {{0005}}1616 {{0400}}1616 =
{01} {01} {02} {03}
{03}16 {01}16 {01}16 {02} {00} {04} {00} {05}
{0E}1616 {0B}1616 {0D}1616 {09}1616
{09}1616 {0E}1616 {0B}1616 {0D}16 (2.14)
{0D} {09} {0E} {0B}
{0B} {0D} {09} {0E}16
So the normal MixColumns transformation can be applied after a little precalculation of the State matrix. For the constant multiplication of {04}16 two xtime calls and for the multiplication by {05}16 two xtime calls plus one addition (XOR) are needed. Listing 2.4 denotes the preprocessing step. The word w (four bytes) is the current column of the State matrix.
1 word ImvprovedInvMixColumns( byte w[ 4] )
2 begin
3 u = xtime (xtime(w[ 0] ^ w[ 2 ] ) ) ; 4 v = xtime (xtime(w[ 1] ^ w[ 3 ] ) ) ;
5
6 w[0] = w[0] ^ u; 7 w[1] = w[1] ^ v ; 8 w[2] = w[2] ^ u; 9 w[3] = w[3] ^ v ;
10
11 MixColumn(w);
12 return w;
13 end
Listing 2.4: Preprocessing step for improved InvMixColumns
The precalculation needs four xtimes operations and six XOR operations, these are 4·5cc+ 6·1cc = 26cc. The precalculation is followed by a normal MixColumns transformation which leads to 26cc + 64cc = 90cc to calculate the inverse MixColumns transformation for one column of the State matrix. A whole State S takes 4·90cc = 360cc and all nine MixColumns transformations during an AES decryption take 9 · 360cc = 3240cc.
Summary
Summarizing, the optimized (8-bit) AES encryption takes 704cc for the AddRoundKey transformation, 800cc for the combined ShiftRows and SubBytes transformation, and 2304cc for the MixColumns transformation. In total, this makes 3808cc for one entire AES encryption with TLU keys.
With the improved inverse MixColumns transformation, the AES decryption now needs 704cc + 800cc + 3240cc = 4744cc to decrypt a 128bit ciphertext.
2.3.5 32-bit Software Implementation
Two of the discussed 8-bit transformations, namely the AddRoundKey and the MixColumns transformation both benefit directly from a 32-bit register size. In the next sections, we discuss this 32-bit implementation. In addition, there is a 32-bit implementation called Ttables implementation, which combines the four transformations in precalculated tables.
The AddRoundKey Transformation
The AddRoundKey transformation can also be implemented to work simultaneously on a whole column of the State matrix. Therefore it loads one word from the expanded key array and one word from the State matrix S and adds them. The resulting word is written back to the State matrix. To process the State S, 4·4cc = 16cc are needed. All eleven AddRoundKey transformations now take 11 · 16cc = 176cc.
The MixColumns Transformation
The MixColumns transformation can be speed up considerably on 32-bit processors. Gladman mentioned in [Gla01] that equation (2.8a) to (2.8d) can be rewritten in such a way, that only multiplications by two are necessary, e.g., {03}16 ⊗ s0,c = {02}16 ⊗ s0,c ⊕ s0,c.
The four bytes s0,c to s3,c can be stored successively in the word w. Listing 2.5 denotes the multiplication by two within a whole word, called FFmulX. The word t, on the left hand side (from the second XOR) in line four, extracts the highest bits from each byte within the word w. The four individual bytes are now multiplied by two in parallel with a single left shift. On the right hand side, the bytes of word w whose top bits are set are multiplied by {1B}16, the AES reduction polynomial. This reduction ensures, that the four 8-bit results fits into the 32-bit register.
1 word FFmulX(word w)
2 begin
3 word t = w & 0x80808080 ;
4 return ((w ^ t ) << 1) ^ (( t >> 3) |
5 ( t >> 4) | ( t >> 6) | ( t >> 7));
6 end
Since the four bytes s0,c to s3,c are stored successively in the word w, the four Equations (2.16a) to Equations (2.16d) can be calculated in parallel. Listing 2.6 denotes this MixColumns transformation for one column of the State matrix. The resulting word contains the four transformed bytes s00 ,c to s30 ,c.
1 word MixColumn(word w)
2 begin
3 w = rot1 (w) ^ rot2 (w) ^ rot3 (w) ^ FFmulX( rot1 (w) ^ w);
4 return w;
5 end
Listing 2.6: MixColumns transformation for one column of the State matrix
With Listing 2.5 and Listing 2.6 we can estimate the clock cycles for the MixColumns transformation. The FFmulX function needs ≈ 11cc. The MixColumn operation needs four rotations, four XOR operations, one FFmulX operation which makes 19cc. In addition it has to load the word w from RAM and store the result back in RAM, this takes two clock cycles. One MixColumns transformation needs four MixColumn operations which makes 4 · 21cc = 84cc for one round and 9 · 84cc = 756cc for all nine MixColumns transformations.
The Inverse MixColumns Transformation
The 8-bit precalculation from Listing 2.4 can be formulated in such a way, that it uses 32-bit operations. Listing 2.7 denotes the improved inverse MixColumns transformation for one column w of the State matrix.
1 word InvMixColumn(word w)
2 begin
3 word tmp;
4
5 tmp = rot2 (w); 6 tmp = tmp ^ w;
7 tmp = FFmulX(tmp);
8 tmp = FFmulX(tmp);
9 10 w = w ^ tmp;
11
12 w = MixColumn(w);
13
14 return w;
15 end
Listing 2.7: Inverse MixColumns transformation with 32-bit operations
We can estimate the clock cycles for the InvMixColumns transformation with Listing 2.7. For one column of the State matrix it takes one rotation, two XOR operations, two FFmulX operations and one MixColumn operation. This makes 1cc + 2 · 1cc + 2 · 11cc + 21cc = 46cc for one column, 4 · 46cc = 184cc for a State and 9 · 184cc = 1656cc for all nine rounds.
Summary
Summarizing the optimized 32-bit AES encryption takes 176cc for the AddRoundKey transformation, 800cc for the combined ShiftRows and SubBytes transformation, and 756cc for the MixColumns transformation. In total this makes 1732cc for the encryption.
The decryption, with 1656cc for the InvMixColumns transformation, now needs 176cc + 800cc + 1656cc = 2632cc.
2.3.6 32-bit T-Tables Implementation
Daemon and Rijmen already showed in their original AES proposal, that AES can be implemented very efficiently on 32-bit processors by combining the transformations SubBytes,
ShiftRows, MixColumns, and AddRoundKey for one column4. They define four tables each of 256 4-byte words (for 0 ≤ x ≤ 255).
Where c(r) = c + r mod 4 with c(0) = c. Here c denotes the actual column index, and c(r),r = 0,...,3 denotes the new position after the ShiftRows transformation. With these tables each column in the output State s0 can be computed by using four XOR operations together with one word from the key schedule and four table-lookups that are indexed using four bytes from the input State. We need four XOR and five TLU per column. Since we have four columns per State matrix, 4 · 4 = 16 XOR, 4 · 4 = 16 TLU (ROM) and four TLU (RAM) per round are needed. This makes 16cc + 2 · 16cc + 4cc = 52cc for one round and 10 · 52cc = 520cc for the whole algorithm (plus additional key scheduling).
One table consumes 4·256bytes = 1024bytes ROM, and for the four tables 4·1024bytes =
4096bytes are needed. In the last round, the MixColumns is not present. This means that different tables are required for the last round. If we also implement the last round as a table, we need 2 · 4096bytes = 8192bytes of table space altogether.
If we take a closer look at the tables, it can be recognized that the structure of the four tables is closely related to each other. The columns are rotated by one, two and three bytes, since Ti(x) = rot(Ti−1(x)),i = 0,... ,3, where the function rot() rotates the bytes in the column cyclically by one position. So, the space needed for the tables can be reduced by a factor of four at the expense of three additional cyclically rotations per round.
2.3.7 32-bit Implementation with Transposed State
Bertoni et al. [BBF+03] had the idea to transpose the State matrix S in such a way, that it can be processed by columns instead of rows. Therefore they have restructured the MixColumns, the ShiftRows, and the round key generation for en-/decryption. Since the SubBytes transformation works on each byte individually, they left it as it is.
The consequence of the transposition is that the new version of the InvMixColumns achieves a much higher speed gain. In fact, the transposed InvMixColumns requires seven doublings and 27 additions. The authors report a cycle count of 2047cc for decryption on an ARM7TDMI. The encryption becomes a bit slower than the 32-bit reference implementation of Gladman. For encryption they need 1675cc. Their reference implementation from Brian Gladman needs 2763cc for decryption and 1641cc for encryption on the ARM7TDMI.
2.3.8 Hardware Implementations
There has also been much effort put in efficient hardware implementations of AES: From the 8-bit implementation straight from the standard to 32-bit implementations using a 32-bit data path up to high-speed implementations with a 128-bit data path. For every implementation, very much care has to be given to the S-box implementation, both for performance and for security reasons.
Rijmen suggested in 2001 to use subfield arithmetic for computing the inverse part of the SubBytes transformation in [Rij01]. This idea was extended by Satoh et al. in [SMTM01] by using sub-subfields, the so-called “composite field” or “tower field” approach, which resulted in the smallest AES circuit at that point. The “composite field” approach utilizes that a finite field GF(2k) with k = n · m can also be represented as GF((2n)m). Later, Canright [CB09] improved this method again by a carefully chosen normal basis, which results in the most compact S-box up to date.
As we will see, some ideas from the “composite field” approach can be reused for our SCA resistant AES implementation. So we will give a short overview to gain insight into the “composite field” arithmetic.
Composite Fields
We have seen in Section 2.2, that the finite field GF(28) can be seen as an extension field of GF(2) and its elements are represented as bytes. However, we can also see GF(28) as a quadratic extension of the field GF(24), which is far better suited for hardware implementations [Paa94, SMTM01]. We call this field composite field. In particular, composite field inversions are used to create compact AES hardware implementations [RDJ+01, SMTM01, MS03, WOL02].
We can represent every element a ∈ GF(28) as a linear polynomial ahx + al with the coefficients ah and al being elements of the field GF(24). We refer to this polynomial as twoterm polynomial. Both coefficients of this polynomial have four bits. The finite field GF(28) is isomorphic to the field GF((24)2). For each element in GF(28) there exists exactly one element in GF((24)2). The following field polynomial, which is used to define the quadratic extension of GF(24), is taken from [WOL02, Tri03].
Two two-term polynomials can be added by adding their corresponding coefficients:
(ahx + al) ⊕ (bhx + bl) = (ah ⊕ bh)x + (al ⊕ bl). (2.19)
For the multiplication and inversion of two two-term polynomials a modular reduction step is required, to ensure that the result is again a two-term polynomial. The irreducible polynomial n used for the reduction is given by:
n(x) = x2 + x + {E}16. (2.20)
The coefficients of n are elements of GF(24) and written in hexadecimal notation. To ensure, that the product of two coefficients lies in GF(24), i. e., a(x)⊗ b(x) := a(x) · b(x) mod m4(x) with a,b ∈ GF(24), the irreducible reduction polynomial m4 is used, which is given by:
m4(x) = x2 + x + 1. (2.21)
To calculate the inverse of d Oswald et al. shift this element down to GF(22) in [OMP04], where they compute the inversion on a dedicated hardware. This is feasible in hardware but it is not efficient in software on a microcontroller. Instead the inversion of d can be efficiently realized as a lookup table with 16 elements from GF(24).
If the inversion part of SubBytes is calculated with the composite field method, the big lookup table for the inversion in GF(28) with 256 elements can be replaced by the lookup table for the inversion in GF(24) with 16 elements.
2.3.9 Comparison of Different Implementation Options
In the last sections, several options to implement AES were discussed. Two main possibilities to write an 8-bit software implementation of AES have been pointed out.
In the first method, all round keys are calculated in advance, Table 2.1 line one, three, four, five and six. A method with separate key scheduling can be recommended, if there is enough RAM to store the 176bytes for the round keys. The separate key scheduling especially pays off, if two or more blocks of 128bit are encrypted5. An AES 128-bit encryption with separate key scheduling takes 3808cc on an 8-bit processor (line three) and 1732cc on a 32-bit processor (line four).
The second method is to implement an embedded key scheduling. Here the round keys are calculated when they are needed, on-the-fly, and afterwards they are discarded. This leads
to a very small RAM consumption. Table 2.1 denotes in line two, that this implementation needs 4064cc and only 16bytes RAM (for the last used round key). If the processor used has a large number of registers, everything can be calculated without the use of RAM. Thus, the on-the-fly method can be preferred on architectures where only very limited RAM is available.
On a 32-bit processor, AES can also be implemented with almost only table lookups. Table 2.1 refers to this method as T-tables. An 128-bit encryption with the T-tables approach takes approximately 520cc (with separate key scheduling). The T-tables implementation needs 8192bytes of ROM, to store the four pre-calculated tables that are needed for the encryption, and additional 8192bytes for the decryption.
The first three entries in the table are to get a rough feeling of the space and cycle consumption of AES. In the second part of the table they can be compared with real implementations. We notice that there are big clock cycles differences between different implementations even if they are on the same architecture (8 or 32-bit). So we just compare the ratio between the cycle counts for an implementation given from the same author.
The cAESar implementation from [HKQ99] is one of the first 8-bit implementations we found. The used processor is not clearly described in the paper. The implementation is completely written in Assembler and needs 2889cc for one encryption. The round keys are generated on-the-fly.
The second implementation is from Akkar/Giraud in 2001 [AG01] (line 8). Their nonoptimized implementation was done in assembly code using an 8-bit processor. In their paper, they specify the runtime with 18.1ms at 5MHz which results in 90500cc for the encryption of a 128-bit message. The whole implementation needs 730bytes ROM and 41bytes of RAM.
In [HOM06], Herbst et al. compare their SCA resistant 8-bit implementation with one self made older unprotected one from the Institute for Applied Information Processing and Communication (IAIK) [IAI06] they work at. This unprotected implementation needs 4427cc for an encryption. The second unprotected implementation they refer to, is an implementation from Christian Röpke [R¨03] on an ATmega163, which needs 7498cc for an encryption. These two clock cycle counts are interesting for us, because we want use to some countermeasures from Herbst et al.
Bernstein and Schwabe [BS08] have written the fastest 32-bit implementation we have found so far. For achieving these results, they have made heavily optimized versions for the underlying 32-bit micro architecture and utilized the T-tables approach. On an Intel Core 2 Quad with, e.g., combined load and XOR operations, their implementation needs ≈ 169cc for the encryption. In their clock cycle calculation they ignored the costs for computing the round keys.
The main requirement for our AES implementation is to be small in terms of RAM and ROM consumption. For this reason, we do not take the 32-bit T-tables approach. Because a typical AES implementation will have a fixed key and we assume that more than one 128-bit block is encrypted during one AES run, we decided us for the separate key scheduling method. Therefore we will take the optimized 32-bit implementation (line four) and try to implement it on the TriCore architecture as fast as possible. The RAM consumption in Table 2.1 does not contain the 16-byte State matrix.
[1] t = a ⊗ m;
6 b = rotate_bit_left(b);
[2] b = b ⊕ parity(t);
8 m = rotate_bit_right(m);
9 end
[3] b = b ⊕ c;
Now we one multiplication (AND) 1cc, two rotations 2cc, one parity calculation 1cc and one addition (XOR) 1cc to calculate one bit of the result. For all eight bits we need 8·5cc =
3 States whether the sum of all bits in the byte are even or odd.
@Slav Defence - Again, this is a much more advanced article on EMSEC, one that goes into the principles behind the concept of side channel analysis, or the task of side-stepping encryption. Also, I hope these article are helping you and others understand what I did in the US Navy. They are quite technical, if you'd like me to post less technical information, please say so. I really want people to enjoy these article and learn something, but the information can be overwhelming and scary at times, especially for novices... this isn't even too advanced, it's first year stuff!!!
@Jango @Manticore @Jungibaaz
@Hu Songshan - you told me in our conversation that you where interested in Electronic Warfare, so I thought I'd tag you here as well given this thread pertains to said topic.
@Nihonjin1051 - I have no idea if this topic interests you (if it does, you're welcome. If not, ), but this was my job in the USN.
This is a big document so additional sections will be added as I finish with them. If something doesn't make sense, or you want to check a piece of information feel free to ask me for clarification... or you can wait until the entire document is posted. The final section will have all the references (these will explain some of the name, backup the research and offer additional articles on the subject), but in the meantime I will answer any questions you have as best as I can (the technicals aren't a problem, the legality is).
Most of the matrixes have been omitted as they were attachments and each section had too many. I couldn't keep each post under ten photos/attachments.
SvenSvensonov
Chapter 1 Introduction
1.1 Motivation
Nowadays, security becomes more and more important in the automotive industry. Robert Bosch GmbH, Corporate Research supplies the necessary know-how and provides solutions for future applications in the business units.
Many functions in a modern car are realized with the help of software applications and the underlying electronics. This means in particular, that a growing part of the components in a car are realized through an easy to copy and manipulate software. Due to this fact, there is a growing need for an effective protection of these software applications to save the intellectual property on the one hand and to disclose manipulation, e.g., by tuning, on the other hand.
There are up to 80 electronic control units (ECU) built into a modern upper class car. These control units come from different suppliers, communicate over potentially insecure networks with each other and with the outside world respectively, e.g., with diagnostic equipment or in the near future Car2Car and Car2Infrastructure. As a consequence, these control units should authenticate each other and the communication data should be cryptographically protected against manipulation.
For several use cases in the automotive domain, e.g., tuning detection and sensor protection, cryptography constitutes a considerable part of the overall-security architecture. Due to their performance and compactness in implementations, symmetric encryption primitives are of special interest. These primitives can be used for attaining the following security goals:
• confidentiality: encryption of data,• authenticity: challenge-response protocols,
• integrity: message authentication codes.
The State-of-the-Art symmetric encryption scheme is the Advanced Encryption Standard
(AES), [FIP01]. This block cipher with key sizes of 128, 192, and 256bits and a block size of 128bits is remarkably fast and can be implemented very efficiently on a wide variety of platforms. It replaces the old Data Encryption Standard DES or, to be more precise, its variant Triple-DES.
A lot of proprietary mechanisms in the automotive environment are getting replaced by standardized security mechanisms. An example is here the changeover from the proprietary KeeLoq System1, which is used for keyless entry in cars and garage doors, to modern challenge response protocols based on AES2.
1.2 Adversary Model and Assumptions
Ross Anderson and Markus Kuhn give a good classification of adversaries’ knowledge, funding, and skills in [AK98]. They define the following three classes of attackers:
• Class I (clever outsiders): They are very intelligent but may have insufficient knowledge of the system. They may have access to only moderately sophisticated equipment. They often try to take advantage of an existing weakness in the system, rather than try to create one.
• Class II (knowledgeable insiders): They have substantial specialized technical education and experience. They have varying degrees of understanding of parts of the system but potential access to most of it. They have highly sophisticated tools and instruments for analysis.
• Class III (funded organizations): They are able to assemble teams of specialists with related and complementary skills backed by great funding resources. They can do in-depth analysis of the system, design sophisticated attacks, and use the most advanced analysis tools. They may use Class II Adversaries as part of the attack team.
We will focus on knowledgeable insiders, i.e., the cryptographic algorithm itself is known to the attack agent (no security by obscurity!). The attack agent may have access to equipment like oscilloscopes etc. that can be found in university labs. The specific implementation, e.g., the source code or an open sample device, is assumed to be not available to the attacker.
In terms of side-channel analysis resistance3, this means that our implementation should be resistant against:
• timing attacks, and
• first-order differential power analysis (DPA) attacks.
Resistance against timing attacks implies that we assume an adversary having access to an exact simulation of the runtime of the implementation. Note that we do not require resistance against the new microarchitectural side-channel attacks like cache and branch predication attacks [AKS07, Ber05, OST06, NS06]. This does not mean that these attacks may not be relevant, it just means that we think that an attack agent has no or only little chance to install a spy process on the target processor and to measure the timing signals with the necessary precision.4
We assume that table lookups (TLU) have a constant runtime. For achieving timing analysis resistance, we need a key-independent runtime of the implementation. Therefore we will consider the effects of the 4-stage pipeline of the target processor, since they have a huge influence on the runtime.
Resistance against first-order differential power analysis (DPA) attacks means that neither a standard difference of means nor a correlation attack against the implementation will succeed
with a sufficient number of measurement traces. We assume that the attacker has no access to a training device where he can input, e.g., random masks or keys, i.e., we effectively rule out template attacks [GLRP06, APSQ06, CRR02].
The goal is not to make the implementation secure against second or higher order attacks, i.e., the focus is put on simple masking schemes. Of course, very simple second order attacks,
e.g., by using the same mask twice directly on the bus, have to be prevented.
1.3 Scope
Robert Bosch GmbH, Corporate Sector Research and Engineering investigates the suitability of side-channel resistant implementations of symmetric key primitives under the constraints of the embedded domain. In this thesis, we will survey and investigate several masking countermeasures described in the literature to protect an implementation of the Advanced Encryption Standard (AES) against first-order DPA attacks in terms of security and runtime.
A selected countermeasure will be implemented in C (and partly in assembly language) on a typical 32-bit automotive processor used in Embedded Control Units (ECUs), e.g., TriCore. The implementation shall be secured against timing attacks as well. Of special interest is an efficient implementation taking into account the requirements of embedded devices with respect to memory and performance. Implementations of AES without countermeasures against DPA are provided by Robert Bosch CR/AEA. The tasks, that are to be carried out are:
• to understand and to survey the relevant AES implementation options,
• to understand and to survey the most important side-channel attacks (DPA: differenceof means + correlation power attack), timing attack,
• to survey countermeasures (it is not the goal to invent new countermeasures, just todescribe what is available in the literature),
• to describe the different masking schemes in their memory usage (RAM and ROM),performance, and number of random bits,
• to select and to implement the most appropriate or a combination of methods,
• to implement AES efficiently and securely with a key length of 128bits on a TriCore 17965 processor in C and, maybe, parts of it in Assembler.
All countermeasures considered in this master thesis will be algorithmic, i.e., softwarebased, countermeasures. Although many hardware countermeasures have been discussed in the literature, e.g., special logic styles such as Masked Dual Rail Pre-Charge Logic (MDPL), increasing the noise by special noise generators, etc., we will only discuss these countermeasures if they can be applied to software as well. However, we will survey a number of countermeasures which have been originally proposed for hardware implementations e.g., masking in tower fields and investigate if they can be applied efficiently in software implementations.
For the masking countermeasures, we need a certain amount of random data. The generation of these random values is not part of this thesis. We assume that all random values are given to our program via a pointer to an array with sufficient numbers of the necessary entropy. For simulation the stream cipher Grain-80 [HJM05] is used as pseudo random number generator (RNG).
We will present a side-channel resistant AES implementation for the TriCore architecture, which is not available up to now. Therefore we will combine side-channel analysis (SCA) countermeasures.
In Chapter 2 we will give a short overview of AES and the existing implementation options. We will introduce some mathematical basics which are important to understand the upcoming implementations. We introduce a processor model, which will help us to select the right implementation option for our needs.
Chapter 3 gives an overview about known side-channels and possible attacks on it. The focus lies on timing analysis because this form of side-channel will be analyzed at the end of this thesis.
Chapter 4 describes countermeasures to prevent side-channel attacks. We analyze different masking schemes in terms of size and runtime and combine them with other countermeasures like randomization.
Chapter 5 is dedicated to the TriCore. We will describe the architecture and the assembler instructions which are used for the implementation.
Chapter 6 is focused on the implementation. We will discuss improvements for the already available implementations and based on the improvements build a protected AES implementation.
In Chapter 7 we will analyze the implemented AES versions in terms of their timing behavior.
Chapter 2 AES – Advanced Encryption Standard
2.1 Introduction
The Advanced Encryption Standard (AES), also known as Rijndael, is a symmetric block cipher that was developed by the Belgian cryptographers Joan Daemen and Vincent Rijmen. AES encrypts/decrypts data with a block size of 128bits and a key size of 128, 192, or 256bits. AES targets on a small and fast implementation in hardware and software. The standard is described in [FIP01]. Brian Gladman wrote a detailed description of AES in [Gla01] and an updated version of the description in [Gla07]. We will give here a short overview to introduce the notation used in this thesis.
AES is a typical substitution-permutation network (SPN), see [Sti05]. An SPN consists of several rounds, each made up with a substitution and a permutation stage. The substitution stage obscures the relationship between elements of the plaintext and elements of the ciphertext, while the permutation stage makes of the influence of plaintext elements spread over the ciphertext.
For AES, the number of rounds depends on the key size. We focus on a key size of 128bits, which corresponds to ten rounds. The 16byte plaintext P = (p0,p1,...,p15) is internally arranged in a 4 × 4 matrix, the so-called State matrix S. Every element sr,c of the State matrix (r denotes the index of the row and c the index of the column) equals to one byte. In Figure 2.1 we can see the individual transformations which are applied to the State matrix. Figure 2.2 shows how the plaintext P is initially mapped to the State matrix S and then to the ciphertext output, C = (c0,c1,...,c15).
In each round, four transformations are applied to the State matrix S:
1. Byte Substitution (SubBytes), which is a non-linear byte substitution operating independently on each byte (sr,c) of the State matrix, using a substitution table called S-box. This S-box can be constructed from two transformations:
a) an inversion in the finite field GF(28), where the zero is mapped to itself; and
b) a bitwise affine transformation.
2. ShiftRows, where the bytes sr,c in the State matrix are cyclically shifted. The first row is not rotated. The second row is rotated to the left by one position, row three is rotated to the left by two positions and row four is rotated to the left by three positions.
3. MixColumns, where every column of the State matrix is treated as a four-term polynomial and gets multiplied by a fixed 4×4 matrix over GF(28) which is defined in [FIP01].
5
4. Key Addition (AddRoundKey), which adds a 4 × 4 key matrix (128-bit) to the State matrix. This key matrix is called round key.
The final round differs slightly from the first nine rounds. As we see in Figure 2.1 the MixColumns transformation is missing, which is also typical for a substitution-permutation network since the MixColumns transformation adds no cryptographic strength to the algorithm if it is included at the end of the cipher.
Decryption is accomplished by running the algorithm “backward”. The last round key is first XORed with the ciphertext, and then in each round, the inverse linear transformations are applied, namely InvMixColumns, InvShiftRows, followed by the inverse S-box InvSubBytes. Note that for AES additional space for the inverse functions used in decryption is needed, in contrast to DES where encryption and decryption basically consume the same program memory. For example, in AES we need additional 256bytes if the inverse transformation InvSubBytes is implemented as lookup table (LUT).
To supply the algorithm with the required round keys, AES uses a mechanism called key scheduling. The encryption key of 16bytes is expanded to an array with eleven round keys, where each round key has a size of 16bytes. So in total 176bytes are needed for the round keys. To provide non-linearity between the cipher key and the expanded round key, the key schedule uses the SubBytes transformation followed by a cyclic rotation of the bytes and an addition of a round constant. The round constant is defined by a simple recursive rule.
2.2 Mathematical Background
In order to explain the upcoming implementation tricks and the side-channel countermeasures, we have to recall some basics on finite fields. Almost all operations that will follow are in the finite field GF(2), i.e., the bits ’0’ and ’1’, or the extension field GF(28), where GF denotes a Galois Field. A more detailed and general introduction to finite fields can be found in [Kna06].
GF(28) is the unique finite field with 256 elements and the operations ⊕ (addition) and ⊗ (multiplication). Its elements can be represented as polynomials of order 8 with binary coefficients ai ∈ {0,1},i = 0,...,7. An element a ∈ GF(28) can thus be seen as the set of coefficients ai ∈ GF(2) of the polynomial a with function value a(x), i.e.,
{a7a6a5a4a3a2a1a0}2 7→ a(x), (2.1)
7
a(x) = Xaixi = a7x7 + a6x6 + a5x5 + a4x4 + a3x3 + a2x2 + a1x + a0. (2.2)
i=0
A sequence of 8bits, for example {11010100}2, is called a byte. We will also depict byte values by their hexadecimal notation. Hence we can write the value {11010100}2 as {D4}16 which represents the polynomial x7 + x6 + x4 + x2.
Addition. In the following, ⊕ denotes an addition in GF(28), that can be realized using an exclusive OR (XOR). We can add two elements a,b ∈ GF(28) by adding their coefficients modulo 2 element wise.
7 7 7
Xaixi ⊕ Xbixi = X(ai ⊕ bi)xi (2.3) i=0 i=0 i=0
Multiplication. Multiplication in GF(28) is the usual polynomial multiplication, modulus an irreducible polynomial. For AES, the reduction polynomial m with
m(x) = x8 + x4 + x3 + x + 1
is used. The polynomial m can be represented by {11B}16. Note that in GF(28) reduction by m(x) is equivalent to adding m(x).
areduction by(xWe will use) · b(x) modm⊗(mxto indicate the multiplication of two GF(2)(xensures that the result is an element of GF) with a,b ∈ GF(28) and “·” denoting a polynomial multiplication. The8)(2elements, i.e.,8). a(x) ⊗ b(x) :=
We can implement the multiplication by x by a left shift and a subsequent conditional bitwise XOR with {11B}16, if the most significant bit (MSB) bit is set. For example, multiplying {11001000}2 by x gives {110010000}2. The overflow bit is then removed by adding {100011011}2, the modular polynomial, to give the final result of {10001011}2.
This method is called xtimes, see [Gla07]. Multiplication by higher powers of x can be implemented by repeated application of xtimes. Since we can write all elements of GF(28) as a sum of powers of {02}16, we can implement the multiplication by any value by a repeated use of xtimes. To get the result of the multiplication with any input g by the constant value {15}16.
2.3 Implementation Aspects
AES is an industry-wide used block cipher since the National Institute of Standards and Technologies (NIST) adopted it in 2001 from the Rijndael algorithm. It is used for the protection of government documents as well as for electronic commerce and private use. Many different implementations of AES have been proposed and have been optimized for the specific requirements like maximized throughput or minimized circuitry. We will give a short overview of the implementations, in order to get a picture of the tricks we can use for our implementation.
We begin by defining two microcontroller models (8-bit and 32-bit) to estimate the clock cycles for the implementation. Then we present the 8-bit implementation which is proposed in the standard [FIP01] and then look at existing 32-bit software implementations. We introduce the affine transformation, which is part of the SubBytes transformation, because we will need it in a later chapter.
For the current master thesis it is required to implement both encryption and decryption. For reasons of brevity, we will focus on the description of the encryption and give only the necessary hints for understanding the decryption process.
2.3.1 Microcontroller Model
To compare the runtime of an encryption, we define a simple 8-bit processor model and count the clock cycles (cc) according to that model. This model does not fit to all existing 8-bit microcontrollers. Hence, the given clock cycle count is a rough estimate to get an impression of the implementation. We make the following assumptions for byte operations on our processor model:
• load/store from/to RAM takes one clock cycle,
• load/store from/to ROM takes two clock cycles,
• XOR two registers takes one clock cycle,
• shift by one bit takes one clock cycle.
For the 32-bit (word) operations, we expand our processor model, which is defined in Section 2.3.3. We assume, that a load and a store operation on a word takes one clock cycle.
2.3.2 The Key Schedule
The AES implementation has a key size of 128bits, which leads to ten rounds. With the initial AddRoundKey transformation, an entire encryption needs eleven round keys. There are two different options to supply the algorithm with those round keys:
Key Expansion Before the Encryption
1 KeyExpansion ( byte key [16] , byte expkey [ 4 ] [ 4 4] )
2 begin
3 for ( j = 0; j < 4; j++) {
4 for ( i = 0; i < 4; i++)
5 expkey [ i ] [ j ] = key [ j ∗ 4 + i ] ;
6 }
7 for ( j = 4; j < 44; j++) {
8 if ( j % 4 == 0) {
9 expkey [ 0 ] [ j ] = expkey [ 0 ] [ j −4] ^
10 SubBytes(expkey [ 1 ] [ j −1], 0) ^ rc [ j /4];
11 for ( i = 1; i < 4; i++)
12 expkey [ i ] [ j ] = expkey [ i ] [ j −4] ^
13 SubBytes(expkey [( i +1) % 4][ j −1] ,0);
14 }
15 else {
16 for ( i = 0; i < 4; i++)
17 expkey [ i ] [ j ] = expkey [ i ] [ j − 4] ^ expkey [ i ] [ j − 1];
18 }
19 }
20 End
21
On the one hand, the whole round keys can be calculated before the actual encryption. Therefore, the KeyExpansion function is used, which is depicted in Listing 2.1. This function expands the 128-bit cipher key into an expanded key array, with four rows and 44 columns. The size of each element of this array equals to one byte. To store this expanded key array, we need 11·16bytes = 176bytes RAM. The first four columns (16bytes) of the expanded key are filled with the 128-bit cipher key. The next columns recursively depend on the previous ones, with the following cases: If the column index j = 0,...,43 is not a multiple of four, column j is a bitwise XOR of column j−4 and column j−1. Otherwise, column j is a bitwise XOR of column j − 4 and j − 1, with the difference, that the S-box is applied on the four bytes of column j − 1, with an additional XOR of the round constant rc on the first byte of this column. Since there are eleven rounds, also eleven round constants are needed, which are calculated by the following recursion rule in GF(28):
rc[1] = 1, rc[2] = x, rc[n] = x ⊗ rc[n − 1], n = 3,...,11. (2.5)
Key On-The-Fly Calculation
On the other hand, we can calculate the round keys on-the-fly. This alternative method is for environments with limited memory, e.g., smartcards. The calculation works the same way as we described above, but we only store the last calculated round key since we see that the calculations depend only on the columns j −4 to j −1, so we can derive the next one from it. In Listing 2.2 the function NextRoundKey calculates the next round key from the last used round key and the last used round constant rc. The first round key is the cipher key, the first round constant is one.
1 NextRoundKey( byte round_key [16] , byte rc )
2 begin
3 round_key [0] = round_key [0] ^ SubBytes [ round_key [ 13] ] ^ rc ;
4 round_key [1] = round_key [1] ^ SubBytes [ round_key [ 1 4 ] ] ;
5 round_key [2] = round_key [2] ^ SubBytes [ round_key [ 1 5 ] ] ;
6 round_key [3] = round_key [3] ^ SubBytes [ round_key [ 1 2 ] ] ;
7
8 for ( i =4; i <16; i++) {
9 round_key [ i ] = round_key [ i ] ^ round_key [ i −4];
10 }
11 rc = xtime( rc ); /∗ compute new round constant ∗/
12 end
13
If the key is calculated on-the-fly, we have to add the calculation costs to our encryption costs, since the calculation is done during every encryption. By counting the operations from Listing 2.2, 17 XORs, 3 · 16 = 48 load and store operations plus four load operations from ROM (for the S-box lookup) and two shift and an additional addition (XOR) for the xtime operation are necessary to calculate the next round key. Summarizing, we get 72cc for calculating the next round key and in total, the key on-the-fly method costs 10·72cc = 720cc.
2.3.3 8-bit Software Implementation Straight from the Standard
On an 8-bit processor, AES can be programmed by simply implementing the different component transformations which are described in Section 2.1.
The AddRoundKey Transformation
During the AddRoundKey transformation, every byte of the State matrix S is XORed with one byte from the expanded key. For one byte sr,c of the State this takes:
• 1cc to load the byte sr,c from RAM,
• 1cc to load one byte of the expanded key from RAM,
• 1cc to XOR both, and
• 1cc to store the result.
To process one State matrix S, 16 · 4cc = 64cc are needed. Since eleven round keys are added during an AES run, a whole AES en-/decryption needs 11 · 64cc = 704cc for the AddRoundKey transformations.
The ShiftRows Transformation
For ShiftRows the implementation is straightforward. Here the bytes in the last three rows of the State matrix are cyclically shifted. For this operation, the State value is loaded from RAM and must be written back at its new position. We assume that these two operations take two clock cycles per byte of the State matrix. Then it takes 4 · 2cc = 8cc per row and
3·8cc = 24cc per transformation. For a whole AES en-/decryption with ten rounds we need 10 · 24cc = 240cc for the ten ShiftRows transformations.
The SubBytes Transformation
The SubBytes transformation is implemented in the standard as a static lookup table (the S-box) of 256bytes. For decryption, the inverse SubBytes transformation is needed, which results in a second 256-byte lookup table. Both tables are generally located in read-only memory (ROM). The transformation loads one byte from ROM, one byte from the State matrix, XORs them, and writes the result back to RAM. We estimate that these operations take five clock cycles. One SubBytes transformation takes 16 · 5cc = 80cc. For a whole AES en-/decryption with ten rounds this makes 10 · 80cc = 800cc for the ten SubBytes transformations.
The MixColumns Transformation
The MixColumns transformation considers every column8). This column sc is multiplied by a fixed polynomialsc (0 ≤ c ≤ 3) of the State matrix as a four-term polynomial over GF(2 a with:
a(x) = {03}16 · x3 + {01}16 · x2 + {01}16 · x + {02}16 (2.6)
and then reduced by a fixed polynomial l with l(x) = x4 +1. The reduction by l decreases the result so it fits to the column again. Since a is coprime to l it is invertible and all elements have an inverse. The multiplication can be rewritten with the fixed polynomial a:
sc0 (x) = a(x) · sc(x) mod l(x), for 0 ≤ c ≤ 3
(2.7)
The indices 0,...,3 indicate the byte in the column c of the current State S. A multiplication with {02}16 can be efficiently implemented with the xtime operation which is denoted in Listing 2.31 and described in Section 2.2.
We estimate five clock cycles for the xtimes operation. A multiplication with {03}16 can be implemented as a multiplication with {02}16 plus an additional XOR operation with the operand. So for computing equations (2.8a) to (2.8d) we need:
• 4 · 1cc = 4cc for loading from RAM,
• 4 · 1cc = 4cc for writing to RAM,
• 4 · 3cc = 12cc for the additions (XOR),
• 4 · 5cc = 20cc for the four fixed multiplications with {02}16,
• 4 · 6cc = 24cc for the four fixed multiplications with {03}16.
This makes 64cc per column, 4 · 64cc = 256cc per round and 9 · 256cc = 2304cc for all nine MixColumns transformations during an AES encryption.
The Inverse MixColumns Transformation
The InvMixColumns transformation is similar to MixColumns transformation, but it uses the inverse of a:
a−1(x) = {0B}16 · x3 + {0D}16 · x2 + {09}16 · x + {0E}16. (2.9)
1 Please note that implementing xtime this way leaks timing information, due to the conditional multiplication with {1B}16. This will be discussed in Section 3.1.
The Hamming weight2 of the coefficients of a−1(x) is larger, which leads to more xtime calls and thus to a slower implementation for the decryption. For the InvMixColumns operation, which is given by the equations:
sc0 (x) = a−1(x) · sc(x) mod l(x), for 0 ≤ c ≤ 3
(2.10)
For the inverse MixColumns transformation the following operations are needed:
• 4 · 1cc = 4cc for loading from RAM,
• 4 · 1cc = 4cc for writing to RAM,
• 4 · 3cc = 12cc for the additions (XOR),
• 4 · 17cc = 68cc for the four fixed multiplications with {0E}16,
• 4 · 17cc = 68cc for the four fixed multiplications with {0D}16,
• 4 · 17cc = 68cc for the four fixed multiplications with {0B}16,
• 4 · 16cc = 64cc for the four fixed multiplications with {09}16,
Summarizing: To calculate a column of the State matrix S, 288cc are needed. To calculate the whole State matrix 4·288cc = 1152cc and 9·1152cc = 10368cc for all nine InvMixColumns transformations, during an AES decryption.
The Affine Transformation
As described in Section 2.1, the SubBytes transformation can be realized by a inversion followed by an affine transformation. Figure 2.6 illustrates these two steps. In the first part of the transformation the multiplicative inversion si,j−1 in GF(28) is build of the processed byte si,j. The inversion is followed by an affine transformation. The affine transformation
is a polynomial multiplication of a constant matrix with the byte a = s−1 followed by a
2 No. of “1” in the binary representation
XOR with the constant c. The matrix and the constant c are given in the standard [FIP01]. Equation (2.12) denotes the affine transformation.
b67 0 1 1 1 1 1 0 0 a67 1 b 1 1 1 1 1 0 0 0 a 0 b[1]43 0 0 1 1 1 1 1 0 a025143 1
b 0 0 0 1 1 1 1 1 a 0 2 = · a ⊕ 0 (2.12)
b 1 0 0 0 1 1 1 1
b10 1 1 0 0 0 1 1 1 a 0 b 1 1 1 0 0 0 1 1 a 1
b 1 1 1 1 0 0 0 1 a 1
because some of the countermeasures we will present calculate the inversion on the fly, we also have to calculate the affine transformation. So we estimate the runtime for this transformation. To calculate the runtime, we use the hypothetical processor model from Section 2.3.1 and count the clock cycles (cc). To calculate one bit of the result byte b we need 8 multiplications (AND) and seven additions (XOR). This makes 8+7 = 15cc for one result bit and 8·15cc = 120cc for all eight resulting bits a0 to a[2][3]. The affine transformation is executed 160 times during a whole AES run. So this naive way would take us 160 · 120cc = 19200cc just for the affine transformation.
Obviously, we can improve this implementation. If we take a closer look at the constant matrix in (2.12) the first thing we see, is the pattern m = {11111000}2 in the first row. This pattern repeats in the following rows, but every time rotated by one position to the right. With the fact, that this calculation in done in GF(2), we can also save some computation. If we are able to calculate the parity3 of a byte, we can safe the seven bitwise binary additions. Therefore we multiply the current column m of the constant matrix with the byte a and just set the output bit in b if the parity of the result is even. Algorithm 1 denotes this method.
Algorithm 1: Calculation of the affine transformation
Input: a
Output: b=aff_trans(a)
1 b = 0;
2 m = {11111000}2;
3 c = {01100011}2;
4 for i = 0 to 7 do
40cc. In the last step, the constant c is added to the result what also takes one clock cycle. To calculate the affine transformation for one byte this takes now 40cc + 1cc = 41cc. The affine transformation part of a whole AES run, now takes 160 · 41cc = 6560cc.
Summary
Summarizing, in our processor model an AES encryption needs 240cc for the ShiftRows transformation, 800cc for the SubBytes transformation, and 2304cc for the MixColumns transformation. Depending on the key calculation method, this gives either a total cycle count of 4048cc if separate key scheduling is used, or 4064cc for the key on-the-fly method for an entire AES encryption.
The AES decryption needs a bit more clock cycles, because the InvMixColumns is slower than the MixColumns transformation. With the 10368cc for an InvMixColumns transformation, the decryption takes 704cc+240cc+800cc+10368cc = 12112cc for an AES decryption with separate key scheduling.
2.3.4 8-bit Optimized Software Implementation
The discussed AES implementation can be improved by combining the ShiftRows and SubBytes transformation, see [Den07]. In addition, the inverse MixColumns transformation needed for the decryption can be rewritten in a clock cycle saving way. This leads to an optimized implementation as discussed in this section.
ShiftRows and SubBytes Combined
Since ShiftRows is only a permutation of the indices of the State matrix elements, it can be combined with the SubBytes transformation. With this combination, we safe the 240cc of the ShiftRows transformation.
Improved inverse MixColumns
In [DR02], P. Barreto observes the following relation between the MixColumns polynomial a(x), which has been defined in Equation (2.6), and the InvMixColumns polynomial a−1(x):
a−1(x) = ({04}16 · x2 + {05}16) · a(x) mod (x4 + {01}16) (2.13)
There are two important points. The Hamming weight of the coefficients is much smaller than in the original Equation (2.9) and it has only half as many coefficients. In matrix notation, the right part of Equation (2.13) becomes:
{02}16 {03}16 {01}16 {01}1616 {05}1616 {00}1616 {04}1616 {00}1616
{01}1616 {02}1616 {03}1616 {01}1616· {{0004}}1616 {{0500}}1616 {{0005}}1616 {{0400}}1616 =
{01} {01} {02} {03}
{03}16 {01}16 {01}16 {02} {00} {04} {00} {05}
{0E}1616 {0B}1616 {0D}1616 {09}1616
{09}1616 {0E}1616 {0B}1616 {0D}16 (2.14)
{0D} {09} {0E} {0B}
{0B} {0D} {09} {0E}16
So the normal MixColumns transformation can be applied after a little precalculation of the State matrix. For the constant multiplication of {04}16 two xtime calls and for the multiplication by {05}16 two xtime calls plus one addition (XOR) are needed. Listing 2.4 denotes the preprocessing step. The word w (four bytes) is the current column of the State matrix.
1 word ImvprovedInvMixColumns( byte w[ 4] )
2 begin
3 u = xtime (xtime(w[ 0] ^ w[ 2 ] ) ) ; 4 v = xtime (xtime(w[ 1] ^ w[ 3 ] ) ) ;
5
6 w[0] = w[0] ^ u; 7 w[1] = w[1] ^ v ; 8 w[2] = w[2] ^ u; 9 w[3] = w[3] ^ v ;
10
11 MixColumn(w);
12 return w;
13 end
Listing 2.4: Preprocessing step for improved InvMixColumns
The precalculation needs four xtimes operations and six XOR operations, these are 4·5cc+ 6·1cc = 26cc. The precalculation is followed by a normal MixColumns transformation which leads to 26cc + 64cc = 90cc to calculate the inverse MixColumns transformation for one column of the State matrix. A whole State S takes 4·90cc = 360cc and all nine MixColumns transformations during an AES decryption take 9 · 360cc = 3240cc.
Summary
Summarizing, the optimized (8-bit) AES encryption takes 704cc for the AddRoundKey transformation, 800cc for the combined ShiftRows and SubBytes transformation, and 2304cc for the MixColumns transformation. In total, this makes 3808cc for one entire AES encryption with TLU keys.
With the improved inverse MixColumns transformation, the AES decryption now needs 704cc + 800cc + 3240cc = 4744cc to decrypt a 128bit ciphertext.
2.3.5 32-bit Software Implementation
Two of the discussed 8-bit transformations, namely the AddRoundKey and the MixColumns transformation both benefit directly from a 32-bit register size. In the next sections, we discuss this 32-bit implementation. In addition, there is a 32-bit implementation called Ttables implementation, which combines the four transformations in precalculated tables.
The AddRoundKey Transformation
The AddRoundKey transformation can also be implemented to work simultaneously on a whole column of the State matrix. Therefore it loads one word from the expanded key array and one word from the State matrix S and adds them. The resulting word is written back to the State matrix. To process the State S, 4·4cc = 16cc are needed. All eleven AddRoundKey transformations now take 11 · 16cc = 176cc.
The MixColumns Transformation
The MixColumns transformation can be speed up considerably on 32-bit processors. Gladman mentioned in [Gla01] that equation (2.8a) to (2.8d) can be rewritten in such a way, that only multiplications by two are necessary, e.g., {03}16 ⊗ s0,c = {02}16 ⊗ s0,c ⊕ s0,c.
The four bytes s0,c to s3,c can be stored successively in the word w. Listing 2.5 denotes the multiplication by two within a whole word, called FFmulX. The word t, on the left hand side (from the second XOR) in line four, extracts the highest bits from each byte within the word w. The four individual bytes are now multiplied by two in parallel with a single left shift. On the right hand side, the bytes of word w whose top bits are set are multiplied by {1B}16, the AES reduction polynomial. This reduction ensures, that the four 8-bit results fits into the 32-bit register.
1 word FFmulX(word w)
2 begin
3 word t = w & 0x80808080 ;
4 return ((w ^ t ) << 1) ^ (( t >> 3) |
5 ( t >> 4) | ( t >> 6) | ( t >> 7));
6 end
Since the four bytes s0,c to s3,c are stored successively in the word w, the four Equations (2.16a) to Equations (2.16d) can be calculated in parallel. Listing 2.6 denotes this MixColumns transformation for one column of the State matrix. The resulting word contains the four transformed bytes s00 ,c to s30 ,c.
1 word MixColumn(word w)
2 begin
3 w = rot1 (w) ^ rot2 (w) ^ rot3 (w) ^ FFmulX( rot1 (w) ^ w);
4 return w;
5 end
Listing 2.6: MixColumns transformation for one column of the State matrix
With Listing 2.5 and Listing 2.6 we can estimate the clock cycles for the MixColumns transformation. The FFmulX function needs ≈ 11cc. The MixColumn operation needs four rotations, four XOR operations, one FFmulX operation which makes 19cc. In addition it has to load the word w from RAM and store the result back in RAM, this takes two clock cycles. One MixColumns transformation needs four MixColumn operations which makes 4 · 21cc = 84cc for one round and 9 · 84cc = 756cc for all nine MixColumns transformations.
The Inverse MixColumns Transformation
The 8-bit precalculation from Listing 2.4 can be formulated in such a way, that it uses 32-bit operations. Listing 2.7 denotes the improved inverse MixColumns transformation for one column w of the State matrix.
1 word InvMixColumn(word w)
2 begin
3 word tmp;
4
5 tmp = rot2 (w); 6 tmp = tmp ^ w;
7 tmp = FFmulX(tmp);
8 tmp = FFmulX(tmp);
9 10 w = w ^ tmp;
11
12 w = MixColumn(w);
13
14 return w;
15 end
Listing 2.7: Inverse MixColumns transformation with 32-bit operations
We can estimate the clock cycles for the InvMixColumns transformation with Listing 2.7. For one column of the State matrix it takes one rotation, two XOR operations, two FFmulX operations and one MixColumn operation. This makes 1cc + 2 · 1cc + 2 · 11cc + 21cc = 46cc for one column, 4 · 46cc = 184cc for a State and 9 · 184cc = 1656cc for all nine rounds.
Summary
Summarizing the optimized 32-bit AES encryption takes 176cc for the AddRoundKey transformation, 800cc for the combined ShiftRows and SubBytes transformation, and 756cc for the MixColumns transformation. In total this makes 1732cc for the encryption.
The decryption, with 1656cc for the InvMixColumns transformation, now needs 176cc + 800cc + 1656cc = 2632cc.
2.3.6 32-bit T-Tables Implementation
Daemon and Rijmen already showed in their original AES proposal, that AES can be implemented very efficiently on 32-bit processors by combining the transformations SubBytes,
ShiftRows, MixColumns, and AddRoundKey for one column4. They define four tables each of 256 4-byte words (for 0 ≤ x ≤ 255).
Where c(r) = c + r mod 4 with c(0) = c. Here c denotes the actual column index, and c(r),r = 0,...,3 denotes the new position after the ShiftRows transformation. With these tables each column in the output State s0 can be computed by using four XOR operations together with one word from the key schedule and four table-lookups that are indexed using four bytes from the input State. We need four XOR and five TLU per column. Since we have four columns per State matrix, 4 · 4 = 16 XOR, 4 · 4 = 16 TLU (ROM) and four TLU (RAM) per round are needed. This makes 16cc + 2 · 16cc + 4cc = 52cc for one round and 10 · 52cc = 520cc for the whole algorithm (plus additional key scheduling).
One table consumes 4·256bytes = 1024bytes ROM, and for the four tables 4·1024bytes =
4096bytes are needed. In the last round, the MixColumns is not present. This means that different tables are required for the last round. If we also implement the last round as a table, we need 2 · 4096bytes = 8192bytes of table space altogether.
If we take a closer look at the tables, it can be recognized that the structure of the four tables is closely related to each other. The columns are rotated by one, two and three bytes, since Ti(x) = rot(Ti−1(x)),i = 0,... ,3, where the function rot() rotates the bytes in the column cyclically by one position. So, the space needed for the tables can be reduced by a factor of four at the expense of three additional cyclically rotations per round.
2.3.7 32-bit Implementation with Transposed State
Bertoni et al. [BBF+03] had the idea to transpose the State matrix S in such a way, that it can be processed by columns instead of rows. Therefore they have restructured the MixColumns, the ShiftRows, and the round key generation for en-/decryption. Since the SubBytes transformation works on each byte individually, they left it as it is.
The consequence of the transposition is that the new version of the InvMixColumns achieves a much higher speed gain. In fact, the transposed InvMixColumns requires seven doublings and 27 additions. The authors report a cycle count of 2047cc for decryption on an ARM7TDMI. The encryption becomes a bit slower than the 32-bit reference implementation of Gladman. For encryption they need 1675cc. Their reference implementation from Brian Gladman needs 2763cc for decryption and 1641cc for encryption on the ARM7TDMI.
2.3.8 Hardware Implementations
There has also been much effort put in efficient hardware implementations of AES: From the 8-bit implementation straight from the standard to 32-bit implementations using a 32-bit data path up to high-speed implementations with a 128-bit data path. For every implementation, very much care has to be given to the S-box implementation, both for performance and for security reasons.
Rijmen suggested in 2001 to use subfield arithmetic for computing the inverse part of the SubBytes transformation in [Rij01]. This idea was extended by Satoh et al. in [SMTM01] by using sub-subfields, the so-called “composite field” or “tower field” approach, which resulted in the smallest AES circuit at that point. The “composite field” approach utilizes that a finite field GF(2k) with k = n · m can also be represented as GF((2n)m). Later, Canright [CB09] improved this method again by a carefully chosen normal basis, which results in the most compact S-box up to date.
As we will see, some ideas from the “composite field” approach can be reused for our SCA resistant AES implementation. So we will give a short overview to gain insight into the “composite field” arithmetic.
Composite Fields
We have seen in Section 2.2, that the finite field GF(28) can be seen as an extension field of GF(2) and its elements are represented as bytes. However, we can also see GF(28) as a quadratic extension of the field GF(24), which is far better suited for hardware implementations [Paa94, SMTM01]. We call this field composite field. In particular, composite field inversions are used to create compact AES hardware implementations [RDJ+01, SMTM01, MS03, WOL02].
We can represent every element a ∈ GF(28) as a linear polynomial ahx + al with the coefficients ah and al being elements of the field GF(24). We refer to this polynomial as twoterm polynomial. Both coefficients of this polynomial have four bits. The finite field GF(28) is isomorphic to the field GF((24)2). For each element in GF(28) there exists exactly one element in GF((24)2). The following field polynomial, which is used to define the quadratic extension of GF(24), is taken from [WOL02, Tri03].
Two two-term polynomials can be added by adding their corresponding coefficients:
(ahx + al) ⊕ (bhx + bl) = (ah ⊕ bh)x + (al ⊕ bl). (2.19)
For the multiplication and inversion of two two-term polynomials a modular reduction step is required, to ensure that the result is again a two-term polynomial. The irreducible polynomial n used for the reduction is given by:
n(x) = x2 + x + {E}16. (2.20)
The coefficients of n are elements of GF(24) and written in hexadecimal notation. To ensure, that the product of two coefficients lies in GF(24), i. e., a(x)⊗ b(x) := a(x) · b(x) mod m4(x) with a,b ∈ GF(24), the irreducible reduction polynomial m4 is used, which is given by:
m4(x) = x2 + x + 1. (2.21)
To calculate the inverse of d Oswald et al. shift this element down to GF(22) in [OMP04], where they compute the inversion on a dedicated hardware. This is feasible in hardware but it is not efficient in software on a microcontroller. Instead the inversion of d can be efficiently realized as a lookup table with 16 elements from GF(24).
If the inversion part of SubBytes is calculated with the composite field method, the big lookup table for the inversion in GF(28) with 256 elements can be replaced by the lookup table for the inversion in GF(24) with 16 elements.
2.3.9 Comparison of Different Implementation Options
In the last sections, several options to implement AES were discussed. Two main possibilities to write an 8-bit software implementation of AES have been pointed out.
In the first method, all round keys are calculated in advance, Table 2.1 line one, three, four, five and six. A method with separate key scheduling can be recommended, if there is enough RAM to store the 176bytes for the round keys. The separate key scheduling especially pays off, if two or more blocks of 128bit are encrypted5. An AES 128-bit encryption with separate key scheduling takes 3808cc on an 8-bit processor (line three) and 1732cc on a 32-bit processor (line four).
The second method is to implement an embedded key scheduling. Here the round keys are calculated when they are needed, on-the-fly, and afterwards they are discarded. This leads
to a very small RAM consumption. Table 2.1 denotes in line two, that this implementation needs 4064cc and only 16bytes RAM (for the last used round key). If the processor used has a large number of registers, everything can be calculated without the use of RAM. Thus, the on-the-fly method can be preferred on architectures where only very limited RAM is available.
On a 32-bit processor, AES can also be implemented with almost only table lookups. Table 2.1 refers to this method as T-tables. An 128-bit encryption with the T-tables approach takes approximately 520cc (with separate key scheduling). The T-tables implementation needs 8192bytes of ROM, to store the four pre-calculated tables that are needed for the encryption, and additional 8192bytes for the decryption.
The first three entries in the table are to get a rough feeling of the space and cycle consumption of AES. In the second part of the table they can be compared with real implementations. We notice that there are big clock cycles differences between different implementations even if they are on the same architecture (8 or 32-bit). So we just compare the ratio between the cycle counts for an implementation given from the same author.
The cAESar implementation from [HKQ99] is one of the first 8-bit implementations we found. The used processor is not clearly described in the paper. The implementation is completely written in Assembler and needs 2889cc for one encryption. The round keys are generated on-the-fly.
The second implementation is from Akkar/Giraud in 2001 [AG01] (line 8). Their nonoptimized implementation was done in assembly code using an 8-bit processor. In their paper, they specify the runtime with 18.1ms at 5MHz which results in 90500cc for the encryption of a 128-bit message. The whole implementation needs 730bytes ROM and 41bytes of RAM.
In [HOM06], Herbst et al. compare their SCA resistant 8-bit implementation with one self made older unprotected one from the Institute for Applied Information Processing and Communication (IAIK) [IAI06] they work at. This unprotected implementation needs 4427cc for an encryption. The second unprotected implementation they refer to, is an implementation from Christian Röpke [R¨03] on an ATmega163, which needs 7498cc for an encryption. These two clock cycle counts are interesting for us, because we want use to some countermeasures from Herbst et al.
Bernstein and Schwabe [BS08] have written the fastest 32-bit implementation we have found so far. For achieving these results, they have made heavily optimized versions for the underlying 32-bit micro architecture and utilized the T-tables approach. On an Intel Core 2 Quad with, e.g., combined load and XOR operations, their implementation needs ≈ 169cc for the encryption. In their clock cycle calculation they ignored the costs for computing the round keys.
The main requirement for our AES implementation is to be small in terms of RAM and ROM consumption. For this reason, we do not take the 32-bit T-tables approach. Because a typical AES implementation will have a fixed key and we assume that more than one 128-bit block is encrypted during one AES run, we decided us for the separate key scheduling method. Therefore we will take the optimized 32-bit implementation (line four) and try to implement it on the TriCore architecture as fast as possible. The RAM consumption in Table 2.1 does not contain the 16-byte State matrix.
[1] t = a ⊗ m;
6 b = rotate_bit_left(b);
[2] b = b ⊕ parity(t);
8 m = rotate_bit_right(m);
9 end
[3] b = b ⊕ c;
Now we one multiplication (AND) 1cc, two rotations 2cc, one parity calculation 1cc and one addition (XOR) 1cc to calculate one bit of the result. For all eight bits we need 8·5cc =
3 States whether the sum of all bits in the byte are even or odd.
@Slav Defence - Again, this is a much more advanced article on EMSEC, one that goes into the principles behind the concept of side channel analysis, or the task of side-stepping encryption. Also, I hope these article are helping you and others understand what I did in the US Navy. They are quite technical, if you'd like me to post less technical information, please say so. I really want people to enjoy these article and learn something, but the information can be overwhelming and scary at times, especially for novices... this isn't even too advanced, it's first year stuff!!!
@Jango @Manticore @Jungibaaz
@Hu Songshan - you told me in our conversation that you where interested in Electronic Warfare, so I thought I'd tag you here as well given this thread pertains to said topic.
@Nihonjin1051 - I have no idea if this topic interests you (if it does, you're welcome. If not, ), but this was my job in the USN.
Last edited: