Cache Side Channel Defense System -Ezgi Demirayak

Cache Side Channel Attacks

When a program needs a particular data object from level, it first looks for in one of the blocks currently stored at level. If happens to be cached at level , then we have what is called a cache hit. The program reads directly from level, which by the nature of the memory hierarchy is faster than reading from level. For example, a program with good temporal locality might read a data object from block 14, resulting in a cache hit from level. Although caches are highly effective in reducing average memory access time and thus widely used in modern proces- sors, their internal functionalities, i.e., hit/miss behaviors, were shown to leak critical information that puts trusted software implementations in an unforeseen danger.

Software cache-based side channel attacks present a serious threat to computer systems. There exist mainly two types of software cache-based sidechannel attacks: access-driven and time-driven attacks. In access-driven attacks, the adversary has control over one or multiple spy processes, which share the cache with the victim process. Due to cache sharing, the victim process may evict the spy process’ cache lines/entries when it accesses keydependent (i.e., critical) cache lines/entries. By measuring the access times of itsowncache lines/entries, the spy process can figure out which cache lines/entries are evicted by the victim process. Such cache access behavior of the victim process may leak enough information for the adversary to infer the key. In time-driven attacks, the adversary sends various encryption/decryption requests to the target crypto process. Upon receiving responses, the adversary records the encryption times. Since the secret key may correlate to different number of cache misses upon different inputs/ outputs, the variations among encryption times may provide sufficient information for the adversary to derive the key.

I'm researching in order to find a defense system to prevent the attackers from reaching LLC.

2 October: Met with Prof. Ponomarev in order to discuss the questions that I have. Also we determined what part of the paper I will be working on. I will develop a defense mechanism to protect writable data in caches.

4 October: I researched AES cypher which is a symmetric cypher. https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture8.pdf

7 October : I've read Prof. Dmitry Ponomarev's publication about Cache Side Channel Attacks.

9 October: Learned the basics of M-sim simulator from a graduate student of Prof.Ponomarev in order to use it when I need statistics.

10 October: I've read articles about El-Gamal cipher.

http://www14.in.tum.de/konferenzen/Jass05/courses/1/papers/meier_paper.pdf

http://www.asecuritysite.com/encryption/elgamal

18 October Worked on a possible solution.

21 October Worked on a possible solution.

23 October Weekly meeting with Professor.Ponomarev.

November 3 2015

I worked on M-sim simulator. M-Sim is a multi-threaded micro architectural simulation environment with a detailed cycle-accurate model for the key pipeline structures. It offers the ability to measure processor performance under both single and multi-threaded simulation. As a computer architecture simulator, M-Sim provides designers and researchers a platform by which they can measure the performance of new devices. Categorized as a flexible, multi-threaded, micro architectural simulation environment, it provides detailed performance analysis for super scalar microprocessors. I try to simulate L3 (Last Level) cache partitioning. First, I've executed M-sim from terminal. I wanted to see if there is a built-in L3 cache in it. After executing m-sim with a benchmark I was able to see what is built-in. Executed the following code:

ezgi@fatjoe:~/Desktop/msim$ ./sim-outorder spec2K6/bzip2NS.1.arg

Then I've checked whether there is a L3 cache or not. There were L1, and L2 caches. I was able to understand it from the following information.

-cache:dl1         dl1:256:64:4:l # l1 data cache config, i.e., {<config>|none}
-cache:dl1lat                 1 # l1 data cache hit latency (in cycles)
-cache:dl2         ul2:512:64:16:l # l2 data cache config, i.e., {<config>|none}
-cache:dl2lat                10 # l2 data cache hit latency (in cycles)
-cache:il1         il1:512:64:2:l # l1 inst cache config, i.e., {<config>|dl1|dl2|none}
-cache:il1lat                 1 # l1 instruction cache hit latency (in cycles)
-cache:il2                  dl2 # l2 instruction cache config, i.e., {<config>|dl2|none}
-cache:il2lat                10 # l2 instruction cache hit latency (in cycles)

But when I look at l3 I've seen the information below:

-cache:dl3                 none # l3 data cache config, i.e., {<config>|none}
-cache:dl3lat                30 # l3 data cache hit latency (in cycles)
-cache:il3                 none # l3 instruction cache config, i.e., {<config>|dl3|none}
-cache:il3lat                30 # l3 instruction cache hit latency (in cycles)

I understood that there isn't an L3 cache in m-sim. So I need to create it.

The cache config parameter <config> has the following format:

   <name>:<nsets>:<bsize>:<assoc>:<repl>

   <name>   - name of the cache being defined
   <nsets>  - number of sets in the cache
   <bsize>  - block size of the cache
   <assoc>  - associativity of the cache
   <repl>   - block replacement strategy, 'l'-LRU, 'f'-FIFO, 'r'-random

The name for my cache will be ul3, “u” stands for unified. It means my l3 cache will include both data and instructions. Then I had to figure it out how many sets I will need. I calculated this part last. I set my block size as 64 in order to be consistent with l1 and l2 caches. For cache I have the following formula to calculate:

Total Size of the cache=Sets*block size*associativity

I did research in order to find a optimal cache size. I've figured it out that Intel’s Core i7 processors have maintained an 8MB L3 cache. Also Intel’s Sandy Bridge CPUs shared an 8MB L3 cache with the on-die graphics core (Ivy Bridge gave the GPU its own dedicated slice of L3 cache in lieu of sharing the entire 8MB). The common L3 cache is slower but much larger, which means it can store data for all the cores at once.

So I specify my cache size as 8MB which is 2^3 * 2^20 As for the associativity, I know there is trade-off between the performance and resources. When cache checks more places, it takes more power, more resources and more time. However, when the associativity increases cache will suffer fewer misses and the CPU will waste less time trying to read from the slow memory. Therefore I wanted my cache to be highly associative and I specified my associativity as 16.

Putting on to the formula:

2^3 * 2^20 = Sets * 64 * 16

Sets= (2 ^ 23) / (2^10) = (2^13)
Sets= 8192

Lastly, I need to decide which block replacement policy that I will use. When the cache is full the program needs a victim to take out from the cache and replaces it with new data.

I examined my options:

Random Replacement: Replaces the block which is randomly selected among all blocks that are currently in the cache.

First-in-first-out (FIFO) Replacement: Replaces the block that has been in the cache for the longest time. The block that is first put to the cache.

Least Recently Used (LRU): Replaces the block that has not been used for the longest time.

In order to compare their performance, I've used the following article.

http://www.ece.uah.edu/~milenka/docs/milenkovic_acmse04r.pdf

On average, for LRU when the associativity increased the number of misses reduces , since I'm using a highly associative cache using LRU is a great way to reduce misses. According to the published article above LRU policy in data caches has better performance than FIFO and Random, across almost all evaluated benchmarks and cache sizes. After I figured properties of my cache, I ran the following code in order to create my L3 cache.

../sim-outorder -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3

This means create L3 cache with the above features and set il3 as a pointer to dl3.

However, since I did not specify a benchmark the program did not output any statistics.

In order to see the hit and miss rate of l3 cache I ran the following code:

../sim-outorder -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3 bzip2NS.1.arg

Now, I was able to see the statistics about my L3.

Core_0_ul3.accesses             558908 # total number of accesses
Core_0_ul3.hits                 279200 # total number of hits
Core_0_ul3.misses               279708 # total number of misses
Core_0_ul3.replacements         271534 # total number of replacements

5 November 2015

I want to see what happens if I have two cores. Do they share L3 , or not?

I can specify the number of cores with simply typing -num_cores #core

../sim-outorder -num_cores 2 -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3 bzip2NS.1.arg

I've realized that they share it since there is only output for one core.

Core_0_ul3.accesses             558908 # total number of accesses
Core_0_ul3.hits                 279200 # total number of hits
Core_0_ul3.misses               279708 # total number of misses
Core_0_ul3.replacements         271534 # total number of replacements
.
.
.

Core_0_ul3.accesses             558908 # total number of accesses
Core_0_ul3.hits                 279200 # total number of hits
Core_0_ul3.misses               279708 # total number of misses
Core_0_ul3.replacements         271534 # total number of replacements
.
.
.

So now I will try partitioning the L3 cache for two cores. My first approach is: divide the associativity by half and see what happens in terms of caches and misses.This will be an approximation.

When I did it nothing has changed in terms of statistics. So I tried to find the reason and figured it out that I wasn't using enough instructions in order to fill the cache to actually get enough replacements and misses.

So I changed the maximum instructions that the program will execute.

../sim-outorder -max:inst 10000000  -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3 bzip2NS.1.arg

Then the results started to change..

This is when my L3 cache is 8 ways associative.

Core_0_ul3.accesses             1674635 # total number of accesses
Core_0_ul3.hits                 749368 # total number of hits
Core_0_ul3.misses               925267 # total number of misses
Core_0_ul3.replacements         864221 # total number of replacements

This is when my L3 cache is 16 ways associative.

Core_0_ul3.accesses             1674635 # total number of accesses
Core_0_ul3.hits                 749368 # total number of hits
Core_0_ul3.misses               925267 # total number of misses
Core_0_ul3.replacements         802781 # total number of replacements

As it can be seen the total numbers of replacements are reduced since the cache will need to replace less.

X X X X X X X X

The above table is the representation of 8 ways cache. When the 9th read occurs one of them needs to be replaced. However when we have 16 ways associativity we will not need replacement till the 17th read. Therefore the number of replacements are lesser when compared to 8 ways associativity.

Now I think about partitioning. Let's say two cores share 16ways associative L3. What happens if share the cache by half.

So both of them will use 8 ways of the cache. From the above statistic I got, I came up with the following conclusion:

The cache will execute for both of them. I know that when I have 8 ways associativity I have

Core_0_ul3.replacements         864221 # total number of replacements

However, I have two cores ;therefore, I need to multiply it with 2.

In total I will have 1,728,442 replacement.

Core_0_ul3.replacements         1728442 # total number of replacements

This is a rough approximation.

On the other hand when they share the whole L3 both cores have 16ways associativity thus the total number of replacements is :

Core_0_ul3.replacements         802781 # total number of replacements

Which is so much lesser.

Now, I will try to ran a different program that needs larger memory to see the difference.

This one is with 16ways set associativity:

Core_0_ul3.accesses             714279 # total number of accesses
Core_0_ul3.hits                 357204 # total number of hits
Core_0_ul3.misses               357075 # total number of misses
Core_0_ul3.replacements        262088 # total number of replacements
Core_0_ul3.writebacks           261539 # total number of writebacks
Core_0_ul3.invalidations        0 # total number of invalidations
Core_0_ul3.miss_rate            0.499910 # miss rate (misses/ref)
Core_0_ul3.repl_rate            0.366927 # replacement rate (repls/ref)

This one is with 8 ways set associativity:

Core_0_ul3.accesses             714279 # total number of accesses
Core_0_ul3.hits                 357158 # total number of hits
Core_0_ul3.misses               357121 # total number of misses
Core_0_ul3.replacements         327670 # total number of replacements
Core_0_ul3.writebacks           327097 # total number of writebacks
Core_0_ul3.invalidations        0 # total number of invalidations
Core_0_ul3.miss_rate            0.499974 # miss rate (misses/ref)
Core_0_ul3.repl_rate            0.458742 # replacement rate (repls/ref)

Now I will try to partition the caches with different amounts. However, I need the change the code internally because cache associativity must be a power of two. So I cannot create two different caches for each core to try. I need to change the code internally.

10 - 15 November

Read an article about Cache Partitioning in order to have more knowledge.

http://link.springer.com/article/10.1023%2FB%3ASUPE.0000014800.27383.8f?LI=true

18 - 24 November

Read an article about a way to partition the cache dynamically.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.5764&rep=rep1&type=pdf

24-30 November Thanksgiving break.

On 5 December Weekly meeting with Prof. Ponomarev

../sim-outorder -fastfwd 2000000000 -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3 mcfNS.1.arg

Core_0_ul3.accesses 119441 # total number of accesses Core_0_ul3.hits 52258 # total number of hits Core_0_ul3.misses 67183 # total number of misses Core_0_ul3.replacements 67183 # total number of replacements Core_0_ul3.writebacks 14606 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.562479 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.562479 # replacement rate (repls/ref) Core_0_ul3.wb_rate 0.122286 # writeback rate (wrbks/ref) Core_0_ul3.inv_rate 0.000000 # invalidation rate (invs/ref)

 ../sim-outorder -fastfwd 2000000000 -cache:dl3 ul3:8192:64:8:l -cache:il3 dl3 mcfNS.1.arg
  Core_0_ul3.accesses             119434 # total number of accesses
   Core_0_ul3.hits                 39384 # total number of hits
   Core_0_ul3.misses               80050 # total number of misses

February 2 Weekly meeting with Prof. Ponomarev. We talked about what I will do this semester. I will change the internal code of cache simulator to be able to partition the cache with different ratios.

February 3 Worked on the lab. Tried ways to modify the cache.c and cache.h cache files to let it me to partition the cache with the numbers that are not 2 to the power of n. I modified the code and became able to partition it with the numbers I would like to partition. So I continued to partition the cache.

For 16 ways of assosiativity 
 First I gave 1/16 of the entire cache one core, 15/16 to other one.
 ../sim-outorder -fastfwd 2000000000 -cache:dl3 ul3:8192:64:15:l -cache:il3 dl3 mcfNS.1.arg 
 When a core given 15/16 of the cache here are the results:
Core_0_ul3.accesses             119439 # total number of accesses
Core_0_ul3.hits                 51288 # total number of hits
Core_0_ul3.misses               68151 # total number of misses
Core_0_ul3.replacements         68151 # total number of replacements
Core_0_ul3.writebacks           14888 # total number of writebacks
Core_0_ul3.invalidations        0 # total number of invalidations
Core_0_ul3.miss_rate            0.570593 # miss rate (misses/ref)
Core_0_ul3.repl_rate            0.570593 # replacement rate (repls/ref)
Core_0_ul3.wb_rate              0.124649 # writeback rate (wrbks/ref)
Core_0_ul3.inv_rate             0.000000 # invalidation rate (invs/ref)

 ezgi@fatjoe:~/Desktop/msim/spec2K6$ ../sim-outorder -fastfwd 2000000000 -cache:dl3 ul3:8192:64:1:l -cache:il3 dl3 mcfNS.1.arg
 When a core given 1/16 of the cache here are the results:
Core_0_ul3.accesses             119436 # total number of accesses
Core_0_ul3.hits                 9589 # total number of hits
Core_0_ul3.misses               109847 # total number of misses
Core_0_ul3.replacements         109847 # total number of replacements
Core_0_ul3.writebacks           27079 # total number of writebacks
Core_0_ul3.invalidations        0 # total number of invalidations
Core_0_ul3.miss_rate            0.919714 # miss rate (misses/ref)
Core_0_ul3.repl_rate            0.919714 # replacement rate (repls/ref)
Core_0_ul3.wb_rate              0.226724 # writeback rate (wrbks/ref)
Core_0_ul3.inv_rate             0.000000 # invalidation rate (invs/ref)

 Then, I gave 2/16 of the entire cache one core, 14/16 to other one.
When a core given 2/16 of the cache here are the results:
 Core_0_ul3.accesses             119436 # total number of accesses
Core_0_ul3.hits                 27538 # total number of hits
Core_0_ul3.misses               91898 # total number of misses
Core_0_ul3.replacements         91898 # total number of replacements
Core_0_ul3.writebacks           24374 # total number of writebacks
Core_0_ul3.invalidations        0 # total number of invalidations
Core_0_ul3.miss_rate            0.769433 # miss rate (misses/ref)
Core_0_ul3.repl_rate            0.769433 # replacement rate (repls/ref)
Core_0_ul3.wb_rate              0.204076 # writeback rate (wrbks/ref)
Core_0_ul3.inv_rate             0.000000 # invalidation rate (invs/ref)

When a core given 14/16 of the cache here are the results:
 Core_0_ul3.accesses             119439 # total number of accesses
Core_0_ul3.hits                 50110 # total number of hits
Core_0_ul3.misses               69329 # total number of misses
Core_0_ul3.replacements         69329 # total number of replacements
Core_0_ul3.writebacks           15261 # total number of writebacks
Core_0_ul3.invalidations        0 # total number of invalidations
Core_0_ul3.miss_rate            0.580455 # miss rate (misses/ref)
Core_0_ul3.repl_rate            0.580455 # replacement rate (repls/ref)
Core_0_ul3.wb_rate              0.127772 # writeback rate (wrbks/ref)
Core_0_ul3.inv_rate             0.000000 # invalidation rate (invs/ref)

 Here is 3/16 and 13/16:
When a core given 3/16 of the cache here are the results:
  Core_0_ul3.accesses             119437 # total number of accesses
Core_0_ul3.hits                 31448 # total number of hits
Core_0_ul3.misses               87989 # total number of misses
Core_0_ul3.replacements         87989 # total number of replacements
Core_0_ul3.writebacks           22021 # total number of writebacks
Core_0_ul3.invalidations        0 # total number of invalidations
Core_0_ul3.miss_rate            0.736698 # miss rate (misses/ref)
Core_0_ul3.repl_rate            0.736698 # replacement rate (repls/ref)
Core_0_ul3.wb_rate              0.184373 # writeback rate (wrbks/ref)
Core_0_ul3.inv_rate             0.000000 # invalidation rate (invs/ref)

 Here is 4/16 and 12/16:

 Here is 5/16 and 11/16:

 Here is 6/16 and 10/16:

 Here is 7/16 and 9/16:

 Here is 8/16 and 8/16:

Cache Side Channel Defense System -Ezgi Demirayak

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools