Cache Side Channel Defense System -Ezgi Demirayak
Cache Side Channel Attacks
When a program needs a particular data object  from level, it first looks for  in one of the blocks currently stored at level. If  happens to be cached at level  , then we have what is called a cache hit. The program reads  directly from level, which by the nature of the memory hierarchy is faster than reading from level. For example, a program with good temporal locality might read a data object from block 14, resulting in a cache hit from level. Although caches are highly effective in reducing average memory access time and thus widely used in modern proces- sors, their internal functionalities, i.e., hit/miss behaviors, were shown to leak critical information that puts trusted software implementations in an unforeseen danger.
Software cache-based side channel attacks present a serious threat to computer systems. There exist mainly two types of software cache-based sidechannel attacks: access-driven and time-driven attacks. In access-driven attacks, the adversary has control over one or multiple spy processes, which share the cache with the victim process. Due to cache sharing, the victim process may evict the spy process’ cache lines/entries when it accesses keydependent (i.e., critical) cache lines/entries. By measuring the access times of itsowncache lines/entries, the spy process can figure out which cache lines/entries are evicted by the victim process. Such cache access behavior of the victim process may leak enough information for the adversary to infer the key. In time-driven attacks, the adversary sends various encryption/decryption requests to the target crypto process. Upon receiving responses, the adversary records the encryption times. Since the secret key may correlate to different number of cache misses upon different inputs/ outputs, the variations among encryption times may provide sufficient information for the adversary to derive the key.
I'm researching in order to find a defense system to prevent the attackers from reaching LLC.
2 October: Met with Prof. Ponomarev in order to discuss the questions that I have. Also we determined what part of the paper I will be working on. I will develop a defense mechanism to protect writable data in caches.
4 October: I researched AES cypher which is a symmetric cypher. https://engineering.purdue.edu/kak/compsec/NewLectures/Lecture8.pdf
7 October : I've read Prof. Dmitry Ponomarev's publication about Cache Side Channel Attacks.
9 October: Learned the basics of M-sim simulator from a graduate student of Prof.Ponomarev in order to use it when I need statistics.
10 October: I've read articles about El-Gamal cipher.
http://www14.in.tum.de/konferenzen/Jass05/courses/1/papers/meier_paper.pdf
http://www.asecuritysite.com/encryption/elgamal
18 October Worked on a possible solution.
21 October Worked on a possible solution.
23 October Weekly meeting with Professor.Ponomarev.
November 3 2015
I worked on M-sim simulator. M-Sim is a multi-threaded micro architectural simulation environment with a detailed cycle-accurate model for the key pipeline structures. It offers the ability to measure processor performance under both single and multi-threaded simulation. As a computer architecture simulator, M-Sim provides designers and researchers a platform by which they can measure the performance of new devices. Categorized as a flexible, multi-threaded, micro architectural simulation environment, it provides detailed performance analysis for super scalar microprocessors. I try to simulate L3 (Last Level) cache partitioning. First, I've executed M-sim from terminal. I wanted to see if there is a built-in L3 cache in it. After executing m-sim with a benchmark I was able to see what is built-in. Executed the following code:
ezgi@fatjoe:~/Desktop/msim$ ./sim-outorder spec2K6/bzip2NS.1.arg
Then I've checked whether there is a L3 cache or not. There were L1, and L2 caches. I was able to understand it from the following information.
-cache:dl1 dl1:256:64:4:l # l1 data cache config, i.e., {<config>|none}
-cache:dl1lat 1 # l1 data cache hit latency (in cycles)
-cache:dl2 ul2:512:64:16:l # l2 data cache config, i.e., {<config>|none}
-cache:dl2lat 10 # l2 data cache hit latency (in cycles)
-cache:il1 il1:512:64:2:l # l1 inst cache config, i.e., {<config>|dl1|dl2|none}
-cache:il1lat 1 # l1 instruction cache hit latency (in cycles)
-cache:il2 dl2 # l2 instruction cache config, i.e., {<config>|dl2|none}
-cache:il2lat 10 # l2 instruction cache hit latency (in cycles)
But when I look at l3 I've seen the information below:
-cache:dl3 none # l3 data cache config, i.e., {<config>|none}
-cache:dl3lat 30 # l3 data cache hit latency (in cycles)
-cache:il3 none # l3 instruction cache config, i.e., {<config>|dl3|none}
-cache:il3lat 30 # l3 instruction cache hit latency (in cycles)
I understood that there isn't an L3 cache in m-sim. So I need to create it.
The cache config parameter <config> has the following format:
<name>:<nsets>:<bsize>:<assoc>:<repl>
<name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-LRU, 'f'-FIFO, 'r'-random
The name for my cache will be ul3, “u” stands for unified. It means my l3 cache will include both data and instructions. Then I had to figure it out how many sets I will need. I calculated this part last. I set my block size as 64 in order to be consistent with l1 and l2 caches. For cache I have the following formula to calculate:
Total Size of the cache=Sets*block size*associativity
I did research in order to find a optimal cache size. I've figured it out that Intel’s Core i7 processors have maintained an 8MB L3 cache. Also Intel’s Sandy Bridge CPUs shared an 8MB L3 cache with the on-die graphics core (Ivy Bridge gave the GPU its own dedicated slice of L3 cache in lieu of sharing the entire 8MB). The common L3 cache is slower but much larger, which means it can store data for all the cores at once.
So I specify my cache size as 8MB which is 2^3 * 2^20 As for the associativity, I know there is trade-off between the performance and resources. When cache checks more places, it takes more power, more resources and more time. However, when the associativity increases cache will suffer fewer misses and the CPU will waste less time trying to read from the slow memory. Therefore I wanted my cache to be highly associative and I specified my associativity as 16.
Putting on to the formula:
2^3 * 2^20 = Sets * 64 * 16
Sets= (2 ^ 23) / (2^10) = (2^13) Sets= 8192
Lastly, I need to decide which block replacement policy that I will use. When the cache is full the program needs a victim to take out from the cache and replaces it with new data.
I examined my options:
Random Replacement: Replaces the block which is randomly selected among all blocks that are currently in the cache.
First-in-first-out (FIFO) Replacement: Replaces the block that has been in the cache for the longest time. The block that is first put to the cache.
Least Recently Used (LRU): Replaces the block that has not been used for the longest time.
In order to compare their performance, I've used the following article.
http://www.ece.uah.edu/~milenka/docs/milenkovic_acmse04r.pdf
On average, for LRU when the associativity increased the number of misses reduces , since I'm using a highly associative cache using LRU is a great way to reduce misses. According to the published article above LRU policy in data caches has better performance than FIFO and Random, across almost all evaluated benchmarks and cache sizes. After I figured properties of my cache, I ran the following code in order to create my L3 cache.
../sim-outorder -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3
This means create L3 cache with the above features and set il3 as a pointer to dl3.
However, since I did not specify a benchmark the program did not output any statistics.
In order to see the hit and miss rate of l3 cache I ran the following code:
../sim-outorder -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3 bzip2NS.1.arg
Now, I was able to see the statistics about my L3.
Core_0_ul3.accesses 558908 # total number of accesses Core_0_ul3.hits 279200 # total number of hits Core_0_ul3.misses 279708 # total number of misses Core_0_ul3.replacements 271534 # total number of replacements
5 November 2015
I want to see what happens if I have two cores. Do they share L3 , or not?
I can specify the number of cores with simply typing -num_cores #core
../sim-outorder -num_cores 2 -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3 bzip2NS.1.arg
I've realized that they share it since there is only output for one core.
Core_0_ul3.accesses 558908 # total number of accesses Core_0_ul3.hits 279200 # total number of hits Core_0_ul3.misses 279708 # total number of misses Core_0_ul3.replacements 271534 # total number of replacements . . .
Core_0_ul3.accesses 558908 # total number of accesses Core_0_ul3.hits 279200 # total number of hits Core_0_ul3.misses 279708 # total number of misses Core_0_ul3.replacements 271534 # total number of replacements . . .
So now I will try partitioning the L3 cache for two cores. My first approach is: divide the associativity by half and see what happens in terms of caches and misses.This will be an approximation.
When I did it nothing has changed in terms of statistics. So I tried to find the reason and figured it out that I wasn't using enough instructions in order to fill the cache to actually get enough replacements and misses.
So I changed the maximum instructions that the program will execute.
../sim-outorder -max:inst 10000000 -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3 bzip2NS.1.arg
Then the results started to change..
This is when my L3 cache is 8 ways associative.
Core_0_ul3.accesses 1674635 # total number of accesses Core_0_ul3.hits 749368 # total number of hits Core_0_ul3.misses 925267 # total number of misses Core_0_ul3.replacements 864221 # total number of replacements
This is when my L3 cache is 16 ways associative.
Core_0_ul3.accesses 1674635 # total number of accesses Core_0_ul3.hits 749368 # total number of hits Core_0_ul3.misses 925267 # total number of misses Core_0_ul3.replacements 802781 # total number of replacements
As it can be seen the total numbers of replacements are reduced since the cache will need to replace less.
X X X X X X X X
The above table is the representation of 8 ways cache. When the 9th read occurs one of them needs to be replaced. However when we have 16 ways associativity we will not need replacement till the 17th read. Therefore the number of replacements are lesser when compared to 8 ways associativity.
Now I think about partitioning. Let's say two cores share 16ways associative L3. What happens if share the cache by half.
So both of them will use 8 ways of the cache. From the above statistic I got, I came up with the following conclusion:
The cache will execute for both of them. I know that when I have 8 ways associativity I have
Core_0_ul3.replacements 864221 # total number of replacements
However, I have two cores ;therefore, I need to multiply it with 2.
In total I will have 1,728,442 replacement.
Core_0_ul3.replacements 1728442 # total number of replacements
This is a rough approximation.
On the other hand when they share the whole L3 both cores have 16ways associativity thus the total number of replacements is :
Core_0_ul3.replacements 802781 # total number of replacements
Which is so much lesser.
Now, I will try to ran a different program that needs larger memory to see the difference.
This one is with 16ways set associativity:
Core_0_ul3.accesses 714279 # total number of accesses Core_0_ul3.hits 357204 # total number of hits Core_0_ul3.misses 357075 # total number of misses Core_0_ul3.replacements 262088 # total number of replacements Core_0_ul3.writebacks 261539 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.499910 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.366927 # replacement rate (repls/ref)
This one is with 8 ways set associativity:
Core_0_ul3.accesses 714279 # total number of accesses Core_0_ul3.hits 357158 # total number of hits Core_0_ul3.misses 357121 # total number of misses Core_0_ul3.replacements 327670 # total number of replacements Core_0_ul3.writebacks 327097 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.499974 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.458742 # replacement rate (repls/ref)
Now I will try to partition the caches with different amounts. However, I need the change the code internally because cache associativity must be a power of two. So I cannot create two different caches for each core to try. I need to change the code internally.
10 - 15 November
Read an article about Cache Partitioning in order to have more knowledge.
http://link.springer.com/article/10.1023%2FB%3ASUPE.0000014800.27383.8f?LI=true
18 - 24 November
Read an article about a way to partition the cache dynamically.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.42.5764&rep=rep1&type=pdf
24-30 November Thanksgiving break.
On 5 December Weekly meeting with Prof. Ponomarev
../sim-outorder -fastfwd 2000000000 -cache:dl3 ul3:8192:64:16:l -cache:il3 dl3 mcfNS.1.arg
Core_0_ul3.accesses 119441 # total number of accesses Core_0_ul3.hits 52258 # total number of hits Core_0_ul3.misses 67183 # total number of misses Core_0_ul3.replacements 67183 # total number of replacements Core_0_ul3.writebacks 14606 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.562479 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.562479 # replacement rate (repls/ref) Core_0_ul3.wb_rate 0.122286 # writeback rate (wrbks/ref) Core_0_ul3.inv_rate 0.000000 # invalidation rate (invs/ref)
../sim-outorder -fastfwd 2000000000 -cache:dl3 ul3:8192:64:8:l -cache:il3 dl3 mcfNS.1.arg Core_0_ul3.accesses 119434 # total number of accesses Core_0_ul3.hits 39384 # total number of hits Core_0_ul3.misses 80050 # total number of misses
February 2
Weekly meeting with Prof. Ponomarev. We talked about what I will do this semester.
I will change the internal code of cache simulator to be able to partition the cache with different ratios.
February 3 Worked on the lab. Tried ways to modify the cache.c and cache.h cache files to let it me to partition the cache with the numbers that are not 2 to the power of n. I modified the code and became able to partition it with the numbers I would like to partition. So I continued to partition the cache.
For 16 ways of assosiativity First I gave 1/16 of the entire cache one core, 15/16 to other one. ../sim-outorder -fastfwd 2000000000 -cache:dl3 ul3:8192:64:15:l -cache:il3 dl3 mcfNS.1.arg When a core given 15/16 of the cache here are the results: Core_0_ul3.accesses 119439 # total number of accesses Core_0_ul3.hits 51288 # total number of hits Core_0_ul3.misses 68151 # total number of misses Core_0_ul3.replacements 68151 # total number of replacements Core_0_ul3.writebacks 14888 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.570593 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.570593 # replacement rate (repls/ref) Core_0_ul3.wb_rate 0.124649 # writeback rate (wrbks/ref) Core_0_ul3.inv_rate 0.000000 # invalidation rate (invs/ref)
ezgi@fatjoe:~/Desktop/msim/spec2K6$ ../sim-outorder -fastfwd 2000000000 -cache:dl3 ul3:8192:64:1:l -cache:il3 dl3 mcfNS.1.arg When a core given 1/16 of the cache here are the results: Core_0_ul3.accesses 119436 # total number of accesses Core_0_ul3.hits 9589 # total number of hits Core_0_ul3.misses 109847 # total number of misses Core_0_ul3.replacements 109847 # total number of replacements Core_0_ul3.writebacks 27079 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.919714 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.919714 # replacement rate (repls/ref) Core_0_ul3.wb_rate 0.226724 # writeback rate (wrbks/ref) Core_0_ul3.inv_rate 0.000000 # invalidation rate (invs/ref)
Then, I gave 2/16 of the entire cache one core, 14/16 to other one. When a core given 2/16 of the cache here are the results: Core_0_ul3.accesses 119436 # total number of accesses Core_0_ul3.hits 27538 # total number of hits Core_0_ul3.misses 91898 # total number of misses Core_0_ul3.replacements 91898 # total number of replacements Core_0_ul3.writebacks 24374 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.769433 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.769433 # replacement rate (repls/ref) Core_0_ul3.wb_rate 0.204076 # writeback rate (wrbks/ref) Core_0_ul3.inv_rate 0.000000 # invalidation rate (invs/ref)
When a core given 14/16 of the cache here are the results: Core_0_ul3.accesses 119439 # total number of accesses Core_0_ul3.hits 50110 # total number of hits Core_0_ul3.misses 69329 # total number of misses Core_0_ul3.replacements 69329 # total number of replacements Core_0_ul3.writebacks 15261 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.580455 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.580455 # replacement rate (repls/ref) Core_0_ul3.wb_rate 0.127772 # writeback rate (wrbks/ref) Core_0_ul3.inv_rate 0.000000 # invalidation rate (invs/ref)
Here is 3/16 and 13/16: When a core given 3/16 of the cache here are the results: Core_0_ul3.accesses 119437 # total number of accesses Core_0_ul3.hits 31448 # total number of hits Core_0_ul3.misses 87989 # total number of misses Core_0_ul3.replacements 87989 # total number of replacements Core_0_ul3.writebacks 22021 # total number of writebacks Core_0_ul3.invalidations 0 # total number of invalidations Core_0_ul3.miss_rate 0.736698 # miss rate (misses/ref) Core_0_ul3.repl_rate 0.736698 # replacement rate (repls/ref) Core_0_ul3.wb_rate 0.184373 # writeback rate (wrbks/ref) Core_0_ul3.inv_rate 0.000000 # invalidation rate (invs/ref)
Here is 4/16 and 12/16:
Here is 5/16 and 11/16:
Here is 6/16 and 10/16:
Here is 7/16 and 9/16:
Here is 8/16 and 8/16: