CUDA Programming/ProjectDescription: Difference between revisions

From CS486wiki
Jump to navigationJump to search
Content deleted Content added
Myuksek1 (talk | contribs)
Initial import, broken references due to its trac-wiki style   (change visibility)
 
Myuksek1 (talk | contribs)
No edit summary   (change visibility)
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[CUDA_Programming|← Back to project main page]]
= CS 485: Information Systems Senior Project =


== Topic: Evaluating the Performance of GPGPUs and Their Use in Scientific Computing ==
= Evaluating the Performance of GPGPUs and Their Use in Scientific Computing =


=== 1. Introduction ===
= Introduction =
Computational science (or scientific computing) is the field of study concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyse and solve scientific problems. Scientists and engineers develop computer programs, application software, that model systems being studied and run these programs with various sets of input parameters.[#ref0 (0) ]
Computational science (or scientific computing) is the field of study concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyse and solve scientific problems. Scientists and engineers develop computer programs, application software, that model systems being studied and run these programs with various sets of input parameters.[0]
Applications from scientific computing often require a large amount of execution time due to large system sizes or a large number of iteration steps. The execution time can be significantly reduced by a parallel execution on a suitable parallel or distributed execution platform. [#ref1 (1) ] Historically, people in the scientific area used supercomputers or computer grids to carry out these computations.
Applications from scientific computing often require a large amount of execution time due to large system sizes or a large number of iteration steps. The execution time can be significantly reduced by a parallel execution on a suitable parallel or distributed execution platform. [1] Historically, people in the scientific area used supercomputers or computer grids to carry out these computations.
However, with the advancements in computer graphics, graphics processing units became much efficient and powerful. Because of the nature of graphical data, GPUs became more specialized in handling complex matrix calculations and doing massive mathematical computations. As the processing power of GPUs has increased, so has their demand for electrical power. This problem has lead researchers to look for alternative solutions and parallel programming has been adopted by many scientists to further optimize the performance.
However, with the advancements in computer graphics, graphics processing units became much efficient and powerful. Because of the nature of graphical data, GPUs became more specialized in handling complex matrix calculations and doing massive mathematical computations. As the processing power of GPUs has increased, so did their demand for electrical power. This problem has lead researchers to look for alternative solutions and parallel programming has been adopted by many scientists to further optimize the performance.
Nowadays, GPU is especially well suited to address problems that can be expressed as data-parallel computations with high arithmetic intensity. Many applications that process large data sets such as arrays or volumes can use a data-parallel programming model to speed up computations. These applications include, for example[#ref2 (2) ]:
Nowadays, GPU is especially well suited to address problems that can be expressed as data-parallel computations with high arithmetic intensity. Many applications that process large data sets such as arrays or volumes can use a data-parallel programming model to speed up computations. These applications include, for example[2]:
* Seismic simulations
* Seismic simulations
* Computational biology
* Computational biology
Line 15: Line 15:
* Signal processing
* Signal processing
* Physical simulation
* Physical simulation
Ackermann et al. [#ref3 (3) ] have developed a computational approach to allow massively parallel simulation of biological molecular networks that leverage the massively-parallel computing power of modern graphics card. They have demonstrated that the parallelization on the GPU has showed a speedup of about factor 59 compared to a CPU implementation executed on a standard PC.
Ackermann et al. [3] have developed a computational approach to allow massively parallel simulation of biological molecular networks that leverage the massively-parallel computing power of modern graphics card. They have demonstrated that the parallelization on the GPU has showed a speedup of about factor 59 compared to a CPU implementation executed on a standard PC.
Davis et al. [#ref4 (4) ] have carried out water simulations on GPUs and compared the performance gained using a GPU versus the same simulation on a single CPU or multiple CPUs. According to their results, their GPU implementation performs ~7x faster then on a single CPU.
Davis et al. [4] have carried out water simulations on GPUs and compared the performance gained using a GPU versus the same simulation on a single CPU or multiple CPUs. According to their results, their GPU implementation performs ~7x faster then on a single CPU.
Another research on data normalization, done by Rodríguez et al. [#ref5 (5) ], suggests that their implementation of a quantile-based normalization method for high density oligonucleotide array data based on variance and bias running on a GPU leads up to a speed-up factor exceeding 7x versus the counterpart methods implemented on CPUs.
Another research on data normalization, done by Rodríguez et al. [5], suggests that their implementation of a quantile-based normalization method for high density oligonucleotide array data based on variance and bias running on a GPU leads up to a speed-up factor exceeding 7x versus the counterpart methods implemented on CPUs.


=== 2. Research Description ===
= Research Description =
==== 2.1 Purpose ====
== Purpose ==
The purpose of this research project is to illustrate the performance gain of using GPUs in general purpose computing compared to the performance of CPUs.
The purpose of this research project is to illustrate the performance that can be gained by using GPUs in general purpose computing compared to the performance that can be gained by using CPUs.

==== 2.2 Problem ====
== Problem ==
The problem I'll be working on to test the hardware is “cluster analysis of gene expressions”.
The problem I'll be working on to test the hardware is “cluster analysis of gene expressions”.
Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense [#ref6 (6)].
Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense [6].
A gene is a segment of DNA, which contains the formula for the chemical composition of one particular protein. The large majority of abundantly expressed genes are associated with common functions, such as metabolism, and hence are expressed in all cells. However, there will be differences between the expression profiles of different cells, and even in a single cell, expression will vary with time, in a manner dictated by external and internal signals that reflect the state of the organism and the cell itself [#ref7 (7) ].
A gene is a segment of DNA, which contains the formula for the chemical composition of one particular protein. The large majority of abundantly expressed genes are associated with common functions, such as metabolism, and hence are expressed in all cells. However, there will be differences between the expression profiles of different cells, and even in a single cell, expression will vary with time, in a manner dictated by external and internal signals that reflect the state of the organism and the cell itself [7].
A natural basis for organizing gene expression data is to group together genes with similar patterns or expression. For any series of measurements, a number of sensible measures of similarity in the behavior of two genes can be used [#ref8 (8) ]. This information, then, can be used by the experts in biological sciences to gather further knowledge in the area.
A natural basis for organizing gene expression data is to group together genes with similar patterns or expression. For any series of measurements, a number of sensible measures of similarity in the behavior of two genes can be used [8]. This information, then, can be used by the experts in biological sciences to gather further knowledge in the area.
This situation makes cluster analysis the best candidate for extracting the information out of gene expressions.
This situation makes cluster analysis the best candidate for extracting the information out of gene expressions.
==== 2.3 Methodology ====
Although, for the time being. the exact methodology is not completely clear, it will require implementing cluster analysis algorithm(s) to be applied on gene-expression data and evaluating the performance on several hardware architectures. Different scenarios can be designed for evaluating/illustrating the work. Some of these comparison scenarios include:
An implementation of an algorithm on a single core CPU versus the parallelized form of the same algorithm on a GPU.
A parallel implementation of an algorithm on a multi-core CPU versus on a GPU.
The performance of different implementations of the same clustering approach on the same GPU. Different implementations are expected to make different use of memory and have different number of threads/thread blocks.
The same implementation of an algorithm on GPUs with different specifications.
The candidate APIs that will be used to program GPUs are CUDA[#ref9 (9) ] and OpenCL[#ref10 (10) ].
==== 2.3 Evaluation ====
Evaluation of the work is based on performance metrics used in evaluation of processing units (CPUs and GPUs). These metrics include; total execution time, speedup, number of threads running concurrently.
There are also software tools can be used for evaluation, such as Visual Profiler[#ref11 (11) ] provided by NVIDIA.
==== 2.4 Research Paper ====
It's been decided a research paper to be written that would explain the process in detail including methods and parameters, reflect the performance results determined by the tests that will be done.
=== 3. Conclusions ===
This paper is aimed to provide an overview of the senior project by explaining the problem at hand, different approaches to the solution, different methods that can be used and metrics for evaluating the work. Also, the information that have been gathered throughout the semester is briefly reflected.


=== References: ===
== Methodology ==
For testing purposes, I used three different clustering programs; one is a single threaded program and the other two are programs that use CUDA[9] and OpenCL[10] parallel programming APIs respectively. For the C program, I used Cluster 3.0 [11] software. The CUDA and OpenCL implementations are done by me.
* [=#ref0] (0) http://en.wikipedia.org/wiki/Computational_science

* [=#ref1] (1) Rauber T., Rünger G., “Exploiting Multiple Levels of Parallelism in Scientific Computing”. IFIP International Federation for Information Processing, 2005, Volume 172/2005, 3-19, DOI: 10.1007/0-387-24049-7_1
The clustering algorithm used in this project is hierarchical clustering with Euclidean distance[12] as a distance metric and single linkage[13] as a linkage method.
* [=#ref2] (2) NVIDIA Tesla GPU Computing Technical Brief. Version 1.0.0, 5/24/2007

* [=#ref3] (3) Ackermann, J., Baecher, P., Franzel T., Goesele, M., Hamacher, K., “Massively-Parallel Simulation of Biochemical Systems”
The gene data is gathered from Gene Expression Omnibus Data Set Record 3345 [14]. Then the following data sets with given row count x column count are generated: 4096x16, 8192x16, 16384x16, 4096x32, 8192x32, 16384x32, 4096x64, 8192x64, 16384x64. Each of these sets is given as an input to the three programs.
* [=#ref4] (4) Davis, J., Ozsoy, A., Patel, S., Taufer, M., “Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors”

* [=#ref5] (5) Rodríguez, A., Trelles, O., Ujaldón, M., “Using Graphics Processors for a High Performance Normalization of Gene Expressions”
== Evaluation ==
* [=#ref6] (6) http://en.wikipedia.org/wiki/Cluster_analysis
Evaluation of the work is based on performance metrics used in evaluation of processing units (CPUs and GPUs). Please see [[CUDA_Programming/BenchmarkingTools|Benchmarking Tools]] section of the wiki for more detailed info.
* [=#ref7] (7) Domany, Eytan. “Cluster Analysis of Gene Expression Data”

* [=#ref8] (8) Eisen, M., Spellman, P., Brown, P., Botstein, D., “Cluster Analysis and Display of Genome-Wide Expression Patterns”. PNAS December 8, 1998 vol. 95 no. 25 14863-14868
= Results =
* [=#ref9] (9) http://www.nvidia.com/object/what_is_cuda_new.html
Results showed us that the program written using CUDA API performed significantly better than OpenCL and Cluster 3.0. The speedup of CUDA compared to OpenCL was between 2 - 8 times, and compared to Cluster 3.0 was between 3 - 20 times. It can be argued that the performance difference between CUDA and OpenCL comes from the fact that OpenCL library is merely a wrapper around CUDA library.
* [=#ref10] (10) http://www.khronos.org/opencl/

* [=#ref11] (11) http://developer.nvidia.com/object/visual-profiler.html
= References: =
* [0] http://en.wikipedia.org/wiki/Computational_science
* [1] Rauber T., Rünger G., “Exploiting Multiple Levels of Parallelism in Scientific Computing”. IFIP International Federation for Information Processing, 2005, Volume 172/2005, 3-19, DOI: 10.1007/0-387-24049-7_1
* [2] NVIDIA Tesla GPU Computing Technical Brief. Version 1.0.0, 5/24/2007
* [3] Ackermann, J., Baecher, P., Franzel T., Goesele, M., Hamacher, K., “Massively-Parallel Simulation of Biochemical Systems”
* [4] Davis, J., Ozsoy, A., Patel, S., Taufer, M., “Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors”
* [5] Rodríguez, A., Trelles, O., Ujaldón, M., “Using Graphics Processors for a High Performance Normalization of Gene Expressions”
* [6] http://en.wikipedia.org/wiki/Cluster_analysis
* [7] Domany, Eytan. “Cluster Analysis of Gene Expression Data”
* [8] Eisen, M., Spellman, P., Brown, P., Botstein, D., “Cluster Analysis and Display of Genome-Wide Expression Patterns”. PNAS December 8, 1998 vol. 95 no. 25 14863-14868
* [9] http://www.nvidia.com/object/what_is_cuda_new.html
* [10] http://www.khronos.org/opencl/
* [11] http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm
* [12] http://en.wikipedia.org/wiki/Euclidean_distance
* [13] http://en.wikipedia.org/wiki/Single-linkage_clustering/|single
* [14] http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS3345
* [15] http://developer.nvidia.com/object/visual-profiler.html

Latest revision as of 06:14, 21 May 2011

← Back to project main page

Evaluating the Performance of GPGPUs and Their Use in Scientific Computing

Introduction

Computational science (or scientific computing) is the field of study concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyse and solve scientific problems. Scientists and engineers develop computer programs, application software, that model systems being studied and run these programs with various sets of input parameters.[0] Applications from scientific computing often require a large amount of execution time due to large system sizes or a large number of iteration steps. The execution time can be significantly reduced by a parallel execution on a suitable parallel or distributed execution platform. [1] Historically, people in the scientific area used supercomputers or computer grids to carry out these computations. However, with the advancements in computer graphics, graphics processing units became much efficient and powerful. Because of the nature of graphical data, GPUs became more specialized in handling complex matrix calculations and doing massive mathematical computations. As the processing power of GPUs has increased, so did their demand for electrical power. This problem has lead researchers to look for alternative solutions and parallel programming has been adopted by many scientists to further optimize the performance. Nowadays, GPU is especially well suited to address problems that can be expressed as data-parallel computations with high arithmetic intensity. Many applications that process large data sets such as arrays or volumes can use a data-parallel programming model to speed up computations. These applications include, for example[2]:

  • Seismic simulations
  • Computational biology
  • Option risk calculations in finance
  • Medical Imaging
  • Pattern recognition
  • Signal processing
  • Physical simulation

Ackermann et al. [3] have developed a computational approach to allow massively parallel simulation of biological molecular networks that leverage the massively-parallel computing power of modern graphics card. They have demonstrated that the parallelization on the GPU has showed a speedup of about factor 59 compared to a CPU implementation executed on a standard PC. Davis et al. [4] have carried out water simulations on GPUs and compared the performance gained using a GPU versus the same simulation on a single CPU or multiple CPUs. According to their results, their GPU implementation performs ~7x faster then on a single CPU. Another research on data normalization, done by Rodríguez et al. [5], suggests that their implementation of a quantile-based normalization method for high density oligonucleotide array data based on variance and bias running on a GPU leads up to a speed-up factor exceeding 7x versus the counterpart methods implemented on CPUs.

Research Description

Purpose

The purpose of this research project is to illustrate the performance that can be gained by using GPUs in general purpose computing compared to the performance that can be gained by using CPUs.

Problem

The problem I'll be working on to test the hardware is “cluster analysis of gene expressions”. Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense [6]. A gene is a segment of DNA, which contains the formula for the chemical composition of one particular protein. The large majority of abundantly expressed genes are associated with common functions, such as metabolism, and hence are expressed in all cells. However, there will be differences between the expression profiles of different cells, and even in a single cell, expression will vary with time, in a manner dictated by external and internal signals that reflect the state of the organism and the cell itself [7]. A natural basis for organizing gene expression data is to group together genes with similar patterns or expression. For any series of measurements, a number of sensible measures of similarity in the behavior of two genes can be used [8]. This information, then, can be used by the experts in biological sciences to gather further knowledge in the area. This situation makes cluster analysis the best candidate for extracting the information out of gene expressions.

Methodology

For testing purposes, I used three different clustering programs; one is a single threaded program and the other two are programs that use CUDA[9] and OpenCL[10] parallel programming APIs respectively. For the C program, I used Cluster 3.0 [11] software. The CUDA and OpenCL implementations are done by me.

The clustering algorithm used in this project is hierarchical clustering with Euclidean distance[12] as a distance metric and single linkage[13] as a linkage method.

The gene data is gathered from Gene Expression Omnibus Data Set Record 3345 [14]. Then the following data sets with given row count x column count are generated: 4096x16, 8192x16, 16384x16, 4096x32, 8192x32, 16384x32, 4096x64, 8192x64, 16384x64. Each of these sets is given as an input to the three programs.

Evaluation

Evaluation of the work is based on performance metrics used in evaluation of processing units (CPUs and GPUs). Please see Benchmarking Tools section of the wiki for more detailed info.

Results

Results showed us that the program written using CUDA API performed significantly better than OpenCL and Cluster 3.0. The speedup of CUDA compared to OpenCL was between 2 - 8 times, and compared to Cluster 3.0 was between 3 - 20 times. It can be argued that the performance difference between CUDA and OpenCL comes from the fact that OpenCL library is merely a wrapper around CUDA library.

References: