程序代写案例-Y3901924

时间：2022-06-16

Exam Number
Y3901924
HIGH-PERFORMANCE PARALLEL
AND DISTRIBUTED SYSTEMS
COM00174M

Table of Contents
Parallelization Approach ........................................................................................................................... 1
OpenMP ................................................................................................................................................... 1
MPI ........................................................................................................................................................... 1
CUDA ....................................................................................................................................................... 1
Validations ................................................................................................................................................... 2
Experimental Setup .................................................................................................................................... 3
OpenMP ................................................................................................................................................... 3
MPI ........................................................................................................................................................... 3
CUDA ....................................................................................................................................................... 3
Performance Evaluation ............................................................................................................................ 4
OpenMP ................................................................................................................................................... 4
MPI ........................................................................................................................................................... 5
CUDA ....................................................................................................................................................... 6
Comparative Analysis ................................................................................................................................ 7
References .................................................................................................................................................. 9

Parallelization Approach
OpenMP
The diagram [1] shows the
architecture of OpenMP and its
working, we can see the flow of
process that occurs when we use
OpenMP to parallelize our
application. Firstly, I identified the
possible region in the code that
could be parallelized. The
methods update_fields(),
apply_boundary() and
resolve_to_grid() had for loops that could be optimized using OpenMP. The OpenMP is mainly
based on the use of directives and clauses [2] .
In method update_fields(), omp parallel for collapse() is used for every for loop. Collapse clause
was used since it allows to parallelize multiple loops in a nest without introducing nested
parallelism. The variable I, j is declared private, which means the data declared within a parallel
region is private to each thread (and is not initialised before the parallel region).
In method apply_boundary() , omp parallel for private is used for the similar reason as in previous
method. In one loop variable i is declared private and j is declared private in the next loop
In method resolve_to_grid() , two local variable local_E_mag and local_B_mag is introduced
which is used for reduction. Here the parallel for, collapse, private and reduction is used. The
reduction clauses are>recurrence calculations in parallel [3].
The number of thread is set by using omp_set_num_threads(). The application was run many
times and the best performance was observed when the thread number was set to 8. There
comparison is shown further in the report.
MPI
To identify a process, we need process ID and a routine to find its own process ID from a process
a process. MPI assigns an integer to each process beginning with 0 for the parent process and
incrementing each time a new process is created. A process ID is also called its "rank" [4].
The main loop iterations are divided into number of processes. We can divide the number of
iterations with number of processors and then calculate the starting and ending loop of each and
every process. For this, we need to multiply the rank of process and add the number of steps per
process for ending condition of loop. Then each process run the loop by its own.
CUDA
In CUDA, to perform work, data must be transferred between the memory space of the host
computer and device. GPU resources can be utilized by the code to CUDA allows the
programmer to directly manipulate hardware and software GPU resources within the kernel such
as threads, blocks of threads, grids of blocks, memory and cache allocation, thread registers,
asynchronous thread execution, etc.

To efficiently apply the CUDA, the first step is to assess the program to find the parts of code that
are responsible for the bulk of the execution time [5]. The code in the file maxwell.c and setup.c
performs lots of different calculations. These codes are used to setup, allocate and compute the
solution for the problem, running these codes as kernels should significantly increase the
performance, as GPU are usually faster than CPU [6].
CUDA code works by copying the data from host to device, so the device pointers were created.
__global__ declaration specifier is used before the name of the function that are to be run in the
GPU. This structure alerts the compiler that a function should be compiled to run on a device
(GPU) instead of the host (CPU) [7]. The method resolve_to_grid() was not run as kernel as it
was mostly used for debugging and did not had much impact on the code execution time. The
function update_fields() is broken down into three different function and is defined what thread id
and block it is supposed to run on. A host code is written to invoke kernels. The function is called
and assigned the number of block and grid it is to be run on. Likewise, same process is followed
for the function problem_set_up() and apply boundary().The grid size was decided by dividing the
size of array with the number of blocks. To decide on number of blocks, a test was carried by
applying different values. NVPROF and cudaEvent_t was used to test the performance and time.
The values that were used was 2,4,6,8,12 ,16 and 32. For my system the code performed better
with the block size 16. For the apply boundary method the block size 1024 performed best. The
comparisons are shown in the further sections of the report.
Validations

One of the key aspects of correctness verification for any changes to the existing program is to
create a mechanism where the new results are compared to the known good reference. Sometime
exactly same results are not possible, especially where floating-point arithmetic is concerned [5].
The application can be considered valid with some small difference.
To validate the program the result data was stored in files. A different validator program was
written that calculated the difference between the resulted EMag and BMag value. The program
finds out the difference in the two values and shows warning if the value has difference. Using
the program, it was observed the OpenMP implementation had no difference. For MPI
implementation there was only slight difference is due to floating point arithmetic. There was no
difference observed for the CUDA implementation. Since the recorded data of MPI was not in
sorted order, it was firstly entered into a spreadsheet to sort it and copied back to the file and then
it was run though the validation code.
After this all of the implementations were run for problem size of x = 100 and = 100 and a .VTK
files was generated. The generated file was fed into VISIT software to visualize, and the visualized
image was observed and compared to the singled threaded implementation.

Experimental Setup
The program was run on the University of York Viking - Research computing cluster. The
following resource was allocated to run the
OpenMP MPI CUDA
Allocated RAM: 8 GB
Allocated CPU: 10
Modules: GCC version
11.2.0
RAM: 16 GB
Allocated CPU: 40
Modules: OpenMPI version 4.1.1-
GCC-11.2.0
Allocated RAM: 8 GB
Allocated CPU: 10
GPU: 1x NVIDIA Tesla V100
Module: CUDA version 11.0.2-
GCC-9.3.0,
OpenMP
The time taken for the execution of the code was recorded using omp_get_wtime() routine. This
routine is included in the OpenMP API [8]. It returns elapsed wall clock time in seconds. [8] .
Different numbers of threads were tested, and the best performing number of threads was
assigned for comparing OpenMP to other options. These data were noted down in a spread
sheet, and it was used to perform analysis.
MPI
For the MPI, the time taken for execution was recorded using MPI_Wtime() . This function is
included in the OpenMPI [9]. This function returns an elapsed time on the calling processor [9].
These data were noted down in a spread sheet and were used to perform analysis.
CUDA
To accurately record the code execution time cuda event timer and nvprof was used. The
profiling tool NVPROF provided useful information which helped a lot to get insight on the
performance of every kernel. The performance of different number of blocks and grids were
noted down and the best performing one was selected for the final test. These data were noted
down in a spread sheet, and it was used to perform analysis.

Performance Evaluation
OpenMP

For the performance evaluation. Firstly,
the best configuration needed to be
decided. For this the code was run many
times with different configuration. In this
case the best number of threads was
needed to be decided. The code was run
10 times with different number of threads
for the problem set x= 2000 and y =
2000, the average was then calculated.
To record the code execution time
omp_get_wtime() was used. Then the
time taken for the different number of threads was plotted in a graph which we can see above. In
the test machine best performance was observed when the thread number was set to 8.

After deciding on the suitable number of
threads. The code was run 10 times for
the grid size 300x300, 1000x1000,
2000x2000, 4000x4000 and
5000x5000. The average time was then
calculated and plotted against the data
for the single threaded implementation
of the code.

Using the observed data, the
percentage of time decreased using
OpenMP implementation in comparison
to the single threaded implementation
was calculated. The above graph was
plotted using the calculated data.

MPI
To find the optimal configuration for the
MPI, we need to find the number of
processes that gives the best result and is
supported by our system. The Viking
computing cluster is a extremely powerful
system with cutting edge specification so
the test could be performed using lot of
resource which would not be possible in
any normal user’s setup.
MPI_Wtime() function is included with
OpenMPI and it returns an elapsed time on
the calling processor. This function was used in the application to calculate the time taken for the
code to be executed. Along with it MPI_Barrier() was also used to ensure that the timer provides
the correct result.
The application was run for the problem size x = 2000 and y = 2000 with different number of
processes starting from 2 to 40. The time taken for each variant was noted down in a spreadsheet
and the best performing number of process was used for the further testing of different grid size
of the problem. As we can see from the graph above the best performing number processes to
use was 40.
With the best configuration set, the
application was run 10 times for
problem grid size 300x300,
1000x1000, 2000x2000,
3000x3000, 4000x4000 and
5000x5000. The average time was
then calculated for the different
grid sizes and plotted against the
data for the single threaded
implementation of the code.

With all the data a graph shown
above was plotted that shows the
percentage of time decreased in this
implementation compared to the
single threaded implementation. We
can observe the scaling behavior of
our implementation from this graph.

CUDA

Firstly, the goal was to find the number of
blocks and grid that would give the best
result for the current system. The code was
run 10 times with different number of blocks
(the number of grids in the current
implementation depends on the number of
blocks) for the problem set of x= 4000 and
y= 4000. The number of blocks tested was
2, 4, 6, 8, 12, 16 and 32 and the average
was noted for each of the different sizes. As
we can see from the above graph. It was
found that the code performed best with the
number of blocks set to 16.

After setting the best performing
configuration. The code was again run 10
times for different size of the problem. The
code execution time was recorded using
cudaEvent_t. The code was run 10 times
for the problem grid size of 300x300,
1000x1000, 2000x2000, 4000x4000 and
5000x5000. The average time was then
calculated and plotted against the data for
the single threaded implementation of the
code. The result was noted down and a
graph shown above was plotted using it.

Using the observed data, the percentage of
time decreased in CUDA implementation in
comparison to the single threaded
implementation was calculated. The above
graph was plotted using the calculated
data.

Comparative Analysis

Code execution time (sec) in the test system
Problem
Size
Single
Threaded OpenMP MPI CUDA
300x300 0.1 0.068729 0.020669 0.00386822
1000x1000 2.23 0.927611 0.283728 0.111949
2000x2000 26.54 6.7566575 2.853362 0.839262
3000x3000 97.44 22.969056 9.13889 3.88811
4000x4000 234.97 54.664113 15.636817 8.38714

With all the data collected we can now compare and analysis all four implementation of the
code. From the above graph and table there are few observations we can make.
- CUDA is the fastest implementation followed by MPI and then OpenMP
- All of the implementation follows a trend of performing better as the problem grid size
increases.
From the graph we can see that
the CUDA provides the best
optimization for our problem. This
is because GPU are usually much
faster than CPU for this type of
problems [1]. As the size of the
problem grid is increasing, we
can see that the optimizations are
performing better and better.
Looking at the code used for
optimization, the OpenMP is the
easiest one to implement and it
still shows a decent amount of
improvement, it works based on
the implementation of
multithreading where ass The
OpenMP uses shared in which a
number of cores (or CPUs) work
on a common, shared physical
address space whereas MPI is
based on distributed memory.
The complexity of implementing
MPi is much higher than OpenMP
but provides justifiable difference
in performance.
Time saved in OpemMP, MPI and
CUDA compared to single threaded
Problem
Size OpenMP MPI CUDA
300x300 31.27% 79.33% 96.13%
1000x1000 58.40% 87.27% 94.98%
2000x2000 74.54% 89.25% 96.84%
3000x3000 76.43% 90.62% 96.01%
4000x4000 76.74% 93.34% 96.43%

The huge difference can be seen when the implementation of CUDA is compared against the
OpenMP and MPI. The code is run as kernel in the GPU, which for our task provides a huge
improvement in performance compared to any other implementation. In other implementation the
performance gain in smaller size of the problem is much less compared to CUDA. CUDA provides
much better performance no matter the size of the problem.

References

[1] F. K. Nawaz, A. Chattopadhyay, K. G J, G. D. Mane and R. N. Savanth, "Comparison of
Open MP and CUDA," International Journal of Computer Sciences and Engineering,
Vellore, 2014.
[2] OpenMP, "OpenMP Application Programming Interface," November 2018. [Online].
Available: https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf.
[Accessed 15 April 2022].
[3] The OpenMP Architecture Review Board, "OPENMP API Specification: Version 5.0,"
November 2018. [Online]. Available: https://www.openmp.org/spec-
html/5.0/openmpsu107.html. [Accessed 10 April 2022].
[4] D. Thomasset and M. Grobe, "An introduction to the Message Passing Interface (MPI) using
C," [Online]. Available: http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml. [Accessed
15 April 2022].
[5] Nvidia, "CUDA C++ Best Practices Guide," March 2022. [Online]. Available:
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html. [Accessed 15 April
2022].
[6] A. Syberfeldt and T. Ekblom, "A Comparative Evaluation of the GPU vs The CPU for
Parallelization of Evolutionary Algorithms Through Multiple Independent Runs," International
Journal of Information Technology and Computer Science, vol. IX, no. 3, pp. 1-3, 2017.
[7] J. sanders and E. Kandrot, "CUDA by Example: An Introduction to general=purpose GPU
programming," Pearson Education, Inc., Boston, 2011.
[8] The OpenMP Architecture Review Board, " OPENMP API Specification: Version 5.0,"
November 2018. [Online]. Available: https://www.openmp.org/spec-
html/5.0/openmpsu160.html. [Accessed 15 April 2022].
[9] The Open MPI Development Team, "MPI_Wtime(3) man page (version 3.0.6)," 20 March
2020. [Online]. Available: https://www.open-mpi.org/doc/v3.0/man3/MPI_Wtime.3.php.
[Accessed 15 April 2022].