C/C++代写 - Use of Scientific Libraries and GPU Acceleration
In this project, graduate students are assigned tasks in addition to tasks expected from the
Here is a useful link on BLAS from Intel and NVIDIA to start with.
2) Write programs implementing matrix multiplication C = AB, where A is m × n and B is n × k.
Your program should take m, n, k as command line arguments (i.e. ./executable )
and the multiplication is to be done in a few different ways. Create a separate function that does
each of the following operations and execute each function in one main program:
1. Create a CPU version of the naïve matrix multiplication similar to the one I presented in
2. Compute the inner products of rows of A with columns of B using the level-1 BLAS
function ddot( ), which calculates the dot product of two arrays. READ THE NOTES
ON CANVAS THAT INTRODUCES LEVEL-1 BLAS OPERATIONS. The first d in
ddot( ) stands for double, which means that this operation is to be performed on arrays of
doubles. Good link for ddot( ) implementation:
3. The second method also uses a level-1 BLAS function for the matrix multiplication. In
this case, you will use daxpy( ) to form each column of C as a linear combination of
columns of A. Once again, the d in daxpy( ) stands for double, so use double arrays.
4. Implement the same matrix multiplication problem using the
dgemm routine, which is the most common function for matrix
multiplication. Intel provides the following page to
explain the usage.
In this step, you should create a random number function to
initialize your matrices with random integer numbers
ranging from 1 to 10.
5. Create a kernel that does the naive matrix multiplication for square matrices. Calculate
your grid and block sizes and execute as follows:
dim3 block(16, 16);
dim3 grid( (n+15)/16, (n+15)/16 );
6. CUDA also provides a GPU version of BLAS, which is cuBlas. Repeat Task #2, 3, and 4
• Demonstrate matrix multiplication for a small problem (e.g.
5x5). Print all elements of the matrices. This step is for
• Time your code for square matrix sizes of N 100, 500, 1000,
2000, and 5000 for both CPU and GPU using ddot, daxpy and
dgemm routines and present your results in a table format
and also a plot. See the example above. Comment on your
findings. Make sure to compile with optimization level –O3
during your testing
HELPFUL TIPS: • BLAS functions should look like ddot_( ) rather than ddot( ). Also, since you are making a
.cu file and compiling with nvcc, your function prototype for BLAS functions should look
extern “C” double ddot_(…arguments);
• It is sometimes useful to pass in the transpose of a matrix, rather than the original matrix.
Remember that C stores arrays using row-major ordering.
• Don’t forget to clean up when your code is done by using free( ) and cudaFree( ).
• Make sure to include all of the required headers and link the appropriate libraries in your