C/C++代写 - Use of Scientific Libraries and GPU Acceleration
In this project, graduate students are assigned tasks in addition to tasks expected from the undergraduate students. Here is a useful link on BLAS from Intel and NVIDIA to start with. https://software.intel.com/en-us/node/520725 http://docs.nvidia.com/cuda/cublas/#axzz4cdzdjSPJ 2) Write programs implementing matrix multiplication C = AB, where A is m × n and B is n × k. Your program should take m, n, k as command line arguments (i.e. ./executable ) and the multiplication is to be done in a few different ways. Create a separate function that does each of the following operations and execute each function in one main program: 1. Create a CPU version of the naïve matrix multiplication similar to the one I presented in class. 2. Compute the inner products of rows of A with columns of B using the level-1 BLAS function ddot( ), which calculates the dot product of two arrays. READ THE NOTES ON CANVAS THAT INTRODUCES LEVEL-1 BLAS OPERATIONS. The first d in ddot( ) stands for double, which means that this operation is to be performed on arrays of doubles. Good link for ddot( ) implementation: https://svn.nmap.org/nmap/liblinear/blas/ddot.c 3. The second method also uses a level-1 BLAS function for the matrix multiplication. In this case, you will use daxpy( ) to form each column of C as a linear combination of columns of A. Once again, the d in daxpy( ) stands for double, so use double arrays. 4. Implement the same matrix multiplication problem using the dgemm routine, which is the most common function for matrix multiplication. Intel provides the following page to explain the usage. https://software.intel.com/en-us/mkl-tutorial-c￾multiplying-matrices-using-dgemm In this step, you should create a random number function to initialize your matrices with random integer numbers ranging from 1 to 10. 5. Create a kernel that does the naive matrix multiplication for square matrices. Calculate your grid and block sizes and execute as follows: dim3 block(16, 16); dim3 grid( (n+15)/16, (n+15)/16 ); my_kernel<<>>(arguments); 6. CUDA also provides a GPU version of BLAS, which is cuBlas. Repeat Task #2, 3, and 4 using cuBlas. • Demonstrate matrix multiplication for a small problem (e.g. 5x5). Print all elements of the matrices. This step is for verification. • Time your code for square matrix sizes of N 100, 500, 1000, 2000, and 5000 for both CPU and GPU using ddot, daxpy and dgemm routines and present your results in a table format and also a plot. See the example above. Comment on your findings. Make sure to compile with optimization level –O3 during your testing HELPFUL TIPS: • BLAS functions should look like ddot_( ) rather than ddot( ). Also, since you are making a .cu file and compiling with nvcc, your function prototype for BLAS functions should look like: extern “C” double ddot_(…arguments); • It is sometimes useful to pass in the transpose of a matrix, rather than the original matrix. Remember that C stores arrays using row-major ordering. • Don’t forget to clean up when your code is done by using free( ) and cudaFree( ). • Make sure to include all of the required headers and link the appropriate libraries in your Makefile.