matrix multiplication with large dimensions

Hi, everyone! I’m just a CUDA novice.

I’m reading the book “Programming Massively Parallel Processors, A Hands-on Approach”.

I followed the example of the matrix multiplication with multiple thread blocks in Chapter 4, and made minor modifications for exercises. I found that it works for smaller matrices, say 1280x1280. However, it doesn’t work for large matrices, say 2400x2400. Could anyone help me to figure out the problem? Thank you in advance!


xwu@xth3:~> uname -a

Linux xth3 #1 SMP PREEMPT 2010-12-13 11:13:53 +0100 x86_64 x86_64 x86_64 GNU/Linux

xwu@xth3:~> /sbin/lspci | grep VGA

02:00.0 VGA compatible controller: nVidia Corporation G96 [Quadro FX 580] (rev a1)

xwu@xth3:~> nvcc

xwu@xth3:~> ./a.out 1280

 1280 serial      20.73 sec

cuda 4.54 sec

xwu@xth3:~> ./a.out 2400

 2400 serial     150.90 sec

cuda 9.63 sec

    0	p    5.99841	gp    0.00000

a.out: int main(int, char**): Assertion `p[i]==gp[i]’ failed.



#include <stdio.h>

#include <stdlib.h>

#include <assert.h>

#include <math.h>

#include <time.h>

#include <cuda.h>

#define TILE_WIDTH 16

// serial implementation

void MatMul(float *m, float *n, float *p, int width) {

  int i, j, k;

  float a, b, sum;

for (i=0; i<width; i++) {

    for (j=0; j<width; j++) {


      for (k=0; k<width; k++) {









// cuda gpu implementation kernel

__global__ void gMatMulKernel (float *m_d, float *n_d, float *gp_d, int width) {

  int k, row, col;

  float sum;




  for (k=0; k<width; k++)




// cuda gpu implementation

void gMatMul(float *m, float *n, float *gp, int width) {

  int size;

  float *m_d, *n_d, *gp_d;

  dim3 dimGrid(width/TILE_WIDTH,width/TILE_WIDTH,1);

  dim3 dimBlock(TILE_WIDTH,TILE_WIDTH,1);


cudaMalloc((void **) &m_d, size);

  cudaMemcpy(m_d, m, size, cudaMemcpyHostToDevice);

cudaMalloc((void **) &n_d, size);

  cudaMemcpy(n_d, n, size, cudaMemcpyHostToDevice);

cudaMalloc((void **) &gp_d, size);

gMatMulKernel<<<dimGrid, dimBlock>>>(m_d, n_d, gp_d, width);

cudaMemcpy(gp, gp_d, size, cudaMemcpyDeviceToHost);





int main (int argc, char *argv[]) {

  int i, width, size;

  float *m, *n, *p, *gp;

  clock_t t1, t2;

sscanf(argv[1], "%d", &width);

if (width%TILE_WIDTH != 0) return 0;


  m = (float *) malloc(size*sizeof(float));

  n = (float *) malloc(size*sizeof(float));

  p = (float *) malloc(size*sizeof(float));

  gp = (float *) malloc(size*sizeof(float));

srand (time(NULL));

  for (i=0; i<size; i++) {

    m[i] = (float) rand() / (float) RAND_MAX;

    m[i] = m[i]/10.0;

    n[i] = (float) rand() / (float) RAND_MAX;

    n[i] = n[i]/10.0;





  printf("%10d serial %10.2lf sec\n", width, (t2-t1)/(double) CLOCKS_PER_SEC);




  printf("cuda %10.2lf sec\n", (t2-t1)/(double) CLOCKS_PER_SEC);

for (i=0; i<size; i++) {

    if (fabs(p[i]-gp[i])>0.001) {

      printf("%10d\tp %10.5f\tgp %10.5f\n", i, p[i], gp[i]);








  return 0;


Have you checked how large the difference between CPU and GPU solution actually is? I’m quite surprised that the results for smaller matrices are bit-for-bit identical given the different rounding on CPU and GPU. This might be related to what the value of RAND_MAX is on your CPU.

The large case is clearly not running to completion -there should be about a factor of 8 difference in solution time between the small and large cases. The fact the large only takes a bit more that 9 seconds suggests that the kernel is being killed by the watchdog timer in the display driver. You are also running X11 on this card, I assume? If so you are limited to an about 8 second kernel execution time limit by the display driver. If you add some error checking code to the kernel launch you should see an error reported for the big case.

Yes, I have checked the difference. Even for smaller matrix, e.g. 16x16, they are not identical. I was actually not using


Instead, I was using a tolerance for difference.

if (fabs(p[i]-gp[i])>0.0001)

According to my observation, the differences sometimes can be large, e.g. more than 0.0001. Could I trust GPU results?

I’m using single precision, I don’t have a device that supports double precision. If I switch to the device with double precision, could I have more accurate results?


Yes! Thank you, avidday!

I’m running KDE. How can I extend the limit of the kernel execution time?

Don’t run KDE (so just use work from the console), or use a dedicated compute card. The limit itself can’t be disabled on a card running a display driver. If you really need to multiply large matrices, use an optimal blas, like CUBLAS or MAGMA BLAS - both are close to an order of magnitude faster than the code you are using now, and you will get much larger problems solved within the display driver timer limit.

Thanks, Avidday! Another concern is the discrepancy of the CPU and GPU results indicated in my previous posts. How can I solve the problem?

Are you sure it really is a “problem”? What are the maximum absolute and relative error values you are getting?

IEEE 754 single precision only gives about 6 decimal places of accuracy, so that is the absolute upper limit of agreement. You might also want to revisit the assumption that when there is disagreement, that the CPU must be right and the GPU wrong. The GPU uses fused multiply-add operations which don’t suffer from intermediate truncation or rounding, whereas the CPU code probably doesn’t. That can make the GPU “more accurate” that the CPU. The only way to tell for sure is to compute a reference solution in higher precision (so double in this case) on the CPU and compare both the CPU and GPU single precision results to the double precision result. That will give a better indication of the relative accuracy of the two solutions you are comparing.