Yes, its the same as the stackoverflow one, the other one was missing one __syncthreads().

I basically wanted to do non square matrix multiplication, so I checked the cudasdk example and saw that they were using dimensions multiples of blocksize. I then used that though and merged it with the kernel in the Programming Massively Parralel Processors book and most of the results(in terms of thoughput and resulting matrix) are fine, aside from those special cases of non-square matrices.

Since i don’t have computer privilleges to compile programs on the CUDASDK directory, I will copy the SDK kernel and test it with my host code. Although i will be working with 3D matrices, I have to make sure i understand all the subtleties of 2D multiplication.

P.S: I had a good laugh because you associated the two posts, although i use my nickname here and my real name in stackoverflow ^^

EDIT: I tested the SDK example and it doesnt give me the cudamemcheckerror. Nevertheless i keep getting the dead pixels problem with the same examples i posted here. Given that one of them(3200x3200) is not even close to reaching full GPU memory capacity, there is something happening outside the kernel.This is my host code, it doesnt seem like there is something out of the ordinary:

```
int main(){
cudaEvent_t evstart, evstop;
cudaEventCreate(&evstart);
cudaEventCreate(&evstop);
float*M=(float*)malloc(sizeof(float)*HM*WM);
float*N=(float*)malloc(sizeof(float)*HN*WN);
for(int i=0;i<WM*HM;i++)
M[i]=(float)i;
for(int i=0;i<WN*HN;i++)
N[i]=(float)i;
float*P=(float*)malloc(sizeof(float)*HP*WP);
float *Md,*Nd,*Pd;//,*Ptranspostad
cudaMalloc((void**)&Md,HM*WM*sizeof(float));
cudaMalloc((void**)&Nd,HN*WN*sizeof(float));
cudaMalloc((void**)&Pd,HP*WP*sizeof(float));
cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice);
dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);
dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);
cudaEventRecord(evstart,0);
nonsquare<<<dimGrid,dimBlock>>>(Pd,Md,Nd,WM,WN);
cudaEventRecord(evstop,0);
cudaEventSynchronize(evstop);
float time;
cudaEventElapsedTime(&time,evstart,evstop);
cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost);
size_t freem, totalm;
float free_m,total_m,used_m;
cudaMemGetInfo((size_t*)&freem,(size_t*)&totalm);
free_m =(size_t)freem/1048576.0;
total_m=(size_t)totalm/1048576.0;
used_m=(total_m-free_m);
printf ( " mem free %d .... %f MB mem total %d....%f MB mem used %f MB\n",freem,free_m,totalm,total_m,used_m);
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
printf("\nMatrix P:\n");
printMat(P,WP,HP);
printf("\n Time spent:%f ms",time);
float Bandwidth=((HM*WM+HN*WN+WP*HP)*sizeof(float))/(time*1000000);
long double NumOp= (HM*WM*WN)*2;
long double GFLOP= (NumOp*0.000000001)/((double)time*0.001);
printf("\nThroughput %lg GFLOP/s\n",GFLOP);
printf("Bandwidth %f GB/s \n", Bandwidth);
}
```