Working with very large arrays in CUDA

Well, the day before yesterday, i was testing my matrix-matrix multiplication program using 8000x8000 matrices, meaning, i (cuda)allocated three of them: 8000^234(float)= 768 Mb if my math is right. Soon after that, the monitor started displaying a lot of dead pixels to an extent that i had to reboot the computer(and now no one in the University can log in into that computer, but i dont think that is related to the dead pixels).

So basically i want to know what is the maximum amout of memory i can allocate with my GPU(Nvidia Quadro Fx 3800) as i thought it was equal to the memory size(1Gb).

Also i would like to know some alternative in the case i am not able to allocate such an ammount of memory because i will have to work with 3D matrices with very large sizes.

Thanks in advance

Anyone?

If you are seeing graphics corruption after allocating a large chunk of memory, that sounds like a driver (or a hardware) problem.

You can use cudaGetMemInfo() to retrieve the amount of free memory on the device if you want to know the maximum amount of memory you can use from CUDA.

Thanks, after a little bit of searching i found an implementation of cudaGetMemInfo:

size_t freem, totalm;

	float free_m,total_m,used_m;

	cudaMemGetInfo((size_t*)&freem,(size_t*)&totalm);

	free_m =(size_t)freem/1048576.0;

	total_m=(size_t)totalm/1048576.0;

	used_m=(total_m-free_m);

	

	printf ( "  mem free %d .... %f MB mem total %d....%f MB mem used %f MB\n",freem,free_m,totalm,total_m,used_m);

//Throughput

	double NumOperations= (HM*WM*WN)*2;

	double GFLOP= (NumOperations*0.000000001)/((double)time*0.001);

        printf("\nThroughput %f GFLOP/s\n",GFLOP);

I then tried multiplicating 4000x4000 matrices, and these are the results:

mem free 827068160 .... 788.753662 MB mem total 1073020928....1023.312500 MB mem used 234.558838 MB

Time Spent:1963.720093 ms

Throughput -0.432352 GFLOP/s(I know it is wrong, but when the number of elements is very high the operation is out of range, when using a calculator i found the value to be 521 GFLOP/s...okay that does not seem possible either lol)

Bandwidth 0.097774 GB/s

When i try 8000x8000(which i did now, and i am writing under a cloud of dead pixels again lol) these are the results i get:

mem free 0 .... 0.000000 MB mem total 0....0.000000 MB mem used 0.000000 MB

Time Spent:0.000000 ms

Throughput inf GFLOP/s

Bandwidth inf GB/s

It doesnt make sense because three matrices of 8000x8000 float elements do not fill the 1023 Mb

EDIT: After commenting the kernel invocation i ran the code again and i got 783 Mb used and no dead pixels, so the problem lies within the kernel(damn), which is strange since it is the same as the Programming Massively Parallel Processors Book. I will look into it.

P.S Isnt it strange that when using blocksize 16, and 4000x4000 matrices i get around 32 GFLOP/s, and the GPU goes bananas with 8000x8000?

I think there is something wrong with your graphics driver or hardware. No CUDA kernel should be able to corrupt the display. (People have reported memory corruption, but then that indicates a fundamental bug in the graphics drivers.)

Are you using the latest NVIDIA drivers?

I am not at the university ATM, but tomorrow i will check the display drivers and post it here, but since i started working with CUDA about 2 months ago, i think the display drivers should be updated.

I will post some tests too, because i think i’m getting some strange output results, mainly throughput(and i have tested the matrix multiplication and confirmed with CPU code that the results are fine and i have tested up to 200x200 matrices, so i dont know why upping the data size is affecting the kernel. I am not an CUDA expert, but the purpose of using CUDA should be to process a large ammount of data with high throughput)

Video memory corruption almost always means out of bounds global memory writes on pre-fermi hardware. The fact that not running your kernel eliminatesthe problem confirms it. The difference between 4000 and 8000 is probably explained by GPU memory page size versus write stride differences between the two cases. Running cuda-memcheck will probably tell you more.

Well, as for the display drivers, the current version is 260.19.44(i have 260.19.36)

Now after about two hours of intensive testing

Blocksize 16 - [32][32] matrices - result confirmed with CPU

Time spent:0.106656 ms

Throughput 0.614461 GFLOP/s

Bandwidth 0.115212 GB/s 

Blocksize 16 - [1600][1600] matrices - not confirmed

Time Spent:107.803421 ms

Throughput 75.990167 GFLOP/s

Bandwidth 0.284963 GB/s 

Blocksize 16 - [3200][3200] matrices - not confirmed

Time Spent:978.571655 ms

Throughput 66.971078 GFLOP/s

Bandwidth 0.125571 GB/s 

Blocksize 2 - [32][32] matrices 

Time Spent:0.131744 ms

Throughput 0.131744 GFLOP/s

Bandwidth 0.093272 GB/s 

Blocksize 2 - [1600][1600] matrices 

Time Spent:0.131744 ms

Throughput 1.723420 GFLOP/s

Bandwidth 0.006463 GB/s 

Blocksize 2 - [3200][3200] matrices

Time Spent:0.000000 ms

Throughput 0 GFLOP/s

Bandwidth inf GB/s 

<b>VIDEO CORRUPTION</b>

More results confirmed with CPU code:

[160][160]

[40][20]*[20][4] - non square

[80][180]*[180][40] - non square

[80][180]*[180][200] - wrong result, all elements inf

========= Invalid __global__ read of size 4

=========     at 0x00000168 in nonsquare

=========     by thread (0,0,0) in block (8,1) //the block changes

=========     Address 0x0011db80 is out of bounds

=========

========= ERROR SUMMARY: 1 error

Other matrices with the same error, albeit giving right output

[2][2]*[2][4]

[4][4]*[4][8]

It seems the error appears only when the Width of the second matrix is larger than the other dimensions 

P.S: When i run [8000][8000] i keep having the strange output, but the memcheck returns no error! Isnt that strange?

P.S2: avidday thanks for the tip about the cudamemcheck.

Is this still the same kernel code you have previously posted here and on stackoverflow?

Yes, its the same as the stackoverflow one, the other one was missing one __syncthreads().

I basically wanted to do non square matrix multiplication, so I checked the cudasdk example and saw that they were using dimensions multiples of blocksize. I then used that though and merged it with the kernel in the Programming Massively Parralel Processors book and most of the results(in terms of thoughput and resulting matrix) are fine, aside from those special cases of non-square matrices.

Since i don’t have computer privilleges to compile programs on the CUDASDK directory, I will copy the SDK kernel and test it with my host code. Although i will be working with 3D matrices, I have to make sure i understand all the subtleties of 2D multiplication.

P.S: I had a good laugh because you associated the two posts, although i use my nickname here and my real name in stackoverflow ^^

EDIT: I tested the SDK example and it doesnt give me the cudamemcheckerror. Nevertheless i keep getting the dead pixels problem with the same examples i posted here. Given that one of them(3200x3200) is not even close to reaching full GPU memory capacity, there is something happening outside the kernel.This is my host code, it doesnt seem like there is something out of the ordinary:

int main(){

	cudaEvent_t evstart, evstop;

	cudaEventCreate(&evstart);

	cudaEventCreate(&evstop);

	

	float*M=(float*)malloc(sizeof(float)*HM*WM);

	float*N=(float*)malloc(sizeof(float)*HN*WN);

	for(int i=0;i<WM*HM;i++)

		M[i]=(float)i;

	for(int i=0;i<WN*HN;i++)

		N[i]=(float)i;

	float*P=(float*)malloc(sizeof(float)*HP*WP);

	float *Md,*Nd,*Pd;//,*Ptranspostad

	cudaMalloc((void**)&Md,HM*WM*sizeof(float));

	cudaMalloc((void**)&Nd,HN*WN*sizeof(float));

	cudaMalloc((void**)&Pd,HP*WP*sizeof(float));

	cudaMemcpy(Md,M,HM*WM*sizeof(float),cudaMemcpyHostToDevice);

	cudaMemcpy(Nd,N,HN*WN*sizeof(float),cudaMemcpyHostToDevice);

	dim3 dimBlock(blocksize,blocksize);//(tile_width , tile_width);

	dim3 dimGrid(WN/dimBlock.x,HM/dimBlock.y);//(width/tile_width , width/tile_witdh);

	cudaEventRecord(evstart,0);

	nonsquare<<<dimGrid,dimBlock>>>(Pd,Md,Nd,WM,WN);

	

	cudaEventRecord(evstop,0);

	cudaEventSynchronize(evstop);

	float time;

	cudaEventElapsedTime(&time,evstart,evstop);

	cudaMemcpy(P,Pd,WP*HP*sizeof(float),cudaMemcpyDeviceToHost);

	

	size_t freem, totalm;

	float free_m,total_m,used_m;

	cudaMemGetInfo((size_t*)&freem,(size_t*)&totalm);

	free_m =(size_t)freem/1048576.0;

	total_m=(size_t)totalm/1048576.0;

	used_m=(total_m-free_m);

	

	printf ( "  mem free %d .... %f MB mem total %d....%f MB mem used %f MB\n",freem,free_m,totalm,total_m,used_m);

	cudaFree(Md);

	cudaFree(Nd);

	cudaFree(Pd);

	printf("\nMatrix P:\n");

	printMat(P,WP,HP);

	printf("\n Time spent:%f ms",time);

	float Bandwidth=((HM*WM+HN*WN+WP*HP)*sizeof(float))/(time*1000000);

	long double NumOp= (HM*WM*WN)*2;	

	long double GFLOP= (NumOp*0.000000001)/((double)time*0.001);

	printf("\nThroughput %lg GFLOP/s\n",GFLOP);

	printf("Bandwidth %f GB/s \n", Bandwidth);

}