Some basic questions on CUDA and memory

SeineRiver · May 29, 2009, 9:33am

I have the kernel

extern "C" void 

runCuda(const int argc, const char** argv, float *sPWM,

	int popSize, int matrixLen, float specificity, float *sFitness, 

	int numMotifSeq, char *smotifSeq, char *srmotifSeq,

	int *motifSeqLen, int numBackgSeq, char *sbackgSeq, char *srbackgSeq,

	int *backgSeqLen, float *scoreMotif, float *scoreBackg)

{

	CUT_DEVICE_INIT();

	//Declare CUDA variables

	float *dsPWM, *dsFitness, *dspecificity;

	int *dmotifSeqLen, *dbackgSeqLen, *dmatrixLen, *dnumMotifSeq, *dnumBackgSeq;

	char *dsmotifSeq, *dsrmotifSeq, *dsbackgSeq, *dsrbackgSeq;

	//Specify memory allocation size of each variable

	int dsPWM_size = sizeof(float)*100*17*4; 

	int dsFitness_size = sizeof(float)*100;

	int dmotifSeqLen_size = sizeof(int)*5001;

	int dbackgSeqLen_size = sizeof(int)*5001;

	int dmatrixLen_size = sizeof(int)*1;

	int dspecificity_size = sizeof(float)*1;

	int dnumMotifSeq_size = sizeof(int)*1;

	int dnumBackgSeq_size = sizeof(int)*1;

	int dsmotifSeq_size = sizeof(char)*302*10001;

	int dsrmotifSeq_size = sizeof(char)*302*10001;

	int dsbackgSeq_size = sizeof(char)*1500*10001;

	int dsrbackgSeq_size = sizeof(char)*1500*10001;

	// allocate device memory

	CUDA_SAFE_CALL(cudaMalloc((void**) &dsPWM, dsPWM_size));

	.... (all the other 10 malloc)

	CUDA_SAFE_CALL(cudaMalloc((void**) &dsrbackgSeq, dsrbackgSeq_size));

	// copy host memory to device

	CUDA_SAFE_CALL(cudaMemcpy(dsPWM, sPWM, dsPWM_size, cudaMemcpyHostToDevice));

	.... (all the other 10 memcpy)

	CUDA_SAFE_CALL(cudaMemcpy(dsrbackgSeq, srbackgSeq, dsrbackgSeq_size, cudaMemcpyHostToDevice));

	 

	// setup execution parameters

	dim3 grid(10, 1, 1);

	dim3 threads(10, 1, 1);

	

	// execute the kernel

	evolve<<< grid, threads >>> (dsPWM, dmatrixLen, dspecificity, dsFitness, 

					 dnumMotifSeq, dsmotifSeq, dsrmotifSeq, dmotifSeqLen, 

					 dnumBackgSeq, dsbackgSeq, dsrbackgSeq, dbackgSeqLen);

			   

	// check if kernel execution generated and error

	CUT_CHECK_ERROR("Kernel execution failed");

	// copy results from device to host

	CUDA_SAFE_CALL(cudaMemcpy(sFitness, dsFitness, dsFitness_size, cudaMemcpyDeviceToHost));

	

	// cleanup memory

	CUDA_SAFE_CALL(cudaFree(dsPWM));

	... (all the free)

}

The kernel is simple

__global__ void evolve(float *sPWM, int *matrixLen, float *specificity, float *sFitness, 

			   int *numMotifSeq, char *smotifSeq, char *srmotifSeq, int *motifSeqLen,

			   int *numBackgSeq, char *sbackgSeq, char *srbackgSeq, int *backgSeqLen)

{   

	float stPWM[100];

	float fsumMin, fsumMax; 

	float fscoreMotif[302], fscoreBackg[1501];	

	int matLen = *matrixLen;

	float spec = *specificity;

	// Write data to global memory

	const int tid = blockIdx.x * blockDim.x + threadIdx.x;

	transform_pwm_s(&sPWM[tid*matLen*4], stPWM, *matrixLen);

	sum_position_min_s(&fsumMin, stPWM, *matrixLen);

	sum_position_max_s(&fsumMax, stPWM, *matrixLen);	

	score_seq_s(*numMotifSeq,smotifSeq,srmotifSeq,motifSeqLen,stPWM,*mat

rixLen,fsumMin,fsumMax,fscoreMotif);

	score_seq_s(*numBackgSeq,sbackgSeq,srbackgSeq,backgSeqLen,stPWM,*mat

rixLen,fsumMin,fsumMax,fscoreBackg);

	sFitness[tid] = cal_ROC_s(fscoreMotif,*numMotifSeq,fscoreBackg,*numBackgSeq,1.0f-spec);

}

From the main I want to call the runCuda multiple times. I have the following problems:

Without the assignment back (sFitness[tid] = …) in the kernel, the code run at about 0.06sec, when I try to assign the result back, it costs me 66 secs to run. What happened?
I think I’m using too much memory. But I am not clear how much is too much. In the kernel, when I declare fscoreMotif to be greater than 500 elements then I got compilation error (using too much local memory). But at the current size, it works without complain.
In the main, for the 1st time I call runCuda, things work perfectly, I got the correct result for the data I passed in (after 66 secs) but the 2nd iteration it just runs in 18 sec and giving me wrong result, the 3 iteration causes segmentation fault. I’m not sure what went wrong. Is there anyway after every call the runCuda, I can reset everything to the initial states?
I want to put the CUDA_EXIT(argc, argv) inside the runCuda, but when it runs, it print out smthing like cannot parse the argv blah blah. Anyone can give me any idea about this?

Thanks a lot and best regards

Sarnath · May 29, 2009, 9:37am

Avoid CUTIL for production code. It can be changed and withdrawn from NVIDIA anytime and it will BREAK your code.

Regarding your question, we dont know what “cal_ROC_s” does… may b, it is toooo compute intensive

SeineRiver · May 29, 2009, 9:43am

Hey, thanks for quick reply. What do you mean by CUTIL? The cal_ROC_s is actually very fast, it just does a up with 1000 iterations. Even on CPU that costs less than 100ms.

Sarnath · May 29, 2009, 9:53am

CUT_DEVICE_INIT, CUDA_SAFE_CALL and the likes all come from the CUTIL part of the SDK i.e. “NVIDIA CUDA SDK/common/inc” directory.

Those macros were written for the people who wrote the SDK and it is for their personal use. They wont support that code. They wont commit that it wont be withdrawn. It is a RISK in production code. One day, you may upgrade to some CUDA version locally and find your build tree breaking.

This has been officially said by some NVIDIA moderators. Hence, I adviced you so.

btw, local memory might be too slow. Also, you are just using 10 Blocks with 10 threads each… That is a very very small number.
Block-size at least must be a multiple of 32. Othewise you are wasting CPU cycles. You are straightway using only 30% of the GPU clock.

And, there is always latencies. you need to have more blocks actively running. How mych Multiprocessors does your GPU have? OR whatGPU u use?

U need lot of threads.blocks for GPU. GPU is an elephant. Dont feed it like an ant. It does not scale for small sizes.

SeineRiver · May 29, 2009, 10:07am

Oh I see. I’m using an 8800GTX :)

Sarnath · May 29, 2009, 10:21am

You have 16 mutliprocessrs in GTX. You are spawning 10 blocks and hence most likely you are using only 10 multi-processors.

Each multi-processor is running 10 threads. You need 192 active threads (all from same block OR different active blocks) to hide even register latencies.

Simply put, you need 192*16 threads actively running to keep your GTX performing to an acceptable level.

Depending on application, this number could increase.

SeineRiver · May 29, 2009, 1:20pm

Thanks a lot. Then how about the amount of memory usage in each of the thread. The above kernel code (together with the parameters passed in with the size indicated in the other piece of code), can it run properly? And what happen with my second call to runCuda() onward, which some of the threads cannot be invoked?