Segmentation error in cudaMalloc - concurrent kernel execution

zam23 · January 26, 2012, 5:55am

Below CUDA program segment gives “segmentation fault” error
I am allocating memory using CUDA 4.0, Centos 64 bits and a GTX 480. I am not sure where it is failing, I think a cudaMalloc makes my program crashing. I am making pointers of pointers using CUDA on device memory. In this code I am executing concurrent kernels, so a stream is created per kernel.

I just commented out the kernel call, but still I am getting same “Segmentation fault”. Please have a look into below program and give me a hint to overcome this issue.

float randomNumberKernel(int num_data, int num_kernels, int num_blocks) {
mtgp32_kernel_status_t *d_status;
uint32_t **d_data = (uint32_t **) malloc(sizeof(uint32_t) * num_kernels);
uint32_t **h_data = (uint32_t **) malloc(sizeof(uint32_t) * num_kernels);
cudaError_t e;
float gputime;
cudaEvent_t start;
cudaEvent_t end;

cudaStream_t *streams;
streams = (cudaStream_t *) malloc(sizeof(cudaStream_t) * num_kernels);
for (int i = 0; i < num_kernels; ++i)
	cudaStreamCreate(&streams[i]);

ccudaEventCreate(&start);
ccudaEventCreate(&end);

for (int i = 0; i < num_kernels; ++i)
	cudaMalloc((void**)&d_data[i], sizeof(uint32_t) * num_data/num_kernels);

for (int i = 0; i < num_kernels; ++i)
	cudaHostAlloc( (void**)&h_data[i], (num_data/num_kernels) * sizeof( uint32_t ), cudaHostAllocWriteCombined );

ccudaEventRecord(start, 0);

/* kernel calls */
for (int i = 0; i < num_kernels; ++i){
	mtgp32_uint32_kernel<<< num_blocks, THREAD_NUM,0, streams[i]>>>(d_status, d_data[i], num_data / (num_kernels*num_blocks),i*num_blocks);
}
e = cudaGetLastError();
if (e != cudaSuccess) {
	printf("failure in kernel call.\n%s\n", cudaGetErrorString(e));
	exit(1);
}    

for(int i = 0; i < num_kernels; i++) {
	cudaStreamSynchronize( streams[i] ) ;
}

cudaEventRecord(end, 0);
cudaEventSynchronize(end);

cudaEventElapsedTime(&gputime, start, end);  

// release resources
for(int i = 0; i < num_kernels; i++) {
	cudaStreamDestroy(streams[i]);
}
for(int i = 0; i < num_kernels; i++) { 
	cudaFreeHost( h_data[i] ); cudaFree(d_data[i]);
}
/* ccutDeleteTimer(timer); */
cudaEventDestroy(start);
cudaEventDestroy(end);

ccudaFree(d_status);
return gputime;

}

tera · January 26, 2012, 6:13am

uint32_t **d_data = (uint32_t **) malloc(sizeof(uint32_t*) * num_kernels);	

uint32_t **h_data = (uint32_t **) malloc(sizeof(uint32_t*) * num_kernels);

EDIT: Correct syntax.

zam23 · January 26, 2012, 5:04pm

What do you mean by above segment? I do not understand please give me some hint

tera · January 26, 2012, 5:39pm

You’ve used the wrong type inside the sizeof() in your code, so I am showing the correct lines instead.

BTW., there is no need to post the same question again - you won’t receive any more replies by having the same thread twice in the forums.

zam23 · January 29, 2012, 5:48pm

Hi Tera,

it worked, thank you so much.

Actually I used as below.(uint32_t* not *uint32_t inside sizeof()). Does it really matter? Anyway I got compillation errors when I used as malloc(sizeof(*uint32_t).

uint32_t **d_data = (uint32_t *) malloc(sizeof(uint32_t) * num_kernels);
uint32_t **h_data = (uint32_t *) malloc(sizeof(uint32_t) * num_kernels);

thanks again

zam23 · January 29, 2012, 5:52pm

hello Tera,

I have one more question,

I know usage of pointers of pointers is a performance killer. If I use 2D array for the same task will I be able to get better performance? What do you think?

cmaster.matso · January 30, 2012, 9:10am

If I may help here… Should the ‘pointer of pointers’ be used as input, You could consider applying not only a 2D array but a texture in such a case, am I right?

Bless You,

MK

tera · January 30, 2012, 11:52am

Sorry, my bad. Fixed the original post now.

tera · January 30, 2012, 11:56am

Yes, 2D arrays are a lot better for performance than havving to follow pointers to pointers. Whether 2D textures are a performance win over 2D arrays on compute capability 2.x devices depends a lot on your memory access patterns (both of the 2D array and the rest of the kernel), so you’d have to try it out.

Topic		Replies	Views
cudaMalloc causes segmentation fault 2 Mo is far from my 1,2 Go card memory limit CUDA Programming and Performance	7	7461	June 28, 2011
Using unified memory for 2Dim and 3 Dim array CUDA Programming and Performance	2	529	November 18, 2018
How to allocate memory for host in cuda main function? CUDA Programming and Performance	5	283	July 28, 2023
segmentation fault at the first cudaMalloc with --device-emulation everything was fine CUDA Programming and Performance	10	4321	January 25, 2010
Allocating device memory for an struc inside an std::vector<struct> CUDA Programming and Performance	2	45	September 28, 2024
Error(Segmentation fault) while using cudaHostAlloc ,Does parameter size require size? CUDA Programming and Performance	1	565	May 31, 2019
Segmentation Fault CUDA Programming and Performance	9	3490	June 14, 2011
Copying 2D array from host to device CUDA Programming and Performance	7	7225	July 27, 2010
Help with cuda 2d array CUDA Programming and Performance	6	7447	September 29, 2014
Segmentation fault when calling virtual function on host CUDA Programming and Performance	9	2456	September 10, 2019

Segmentation error in cudaMalloc - concurrent kernel execution

Related topics