segmentation fault at the first cudaMalloc with --device-emulation everything was fine

timon · January 20, 2010, 10:29pm

Hi, I was developing my project with the --device-emulation option because I can use printf and it was easier for me to debug.
Now that everything seems to work fine, I was trying to run the same project without that option and I get a segmentation fault at the first cudaMalloc call. I got no idea why this is happening since everything was alright while I was emulating the device.
Anyone knows what can be the cause?

To compile I use:

nvcc -g -G --compiler-bindir $(HOME)/bin -c -arch sm_11 paralelLayer.cu

and to link:

nvcc -o preann xmm32.o paralelLayer.o commonFunctions.o vector.o xmmVector.o layer.o cudaLayer.o xmmLayer.o neuralNet.o cudaNeuralNet.o xmmNeuralNet.o chronometer.o main.o -L/usr/local/cuda/lib -lcudart

Any help will be great.

eyalhir74 · January 21, 2010, 8:24am

Hi,

device-emulation runs your code on the cpu in a serial way - so actually beside testing and debugging indexes calculations and that

you indeed access the correct position in your input/output array its mostly meaningless. A perfectly running code on emulation can

crash all day on the GPU.

So what you should do is post your code sample that crashes and also add the following after the line that causes the crash (probably

the kernel code):

char errorMessage[1000] = "Error running this....";

char buff[ 1000 ];

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 

{ 

   sprintf_s( buff, "Cuda error: %s in file '%s' in line %i : %s[%d].\n",errorMessage, __FILE__, __LINE__, cudaGetErrorString( err), err );

   exit(-1);

}

err = cudaThreadSynchronize(); 

if( cudaSuccess != err) 

{ 

   sprintf_s( buff, "Cuda error: %s in file '%s' in line %i : %s[%d].\n",errorMessage, __FILE__, __LINE__, cudaGetErrorString( err), err );

   exit(-1);

}

Make this a macro and you can put it now after kernel invocations, memory allocations, etc…

eyal

timon · January 22, 2010, 3:29pm

Hi, I had a similar but simpler function to handle errors:

void checkCUDAError(const char *msg)

{

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err) 

	{

		fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

								  cudaGetErrorString( err) );

		exit(EXIT_FAILURE);

	}						 

}

I have just one file but I like your LINE thing as a way to avoid using cuda-gdb.

I just don’t understand cudaThreadSynchronize() and sprintf_s…

Anyway, I didn’t post the code because it wasn’t the kernel function it crashes the first time I call cudaMalloc. This seems very strange, but maybe a miss something.

Here’s the crashing code:

extern "C" void** InputsToDevice(void** host_inputs, unsigned* host_inputSizes, VectorType* host_types, unsigned numberInputs)

{

	size_t size = numberInputs * sizeof(void*);

	void** dev_inputs;

	cudaMalloc((void**)&(dev_inputs), size); // Here crashes!!!

	

	for (unsigned i=0; i < numberInputs; i++){

		

		if (host_types[i] == FLOAT){

		

			size = host_inputSizes[i] * sizeof(float);

		} else {

			size = (((host_inputSizes[i] - 1)/ BITS_PER_UNSIGNED) + 1) * sizeof(unsigned);

		}

		cudaMalloc((void**)&(dev_inputs[i]), size);

		cudaMemcpy(dev_inputs[i], host_inputs[i], size, cudaMemcpyHostToDevice);

	}

	checkCUDAError("Inputs To Device");

	return dev_inputs;

}

I don’t know, I’m kinda lost.

Thanks

jaka1 · January 23, 2010, 2:22am

Hi, I had a similar but simpler function to handle errors:
void checkCUDAError(const char *msg)

{

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err) 

	{

		fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

								  cudaGetErrorString( err) );

		exit(EXIT_FAILURE);

	}						 

}
I have just one file but I like your LINE thing as a way to avoid using cuda-gdb.

I just don’t understand cudaThreadSynchronize() and sprintf_s…

Anyway, I didn’t post the code because it wasn’t the kernel function it crashes the first time I call cudaMalloc. This seems very strange, but maybe a miss something.

Here’s the crashing code:
extern "C" void** InputsToDevice(void** host_inputs, unsigned* host_inputSizes, VectorType* host_types, unsigned numberInputs)

{

	size_t size = numberInputs * sizeof(void*);

	void** dev_inputs;

	cudaMalloc((void**)&(dev_inputs), size); // Here crashes!!!

	

	for (unsigned i=0; i < numberInputs; i++){

		

		if (host_types[i] == FLOAT){

		

			size = host_inputSizes[i] * sizeof(float);

		} else {

			size = (((host_inputSizes[i] - 1)/ BITS_PER_UNSIGNED) + 1) * sizeof(unsigned);

		}

		cudaMalloc((void**)&(dev_inputs[i]), size);

		cudaMemcpy(dev_inputs[i], host_inputs[i], size, cudaMemcpyHostToDevice);

	}

	checkCUDAError("Inputs To Device");

	return dev_inputs;

}
I don’t know, I’m kinda lost.

Thanks

cudaMalloc((void**)&(dev_inputs), size); // Here crashes!!!

Here dev_inputs is already void**. By taking the address, you make it void***.

timon · January 23, 2010, 2:00pm

Thank you, jaka.
But, what I don’t understand is why this didn’t crash in emulation mode the same.

I guess Then I have to revise all the cudaMalloc calls.

Thanks a lot.

timon · January 23, 2010, 2:46pm

hi, jaka.
Removing & it crashes the same. I’m not sure it doesn’t have to be void*** before the (void**) casting.

eyal, my function for handling the errors doesn’t works. The error isn’t printed like it used to do while I was using --device-emulation.
Do you have any idea why?

Thanks

avidday · January 23, 2010, 3:18pm

I actually thought it would crash here:

cudaMalloc((void**)&(dev_inputs[i]), size);

You can’t do that. You have already allocated dev_inputs to be a device array of pointers. You cannot reference its values from host memory. That will be a guaranteed segfault. It you want that sort of memory structure, the assignment of values into the array of pointers must be done in device code (like a small initialization kernel), or using cudaMemcpyToSymbol.

timon · January 23, 2010, 4:13pm

I actually thought it would crash here:
cudaMalloc((void**)&(dev_inputs[i]), size);
You can’t do that. You have already allocated dev_inputs to be a device array of pointers. You cannot reference its values from host memory. That will be a guaranteed segfault. It you want that sort of memory structure, the assignment of values into the array of pointers must be done in device code (like a small initialization kernel), or using cudaMemcpyToSymbol.

you’re right. It was crashing there. I mistook the number line with the main.cpp one. What a stupid mistake!

I’ll study the cudaMemcpyToSymbol thing and the initialization kernel solution.

Thank you very much.

timon · January 25, 2010, 5:31pm

Hi, I’ve changed my code to:

extern "C" void** InputsToDevice(void** host_inputs, unsigned* host_inputSizes, VectorType* host_types, unsigned numberInputs)

{

	size_t size = numberInputs * sizeof(void*);

	void** dev_inputs;

	cudaMalloc((void**)&dev_inputs, size);

	

	for (unsigned i=0; i < numberInputs; i++){

		

		if (host_types[i] == FLOAT){

		

			size = host_inputSizes[i] * sizeof(float);

		} else {

			size = (((host_inputSizes[i] - 1)/ BITS_PER_UNSIGNED) + 1) * sizeof(unsigned);

		}

		void* aux;

		cudaMalloc((void**)&aux, size);

		cudaMemcpy(aux, host_inputs[i], size, cudaMemcpyHostToDevice);

		cudaMemcpyToSymbol(&(dev_inputs[i]), &aux, sizeof(void*), cudaMemcpyDeviceToDevice);

	}

	checkCUDAError("Inputs To Device");

	return dev_inputs;

}

And I get a cuda error:

invalid device symbol

I’ve tried

cudaMemcpyToSymbol(&(dev_inputs[i]), &aux, sizeof(void*), cudaMemcpyDeviceToDevice);

and cudaMemcpyToSymbol(&(dev_inputs[i]), &aux, sizeof(void*), cudaMemcpyDeviceToDevice);

but the result is the same.

I’ve also tried:

cudaMemcpyToSymbol(dev_inputs, &aux, sizeof(void*), i * sizeof(void*) ,cudaMemcpyHostToDevice); (same error)

and

cudaMemcpyToSymbol(dev_inputs, &aux, sizeof(void*), i * sizeof(void*) ,cudaMemcpyDeviceToDevice);

with this i get “invalid device pointer”

Do I have to write a kernel just for doing an assignment?

timon · January 25, 2010, 6:12pm

I realised that I don’t need dev_inputs vector to be in device memory, but just the inputs themselves.

The method will be:

extern "C" void** InputsToDevice(void** host_inputs, unsigned* host_inputSizes, VectorType* host_types, unsigned numberInputs)

{

	size_t size = numberInputs * sizeof(void*);

	void** dev_inputs;

	dev_inputs = new void*[numberInputs];

	

	for (unsigned i=0; i < numberInputs; i++){

		

		if (host_types[i] == FLOAT){

		

			size = host_inputSizes[i] * sizeof(float);

		} else {

			size = (((host_inputSizes[i] - 1)/ BITS_PER_UNSIGNED) + 1) * sizeof(unsigned);

		}

		cudaMalloc((void**)&(dev_inputs[i]), size);

		cudaMemcpy(dev_inputs[i], host_inputs[i], size, cudaMemcpyHostToDevice);

	}

	checkCUDAError("Inputs To Device");

	return dev_inputs;

}

But I got another method that copies data to the device and I got the same (or very similar) problem there.

extern "C" struct_Layer* LayerHostToDevice(struct_Layer* h_layer, VectorType inputType, VectorType outputType){

	struct_Layer* d_layer;

	cudaMalloc((void**)&d_layer, sizeof(struct_Layer));

	size_t size = sizeof(unsigned);

	cudaMemcpy(&(d_layer->numberInputLayers), &(h_layer->numberInputLayers), size, cudaMemcpyHostToDevice);

	cudaMemcpy(&(d_layer->totalWeighsPerOutput), &(h_layer->totalWeighsPerOutput), size, cudaMemcpyHostToDevice);

	cudaMemcpy(&(d_layer->outputSize), &(h_layer->outputSize), size, cudaMemcpyHostToDevice);

	cudaMemcpy(&(d_layer->functionType), &(h_layer->functionType), sizeof(FunctionType), cudaMemcpyHostToDevice);

	size = h_layer->numberInputLayers * sizeof(unsigned);

	cudaMalloc((void**)&(d_layer->inputLayerSize), size); //Now crashes here!!! I think the reason is the same :(

	cudaMemcpy(d_layer->inputLayerSize, h_layer->inputLayerSize, size, cudaMemcpyHostToDevice);

	size = h_layer->numberInputLayers * sizeof(void*);

	cudaMalloc((void**)d_layer->inputNeurons, size);

	if (outputType == FLOAT){

		size = sizeof(float) * h_layer->outputSize * h_layer->totalWeighsPerOutput;

	} else {

		size = sizeof(unsigned char) * h_layer->outputSize * h_layer->totalWeighsPerOutput;

	}

	cudaMalloc((void**)&(d_layer->weighs), size);

	cudaMemcpy(d_layer->weighs, h_layer->weighs, size, cudaMemcpyHostToDevice);

	if (outputType == FLOAT){

		size = sizeof(float) * h_layer->outputSize;

	} else {

		size = sizeof(unsigned) * (((h_layer->outputSize - 1)/ BITS_PER_UNSIGNED) + 1);

	}

	cudaMalloc((void**)&(d_layer->outputNeurons), size);

	cudaMemcpy(d_layer->outputNeurons, h_layer->outputNeurons, size, cudaMemcpyHostToDevice);

	size = h_layer->outputSize * sizeof(float);

	cudaMalloc((void**)&(d_layer->thresholds), size);

	cudaMemcpy(d_layer->thresholds, h_layer->thresholds, size, cudaMemcpyHostToDevice);

	

	checkCUDAError("Layer Host To Device");

	return d_layer;

}

Any help will be great, but I think I’m going to start a new topic since the title of this one is not appropriate anymore.

Thanks for the help, everybody.

avidday · January 25, 2010, 6:16pm

Yeah obviously you don’t do it quite as literally as that…

To use global memory symbols, you have to declare device memory target pointers to the compiler/runtime (and the must be in the same compilation unit as the code that will write to them). Then malloc dynamic memory roughly like you are doing now, then use cudaMemcpyToSymbol to copy the global memory addresses of your malloced memory onto the symbols.

Topic		Replies	Views
Can we do malloc inside a __global__ function CUDA Programming and Performance	26	9628	February 21, 2010
Segmentation fault when calling virtual function on host CUDA Programming and Performance	9	2439	September 10, 2019
second kernel call results in segmentation fault and other annoying problems CUDA Programming and Performance	6	2159	March 15, 2009
cudaMalloc error in big loop CUDA Programming and Performance	12	15583	May 21, 2008
cudaMalloc causes segmentation fault 2 Mo is far from my 1,2 Go card memory limit CUDA Programming and Performance	7	7460	June 28, 2011
Segmentation Fault on calling cudaMalloc - I can't figure out why CUDA Programming and Performance	1	2022	November 12, 2015
cudaMalloc() leads to segment fault Jetson TX1	9	4504	June 30, 2017
CudaMalloc? CUDA Programming and Performance	11	9563	December 14, 2010
Segmentation fault (core dumped) CUDA Programming and Performance	4	13037	May 13, 2017
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37661	April 23, 2009

segmentation fault at the first cudaMalloc with --device-emulation everything was fine

Related topics