segmentation fault at the first cudaMalloc with --device-emulation everything was fine

Hi, I was developing my project with the --device-emulation option because I can use printf and it was easier for me to debug.
Now that everything seems to work fine, I was trying to run the same project without that option and I get a segmentation fault at the first cudaMalloc call. I got no idea why this is happening since everything was alright while I was emulating the device.
Anyone knows what can be the cause?

To compile I use:

nvcc -g -G --compiler-bindir $(HOME)/bin -c -arch sm_11 paralelLayer.cu

and to link:

nvcc -o preann xmm32.o paralelLayer.o commonFunctions.o vector.o xmmVector.o layer.o cudaLayer.o xmmLayer.o neuralNet.o cudaNeuralNet.o xmmNeuralNet.o chronometer.o main.o -L/usr/local/cuda/lib -lcudart

Any help will be great.

Hi,

device-emulation runs your code on the cpu in a serial way - so actually beside testing and debugging indexes calculations and that

you indeed access the correct position in your input/output array its mostly meaningless. A perfectly running code on emulation can

crash all day on the GPU.

So what you should do is post your code sample that crashes and also add the following after the line that causes the crash (probably

the kernel code):

char errorMessage[1000] = "Error running this....";

char buff[ 1000 ];

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 

{ 

   sprintf_s( buff, "Cuda error: %s in file '%s' in line %i : %s[%d].\n",errorMessage, __FILE__, __LINE__, cudaGetErrorString( err), err );

   exit(-1);

}

err = cudaThreadSynchronize(); 

if( cudaSuccess != err) 

{ 

   sprintf_s( buff, "Cuda error: %s in file '%s' in line %i : %s[%d].\n",errorMessage, __FILE__, __LINE__, cudaGetErrorString( err), err );

   exit(-1);

}

Make this a macro and you can put it now after kernel invocations, memory allocations, etc…

eyal

Hi, I had a similar but simpler function to handle errors:

void checkCUDAError(const char *msg)

{

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err) 

	{

		fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

								  cudaGetErrorString( err) );

		exit(EXIT_FAILURE);

	}						 

}

I have just one file but I like your LINE thing as a way to avoid using cuda-gdb.

I just don’t understand cudaThreadSynchronize() and sprintf_s…

Anyway, I didn’t post the code because it wasn’t the kernel function it crashes the first time I call cudaMalloc. This seems very strange, but maybe a miss something.

Here’s the crashing code:

extern "C" void** InputsToDevice(void** host_inputs, unsigned* host_inputSizes, VectorType* host_types, unsigned numberInputs)

{

	size_t size = numberInputs * sizeof(void*);

	void** dev_inputs;

	cudaMalloc((void**)&(dev_inputs), size); // Here crashes!!!

	

	for (unsigned i=0; i < numberInputs; i++){

		

		if (host_types[i] == FLOAT){

		

			size = host_inputSizes[i] * sizeof(float);

		} else {

			size = (((host_inputSizes[i] - 1)/ BITS_PER_UNSIGNED) + 1) * sizeof(unsigned);

		}

		cudaMalloc((void**)&(dev_inputs[i]), size);

		cudaMemcpy(dev_inputs[i], host_inputs[i], size, cudaMemcpyHostToDevice);

	}

	checkCUDAError("Inputs To Device");

	return dev_inputs;

}

I don’t know, I’m kinda lost.

Thanks

cudaMalloc((void**)&(dev_inputs), size); // Here crashes!!!

Here dev_inputs is already void**. By taking the address, you make it void***.

Thank you, jaka.
But, what I don’t understand is why this didn’t crash in emulation mode the same.

I guess Then I have to revise all the cudaMalloc calls.

Thanks a lot.

hi, jaka.
Removing & it crashes the same. I’m not sure it doesn’t have to be void*** before the (void**) casting.

eyal, my function for handling the errors doesn’t works. The error isn’t printed like it used to do while I was using --device-emulation.
Do you have any idea why?

Thanks

I actually thought it would crash here:

cudaMalloc((void**)&(dev_inputs[i]), size);

You can’t do that. You have already allocated dev_inputs to be a device array of pointers. You cannot reference its values from host memory. That will be a guaranteed segfault. It you want that sort of memory structure, the assignment of values into the array of pointers must be done in device code (like a small initialization kernel), or using cudaMemcpyToSymbol.

you’re right. It was crashing there. I mistook the number line with the main.cpp one. What a stupid mistake!

I’ll study the cudaMemcpyToSymbol thing and the initialization kernel solution.

Thank you very much.

Hi, I’ve changed my code to:

extern "C" void** InputsToDevice(void** host_inputs, unsigned* host_inputSizes, VectorType* host_types, unsigned numberInputs)

{

	size_t size = numberInputs * sizeof(void*);

	void** dev_inputs;

	cudaMalloc((void**)&dev_inputs, size);

	

	for (unsigned i=0; i < numberInputs; i++){

		

		if (host_types[i] == FLOAT){

		

			size = host_inputSizes[i] * sizeof(float);

		} else {

			size = (((host_inputSizes[i] - 1)/ BITS_PER_UNSIGNED) + 1) * sizeof(unsigned);

		}

		void* aux;

		cudaMalloc((void**)&aux, size);

		cudaMemcpy(aux, host_inputs[i], size, cudaMemcpyHostToDevice);

		cudaMemcpyToSymbol(&(dev_inputs[i]), &aux, sizeof(void*), cudaMemcpyDeviceToDevice);

	}

	checkCUDAError("Inputs To Device");

	return dev_inputs;

}

And I get a cuda error:

invalid device symbol

I’ve tried

cudaMemcpyToSymbol(&(dev_inputs[i]), &aux, sizeof(void*), cudaMemcpyDeviceToDevice);

and cudaMemcpyToSymbol(&(dev_inputs[i]), &aux, sizeof(void*), cudaMemcpyDeviceToDevice);

but the result is the same.

I’ve also tried:

cudaMemcpyToSymbol(dev_inputs, &aux, sizeof(void*), i * sizeof(void*) ,cudaMemcpyHostToDevice); (same error)

and

cudaMemcpyToSymbol(dev_inputs, &aux, sizeof(void*), i * sizeof(void*) ,cudaMemcpyDeviceToDevice);

with this i get “invalid device pointer”

Do I have to write a kernel just for doing an assignment?

I realised that I don’t need dev_inputs vector to be in device memory, but just the inputs themselves.

The method will be:

extern "C" void** InputsToDevice(void** host_inputs, unsigned* host_inputSizes, VectorType* host_types, unsigned numberInputs)

{

	size_t size = numberInputs * sizeof(void*);

	void** dev_inputs;

	dev_inputs = new void*[numberInputs];

	

	for (unsigned i=0; i < numberInputs; i++){

		

		if (host_types[i] == FLOAT){

		

			size = host_inputSizes[i] * sizeof(float);

		} else {

			size = (((host_inputSizes[i] - 1)/ BITS_PER_UNSIGNED) + 1) * sizeof(unsigned);

		}

		cudaMalloc((void**)&(dev_inputs[i]), size);

		cudaMemcpy(dev_inputs[i], host_inputs[i], size, cudaMemcpyHostToDevice);

	}

	checkCUDAError("Inputs To Device");

	return dev_inputs;

}

But I got another method that copies data to the device and I got the same (or very similar) problem there.

extern "C" struct_Layer* LayerHostToDevice(struct_Layer* h_layer, VectorType inputType, VectorType outputType){

	struct_Layer* d_layer;

	cudaMalloc((void**)&d_layer, sizeof(struct_Layer));

	size_t size = sizeof(unsigned);

	cudaMemcpy(&(d_layer->numberInputLayers), &(h_layer->numberInputLayers), size, cudaMemcpyHostToDevice);

	cudaMemcpy(&(d_layer->totalWeighsPerOutput), &(h_layer->totalWeighsPerOutput), size, cudaMemcpyHostToDevice);

	cudaMemcpy(&(d_layer->outputSize), &(h_layer->outputSize), size, cudaMemcpyHostToDevice);

	cudaMemcpy(&(d_layer->functionType), &(h_layer->functionType), sizeof(FunctionType), cudaMemcpyHostToDevice);

	size = h_layer->numberInputLayers * sizeof(unsigned);

	cudaMalloc((void**)&(d_layer->inputLayerSize), size); //Now crashes here!!! I think the reason is the same :(

	cudaMemcpy(d_layer->inputLayerSize, h_layer->inputLayerSize, size, cudaMemcpyHostToDevice);

	size = h_layer->numberInputLayers * sizeof(void*);

	cudaMalloc((void**)d_layer->inputNeurons, size);

	if (outputType == FLOAT){

		size = sizeof(float) * h_layer->outputSize * h_layer->totalWeighsPerOutput;

	} else {

		size = sizeof(unsigned char) * h_layer->outputSize * h_layer->totalWeighsPerOutput;

	}

	cudaMalloc((void**)&(d_layer->weighs), size);

	cudaMemcpy(d_layer->weighs, h_layer->weighs, size, cudaMemcpyHostToDevice);

	if (outputType == FLOAT){

		size = sizeof(float) * h_layer->outputSize;

	} else {

		size = sizeof(unsigned) * (((h_layer->outputSize - 1)/ BITS_PER_UNSIGNED) + 1);

	}

	cudaMalloc((void**)&(d_layer->outputNeurons), size);

	cudaMemcpy(d_layer->outputNeurons, h_layer->outputNeurons, size, cudaMemcpyHostToDevice);

	size = h_layer->outputSize * sizeof(float);

	cudaMalloc((void**)&(d_layer->thresholds), size);

	cudaMemcpy(d_layer->thresholds, h_layer->thresholds, size, cudaMemcpyHostToDevice);

	

	checkCUDAError("Layer Host To Device");

	return d_layer;

}

Any help will be great, but I think I’m going to start a new topic since the title of this one is not appropriate anymore.

Thanks for the help, everybody.

Yeah obviously you don’t do it quite as literally as that…

To use global memory symbols, you have to declare device memory target pointers to the compiler/runtime (and the must be in the same compilation unit as the code that will write to them). Then malloc dynamic memory roughly like you are doing now, then use cudaMemcpyToSymbol to copy the global memory addresses of your malloced memory onto the symbols.