A naive question about dereferenced pointers unspecified launch failure

Hi all,

when I start my Cuda program in device release mode I get an unspecified launch error, this doesn’t happend using the emulation mode. I read some topics in this forum about it and often it seems that a dereferenced pointer is the problem, so that in fact the error message is not the launch failure but a segmentation failure.

My program has following structure. I define three pointers, two of them point to an array of structs one to a float array. I call the first kernel passing one pointer, this pointer is passed to a device function which fills the first array. The second kernel does the same for the seconde array and the the third kernel should use the first 2 arrays to calculate results which are then written to the 3rd array and copied back to the host. so finally I pass the pointers to several kernels and device functions.

So basically my question is, can it be that the pointers are dereferenced. How can dereferenciation happen. I mean, I often read that you have to be carefull that the pointers are not dereferenced, however I don’t know how this can happen?

Can anyone give me an example how a pointer is dereferenced?

Finally I am not really sure that my problem comes from the pointers but I don’t know where to search for it anymore. So any hint would be helpful!


you have to make sure the pointers you use as input for your kernels are actually pointers to memory allocated on the GPU. The error you are seeing might come from the fact that you are passing a pointer to CPU memory to the kernel.

I tried two ways to declare my pointers:

1.) in the function which then calls the different kernels:

xy_struct* xy_coords;    //pointer for array on device 

z_struct* z_coords;     //pointer for array on device

float* pre_sum_d;

This is the way that I saw used in the several examples, however I was not sure, if the pointer really includes the address of memory space on the device, so I found a second way of decalaration.

2.) before the function which calls the kernels:

__device__ xy_struct* xy_coords;

__device__ z_struct* z_coords;

__device__ float* pre_sum_d;

In the kernel I use the array like that:

__global__ void pre_sum_calc(xy_struct* xy_coords, z_struct* z_coords, float* pre_sum ){

   int BIN = blockIdx.x*blockDim.x + threadIdx.x;


   weightedVoxel S[640]; //another array of struct which is just used in the kernel

   int BIN_n_nz;

     BIN_n_nz = get_BIN_data(xy_coords],z_coords, &S[0], BIN);  //a device function which fills S using the data out of the two arrays

       for (int k=0; k<BIN_n_nz; k++){




In the function which calls the kernels y allocate the memory in following way for both versions of pointer declaration:

#define n_bins (192*192*175)

#define  n_voxels (128*128*47)

void BIN_calculations(){

   int size_xy= 192*192;

   CUDA_SAFE_CALL(cudaMalloc((void **) &xy_coords, sizeof(xy_struct)*size_xy));

   Sinogram<<<numBlocks, numThreadsPerBlock>>>(xy_coords); //first kernel

   //here I synchronize and check for errors

   int size_z = 175;

   CUDA_SAFE_CALL(cudaMalloc((void **) &z_coords, sizeof(z_struct)*size_z));

   Michelogram<<<numBlocks, numThreadsPerBlock>>>(z_coords); //second kernel

   //here I synchronize and check for errors

   CUDA_SAFE_CALL(cudaMalloc((void**)&pre_sum_d, sizeof(float)*n_bins));

   float* pre_sum_h;

   pre_sum_h = (float*)malloc(sizeof(float)*n_voxels); 

   for (int i=0; i<n_voxels; i++){



   //Copy pre_sum array to GPU

   CUDA_SAFE_CALL(cudaMemcpy(pre_sum_d, pre_sum_h, sizeof(float)*n_voxels, cudaMemcpyHostToDevice));

   pre_sum_calc<<<numBlocks, numThreadsPerBlock>>>(xy_coords, z_coords, pre_sum_d); //3 kernel

   //here I synchronize and check for errors --------> and here occurs the launch error

   CUDA_SAFE_CALL(cudaMemcpy(pre_sum_h, pre_sum_d, sizeof(float)*n_voxels, cudaMemcpyDeviceToHost));






I didn’t add the number of threads per block and number of bloccks. I calculate them for each kernel according to the regsiter and shared memory used.

Sorry for the long code, but perhaps it is getting clearer what I am doing.

So is there a problem the way I use the pointers?


your pointer look good, so probably you are writing past the end of an array.

n_bins looks to be > n_voxels, but are you sure you are not reading/writing past the end of an array.

unspec launch error can also be caused by writing past end of array, dividing by zero also I think, so there are multiple options…

Some people run the code in emulation mode under valgrind to check for writing past the end of arrays

Yes, there is a huge problem here. You are declaring a pointer that resides on the device and then you are attempting to write the value of that pointer from the host. You are lucky (or unlucky, depending on how you look at it) that this doesn’t seg fault immeadiately. If you were to run this through valgrind, it would tell you of the problem.

What you need to do is this:

xy_struct* xy_coords;

z_struct* z_coords;

float* pre_sum_d;

CUDA_SAFE_CALL(cudaMalloc((void **) &xy_coords, sizeof(xy_struct)*size_xy));

And then call the kernel, passing these pointers as parameters (which you appear to be doing anyways? Why even declare them as device variables then).

My advice is to never use device variables. They have all the same evils that global variables do ad create added headaches. It is difficult enough to keep track of which pointers were allocated on the device and which were on the host (a d_ prefix is often used to help with this).

Ah, I completely missed that point due to the fact I have never used the device attribute.

And indeed, d_* for device pointers, h_* for host pointers in host code is useful. Just as
g_* for variables in global memory, s_* for shared memory variables in kernel code. It really helps reading other people’s code.

First of al thanks for your hints.

As I said, I tried both versions, but in fact finally it happened exactly what you predicted, my screen turned black and I had to reboot the system, so I will avoid using __device__variables!

I checked for the boundaries and in fact, my pre_sum_h was not right, so it should be n_voxels, so the host array was bigger as it should be, so shouldn’t have any influence. I fixed it and the problems still stays the same.

I didn’t know about this valgrind tool, so I downloaded it and did some tests. It seems that there is a possible memory problem in my code, this is what it sowed me:

--12125-- supp:    8 dl_relocate_object

==12125== malloc/free: in use at exit: 39,519 bytes in 114 blocks.

==12125== malloc/free: 311 allocs, 197 frees, 666,319 bytes allocated.


==12125== searching for pointers to 114 not-freed blocks.

==12125== checked 25,816,752 bytes.


==12125== 27,648 bytes in 96 blocks are possibly lost in loss record 15 of 15

==12125==    at 0x4905D27: calloc (vg_replace_malloc.c:279)

==12125==    by 0x3F0A30D5B2: _dl_allocate_tls (in /lib64/ld-2.3.4.so)

==12125==    by 0x3F0B606786: pthread_create@@GLIBC_2.2.5 (in /lib64/tls/libpthread-2.3.4.so)

==12125==    by 0x4A3C2E4: (within /usr/local/cuda/lib/libcudart.so.2.0)

==12125==    by 0x4A344F0: (within /usr/local/cuda/lib/libcudart.so.2.0)

==12125==    by 0x4A20AF9: cudaLaunch (in /usr/local/cuda/lib/libcudart.so.2.0)

==12125==    by 0x411328: cudaError cudaLaunch<char>(char*) (cuda_runtime.h:326)

==12125==    by 0x41113D: __device_stub___globfunc__Z8SinogramP9xy_struct (tmpxft_00002dc9_00000000-1_Michelogram.cudafe1.stub.c:18)

==12125==    by 0x44A901: MLEM (MLEM.cu:42)

==12125==    by 0x411364: main (main_t.cu:5)

Then I had a close look to my kernel function Sinogram(*xystruct xy_coords) (in the file named Michelogram), however I coudldn’t find any problem there. I commented out everything what it was doing and just filled the array of struct with a ‘1’ and the message from valgrind still stays the same, so I am not sure if it is really a problem. However as I never used valgrind so far, I don’t know, if some messages that come up perhaps are due to the cuda sepcific memory allocation instructions?

anyone has more experience in using valgrind and can give me an hint about the messages coing out, please?

For the kind of problems you are looking for, you don’t really care about lost or “leaked” memory at this point. That is just allocated memory that isn’t freed. Look for errors like “invalid write of size 8 at some_file.cu:some_line”. Those are the nasty out of bounds memory writes that can cause the problems you are seeing.

Of course, if you want to look for such out of bounds memory writes in your kernel with valgrind, you have to compile your code in emulation mode so that the kernel is executed on the CPU.

OK, finally I solved my problem, even I can’t exactly tell what happened. After searching through the whole code for several days I found out, that my problem came from a pointer which has the address to a array of structs which is located in shared memory (in the above mentioned code, weightedVoxel S). I pass a pointer of this array to a device function, I call this device function up to 4 times, the first time works perfectly, in the second one the program crashes and returns an unspecified launch error. However I do not know why it works perfectly in emulation mode and in the first call of the program in release mode and then fails.

So, I found the problem, I found a work around, however I still don’t understand what goes wrong and as there is no debug mode for release mode, perhaps it stays a secret forever!

Nevertheless thanks for the help!


Does the compiler ever print an warning like "can’t tell what pointer points too, assuming global memory? If the compiler emits code to read/write global memory where it should be shared memory, that could be your problem. It won’t matter in emulation mode or valgrind because all memory is host memory there…

No,there was no warning like that. I also checked the emulation mode with the valgrind tool as I you recommended before, but nothing about problems with pointers.

Another thing I tested was if I write over the limit of the array, this could happen, because the amount of elements of the array varies with each run of the device function. For first implementation I just chose the max. value which I got with the emulation version. As results between CPU and GPU implementation can be vary slightly to differences in float accuracy, I put a condition, that in case, that I could reach the end of the array, I stop the loop and would just loose some values, which shouldn’t effect too much my results. So, when I found the source of the problem I tried to imagine what could cause the launch failure, but couldn’t really figure it out.