strange extern shared Memory error

Hi all,

i have a strange shared memory error. The main problem is in the GPUTrapez3 routine in the File.

When i declare the memory as extern shared the GPU goes nuts. What i mean with nuts is that it stops working properly. For example the “small” call DevSetzenMitFaktor doesnt work anymore and does nothing. Also the Timer stops working correctly and returns the exact time as before.

I would be very happy for any advice. For those who want to read the code it is attached. The importent routines are:

GPUTrapez3
DevTrapez3
main() from line 255

IntTest.cu (7.82 KB)

According to the CUDA C programming guide, unless you have set the environment variable CUDA_LAUNCH_BLOCKING to 1, your kernel calls will be asynchronous. Hence it appears to me your kernels are all running simultaneously, which might be causing what you see.

Check the return codes of all CUDA function calls. Also insert cudaStreamSynchronize(0) between kernels and check the return value to find out which kernel is causing the problem.

No. Threads from the same stream (or from the default stream) never launch in parallel.

Right, I re-read that section with the default stream in mind and agree with you, each kernel should be queued up waiting on the previous one to finish. I guess the purpose of the environment variable is to create a blocking call that doesn’t return to the host until the kernel is finished.

Thanks for reading through my spaghetti code. I looked at the output of cudaStreamSynchronize(0) and the problem is caused by the extern shared call in the routine DevTrapez3.

In the first call where no extern shared memory is allocated the CudaStreamSynchronize(0) returns the value 0. After the call where the extern shared memory is the first allocated, the CudaStreamSynchronize(0)-value returns 4. But for now i couldn´t figure out what to do with this output or how to reconstruct my code that the Streams are synchronized. I would be thankfull for any advice.

What is also strange is the fact that when i allocate the extern shared memory in the DevTrapez3 as shared memory all CudaStreamSynchronize(0) stay at the value 0 and the other kernel calls work properly.

I figured out what caused the error. In the routine DevTrapez3 the Code

Stutzstellen[threadIdx.x] += Stutzstellen[threadIdx.x + 32];
Stutzstellen[threadIdx.x] += Stutzstellen[threadIdx.x + 16];
Stutzstellen[threadIdx.x] += Stutzstellen[threadIdx.x + 8];
Stutzstellen[threadIdx.x] += Stutzstellen[threadIdx.x + 4];
Stutzstellen[threadIdx.x] += Stutzstellen[threadIdx.x + 2];
Stutzstellen[threadIdx.x] += Stutzstellen[threadIdx.x + 1];

was causing the error. That was my first attempt to unroll a loop and so i forgot to put this code-block in “if(threadIdx.x<32)”. I thinkt that caused the error because different warps tried to read and write the extern shared memory. The Code works fine now. Thank you for your help.

@tera:
For my curiosity: What does the return value of cudaStreamSynchronize(0) actually mean?

The online documentation has a list of CUDA error codes (further down on that page) in it’s directory. 4 is a general launch failure, caused e.g. by accesses past the boundary of shared memory (as you found out).
You can use cudaGetErrorString() to translate the return code to a human readable string.

@Nvidia:
It seems that cudaGetErrorString() isn’t mentioned in the Programming Guide (anymore?). And the Toolkit Reference Manual doesn’t list the numerical error codes, only the online directory page I listed does (but not the detailed page describing cudaError only). Worth changing, I believe.