Problem with arch=sm_20

Is there some obvious cause for this behaviour that I can’t figure out? I have some code that works fine when I compile with -arch=sm_13. But if I compile with -arch=sm_20 the compilation works without any problems, but the program crashes during the run then with a “unspecified launch failure”. I have a gtx480 card. Any suggestions?

Out of bounds shared memory access, probably. Try running the code with cuda-memcheck and see what is reported.

Thanks for the answer. I tried using cuda-memcheck but it not give me much information at all unfortunately. The only more output I get if I run ‘my_program’ like “cuda-memcheck ./my_program” is it starts with a


//the ordinary output which I always get

Cuda operation problem : unspecified launch failure caused by CudaMemcpy host_result

========= Internal Error

The last output is just me printing some info when a Cuda-operation does not return cudaSuccess so I know where the problem occured. But I get no information from cuda-memcheck whatsoever :/

I noticed another weird thing when I tried to debug my code with cuda-gdb. When compiling for the debugging I used the -g -G flags so I compiled like this:

nvcc -g -G -o WalkFDebug -lxerces-c -lmysqlpp -arch=sm_20

If I try to run this with cuda-gdb like "cuda-gdb WalkFDebug --> run ", cuda-gdb crashes saying segmentation fault. But if I just run ./WalkFDebug directly the program works. But then if I compile without the debug-flags like

nvcc -o WalkF -lxerces-c -lmysqlpp -arch=sm_20

If I run that with ./WalkF the program crashes with a unspecified launch failure!? So to sum if I compile the code with debug flags I can run the program, but if I try to run it through cuda-gdb it crashes. If I compile the code without debugging-flags the program crashes during execution as well. This seems very strange to me, anyone have any suggestions what may be the problem here?

Compiling for debugging spills everything to local memory, which can hide out of bounds shared memory errors. In compute 2.x devices, shared memory and the l1 cache share the same physical memory, which is why out of bounds shared memory access causes aborts not seen on older architectures. If it didn’t the result could be global memory corruption.

Thanks again for the answer. I am not using that much shared memory I think. The only shared memory I am using in my kernel is the following:

shared unsigned char sh_progs[MAX_PROGRAM_SIZE]; //MAX_PROGRAM_SIZE = 512

and then I have

extern shared int array;

int* optValues = (int*)array;

Can this be the problem somehow? I’m launching my kernel with <<<blockDimensions, NUM_THREADS, nrLinesInFile*sizeof(int)>>> . So I’m allocating shared memory "dynamically’ and the size is based on nrLinesInFile ( which is 3 in my testcases by the way ). Is this not the way to do it on compute 2.x devices? Or do you think that I am reading/writing outside the allocated shared memory and that only gets noticed when compiling with -arch=sm_20, for the reason you explained above? ( l1 cache share the same physical memory )

Looking at my code I can’t see that I could read outside the arrays, but that is no proof for that Im not actually doing it =)

That is the most likely source of the problem.

Ok thanks, will have to look into it more tomorrow. But there is no problem with the “dynamic” allocation of shared memory space then? That is done the same way on 2.0 as earlier compute capabilities?

The part with

extern shared int array;

int* optValues = (int*)array;

and the kernel launch with <<<blockDimensions, NUM_THREADS, nrLinesInFile*sizeof(int)>>>

None of that changes. The only difference is that Fermi will abort on code that ran ok on older hardware because it has more strict runtime shared memory protection.

Ok, I looked a bit in a whitepaper regarding Fermi architecture and, exactly like you said, there is 64 kb on-chip memory for L1 cache and shared memory. If I understand it correctly you can switch between how much that is used for one or the other. Like 16Kb L1 cache and 48 kb shared and the other way around. To my question, how do you set up which “ratio” between L1 cache and shared memory you will use. Or is that done implicitly? I mean if I use 30kb of shared memory then automatically 34 kb is used for L1 cache? Or do I need to use some flags when I compile my code to set that up?

There are only two possible ways to configure shared memory - the default 16kb shared/48kb L1 cache or 48kb shared/16kb L1. If you do nothing, Fermi will work exactly like a GT200 and give your 16Kb of shared memory per block. If your code requests more shared memory than the configuration permits, you will get an explicit kernel launch failure with an cudaErrorInvalidConfiguration runtime error (not sure what the driver API will return). Also just like a GT200. If you want to change the shared/L1 cache configuration, the 3.2 API includes cudaThreadSetCacheConfig() and cuCtxSetCacheConfig().

But none of this has anything to do with your problem. Your problem is that somewhere in your code there is an out of bounds shared memory read or write. Fermi will explicitly detect this and abort the kernel. A GT200 won’t and the code will run.

Thanks again, appreciate your answers.

Now I am totally confused thou, I tried to dig deeper and find any out of bounds shared memory read or writes. I started with commentating out basically the whole kernel so there is hardly anything left just these lines

float a = 0;

int blockIndex = blockIdx.x + gridDim.x*blockIdx.y;

result[blockIndex*NUM_THREADS+threadIdx.x].profit = blockIndex;;

result[blockIndex*NUM_THREADS+threadIdx.x].trades = threadIdx.x;

If I do this I can run the code without any error-output when compiling with -arch=sm_20 .

But if I compile the code for debugging and run it through cuda-gdb I get a segmentation fault on the line

cudaMalloc((void**)&kernel_program, MAX_PROGRAM_SIZE*sizeof(float));

This happens both if I compile with -arch=sm_13 or -arch=sm_20. I can’t see how that is happening, or am I, doing something really basic error here. The variables used there is declared like this: (so it shouldn’t be any problems I think)

float* kernel_program;

const int MAX_PROGRAM_SIZE = 512;

I also have tried to remove all usage of shared memory at all from the original code and use global memory instead but it still crashes compiled with -arch=sm_20 .

I wonder if the cuda-version I’m using or driver version or something is affecting the result here for me, or if cuda-gdb is bugged.

I’m working on Linux ( Gentoo ) and have following versions of relevant stuff:

dev-util/nvidia-cuda-sdk : version 3.0

dev-util/nvidia-cuda-toolkit : version 3.0

x11-drivers/nvidia-drivers : version 195.36.31 ( this one is quite old I noticed now )

The more recent versions of cuda 3.1, 3.2 seems to be masked in gentoo for some reason that I don’t know.

Any input here would be greatly appreciated, because I am totally confused at the moment.

Posting isolated, random out-of-context code snippets is not very helpful. If you can post a concise, clean, easy to read and follow repro case which demonstrates the problem, someone might be able to take a look at it.

Sorry about that, here is a very scaled down version of some ( quite useless ) code. This code gives me a segmantation fault when I’m running this on a computer with a GTX480 card ( and the same version of drivers etc as posted earlier ).

#include <iostream>

const int NUM_THREADS = 32;

#include ""

using namespace std;

int main(){


	float* kernel_program;

	float program[6] = {25, 122, 4, 137, 126, 143};;

	dim3 blockDimensions(4505,5);

	cout << "Stop for debugging " << endl;

	cudaMalloc((void**)&kernel_program, 6*sizeof(float));

	cudaMemcpy(kernel_program, program, sizeof(float)*6, cudaMemcpyHostToDevice);

	runWalkForwardAnalysis<<<blockDimensions, NUM_THREADS>>>(kernel_program);

	cudaMemcpy(program, kernel_program, sizeof(float)*6, cudaMemcpyDeviceToHost);

	cout << "Program 0 " << program[0] << endl;

	return 0;


__global__ static void runWalkForwardAnalysis(float* program) {

	int blockIndex = blockIdx.x + gridDim.x*blockIdx.y;

	program[0] = 111;


This code is very trivial and does not do much useful things, and I can’t understand why it crashes with a segmentation fault.

It doesn’t crash for me, sorry (cuda 3.1, Ubuntu 9.04, GTX 470):

avidday@cuda:~$ module load cuda/3.1

avidday@cuda:~$ nvcc -arch=sm_20 -o walkforward.exe warning: variable "blockIndex" was declared but never referenced warning: variable "blockIndex" was declared but never referenced

avidday@cuda:~$ cuda-gdb ./walkforward.exe 

NVIDIA (R) CUDA Debugger

3.1 beta release

Portions Copyright (C) 2008,2009,2010 NVIDIA Corporation

GNU gdb 6.6

Copyright (C) 2006 Free Software Foundation, Inc.

GDB is free software, covered by the GNU General Public License, and you are

welcome to change it and/or distribute copies of it under certain conditions.

Type "show copying" to see the conditions.

There is absolutely no warranty for GDB.  Type "show warranty" for details.

This GDB was configured as "x86_64-unknown-linux-gnu"...

Using host libthread_db library "/lib/".

(cuda-gdb) run

Starting program: /home/david/walkforward.exe 

[Thread debugging using libthread_db enabled]

[New process 31711]

[New Thread 140242886149888 (LWP 31711)]

Stop for debugging 

[Launch of CUDA Kernel 0 on Device 0]

[Termination of CUDA Kernel 0 on Device 0]

Program 0 111

Program exited normally.


OK thanks for testing it.

Must be some weird thing with the drivers and version of cuda or something then.

Will have to experiment a bit with it.

Can you use the latest CUDA 4.0RC release to see if cuda-memcheck shows where the problem is in your application? If you are registered dev you should have an email with a link to the 4.0 biniaries.
4.0 cuda-memcheck detects many more application errors.

Actually the problem disappeared when I updated to cuda 3.2 and newer nvidia-drivers. Now I can run the original code compiled with arch=sm_20 and it does not crash and gives 0 errors when ran with cuda-memcheck. So somehow the drivers/cuda-version seems to have given me the problem.

I still can’t run the code through cuda-gdb thou, crashes with some error that I have no clue what it means.

BACKTRACE (4 frames):

Another weird thing is that even when it crashes with cuda-gdb the process does not actually stop. Noticed this when I got the error-message “error: The CUDA driver failed initialization (error=20)”. After that I ran ‘top’ and saw the process(es) is still there. I googled a bit and from what I’ve found it seems this problem may be related to X in some way. The cuda application is executed on a “headless machine” so I only have access to “terminal view” or “console view” or what is the proper name. Somehow it seems it gets stuck in a infinite loop for some reason. Anyone have any idea why this is happening and what to do to solve the problem?