Problem with arch=sm_20

Mange · March 3, 2011, 11:46am

Is there some obvious cause for this behaviour that I can’t figure out? I have some code that works fine when I compile with -arch=sm_13. But if I compile with -arch=sm_20 the compilation works without any problems, but the program crashes during the run then with a “unspecified launch failure”. I have a gtx480 card. Any suggestions?

avidday · March 3, 2011, 12:49pm

Out of bounds shared memory access, probably. Try running the code with cuda-memcheck and see what is reported.

Mange · March 3, 2011, 3:37pm

Thanks for the answer. I tried using cuda-memcheck but it not give me much information at all unfortunately. The only more output I get if I run ‘my_program’ like “cuda-memcheck ./my_program” is it starts with a

========= CUDA-MEMCHECK

//the ordinary output which I always get

Cuda operation problem : unspecified launch failure caused by CudaMemcpy host_result

========= Internal Error

The last output is just me printing some info when a Cuda-operation does not return cudaSuccess so I know where the problem occured. But I get no information from cuda-memcheck whatsoever External Image

I noticed another weird thing when I tried to debug my code with cuda-gdb. When compiling for the debugging I used the -g -G flags so I compiled like this:

nvcc -g -G Walk.cu -o WalkFDebug -lxerces-c -lmysqlpp -arch=sm_20

If I try to run this with cuda-gdb like "cuda-gdb WalkFDebug → run ", cuda-gdb crashes saying segmentation fault. But if I just run ./WalkFDebug directly the program works. But then if I compile without the debug-flags like

nvcc Walk.cu -o WalkF -lxerces-c -lmysqlpp -arch=sm_20

If I run that with ./WalkF the program crashes with a unspecified launch failure!? So to sum if I compile the code with debug flags I can run the program, but if I try to run it through cuda-gdb it crashes. If I compile the code without debugging-flags the program crashes during execution as well. This seems very strange to me, anyone have any suggestions what may be the problem here?

avidday · March 3, 2011, 3:58pm

Compiling for debugging spills everything to local memory, which can hide out of bounds shared memory errors. In compute 2.x devices, shared memory and the l1 cache share the same physical memory, which is why out of bounds shared memory access causes aborts not seen on older architectures. If it didn’t the result could be global memory corruption.

Mange · March 3, 2011, 5:00pm

Thanks again for the answer. I am not using that much shared memory I think. The only shared memory I am using in my kernel is the following:

shared unsigned char sh_progs[MAX_PROGRAM_SIZE]; //MAX_PROGRAM_SIZE = 512

and then I have

extern shared int array;

int* optValues = (int*)array;

Can this be the problem somehow? I’m launching my kernel with <<<blockDimensions, NUM_THREADS, nrLinesInFile*sizeof(int)>>> . So I’m allocating shared memory "dynamically’ and the size is based on nrLinesInFile ( which is 3 in my testcases by the way ). Is this not the way to do it on compute 2.x devices? Or do you think that I am reading/writing outside the allocated shared memory and that only gets noticed when compiling with -arch=sm_20, for the reason you explained above? ( l1 cache share the same physical memory )

Looking at my code I can’t see that I could read outside the arrays, but that is no proof for that Im not actually doing it =)

avidday · March 3, 2011, 5:19pm

That is the most likely source of the problem.

Mange · March 3, 2011, 6:12pm

Ok thanks, will have to look into it more tomorrow. But there is no problem with the “dynamic” allocation of shared memory space then? That is done the same way on 2.0 as earlier compute capabilities?

The part with

extern shared int array;

int* optValues = (int*)array;

and the kernel launch with <<<blockDimensions, NUM_THREADS, nrLinesInFile*sizeof(int)>>>

avidday · March 3, 2011, 7:50pm

None of that changes. The only difference is that Fermi will abort on code that ran ok on older hardware because it has more strict runtime shared memory protection.

Mange · March 4, 2011, 8:57am

Ok, I looked a bit in a whitepaper regarding Fermi architecture and, exactly like you said, there is 64 kb on-chip memory for L1 cache and shared memory. If I understand it correctly you can switch between how much that is used for one or the other. Like 16Kb L1 cache and 48 kb shared and the other way around. To my question, how do you set up which “ratio” between L1 cache and shared memory you will use. Or is that done implicitly? I mean if I use 30kb of shared memory then automatically 34 kb is used for L1 cache? Or do I need to use some flags when I compile my code to set that up?

avidday · March 4, 2011, 10:31am

There are only two possible ways to configure shared memory - the default 16kb shared/48kb L1 cache or 48kb shared/16kb L1. If you do nothing, Fermi will work exactly like a GT200 and give your 16Kb of shared memory per block. If your code requests more shared memory than the configuration permits, you will get an explicit kernel launch failure with an cudaErrorInvalidConfiguration runtime error (not sure what the driver API will return). Also just like a GT200. If you want to change the shared/L1 cache configuration, the 3.2 API includes cudaThreadSetCacheConfig() and cuCtxSetCacheConfig().

But none of this has anything to do with your problem. Your problem is that somewhere in your code there is an out of bounds shared memory read or write. Fermi will explicitly detect this and abort the kernel. A GT200 won’t and the code will run.

Mange · March 4, 2011, 11:27am

Thanks again, appreciate your answers.

Now I am totally confused thou, I tried to dig deeper and find any out of bounds shared memory read or writes. I started with commentating out basically the whole kernel so there is hardly anything left just these lines

float a = 0;

int blockIndex = blockIdx.x + gridDim.x*blockIdx.y;

result[blockIndex*NUM_THREADS+threadIdx.x].profit = blockIndex;;

result[blockIndex*NUM_THREADS+threadIdx.x].trades = threadIdx.x;

If I do this I can run the code without any error-output when compiling with -arch=sm_20 .

But if I compile the code for debugging and run it through cuda-gdb I get a segmentation fault on the line

cudaMalloc((void**)&kernel_program, MAX_PROGRAM_SIZE*sizeof(float));

This happens both if I compile with -arch=sm_13 or -arch=sm_20. I can’t see how that is happening, or am I, doing something really basic error here. The variables used there is declared like this: (so it shouldn’t be any problems I think)

float* kernel_program;

const int MAX_PROGRAM_SIZE = 512;

I also have tried to remove all usage of shared memory at all from the original code and use global memory instead but it still crashes compiled with -arch=sm_20 .

I wonder if the cuda-version I’m using or driver version or something is affecting the result here for me, or if cuda-gdb is bugged.

I’m working on Linux ( Gentoo ) and have following versions of relevant stuff:

dev-util/nvidia-cuda-sdk : version 3.0

dev-util/nvidia-cuda-toolkit : version 3.0

x11-drivers/nvidia-drivers : version 195.36.31 ( this one is quite old I noticed now )

The more recent versions of cuda 3.1, 3.2 seems to be masked in gentoo for some reason that I don’t know.

Any input here would be greatly appreciated, because I am totally confused at the moment.

avidday · March 4, 2011, 11:33am

Posting isolated, random out-of-context code snippets is not very helpful. If you can post a concise, clean, easy to read and follow repro case which demonstrates the problem, someone might be able to take a look at it.

Mange · March 4, 2011, 3:30pm

Sorry about that, here is a very scaled down version of some ( quite useless ) code. This code gives me a segmantation fault when I’m running this on a computer with a GTX480 card ( and the same version of drivers etc as posted earlier ).

#include <iostream>

const int NUM_THREADS = 32;

#include "WalkForwardKernel.cu"

using namespace std;

int main(){

	cudaSetDevice(0);

	float* kernel_program;

	float program[6] = {25, 122, 4, 137, 126, 143};;

	dim3 blockDimensions(4505,5);

	cout << "Stop for debugging " << endl;

	cudaMalloc((void**)&kernel_program, 6*sizeof(float));

	cudaMemcpy(kernel_program, program, sizeof(float)*6, cudaMemcpyHostToDevice);

	runWalkForwardAnalysis<<<blockDimensions, NUM_THREADS>>>(kernel_program);

	cudaMemcpy(program, kernel_program, sizeof(float)*6, cudaMemcpyDeviceToHost);

	cout << "Program 0 " << program[0] << endl;

	return 0;

}

__global__ static void runWalkForwardAnalysis(float* program) {

	int blockIndex = blockIdx.x + gridDim.x*blockIdx.y;

	program[0] = 111;

}

This code is very trivial and does not do much useful things, and I can’t understand why it crashes with a segmentation fault.

avidday · March 4, 2011, 3:39pm

It doesn’t crash for me, sorry (cuda 3.1, Ubuntu 9.04, GTX 470):

avidday@cuda:~$ module load cuda/3.1

avidday@cuda:~$ nvcc -arch=sm_20 walkforward.cu -o walkforward.exe

walkforward.cu(9): warning: variable "blockIndex" was declared but never referenced

walkforward.cu(9): warning: variable "blockIndex" was declared but never referenced

avidday@cuda:~$ cuda-gdb ./walkforward.exe 

NVIDIA (R) CUDA Debugger

3.1 beta release

Portions Copyright (C) 2008,2009,2010 NVIDIA Corporation

GNU gdb 6.6

Copyright (C) 2006 Free Software Foundation, Inc.

GDB is free software, covered by the GNU General Public License, and you are

welcome to change it and/or distribute copies of it under certain conditions.

Type "show copying" to see the conditions.

There is absolutely no warranty for GDB.  Type "show warranty" for details.

This GDB was configured as "x86_64-unknown-linux-gnu"...

Using host libthread_db library "/lib/libthread_db.so.1".

(cuda-gdb) run

Starting program: /home/david/walkforward.exe 

[Thread debugging using libthread_db enabled]

[New process 31711]

[New Thread 140242886149888 (LWP 31711)]

Stop for debugging 

[Launch of CUDA Kernel 0 on Device 0]

[Termination of CUDA Kernel 0 on Device 0]

Program 0 111

Program exited normally.

(cuda-gdb)

Mange · March 4, 2011, 3:53pm

OK thanks for testing it.

Must be some weird thing with the drivers and version of cuda or something then.

Will have to experiment a bit with it.

Satish_Salian · March 4, 2011, 7:34pm

Can you use the latest CUDA 4.0RC release to see if cuda-memcheck shows where the problem is in your application? If you are registered dev you should have an email with a link to the 4.0 biniaries.
4.0 cuda-memcheck detects many more application errors.

Mange · March 4, 2011, 11:12pm

Actually the problem disappeared when I updated to cuda 3.2 and newer nvidia-drivers. Now I can run the original code compiled with arch=sm_20 and it does not crash and gives 0 errors when ran with cuda-memcheck. So somehow the drivers/cuda-version seems to have given me the problem.

I still can’t run the code through cuda-gdb thou, crashes with some error that I have no clue what it means.

BACKTRACE (4 frames):
cuda-gdb[0x46379e]
/lib/libc.so.6(+0x324d0)[0x7f0d294384d0]
cuda-gdb[0x534146]
cuda-gdb[0x53418d]

Another weird thing is that even when it crashes with cuda-gdb the process does not actually stop. Noticed this when I got the error-message “error: The CUDA driver failed initialization (error=20)”. After that I ran ‘top’ and saw the process(es) is still there. I googled a bit and from what I’ve found it seems this problem may be related to X in some way. The cuda application is executed on a “headless machine” so I only have access to “terminal view” or “console view” or what is the proper name. Somehow it seems it gets stuck in a infinite loop for some reason. Anyone have any idea why this is happening and what to do to solve the problem?

Topic		Replies	Views
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29436	April 29, 2010
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	53718	August 1, 2011
Memory Corruption on a Fermi-Class GPU Error only on Fermis, program works on non-Fermis. CUDA Programming and Performance	18	7108	July 22, 2011
Cuda Portability and SharedMem vs Cache CUDA Programming and Performance	9	11621	October 18, 2010
CUDA-GDB captured "Illegal access to address" exception when I invoke child kernel, but the result is correct when free run CUDA Programming and Performance	6	1717	March 20, 2017
Cuda-gdb internal-error of copy_type on basic fortran example CUDA-GDB cuda-gdb	5	973	January 2, 2024
Random execution times and freezes with concurent kernels - 2 CUDA Programming and Performance	5	2640	November 10, 2015
gtx 465 performance CUDA Programming and Performance	34	4301	August 18, 2010
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7922	June 14, 2013
Potential Bug, cuda-memcheck can someone verify? Program crashing on GPU initialisation with cuda-me CUDA Programming and Performance	11	3461	April 24, 2020

Problem with arch=sm_20

Related topics