PTXAS Fatal: Memory Allocation Failure

I’m trying to build/run even the simplest CUDA app with no success.

I’ve installed the 7.5 toolkit on WIn10, running VS2013. I create a new CUDA project, cut and paste any one of the Thrust example apps into it. It compiles just fine (a bunch of Thrust warnings, but it compiles and links). When I go to run them (again this is ANY sample app), it takes forever and finally says “PTXAS Fatal: Memory Allocation Failure”. I’ve stepped into the code, and it happens on the first line of code that creates any variable.

Again, I’ve tried this on several different samples, anyone have any idea what’s up?

In all my years of using CUDA, I don’t think I have ever seen that. My first thought was “corrupt installation”. Did you have a previous version of CUDA installed on this machine? I assume this is a 64-bit windows system with plenty of system memory? Are you able to successfully run trivial CUDA samples that don’t use Thrust?

The fact that the error occurs when you run the app suggests that JIT compilation is being used which would indicate that your build did not specify the target GPU architecture appropriate for your GPU. What is your GPU and what target architecture do you specify in your build (e.g. -arch switch)?

Depending on how much code is being JIT compiled (and I could imagine its a lot when you use Thrust) program startup could take a long time due to JIT compilation overhead. I am not aware of any code size limitation on the JIT compiler, what you see may indicate a bug for which a bug report should be filed. Are you using the latest driver package (the JIT compiler is part of the CUDA driver)?

Are you building a debug project or a release project?

Are you building a 32-bit app or a 64-bit app?

What is the GPU you are running on?

Have you specified the correct GPU arch in your compilation command?

Thank you both for your help. More information:

I downloaded the very latest CUDA from NVidia, and installed it on a virgin installation of VS 2013. I open up VS 2013, click on “new project”, select “NVIDIA CUDA 7.5 Project”, and give it a name. It creates a new project, with a file called “kernel.cu” that has some basic CUDA Sample code in it.

If I simply click on “run”, it builds, and runs fine. Does some simple demo and a printf. Works OK.

If, however, I select all in this .cu file, and cut/paste in ANY of the Thrust examples (from the Thrust GitHub location), what I describe happens. It builds to completion., although there is a flurry of warnings from within Thrust that say “decorated name length exceeded, name was truncated”. I haven’t a clue what that means, yet the names involved are “yuge”. For example, here’s one:

‘thrust::detail::cons<T0,thrust::detail::cons<thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::pointer<int,thrust::system::cuda::detail::tag,thrust::use_default,thrust::use_default>>,thrust::permutation_iterator<thrust::detail::normal_iterator<thrust::device_ptr>,thrust::detail::normal_iterator<thrust::pointer<int,thrust::system::cuda::detail::tag,thrust::use_default,thrust::use_default>>>,thrust::transform_iterator<thrust::system::cuda::detail::reduce_by_key_detail::tuple_and,thrust::zip_iterator<thrust::tuple<thrust::transform_iterator<thrust::detail::tail_flags<thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::pointer<int,thrust::system::cuda::detail::tag,thrust::use_default,thrust::use_default>>,thrust::detail::normal_iterator<thrust::pointer<bool,thrust::system::cuda::detail::tag,thrust::use_default,thrust::use_default>>,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::equal_to<thrust::tuple<int,bool,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,bool,int>::tail_flag_functor,thrust::counting_iterator<IndexType,thrust::use_default,thrust::use_default,thrust::use_default>,thrust::use_default,thrust::use_default>,thrust::detail::normal_iterator<thrust::pointer<bool,thrust::system::cuda::detail::tag,thrust::use_default,thrust::use_default>>,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::use_default,thrust::use_default>,thrust::permutation_iterator<thrust::detail::normal_iterator<thrust::device_ptr>,thrust::detail::normal_iterator<thrust::pointer<int,thrust::system::cuda::detail::tag,thrust::use_default,thrust::use_default>>>,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type>>,thrust::detail::cons<thrust::detail::wrapped_function<thrust::detail::binary_transform_if_functor<thrust::plus,thrust::identity>,void>,thrust::detail::cons<int,thrust::detail::map_tuple_to_consthrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type,thrust::null_type::type>>>>::cons’

Again, lord only knows what that all means.

TXBob, this is, apparently, a 32-bit debug version of the app (I tried a 64-bit version, no change). I’m running the latest Win10 updates, Win10 Pro 64-bit OS on an x64 processor. It is on a good sized machine with an i7, 16gb of memory and two EVGA GeForce GTX 980s in it. I’m not specifying the GPU architecture, it’s the default, and the gencode string is "-gencode=arch=compute_20,code=“sm_20,compute_20” which appears to match this architecture.

Thanks for any help.
Chris

Further research into this “decorated name length” error is that it’s harmless and only causes debugging issues. Not sure why I’d be the first/only one to see it?? No one else gets this?

GTX 980 is sm_52, not sm_20 (the compiler uses the least capable supported architecture as the default). By specifying the correct architecture, you will avoid JIT compilation from PTX.

Thanks. So what exactly do I want there? compute_20,sm_52? Or…?

Thanks so much, I’m a newbie at CUDA – obviously :)

Cool! I changed to compute_52,sm_52 and it works!

Great. Thanks for your help. Weird that I got the error in the first place? But at least I’m moving ahead.

Further, Note that I played around, and it seems as long as I’m 5.x or later, it works.

My issue has been that my eventual target architecture is the Jetson TX1, and I want to mimic that during development. So I’ll go for 5.3 (the TX1s level)

Thanks SO much for your help.
Chris

Thanks! It’s great that this post helps me to solve the same issue I have met. I changed the “compute_20,sm_20” to “compute_52,sm_52” and it still didn’t works.I failed to find the compute_ and sm_ for my GPU device (GTX850), so as a wild guess, I changed to “compute_50,sm_50” instead, problem solved!

So may I check how I can know the appropriate/optimal setting for my GPU device? Thanks!

Cheers
Penney

IMHO, the easiest way to find your GPU’s compute capability is to consult the handy table in Wikipedia (https://en.wikipedia.org/wiki/CUDA), which shows the GTX 850M to be a device with compute capability 5.0.