questions about CUDA 3.1

I heard that CUDA supports C++ and recursion now. I am wondering if I can use STL::list or STL::vector in the global and device functions.

  1. Recursion works in 3.1 + Fermi card:

#include <stdio.h>

#include <stdlib.h>


#include <cuda.h>

device int fact(int f)


if (f == 0)

	return 1;


	return f * fact(f - 1);


global void gfact(int * result, int f)


*result = fact(f);


int main()


dim3 Dg(1, 1);

dim3 Db(1);

size_t Ns = 0;

int * result;

int errMalloc = cudaHostAlloc((void**)&result, sizeof(int), cudaHostAllocMapped);

int * dresult;

int err = cudaHostGetDevicePointer((void**)&dresult, (void*)result, 0);

gfact<<<Dg, Db, Ns>>>(dresult, 5);


cudaError_t ek = cudaGetLastError();

if (ek)


	std::cout << "Error in kernel call " << ek << std::endl;

	return 1;


std::cout << "Value = " << *result << std::endl;


return 0;


However, I have to compile with sm_20, and run it on a Fermi card. The kernel call fails with error code 8 on a GeForce 8900.

It won’t compile under sm_11:

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.1\bin\nvcc.exe” …

produces: “Error: Recursive function call is not supported yet: fact(int)”

  1. Ever since I’ve used Cuda (2.x), it’s supported C++, features like classes, templates, etc. You can use classes in your kernel code, but depending on what you want, it can be cumbersome. In classes, you have to have device (and sometimes global, too) qualifiers on your member functions, but you cannot on the constructors or destructors. You can pass a class object between host and device, either a copy of it (e.g., class foo {…}; global bar(foo v) {…} int main() { foo xxx; …; bar<<<…>>>(xxx); …}) or by native pointer (global bar(foo * v)). But reference parameters “& var” are not possible.

However, the bigger problem is representing complex data structures. For example, consider a graph with nodes and edges. You could represent a node as a class containing a list of native pointers to all those edge objects, and edges with two native pointers for the start and end of the edge, but you’d have to copy those objects to device space and the native host pointers to the device memory addresses. Yuk. Instead, people represent pointers usually as integer offsets into a big block, and allocate the objects out of that big block. You could hide a lot by overloading the ->, *, and operators, and you would have to write your own memory space malloc and new operator. I’ve been hacking at a dlmalloc for this purpose, so I can maybe allocate objects in device code, but mostly just so I can have common code between the host and device to access the complex data structure.

Lot’s of other issues of course. Just play around and you’ll see.

I can’t get std::vector declarations and calls to compile, but I didn’t think it would anyways. Functions called from the device have to have the device qualifier and you still cannot call a global function from a device function. Calls to printf are supposed to work in 3.1 + Fermi card, but I can’t seem to get it to work. Kind of disappointing after it was promised, and with Nsight still in beta, unless I’m doing something wrong.

If you need STL::vector, you should look into the thrust library.