Strange dynamic allocation behavior

while this kernel call:
host device neurons_network::neurons_network(unsigned long _nneurons,unsigned long _nneuronspre) {
int i;
if (nneuronspre<nneurons) {
neurons=new neuron_axon[nneurons];

	synapses=new synapse*[nneurons];
	for (i=0;i<nneurons;i++) {
		synapses[i]=new synapse[nneuronspre];

	links=new ulong*[nneurons];
	for (i=0;i<nneurons;i++) {
		links[i]=new ulong[nneuronspre];


I call this using a global procedure:

global void nn_constructor_neurons_network(void **cuaddr,unsigned long _nneurons,unsigned long _nneuronspre) {
if (threadIdx.x==0) {
cuaddr[0]=(void *)new neurons_network(_nneurons,_nneuronspre);

in this way:
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 12810241024);
nn_constructor_neurons_network<<<1,1>>>( … … … )

The printf within the device procedure does print every address of allocated memory: it is alwasy 4294967295 (-1).
Anyone can help me?

PS: I am using a Kepler GPU with 2.0 architecture

Kepler GPUs don’t have a 2.0 architecture.

What happens when you run your code with cuda-memcheck

Are you doing any proper cuda error checking anywhere?

Hi, i am running it on a gtx560m, it is a Kepler GPU. Its computation arch is 2.1, it can make dynamic memory allocation. The problem was that i simply declared too few heap allocation per thread. Too simple.

GTX 560m is not a Kepler GPU. cc 2.1 is Fermi