CUDA C++ - problem reading arrays in heap memory

I am trying to parallelize the computation of a metric on the nodes of a graph.

As an approach I have made each thread calculate the metric on a node (since the calculation on a node is independent).

Each thread must calculate some values based on the clusters that are created with the neighbors of the considered node and store them in an array of initially unknown size (and different for each node).

I can’t use extern __shared__ array because each thread has to compute its own array and can’t be shared.

I can’t declare a (max) fixed array size because it would be very inefficient for my task.

The solution adopted (thanks to the proposed solution for this question) is to create an array in the kernel with the new function (e.g. int *array=new int[] ); so each thread initializes and calculates its array.

Once I’m done writing the array I need to read the values for the next steps in the calculation of the metric. I have checked that up to this point the values written in the array are correct.

The problem arises when the thread try to read the elements previously written, it seems that it reads values written by others threads.

Aren’t the arrays created with new private for each thread?

Could it be a heap size problem? (I already tried to set the heap size>8MB, i.e. 256MB)

How can I solve this problem?

Yes, the arrays are separate from each other. However they are not “private” in what I would consider to be the usual sense of that word. The pointer returned by new is a pointer to the logical global space. That means that if the pointer was somehow passed to another thread, that other thread would be able to read data, using that pointer, that was written by the first thread.

You haven’t provided a full example so that’s about as far as I can go. I suggest the usual general debug practices:

  • make sure you are using proper cuda error checking
  • run your code with compute-sanitizer

and the debug best practice associated with in-kernel new or malloc usage:

  • before using the returned pointer, test it against NULL. The device runtime signals an allocation error by returning a NULL pointer. If the pointer is NULL, don’t attempt to dereference it.

note that compute-sanitizer is able to detect out-of-bounds accesses to allocations created with in-kernel new, which might be a possible issue based on your description.

This is my __global__ function:

_global_ void expectedForce(int* IR_vec, int* IC_vec, int numOfNodes, int maxDegree, double* d_exf) {

    int seed = blockDim.x * blockIdx.x + threadIdx.x;

    if (seed < numOfNodes && seed != 0) {
        double ExF = 0; // metric I want to calculate for each node
        int degreeIndex = 0;
        int* distOne = new int[maxDegree]; // array containing node's neighbors

        if (&distOne == NULL) { printf("distOne failed\n"); return; }

        ... // calculating node's neighbors and putting them in distOne array

        int* degrees = new int[] {0}; // array storing the rank of neighboring clusters for each node -> this is the tricking array

        if (&degrees == NULL) { printf("gradi failed\n"); return; }

        ... // for each cluster I calculate its degree (number of arcs coming out of the cluster) by a __device__ function and store it in degrees array 
               degrees[degreeIndex] = clusterDegree; 

        // if I check the values with a printf they are correctly written in the degrees array

        for (int K = 0; K < degreeIndex; K++) { // here I have to normalize values by their final sum
            if (degrees[K] != 0) {
                norm = (float)degrees[K] / degreesTotalSum; // code PROBLEM: the thread reads the wrong values, maybe the ones written in the degrees array from another thread 
                ExF -= log(norm) * norm;
            }
        }

        d_exf[seed] = ExF;

    }
}


This is an extract from the main:

    int threadsPerBlock = 1024;
    int blocksPerGrid = (numOfNodes + threadsPerBlock - 1) / threadsPerBlock;


    expectedForce <<<blocksPerGrid, threadsPerBlock >>> (IR_vec, IC_vec, numOfNodes, maxDegree, d_exf);
    CUDA_CALL(cudaDeviceSynchronize());

That is not correct, and any others like it.

You want something like this:

if (distOne == NULL)

Also this doesn’t look valid to me:

int* degrees = new int[] {0};

What is your expectation about the size of the allocation there? It doesn’t make any sense to me. Furthermore, when I attempt that in my own test case, I get compile errors.

I would also suggest that any new operations in your kernel code also have a corresponding delete operation (when you are finished using that particular allocation). That is just good C++ programming practice, not unique or specific to CUDA, and I don’t know if your actual code has those or not.

Beyond that I wouldn’t be able to help without a complete example.