problem with array offset

i’m pretty sure that this not happens, i’ve checked it and however the program should be give error even with atomicAdd()…

This really indicates that the problem is in the parameters you pass to the kernel, not the kernel itself.

I’d suggest you check the type of the variables in sg and any variables that attributes of sg are copied to.

NB in past when I’ve used printf style debugging I have found that when an error is encoutered that causes a crash the tail of my debug output can be truncated and the info I really need is missing.

So I would add code to throughly test the values of jl, iendl, and blocksPerGrid are valid before the call to VecAdd

e.g.

if ( blocksPerGrid < 0 || blocksPerGrid > length_of_NghCounts/threadsPerBlock )

printf( … );

Only call VecAdd if all parameters are within reasonable limits.

I would do this on the original code in your first post

====== next 3 quotes are just to answer other questions you have asked ========

Think of something like

values[threadIdx.x] = d_vecE[i];

every thread is copying one value

As you are definately working with a slice of the array this

for ( int offset = 0; offset < lengthOfDvecE; offset = offset+threadsPerBlock)

can be changed to

for ( int offset = sliceStart; offset < sliceEnd; offset = offset+threadsPerBlock)

where sliceStart and sliceEnd are the parameters passed to VecAdd

When I forgot to use AtomicAdd I didn’t get a runtime error, just incorrect data, so I think the error is being caused by something else as Lev said. Using the AtomicAdd is causing the code to be compiled differently and masking the error, but as i is a variable local to the thread there should be no need to use AtomicAdd on it.

That made me reread your original code and if every slice of vec_E is guaranteed to contain unique values then you dont need to use my suggested approach.

i’ve checked everything and everything is right! The for cannot go out of bounds of arrays and the type of the variables are right

Hear this:

I’ve made this tries:

//atomicAdd(&i, offs);

i += offs;

crash after X kernel calls.

atomicAdd(&i, offs);

//i += offs;

no crash until ends.

atomicAdd(&i, offs);

i += offs;

crash after X kernel calls

i += offs;

atomicAdd(&i, offs);

crash after X kernel calls

So, how is this explanable? If atomicAdd() is a workaround to avoid the problem, why calling it before of after the i increment make the program crash? And crash in the exactly same point of the first try…

And if the problem is not the kernel but the array, why using ONLY atomicAdd avoid the error? The array is written in the same point of the firt try…

I’m really going crazy… i’m on this problem from more than a week and i don’t know why this happens, yet

How large is ‘large’ where things start to wrong? It might help to declare [font=“Courier New”]i[/font], [font=“Courier New”]offs[/font] and [font=“Courier New”]endset[/font] as unsigned to prevent overflows and to make sure the conditional protects the array access against any arbitrary values of [font=“Courier New”]i[/font].

And of course it would be worth protecting against bad input data:

[#include <stdio.h>

#include <cutil_inline.h>

#include <cuda_runtime_api.h>

#include "parallelcount.h"

#include "cuPrintf.cu"

__global__ void VecAdd(unsigned int* d_vecE, int* B, unsigned int offs, unsigned int endset, unsigned int B_length)

{

    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x + offs;

if (i < endset)

    {

        unsigned int j = d_vecE[i];

        if (j < B_length)

            B[d_vecE[i]]++;

        else

            cuPrintf("d_vecE[%u] = %u is out of range [0..%u]\n",i, j, B_length-1);

}

extern "C"

int functionCUDA(int* NghCounts, sparsegraph *sg, int *lab, int ind0, int ind2MainThr )

{

    unsigned int* d_vecE;

    int* d_vecNC;

    unsigned int i = 0;

    unsigned int iterations = 0;

    unsigned int j1 = 0;

    unsigned int iend1 = 0;

    unsigned int threadsPerBlock = 256;

    unsigned int blocksPerGrid = 0;

size_t sizeE = sg->elen;

    size_t sizeNC = (sg->nv) * sizeof(int);

cudaPrintfInit();

// Allocate vectors in device memory

    cutilSafeCall( cudaMalloc((void**)&d_vecE, sizeE) );

    cutilSafeCall( cudaMalloc((void**)&d_vecNC, sizeNC) );

// Copy vectors from host memory to device memory

    cutilSafeCall( cudaMemcpy(d_vecE, sg->e, sizeE, cudaMemcpyHostToDevice) );

    cutilSafeCall( cudaMemcpy(d_vecNC, NghCounts, sizeNC, cudaMemcpyHostToDevice) );

/*  printf("N vertex: %d\n",sg->nv);

    printf("Ind0: %u\n",ind0);

    printf("ind2MainThr: %d\n",ind2MainThr);

    printf("elen: %d\n",sg->elen);

*/

    for (i = ind0; i < ind2MainThr; i++) {  

        j1 = sg->v[lab[i]];

        iend1 = j1+sg->d[lab[i]];

        printf("%u) j1: %u\n",i-ind0+1,j1);

        printf("%u) iend1: %u\n",i-ind0+1,iend1);

        iterations = sg->d[lab[i]];

        printf("%u) iterations: %u\n",i-ind0+1,iterations);

        blocksPerGrid = (iterations + threadsPerBlock - 1) / threadsPerBlock;

        printf("%u) blocksPerGrid: %u\n",i-ind0+1,blocksPerGrid);

        if (iend1 < sizeE/sizeof(*d_vecE)

            VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_vecE, d_vecNC, j1, iend1, sizeE/sizeof(*d_vecE));

        else

            printf("sg->v[lab[%u]] = sg->v[%d] = %u out of range [0..%lu]\n",

                   i, lab[i], sg->v[lab[i]], (unsigned long) sizeE/sizeof(*d_vecE));

        cudaPrintfDisplay(stdout, true);

printf("%d\n",i-ind0+1);

        cutilCheckMsg("kernel launch failure");

    }       

#ifdef _DEBUG

    cutilSafeCall( cudaThreadSynchronize() );

#endif

// Copy result from device memory to host memory

    cutilSafeCall( cudaMemcpy(NghCounts, d_vecNC, sizeNC, cudaMemcpyDeviceToHost) );

// Free device memory

    if (d_vecE)

        cudaFree(d_vecE);

    if (d_vecNC)

        cudaFree(d_vecNC);

cudaPrintfEnd();

return 0;

}

thanks tera for the help, but i haven’t understood why you say to pass “sizeE/sizeof(*d_vecE)”, that isn’t the slice of the d_vecE i have to scan…

However i’ve found what the problem was!!!

As Lev said, Size_E was suspicious and was wrong. I guessed that sg->elen was of type of size_t (as the specifications of the program i’m improving with CUDA say -_-) but it wasn’t. So fixing it has solved every crash!

BUT…

I’m seeing that implementing cuda in this way doesn’t improve the performance of the main program, i think for the large number of times that i have to allocate the memory on the device, so i’m trying to do it once a time, but i’ve encountered a problem:

It seems that the pointer from a cudaMalloc is not holded between functions calls!

This is the structure of my program:

//MainProgram.c

int* d_vecE;

int* d_vecNC;

...

AllocDeviceArray(...,d_vecE, d_vecNC);

...

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

...

...

DellocDeviceArray(...,d_vecE, d_vecNC);

...
//parallelcount.cu

__global__ void VecAdd(int* d_vecE, int* B, int offs, int endset){

       ...

}

extern "C"

int functionCUDA(int* NghCounts, sparsegraph *sg, int *lab, int ind0, int ind2MainThr, int* d_vecE, int* d_vecNC)

{

        ...

        VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_vecE, d_vecNC, j1, iend1);

        ...

}
//memoryDevice.cu

extern "C"

void AllocDeviceArray(int *sge, int *d_vecE, int *d_vecNC, int elen, int nv)

{

         ...

         cutilSafeCall( cudaMalloc((void**)&d_vecE, sizeE) );

         cutilSafeCall( cudaMalloc((void**)&d_vecNC, sizeNC) );

         ...

}

extern "C"

void DeallocDeviceArray(int *d_vecE, int* d_vecNC)

{

// Free device memory

        if (d_vecE)

                cudaFree(d_vecE);

        if (d_vecNC)

                cudaFree(d_vecNC);

}

so, the pointers to the device memory are stored inside AllocDeviceArray() in memoryDevice.cu, but just outside that function, in the MainProgram.c, those pointers are 0.

How to make the device memory allocated, available for all the functions? The d_vecE is very large (100.000+ elements) and i want to allocate it only once…

Thanks again for your help and sorry formy awful english :P

Yes, sorry, I originally had sizeE/sizeof(*d_vecE) and sizeNC interchanged. I thought I had corrected all occurrences before posting, but apparently had missed that one.

The problem is in how you call AllocDeviceArray(): Because modified function arguments are not passed back to the caller, the pointers to the newly allocated arrays are lost there.

Change AllocDeviceArray() to

extern "C"

void AllocDeviceArray(int *sge, int **d_vecE, int **d_vecNC, int elen, int nv)

{

         ...

         cutilSafeCall( cudaMalloc((void**)d_vecE, sizeE) );

         cutilSafeCall( cudaMalloc((void**)d_vecNC, sizeNC) );

         ...

}

and call it in the main program like this

//MainProgram.c

int* d_vecE;

int* d_vecNC;

...

AllocDeviceArray(...,&d_vecE, &d_vecNC);

...

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

...

...

DellocDeviceArray(...,d_vecE, d_vecNC);

...

cudaMalloc takes a pointer to a pointer for exactly the same reason.

Thanks, your solution works.

But my program is too slow yet, with my video card the program run 22-25 times slower than the serial version…

I’m trying to find something to change to use efficiently the shared memory but without success, until now : \

The scattered writes to B will kill performance, so placing B in shared memory will be beneficial.
Even if B does not fit into shared mem, it would still be worth doing multiple passes over d_vecE with only a window of B available each time. After all, as you used int, you can have at most 32768 bins or 8 passes (if full 32 bit width of the bin counters is needed).

What compute capability is you device? On 2.x devices you could do the different passes in adjacent blocks, hoping that these stay close enough together to reread d_vecE from level 2 cache instead of global memory.