problem with array offset

Pompolus · January 24, 2011, 4:42pm

i’m pretty sure that this not happens, i’ve checked it and however the program should be give error even with atomicAdd()…

kbam · January 25, 2011, 12:41am

This really indicates that the problem is in the parameters you pass to the kernel, not the kernel itself.

I’d suggest you check the type of the variables in sg and any variables that attributes of sg are copied to.

NB in past when I’ve used printf style debugging I have found that when an error is encoutered that causes a crash the tail of my debug output can be truncated and the info I really need is missing.

So I would add code to throughly test the values of jl, iendl, and blocksPerGrid are valid before the call to VecAdd

e.g.

if ( blocksPerGrid < 0 || blocksPerGrid > length_of_NghCounts/threadsPerBlock )

printf( … );

Only call VecAdd if all parameters are within reasonable limits.

I would do this on the original code in your first post

====== next 3 quotes are just to answer other questions you have asked ========

Think of something like

values[threadIdx.x] = d_vecE[i];

every thread is copying one value

As you are definately working with a slice of the array this

for ( int offset = 0; offset < lengthOfDvecE; offset = offset+threadsPerBlock)

can be changed to

for ( int offset = sliceStart; offset < sliceEnd; offset = offset+threadsPerBlock)

where sliceStart and sliceEnd are the parameters passed to VecAdd

When I forgot to use AtomicAdd I didn’t get a runtime error, just incorrect data, so I think the error is being caused by something else as Lev said. Using the AtomicAdd is causing the code to be compiled differently and masking the error, but as i is a variable local to the thread there should be no need to use AtomicAdd on it.

That made me reread your original code and if every slice of vec_E is guaranteed to contain unique values then you dont need to use my suggested approach.

Pompolus · January 29, 2011, 3:58am

i’ve checked everything and everything is right! The for cannot go out of bounds of arrays and the type of the variables are right

Hear this:

I’ve made this tries:

//atomicAdd(&i, offs);

i += offs;

crash after X kernel calls.

atomicAdd(&i, offs);

//i += offs;

no crash until ends.

atomicAdd(&i, offs);

i += offs;

crash after X kernel calls

i += offs;

atomicAdd(&i, offs);

crash after X kernel calls

So, how is this explanable? If atomicAdd() is a workaround to avoid the problem, why calling it before of after the i increment make the program crash? And crash in the exactly same point of the first try…

And if the problem is not the kernel but the array, why using ONLY atomicAdd avoid the error? The array is written in the same point of the firt try…

I’m really going crazy… i’m on this problem from more than a week and i don’t know why this happens, yet

tera · January 29, 2011, 2:14pm

How large is ‘large’ where things start to wrong? It might help to declare [font=“Courier New”]i[/font], [font=“Courier New”]offs[/font] and [font=“Courier New”]endset[/font] as unsigned to prevent overflows and to make sure the conditional protects the array access against any arbitrary values of [font=“Courier New”]i[/font].

tera · January 29, 2011, 2:53pm

And of course it would be worth protecting against bad input data:

[#include <stdio.h>

#include <cutil_inline.h>

#include <cuda_runtime_api.h>

#include "parallelcount.h"

#include "cuPrintf.cu"

__global__ void VecAdd(unsigned int* d_vecE, int* B, unsigned int offs, unsigned int endset, unsigned int B_length)

{

    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x + offs;

if (i < endset)

    {

        unsigned int j = d_vecE[i];

        if (j < B_length)

            B[d_vecE[i]]++;

        else

            cuPrintf("d_vecE[%u] = %u is out of range [0..%u]\n",i, j, B_length-1);

}

extern "C"

int functionCUDA(int* NghCounts, sparsegraph *sg, int *lab, int ind0, int ind2MainThr )

{

    unsigned int* d_vecE;

    int* d_vecNC;

    unsigned int i = 0;

    unsigned int iterations = 0;

    unsigned int j1 = 0;

    unsigned int iend1 = 0;

    unsigned int threadsPerBlock = 256;

    unsigned int blocksPerGrid = 0;

size_t sizeE = sg->elen;

    size_t sizeNC = (sg->nv) * sizeof(int);

cudaPrintfInit();

// Allocate vectors in device memory

    cutilSafeCall( cudaMalloc((void**)&d_vecE, sizeE) );

    cutilSafeCall( cudaMalloc((void**)&d_vecNC, sizeNC) );

// Copy vectors from host memory to device memory

    cutilSafeCall( cudaMemcpy(d_vecE, sg->e, sizeE, cudaMemcpyHostToDevice) );

    cutilSafeCall( cudaMemcpy(d_vecNC, NghCounts, sizeNC, cudaMemcpyHostToDevice) );

/*  printf("N vertex: %d\n",sg->nv);

    printf("Ind0: %u\n",ind0);

    printf("ind2MainThr: %d\n",ind2MainThr);

    printf("elen: %d\n",sg->elen);

*/

    for (i = ind0; i < ind2MainThr; i++) {  

        j1 = sg->v[lab[i]];

        iend1 = j1+sg->d[lab[i]];

        printf("%u) j1: %u\n",i-ind0+1,j1);

        printf("%u) iend1: %u\n",i-ind0+1,iend1);

        iterations = sg->d[lab[i]];

        printf("%u) iterations: %u\n",i-ind0+1,iterations);

        blocksPerGrid = (iterations + threadsPerBlock - 1) / threadsPerBlock;

        printf("%u) blocksPerGrid: %u\n",i-ind0+1,blocksPerGrid);

        if (iend1 < sizeE/sizeof(*d_vecE)

            VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_vecE, d_vecNC, j1, iend1, sizeE/sizeof(*d_vecE));

        else

            printf("sg->v[lab[%u]] = sg->v[%d] = %u out of range [0..%lu]\n",

                   i, lab[i], sg->v[lab[i]], (unsigned long) sizeE/sizeof(*d_vecE));

        cudaPrintfDisplay(stdout, true);

printf("%d\n",i-ind0+1);

        cutilCheckMsg("kernel launch failure");

    }       

#ifdef _DEBUG

    cutilSafeCall( cudaThreadSynchronize() );

#endif

// Copy result from device memory to host memory

    cutilSafeCall( cudaMemcpy(NghCounts, d_vecNC, sizeNC, cudaMemcpyDeviceToHost) );

// Free device memory

    if (d_vecE)

        cudaFree(d_vecE);

    if (d_vecNC)

        cudaFree(d_vecNC);

cudaPrintfEnd();

return 0;

}

Pompolus · January 30, 2011, 4:16am

thanks tera for the help, but i haven’t understood why you say to pass “sizeE/sizeof(*d_vecE)”, that isn’t the slice of the d_vecE i have to scan…

However i’ve found what the problem was!!!

As Lev said, Size_E was suspicious and was wrong. I guessed that sg->elen was of type of size_t (as the specifications of the program i’m improving with CUDA say -_-) but it wasn’t. So fixing it has solved every crash!

BUT…

I’m seeing that implementing cuda in this way doesn’t improve the performance of the main program, i think for the large number of times that i have to allocate the memory on the device, so i’m trying to do it once a time, but i’ve encountered a problem:

It seems that the pointer from a cudaMalloc is not holded between functions calls!

This is the structure of my program:

//MainProgram.c

int* d_vecE;

int* d_vecNC;

...

AllocDeviceArray(...,d_vecE, d_vecNC);

...

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

...

...

DellocDeviceArray(...,d_vecE, d_vecNC);

...

//parallelcount.cu

__global__ void VecAdd(int* d_vecE, int* B, int offs, int endset){

       ...

}

extern "C"

int functionCUDA(int* NghCounts, sparsegraph *sg, int *lab, int ind0, int ind2MainThr, int* d_vecE, int* d_vecNC)

{

        ...

        VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_vecE, d_vecNC, j1, iend1);

        ...

}

//memoryDevice.cu

extern "C"

void AllocDeviceArray(int *sge, int *d_vecE, int *d_vecNC, int elen, int nv)

{

         ...

         cutilSafeCall( cudaMalloc((void**)&d_vecE, sizeE) );

         cutilSafeCall( cudaMalloc((void**)&d_vecNC, sizeNC) );

         ...

}

extern "C"

void DeallocDeviceArray(int *d_vecE, int* d_vecNC)

{

// Free device memory

        if (d_vecE)

                cudaFree(d_vecE);

        if (d_vecNC)

                cudaFree(d_vecNC);

}

so, the pointers to the device memory are stored inside AllocDeviceArray() in memoryDevice.cu, but just outside that function, in the MainProgram.c, those pointers are 0.

How to make the device memory allocated, available for all the functions? The d_vecE is very large (100.000+ elements) and i want to allocate it only once…

Thanks again for your help and sorry formy awful english :P

tera · January 30, 2011, 10:26am

Yes, sorry, I originally had sizeE/sizeof(*d_vecE) and sizeNC interchanged. I thought I had corrected all occurrences before posting, but apparently had missed that one.

The problem is in how you call AllocDeviceArray(): Because modified function arguments are not passed back to the caller, the pointers to the newly allocated arrays are lost there.

Change AllocDeviceArray() to

extern "C"

void AllocDeviceArray(int *sge, int **d_vecE, int **d_vecNC, int elen, int nv)

{

         ...

         cutilSafeCall( cudaMalloc((void**)d_vecE, sizeE) );

         cutilSafeCall( cudaMalloc((void**)d_vecNC, sizeNC) );

         ...

}

and call it in the main program like this

//MainProgram.c

int* d_vecE;

int* d_vecNC;

...

AllocDeviceArray(...,&d_vecE, &d_vecNC);

...

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

...

...

DellocDeviceArray(...,d_vecE, d_vecNC);

...

cudaMalloc takes a pointer to a pointer for exactly the same reason.

Pompolus · February 2, 2011, 7:17pm

Yes, sorry, I originally had sizeE/sizeof(*d_vecE) and sizeNC interchanged. I thought I had corrected all occurrences before posting, but apparently had missed that one.

The problem is in how you call AllocDeviceArray(): Because modified function arguments are not passed back to the caller, the pointers to the newly allocated arrays are lost there.

Change AllocDeviceArray() to
extern "C"

void AllocDeviceArray(int *sge, int **d_vecE, int **d_vecNC, int elen, int nv)

{

         ...

         cutilSafeCall( cudaMalloc((void**)d_vecE, sizeE) );

         cutilSafeCall( cudaMalloc((void**)d_vecNC, sizeNC) );

         ...

}
and call it in the main program like this
//MainProgram.c

int* d_vecE;

int* d_vecNC;

...

AllocDeviceArray(...,&d_vecE, &d_vecNC);

...

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

functionCUDA(...,d_vecE, d_vecNC);

...

...

...

DellocDeviceArray(...,d_vecE, d_vecNC);

...
cudaMalloc takes a pointer to a pointer for exactly the same reason.

Thanks, your solution works.

But my program is too slow yet, with my video card the program run 22-25 times slower than the serial version…

I’m trying to find something to change to use efficiently the shared memory but without success, until now : \

tera · February 3, 2011, 2:33am

The scattered writes to B will kill performance, so placing B in shared memory will be beneficial.
Even if B does not fit into shared mem, it would still be worth doing multiple passes over d_vecE with only a window of B available each time. After all, as you used int, you can have at most 32768 bins or 8 passes (if full 32 bit width of the bin counters is needed).

What compute capability is you device? On 2.x devices you could do the different passes in adjacent blocks, hoping that these stay close enough together to reread d_vecE from level 2 cache instead of global memory.

Topic		Replies	Views
multi dimension array CUDA Programming and Performance	26	32771	February 12, 2010
Fiding an elegant solution for an On^4 problem CUDA Programming and Performance	43	25683	June 1, 2011
Device memory access bug? CUDA Programming and Performance	29	2078	June 16, 2011
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29323	April 23, 2010
Urgent help with threads please! CUDA Programming and Performance	21	10786	March 6, 2008
find maximum value in an array along with index CUDA Programming and Performance	40	18645	October 11, 2010
Performance issues on memory transfer CUDA Programming and Performance	13	12981	November 26, 2010
Is atomicExch() safe for incremental a global float array? CUDA Programming and Performance	13	18232	July 6, 2009
Cuda code performance CUDA Programming and Performance	14	3134	December 16, 2014

problem with array offset

Related topics