problem with array offset

Pompolus · January 23, 2011, 7:07pm

Hi,

i’m trying to write a program to count how many times a number is contained in certain part of an array and consequentially increment the relative index of the number in another array.

The code:

#include <stdio.h>

#include <cutil_inline.h>

#include <cuda_runtime_api.h>

#include "parallelcount.h"

#include "cuPrintf.cu"

__global__ void VecAdd(int* d_vecE, int* B, int offs, int endset)

{

    int i = threadIdx.x;

    i += offs;

    if (i < endset)

    {

        B[d_vecE[i]]++;

		cuPrintf("i: %d, Value is: %d\n",i, B[d_vecE[i]]);

	}    

}

extern "C"

int functionCUDA(int* NghCounts, sparsegraph *sg, int *lab, int ind0, int ind2MainThr )

{

	int* d_vecE;

	int* d_vecNC;

	int i = 0;

	int iterations = 0;

	int j1 = 0;

	int iend1 = 0;

	int threadsPerBlock = 256;

	int blocksPerGrid = 0;

	

	size_t sizeE = sg->elen;

	size_t sizeNC = (sg->nv) * sizeof(int);

	

	cudaPrintfInit();

	

	// Allocate vectors in device memory

	cutilSafeCall( cudaMalloc((void**)&d_vecE, sizeE) );

    cutilSafeCall( cudaMalloc((void**)&d_vecNC, sizeNC) );

// Copy vectors from host memory to device memory

    cutilSafeCall( cudaMemcpy(d_vecE, sg->e, sizeE, cudaMemcpyHostToDevice) );

    cutilSafeCall( cudaMemcpy(d_vecNC, NghCounts, sizeNC, cudaMemcpyHostToDevice) );

/*printf("N vertex: %d\n",sg->nv);

    printf("Ind0: %d\n",ind0);

    printf("ind2MainThr: %d\n",ind2MainThr);

    printf("elen: %d\n",sg->elen);

*/

	for (i = ind0; i < ind2MainThr; i++) {	

		j1 = sg->v[lab[i]];

		iend1 = j1+sg->d[lab[i]];

																							printf("%d) j1: %d\n",i-ind0+1,j1);

																							printf("%d) iend1: %d\n",i-ind0+1,iend1);

		iterations = sg->d[lab[i]];

																							printf("%d) iterations: %d\n",i-ind0+1,iterations);

		blocksPerGrid = (iterations + threadsPerBlock - 1) / threadsPerBlock;

																					    printf("%d) blocksPerGrid: %d\n",i-ind0+1,blocksPerGrid);

		VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_vecE, d_vecNC, j1, iend1);

		cudaPrintfDisplay(stdout, true);

	

																							printf("%d\n",i-ind0+1);

		cutilCheckMsg("kernel launch failure");

	}	

	#ifdef _DEBUG

		cutilSafeCall( cudaThreadSynchronize() );

	#endif

		

	// Copy result from device memory to host memory

	cutilSafeCall( cudaMemcpy(NghCounts, d_vecNC, sizeNC, cudaMemcpyDeviceToHost) );

	// Free device memory

	if (d_vecE)

		cudaFree(d_vecE);

	if (d_vecNC)

		cudaFree(d_vecNC);

	

	cudaPrintfEnd();

	

	return 0;

}

sg->e is the array containing the numbers, NghCounts is the array containing the count of the numbers in the sg->e.

j1 is the offset from i have to start searching in sg->e, iend1 is the offset i have to stop searching in sg->e, so if j1==5 and end1==7 i have to search in sg->e[5],sg->e[6],sg->e[7]. Again if sg->e[5]== 10, sg->e[6]==20, sg->e[7]==2, i have to do: NghCounts[2]++, NghCounts[10]++, NghCounts[20]++.

So what is the problem?

The problem is that running this code, i get “cutilCheckMsg() CUTIL CUDA error : kernel launch failure : unspecified launch failure.” on the line after the kernel call.

If i remove the line : “i += offs;” in the kernel function, i get non error but obviously is not a solution because i have to check only a part of the array and deleting that line i suppose to check all the array, right?

So what i’m doing wrong?

Thanks

Lev · January 23, 2011, 7:21pm

You probably refer device variable in host code

cutilSafeCall( cudaMalloc((void**)&d_vecE, sizeE) );
cutilSafeCall( cudaMalloc((void**)&d_vecNC, sizeNC) );

d_vecE points to device memory, but it returns pointer to host, also check SizeE, it looks sucpicious.

Pompolus · January 23, 2011, 8:05pm

Initially i had declared d_vecE in host code and the error was the same. I declared as device in one of my tries, but without success.

Why SizeE looks suspicious? It should be right.

So what do you suggest to do?

Lev · January 23, 2011, 8:18pm

Bad attempt, btw you do not multiplay it by size of int while you multiplay sizeNC. You should found a way to debug your code. Either install nsight or use cuda3.0 with device emulation mode. You may use cuprintf to get some info of what is going on inside.

Pompolus · January 23, 2011, 8:44pm

i don’t multiplay it because sg->elen is already a size_t type…

I have tried to download cuprintf but i cannot because i’m not a member of cuda developer program, beside i have applied it many weeks ago, but i haven’t receveid any answer, yet. If somebody of you can send me the file i’d be very happy :)

However i will try to install nsight, thanks Lev.

If somebody has some else advice i will be very grateful External Image

Pompolus · January 23, 2011, 9:00pm

nsight is available only for windows and i have to develop under linux :(

I really need cuprintf :(

Lev · January 23, 2011, 9:37pm

you may use cuda3.0, it is not too different than 3.2
from other way, cuprintf is avaialbe in 3.2 for all.

Pompolus · January 23, 2011, 9:50pm

are you sure? I have cuda 3.2 and i don’t see cuprintf in any library…

Pompolus · January 23, 2011, 11:23pm

I was wrong, i have found it.

Now i have changed the code in the first post, i have removed the device variable declaring it in the host code and i have added some cuprintf to show what is happening in the kernel.

this is the output:

1) j1: 202

1) iend1: 303

1) iterations: 101

1) blocksPerGrid: 1

[0, 32]: i: 234, Value is: 1

[0, 33]: i: 235, Value is: 1

[0, 34]: i: 236, Value is: 1

[0, 35]: i: 237, Value is: 1

[0, 36]: i: 238, Value is: 1

[0, 37]: i: 239, Value is: 1

[0, 38]: i: 240, Value is: 1

[0, 39]: i: 241, Value is: 1

[0, 40]: i: 242, Value is: 1

[0, 41]: i: 243, Value is: 1

[0, 42]: i: 244, Value is: 1

[0, 43]: i: 245, Value is: 1

[0, 44]: i: 246, Value is: 1

[0, 45]: i: 247, Value is: 1

[0, 46]: i: 248, Value is: 1

[0, 47]: i: 249, Value is: 1

[0, 48]: i: 250, Value is: 1

[0, 49]: i: 251, Value is: 1

[0, 50]: i: 252, Value is: 1

[0, 51]: i: 253, Value is: 1

[0, 52]: i: 254, Value is: 1

[0, 53]: i: 255, Value is: 1

[0, 54]: i: 256, Value is: 1

[0, 55]: i: 257, Value is: 1

[0, 56]: i: 258, Value is: 1

[0, 57]: i: 259, Value is: 1

[0, 58]: i: 260, Value is: 1

[0, 59]: i: 261, Value is: 1

[0, 60]: i: 262, Value is: 1

[0, 61]: i: 263, Value is: 1

[0, 62]: i: 264, Value is: 1

[0, 63]: i: 265, Value is: 1

[0, 64]: i: 266, Value is: 1

[0, 65]: i: 267, Value is: 1

[0, 66]: i: 268, Value is: 1

[0, 67]: i: 269, Value is: 1

[0, 68]: i: 270, Value is: 1

[0, 69]: i: 271, Value is: 1

[0, 70]: i: 272, Value is: 1

[0, 71]: i: 273, Value is: 1

[0, 72]: i: 274, Value is: 1

[0, 73]: i: 275, Value is: 1

[0, 74]: i: 276, Value is: 1

[0, 75]: i: 277, Value is: 1

[0, 76]: i: 278, Value is: 1

[0, 77]: i: 279, Value is: 1

[0, 78]: i: 280, Value is: 1

[0, 79]: i: 281, Value is: 1

[0, 80]: i: 282, Value is: 1

[0, 81]: i: 283, Value is: 1

[0, 82]: i: 284, Value is: 1

[0, 83]: i: 285, Value is: 1

[0, 84]: i: 286, Value is: 1

[0, 85]: i: 287, Value is: 1

[0, 86]: i: 288, Value is: 1

[0, 87]: i: 289, Value is: 1

[0, 88]: i: 290, Value is: 1

[0, 89]: i: 291, Value is: 1

[0, 90]: i: 292, Value is: 1

[0, 91]: i: 293, Value is: 1

[0, 92]: i: 294, Value is: 1

[0, 93]: i: 295, Value is: 1

[0, 94]: i: 296, Value is: 1

[0, 95]: i: 297, Value is: 1

[0, 96]: i: 298, Value is: 1

[0, 97]: i: 299, Value is: 1

[0, 98]: i: 300, Value is: 1

[0, 99]: i: 301, Value is: 1

[0, 100]: i: 302, Value is: 1

[0, 0]: i: 202, Value is: 1

[0, 1]: i: 203, Value is: 1

[0, 2]: i: 204, Value is: 1

[0, 3]: i: 205, Value is: 1

[0, 4]: i: 206, Value is: 1

[0, 5]: i: 207, Value is: 1

[0, 6]: i: 208, Value is: 1

[0, 7]: i: 209, Value is: 1

[0, 8]: i: 210, Value is: 1

[0, 9]: i: 211, Value is: 1

[0, 10]: i: 212, Value is: 1

[0, 11]: i: 213, Value is: 1

[0, 12]: i: 214, Value is: 1

[0, 13]: i: 215, Value is: 1

[0, 14]: i: 216, Value is: 1

[0, 15]: i: 217, Value is: 1

[0, 16]: i: 218, Value is: 1

[0, 17]: i: 219, Value is: 1

[0, 18]: i: 220, Value is: 1

[0, 19]: i: 221, Value is: 1

[0, 20]: i: 222, Value is: 1

[0, 21]: i: 223, Value is: 1

[0, 22]: i: 224, Value is: 1

[0, 23]: i: 225, Value is: 1

[0, 24]: i: 226, Value is: 1

[0, 25]: i: 227, Value is: 1

[0, 26]: i: 228, Value is: 1

[0, 27]: i: 229, Value is: 1

[0, 28]: i: 230, Value is: 1

[0, 29]: i: 231, Value is: 1

[0, 30]: i: 232, Value is: 1

[0, 31]: i: 233, Value is: 1

1

2) j1: 40400

2) iend1: 40501

2) iterations: 101

2) blocksPerGrid: 1

2

3) j1: 41006

3) iend1: 41107

3) iterations: 101

3) blocksPerGrid: 1

3

parallelcount.cu(65) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : unspecified launch failure.

everything seems to go well infirst kernel call, but the second seems not to be processed…

THE VERY STRANGE THING is that deleting the line i += offs; in the kernel, all the kernel calls work fine!! Why??

Please help, i’m going crazy!

Pompolus · January 24, 2011, 1:16am

solved!

adding cudaThreadSynchronize(); after the kernel call it works fine…

but…

I have another problem! External Image

The message error is always the same:
parallelcount.cu(65) : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : unspecified launch failure.

but the problem is different:

i got this after MANY kernel calls, so it seems as a memory problem… But i don’t see any part in my code where the memory can increase by size, once i have allocated my arrays, i should not allocate more memory until the end of the for which has the kernel call inside.

Is it right or i’m not seeing something?

Pompolus · January 24, 2011, 2:25am

i want to add that with small inputs, all the program works fine. But if i give a BIG input, after many kernel calls, the program returns that error…

Pompolus · January 24, 2011, 4:04am

this is the best!

I’ve found that replacing the line i+=offs with the line atomicAdd(&i, offs); the program works fine with inputs of any size!

So i’ve found that placing cudaThreadSynchronize(); after the kernel call, the error message doesn’t come with inputs of small and medium size, but with input of large size, i have to use atomic functions to add the offset to the i counter!!!

Ok i’ve found a workaround, but can somebody explain me what is happening? Because i wouldn’t to use atomicAdd() because is VEEERY slow…

kbam · January 24, 2011, 7:40am

A)

I think that using AtomicAdd on i+=off is effectively forcing the blocks to run sequentially. But of course we want it to run in parallel.

Actually with this code you should be using AtomicAdd where you increment B i.e. on B[d_vecE[i]]++; otherwise two threads may both be updating the same element of B at same time and only one will succeed. And for the reason in B) below that is happening a lot.

B)

I also notice that this line probably isn’t correct

int i = threadIdx.x;

maybe you meant

int i = threadIdx.x + blockIDx.x * threadsPerBlock;

I think it made the code only check the 1st ‘threadsPerBlock’ element in the array.

Sorry I’m not familiar with the sparesegraph structure you are using, so I’m paraphrasing what I think the code is doing.

You go through the outer loop ind2MainThr times

each time through the loop the code gets jl, iendl and iterations from the sparsegraph, these determine the ‘slice’ of d_vecE to process.

so actually the ‘slice’ being processed may be a different width on each call NB width of the slice is the ‘iterations’ variable in your code.

you then call VecAdd to process that slice of d_vecE and increment an array of counters called d_vecNC

I may be missing something but I can’t see that it is neccessary to divide d_vecE into slices (unless the outer loop is just for testing the code), as the results always go into the same cells in d_vecNC.

Regarding d_vecNC you must use AtomicAdd when incrementing the values or you will not get correct results. and that means 1 call to AtomicAdd for every element of d_vecE that you process, and that is a huge overhead that you should try to reduce. The best way of reducing it depends on the nature of your data.

If sg->nv is fairly small then a 1 thread per value approach might work well ( ‘small’ might be 10,000 or even more) i.e. thread 9137 counts the number of times the value 9137 is found in d_vecE (or slice of it). That way each cell in d_vecNC can only be updated by 1 thread and there is no need for AtomicAdd, if trying that approach then I suggest code something like this.

#define threadsPerBlock 256

__shared__ int values[threadsPerBlock];

VecAdd( .... )

{

 int myValue = threadIdx.x + blockIdx.x * threadsPerBlock;

 int myCount = 0;

 for ( int offset = 0; offset < lengthOfDvecE; offset = offset+threadsPerBlock)

 {

   // copy threadsPerBlock elements of d_vecE into 'values' (shared array). This greatly reduces the number of reads of global device memory

   __synchthreads();

   // now each thread scans the shared array and updates its local variable myCount each time it finds myValue

   __synchthreads();

 }

 // now each thread writes its result into d_vecNC

}

Thats just a suggestion.

Enjoy

Pompolus · January 24, 2011, 11:52am

A)

I think that using AtomicAdd on i+=off is effectively forcing the blocks to run sequentially. But of course we want it to run in parallel.

Actually with this code you should be using AtomicAdd where you increment B i.e. on B[d_vecE[i]]++; otherwise two threads may both be updating the same element of B at same time and only one will succeed. And for the reason in B) below that is happening a lot.

B)

I also notice that this line probably isn’t correct
int i = threadIdx.x;
maybe you meant
int i = threadIdx.x + blockIDx.x * threadsPerBlock;
I think it made the code only check the 1st ‘threadsPerBlock’ element in the array.

Sorry I’m not familiar with the sparesegraph structure you are using, so I’m paraphrasing what I think the code is doing.

You go through the outer loop ind2MainThr times

each time through the loop the code gets jl, iendl and iterations from the sparsegraph, these determine the ‘slice’ of d_vecE to process.

so actually the ‘slice’ being processed may be a different width on each call NB width of the slice is the ‘iterations’ variable in your code.

you then call VecAdd to process that slice of d_vecE and increment an array of counters called d_vecNC

I may be missing something but I can’t see that it is neccessary to divide d_vecE into slices (unless the outer loop is just for testing the code), as the results always go into the same cells in d_vecNC.

Regarding d_vecNC you must use AtomicAdd when incrementing the values or you will not get correct results. and that means 1 call to AtomicAdd for every element of d_vecE that you process, and that is a huge overhead that you should try to reduce. The best way of reducing it depends on the nature of your data.

If sg->nv is fairly small then a 1 thread per value approach might work well ( ‘small’ might be 10,000 or even more) i.e. thread 9137 counts the number of times the value 9137 is found in d_vecE (or slice of it). That way each cell in d_vecNC can only be updated by 1 thread and there is no need for AtomicAdd, if trying that approach then I suggest code something like this.
#define threadsPerBlock 256

__shared__ int values[threadsPerBlock];

VecAdd( .... )

{

 int myValue = threadIdx.x + blockIdx.x * threadsPerBlock;

 int myCount = 0;

 for ( int offset = 0; offset < lengthOfDvecE; offset = offset+threadsPerBlock)

 {

   // copy threadsPerBlock elements of d_vecE into 'values' (shared array). This greatly reduces the number of reads of global device memory

   __synchthreads();

   // now each thread scans the shared array and updates its local variable myCount each time it finds myValue

   __synchthreads();

 }

 // now each thread writes its result into d_vecNC

}
Thats just a suggestion.

Enjoy

hi kbam,

Regarding the line:

int i = threadIdx.x;

you are right, i had forced my program to run with only a block in the previous problem and i forgot to rechange that line.

But i don’t think that should be as you say, but something like:

int i = blockDim.x * blockIdx.x + threadIdx.x;

it is correct?

edit: i just figured that it is the same thing :P

You are right about what i have to do, but i don’t need to use atomic add where i increment B, because i’m sure that in the slice of the vec_e that i’m scanning in one cycle of the outer FOR, there are no equal values. This not means that vec_e is an array with unique values, but this happens in each one of the slices…

And again, i really need to scan only a slice of vec_e because i need to increment only the relative values (in vec_NC) of that slice.

This cuda function born from the necessity to speed up a larger project where this operation is made hundred million times…

So do you think that your code is good for my aim, yet, or i need something different?

edit:

i’m trying to implement your code but not all what have you wrote is clear for me…

1)You have placed

// copy threadsPerBlock elements of d_vecE into 'values' (shared array).

inside the FOR, but shared memory has block visibility, true? So, why i’m supposing to copy the array in each thread? Shouldn’t do it once time for block?

2)i hope this will improve my program’s performance, but this will solve my problem? If i have to copy only a slice of the vec, maybe i will get the same error when i will set the startset and endset…

Pompolus · January 24, 2011, 12:35pm

however, if i have understood, the problem is that the blocks are not synchronized so they can damage each other writing the i counter, right?

so i’ve made this change in the host code:

...

for(j = 0; j<blocksPerGrid; j++)

{

      VecAdd<<<1, threadsPerBlock>>>(d_vecE, d_vecNC, j1, iend1, sg->elen, j);

      cudaThreadSynchronize() ;

			

}

...

and this change in device code:

__global__ void VecAdd(int* d_vecE, int* B, int offs, int endset, int maxlen, int blockoffs)

{

    int i = blockDim.x * blockoffs + threadIdx.x;

...

Now if i haven’t missed something, the program should run a kernel function for each block, and each block wait for all your thread finish… but i got the same error again! Really i don’t undesrtand…

avidday · January 24, 2011, 12:45pm

In the kernel lauch code you just posted, there will only be one block. The cudaThreadSynchronize() call just makes the host wait until all GPU activity is finished. It has no effect on the GPU at all.

Pompolus · January 24, 2011, 1:06pm

so why i continue to get the same error and if i use AtomicAdd i get no error? :O

seibert · January 24, 2011, 1:12pm

I don’t entirely follow all of the discussion in this thread, but this line introduces a race condition in your code. The increment operator on an array element does not translate to a single instruction. It in fact requires three operations:

Read integer from global memory to a register.
Increment the integer in the register.
Write new integer back to global memory from the register.

If another thread is accessing the same global memory location, then it is possible for these steps to be interleaved between threads, producing a wrong result.

The correct way to do this is to use the atomicAdd() function (as you found), which pushes the increment operation into the memory controller. Then hardware can force steps 1, 2 and 3 to act as an indivisible unit. The slowness of atomicAdd() is in part due to the serial nature of the problem of multiple threads incrementing one variable. Fermi-based GPUs can perform atomic operations in the L2 cache, which gives them a 20x speed boost on this kind of code. Additionally, I believe there are more complex algorithms to compute a histogram (which is basically what you are doing) without relying heavily on atomics.

Pompolus · January 24, 2011, 1:21pm

I don’t entirely follow all of the discussion in this thread, but this line introduces a race condition in your code. The increment operator on an array element does not translate to a single instruction. It in fact requires three operations:

Read integer from global memory to a register.

Increment the integer in the register.

Write new integer back to global memory from the register.

If another thread is accessing the same global memory location, then it is possible for these steps to be interleaved between threads, producing a wrong result.

The correct way to do this is to use the atomicAdd() function (as you found), which pushes the increment operation into the memory controller. Then hardware can force steps 1, 2 and 3 to act as an indivisible unit. The slowness of atomicAdd() is in part due to the serial nature of the problem of multiple threads incrementing one variable. Fermi-based GPUs can perform atomic operations in the L2 cache, which gives them a 20x speed boost on this kind of code. Additionally, I believe there are more complex algorithms to compute a histogram (which is basically what you are doing) without relying heavily on atomics.

no, i’m sure that this not happens because vec_E is structured in a certain way and in the slice that i’m scanning there is no way to find 2 equal values, so the threads never go to increment the same cell of the B array…

When i’ve said that i’ve used atomicAdd, i referred to the line:

i+=offs

Lev · January 24, 2011, 1:24pm

maybe it gets out of bound access with offset and endset.

Topic		Replies	Views
multi dimension array CUDA Programming and Performance	26	32771	February 12, 2010
Fiding an elegant solution for an On^4 problem CUDA Programming and Performance	43	25683	June 1, 2011
Device memory access bug? CUDA Programming and Performance	29	2078	June 16, 2011
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29323	April 23, 2010
Urgent help with threads please! CUDA Programming and Performance	21	10786	March 6, 2008
find maximum value in an array along with index CUDA Programming and Performance	40	18645	October 11, 2010
Performance issues on memory transfer CUDA Programming and Performance	13	12981	November 26, 2010
Is atomicExch() safe for incremental a global float array? CUDA Programming and Performance	13	18229	July 6, 2009
Cuda code performance CUDA Programming and Performance	14	3133	December 16, 2014

problem with array offset

Related topics