problem with array offset


i’m trying to write a program to count how many times a number is contained in certain part of an array and consequentially increment the relative index of the number in another array.

The code:

#include <stdio.h>

#include <cutil_inline.h>

#include <cuda_runtime_api.h>

#include "parallelcount.h"

#include ""

__global__ void VecAdd(int* d_vecE, int* B, int offs, int endset)


    int i = threadIdx.x;

    i += offs;

    if (i < endset)



		cuPrintf("i: %d, Value is: %d\n",i, B[d_vecE[i]]);



extern "C"

int functionCUDA(int* NghCounts, sparsegraph *sg, int *lab, int ind0, int ind2MainThr )


	int* d_vecE;

	int* d_vecNC;

	int i = 0;

	int iterations = 0;

	int j1 = 0;

	int iend1 = 0;

	int threadsPerBlock = 256;

	int blocksPerGrid = 0;


	size_t sizeE = sg->elen;

	size_t sizeNC = (sg->nv) * sizeof(int);




	// Allocate vectors in device memory

	cutilSafeCall( cudaMalloc((void**)&d_vecE, sizeE) );

    cutilSafeCall( cudaMalloc((void**)&d_vecNC, sizeNC) );

// Copy vectors from host memory to device memory

    cutilSafeCall( cudaMemcpy(d_vecE, sg->e, sizeE, cudaMemcpyHostToDevice) );

    cutilSafeCall( cudaMemcpy(d_vecNC, NghCounts, sizeNC, cudaMemcpyHostToDevice) );

/*printf("N vertex: %d\n",sg->nv);

    printf("Ind0: %d\n",ind0);

    printf("ind2MainThr: %d\n",ind2MainThr);

    printf("elen: %d\n",sg->elen);


	for (i = ind0; i < ind2MainThr; i++) {	

		j1 = sg->v[lab[i]];

		iend1 = j1+sg->d[lab[i]];

																							printf("%d) j1: %d\n",i-ind0+1,j1);

																							printf("%d) iend1: %d\n",i-ind0+1,iend1);

		iterations = sg->d[lab[i]];

																							printf("%d) iterations: %d\n",i-ind0+1,iterations);

		blocksPerGrid = (iterations + threadsPerBlock - 1) / threadsPerBlock;

																					    printf("%d) blocksPerGrid: %d\n",i-ind0+1,blocksPerGrid);

		VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_vecE, d_vecNC, j1, iend1);

		cudaPrintfDisplay(stdout, true);



		cutilCheckMsg("kernel launch failure");


	#ifdef _DEBUG

		cutilSafeCall( cudaThreadSynchronize() );



	// Copy result from device memory to host memory

	cutilSafeCall( cudaMemcpy(NghCounts, d_vecNC, sizeNC, cudaMemcpyDeviceToHost) );

	// Free device memory

	if (d_vecE)


	if (d_vecNC)





	return 0;


sg->e is the array containing the numbers, NghCounts is the array containing the count of the numbers in the sg->e.

j1 is the offset from i have to start searching in sg->e, iend1 is the offset i have to stop searching in sg->e, so if j1==5 and end1==7 i have to search in sg->e[5],sg->e[6],sg->e[7]. Again if sg->e[5]== 10, sg->e[6]==20, sg->e[7]==2, i have to do: NghCounts[2]++, NghCounts[10]++, NghCounts[20]++.

So what is the problem?

The problem is that running this code, i get “cutilCheckMsg() CUTIL CUDA error : kernel launch failure : unspecified launch failure.” on the line after the kernel call.

If i remove the line : “i += offs;” in the kernel function, i get non error but obviously is not a solution because i have to check only a part of the array and deleting that line i suppose to check all the array, right?

So what i’m doing wrong?


You probably refer device variable in host code

cutilSafeCall( cudaMalloc((void**)&d_vecE, sizeE) );
cutilSafeCall( cudaMalloc((void**)&d_vecNC, sizeNC) );

d_vecE points to device memory, but it returns pointer to host, also check SizeE, it looks sucpicious.

Initially i had declared d_vecE in host code and the error was the same. I declared as device in one of my tries, but without success.

Why SizeE looks suspicious? It should be right.

So what do you suggest to do?

Bad attempt, btw you do not multiplay it by size of int while you multiplay sizeNC. You should found a way to debug your code. Either install nsight or use cuda3.0 with device emulation mode. You may use cuprintf to get some info of what is going on inside.

i don’t multiplay it because sg->elen is already a size_t type…

I have tried to download cuprintf but i cannot because i’m not a member of cuda developer program, beside i have applied it many weeks ago, but i haven’t receveid any answer, yet. If somebody of you can send me the file i’d be very happy :)

However i will try to install nsight, thanks Lev.

If somebody has some else advice i will be very grateful ;)

nsight is available only for windows and i have to develop under linux :(

I really need cuprintf :(

you may use cuda3.0, it is not too different than 3.2
from other way, cuprintf is avaialbe in 3.2 for all.

are you sure? I have cuda 3.2 and i don’t see cuprintf in any library…

I was wrong, i have found it.

Now i have changed the code in the first post, i have removed the device variable declaring it in the host code and i have added some cuprintf to show what is happening in the kernel.

this is the output:

1) j1: 202

1) iend1: 303

1) iterations: 101

1) blocksPerGrid: 1

[0, 32]: i: 234, Value is: 1

[0, 33]: i: 235, Value is: 1

[0, 34]: i: 236, Value is: 1

[0, 35]: i: 237, Value is: 1

[0, 36]: i: 238, Value is: 1

[0, 37]: i: 239, Value is: 1

[0, 38]: i: 240, Value is: 1

[0, 39]: i: 241, Value is: 1

[0, 40]: i: 242, Value is: 1

[0, 41]: i: 243, Value is: 1

[0, 42]: i: 244, Value is: 1

[0, 43]: i: 245, Value is: 1

[0, 44]: i: 246, Value is: 1

[0, 45]: i: 247, Value is: 1

[0, 46]: i: 248, Value is: 1

[0, 47]: i: 249, Value is: 1

[0, 48]: i: 250, Value is: 1

[0, 49]: i: 251, Value is: 1

[0, 50]: i: 252, Value is: 1

[0, 51]: i: 253, Value is: 1

[0, 52]: i: 254, Value is: 1

[0, 53]: i: 255, Value is: 1

[0, 54]: i: 256, Value is: 1

[0, 55]: i: 257, Value is: 1

[0, 56]: i: 258, Value is: 1

[0, 57]: i: 259, Value is: 1

[0, 58]: i: 260, Value is: 1

[0, 59]: i: 261, Value is: 1

[0, 60]: i: 262, Value is: 1

[0, 61]: i: 263, Value is: 1

[0, 62]: i: 264, Value is: 1

[0, 63]: i: 265, Value is: 1

[0, 64]: i: 266, Value is: 1

[0, 65]: i: 267, Value is: 1

[0, 66]: i: 268, Value is: 1

[0, 67]: i: 269, Value is: 1

[0, 68]: i: 270, Value is: 1

[0, 69]: i: 271, Value is: 1

[0, 70]: i: 272, Value is: 1

[0, 71]: i: 273, Value is: 1

[0, 72]: i: 274, Value is: 1

[0, 73]: i: 275, Value is: 1

[0, 74]: i: 276, Value is: 1

[0, 75]: i: 277, Value is: 1

[0, 76]: i: 278, Value is: 1

[0, 77]: i: 279, Value is: 1

[0, 78]: i: 280, Value is: 1

[0, 79]: i: 281, Value is: 1

[0, 80]: i: 282, Value is: 1

[0, 81]: i: 283, Value is: 1

[0, 82]: i: 284, Value is: 1

[0, 83]: i: 285, Value is: 1

[0, 84]: i: 286, Value is: 1

[0, 85]: i: 287, Value is: 1

[0, 86]: i: 288, Value is: 1

[0, 87]: i: 289, Value is: 1

[0, 88]: i: 290, Value is: 1

[0, 89]: i: 291, Value is: 1

[0, 90]: i: 292, Value is: 1

[0, 91]: i: 293, Value is: 1

[0, 92]: i: 294, Value is: 1

[0, 93]: i: 295, Value is: 1

[0, 94]: i: 296, Value is: 1

[0, 95]: i: 297, Value is: 1

[0, 96]: i: 298, Value is: 1

[0, 97]: i: 299, Value is: 1

[0, 98]: i: 300, Value is: 1

[0, 99]: i: 301, Value is: 1

[0, 100]: i: 302, Value is: 1

[0, 0]: i: 202, Value is: 1

[0, 1]: i: 203, Value is: 1

[0, 2]: i: 204, Value is: 1

[0, 3]: i: 205, Value is: 1

[0, 4]: i: 206, Value is: 1

[0, 5]: i: 207, Value is: 1

[0, 6]: i: 208, Value is: 1

[0, 7]: i: 209, Value is: 1

[0, 8]: i: 210, Value is: 1

[0, 9]: i: 211, Value is: 1

[0, 10]: i: 212, Value is: 1

[0, 11]: i: 213, Value is: 1

[0, 12]: i: 214, Value is: 1

[0, 13]: i: 215, Value is: 1

[0, 14]: i: 216, Value is: 1

[0, 15]: i: 217, Value is: 1

[0, 16]: i: 218, Value is: 1

[0, 17]: i: 219, Value is: 1

[0, 18]: i: 220, Value is: 1

[0, 19]: i: 221, Value is: 1

[0, 20]: i: 222, Value is: 1

[0, 21]: i: 223, Value is: 1

[0, 22]: i: 224, Value is: 1

[0, 23]: i: 225, Value is: 1

[0, 24]: i: 226, Value is: 1

[0, 25]: i: 227, Value is: 1

[0, 26]: i: 228, Value is: 1

[0, 27]: i: 229, Value is: 1

[0, 28]: i: 230, Value is: 1

[0, 29]: i: 231, Value is: 1

[0, 30]: i: 232, Value is: 1

[0, 31]: i: 233, Value is: 1


2) j1: 40400

2) iend1: 40501

2) iterations: 101

2) blocksPerGrid: 1


3) j1: 41006

3) iend1: 41107

3) iterations: 101

3) blocksPerGrid: 1

3 : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : unspecified launch failure.

everything seems to go well infirst kernel call, but the second seems not to be processed…

THE VERY STRANGE THING is that deleting the line i += offs; in the kernel, all the kernel calls work fine!! Why??

Please help, i’m going crazy!


adding cudaThreadSynchronize(); after the kernel call it works fine…


I have another problem! ;)

The message error is always the same: : cutilCheckMsg() CUTIL CUDA error : kernel launch failure : unspecified launch failure.

but the problem is different:

i got this after MANY kernel calls, so it seems as a memory problem… But i don’t see any part in my code where the memory can increase by size, once i have allocated my arrays, i should not allocate more memory until the end of the for which has the kernel call inside.

Is it right or i’m not seeing something?

i want to add that with small inputs, all the program works fine. But if i give a BIG input, after many kernel calls, the program returns that error…

this is the best!

I’ve found that replacing the line i+=offs with the line atomicAdd(&i, offs); the program works fine with inputs of any size!

So i’ve found that placing cudaThreadSynchronize(); after the kernel call, the error message doesn’t come with inputs of small and medium size, but with input of large size, i have to use atomic functions to add the offset to the i counter!!!

Ok i’ve found a workaround, but can somebody explain me what is happening? Because i wouldn’t to use atomicAdd() because is VEEERY slow…


I think that using AtomicAdd on i+=off is effectively forcing the blocks to run sequentially. But of course we want it to run in parallel.

Actually with this code you should be using AtomicAdd where you increment B i.e. on B[d_vecE[i]]++; otherwise two threads may both be updating the same element of B at same time and only one will succeed. And for the reason in B) below that is happening a lot.


I also notice that this line probably isn’t correct

int i = threadIdx.x;

maybe you meant

int i = threadIdx.x + blockIDx.x * threadsPerBlock;

I think it made the code only check the 1st ‘threadsPerBlock’ element in the array.

Sorry I’m not familiar with the sparesegraph structure you are using, so I’m paraphrasing what I think the code is doing.

You go through the outer loop ind2MainThr times

each time through the loop the code gets jl, iendl and iterations from the sparsegraph, these determine the ‘slice’ of d_vecE to process.

so actually the ‘slice’ being processed may be a different width on each call NB width of the slice is the ‘iterations’ variable in your code.

you then call VecAdd to process that slice of d_vecE and increment an array of counters called d_vecNC

I may be missing something but I can’t see that it is neccessary to divide d_vecE into slices (unless the outer loop is just for testing the code), as the results always go into the same cells in d_vecNC.

Regarding d_vecNC you must use AtomicAdd when incrementing the values or you will not get correct results. and that means 1 call to AtomicAdd for every element of d_vecE that you process, and that is a huge overhead that you should try to reduce. The best way of reducing it depends on the nature of your data.

If sg->nv is fairly small then a 1 thread per value approach might work well ( ‘small’ might be 10,000 or even more) i.e. thread 9137 counts the number of times the value 9137 is found in d_vecE (or slice of it). That way each cell in d_vecNC can only be updated by 1 thread and there is no need for AtomicAdd, if trying that approach then I suggest code something like this.

#define threadsPerBlock 256

__shared__ int values[threadsPerBlock];

VecAdd( .... )


 int myValue = threadIdx.x + blockIdx.x * threadsPerBlock;

 int myCount = 0;

 for ( int offset = 0; offset < lengthOfDvecE; offset = offset+threadsPerBlock)


   // copy threadsPerBlock elements of d_vecE into 'values' (shared array). This greatly reduces the number of reads of global device memory


   // now each thread scans the shared array and updates its local variable myCount each time it finds myValue



 // now each thread writes its result into d_vecNC


Thats just a suggestion.


hi kbam,

Regarding the line:

int i = threadIdx.x;

you are right, i had forced my program to run with only a block in the previous problem and i forgot to rechange that line.

But i don’t think that should be as you say, but something like:

int i = blockDim.x * blockIdx.x + threadIdx.x;

it is correct?

edit: i just figured that it is the same thing :P

You are right about what i have to do, but i don’t need to use atomic add where i increment B, because i’m sure that in the slice of the vec_e that i’m scanning in one cycle of the outer FOR, there are no equal values. This not means that vec_e is an array with unique values, but this happens in each one of the slices…

And again, i really need to scan only a slice of vec_e because i need to increment only the relative values (in vec_NC) of that slice.

This cuda function born from the necessity to speed up a larger project where this operation is made hundred million times…

So do you think that your code is good for my aim, yet, or i need something different?


i’m trying to implement your code but not all what have you wrote is clear for me…

1)You have placed

// copy threadsPerBlock elements of d_vecE into 'values' (shared array).

inside the FOR, but shared memory has block visibility, true? So, why i’m supposing to copy the array in each thread? Shouldn’t do it once time for block?

2)i hope this will improve my program’s performance, but this will solve my problem? If i have to copy only a slice of the vec, maybe i will get the same error when i will set the startset and endset…

however, if i have understood, the problem is that the blocks are not synchronized so they can damage each other writing the i counter, right?

so i’ve made this change in the host code:


for(j = 0; j<blocksPerGrid; j++)


      VecAdd<<<1, threadsPerBlock>>>(d_vecE, d_vecNC, j1, iend1, sg->elen, j);

      cudaThreadSynchronize() ;



and this change in device code:

__global__ void VecAdd(int* d_vecE, int* B, int offs, int endset, int maxlen, int blockoffs)


    int i = blockDim.x * blockoffs + threadIdx.x;


Now if i haven’t missed something, the program should run a kernel function for each block, and each block wait for all your thread finish… but i got the same error again! Really i don’t undesrtand…

In the kernel lauch code you just posted, there will only be one block. The cudaThreadSynchronize() call just makes the host wait until all GPU activity is finished. It has no effect on the GPU at all.

so why i continue to get the same error and if i use AtomicAdd i get no error? :O

I don’t entirely follow all of the discussion in this thread, but this line introduces a race condition in your code. The increment operator on an array element does not translate to a single instruction. It in fact requires three operations:

  1. Read integer from global memory to a register.

  2. Increment the integer in the register.

  3. Write new integer back to global memory from the register.

If another thread is accessing the same global memory location, then it is possible for these steps to be interleaved between threads, producing a wrong result.

The correct way to do this is to use the atomicAdd() function (as you found), which pushes the increment operation into the memory controller. Then hardware can force steps 1, 2 and 3 to act as an indivisible unit. The slowness of atomicAdd() is in part due to the serial nature of the problem of multiple threads incrementing one variable. Fermi-based GPUs can perform atomic operations in the L2 cache, which gives them a 20x speed boost on this kind of code. Additionally, I believe there are more complex algorithms to compute a histogram (which is basically what you are doing) without relying heavily on atomics.

no, i’m sure that this not happens because vec_E is structured in a certain way and in the slice that i’m scanning there is no way to find 2 equal values, so the threads never go to increment the same cell of the B array…

When i’ve said that i’ve used atomicAdd, i referred to the line:


maybe it gets out of bound access with offset and endset.