Memory copies and array methinks I misunderstand the concept...

Hello all. I am new.

I am working on porting some code to run using CUDA and I am aware that there are a number of concepts to keep in mind when dealing with threads, memory addressing and efficiency.

That said, I thought I would just jump in to get my feet wet and I believe that all I have discovered is that I do not understand some fundamental memeory work…

I have declared arrays that will represent data on the host and (duplicated) data on the card. The idea is that I will copy some arrays to the card, do a bunch of work on them, then copy a resultant array back to the host for visualizing… makes sense, I guess.

I declare the arrays like this:

float3	bPos[MAX_COUNT];

float3 *bPosD;

float3	bDir[MAX_COUNT];

float3 *bDirD;

float3 *pCenterD;

The ‘D’ arrays will point to data on the card.

I allocate and fill the local arrays, allocate the space I need on the card (actually, much more than I will need as I forsee the number of array elements varying throuhout the life of the application – up to a max)

I allocate the memory on the card with:

int theSize = MAX_COUNT*sizeof(float3);

CUDA_SAFE_CALL( cudaMalloc((void **)&bPosD, theSize)   );

CUDA_SAFE_CALL( cudaMalloc((void **)&bDirD, theSize)   );

CUDA_SAFE_CALL( cudaMalloc((void **)&pCenterD, theSize)   );

then copy the local contents to those places on the card:

CUDA_SAFE_CALL( cudaMemcpy(bDirD, bDir, bCount, cudaMemcpyHostToDevice) );

CUDA_SAFE_CALL( cudaMemcpy(bPosD, bPos, bCount, cudaMemcpyHostToDevice) );

I call the device code and it chews through the work with aplumb. No worries.

calc<<<blockCount,threadsPerBlock>>>(bPosD, pCenterD, bD, bCount, aRadius, threadsPerBlock );

and the kernal has a signturature of:

__global__ void 

calc(float3 bPos[], float3 pCenter[], float3 bDir[], int bCount, float aRadius, int threadsPerBlock )

blockCount and threadsPerBlock are modifed according to how large the arrays are… currently, threadsPerBlock is 128, blocks can number in the dozens but in the future, with respect to the docs, I plan on having a max of 512 threads per block and a blocksize of hundreds… I assume that it will take some experimentation.

The kernal determines with array element it will work with using a simple function:

int index = blockIdx.x*threadsPerBlock + threadIdx.x;

looking at emuRelease output, it seems to correctly address each element index. That’s all well and good, except it appears that only the first element in the array is ever different and therefore the only one that is modified. The rest of the values are all the same.

In the kernal I take the values from bPos (remember, pointing to bPosD now) and modify pCenter and bDir.

After all the threads run I copy the bPosD and bDirD arrays back to the host using:

CUDA_SAFE_CALL( cudaMemcpy(boidDir, boidDirD, boidCount, cudaMemcpyDeviceToHost) );

CUDA_SAFE_CALL( cudaMemcpy(bPos, bPosD, bCount, cudaMemcpyDeviceToHost) );

The [0] element is the only one that is different from what I previously sent to the card.

What am I doing wrong? I suspect that I am not using the memory ‘correctly’ but if that is true than I do not understand how to do it correctly.

Any thoughts?

I am (currently) not worried as much about full untilization as I am in just trying to get the correct output.

Any guidence is greatly appreciated.

Thanks in advance,


You are doing all the right steps, but there is something wrong with your kernel.

This is a very simple example to add two vectors:

#include "stdio.h"

__global__ void add_arrays_gpu( float *in1, float *in2, float *out, int Ntot)


        int idx=blockIdx.x*blockDim.x+threadIdx.x;

        if ( idx <Ntot )



int main()


 /* pointers to host memory */

 float *a, *b, *c;

 /* pointers to device memory */

 float *a_d, *b_d, *c_d;

 int N=18;

 int i;

/* Allocate arrays a, b and c on host*/

 a = (float*) malloc(N*sizeof(float));

 b = (float*) malloc(N*sizeof(float));

 c = (float*) malloc(N*sizeof(float));

/* Allocate arrays a_d, b_d and c_d on device*/

 cudaMalloc ((void **) &a_d, sizeof(float)*N);

 cudaMalloc ((void **) &b_d, sizeof(float)*N);

 cudaMalloc ((void **) &c_d, sizeof(float)*N);

/* Initialize arrays a and b */

 for (i=0; i<N; i++)


   a[i]= (float) i;

   b[i]=-(float) i;


/* Copy data from host memory to device memory */

  cudaMemcpy(a_d, a, sizeof(float)*N, cudaMemcpyHostToDevice);

  cudaMemcpy(b_d, b, sizeof(float)*N, cudaMemcpyHostToDevice);

/* Compute the execution configuration */

   int block_size=8;

   dim3 dimBlock(block_size);

   dim3 dimGrid ( (N/dimBlock.x) + (!(N%dimBlock.x)?0:1) );

/* Add arrays a and b, store result in c */

  add_arrays_gpu<<<dimGrid,dimBlock>>>(a_d, b_d, c_d, N);

/* Copy data from deveice memory to host memory */

  cudaMemcpy(c, c_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

/* Print c */

 for (i=0; i<N; i++)

        printf(" c[%d]=%f\n",i,c[i]);

/* Free the memory */

 free(a); free(b); free(c);

 cudaFree(a_d); cudaFree(b_d);cudaFree(c_d)


In a real code, the number of threads per block should be bigger as well as the size of the arrays.

Thanks for the (very!) fast reply… but I seem to be ambling in the twilight zone…

Inside mykernal, a loop that steps through one of the arrays has values that are all the same except for the first slot:

for (int j = 0; j < bCount; j++)


	pCenter[i].x += bPos[j].x;

	pCenter[i].y += bPos[j].y;

	pCenter[i].z += bPos[j].z;

	printf("  in Jloop: bPos[%i].x,y,z = %f,%f,%f\n",j,  bPos[j].x, bPos[j].y, bPos[j].z); 

} // j

emu prints out a list that has a [0] slot that is unique (and correct, it seems) while the other slots are identical to each other, but differnt from the [0] and cinorrectly match any slot in the array.



in Jloop: bPos[0].x,y,z = .2434, -.1234, .7890

in Jloop: bPos[1].x,y,z = .001327,  .001327,  .001327

in Jloop: bPos[2].x,y,z = .001327,  .001327,  .001327

in Jloop: bPos[3].x,y,z = .001327,  .001327,  .001327


in Jloop: bPos[9].x,y,z = .001327,  .001327,  .001327

which makes me think this is some kind of bank issue?

At present I am only using a single block and 10 threads (the arrays are only 10 elements long…!)

Can I not address any element of the arrays that I want? I understood that ‘gathering’ was the easy part! :)

copying the array back from the card works fine, though (even though I don’t need it for this particular operation… it should be the same as it was when it was copied TO the card… and it is.



if I got you right you are doing a

CUDA_SAFE_CALL( cudaMemcpy(bDirD, bDir, bCount, cudaMemcpyHostToDevice) );

with bCount = 10, so you are copying 10 bytes. The size of a float3 is 12 bytes. So you are only copying a part of the first element. Try

CUDA_SAFE_CALL( cudaMemcpy(bDirD, bDir, bCount*sizeof(float3), cudaMemcpyHostToDevice) );

hope that helps :-)

ohhh boy…

Have you ever heard the story about me and The Ultimate Ice Cream Sundae? It’s the Stuff Of Legend.

It was hot and I was positively DREAMING of a Sundae but I didn’t have any ice cream So I hopped into my car and drove the 20 minutes to the store to get some of that really expensive, slow-churned kind that’s made by hand and infused with Deliciousness.

While entering the store I thought to myself, “Man, some fresh strawberries would be GREAT with this!” so I went and found PERFECT, vine fresh strawberries.

A new batch of fresh bananas! Awesome! Gotta get a bunch of those to scoop the ice cream onto!

I also thought about how great it would be to get Hershey’s chocolate syrup, and, YES! – some fresh walnuts that I COULD EVEN HEAT UP and, for the piece de resistance, add them to some maple syrup to drizzle on the top of it all…(Ohhh… need some of that, too… glad I thought of that! I’m out!). OOOOOO!!! I almost forgot! WHIPPED CREAM! YAY!

I got it all together, waited in line patiently, paid the nice girl at the cash register (I could see the pure jealousy in her eyes!), and then hurried back to the house with the WINDOWS DOWN so that I would be goooood and hot when the time came to enjoy my Ultimate Ice Cream Sundae.

I was so excited I practically ran into the house (no small feat: I’m a big guy). I threw all of the stuff out of the bag, put the Hershey’s in a sauce pan half-filled with water on the stove and turned it on to heat the chocolate up. Then, I sliced the bananas and strawberries. While chopping the nuts, I put a cup of the Maple Syrup in the microwave so that it would get runny and easy to mix the chopped walnuts bits into it. Took the strawberries and added to them a slash of water and some sugar to create a delicate, but sweet sauce for them to mix with.

Now came the moment of truth!!! COME ON ICE CREAM!!!

It would have been great, too, if I had remembered to buy the dang ice cream…

Thanks for making me feel like a moo-ran, again!!! That’s what happens when I start to get all excited about something and forget about the fundamentals!

(and seriously, thanks for taking the time to look at my code… I really appreciate it!)


It’s not that simple.

CUDA documentation is not always clear on when to use byte-count and when to use element-count, especially when dealing with arrays. I think for one function, the documentation is even wrong (I read this in this forum).

Don’t blame yourself too much ;-)

If you find errors in the documentation, please file a bug or report it on the forum, so that we can fix them.


I’m trying to run the sample simple code - adding two vectors, and I’m getting the following results - what am I doing wrong?