Hello all. I am new.
I am working on porting some code to run using CUDA and I am aware that there are a number of concepts to keep in mind when dealing with threads, memory addressing and efficiency.
That said, I thought I would just jump in to get my feet wet and I believe that all I have discovered is that I do not understand some fundamental memeory work…
I have declared arrays that will represent data on the host and (duplicated) data on the card. The idea is that I will copy some arrays to the card, do a bunch of work on them, then copy a resultant array back to the host for visualizing… makes sense, I guess.
I declare the arrays like this:
float3 bPos[MAX_COUNT];
float3 *bPosD;
float3 bDir[MAX_COUNT];
float3 *bDirD;
float3 *pCenterD;
The ‘D’ arrays will point to data on the card.
I allocate and fill the local arrays, allocate the space I need on the card (actually, much more than I will need as I forsee the number of array elements varying throuhout the life of the application – up to a max)
I allocate the memory on the card with:
int theSize = MAX_COUNT*sizeof(float3);
CUDA_SAFE_CALL( cudaMalloc((void **)&bPosD, theSize) );
CUDA_SAFE_CALL( cudaMalloc((void **)&bDirD, theSize) );
CUDA_SAFE_CALL( cudaMalloc((void **)&pCenterD, theSize) );
then copy the local contents to those places on the card:
CUDA_SAFE_CALL( cudaMemcpy(bDirD, bDir, bCount, cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpy(bPosD, bPos, bCount, cudaMemcpyHostToDevice) );
I call the device code and it chews through the work with aplumb. No worries.
calc<<<blockCount,threadsPerBlock>>>(bPosD, pCenterD, bD, bCount, aRadius, threadsPerBlock );
and the kernal has a signturature of:
__global__ void
calc(float3 bPos[], float3 pCenter[], float3 bDir[], int bCount, float aRadius, int threadsPerBlock )
blockCount and threadsPerBlock are modifed according to how large the arrays are… currently, threadsPerBlock is 128, blocks can number in the dozens but in the future, with respect to the docs, I plan on having a max of 512 threads per block and a blocksize of hundreds… I assume that it will take some experimentation.
The kernal determines with array element it will work with using a simple function:
int index = blockIdx.x*threadsPerBlock + threadIdx.x;
looking at emuRelease output, it seems to correctly address each element index. That’s all well and good, except it appears that only the first element in the array is ever different and therefore the only one that is modified. The rest of the values are all the same.
In the kernal I take the values from bPos (remember, pointing to bPosD now) and modify pCenter and bDir.
After all the threads run I copy the bPosD and bDirD arrays back to the host using:
CUDA_SAFE_CALL( cudaMemcpy(boidDir, boidDirD, boidCount, cudaMemcpyDeviceToHost) );
CUDA_SAFE_CALL( cudaMemcpy(bPos, bPosD, bCount, cudaMemcpyDeviceToHost) );
The [0] element is the only one that is different from what I previously sent to the card.
What am I doing wrong? I suspect that I am not using the memory ‘correctly’ but if that is true than I do not understand how to do it correctly.
Any thoughts?
I am (currently) not worried as much about full untilization as I am in just trying to get the correct output.
Any guidence is greatly appreciated.
Thanks in advance,
Dave