i’m wondering if my kernel is accessing the global memory in a coalesced way. as i understood coalescing, the threads of an active warp have to read elements from a global array that are stored right next to each other:
thread 0: element N
thread 1: element N+1
thread 2: element N+2
…
i’m using the following grid:
blockDim.x = 256, blockDim.y = 0;
gridDim.x = gridDim.y = 256;
that makes it possible to access 256256256 elements in a “parallel” way:
my kernel is similar to this
global void someKernel(float* ArrayInGlobalMemory)
{
blockDim.y = 0 is incorrect. You want the y dimension to be 1 to have a “row of 256 threads”.
The time it takes shows that it definitely is NOT coalesced. You’re getting 16MB/s… That’s three orders of magnitude less than you should get, this is much worse than simply having uncoalesced reads. I presume it’s either y=0 or the way you calculate the index that somehow results in a race condition.
how do you calculate the index when you have 1dimensional blocks (one row of 256 threads in my case) and a 2dimensional grid (256 rows/columns of blocks in my case)?
(I got this: int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x from another from another post in this forum)
Okay, I just ran your code through the profiler and it says the writes are coalesced. It took around 1,6ms to complete, giving me 40GB/s of bandwidth (bandwidthTest from the SDK maxes out at 50GB/s for my card). Your indexing scheme is also correct.
Here’s the code
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cutil.h>
#if __DEVICE_EMULATION__
bool InitCUDA(void){return true;}
#else
bool InitCUDA(void)
{
int count = 0;
int i = 0;
cudaGetDeviceCount(&count);
if(count == 0) {
fprintf(stderr, "There is no device.\n");
return false;
}
for(i = 0; i < count; i++) {
cudaDeviceProp prop;
if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {
if(prop.major >= 1) {
break;
}
}
}
if(i == count) {
fprintf(stderr, "There is no device supporting CUDA.\n");
return false;
}
cudaSetDevice(i);
printf("CUDA initialized.\n");
return true;
}
#endif
//using ints instead of floats for more robust correctness checking later,
//tested with floats as well and it made no difference in bandwidth
__global__ void coalescing2d(int arrayGlobal[])
{
int idx = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;
arrayGlobal[idx] = idx;
}
int main(int argc, char* argv[])
{
if(!InitCUDA()) {
return 0;
}
dim3 blockDim;
dim3 gridDim;
blockDim.x = 256, blockDim.y = 1;
gridDim.x = gridDim.y = 256;
int arrayLength = gridDim.x*gridDim.y*blockDim.x*blockDim.y;
int* arrayInGlobalMemory;
//I can feel tmurray's wrath
CUDA_SAFE_CALL(cudaMalloc((void **) &arrayInGlobalMemory, arrayLength*sizeof(int)));
coalescing2d<<<gridDim,blockDim>>>(arrayInGlobalMemory);
//no syncing, measure time with the profiler instead
//test for correctness
int *h_array = new int[arrayLength];
CUDA_SAFE_CALL(cudaMemcpy(h_array,
arrayInGlobalMemory,
arrayLength*sizeof(float),
cudaMemcpyDeviceToHost));
//numbers written should be incrementing if index calculations went right
//ie. each element should be previous element + 1
int prev = -1;
int errors = 0;
for(int i = 0; i < arrayLength; ++i)
{
if(h_array[i] != prev+1)
++errors;
prev = h_array[i];
}
printf("errors: %d\n",errors);
cudaFree(arrayInGlobalMemory);
delete h_array;
cudaThreadExit();
return 0;
}
How did you measure that it’s the store that takes 4000ms?
first of all: thank you so much for making so much effort on my problem! i really appreciate that!
i measured these 4000ms by simply removing any kind of computations from my kernel and simply store a fixed value to every array-element:
e.g. arrayInGlobalMemory[idx] = 1.0f;
the problem must be somewhere else…
for later deploy-testing i compiled my code with 32bit libraries (cutil32.dll etc.) and took the time-measurements on a pc having the 64bit nvidia driver installed. so mybe that doesn’t fit together.
32-bit libraries can work with 64-bit drivers. I have used them on XP 64-bit machine.
There is a known problem with “cudaMalloc” in such setups. cudaMalloc fails after some repeated invocation of an application (no matter how less u allocate inside that application). But that should NOT cause abysmal turn around times…as u c.
btw, not sure if this problem is there in VISTA 64.