I’m not sure if this thread is completely stale, but the title fits my problem perfectly.
It seems to me that CUDA cannot coalesce writes to global memory. Here is my evidence:
I am permuting a row of memory into a random order in 3 steps so that I can profile time spent in each:
-
read a random row index from global memory
-
read a data row at that random index
-
write the data item to a new row in global memory
For example, if I reorder the following vector named data into the vector named permuted by accessing data at the indices in random, the result would be the following:
data: [.3, .5, .2, .9]
random: [1, 3, 2, 0]
permuted=[.5, .9, .2, .3]
I understand coalescence and I have verified via testing that step 1 is indeed coalesced nicely.
Warps in step 2 access randomly criss-crossing values and are inherently impossible to coalesce-- that’s fine, I must accept that.
The data structure to which is written in step 3 is allocated using cudaMallocPitch ,etc just like the data structure whose accesses are nicely coalesced in step 1. However, it is taking just as much time as step 2! I have verified that the data structure is properly set up for coalescence by doing a subsequent read from it which shows to indeed be coalesced.
Time spent in this loop:
Step 1: ~2%
Step 2: ~49%
Step 3: ~49%
So, the reading is coalesced and writing is not. Is this supposed to happen in general? The manual only speaks in terms of “memory accesses” and does not make a distinction between reading and writing.
System info:
Device: GeForce 8800 Ultra
OS: RHEL5 x86_64
Cuda compilation tools, release 2.0, V0.2.1221
Code from kernel loop (time keeping statements removed):
[codebox]for(int i=0;i<len;i++)
{
//Step 1: read a random row index
tmpi = random[i].value;
//Step 2: read the data item at that index
tmpf = data[tmpi];
//Step 3: write the data item to the new permuted row
permuted[i] = tmpf;
//Step 4: read from permuted to check for coalesced reading
//(this turns out to be just as fast as step 1, which is a coalesced read also)
tmp = permuted[i];
}[/codebox]
Thanks very much.