read from global mem vs write to global mem

Hi, everyone

There is global function which accesses two arrays residing in global memory, array A contains 119131 integers, array B contains 1000 integers. The function reads from array A and does some computation in a for loop, at last the result is written back to array B. i took a timing on the procedure, time of reading from array A and computation is 3.141024 millisecond, time of writing back to array B is 12.191104 millisecond, it is still more than 12 milliseconds even if just one integers is written back to array B. The former access an array which has much more elements than the latter and does a lot of cumputation, but the time it spent is much less than the latter, it doesn’t make sense, so i wonder whether there is a difference between reading from global memory and writing back to global memory, is it true? Does anybody can give a explanation?

Thanks

—Gimurk

If you commented out the write to time the read, then the optimizer most likely recognized that your kernel does nothing and optimized most of it away.

Otherwise, non-coalesced memory writes could potentially cause such a big hit in performance.

code in device function

[codebox]unsigned int item_idx = stream_start + blockIdx.x * blockDim.x + threadIdx.x;

unsigned int item_cnt = 0;

dk_freq_itemset_cnt[item_idx].itemset_cnt = item_cnt; //write back to global mem

[/codebox]

writing back to global mem is coalesced, i think.

it may be a problem of optimizer as you said…

You are writing to a struct array as far as I can see. That will most likely not be coalesced.
Also if you were writing to an output-array, it would only be coalesced if stream_start is a multiple of 16

[codebox]typedef struct align(8) {

unsigned int itemset_id;

unsigned int itemset_cnt;

} ITEMCNT;

[/codebox]

The output-array is aligned, and i thought it may be a problem as you said, so i tried to write the output to an int array, but no change happened.

In my program, i make sure stream_start is a multiple of 32, it seems a problem as MisterAnderson42 said.

But you are only writing one of the ints in the struct, so it won’t be coalesced. You could verify this by running with the CUDA profiler with the right counters enabled.

You can also verify if the dead code optimization is happening by looking at the ptx output from nvcc.

I have tried writing to an int array, nothing is changed. i will check memory coalesce by using CUDA profiler.

Thanks for your advice.

gld_incoherent = 0

gst_incoherent = 0

divergent_branch = 0

warp_serialize = 0

The CUDA profiler shows that there is no problem of memory access, no matter the result of cumputation is written to a struct array or an int one.

Are you running on GTX 260 or GTX 280? They always report 0 for g*_incoherent

My device is GeForce 8800 GT, and it has given non-zero for g*_incoherent when i run a former version of the function with non-coalescing memory access.

Well, my mind is officially boggled on this one. Your code

dk_freq_itemset_cnt[item_idx].itemset_cnt = item_cnt;

has consecutive threads writing 4 byte values to bytes 4, 12, 20, … which is most certainly not coalesced. Maybe the compiler is somehow making that write with a st.global.v2 command despite the fact that you only write half of the struct. You’ll have to check out the ptx code to know for sure.

Or… are you running on Mac OS X or Vista? They don’t support the profiler counters, either.

In the main function:

int *d_itemsets_cnt_struct; // store the counted or connected itemsets information temporarily

CUDA_SAFE_CALL( cudaMalloc((void**)&d_itemsets_cnt_struct, DBASE_MAXITEM*sizeof(int)) );

In the device function:

dk_freq_itemset_cnt[item_idx] = item_cnt;

As i said, i had tried writing back to a int array, even only one integer is written back, no chang of timing happened, it seems not a problem of memory coalesce.

My OS is Fedora 7, i prefer run a program on Linux than on MS Window, and i investigated some device functions with CUDA profiler, the output is reasonable with different values, 0 or non-zero.

I will learn to check the ptx code.

Thank you for your help.

I’m not sure if this thread is completely stale, but the title fits my problem perfectly.

It seems to me that CUDA cannot coalesce writes to global memory. Here is my evidence:

I am permuting a row of memory into a random order in 3 steps so that I can profile time spent in each:

  1. read a random row index from global memory

  2. read a data row at that random index

  3. write the data item to a new row in global memory

For example, if I reorder the following vector named data into the vector named permuted by accessing data at the indices in random, the result would be the following:

data: [.3, .5, .2, .9]

random: [1, 3, 2, 0]

permuted=[.5, .9, .2, .3]

I understand coalescence and I have verified via testing that step 1 is indeed coalesced nicely.

Warps in step 2 access randomly criss-crossing values and are inherently impossible to coalesce-- that’s fine, I must accept that.

The data structure to which is written in step 3 is allocated using cudaMallocPitch ,etc just like the data structure whose accesses are nicely coalesced in step 1. However, it is taking just as much time as step 2! I have verified that the data structure is properly set up for coalescence by doing a subsequent read from it which shows to indeed be coalesced.

Time spent in this loop:

Step 1: ~2%

Step 2: ~49%

Step 3: ~49%

So, the reading is coalesced and writing is not. Is this supposed to happen in general? The manual only speaks in terms of “memory accesses” and does not make a distinction between reading and writing.

System info:

Device: GeForce 8800 Ultra

OS: RHEL5 x86_64

Cuda compilation tools, release 2.0, V0.2.1221

Code from kernel loop (time keeping statements removed):

[codebox]for(int i=0;i<len;i++)

{

//Step 1: read a random row index

tmpi = random[i].value;

//Step 2: read the data item at that index

tmpf = data[tmpi];

//Step 3: write the data item to the new permuted row

permuted[i] = tmpf;

//Step 4: read from permuted to check for coalesced reading

//(this turns out to be just as fast as step 1, which is a coalesced read also)

tmp = permuted[i];

}[/codebox]

Thanks very much.

My time measurements actually aren’t very helpful, but Visual Profiler 1.0 says that when I write to my data structure in the same coalesced way I’m reading from it, my uncoalesced READS goes through the roof. Is there something I’m missing about coalesced writes?