read from global mem vs write to global mem

Gimurk · November 11, 2008, 12:25pm

Hi, everyone

There is global function which accesses two arrays residing in global memory, array A contains 119131 integers, array B contains 1000 integers. The function reads from array A and does some computation in a for loop, at last the result is written back to array B. i took a timing on the procedure, time of reading from array A and computation is 3.141024 millisecond, time of writing back to array B is 12.191104 millisecond, it is still more than 12 milliseconds even if just one integers is written back to array B. The former access an array which has much more elements than the latter and does a lot of cumputation, but the time it spent is much less than the latter, it doesn’t make sense, so i wonder whether there is a difference between reading from global memory and writing back to global memory, is it true? Does anybody can give a explanation?

Thanks

—Gimurk

MisterAnderson42 · November 11, 2008, 12:54pm

If you commented out the write to time the read, then the optimizer most likely recognized that your kernel does nothing and optimized most of it away.

Otherwise, non-coalesced memory writes could potentially cause such a big hit in performance.

Gimurk · November 11, 2008, 1:09pm

code in device function

[codebox]unsigned int item_idx = stream_start + blockIdx.x * blockDim.x + threadIdx.x;

unsigned int item_cnt = 0;

…

dk_freq_itemset_cnt[item_idx].itemset_cnt = item_cnt; //write back to global mem

[/codebox]

writing back to global mem is coalesced, i think.

it may be a problem of optimizer as you said…

E.D_Riedijk · November 11, 2008, 1:56pm

You are writing to a struct array as far as I can see. That will most likely not be coalesced.
Also if you were writing to an output-array, it would only be coalesced if stream_start is a multiple of 16

Gimurk · November 12, 2008, 2:04am

[codebox]typedef struct align(8) {

unsigned int itemset_id;

unsigned int itemset_cnt;

} ITEMCNT;

[/codebox]

The output-array is aligned, and i thought it may be a problem as you said, so i tried to write the output to an int array, but no change happened.

In my program, i make sure stream_start is a multiple of 32, it seems a problem as MisterAnderson42 said.

MisterAnderson42 · November 12, 2008, 2:16am

But you are only writing one of the ints in the struct, so it won’t be coalesced. You could verify this by running with the CUDA profiler with the right counters enabled.

You can also verify if the dead code optimization is happening by looking at the ptx output from nvcc.

Gimurk · November 12, 2008, 9:30am

I have tried writing to an int array, nothing is changed. i will check memory coalesce by using CUDA profiler.

Thanks for your advice.

Gimurk · November 13, 2008, 2:46am

gld_incoherent = 0

gst_incoherent = 0

divergent_branch = 0

warp_serialize = 0

The CUDA profiler shows that there is no problem of memory access, no matter the result of cumputation is written to a struct array or an int one.

MisterAnderson42 · November 13, 2008, 3:48am

Are you running on GTX 260 or GTX 280? They always report 0 for g*_incoherent

Gimurk · November 13, 2008, 7:10am

My device is GeForce 8800 GT, and it has given non-zero for g*_incoherent when i run a former version of the function with non-coalescing memory access.

MisterAnderson42 · November 13, 2008, 2:13pm

Well, my mind is officially boggled on this one. Your code

dk_freq_itemset_cnt[item_idx].itemset_cnt = item_cnt;

has consecutive threads writing 4 byte values to bytes 4, 12, 20, … which is most certainly not coalesced. Maybe the compiler is somehow making that write with a st.global.v2 command despite the fact that you only write half of the struct. You’ll have to check out the ptx code to know for sure.

Or… are you running on Mac OS X or Vista? They don’t support the profiler counters, either.

Gimurk · November 14, 2008, 1:02am

Well, my mind is officially boggled on this one. Your code
dk_freq_itemset_cnt[item_idx].itemset_cnt = item_cnt;
has consecutive threads writing 4 byte values to bytes 4, 12, 20, … which is most certainly not coalesced. Maybe the compiler is somehow making that write with a st.global.v2 command despite the fact that you only write half of the struct. You’ll have to check out the ptx code to know for sure.

Or… are you running on Mac OS X or Vista? They don’t support the profiler counters, either.

In the main function:

int *d_itemsets_cnt_struct; // store the counted or connected itemsets information temporarily

CUDA_SAFE_CALL( cudaMalloc((void**)&d_itemsets_cnt_struct, DBASE_MAXITEM*sizeof(int)) );

In the device function:

…

dk_freq_itemset_cnt[item_idx] = item_cnt;

As i said, i had tried writing back to a int array, even only one integer is written back, no chang of timing happened, it seems not a problem of memory coalesce.

My OS is Fedora 7, i prefer run a program on Linux than on MS Window, and i investigated some device functions with CUDA profiler, the output is reasonable with different values, 0 or non-zero.

I will learn to check the ptx code.

Thank you for your help.

jlv · January 20, 2009, 11:50pm

I’m not sure if this thread is completely stale, but the title fits my problem perfectly.

It seems to me that CUDA cannot coalesce writes to global memory. Here is my evidence:

I am permuting a row of memory into a random order in 3 steps so that I can profile time spent in each:

read a random row index from global memory
read a data row at that random index
write the data item to a new row in global memory

For example, if I reorder the following vector named data into the vector named permuted by accessing data at the indices in random, the result would be the following:

data: [.3, .5, .2, .9]

random: [1, 3, 2, 0]

permuted=[.5, .9, .2, .3]

I understand coalescence and I have verified via testing that step 1 is indeed coalesced nicely.

Warps in step 2 access randomly criss-crossing values and are inherently impossible to coalesce-- that’s fine, I must accept that.

The data structure to which is written in step 3 is allocated using cudaMallocPitch ,etc just like the data structure whose accesses are nicely coalesced in step 1. However, it is taking just as much time as step 2! I have verified that the data structure is properly set up for coalescence by doing a subsequent read from it which shows to indeed be coalesced.

Time spent in this loop:

Step 1: ~2%

Step 2: ~49%

Step 3: ~49%

So, the reading is coalesced and writing is not. Is this supposed to happen in general? The manual only speaks in terms of “memory accesses” and does not make a distinction between reading and writing.

System info:

Device: GeForce 8800 Ultra

OS: RHEL5 x86_64

Cuda compilation tools, release 2.0, V0.2.1221

Code from kernel loop (time keeping statements removed):

[codebox]for(int i=0;i<len;i++)

{

//Step 1: read a random row index

tmpi = random[i].value;

//Step 2: read the data item at that index

tmpf = data[tmpi];

//Step 3: write the data item to the new permuted row

permuted[i] = tmpf;

//Step 4: read from permuted to check for coalesced reading

//(this turns out to be just as fast as step 1, which is a coalesced read also)

tmp = permuted[i];

}[/codebox]

Thanks very much.

jlv · January 22, 2009, 9:42pm

My time measurements actually aren’t very helpful, but Visual Profiler 1.0 says that when I write to my data structure in the same coalesced way I’m reading from it, my uncoalesced READS goes through the roof. Is there something I’m missing about coalesced writes?

Topic		Replies	Views
global memory latency CUDA Programming and Performance	12	16159	December 13, 2007
coalesced access to global memory CUDA Programming and Performance	6	1151	May 8, 2014
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5071	September 6, 2008
How do GPUs "handle" writes? CUDA Programming and Performance	12	3398	March 10, 2018
Reading from global memory to registers in a fast way CUDA Programming and Performance	10	1882	November 15, 2021
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24461	December 21, 2009
Effective global memory bandwidth? CUDA Programming and Performance	17	17558	September 18, 2007
Writing global memory 14 times slower than reading? CUDA Programming and Performance	6	10054	January 19, 2011
Global memory access cost CUDA Programming and Performance	4	2801	November 18, 2017
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7551	July 21, 2008

read from global mem vs write to global mem

Related topics