Preferred method of updating float4 type in global memory

Each launch will only visit each memory location in a contiguous float4 array once per launch( no contention from other threads for the same memory location), and I would guess that there is no ‘optimal’ method of updating other than this simple example method:

float4 cur_vals=make_float4(1.0f,2.0f,3.0f,4.0f);

float4 temp_4=Arr[idx];

temp_4.x+=cur_vals.x;
temp_4.y+=cur_vals.y;
temp_4.z+=cur_vals.z;
temp_4.w+=cur_vals.w;

Arr[idx]=temp_4;

Would it make any difference if I applied the addition to each 4 byte sub-location without the use of a temporary variable?
Just want to make sure there is no other better method, as I will have to update quite a large space of memory in this fashion.

Personally I wouldn’t bother with the creation of cur_vals, although the compiler should optimize anything going on there pretty well. Questions like this are usually best answered by inspection of SASS and comparison of multiple approaches.

The fiddling around in the thread code should not be that significant compared to the global load/store activity.

How do I examine that corresponding SASS and make sense of that output(any resources which provide a guide)?

When I compile, the …\x64\release folder is updated and I can see the corresponding PTX generation in a file grouped with other files which have types such as FATBIN,CUBIN, II, GPU file.

Generally speaking each thread will update one adjacent float4 location in memory.

In general, the cuda binary utilities are documented here:

[url]CUDA Binary Utilities :: CUDA Toolkit Documentation

cuobjdump -sass myapp.exe

should give you something.

If you compile with -lineinfo and use nvdisasm instead, and you specify cc3.0 or greater architecture, I think you’ll get output that is more readable, at least referencing back to source. It’s been a while since I fiddled with this, and I rarely use windows for this level of work, so if things aren’t working just holler.