pacard
January 18, 2010, 11:30am
1
I am trying to profile my program with cuda profiler. I used the following events:
gld_32b : 32-byte global memory load transactions
gld_64b : 64-byte global memory load transactions
gld_128b : 128-byte global memory load transactions
gld_request : Global memory loads
My understanding is that gld_request=gld_32b+gld_64b+gld_128b.
But I am getting this output:
gld_32b=[ 9315200 ]
gld_64b=[ 3031600 ]
gld_128b=[ 1537150 ]
gld_request=[ 1962736 ]
So what does gld_request really mean?
I am using GTX280 on Redhat EL_5.3_x86.
avidday
January 18, 2010, 12:26pm
2
I don’t think that is correct. On the GT200, a global load request is a half-warp wide request for a global memory load from the memory controller. The GT200 memory controller can decompose the request into a sequence of transactions to service the gld_request, so one gld_request can produce more than one 32 byte, 64 byte or128 byte load (see for example Figure 5-4 in the CUDA 2.3 programming guide). So I wouldn’t expect that relationship you suggest would be valid in any case except perhaps code with perfectly coalesced read behaviour.
pacard
January 18, 2010, 1:03pm
3
I don’t think that is correct. On the GT200, a global load request is a half-warp wide request for a global memory load from the memory controller. The GT200 memory controller can decompose the request into a sequence of transactions to service the gld_request, so one gld_request can produce more than one 32 byte, 64 byte or128 byte load (see for example Figure 5-4 in the CUDA 2.3 programming guide). So I wouldn’t expect that relationship you suggest would be valid in any case except perhaps code with perfectly coalesced read behaviour.
avidday, thank you for you reply.
Now the data can be explained.
But I still have a set of odd data now here:
gst_32b=1638400
gst_64b=0
gst_128b=307200
gst_request=57344
Summing all together, the number of memory transactions is 1945600. So each request generates 1945600/57344=33.9 transactions. :-(
Am I getting something wrong here?
pacard
January 18, 2010, 1:07pm
4
avidday, thank you for you reply.
Now the data can be explained.
But I still have a set of odd data now here:
gst_32b=1638400
gst_64b=0
gst_128b=307200
gst_request=57344
Summing all together, the number of memory transactions is 1945600. So each request generates 1945600/57344=33.9 transactions. :-(
Am I getting something wrong here?
By the way, I am using the atomicCAS and atomicAdd operations heavily. Could that have caused the problem?
That looks somewhat suspicious and incorrect, but still hints that you have nearly completly random global memory writes and it asks for optimisation :)
pacard
January 18, 2010, 2:04pm
6
Well, I tried my best to make the memory access pattern more coalesce-able. This is totally beyond my imagination.
I would accept the reality if transaction/request = 10, but not 33. External Media