Less number of memory transactions taking more time

Hi,

I've two kernels whose profiling counters look like

[codebox]imestamp=[ 4025906.000 ] method=[ _Z14rearrangeDataSimple] gputime=[ 1215528.250 ] cputime=[ 1215528.000 ] occupancy=[ 1.000 ] gld_32b=[ 32022248 ] gld_64b=[ 40 ] gld_128b=[ 125160 ] instructions=[ 1824224 ]

timestamp=[ 3496534.000 ] method=[ _Z13rearrangeDataMod] gputime=[ 673151.125 ] cputime=[ 673157.000 ] occupancy=[ 0.500 ] gld_32b=[ 32022528 ] gld_64b=[ 0 ] gld_128b=[ 125088 ] instructions=[ 35431217 ][/codebox]

I was trying to employ a mechanism in “rearrangeDataMod” which should improve the run-time compared to “rearrangeDataSimple”. As you can see, that is indeed the case but if we look at the memory transactions, they are actually telling the opposite story. The technique in the first place was to increase the number of 128-byte transaction by clubbing data together but on the contrary, the number of 128-byte transactions has actually increased. So, I’m not really clear where exactly is the improvement coming from.I’ve checked the write transactions which are same for both the kernel and both of them are doing essentially the same thing. If anyone knows anything else I should profile to get that idea it would be great.

Thanks

Shibdas

You could look for warp serialization and divergent branches what also could be a reason for more memory transactions. What I understood is that you clubbed data together to get less (?) memory transactions due to improved coalescing but now you still see the same number than in the “unoptimized” kernel, right?

You could look for warp serialization and divergent branches what also could be a reason for more memory transactions. What I understood is that you clubbed data together to get less (?) memory transactions due to improved coalescing but now you still see the same number than in the “unoptimized” kernel, right?

double post sry - forums are kinda slow today

double post sry - forums are kinda slow today

Yes. Actually, the supposedly optimized kernel is firing more memory transaction than the previous buy I can understand if the technique of “clubbing” data together is not working together as it should. But my question is, then why the kernel is taking less time than unoptimized kernel as shown in the profiler output. Even the instruction counts and occupancy say it should be other way round.

Yes. Actually, the supposedly optimized kernel is firing more memory transaction than the previous buy I can understand if the technique of “clubbing” data together is not working together as it should. But my question is, then why the kernel is taking less time than unoptimized kernel as shown in the profiler output. Even the instruction counts and occupancy say it should be other way round.

Do you have warp serialization or divergent branches? Is the data aligned to the memory segments (so each memory transaction needed for 16/32 threads starts at the begin of one segment)?

Do you have warp serialization or divergent branches? Is the data aligned to the memory segments (so each memory transaction needed for 16/32 threads starts at the begin of one segment)?

I’m attaching the code which can be compiled and run. If you take a peek at largesort.cu, there are two kernels rearrangeDataSimple and rearrangeDataMod. You can un-comment them one at a time to find the difference in run-time. The code expects a file called “random.dat” to be present in the current directory. It can be created using the program genRand.c. You have to specify the number of random integers to be generated. This preliminary code also expects the number of records to be a multiple of (1024 * 768). I’m checking with 40028160 records with 8 fields.

[codebox]

make

gcc -o genRand genRand.c

./genRand 40028160

bin/linux/release/largeSort -n=40028160

Using CUDA device [0]: Tesla T10 Processor

Sorting 40028160 32-bit unsigned int keys and values

Sorting : PASS

: elements GPUms

:: 40028160 1443.16767578

bin/linux/release/largeSort -n=40028160

Using CUDA device [0]: Tesla T10 Processor

Sorting 40028160 32-bit unsigned int keys and values

Sorting : PASS

: elements GPUms

:: 40028160 921.94521484

[/codebox]

Those two kernels are very similar and do not have any divergent branches. If you profile them you will get to see the transactions and instruction counts as mentioned earlier.

code.zip (14.5 KB)

I’m attaching the code which can be compiled and run. If you take a peek at largesort.cu, there are two kernels rearrangeDataSimple and rearrangeDataMod. You can un-comment them one at a time to find the difference in run-time. The code expects a file called “random.dat” to be present in the current directory. It can be created using the program genRand.c. You have to specify the number of random integers to be generated. This preliminary code also expects the number of records to be a multiple of (1024 * 768). I’m checking with 40028160 records with 8 fields.

[codebox]

make

gcc -o genRand genRand.c

./genRand 40028160

bin/linux/release/largeSort -n=40028160

Using CUDA device [0]: Tesla T10 Processor

Sorting 40028160 32-bit unsigned int keys and values

Sorting : PASS

: elements GPUms

:: 40028160 1443.16767578

bin/linux/release/largeSort -n=40028160

Using CUDA device [0]: Tesla T10 Processor

Sorting 40028160 32-bit unsigned int keys and values

Sorting : PASS

: elements GPUms

:: 40028160 921.94521484

[/codebox]

Those two kernels are very similar and do not have any divergent branches. If you profile them you will get to see the transactions and instruction counts as mentioned earlier.

From a short glimpse at your code I couldnt find any difference between the two kernels that would let me see why you have such a difference in performance. Think you have to dig somewhat further into your code to find any.

From a short glimpse at your code I couldnt find any difference between the two kernels that would let me see why you have such a difference in performance. Think you have to dig somewhat further into your code to find any.