kernel function efficiency

DiSol · June 26, 2012, 6:04pm

Hi,

I have the below function written on cuda and i’m calling it with 192 threads 16 blocks and shared memory of the sharedMemPerBlock of the device.

I run the profiler and got occupancy of 83 percent which seems good but the thing is that i have a very small improvment from the CPU version (x5).

Any ideas what am i doing wrong and how i could improve the performance?

My guess is that the nested loop could cause it but i don’t know how can i solve it?

__global__ void cudaSieving (unsigned long *legendre, // pointer to array located on the device mem

				unsigned int *log_legendre, // pointer to array located on the device mem

				unsigned long *legendre_mod, // pointer to array located on the device mem

				unsigned int *partition, // pointer to array located on the device mem

				size_t legendre_len, // size of three first arrays

				size_t partition_size, // size of partition array

				size_t shared_partition_len, // size of partition on the shared mem

				unsigned int bits, // for init the partition on the shared mem

				unsigned int partition_idx) {	// partition_idx is for debug only

	// the shared partition will be later copy to the device partition (each block holds on the shared mem another part of the device partition)

	unsigned int i;

	int thread_loop_id;

	// init partitions

	extern __shared__ unsigned int shared_partition[];

	for (i = 0; (thread_loop_id = threadIdx.x + i) < shared_partition_len; i += blockDim.x) {

	

		shared_partition[thread_loop_id] = bits;

	}

	__syncthreads();

	int thread_legendre;

	int thread_log_legendre;

	int thread_legendre_mod;

	int in_partition_idx;

	int block_offset = shared_partition_len * blockIdx.x;

	for (i = 0; (thread_loop_id = threadIdx.x + i) < legendre_len; i += blockDim.x) {

	

		// get current thread legendre data

		thread_legendre = legendre[thread_loop_id];

		thread_log_legendre = log_legendre[thread_loop_id];

		thread_legendre_mod = legendre_mod[thread_loop_id];

		// calculate the in partition jumps offset

		in_partition_idx = thread_legendre_mod + block_offset;

		in_partition_idx %= thread_legendre;

		if (in_partition_idx)	// in case of 0 there is no offset

			in_partition_idx = thread_legendre - in_partition_idx;

		// partition jumps

		while (in_partition_idx < shared_partition_len) {

			atomicSub (&shared_partition[in_partition_idx], thread_log_legendre);

			in_partition_idx += thread_legendre;

		}

	}

	

	__syncthreads();

	// copy the partition part from the block shared mem to the device global mem

	for (i = 0; (thread_loop_id = threadIdx.x + i) < shared_partition_len; i += blockDim.x) {

	

		partition[shared_partition_len * blockIdx.x + thread_loop_id] = 

			shared_partition[thread_loop_id];

	}

}

tera · June 26, 2012, 8:16pm

I wouldn consider a 5Ã— speedup much more than a “very small improvement”.

Atomic operations on shared memory are expensive. You can easily test whether that is the bottleneck in your case by replacing the [font=“Courier New”]atomicSub()[/font] with [font=“Courier New”]-=[/font] (leading to wrong results of course) and checking if there is any significant speedup.

DiSol · June 27, 2012, 6:22am

Hi Tera,

Thanks for the fast reply.

I checked the run time (with profiler) without the atomic operation and you are right - without the atomic operation the run time is x1.5 from the one with atomic operation and the occupancy was now increased to almost 100 percent but when i run the whole program (this kernel runs thousends of times) the time decresed from 550 seconds to only 500 seconds.

Please notice that the fact that without the atomic operation i’ll somtimes get wrong calculation doesn’t really matter because this stage is only filtering numbers for me and i don’t care to miss some numbers becuaase i’ll find others…

what really happens is that this kernel will be executed now more times until it’ll filter the needed amount of numbers.

But still I think that the GPU run time doesn’t improve enouth from the CPU…

Maybe i have bank conflicts or the netead loop have some relation to this?

I was expecting improvment of at least x40 - is it too musch?

tera · June 27, 2012, 8:47am

If your kernel runs 5Ã— faster than on the CPU but the whole program only sees a 10% speedup, either the copying to and from the GPU takes too much time or your program spends most of the time outside of the part you’ve ported to CUDA.

DiSol · June 27, 2012, 8:57am

No, the 550 seconds is the cuda version, the CPU version takes 550x5=2750 seconds (actually it’s 2975 External Image.

I’m sure that most of time is spent inside the kernel.

The 500 seconds i wrote on the prev comment is after i changed the atomic operation i.e. CUDA run time decreased from 550 → 500 seconds after changing the atomic substraction to regular one.

tera · June 27, 2012, 9:50am

Ah ok. So if 50s correspond to 50% of your kernel’s runtime, I conclude 100s are spent inside the kernel and 400s in non-CUDA code? That would still suggest you are soon to enter the domain of diminishing returns regarding further optimization.

In my experience you are very lucky if you are to encounter a 40Ã— speedup. Between 5Ã— and 10Ã— is more realistic, unless your CPU version is not well optimized.

DiSol · June 27, 2012, 9:58am

It’s not 100s that spent inside the kernel but most of the 500s because all my application is doing is a loop that calls to the kernel and this loop itself don’t take more of few seconds…

MKasper · June 27, 2012, 4:28pm

What GPU are you using?

My thoughts:

Can you modify your problem to launch bigger Kernels (more than 16 blocks) to further reduce kernel launch overhead?
If you use sharedMemPerBlock of shared memory in each block, you can fit exactly one block per SM. I don’t get how you can reach such a high occupancy.
If you launch just 16 blocks, and have 16 SMs (assuming GTX500 kind of architecture), you have just 6 warps per SM, that you can not really hide the global memory latencies and have to rely on your caches.
can you avoid the if statement by using
in_partition_idx = -thread_legendre_mod - block_offset;
in_partition_idx %= thread_legendre;
why do you use those loops instead of more threads per block?

Cheers,
Markus

DiSol · June 27, 2012, 4:49pm

Hi,

Thanks for the reply!

I’m using GTX 460.

I guess i can modify the problem to run more than 16 blocks but than i’ll lose other things - if the partition will be too small i won’t gain enouth with the jumps loop because i jump with big numbers in the partition array (each num =~ 80000) and if the partition is to small, to much threads won’t jump but any way i’ll try it.

i’m using this loops insted of more threads because the legendre is to big and there is no enouth threads for each one of them to take one value from it.

regarding the if statment - i can’t do what you wrote because it will be wrong calculation:

x%y is not equal to (-x)%y.

Thanks a lot for the help!

MKasper · June 28, 2012, 8:22am

Hi,
argh. sorry, I forgot the C % is the remainder not the modulo.

GTX460 means you have 7 SMs supporting Compute Capability 2.1:
max 1024 Threads/block, max 8 blocks per SM, max 1526 threads per SM, max 32K regs per SM, max 48K shared/SM

If your shared Mem usage is indeed 49152 byte per block, then you assign ONE block to each SM. Having 7 SMs my intuition says: use a multiple of 7 blocks to keep the hardware busy all the time. Otherwise 7 blocks are processed on 7 SMs → another 7 blocks are processed on 7 SMs → the remaining 2 blocks are processed on 2 SMS and 5 SMs are idle. This is relaxed a bit by: 1. concurrent kernel execution if you use multiple streams and 2. very short kernels

Furthermore my gut feeling says this kernel is memory bound, as you have 3 loads and just few instructions to hide their latency. With just 6 warps on the SM, most of the GPUs resources should be waiting for the memory accesses. You should see a slight speedup, using more threads/block. I’d suggest increasing the number of threads per block to the maximum of 1024 unless you run out of registers. Otherwise increase as long as you can without regster spilling. Of course your legendre_len parameter can give a natural upper bound.

If you don’t care: remove the atomic instruction.

For the rest: I fear you are in memory bandwidth hell with this algorithm. If you see any chance to reduce the global memory accesses or to remove the randomness of the shared mem accesses (causing them to be serialized) → try it, even if it is at the cost of some extra instructions. For example: if the log_legendre is just a precalculated log of the legendre value, try to make this one load + one log (if you round anyways consider the fast __logf primitives). Remember your cores should beidle most of the time, so playing with funny ideas is worth trying.

Topic		Replies	Views
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	862	March 30, 2024
Efficient use of shared memory CUDA Programming and Performance	29	4588	December 2, 2019
Cuda code performance CUDA Programming and Performance	14	3177	December 16, 2014
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13942	September 5, 2008
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	147	December 31, 2024
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11216	May 23, 2010
Performance degradation as task size grows CUDA Programming and Performance	13	650	April 25, 2023
Discrepancy between theoretical occupancy and achieved occupancy depending on ThreadsPerBlock CUDA Programming and Performance cuda	7	150	September 6, 2024
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16345	January 30, 2011
Doubts related to CUDA CUDA Programming and Performance	17	11820	November 18, 2010

kernel function efficiency

Related topics