kernel function efficiency

Hi,

I have the below function written on cuda and i’m calling it with 192 threads 16 blocks and shared memory of the sharedMemPerBlock of the device.

I run the profiler and got occupancy of 83 percent which seems good but the thing is that i have a very small improvment from the CPU version (x5).

Any ideas what am i doing wrong and how i could improve the performance?

My guess is that the nested loop could cause it but i don’t know how can i solve it?

__global__ void cudaSieving (unsigned long *legendre, // pointer to array located on the device mem

				unsigned int *log_legendre, // pointer to array located on the device mem

				unsigned long *legendre_mod, // pointer to array located on the device mem

				unsigned int *partition, // pointer to array located on the device mem

				size_t legendre_len, // size of three first arrays

				size_t partition_size, // size of partition array

				size_t shared_partition_len, // size of partition on the shared mem

				unsigned int bits, // for init the partition on the shared mem

				unsigned int partition_idx) {	// partition_idx is for debug only

	// the shared partition will be later copy to the device partition (each block holds on the shared mem another part of the device partition)

	unsigned int i;

	int thread_loop_id;

	// init partitions

	extern __shared__ unsigned int shared_partition[];

	for (i = 0; (thread_loop_id = threadIdx.x + i) < shared_partition_len; i += blockDim.x) {

	

		shared_partition[thread_loop_id] = bits;

	}

	__syncthreads();

	int thread_legendre;

	int thread_log_legendre;

	int thread_legendre_mod;

	int in_partition_idx;

	int block_offset = shared_partition_len * blockIdx.x;

	for (i = 0; (thread_loop_id = threadIdx.x + i) < legendre_len; i += blockDim.x) {

	

		// get current thread legendre data

		thread_legendre = legendre[thread_loop_id];

		thread_log_legendre = log_legendre[thread_loop_id];

		thread_legendre_mod = legendre_mod[thread_loop_id];

		// calculate the in partition jumps offset

		in_partition_idx = thread_legendre_mod + block_offset;

		in_partition_idx %= thread_legendre;

		if (in_partition_idx)	// in case of 0 there is no offset

			in_partition_idx = thread_legendre - in_partition_idx;

		// partition jumps

		while (in_partition_idx < shared_partition_len) {

			atomicSub (&shared_partition[in_partition_idx], thread_log_legendre);

			in_partition_idx += thread_legendre;

		}

	}

	

	__syncthreads();

	// copy the partition part from the block shared mem to the device global mem

	for (i = 0; (thread_loop_id = threadIdx.x + i) < shared_partition_len; i += blockDim.x) {

	

		partition[shared_partition_len * blockIdx.x + thread_loop_id] = 

			shared_partition[thread_loop_id];

	}

}

I wouldn consider a 5× speedup much more than a “very small improvement”.

Atomic operations on shared memory are expensive. You can easily test whether that is the bottleneck in your case by replacing the [font=“Courier New”]atomicSub()[/font] with [font=“Courier New”]-=[/font] (leading to wrong results of course) and checking if there is any significant speedup.

Hi Tera,

Thanks for the fast reply.

I checked the run time (with profiler) without the atomic operation and you are right - without the atomic operation the run time is x1.5 from the one with atomic operation and the occupancy was now increased to almost 100 percent but when i run the whole program (this kernel runs thousends of times) the time decresed from 550 seconds to only 500 seconds.

Please notice that the fact that without the atomic operation i’ll somtimes get wrong calculation doesn’t really matter because this stage is only filtering numbers for me and i don’t care to miss some numbers becuaase i’ll find others…

what really happens is that this kernel will be executed now more times until it’ll filter the needed amount of numbers.

But still I think that the GPU run time doesn’t improve enouth from the CPU…

Maybe i have bank conflicts or the netead loop have some relation to this?

I was expecting improvment of at least x40 - is it too musch?

If your kernel runs 5× faster than on the CPU but the whole program only sees a 10% speedup, either the copying to and from the GPU takes too much time or your program spends most of the time outside of the part you’ve ported to CUDA.

No, the 550 seconds is the cuda version, the CPU version takes 550x5=2750 seconds (actually it’s 2975 :)).

I’m sure that most of time is spent inside the kernel.

The 500 seconds i wrote on the prev comment is after i changed the atomic operation i.e. CUDA run time decreased from 550 -> 500 seconds after changing the atomic substraction to regular one.

Ah ok. So if 50s correspond to 50% of your kernel’s runtime, I conclude 100s are spent inside the kernel and 400s in non-CUDA code? That would still suggest you are soon to enter the domain of diminishing returns regarding further optimization.

In my experience you are very lucky if you are to encounter a 40× speedup. Between 5× and 10× is more realistic, unless your CPU version is not well optimized.

It’s not 100s that spent inside the kernel but most of the 500s because all my application is doing is a loop that calls to the kernel and this loop itself don’t take more of few seconds…

What GPU are you using?

My thoughts:

  1. Can you modify your problem to launch bigger Kernels (more than 16 blocks) to further reduce kernel launch overhead?
  2. If you use sharedMemPerBlock of shared memory in each block, you can fit exactly one block per SM. I don’t get how you can reach such a high occupancy.
  3. If you launch just 16 blocks, and have 16 SMs (assuming GTX500 kind of architecture), you have just 6 warps per SM, that you can not really hide the global memory latencies and have to rely on your caches.
  4. can you avoid the if statement by using
    in_partition_idx = -thread_legendre_mod - block_offset;
    in_partition_idx %= thread_legendre;
  5. why do you use those loops instead of more threads per block?

Cheers,
Markus

Hi,

Thanks for the reply!

I’m using GTX 460.

I guess i can modify the problem to run more than 16 blocks but than i’ll lose other things - if the partition will be too small i won’t gain enouth with the jumps loop because i jump with big numbers in the partition array (each num =~ 80000) and if the partition is to small, to much threads won’t jump but any way i’ll try it.

i’m using this loops insted of more threads because the legendre is to big and there is no enouth threads for each one of them to take one value from it.

regarding the if statment - i can’t do what you wrote because it will be wrong calculation:

x%y is not equal to (-x)%y.

Thanks a lot for the help!

Hi,
argh. sorry, I forgot the C % is the remainder not the modulo.

GTX460 means you have 7 SMs supporting Compute Capability 2.1:
max 1024 Threads/block, max 8 blocks per SM, max 1526 threads per SM, max 32K regs per SM, max 48K shared/SM

If your shared Mem usage is indeed 49152 byte per block, then you assign ONE block to each SM. Having 7 SMs my intuition says: use a multiple of 7 blocks to keep the hardware busy all the time. Otherwise 7 blocks are processed on 7 SMs -> another 7 blocks are processed on 7 SMs -> the remaining 2 blocks are processed on 2 SMS and 5 SMs are idle. This is relaxed a bit by: 1. concurrent kernel execution if you use multiple streams and 2. very short kernels

Furthermore my gut feeling says this kernel is memory bound, as you have 3 loads and just few instructions to hide their latency. With just 6 warps on the SM, most of the GPUs resources should be waiting for the memory accesses. You should see a slight speedup, using more threads/block. I’d suggest increasing the number of threads per block to the maximum of 1024 unless you run out of registers. Otherwise increase as long as you can without regster spilling. Of course your legendre_len parameter can give a natural upper bound.

If you don’t care: remove the atomic instruction.

For the rest: I fear you are in memory bandwidth hell with this algorithm. If you see any chance to reduce the global memory accesses or to remove the randomness of the shared mem accesses (causing them to be serialized) -> try it, even if it is at the cost of some extra instructions. For example: if the log_legendre is just a precalculated log of the legendre value, try to make this one load + one log (if you round anyways consider the fast __logf primitives). Remember your cores should beidle most of the time, so playing with funny ideas is worth trying.