Here are my timing results, not impressive. Help.

I have tried to do some computations on some memory in a parallel way. Here is the kernel code:

__global__ void

testKernel( float* g_idata, int data_size) 

{

  // access thread id

  const unsigned int tid = threadIdx.x;

  // access number of threads in this block

  const unsigned int num_threads = blockDim.x;

 int slices = data_size / num_threads;

  int js;

 for(int i = 0; i < slices; i++){

	js = (tid*slices) + i; 

	

	for(int j = 0; j < 200000; j++){

  g_idata[js] = g_idata[js] + 2;

  g_idata[js] = g_idata[js] - 1;

	}

  }

  __syncthreads();

}

I have run this with data_size always being 2000 and the number of threads varying. I also just did the computation on the CPU each time to compare the results. Please read the results. Does anyone else see similar results? I am not impressed with 10, 20 and 50 threads, which is where my application would run at.

Here are my results:

All times in microseconds

Release build:

Threads: 1

kernel: 9395514

CPU: 3024071

Threads: 2

kernel: 7564088

CPU: 3022072

Threads: 5

kernel: 7220803

CPU: 3019677

Threads: 10

kernel: 4499308

CPU: 3025057

Threads: 20

kernel: 2454269

CPU: 3019823

Threads: 50

kernel: 2379628

CPU: 3027225

Threads: 100

kernel: 2455272

CPU: 3029687

Threads: 200

kernel: 2450573

CPU: 3026759

Threads: 500

kernel: 2450161

CPU: 3020396

Is there anything I am doing wrong?

  1. Block sizes should be in multiples of 32, since that is the warp size and the smallest batch of threads the device can execute
  2. It doesn’t appear that your memory accesses are coalesced. Rearranging your memory accesses so that threads read/write across rows instead of down columns should help
  3. Are you just running one block? The GPU is really a collection of a bunch of relatively slow processors, but capable of processing an amazing number of threads concurrently. If you run only one block, you see the slow processors because most of the device is sitting idle. With 100+ blocks, all multiprocessors are activated and multiple blocks can interleave.

Yes, I am only running one block.

How do I read what block I am running on in the kernel?

The blockIdx variable.

Here are results for changing the number of threads and blocks. My application needs exactly 50 threads so I am doing it this way…I have not seen anything regarding 32 blocks being optimal according to a previous post on this discussion. Is this due to the fact that there are 12 multiprocessors on my card? Any comments?

Data for 50 threads and 1 blocks
Elapsed microseconds for kernel: 2454572
Elapsed microseconds for CPU reference: 3017526
Test PASSED

Data for 25 threads and 2 blocks
Elapsed microseconds for kernel: 2379803
Elapsed microseconds for CPU reference: 3016858
Test PASSED

Data for 10 threads and 5 blocks
Elapsed microseconds for kernel: 1495458
Elapsed microseconds for CPU reference: 3016244
Test PASSED

Data for 5 threads and 10 blocks
Elapsed microseconds for kernel: 1296207
Elapsed microseconds for CPU reference: 3016458
Test PASSED

Data for 2 threads and 25 blocks
Elapsed microseconds for kernel: 2282588
Elapsed microseconds for CPU reference: 3016848
Test PASSED

Data for 1 threads and 50 blocks
Elapsed microseconds for kernel: 4492677
Elapsed microseconds for CPU reference: 3017058
Test PASSED

If your algorithm can only scale up to 50 threads (across all blocks), then CUDA is going to be a poor fit. Even mediocre utilization of your card is going to require at least 32 threads/block x 12 blocks = 384 threads. To good performance you will need to at least go double that in threads/block and in total blocks.

Also, when MisterAnderson42 said “block size” he meant “threads per block”. That number really should be a multiple of the warp size, which is 32. It is less important how many total blocks you have, though you definitely want at least as many blocks as you have multiprocessors, and 2 or 3x more if possible.