please say me optimal values of A B C D parametres for GeForce 8800GTS 640MB

please say me optimal values of A B C D parametres

for GeForce 8800GTS 640MB

in this code:

// setup execution parameters

	dim3 threads(A, B);

	dim3 grid(C, D);

	// execute the kernel

	matrix<<< grid, threads >>>(dataDev);

Well - that heavily depends on the amount of data you are going to process, as well as the resources (registers, shared memory) usage of your kernel code.

if i use 1 register and 1byte of shared memory what is best values of A B C D parametres?

Again: it depends on what you’re doing in your kernel; you cannot expect resolute answer to the kind of question you are asking - if there exist some values that work best for all cases, then these would be hard-coded in CUDA runtime, and programmer would not be asked to supply them as arguments to kernel invocation call.

this code:

__global__ void matrix( calcData* data)

{

	// Block index

	int i = threadIdx.x+blockIdx.x*BLOCK_SIZE;

	calcData* dat= data+i;

	int f1=dat->field1;

	int f2=dat->field2;

	dat->field1=i;

	int s=0;

	for (int i = 0; i < 999999; ++i)

	{

		s += f1;

		s ^= f2;

	}

	dat->field3=s;

}

How many items in the data array you have to process with this kernel?

number of items does not matter

need to maximize productivity

Of course that number of items matter - the size of the grid (numbers “C” and “D” you asked for) cannot be determined without knowing the number of items to be processed.

i need to process 9999999999999999999999 items

whow many items need to proced by single call

to maximize productivity?

say me optimal values of A B C D parametres

for GeForce 8800GTS 640MB

this code:

// setup execution parameters

	dim3 threads(A, B);

	dim3 grid(C, D);

	// execute the kernel

	matrix<<< grid, threads >>>(dataDev);
__global__ void matrix( calcData* data)

{

	// Block index

	int i = threadIdx.x+blockIdx.x*BLOCK_SIZE;

	calcData* dat= data+i;

	int f1=dat->field1;

	int f2=dat->field2;

	dat->field1=i;

	int s=0;

	for (int i = 0; i < 999999; ++i)

	{

		s += f1;

		s ^= f2;

	}

	dat->field3=s;

}

(hard maind people in this forum)

You need to provide actual working code, and details of the data you want to process, before anyone can give you an answer to this question. The amount of shared memory and registers your kernel uses is generally what determines A, B, C, and D.

this code:

__global__ void matrix( calcData* data)

{

	// Block index

	int i = threadIdx.x+blockIdx.x*BLOCK_SIZE;

	calcData* dat= data+i;

	int f1=dat->field1;

	int f2=dat->field2;

	dat->field1=i;

	int s=0;

	for (int i = 0; i < 999999; ++i)

	{

		s += f1;

		s ^= f2;

	}

	dat->field3=s;

}

In order to reach maximum occupancy on 1.0 or 1.1 device then you can launch:

  • a block of size 384
  • two blocks per stream multiprocessor.

According to Programming Guide (Chapter A.1 General Specifications, Page 101) GeForce 8800GTS has 4 stream multiprocessors. Therefore to completly saturate your device you need to launch (for example):
8 blocks of 384 threads
other option is
12 blocks of 256 threads

there are other combinations…
In general

  • size of block cannot be bigger than 512
  • you need to have 768 thread per SM for maximum occupancy
  • you cannot launch more than 8 blocks per SM

How many in X or Y direction does not matter that much but I launch only one-dimentional kernels. From my own experience with multiple dimentions - I hurt myself a lot by doing that. Looong debugging to understand that one of my small device functions assumed one dimention execution only.

However, maximum occupancy requires that each thread uses at most 16 registers. You are also limited by the amount of shared memory, in 8-384 setting you can use at most 8KB of shared memory.
When you compile your code with nvcc, add option “–ptxas-options=-v” (without quotation marks).

I agree with what previous people said, maximum speed heavily depends on your code, but with the simplest setting here is the answer…

So multiply the numbers provided by Cygnus X1 above by 3, and that’s it (24 blocks of 384 threads, or 36 blocks of 256 threads, as minimum numbers to fill up the device), in case you don’t hit register usage or shared memory limits. From the kernel source you stubbornly kept pasting here, one would say you are not using shared memory, so that should not be an issue, but on the other side this source doesn’t compile, so it’s not possible to check registers usage; but still seems that small number of registers would be used, so this should not be an issue either. But both you and us know that this kernel source is meaningless (it would be fun to see the host code preparing the calcData array layout in device memory for passing it to the kernel then), so I’m afraid all of this talk is not going to be of much help. Instead: better do yourself a favor, and read chapter 4 of “Best Practices Guide” document - you have it all explained there.

is metod to apply low prioritet to CUDA calculations?
my video system become slow

look like video functions complety blocked while CUDA task runing
can i print to screen while CUDA task runing?

While CUDA is running and your screen is plugged into the same device, all output freezes, so no, you cannot really print stuff at exactly that time. There are way “partially” around it, kernel invokes printing, but it appears later. Don’t know about cuPrintf, I made one myself using mapped pinned memory.
Regarding GeForce 8800 GTS - I stand corrected, it has 12 SM, not just 4. So you multiply the number of blocks by 3, number of threads remains the same.

on trivial task CUDA on GeForce 8800GTS 640MB faster Celeron CORE DUO 2333 ----: X 112

on linyx cript(3) finction task CUDA on GeForce 8800GTS 640MB faster Celeron CORE DUO 2333 ----: X 32

what is this deferense?

mem using?

linyx cript(3) finction:

ptxas info	: Compiling entry function '_Z6matrixP10calcDataInP11calcDataOut'

ptxas info	: Used 16 registers, 170+0 bytes lmem, 4312+16 bytes smem, 116 bytes cmem[1], 4 bytes cmem[14]

trivial task:

ptxas info	: Compiling entry function '_Z6matrixP10calcDataInP11calcDataOut'

ptxas info	: Used 6 registers, 8+16 bytes smem, 8 bytes cmem[1]

ahh… this is 1.0 device, I keep looking on 1.2 specs sometimes. My mistake.
To reach maximum occupancy on your device, you cannot have more than 10 registers per thread (and not 16 as I stated before).
Nevertheless, I see a bigger problem! You are using 170 bytes of local memory, which is almost as slow as global memory. Local memory is used in the following conditions:

  • dynamic addressing of local arrays - this cannot be done on registers
  • you run out of register space. If you set limit to 16 registers but the code needs 17th one then it has to spill one out to local memory to free up space.

Also, you have 8192 bytes of shared memory per SM. If your kernel uses 8312+16 (as reported by ptxas info) you will be able to put only one block on your SM because another just won’t fit in.

For some algorithms maximum occupancy is very hard to reach and often it is not even worth it! Because of register spilling your performance may actually deteriorate.

Anyway, to play with register usage, shared memory usage, and block dimentions you can play with Occupancy_Calculator xml file which should be given to you somewhere with CUDA toolkit or SDK (I don’t remember where and I seem to have moved it elsewhere)

where in it http://openpaste.org/en/18965/

is using local memory?

how to know what variables are placed in local memory?

function local arrays garanted placed in local memory? or not?