CUDA 4.1 suggested improvements.

[list=1]

[*]Some more cuda device Information in its Property structure!

Number of 32-bit registers per multiprocessor

with that i can find the register limit for optimal occupancy.

For example i have 4 kernels with 12,10,9,7 register usage.

So for Cuda 1.1 best choice is to limit the registers to 10!

Maximum number of resident blocks per multiprocessor

Even if this one is currently constant across all platforms, this also helps

to find best blocksize for optimal occupancy for kernels below the register limit.

[*]To check this Bug?

kernel produces only sometimes correct result, shared memory struct

[*]And this strange behaviour:

shared memory struct needs no initialisation!

It would be nice to have warp size variable. Do you think so?

CUDA C Programming Guide, Version 4.0:

B.4.5 warpSize

This variable is of type int and contains the warp size in threads (see Section 4.1 for the definition of a warp).

Firstly, this is not pretty looking code. Secondly, i suppose this is not good from the view of performance.

The cost depends on the amount of contention.

Launching a kernel whose parameter uses a device value as opposed to host value.

Say I have a kernel that does some work and computes some integer N and stores in device memory address d_s

I want to a launch a new kernel with # blocks equal to N, which is stored in d_s

I don’t want to have to cudaMemcpy the value of *d_s into host to launch the kernel, as cudaMemcpy is very very slow.

And obviously something like kernel_launch<<<*d_s, blah>>>(); fails miserably

http://forums.nvidia.com/index.php?showtopic=211486

can be deleted, missunderstood the post, thanks @njuffa

Based on the example gurifisu gave, (s)he would like to base the grid configuration and / or threadblock configuration of a subsequent kernel on the data computed (and stored into device memory) by a previous kernel. That is currently doable only indirectly: either by copying the relevant data back to the host, explicitly or using the zero copy mechanism; or by unneeded threadblocks terminating themselves based on the data stored on the device, as I had suggested.

+1

Have you tried this one? Does it work? Thanks.

Could you elaborate about the context and the requirements? In particular, is the return value of the atomic operation needed or is it just used for accumulation? How many independent double-precision accumulators do you need? Are the atomic operations intermixed with non-atomic accesses? Do you have an idea about the dynamic range of the values?

I am thinking about an implementation based on a fixed-point long accumulator.

Yes. Currently we are implementing a particle code for simulating plasma physics, namely, particle-in-cell (PIC) method. In it’s simplest form, a Poisson equation is solved for the electric potential, in which charged particles are moving. The source term of the Poisson is obtained by accumulating charges of the particles. What has been observed in previous GPU implementations is that such accumulations are exceedingly expensive, thus becoming the bottleneck of the whole simulation. As double precision is commonly used for scientific simulations, an efficient atomicAdd for double precision floating point numbers on shared memory is highly desirable.

If one employs one million particles, each particle deposits every time step, and a simulation takes one thousand steps, the number of atomic operations can be very large.

Normally for PIC simulations, they don’t intermixed with non-atomic accesses (I am not completely sure what you mean by this… could you explain it a little or give me an example? Thanks.)

The dynamic range of the values for charge accumulation, as an example, could be around the number of particles. Or, in the case of current (speed) accumulation, it varies between positive and negative speed of light.

Thanks for asking. Please let me know if you have further questions.

With best regards,

Benetion

Thanks for the explanation.
What I had in mind is that you might be better served by a fixed-point accumulator than by atomic floating-point operations.

An issue of floating-point atomics is that they make the computation become non-deterministic, as floating-point addition is not associative. This may be a concern especially if the accumulated terms can have different signs and may cancel each other.

An alternative is to perform the accumulation in a large fixed-point number (scaled integer). As fixed-point addition is associative and has no rounding error, fixed-point accumulation is deterministic and more amenable to error analysis. Using a sufficiently large accumulator, summations can even be computed exactly.
Even a 64-bit fixed-point accumulator may deliver more accurate results than double-precision floating-point (which has 53 bits of precision). For instance, using a scaling factor of 2^32, values up to 10^9 can be accumulated with a resolution of 10^(-9).

The issues with long accumulators are the following:

  • Converting a long accumulator back to floating-point is a relatively expensive operation. Frequent conversions will hurt performance (this is what I meant by “intermixing with non-atomic accesses”).
  • They may use more memory than floating-point accumulators. If you are limited by the amount of shared memory and need accumulators longer than 64-bit, this can be a problem.

It can be implemented so that accumulation is done in constant time regardless of the accumulator width, using a few tricks such as high-radix carry-save representation. But it is probably not worth the effort for accumulators smaller than 128-bit.

Would 64-bit fixed-point be enough for your application? If so, the implementation is much easier, and the issues mentioned above are not a concern.