complex multiplication & coalesced global memory access

hi all, ;)
I want to do a complex multiplication of dev_input_signal and dev_supportfunction (complex arrays);
cufftComplex *dev_input_signal;
cufftComplex *dev_supportfunction;

cudaMalloc((void**)&dev_input_signal, sizeof(cufftComplex)512512);
cudaMalloc((void**)&dev_supportfunction, sizeof(cufftComplex)512512);

And record results in an array
cufftComplex *dev_output_signal;
cudaMalloc((void**)&dev_output_signal, sizeof(cufftComplex)512512);

i have a kernel:

[b]global void Kernel_Mult( cufftComplex *dev_input_signal, cufftComplex *dev_supportfunction, cufftComplex *dev_output_signal )
{
int offset = threadIdx.x + blockIdx.x * blockDim.x;

	if (offset < 512*512){
		float real_sgn = dev_input_signal[offset].x;
		float imag_sgn = dev_input_signal[offset].y;
		float real_h = dev_supportfunction[offset].x;
		float imag_h = dev_supportfunction[offset].y;
		dev_output_signal_[offset].x = real_sgn * real_h - imag_sgn * imag_h;
		dev_output_signal_[offset].y = imag_h * real_sgn + imag_sgn * real_h;
	}

}
[/b]

int main(){

dim3 blocks(512);
dim3 threads(512);
Kernel_Mult <<<blocks, threads>>>(dev_input_signal, dev_supportfunction, dev_output_signal);

}

…it worked wrong but too slow.
Visual profiler showed:
Kernel requested global memory read throughput (GB/s) 11.7987
Kernel requested global memory write throughput (GB/s) 5.89936

i have a device with Fermi architecture and 2.1 compute capability
I think that the problem is in not coalesced global memory access, but I do not know how to do it for cufftComplex-arrays.

Any ideas? Tips?

I think you might get some improvement if you would replace real_sgn and imag_sgn vy just one variable float 2 c_sgn; and use c_sgn=dev_input_signal[offset]; Same with the rea_h and imag_h variables. I think that the lines
float real_sgn = dev_input_signal[offset].x;
float imag_sgn = dev_input_signal[offset].y;
float real_h = dev_supportfunction[offset].x;
float imag_h = dev_supportfunction[offset].y; will start 4 memory reads in which the requested data is interleaved which will in result in 4 more requests than if you use:
float2 c_sgn=dev_input_signal[offset];
float2 c_h = dev_supportfunction[offset];
I think you can use cufftComplex in the kernel instead of float2 as well.

thank you for reply:)
so… i make
[b]global void Kernel_Mult( cufftComplex *dev_input_signal, cufftComplex *dev_supportfunction, cufftComplex *dev_output_signal )
{

	int offset = threadIdx.x + blockIdx.x * blockDim.x;

	if (offset < 512*512){
		float2 c_sgn = dev_input_signal[offset];
                    float2 c_h = dev_supportfunction[offset];
		dev_output_signal[offset].x = c_sgn.x * c_h.x - c_sgn.y * c_h.y;
		dev_output_signal[offset].y = c_h.y * c_sgn.x + c_sgn.y * c_h.x;
	}

}[/b]

Kernel launch parameters
int main(){

dim3 blocks(512);
dim3 threads(512);
Kernel_Mult <<<blocks, threads>>>(dev_input_signal, dev_supportfunction, dev_output_signal);

}

Visual profiler showed:
Kernel requested global memory read throughput (GB/s) 18.8052
Kernel requested global memory write throughput (GB/s) 9.40258
yes thats faster:)
And how can I determine that the memory access is coalesced?
I hope to make it faster

I am confused. Both versions produce exactly the same code, because the compiler automatically reorders the instructions to take advantage of 64-bit wide memory accesses. So how could one be faster than the other?

Which CUDA version are you using? How reproducible is your measurement?

Your throughput is now bigger, but I think that the ultimate test is by measuring the time.
The should be coalesced now collapsed. Your data is arranged like:

RIRIRIRIRIRIRIRIRIRIRIRIRIRI

Your code was tryting to get the R’s first and then the I’s and each warp had to do at least two calls first loading the R’s and then the I’s. With the new code the data is accessed continuously and each warp is loading a continuous chunk.

If you do this on the cpu there is proabably no big increase because of the large cache and some optimizations the compilers have, on the gpu things work different, each warp is loading a chunk at a time and the programmer must be more careful.

This was just the advice I got from an Nvidia employee during an optimization workshop. I read the programming guide 2 times, but there are still many things I do not understand.
Tera, is the feature you are referring to only present in toolkit 4.0 ?

Just a late edit your performance might change by using different numbers for dim3 blocks, threads variables. You might also want to try to use float 4 variables and have the kernel make more 1 complex multiplication per thread. This might help or not depending on the instructions/byte ratio and real and theoretical occupancy.

No, all toolkits at least going back to 2.something are able to perform this optimization. I haven’t checked with 4.1 yet, but would be really surprised if Nvidia introduced such a bad performance regression.

The suggestion to use [font=“Courier New”]float4[/font] of course is a good one, and probably is the single remaining optimization that might be worth doing. However the cache on compute capability 2.x devices already evens out performance a lot, and any optimization may thus provide at most marginal improvements.

As hinted above, I strongly suspect that with this relatively minor amount of data transferred, the results are just not very reproducible. Note that the profiler only provides statistical sampling, as it only collects data on a subset of multiprocessors and memory controllers.

EDIT: Grammar

Many thanks for the clarification.
I’ll have to try it on large samples