Cuda program results are always zero in HW, correct in EMU?

OK, here’s one thing to try. It might be worth going through shared mem again as we permute over more than a halfwarp and thus the different warps were reading some data twice (this would also explain why Fermi’s speedup vs. GT280 is more than the memory bandwidth ratio).

The changes in the setup code are mostly for my own enjoyment, I don’t expect them to actually help since we are bandwidth bound.

#define subVector 192 

__global__ void gamma1_Kernel(Matrix M, Matrix N )

{	

	// subvector is a multiple of 24

	__shared__ float Mds[subVector];

	int tx = threadIdx.x;

	int basex = blockDim.x * blockIdx.x;

	int txdiv6 = tx/6;

	int txmod6 = tx - __umul24(txdiv6, 6);

	float sign = (((2*tx) ^ txdiv6) & 2) - 1;

	int permuted_idx = __umul24(txdiv6 ^ 3, 6) + (txmod6 ^ 1);

	Mds[tx] = M.elements[basex + tx];

	__syncthreads();

	N.elements[basex + tx] = sign * Mds[permuted_idx];

}

Have you computed the actual global memory bandwidth used and set that in relation to the devices’ specified memory bandwidth? This might answer some of your professor’s questions.

EDIT: Fixed bugs in kernel - the sign depends on the index after permutation, and unsigned ints underflow instead of becoming negative.

OK, here’s one thing to try. It might be worth going through shared mem again as we permute over more than a halfwarp and thus the different warps were reading some data twice (this would also explain why Fermi’s speedup vs. GT280 is more than the memory bandwidth ratio).

The changes in the setup code are mostly for my own enjoyment, I don’t expect them to actually help since we are bandwidth bound.

#define subVector 192 

__global__ void gamma1_Kernel(Matrix M, Matrix N )

{	

	// subvector is a multiple of 24

	__shared__ float Mds[subVector];

	int tx = threadIdx.x;

	int basex = blockDim.x * blockIdx.x;

	int txdiv6 = tx/6;

	int txmod6 = tx - __umul24(txdiv6, 6);

	float sign = (((2*tx) ^ txdiv6) & 2) - 1;

	int permuted_idx = __umul24(txdiv6 ^ 3, 6) + (txmod6 ^ 1);

	Mds[tx] = M.elements[basex + tx];

	__syncthreads();

	N.elements[basex + tx] = sign * Mds[permuted_idx];

}

Have you computed the actual global memory bandwidth used and set that in relation to the devices’ specified memory bandwidth? This might answer some of your professor’s questions.

EDIT: Fixed bugs in kernel - the sign depends on the index after permutation, and unsigned ints underflow instead of becoming negative.

I thank you all for your patience and attention!
I am really new to the field and am doing my best! You have been extremely helpful.

If the input size is 49152 float element-vector with a total of 192KB space … doesn’t latency play a role as much as BW does, and isn’t latency the main factor here … am I wrong in my conclusion … how can I hide latency where all I do is mainly fetch an element form global memory and place it into its new location and then copy it back … what other work is there to be done?

Wouldn’t scaling the number of SMs be the factor in here … if each SM has been assigned a number of thread blocks it will be occupied until all thread blocks retire and new ones occupy their place … but as the SMs scale, thread blocks have no need to wait their turn … they can occupy other available SMs and threads start to work … am I wrong here also …?

I thank you all for your patience and attention!
I am really new to the field and am doing my best! You have been extremely helpful.

If the input size is 49152 float element-vector with a total of 192KB space … doesn’t latency play a role as much as BW does, and isn’t latency the main factor here … am I wrong in my conclusion … how can I hide latency where all I do is mainly fetch an element form global memory and place it into its new location and then copy it back … what other work is there to be done?

Wouldn’t scaling the number of SMs be the factor in here … if each SM has been assigned a number of thread blocks it will be occupied until all thread blocks retire and new ones occupy their place … but as the SMs scale, thread blocks have no need to wait their turn … they can occupy other available SMs and threads start to work … am I wrong here also …?

The other work comes from the other warps that also want to read memory. In effect, you are bound by the memory bandwidth, not it’s latency.

The memory access path is highly pipelined, so the bandwidth is much higher than bus width divided by latency.

No, the SMs don’t actually do much work here. Each SM can happily sit and wait for multiple memory transactions to finish, you don’t need multiple SMs for that.

The other work comes from the other warps that also want to read memory. In effect, you are bound by the memory bandwidth, not it’s latency.

The memory access path is highly pipelined, so the bandwidth is much higher than bus width divided by latency.

No, the SMs don’t actually do much work here. Each SM can happily sit and wait for multiple memory transactions to finish, you don’t need multiple SMs for that.

The easy test of that hypothesis would be to compare the Fermi result with the GTX280 result. The GTX280 has 30MP. The GTX480 has 15MP. Under your hypothesis, the GTX280 should be considerably faster than the GTX480 because it has double the SM count. But your measurements seem to directly contradict this.

The easy test of that hypothesis would be to compare the Fermi result with the GTX280 result. The GTX280 has 30MP. The GTX480 has 15MP. Under your hypothesis, the GTX280 should be considerably faster than the GTX480 because it has double the SM count. But your measurements seem to directly contradict this.

And to really maximize available memory bandwidth, we can use 128 bit accesses:

#define subVector 96

__global__ void gamma1_Kernel(Matrix M, Matrix N )

{	

	// subvector is a multiple of 12

	__shared__ float2 Mds[subVector];

	int tx = threadIdx.x;

	int basex = blockDim.x * blockIdx.x;

	int txdiv3 = tx/3;

	int txmod3 = tx - __umul24(txdiv3, 3);

	float sign = (txdiv3 & 2) - 1;

	float2 *Mptr = (float2*)M.elements;

	float2 *Nptr = (float2*)N.elements;

	float2 Nd;

	int permuted_idx = __umul24(txdiv3 ^ 3, 3) + txmod3;

	Mds[tx] = Mptr[basex + tx];

	__syncthreads();

	Nd.x =  sign * Mds[permuted_idx].y;

	Nd.y = -sign * Mds[permuted_idx].x;

	Nptr[basex + tx] = Nd;

}

Note that the blocksize now needs to be half of what it was before!

Rewriting the kernel to use a 32x3 blocksize instead is left as an exercise to the reader - we don’t expect any speedup anyway.

Please also check that I didn’t make any mistakes.

EDIT: Remove “unsigned” to fix bug spotted by Last_time.

And to really maximize available memory bandwidth, we can use 128 bit accesses:

#define subVector 96

__global__ void gamma1_Kernel(Matrix M, Matrix N )

{	

	// subvector is a multiple of 12

	__shared__ float2 Mds[subVector];

	int tx = threadIdx.x;

	int basex = blockDim.x * blockIdx.x;

	int txdiv3 = tx/3;

	int txmod3 = tx - __umul24(txdiv3, 3);

	float sign = (txdiv3 & 2) - 1;

	float2 *Mptr = (float2*)M.elements;

	float2 *Nptr = (float2*)N.elements;

	float2 Nd;

	int permuted_idx = __umul24(txdiv3 ^ 3, 3) + txmod3;

	Mds[tx] = Mptr[basex + tx];

	__syncthreads();

	Nd.x =  sign * Mds[permuted_idx].y;

	Nd.y = -sign * Mds[permuted_idx].x;

	Nptr[basex + tx] = Nd;

}

Note that the blocksize now needs to be half of what it was before!

Rewriting the kernel to use a 32x3 blocksize instead is left as an exercise to the reader - we don’t expect any speedup anyway.

Please also check that I didn’t make any mistakes.

EDIT: Remove “unsigned” to fix bug spotted by Last_time.

Which brings me to the subject of alignment. I assume M.elements is aligned on a 128 byte boundary?

And you should merge gamma1_Kernel with the subsequent kernel doing the reduction. No need to write the results out to memory and read them back again in the next kernel.

Which brings me to the subject of alignment. I assume M.elements is aligned on a 128 byte boundary?

And you should merge gamma1_Kernel with the subsequent kernel doing the reduction. No need to write the results out to memory and read them back again in the next kernel.

Tera :)) You are great indeeeeed!!! :))

The average execution time was almost cut into half!! from approx. 0.15ms to 0.08ms on the G210!
One note though, the sign in this form didn’t work for the elements 0-12, 24-36 … it had the value of 2^32 for these groups, i replaced it with
int sign = (txdiv3 & 0x0002) - 0x0001;
and it worked … Test Passed!

I have learned a lot from you and avidday the past couple of days … I just can’t express my gratitude to both of you for your time, patience and efforts!
All we’ve taken is a general partial course in CUDA less than 30hrs in total … but I think will spend this summer delving into details to learn as much as possible!! :)

Thank you from heart!

Tera :)) You are great indeeeeed!!! :))

The average execution time was almost cut into half!! from approx. 0.15ms to 0.08ms on the G210!
One note though, the sign in this form didn’t work for the elements 0-12, 24-36 … it had the value of 2^32 for these groups, i replaced it with
int sign = (txdiv3 & 0x0002) - 0x0001;
and it worked … Test Passed!

I have learned a lot from you and avidday the past couple of days … I just can’t express my gratitude to both of you for your time, patience and efforts!
All we’ve taken is a general partial course in CUDA less than 30hrs in total … but I think will spend this summer delving into details to learn as much as possible!! :)

Thank you from heart!

Great!

Would you mind giving performance numbers for both last versions? I’d be interested in learning how much improvement (if any) one gets with 128 byte vs. 64 byte memory transaction size. That’s something that’s neither well documented nor obvious, so it’s always nice to have more data points. One of my kernels gets a huge profit from that change, but a big part of that can be explained by unaligned accesses.

And the 0.08ms vs. 0.06ms difference suggests that now that the bandwidth requirement is reduced, a little bit may be gained again by amortizing the setup time through a loop over more data, and by removing the division by making the grid two-dimensional.

Great!

Would you mind giving performance numbers for both last versions? I’d be interested in learning how much improvement (if any) one gets with 128 byte vs. 64 byte memory transaction size. That’s something that’s neither well documented nor obvious, so it’s always nice to have more data points. One of my kernels gets a huge profit from that change, but a big part of that can be explained by unaligned accesses.

And the 0.08ms vs. 0.06ms difference suggests that now that the bandwidth requirement is reduced, a little bit may be gained again by amortizing the setup time through a loop over more data, and by removing the division by making the grid two-dimensional.