I may well be wrong, but it seems that CUDA 3.1 is not able to compile this on sm_13:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda.h>
__device__ unsigned int atomic_cnt = 0;
__device__ int queue_fetch()
{
__shared__ bool amFirst;
__shared__ bool amLast;
const uint id = threadIdx.z + threadIdx.y*blockDim.z + threadIdx.x*blockDim.z*blockDim.y;
const uint nblocks = gridDim.z*gridDim.y*gridDim.x;
if(0==id) {
unsigned int ticket = 0;
ticket = atomicInc(&atomic_cnt, nblocks);
amFirst = (ticket == 0);
amLast = (ticket == (nblocks-1));
}
__syncthreads();
// other code here which was removed to simplify it
return 0;
}
__global__ void kernel()
{
queue_fetch();
}
int main()
{
}
$ nvcc -V
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2010 NVIDIA Corporation
Built on Tue_Jun__8_18:56:17_PDT_2010
Cuda compilation tools, release 3.1, V0.2.1221
$ nvcc -arch=sm_13 -c test.cu
test.cu(8): warning: variable “amFirst” was set but never used
test.cu(9): warning: variable “amLast” was set but never used
ptxas /tmp/tmpxft_0000671b_00000000-2_test.ptx, line 0; �%J�Y: (C6017) Unaligned access for GRF[0] (unknown symbol) in entry ; the offset should be 4-byte aligned.
while CUDA2.3 is able to compile it.
any ideas?
avidday
September 6, 2010, 10:58am
#4
I initially thought that as well, but changing the pair of shared memory booleans to a 32 bit type doesn’t help. The ptx it generates looks OK to my inexpert eye, but the assembler is choking on something.
avidday
September 6, 2010, 10:58am
#5
I initially thought that as well, but changing the pair of shared memory booleans to a 32 bit type doesn’t help. The ptx it generates looks OK to my inexpert eye, but the assembler is choking on something.
I initially thought that as well, but changing the pair of shared memory booleans to a 32 bit type doesn’t help. The ptx it generates looks OK to my inexpert eye, but the assembler is choking on something.
I think it was suggested in the other thread that the if statement somehow causes this… the OP also has if statement in his code.
eyal
I initially thought that as well, but changing the pair of shared memory booleans to a 32 bit type doesn’t help. The ptx it generates looks OK to my inexpert eye, but the assembler is choking on something.
I think it was suggested in the other thread that the if statement somehow causes this… the OP also has if statement in his code.
eyal
judging from the weird output, I really think that there is a bug inside ptxas, which has that print as a side effect.
just look at the chars after the “line 0:” part, and the “”
it may be in a ptxas code path related to the sm_13, which I think is much less tested nowadays
judging from the weird output, I really think that there is a bug inside ptxas, which has that print as a side effect.
just look at the chars after the “line 0:” part, and the “”
it may be in a ptxas code path related to the sm_13, which I think is much less tested nowadays
For developers targeting primarily compute 1.x devices there are currently not a lot of good reasons to update to the 3.0 and 3.1 toolkits.
For developers targeting primarily compute 1.x devices there are currently not a lot of good reasons to update to the 3.0 and 3.1 toolkits.
I may well be wrong, but it seems that CUDA 3.1 is not able to compile this on sm_13:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda.h>
__device__ unsigned int atomic_cnt = 0;
__device__ int queue_fetch()
{
__shared__ bool amFirst;
__shared__ bool amLast;
const uint id = threadIdx.z + threadIdx.y*blockDim.z + threadIdx.x*blockDim.z*blockDim.y;
const uint nblocks = gridDim.z*gridDim.y*gridDim.x;
if(0==id) {
unsigned int ticket = 0;
ticket = atomicInc(&atomic_cnt, nblocks);
amFirst = (ticket == 0);
amLast = (ticket == (nblocks-1));
}
__syncthreads();
// other code here which was removed to simplify it
return 0;
}
__global__ void kernel()
{
queue_fetch();
}
int main()
{
}
while CUDA2.3 is able to compile it.
any ideas?
You are not going to believe this:
Leave out gridDim.z in the nblocks declaration and it will compile (1.3 on 3.1).
Afaik gridDim.z is always 1 on 1.3?
I may well be wrong, but it seems that CUDA 3.1 is not able to compile this on sm_13:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda.h>
__device__ unsigned int atomic_cnt = 0;
__device__ int queue_fetch()
{
__shared__ bool amFirst;
__shared__ bool amLast;
const uint id = threadIdx.z + threadIdx.y*blockDim.z + threadIdx.x*blockDim.z*blockDim.y;
const uint nblocks = gridDim.z*gridDim.y*gridDim.x;
if(0==id) {
unsigned int ticket = 0;
ticket = atomicInc(&atomic_cnt, nblocks);
amFirst = (ticket == 0);
amLast = (ticket == (nblocks-1));
}
__syncthreads();
// other code here which was removed to simplify it
return 0;
}
__global__ void kernel()
{
queue_fetch();
}
int main()
{
}
while CUDA2.3 is able to compile it.
any ideas?
You are not going to believe this:
Leave out gridDim.z in the nblocks declaration and it will compile (1.3 on 3.1).
Afaik gridDim.z is always 1 on 1.3?
You are not going to believe this:
Leave out gridDim.z in the nblocks declaration and it will compile (1.3 on 3.1).
Afaik gridDim.z is always 1 on 1.3?
hey… that’s right! nice spot!
now I understand why a similar construct is ok in another code…
thank you so much!
You are not going to believe this:
Leave out gridDim.z in the nblocks declaration and it will compile (1.3 on 3.1).
Afaik gridDim.z is always 1 on 1.3?
hey… that’s right! nice spot!
now I understand why a similar construct is ok in another code…
thank you so much!
njuffa
September 7, 2010, 9:30pm
#16
Thank you for bringing this issue to our attention. I was able to reproduce the problem on 64-bit Linux (RHEL 5.3) with CUDA 3.1. Interestingly it does not reproduce on 64-bit Windows (the code won’t compile however because “uint” is undefined). I will follow up with our toolchain team.
njuffa
September 7, 2010, 9:30pm
#17
Thank you for bringing this issue to our attention. I was able to reproduce the problem on 64-bit Linux (RHEL 5.3) with CUDA 3.1. Interestingly it does not reproduce on 64-bit Windows (the code won’t compile however because “uint” is undefined). I will follow up with our toolchain team.