Hi everyone.
I have a problem in my kernel that investigating I think I finally discover which is the problem. Some days ago I [topic=“107943”]post[/topic] about a problem with my graphics card which seems to have strange behavior after some executions. teju said me that may be I was accessing in a non 'malloc’ed position. To ensure that this wasn’t happening I wrote a simple program to guarantee correct memory access. But the problem was still there.
I was wondering if the problem was in the loop of the Kernel which is very big. I did some test with this program:
#include <cuda.h>
#include <cutil.h>
#include <iostream>
#define BLOCK_X 32
#define BLOCK_Y 16
#define N1 4500000
//#define N1 4700000
using namespace std;
__global__ void bucles(float * data, float * res, float Nbucle)
{
for(uint n=0;n<Nbucle;n++)
res[n%512]=data[n%512];
}
int main()
{
uint Nbucle = N2;
uint N = 512;
float * data = new float[N];
for(uint i=0;i<N;i++)
data[i]=3.;
float * dataDevice;
cudaMalloc((void**)&dataDevice,sizeof(float)*N);
cudaMemcpy(dataDevice,data,sizeof(float)*N,cudaMemcpyHostToDevice);
float * resDevice;
cudaMalloc((void**)&resDevice,sizeof(float)*N);
dim3 dimBlock(BLOCK_X,BLOCK_Y);
dim3 dimGrid(N/(dimBlock.x*dimBlock.y));
bucles<<<dimGrid,dimBlock>>>(dataDevice,resDevice,Nbucle);
float * res = new float[N];
cudaMemcpy(res,resDevice,sizeof(float)*N,cudaMemcpyDeviceToHost);
cudaFree(dataDevice);
cudaFree(resDevice);
for(uint i=0;i<N;i++)
{
if(i%200==0) cout<<endl;
cout<<res[i]<<",";
}
cout<<endl;
delete data;
delete res;
return 0;
}
The grid just have 1 block and the block size is 512. Every thread loops Nbucle times. I want to test which is the limit of the number of instructions a thread supports (if exists). I tried for different values of N1 and I bound (more or less) the values. For N1=4500000 the results was OK and the graphics hardware response right but for N1=4700000 the results were all wrong and screen looks like the attachment image.
I didn’t find the exact value where this happens because some times for N1=4500000 the result was wrong (depends on the execution). So I analyze my kernel with the CUDA Visual profiler and I saw how the same kernel some times was executing different numbers of instructions on different runnings. In this [topic=“108354”]post[/topic] is explained why. This explains why for N1=4500000 sometimes was right and sometimes was wrong. Because I am on the limit of the maximum instruction per thread (I concluded).
So my question to all these is: I am correct? Exists a max number of instructions per thread? If exists, why?
In the CUDA programing guide 2.2 I read something about the max size of a Kernel. “The limit on kernel size is 2 million PTX instructions”. I understand that this doesn’t mean the number of instructions per thread but the size of the compiled kernel. I assume that nvcc doesn’t unroll my loop (since N is a variable), so my simply program doesn’t reach the 2 million PTX instructions.
Any help will be very appreciated. Thanks.