Help related to cuda stream

i am trying to use cuda stream in my code but it is giving me an error no 11 i.e. cudaErrorInvalidDevice .
can any one please help me.
i am posting my code below:-

error=(cudaMemcpy2DAsync(gpu_T, pitch_t, T, size, size, N1, cudaMemcpyHostToDevice,stream1));

printf("\error is %d",error);

(cudaMallocPitch((void**)&gpu_T_, &pitch_t_, size, N1));
(cudaMemcpy2DAsync(gpu_T_, pitch_t_, T_, size,size, N1, cudaMemcpyHostToDevice,stream2));

(cudaMallocPitch((void**)&gpu_D ,&pitch_d, size, N1));
(cudaMemcpy2D(gpu_D, pitch_d, D, size, size, N1, cudaMemcpyHostToDevice));

transpose_kernel<<<dimGrid, dimBlock,0,stream3>>>(gpu_T,gpu_T_,pitch_t/sizeof(float));

and my kernel code is :-

global void transpose_kernel( float *T,float *T_,int pitch)

int xid = blockIdx.x * blockDim.x + threadIdx.x;
int yid = blockIdx.y * blockDim.y + threadIdx.y;

if(xid<N1 && yid<N1)
	T_[xid*pitch+yid] = T[yid*pitch+xid];


please help me as why i am getting the error.
thanx in advance.

I think you should make sure “stream1” and “Stream2” complete before calling the kernel on “Stream3”. I am assuming that the kernel is depndent on the memcopies done in stream1 and stream2. If not, ignore.

Have you set up your context?

cudaSetDevice(0);   // Should be the first cudaX() call, IIRC

thanx for your reply.

Ya the kernel is dependent on the memcopies done in stream 1 and stream2.

please tell me what should i do…

No i havent set up any context.

how should i do that.

i tried using cudaSetDevice(0) as told by you but still its giving me the same error.

tell me what should i do.

Just issue a “cudaThreadSynchronize()” before the kernel call. Everything should be fine.

Its still giving me an error…

tell me what should i do??