I have a question about cudaMallocPitch() and cudaMemcpy2D().
float X_h; X_h = (float )malloc(NKsizeof(float));
where X_h[n*K+k] is the (n,k) element of X_h.
float X_d;
cudaMallocPitch((void **) &X_d, &pitch_x, widthsizeof(float), height);
cudaMemcpy2D(X_d, pitch_x, X_h, widthsizeof(float), widthsizeof(float), height, cudaMemcpyHostToDevice);
according to NVIDIA manual
((float )((char)X+pitch_xn) + k); accesses the nth row and kth column
why in my case I am accessing the kth row and the nth column? Is this a bug in Cuda 2.3?