I have about 10000 8*8 matrices to take the inverses of. I have implement the function first and now am working on optimizations. The first optimization I am looking at is changing from generic mallocs to mallocPitch. I had a question about mallocPitch that I have seam to gotten wrong because the way I tried doesn’t output the correct results. First question how to you designate host memory to fit in the correct form to preform a cudaMemcpy2D operation for it to the device. Second if I am going to have to copy the array from global mem to shared mem, what is the best way to structure the shared mem to optimize performance? Any help would be greatly appreciated.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Memory management API | 1 | 865 | March 29, 2010 | |
cublasSgemm with cudaMallocPitch? | 0 | 4824 | September 7, 2011 | |
cudaMalloc() | 0 | 838 | October 9, 2013 | |
cudaMemcpy2D To Host | 6 | 3432 | June 8, 2012 | |
Asynchronous cudaMalloc | 3 | 11519 | July 2, 2012 | |
what kind of memory does cudaMalloc malloc ? | 2 | 3061 | August 9, 2007 | |
3D arrays | 3 | 5006 | March 26, 2008 | |
Bad performance using MallocPitch and Memcpy2D | 9 | 2820 | May 24, 2017 | |
CUDA in-kernel malloc | 4 | 9892 | July 19, 2011 | |
2D arrays with cuda confusion | 2 | 1098 | May 9, 2010 |