In a cuda application I need to create new data on the GPU, based on the given values in an array. I would like to start a thread for each point in the array and this thread creates a new array.
- if, for example, the value in the array equals zero, the thread needs to do nothing
- it is not known in advance how many threads will generate how much data
I could allocate a maximum per thread on the gpu, but:
- this data structure will take up too much memory
- the data structure will be sparse, therefore useless data will be moved from gpu to host
In block X only thread A is confronted with a value > 0 and creates an array of size 5kb
In block Y threads B and C are confronted with a value > 0 and create arrays of size 1kb and 3kb
I was thinking about allocating for example 20mb and use blocks of 1kb for storing information (much like a harddisk). But this might also create a sparse data structure. And how about administration?
Am I making any sense? Anybody any pointers to literature or websites (or something in the sdk I missed) perhaps? I can’t imagine I’m the first to encounter this problem