Data transferring from device to host taking too much time
My code is something like this:
struct objectType { char* str1; char* str2; }
cudaMallocManaged(&o, sizeof(objectType) * n)
for (int i = 0; i < n; ++i) { // use cudaMallocManaged to copy data }
if (useGPU) compute_on_gpu(objectType* o, ….) else compute_on_cpu(objectType* o, ….)
function1(objectType* o, ….) // on host
when computing on GPU, ‘function1’ takes a longer time to execute (around 2 seconds) compared to when computing on CPU (around 0.01 seconds). What could be a work around for this? I guess this is the time it takes to transfer back data from GPU to CPU but I’m just a beginner so I’m not quite sure how to handle this.
Note: I am passing ‘o’ to CPU just for a fair comparison even tho it is not required to be accessible from GPU due to the cudaMallocManaged call.