Fig. 4

Download original image
(a) Performance (speedup over sequential execution) achieved by the Block-Interleaved approach for multiple BS (32, 64, 128, 256, 512) for a CUDA Block size equal to 128. (b) Performance (speedup over sequential execution) achieved by the Block-Shared implementation, Flat, Full-Interleaved (Full-Inter) and Multicore (Multi) using 16 cores. The test-case consisted of computing 256 000 medium-high neurons, using one of the two logic GPUs in one K80 NVIDIA GPU.