Abstract | The Compute Unified Device Architecture (CUDA)has enabled graphics processors to be explicitly programmed
as general-purpose shared-memory multi-core processors with
a high level of parallelism. In this paper, we present our
preliminary results of implementing the Graph Cuts algorithm
on CUDA. Our primary focus is on implementing Graph Cuts on
grid graphs, which are extensively used in imaging applications.
We first explain our implementation of breadth first search (BFS)
graph traversal on CUDA, which is extensively used in our Graph
Cuts implementation. We then present a basic implementation
of Graph Cuts that succeeds to achieve absolute and relative
speedups when used for foreground-background segmentation on
synthesized images. Finally, we introduce two optimizations that
utilize the special structure of grid graphs. The first one is lockstep
BFS, which is used to reduce the overhead of BFS traversals.
The second is cache emulation, which is a general technique to
regularize memory access patterns and hence enhance memory
access throughput. We experimentally show how each of the
two optimizations can enhance the performance of the basic
implementation on the image segmentation application.
|