Given an image with k grey levels on a p processor
machine (
), we wish to develop efficient and portable
parallel algorithms to perform various useful primitive image
processing computations. Efficiency is a performance measure used to
evaluate parallel algorithms. This measure provides an indication of
the effective utilization of the p processors relative to the given
parallel algorithm. For example, an algorithm with an efficiency near
one runs approximately p times faster on p processors than the
same algorithm on a single processor. Portability refers to code that
is written independently of low-level primitives reflecting machine
architecture or size. Our goal is to develop portable algorithms that
are scalable in terms of both image size and number of processors,
when ran on distributed memory multiprocessors.
Our first algorithm computes the histogramming of an image; i.e., the
output consists of an array held in a single processor
such that
is equal to the number of pixels in the image with
grey level i. Without loss of generality, k is assumed to be a
power of two. The second algorithm performs the connected components
of images ([1], [6], [8],
[9], [12], [16], [17],
[20], [38]). The task of connected component
labeling is cited as an important object recognition problem in the
DARPA Image Understanding benchmarks ([37], [47]),
and also can be applied to several computational physics problems such
as percolation ([41], [5]) and various cluster
Monte Carlo algorithms for computing the spin models of magnets such
as the two-dimensional Ising spin model ([2],
[3], [4], [39], [40]). All
pixels with grey level (or `color') 0 are assumed to be background,
while pixels with color > 0 are foreground objects. A connected
component in the image is a maximal collection of pixels such that a
path exists between any pair of pixels in the component. Note that we
are using the notion of 8-connectivity, meaning that two pixels are
adjacent if and only if one pixel lies in any of the eight positions
surrounding the other pixel, or 4-connectivity, in which only the
north, east, south, and west pixels are adjacent. Each pixel in the
image will receive a positive integer label; pixels will have the same
label if and only if they belong to the same connected component.
Also, all 0-pixels will receive a label of 0.
The majority of previous parallel histogramming algorithms are architecture- or machine-specific and do not port well to other platforms. Table 1 shows some previous running times for histogramming algorithms on parallel machines. Note that several of these machines are special purpose image processing machines. The last column corresponds to a normalized measure of the amount of work per pixel, where the total work is defined to be the product of the execution time and the number of processors. In order to normalize the results between fine- and coarse-grained machines, we divide the number of processors in the fine-grained machines by 32 to compute the work per pixel site.
Table 1: Implementation Results of Parallel Histogramming Algorithms
As with the histogramming algorithms, most of the previous connected components parallel algorithms as well are architecture- or machine-specific, and do not port easily to other platforms. Table 2 shows some previous running times for implementations of connected components for images using parallel machines. Again, several of these machines are special purpose image processing machines. The second last column corresponds to a normalized measure of the amount of work per pixel, where the total work is defined to be the product of the execution time and the number of processors. Note that only the entries labeled ``DARPA II Image'' use the benchmark image given in Figure 2. All other test images used are unknown to the authors.
Section 2 describes the algorithmic model used to analyze the algorithms whereas Section 3 describes the input images used, as well as the data layout on the Thinking Machines CM-5, IBM SP-1 and SP-2, Meiko CS-2, and Intel Paragon. The histogramming algorithm is presented in Section 4, and the binary connected components algorithm is described in Section 5. Section 6 generalizes the connected components algorithm to multi-grey level images.
The experimental data obtained reflect the execution times from implementations on the TMC CM-5, IBM SP-1 and SP-2, Meiko CS-2, and the Intel Paragon, with the number of parallel processing nodes ranging from 16 to 128 for each machine when possible. The shared memory model algorithms are written in SPLIT-C [10], a shared memory programming model language which follows the SPMD (single program multiple data) model on these parallel machines, and the source code is available for distribution to interested parties.
Table 2: Implementation Results of Parallel Connected Components of Images Algorithms