The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical Engineering University of Southern California

Technical Report USC-SIPI-292

“Architecture and Simulation of Selected Fine-Grained VLSI Array Processors”

by Tony H.-Y. Wu

November 1995

With rapid advances of semiconductor manufacturing technologies and the mature computer-aided design tools, the trend of moving from serial signal processing to parallel processing follows a natural progress. The constraint on VLSI implementation in circuit complexity can be alleviated by the use of regular, repetitive architectural structures. Scalability and massive parallelism provide the enormous throughput rate and processing capability that conventional sequential processors cannot achieve.

An analog compact neural network model is reviewed. It is a powerful parallel processing paradigm consisting of densely-connected analog computing cells. Various applications, such as edge detection, hole-filling, and connected-component detection, can be accomplished by changing the local interconnection strengths, which are programmed through the coefficient templates.

The behavioral simulator cnna, which reads in the configuration information and simulates the dynamic behavior of the network, is presented. An unique feature of this simulator is the hardware annealing capability which provides an efficient method of finding globally optimal solutions. Effects of hardware annealing and the different bias setting are presented. Performance comparisons with respect to image sizes, and template sizes have been summarized. A comparison with simulators developed by other researchers has also been included.

The parallel computation architecture is defined. A compiler and its associated programming language greatly facilitate the simulation of compact neural networks in order to optimize the design parameters. The framework of the whole simulation environment is also presented.

An innovative architectural mapping based on the multiplication-accumulation techniques which is similar to distributed arithmetic is described. The array processor is most suitable for high performance requirement of 2-dimensional forward and inverse discrete cosine transformation tasks. Each computing cell contains two adders: one for 2-bit multiplication and the other one for partial-sum accumulation. The computation power could reach 10 giga-operations-per-second at a system clock 100 MHz by using a 0.5 _m CMOS manufacturing technology. The accuracy tests of the inverse discrete cosine transformation have been verified and comparison with other researchers' work is summarized. These accomplished results lay a critical foundation for the increasingly important multimedia applications, especially in image, video, and speech processing.

To download the report in PDF format click here: USC-SIPI-292.pdf (21.1Mb)