“Estimating the Kinematics and Structure of a Moving Object from a Sequence of Images”

by Theodore Broida

The problem considered here involves the use of a sequence of monocular images of a three-dimensional (3-D) moving object to estimate both its structure and kinematics. The object is assumed to be rigid, and its motion is assumed to be "smooth." A set of object match points is assumed to be available, consisting of fixed features on the object, the image plane coordinates of which have been extracted from successive images in the sequence. The measured data are the noisy image plane coordinates of this set of object match points, taken from each image in the sequence. Structure is defined as the 3-D positions of these object feature points, relative to each other. Rotational motion occurs about the origin of an object-centered coordinate system, while translational motion is that of the origin of this coordinate system.

A set of models is developed that decouples the object and its motion from the imaging process. This allows both translational and rotational motion to be modeled to any desired order (constant velocity, constant acceleration, constant jerk, etc.), independent of the nonlinearity of the central projection imaging process. Standard rectilinear models are used for translational motion. Quaternions are used to propagate the rotational motion via closed-form expressions for constant angular velocity and constant precession rate, and a well-behaved ordinary differential equation for higher order motion. The models allow the use of arbitrarily many image frames and feature points: additional data, as available, can be incorporated to improve the estimation accuracy, especially in high noise environments. The time intervals between images frames need not be constant, and there is no requirement that all feature points be visible in all frames, so the events of feature occlusion or disappearance can be accommodate.

Object kinematics can be interpolated or extrapolated to any desired time, based on available data. Extrapolated kinematics have potential applications in any situation in which a vision system is used to aid a machine's interaction with its environment. For example, by predicting an object's trajectory to some future time, a video sequence could be used to direct a manipulator arm to meet and grasp the object at that time. Extrapolation of the image coordinates of object feature points to future images could also be used to aid in feature extraction, segmentation, and other image processing operations. Approximate Cramer-Rao lower bounds are derived, which can be used to predict estimation accuracy for any given situation.

Both batch and recursive formulations are presented for estimation of the model parameters. Results using the batch approach are given for both simulated and real imagery, when translational motion has constant velocity and rotational motion has constant angular velocity, with unknown object structure. Recursive results are presented for the case of pure translation with known structure, based on simulated imagery. Image plane noise levels from .01% to 15% of the object image size are used.