Neurons in the visual cortex are known to possess localized, oriented receptive fields. It has previously been suggested that these distinctive properties may reflect an efficient image encoding strategy based on maximizing the sparseness of the distribution of output neuronal activities or alternatively, extracting the independent components of natural image ensembles. Here, we show that a relatively simple neural solution to the problem of transformation-invariant visual recognition also causes localized, oriented receptive fields to be learned from natural images. These receptive fields, which code for various transformations in the image plane, allow a pair of cooperating neural networks, one estimating object identity (``what'') and the other estimating object transformations (``where'') to simultaneously recognize an object and estimate its pose by jointly maximizing the a posteriori probability of generating the observed visual data. We provide experimental results demonstrating the ability of these networks to factor retinal stimuli into object-centred features and object-invariant transformations. The resulting neuronal architecture suggests concrete computational roles for the neuroanatomical connections known to exist between the dorsal and ventral visual pathways.