Jul 13, 2022 ▪ 137 min read (~20 pages) ▪ Computer Science ▪ Updated on Nov 16, 2022
This post is adapted from lecture notes taken during postgraduate studies at King's College London.
Computer vision is a field of artificial intelligence and generally used to describe methods for extracting information from images. Related disciplines include (note that image processing is image to image and computer graphics is description to image, whereas computer vision is image to description):
Computer vision is needed when machines interact with the physical world, such as in robotics, or use machines to extract useful information from images (and images are everywhere).
A transformation of an image into a description is not obvious, it is an ill-posed problem. One image can have many interpretations, one object can result in many images, so this problem is exponentially large:
Most vision systems use constraints and priors, as in prior knowledge, to make some interpretations more likely than others. Our brain produces one interpretation from many possible ones. In an image, an object can appear in any location, scale, orientation, and color, so the number of possible images of same object increases exponentially with the number of these parameters. Viewpoint, angle, illumination (light or dark), and deformations (smiling or tilting head), can affect appearance, and often results in images of same the object with almost no visual similarity. A within-category variation (e.g., a calculator looks more like modern phone than modern phone looks like an old phone) and other objects in the same image, such as background and clutter, can also result in images with no visual similarity.
Human perception use inference, which is prior information about the world, to constrain results and (sometimes) produce a more accurate interpretation:
Human perception is influenced by prior expectations, or simply priors, and the examples listed above are illusions, which is from assumptions that our vision system makes to solve an under-constrained problem:
An image is formed when a sensor registers radiation that interacted with a physical object. An image that is formed is affected by two types of parameters (note that camera and computer vision systems can also work with non-human-visible wavelengths, such as infra-red and x-ray):
Colors are created from mixing light, or additive, adds illumination to increase intensity, which results in a white color at extreme, or mixing pigments, or subtractive, reflects illumination to decrease intensity, which result in a black color at extreme. A color is determined by luminance, which is amount of light striking sensor, illuminance, which amount of light striking surface, and reflectance, which is light absorbed (depends on surface material). So, luminance at a location, $(x, y)$, and wavelength, $w$, is a function of illuminance and reflectance at same location and wavelength:
The human eye seems to recover surface color, illuminance, since perceived colors remain unchanged with changes to illumination, such as looking at color of fruits in different lightning.
Light spreads out from a point, where focus is light converging (restricting flow of light) into a single image point and exposure is time needed to allow light through to form an image. The pinhole camera model describes that a small pinhole gives sharp focus and dim image (long exposure) and large pinhole gives bright but blurred image (short exposure).
A lens can be used with large pinhole to produce a bright and sharp image, the lens keeps light from spreading out (light is refracted):
a lens is positioned between object and image
the thin lens equation is $\frac{1}{f} = \frac{1}{|z|} + \frac{1}{|z'|}$ where $f$ is focal length
A lens follows the pinhole model (also called perspective camera model) for objects that are in focus:
An external reference frame is used when camera is moving or with more than one camera (stereo vision):
In Euclidean geometry, objects are described as they are, transformations within a 3D world, translations and rotations, and same shape. In projective geometry, objects are described as they appear, transformations from a 3D world to a 2D image, scaling and shear in addition to translation and rotation. A vanishing point is a projection of a point at infinity, such as looking at train tracks in the distance.
A digital image is represented as a 2D array with numbers and is sampled at discrete points (pixels), value at each pixel is light intensity at that point (0 is black and 255 is white, sometimes 1 is white). An image is usually denoted as $I$, origin is in top-left corner, where a point on the image is denoted as $p$, such as $p = (x, y)^{T}$, where $p$ is a transposed vector, and pixel value is denoted as $I(p)$, or $I(x, y)$. An intensity value is averaged at sampling point, also called pixelization, and represented using finite discrete values (quantization):
A charge-coupled device (CCD) is a semiconductor with a two-dimensional matrix of photo-sensors (photodiodes) where each sensor is small and isolated capacitive region, which can accumulate charge. The photoelectric effect converts photons on sensors into electrons, the accumulated charge is proportional to light intensity and exposure time. CCDs can be used with colored filters to make different pixels selective to different colors, such as Bayer filter.
A Bayer filter, or mask, has twice as many green filters as red and blue, samples colors at specific locations, and use demosaicing, which is an algorithm used to compute color at pixel (based on local red, green, blue values in subsampled images) to fill in missing values:
The human eye has several parts (listing relevant only):
A photoreceptor is a rod, highly sensitive and can operate in low light levels, or a cone, low sensitivity but sensitive to different wavelengths. Blue has short-wavelength and peak sensitivity 440nm, green has medium-wavelength and peak sensitivity 545nm, and red has long-wavelength and peak sensitivity 580nm. A blind spot has no photoreceptors (it is blind), the fovea has no rods but many cones, and periphery has many rods and few cones. The fovea is high resolution (high density of photoreceptors), color (cones), and low sensitivity (no rods), whereas the periphery is low resolution (low density of photoreceptors), monochrome (rods), and high sensitivity (rods).
A center-surround receptive field (RF) is an area of visual space from which neuron receive input:
A center-surround RF measure change in intensity, or contrast, between adjacent locations. The relative contrast should be independent of lightning conditions, so illuminance should be irrelevant.
The process of applying a filter, or mask, to an image is called convolution, where each location in an image $(i, j)$ is the weighted sum of pixel values in adjacent locations (range is defined by $k$ and $l$), so $I'(i, j) = IH = \sum I(i - k, j - l) H(k, l)$. Convolution is commutative, $IH = HI$, and associative, $(IH)G = IHG$, so for each pixel (note that a 2D convolution is separable if $H$ is a convolution of two vectors, which is much more efficient, so $H = h_{1}h_{2}$):
For more convolution examples, see: Example of 2D Convolution.
A filter, or mask, is a point-spread function, image with isolated white dots on black background would superimpose mask at each pixel. A mask is a template, and convolution output is maximum when large image values are multiplied with large mask values (respond most strongly at features like rotated mask), where rotated mask can be used as template to find image features:
each pixel replaced by itself
0 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 0 |
each pixel replaced by the one to the left (rotated)
0 | 0 | 0 |
0 | 0 | 1 |
0 | 0 | 0 |
each pixel replaced by average of itself and its eight neighbors, also called smoothing, mean filter, or box mask (weights add up to one to preserve average grey levels)
1/9 | 1/9 | 1/9 |
1/9 | 1/9 | 1/9 |
1/9 | 1/9 | 1/9 |
A Gaussian mask gives more weight to nearby pixels to reduce hard edges (it is separable):
A difference mask is used to calculate differences instead of averages. The difference between pixel values gives gradient of intensity values (smoothing approximate mathematical integration, difference approximate differentiation) and highlight locations with intensity changes:
weights add up to zero to generate zero response to constant image regions
-1 | -1 | -1 |
-1 | 8 | -1 |
-1 | -1 | -1 |
Note that the Laplacian mask is a combination of difference masks in each direction and detects intensity discontinuities at all orientations (sensitive to noise).
An edge, or discontinuity, is often where the intensity changes in an image. Edges can usually be found with difference masks and often correspond to physical characteristics, or features:
Edges are most efficiently detected using both differencing mask and smoothing mask, where one approach is Gaussian smoothing combined with Laplacian, also called Laplacian of Gaussian (LoG):
an example Laplacian of Gaussian mask
1/8 | 1/8 | 1/8 |
1/8 | 1 | 1/8 |
1/8 | 1/8 | 1/8 |
Other methods to find edges include:
An image feature can be found at different scales by applying filters of different sizes or applying filters at fixed size to an image of different sizes (most common). A down-sampling algorithm is often used to scale an image by taking every $n$-pixel and can be used recursively to create images in various sizes. However, results can be both good or bad (aliased, or not representative of image), where smoothing can be used to fix bad sampling:
In our brain, the cortex is responsible for all higher cognitive functions, such as perception, learning, language, memory, and reasoning, and the pathway from our retina to the cortex is not straightforward:
The primary visual cortex (V1) have receptive fields (some similar to retinal ganglion cells) for color, orientation (stronger response at a specific angle, simple cells act as edge and bar detectors, complex cells act as edge and bar detectors with tolerance to location, hyper-complex cells are selective to length and optimal when matching width of RF), direction, spatial frequency, eye of origin (monocular cells are input from one eye only, binocular cells are optimal with same input from both eyes), binocular disparity (difference in location in each eye, depth of stimulus), and position.
The V1 have center-surround RFs, which is like LGN and retina, double-opponent (DO) cells to detect location where color changes, and a hypercolumn, which is a collection of all neurons with RFs in same location on retina (each hypercolumn can process every image attribute).
A 2D Gabor function is a Gaussian multiplied with a sinusoid:
An image component is a small part of different images that are selected (as a mask) and summed to produce an image, such as set of handwritten parts to generate a number as if written by different people:
A set of randomly selected image components in natural images are like Gabor functions, which seem to capture the intrinsic structure of natural images. A set of Gabor functions can represent every image (with sparsity constraint, such as using as few as possible), which is useful for:
The mid-level vision process involves grouping elements that belong together (intensities, colors, edges, features) and segmenting elements in groups from each other (differentiate between groups). A top-down approach is grouping elements that are on the same object based on internal knowledge or prior experience, whereas bottom-up approach is grouping elements that are similar based on image properties.
An object grouped with another object is influenced by Gestalt laws, bottom-up approach, to increase simplicity or likelihood (note that the border-ownership of an object is decided in V2, which comes after V1):
In the context of vision, a feature is used to determine which elements that belong together (individually or combination, based on Gestalt laws), such as location/proximity, color/similarity, texture/similarity, size/similarity, depth, motion/common fate), not separated by contour/common region, and form a known shape (top-down approach). A feature space is a coordinate system with image elements, or features, as points, where similarity is determined by distance between points in the feature space:
Features can be weighted differently based on relative importance, finding best performance is non-trivial, but can be scaled to make calculations easier, such as within range of 0 and 1.
A region-based method for segmentation try to group image elements with similar feature vectors (look similar), such as thresholding (applied to intensity), region growing, region merging, split and merge, k-means clustering, hierarchical clustering, and graph cutting. An edge-based method partitions an image based on changes in feature values (intensity discontinuities), such as thresholding (applied to edge detectors), Hough transform (model-based, fit data to a predefined model), and active contours (also model-based).
Thresholding methods have regions defined by differences in intensity values and feature space is one-dimensional, where $I'(x, y) = 1$ if $I(x, y)$ above threshold, otherwise $I'(x, y) = 0$ (note that results are often missing parts in figure, such as edges with missing pixels, or has unwanted pixels in figure):
Morphological operations can be used to clean up results of thresholding, such as neighbor is defined by structuring element (matrix), dilation to expand area of foreground pixels to fill gaps (background pixels that neighbor foreground pixel is changed from 0 to 1), erosion to shrink area of foreground pixels to remove unwanted pixels (bridges, branches, foreground pixels that neighbor background pixel is changed from 1 to 0) and can be used in combination to remove and fill gaps:
A partitional clustering algorithm divide data into non-overlapping subsets, or clusters, where each data point is in exactly one cluster. A few methods to determine similarity between clusters:
The k-means clustering algorithm assume $k$-clusters as input (work best with equally sized clusters):
Hierarchical clustering produces a set of nested clusters organized as a tree, such as divisive clustering, where data is regarded as single cluster and then recursively split, or agglomerative clustering, where each data point is regarded as cluster and then recursively merged with most similar cluster.
agglomerative clustering
A feature space can be represented as a graph, $G = (V, E)$, where vertices, $V$, represent image elements (feature vector), edges, $E$, represent connections between pair of vertices, and each edge is the similarity between the two connected elements. A graph cutting process involves:
Normalized cuts (Ncuts) are used to avoid bias towards small subgraphs (cost of cutting graph). Finding set of nodes that produce minimum cut is a NP-hard problem and only approximations are possible for real images, with bias towards partitioning into equal segments.
Fitting algorithms use mathematical models to represent a set of elements. A model of the outline of an object can be rotated and translated to compare with an image, and used to represent closely fit data points, or line segments, in an image, where data points that fit the model is grouped together.
The Hough transform is useful for fitting lines, where data points vote on which line to belong to (there are often a lot of potential straight lines):
A generalized Hough transform is an extension to express shapes that cannot be expressed parametrically:
An active contour algorithm, also called snakes, should produce result that is near the edge and smooth, where a snake is a curve that moves to minimize energy:
Minimizing the energy of a snake results in a curve that is short, smooth, and close to intensity discontinuities, but only work on shapes that are closed (no gaps), is sensitive to parameters (smooth vs short vs proximity), and dependent on initial position, which are often placed around object manually.
Typically, multiple images are used when multiple cameras take two or more images at same time (such as stereo, recover 3D information), one camera take two or more images at different times (such as video, motion tracking), or for object recognition, such as current image and training images. The correspondence problem refers to the problem of finding matching image elements across different views (need to decide what to search for, intensity, edges). Correspondence require that most scene points are visible in both images and corresponding regions must appear similar, some issues might be:
A correlation-based method tries to match image intensities over window of pixels to find correspondence:
Correlation-based methods are easy to implement and have dense correspondence map (calculate at all points), but computationally expensive, constant or repetitive regions give false matches, and viewpoints cannot be too different.
A feature-based method find correspondence by matching sparse sets of image features (detect interest points in both images):
start from image features extracted from preprocessing, match image features, and compare matches using distance between feature descriptions
for each interest point $ip_{1}$ in $I_{1}$, and for each $ip_{2}$ in $I_{2}$
Feature-based methods are less sensitive to illumination and appearance, computationally cheaper than correlation-based methods (only need to match selected locations instead of every pixel), but have sparse correspondence maps, which is still sufficient for many tasks, and bad with constant or random regions. The choice of interest points is important, where an interest point is typically a corner, so need to detect same point independently in each image (repeatable detector, invariant to scaling, rotation), and need to recognize corresponding points in each image (distinctive descriptor, sufficiently complex to map with high probability).
Typically, corners are selected as interest points and can be detected by computing intensity gradients ($I_{x}$ and $I_{y}$), convolving image with derivative of Gaussian masks, then sum gradients over small area (Gaussian window) around each pixel:
The Harris corner detector is invariant to translation and rotation, partly invariant to illumination and viewpoint, but not invariant to scale. First find Hessian matrix then find $R$:
corner is where $R$ is greater than some threshold
local maxima of $R$ as interest points
1 | 2 | 2 | 2 |
0 | 1 | [4] | 3 |
0 | 2 | 2 | 2 |
0 | 1 | 1 | 0 |
use non-maximum suppression to find local maxima
0 | 0 | 0 | 0 |
0 | 0 | 4 | 0 |
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
An image pyramid can be used to detect interest points at different scales (scale invariant interest points). The Harris-Laplacian find local maximum in space and scale, where descriptor (list of features) is small window around interest point, or set of pixel intensity values.
A Scale Invariant Feature Transform (SIFT) find local maximum using difference of gaussians in space and scale:
convolve image with DoG mask and repeat for different resolutions (create Laplacian image pyramid)
detect maxima and minima of difference of Gaussian across scale space
The descriptor can be found using this method:
calculate magnitude and orientation of intensity gradient at all pixels around interest point (using Gaussian smoothed image at scale where interest point is found, approximate using pixel differences)
create histogram of all orientations around the interest point
create separate histograms for all orientations in sub-windows
A match that is correct belong to inliers, incorrect are outliers, and used to extract features (feature-based methods), compute matches, find the most likely transformation (most inliers and fewest outliers).
RANSAC is a random sample consensus algorithm and can be used to fit model to data set with outliers, where data consist of inliers and outliers, and parameterized model explains inliers:
RANSAC is simple and effective, works with different model fitting problems, such as segmentation, camera transformation, and object trajectory, but can sometimes requires many iterations with a lot of parameters, which can be computationally expensive.
A camera projects a 3D point onto 2D plane, so all 3D points on the same line-of-sight has the same 2D image location, so depth information is lost, $x' = \frac{f'}{Z_{1}} X_{1} = \frac{f'}{Z_{2}} X_{2}$. A stereo image (two images) can be used to recover depth information, where points project to same location in one image but to different locations in the other image. Two images can be used to measure how far each point are from each other in each image (different viewpoint, need to solve correspondence problem). This is useful for:
Image formation with one camera is a 3D scene point, $P$, projected to point on image, $P'$, so $P' = (x', y') = \left( f\frac{x}{z}, f\frac{y}{z} \right)$. Image formation with two cameras (stereo) is a 3D scene point projected to $(x_{R}', y_{R}')$ and $(x_{L}', y_{L}')$, so $(x_{L}', y_{L}') = \left( f\frac{x}{z}, f\frac{y}{z} \right)$ and $(x_{R}', y_{R}') = \left( f\frac{x - B}{z}, f\frac{y}{z} \right)$, where $B$ is baseline, or distance between cameras, $x_{R} = x_{L} - B$.
Disparity, denoted as $d$, is difference between coordinates of two corresponding points (2D vector). A pair of stereo images define a field of disparity vectors, or disparity map, and coplanar cameras has disparity in $x$-coordinates only, where $d = x_{L}' - x_{R}' = f\frac{x}{z} - f\frac{x - B}{z} = f\frac{B}{z}$, or $z = f\frac{B}{d}$:
An epipolar constraint is cameras on the same plane, $y_{L}' = y_{R}'$, so possible to search along straight line to find corresponding point in other image, and maximum disparity constraint is when length of search region depends on maximum expected disparity, so $d_{\textrm{max}} = f\frac{B}{z_{\textrm{max}}}$:
Other constraints are:
A field of view (FOV) is points in image visible to camera, where one camera has one FOV and two cameras have two FOV, and one common FOV visible to both cameras (coplanar):
Non-coplanar cameras are when cameras are at different planes with intersecting optical axes, which gives large common FOV and small depth error, fixation point determined by convergence angle (note that corresponding points still occur on straight lines, or epipolar lines, but not strictly horizontal, so search can be reduced to line):
In epipolar geometry:
A rectification is a transform to make epipolar lines parallel to rows of an image and warps both images so all epipolar lines are horizontal (free transform to treat as coplanar cameras).
Depth in images can be recognized using different methods:
A video is a series of $n$-images, also referred to as frames, at discrete time instants. A static image has intensity of pixel as function of spatial coordinates $x$, $y$, so $I(x, y)$, whereas a video has intensity of pixel as function of spatial coordinates $x$, $y$ and time $t$, so $I(x, y, t)$. An optic flow is change in position from one image to another image, optic flow vector is the image motion of a scene point, and optic flow field is collection of all optic flow vectors, which can be sparse or dense (defined for specified features or defined everywhere). An optic flow provides an approximation of a motion field, true image motion of scene points from actual projection of relative motion between camera and 3D scene, but not always accurate (smooth surfaces, moving light source). It is measured by finding corresponding points at different frames (note that a discontinuity in an optic flow field indicate different depths, so different objects):
Constraints for finding corresponding points in video:
An optic flow is measured to estimate layout of environment, such as depth and orientation of surfaces, estimating ego motion (camera velocity relative to visual frame of reference), estimating object motion relative to visual frame of reference or environment frame of reference, and to predict information for control of action. The aperture problem is the inability to determine optic flow along direction of brightness pattern (no edges or corners along straight lines), where any movement with a component perpendicular to an edge is possible. It is solved by combining local motion measurements across space.
Below is an example of recovering depth from velocity if direction of motion is perpendicular to optical axis and velocity of camera is known:
$x = f\frac{X}{Z}$, $Z = f\frac{X_{1}}{x_{1}} = f\frac{X_{2}}{x_{2}}$ and $x = \frac{x_{2} - x_{1}}{t}$
Below is an example of recovering depth from velocity if direction of motion is along optical axis and velocity of camera is known:
$x = f\frac{X}{Z}$, where $fX = x_{1} Z_{1} = x_{2} Z_{2}$ and $x = \frac{x_{2} - x_{1}}{t}$
Below is an example of recovering depth from velocity if direction of motion is along optical axis and velocity of camera is not known (time-to-collision if velocity is constant, used by birds to catch prey and land without crashing into surfaces):
$Z_{2} = \frac{Vx_{1}}{x}$
A parallel optic flow field, where depth velocity is zero, is when all optic flow vectors are parallel, direction of camera movement is opposite to direction of optic flow field, speed of camera movement is proportional to length of optic flow vectors, and depth is inversely proportional to magnitude of optic flow vector (like motion parallax with fixation on infinity).
A radial optic flow field, where depth velocity is not zero, is when all optic flow vectors point towards or away from vanishing point, direction of camera movement is determined by focus of expansion or focus of contraction, destination of movement is focus of expansion, depth is inversely proportional to magnitude of optic flow vector and proportional to distance from point to vanishing point.
An optic flow algorithm, such as track algorithm, is matching high-level features across several frames to track objects. It uses previous frames to predict location in next frame (Kalman filter, Particle filter) to restrict search and do less work. Measurement noise is averaged to get better estimates, where an object matched to location should be near predicted location.
Segmentation from motion is typically done using optic flow discontinuities, optic flow and depth, or Gestalt law of common fate:
image differencing is the process of subtracting pixel by pixel from next frames to create a binary image (absolute difference above threshold), intensity levels change the most in regions with motion
background subtraction is when background is used as reference image (static image) and each image is subtracted from previous image in sequence, adjacent frame difference, which is like image differencing, $B(x, y) = I(x, y, t - 1)$
In background subtraction, new objects that are temporarily stationary is seen as foreground, dilation and erosion can be used to clean result, so if $B(x, y)$ and $I(x, y, t)$, then $|(I(x, y, t) - B(x, y)| > T$, where $T$ is some threshold:
An object recognition task is often to determine the identity of an individual instance of an object, such as recognizing different phone models, or people. A classification task is to determine category of an object, such as human vs ape, or phone vs calculator, where each category has a different level:
A localization task determines presence and location of an object in image, such as finding faces and cars, where image segmentation is used to determine location of multiple different objects in image (semantic segmentation). Object recognition should be sensitive to small image differences relevant to different categories of objects (phone vs calculator) and insensitive to large image differences that do not affect the identity or category of object (mobile phone vs old phone), such as background clutter and occlusion, viewpoint, lightning, non-rigid deformations (same object changing shape, wind blowing in tree), or variation within category (different type of chairs).
Note that it is generally hard to distinguish between similar objects (many false positives), and global representation is sensitive to viewpoint and occlusion (many false negatives), so one approach to object recognition is to use intermediate complexity between local and global, or hierarchy of features with range of complexities, or sensitivity.
The object recognition process is to associate information extracted from images with objects, which requires accurate image data, representations of objects, and some matching technique:
Note that an image can be represented as:
The most common methods used for object recognition are sliding window, ISM, SIFT, and bag-of-words, where matching procedures are:
Template matching is a general technique for object recognition, where image of some object to be recognized is represented as a template, which is an array with pixel intensities (note that template needs to be very similar to target object, so not scaled or rotated, and is sensitive to occlusion):
A sliding window method is a template matching method with a classifier for each image region, where each image region is warped to increase tolerance to changes in appearance. The classifier, which could be a deep neural network (CNN), then determines if image region contains the object instead of simple intensity value comparison. However, it is very computationally expensive (pre-processing using image segmentation can reduce number of image regions to classify, sky not likely to have cows in it).
In edge-matching, an image of some object is represented as a template, which is pre-processed to extract edges:
A model-based method is model of object identity and pose, where object is rendered in image, or back-project, and compared to image (edges in model match edges in image). It is represented as 2D or 3D model of object shape, and compared using edge score, image edges near predicted object edges (unreliable), or oriented edge score, image edges near predicted object edges with correct orientation. It is very computationally expensive.
An intensity histogram method is histogram of pixel intensity values (grayscale or color) representing some object, color histograms are commonly used in face detection and recognition algorithms. Histograms are compared to find closest match, which is fast and east to compute, and matches are insensitive to small viewpoint changes. However, it is sensitive to lightning and within-category changes to appearance, and insensitive to spatial configuration, where image with similar configurations result in same histogram.
An implicit shape model (ISM) method is where 2D image fragments, or parts, are extracted from image regions around each interest point (Harris detector):
A feature-based method is when training image content is transformed into local features, so invariant to translation, rotation, and scale. Local features are extracted from input image and then features are matched with training image features. In SIFT feature matching, a 128-element histogram of orientations of intensity gradients is binned into 8 orientations and 4x4 pixel windows around the interest point, normalized with dominant orientation vertical:
A bag-of-words method is when different objects have distinct set of features that occur in different frequencies:
A geometric invariant is some property of an object in a scene that do not change with viewpoint (note that geometric invariant methods compare values of cross-ratio in image with cross-ratios in training images, but sensitive to occlusion and availability and distinctiveness of points):
Object-based theory, or recognition-by-components, is a concept where an object is represented as 3D model with an object-centered reference frame, objects are stored in our brain as structural descriptions (parts and configuration), and each object is a collection of shapes (or geometric components), such as human head and body could be represented as a cube shape above cylinder shape. Geometric components are primitive elements (or geons, letters forming words), such as cube, wedge, pyramid, cylinder, barrel, arch, cone, expanded cylinder, handle, expanded handle, and different combinations of geons can be used to represent a large variety of objects:
Image-based theory is a concept where an object is represented by multiple 2D views with a viewer-centered reference frame and matched using template matching, which is an early version of an image-based approach (less flexible compared to human recognition), or multiple views approach, which encode multiple views of an object through experience (templates for recognition).
To assign objects to different categories of features in a feature space, or classification, where category membership is defined by abstract rules (such as "has three sides, four legs, and barks"):
A nearest neighbor classifier have a non-linear decision boundary and do not deal with outliers. The k-nearest neighbors classifier also has a non-linear decision boundary but reduce effect of outliers. To determine similarity (using similarity measures):
In the cortical visual system, there are pathways for different kinds of information, such as spatial and motion (the where and how), which goes from V1 to parietal cortex, or identity and category (the what), which goes from V1 to inferotemporal cortex. The receptive fields become greater further down a pathway, like a hierarchy, or progression:
A feedforward model is image processed by layers of neurons with progressively more complex receptive fields, and at progressively less specific locations, where features at one stage is built from features at earlier stages (heirarchical). Hierarchical max-pooling (HMAX) is a deep neural network with different mathematical operations required to increase complexity of receptive fields, where multiple models use alternating layers of neurons with different properties, such as $\textrm{simple}(S)$ and $\textrm{complex}(C)$. The simple cells are sums (similar to logical $and$ operation) and used to increase selectivity, whereas complex cells are max (similar to logical $or$ operation) and used to increase invariance. A convolutional neural network (CNN) is a deep neural network similar to HMAX, but alternating layers with convolution and sub-sampling.