In this talk, I will introduce a structure model for an array of
rectangles. The rectangles are aligned both horizontally (in rows) and
vertically (in columns), nevertheless the spacing between rows and
columns is left unconstrained. Such structure may represent windows in
an (orthographically rectified) facade image, for instance. Parsing a
facade image to array-like structures of various facade elements is
our toy problem.
We implement this simple structure model using a kind of attributed
grammars, being inspired by the work of Zhu et al. Advantages of this
choice will be discussed. The task is to find the most probable array
of windows in a Bayesian framework. Our algorithm finds an approximate
solution. The method is fast and works surprisingly well, even if a
very simple pixelwise model is used for image likelihood. This is due
to a strong interplay between the structural model and the image model
during parsing.
An attempt to generalize this result involves two alternative
mechanisms I will briefly mention in this talk:
1. EM-like image model focusing. An alternating algorithm starts from
an initial image model, runs the parser, updates the image model
from the current interpretation, re-runs the parser with the
updated model, etc. We observed quite stable and fast convergence.
2. On-line incremental appearance learning for terminal symbol
(eg. window) detector. I will demonstrate that the combination of
on-line learning and structural modeling help prevent the
classifier from drifting away from the initial terminal symbol
class during on-line appearance learning.
Although this is still an early work, it is possible to say that
attributed image grammars are a strong and efficient modeling tool
that holds a great promise in applications.