Presentation of Multiple
Geo-Referenced Videos
Zhang Lingyan
B. Eng. (Hons), Zhejiang University
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
COMPUTER SCIENCE DEPARTMENT
SCHOOL OF COMPUTING
NATIONAL UNIVERSITY OF SINGAPORE
2011
Abstract
Geo-tagging is becoming increasingly common as location information is associated
with various data that is collected from a variety of sources such as Global Position
System (GPS), compass, etc. In the field of media, images and most recently videos,
can be automatically tagged with the geographic position of the camera. Geo-tagged
videos can be searched according to the location information which can make the
query more specific and precise if the user already knows the place he or she is
interested in. In this thesis we consider the challenge of presenting geo-referenced
videos and we first review the related work in this area. A number of researchers
have focused on on geo-tagged images while few have considered geo-tagged videos.
Earlier literature presents the concept of Field-of-View (FOV) which we also adopt
in our research. In addition, recently the concept of 3D virtual environments has
gained increased prominence, with Google Earth being one example. Some of them
are so-called mirror-worlds – large-scale environments that are essentially detailed
computer-models of our three-dimensional real world. The focus of our work is on
utilizing such virtual environments for the presentation of multiple geo-referenced
videos. We are proposing an algorithm to compute a reasonable viewing location or
viewpoint for an observer of multiple videos. Without calculating the viewpoint, it
might be difficult to find the best location to watch several videos. Our proposed
system automatically presents multiple geo-referenced videos according to an advantageous viewpoint. We performed several experiments to demonstrate the usefulness
and feasibility of our algorithm. We conclude the thesis by describing some of the
challenges of our research and possible future work.
Acknowledgments
First, my sincere thanks to the guidance of Dr. Roger Zimmermann, my advisor.
He carefully taught me a glimpse of the concept of presentation of geo-referenced
videos, from time to time to discuss and enlighten me in the right direction, so I
benefited a lot.
The completion of this thesis was made possible with the great help of a research fellow of my supervisor Dr. Beomjoo Seo. His significant advice and patient
explanations were very useful during the continuation of my research. In addition,
graceful thanks go to a previous student of my supervisor, Dr. Sakire Arslan Ay, for
her constructive suggestions and academic discussions.
Finally, I would like to thank my parents and my friends for their support.
1
Contents
Summary
i
List of Tables
iii
List of Figures
vi
1 Introduction
1
1.1
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Research Problem Statement . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Thesis Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2 Literature Survey
8
2.1
Definition of Related Concepts . . . . . . . . . . . . . . . . . . . . .
8
2.2
Geo-Spatial Techniques for Images . . . . . . . . . . . . . . . . . . .
12
2.2.1
Image Browsing . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.2
Image Hierarchies and Clustering . . . . . . . . . . . . . . . .
14
2.2.3
Image Presentation . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.3
Indexing and Retrieving . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.4
Field-of-View Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.5
Geo-Location Techniques for Videos . . . . . . . . . . . . . . . . . .
26
2.5.1
Sensor-Based Videos . . . . . . . . . . . . . . . . . . . . . . .
26
2.5.2
Presentation of Videos . . . . . . . . . . . . . . . . . . . . . .
27
2.5.3
Obtaining Viewpoints of Videos . . . . . . . . . . . . . . . . .
29
2
2.6
2.5.4
Video Compression . . . . . . . . . . . . . . . . . . . . . . . .
31
2.5.5
Augmented Environments . . . . . . . . . . . . . . . . . . . .
32
2.5.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3 System Overview
38
3.1
Architecture of GRVS . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2
Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.3
Database Implementation . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4
2D Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.4.1
Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.4.2
Communication between Client and Server . . . . . . . . . .
44
3.4.3
Video Management . . . . . . . . . . . . . . . . . . . . . . . .
44
3D Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.5.1
Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.5.2
Communication between Client and Server . . . . . . . . . .
48
3.5.3
Video Management . . . . . . . . . . . . . . . . . . . . . . . .
48
3.5.4
The Algorithm for Presentation of Multiple Videos . . . . . .
50
3.5
4 Evaluation
59
4.1
Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.2
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.3
Discussion and Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
61
5 Challenges and Future Work
64
5.1
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.2.1
Complete and Extend Previous Work . . . . . . . . . . . . .
66
5.2.2
3D Query Method . . . . . . . . . . . . . . . . . . . . . . . .
66
5.2.3
Adjustment of Video Quality . . . . . . . . . . . . . . . . . .
66
6 Conclusions
6.1
68
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
68
6.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 List of Publications
69
70
4
Summary
The primary objective of this thesis is to present multiple geo-referenced videos
in a useful way within 2D or 3D mirror worlds. As the number of geo-tagged videos
is increasing, showing multiple videos within a virtual environment may become
an important topic of media research. We conjecture that presenting videos in the
context of maps or virtual environments is a more precise and comprehensive way.
Our example geo-referenced videos contain longitude, latitude, directional heading,
and video timestamp information which can aid in the search of videos that capture
a particular region of interest. Our main work focuses on presenting the videos
in 3D environments. Therefore, we show the videos with a 3D perspective that
may present the scene at a certain angle. Furthermore, to show multiple videos,
we propose an algorithm to compute a suitable common viewpoint to observe these
videos. To obtain a better viewpoint we provide some guiding rules. Finally, we
perform an experiment with our system to examine its feasibility and effectiveness.
We have studied the literature of existing advanced technologies in detail which
we leverage for reference. There exist many models that make use of the fieldof-view (FOV) concept based on the location and orientation information which
we also use in our research. In a virtual world like Second Life, although it is
an imaginary environment, the user can watch videos which are correctly warped
according to the 3D perspective. Learning from this example, we have adopted video
presentation with a 3D perspective in our system. In later sections we will describe
the implementation of our system and the design of a prototype of geo-referenced
video search engine for both 2D and 3D environments. In our system, we have
achieved the querying of geo-referenced videos, and their presentation with Google
i
Maps and Google Earth. We will show the adopted architecture, the database design
and the 2D and 3D implementations. Furthermore, the evaluation of our system is
shown, which involves our algorithm for calculating the viewpoint. Combined with
a web interface, we can visually show the results and check the effectiveness of our
algorithm. Although there are some tradeoffs in our approach, we believe that it
is useful. There are many conditions we need to consider when implementing this
algorithm. Firstly, if there are more than two videos, calculating the viewpoint is
more difficult because maybe two videos are close together or there are more than
four videos. Secondly, if two videos are shot in opposite directions we need to decide
which video will be in the viewable scene. Thirdly, if the viewpoint is calculated
far from the position of the camera, we may need to decide to move closer to the
camera. Finally, we also introduce some challenges of our research, show possible
future work, and draw conclusions and contributions of our work.
To summarize, our novel system can present multiple geo-referenced videos with
a 3D perspective in a corresponding virtual environment. As a basis for our system, we propose an algorithm to show multiple videos. As demonstrated through
experiments, the approach produces useful results.
ii
List of Tables
2.1
Summary of features of different techniques for images. . . . . . . . .
19
2.2
The features of different techniques for videos. . . . . . . . . . . . . .
35
3.1
Schema for 3D field-of-view (FOV ) representation. . . . . . . . . . .
42
iii
List of Figures
1.1
Illustration of FOVScene model (a) in 2D and (b) in 3D. . . . . . . .
1.2
Early setup for geo-tagged video data collection: laptop computer,
2
OceanServer OS5000-US compass, Canon VIXIA HV30 camera, and
Pharos iGPS-500 receiver. . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Integrated iPhone application for geo-tagged video acquisition. . . .
4
1.4
Android application for geo-tagged video data acquisition. . . . . . .
5
1.5
Example Google Earth 3D environment of the Marina Bay area in
Singapore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6
6
The difference between presenting videos in 2D perspective or 3D
perspective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1
Information Retrieval versus Data Retrieval spectrum. . . . . . . . .
9
2.2
Pictorial diagram of angle of view. . . . . . . . . . . . . . . . . . . .
10
2.3
Architecture of ThemExplorer. . . . . . . . . . . . . . . . . . . . . .
13
2.4
PhotoCompas system diagram. . . . . . . . . . . . . . . . . . . . . .
15
2.5
System architecture for generating representative summaries of landmark image sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.6
Estimated camera locations for the Great Wall data set. . . . . . . .
16
2.7
Screenshots of the explorer interface. Right: a view looking down
on the Prague dataset, rendered in a non-photorealistic style. Left:
when the user visits a photo, that photo appears at full-resolution,
and information about it appears in a pane on the left. . . . . . . . .
17
2.8
Overview of the process of indexing a video segment. . . . . . . . . .
21
2.9
Picture browser interface. . . . . . . . . . . . . . . . . . . . . . . . .
22
iv
2.10 Field of view evaluation. If |HA - BA | is less than a given threshold
then point B is in the field of view of point A. If |HA - HB | is less
than a given threshold then the pictures taken at A and B have similar
heading directions. If both of these conditions are met then imageb ,
taken at point B is in field of view of imagea taken at A. . . . . . .
23
2.11 Visualization of a Viewpoint in 3D space and how it conceptually
relates to a video sequence frame and GPS point. While the image
defines a viewing plane that is orthogonal to the Ortho Photo, in
spatial terms the polyhedron or more specifically frustum defines the
spatial extent. Scales are not preserved. . . . . . . . . . . . . . . . .
23
2.12 Illustration of filter-refinement steps. . . . . . . . . . . . . . . . . . .
24
2.13 The video results of a circle scene query (a) and a FOV scene query.
25
2.14 FOV representation in different spaces. . . . . . . . . . . . . . . . . .
25
2.15 SEVA recorder laptop equipped with a camera, a 3D digital compass,
a Mote with wireless radio and Cricket receiver, a GPS receiver, and
802.11b wireless. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.16 Sample screenshots from the prototype. . . . . . . . . . . . . . . . .
27
2.17 Schematic of the Re-cinematography process. Conceptually, an image
mosaic is constructed for the video clip and a virtual camera viewing
this mosaic is keyframed. Yellow denotes the source camera path,
magenta (dark) the keyframed virtual camera.
. . . . . . . . . . . .
28
2.18 Orientation based visualization model using a minimum bounding
box, MBB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.19 Transfer of corresponding points. . . . . . . . . . . . . . . . . . . . .
30
2.20 H.264 encoder block diagram. . . . . . . . . . . . . . . . . . . . . . .
32
2.21 Components of the Augmented Virtual Environment (AVE) system
with dynamic modeling. . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.22 Overview of approach to generate ALIVE cities that one can browse
and see dynamic and live Aerial Earth Maps, highlighting the three
main stages of Observation, Registration, and Simulation. . . . . . .
34
2.23 A taxonomy of related work technologies. . . . . . . . . . . . . . . .
36
v
3.1
Architecture of geo-referenced video search. . . . . . . . . . . . . . .
39
3.2
Data flow diagram of geo-referenced video search. . . . . . . . . . . .
40
3.3
Geo-referenced 2D video search engine web interface. . . . . . . . . .
43
3.4
Sensor meta-data exchanged between client and server. The XML file
includes GPS coordinates, compass heading, radius, view angle, and
video segment information (start time, duration, and video file name). 45
3.5
Geo-referenced 3D video search engine web interface showing multiple
videos simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
47
Sensor meta-data exchanged between client and server. The XML file
includes GPS coordinates, compass heading, radius, view angle, and
video segment information(start time, duration, and video file name)
for multiple geo-refernced videos. . . . . . . . . . . . . . . . . . . . .
3.7
54
Sensor meta-data produced by server, and invoked by client. The
KML file includes GPS coordinates, compass heading, waiting time,
and trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8
55
Different situations when either of the direction is 90 degrees or 270
degrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
Same direction for two videos to compute viewpoint. . . . . . . . . .
56
3.10 Opposite direction for two videos to compute viewpoint. . . . . . . .
57
3.11 General case for two videos to compute viewpoint. . . . . . . . . . .
57
3.12 Best situation to compute viewpoint of four videos. . . . . . . . . . .
58
4.1
Showing one geo-referenced video. . . . . . . . . . . . . . . . . . . .
60
4.2
Showing two geo-referenced videos simultaneously. . . . . . . . . . .
61
4.3
The trajectory of three videos for different cases. . . . . . . . . . . .
62
3.9
vi
Chapter 1
Introduction
1.1
Motivation
Due to technological advances an increasing number of videos are being collected
with certain sensor information from devices such as GPS, digital compasses, and
Motes with wireless radios. Additionally, there exist now 2D and 3D virtual environments mirroring our real world. Therefore it is possible to use these sensor meta-data
to present the associated videos according to the corresponding viewpoints in these
mirror world environments. The captured geographic meta-data have significant
potential to aid in the process of indexing and searching geo-referenced video data,
especially in location-aware video applications.
If videos are presented in a useful way, users can directly find what they desire
through the videos with associated location and orientation information. Furthermore, videos are presented within the 3D virtual environment which contains real
world location information (longitude and latitude data). Based on this environment, our presentation approach can match the real worlds videos with the 3D
virtual world and give the users an intuitive feel when obtaining the video results.
Technological advances have led to interesting developments in the following three
areas:
• Location and direction information can now be affordably collected through
GPS and compass sensors. By combining location data with other information,
interesting new applications can be developed. Location data also gives rise to
a natural organization of information by superimposing it on maps that can
be browsed and queried.
• While maps are two-dimensional, three-dimensional mirror worlds have recently appeared. In these networked virtual environments, the real world is
“mirrored” with digital models of buildings, trees, roads, etc. Mirror worlds
1
d
R
R
θ
P
θ
d
φ
P
P <longitude,latitude>:
camera location
θ : viewable angle
d : camera direction vector
R : visible distance
P <longitude,latitude,altitude>:
camera location
θ,φ : horizontal and vertical
viewable angles
d : camera direction vector (in 3D)
R : visible distance
(a)
(b)
Figure 1.1: Illustration of FOVScene model (a) in 2D and (b) in 3D.
allow a user to explore, for example, a city from the comfort of their home in
a very realistic way.
• High quality video camcorders are now quite inexpensive and the amount of
user collected video data is growing at an astounding rate. With a large video
data set, we can obtain more precise and convincing results.
Our goal with the presented approach is to harness the confluence of the above
developments. Specifically, we envision a detailed mirror world that is augmented
with (possibly user-collected) videos that are correctly positioned in such a way
that they overlay the 2D and 3D structures behind them, hence bringing the mostly
static mirror world to life and providing a more dynamic experience to the user who
is exploring such a world.
As a basis for our work we leverage a query system called Geo-Referenced Video
Search (GRVS) which is a web-based video search engine that allows geo-referenced
videos to be searched by specifying geographic regions of interest. To achieve this
system, a previous study [3] investigated the representation of a viewable scene of
a video frame as a circular sector (i.e., a pie slice shape) using sensor inputs such
as the camera location from a GPS device and the camera direction from a digital
compass. Figure 1.1 shows the corresponding 2D and 3D field-of-view models. In
2D space, the field-of-view of the camera at time t, (FOVScene(P, d, θ, R, t)) forms
a pie-slice-shaped area as illustrated in Figure 1.1(a). Figure 1.1(b) shows an example camera FOVScene volume in 3D space. For a 3D FOVScene representation we
would need the altitude of the camera location point and the pitch and roll values
to describe the camera heading on the zx and zy planes (i.e., whether camera is di2
rected upwards or downwards). Based on the proposed model, we constructed three
video acquisition prototypes (shown in Figures 1.2, 1.3, and 1.4) to capture the relevant meta-data, implemented a database with a real-world video data set captured
using our prototype capture systems, and developed a web-based search system to
demonstrate the feasibility and applicability of our concept of geo-referenced video
search.
Figure 1.2: Early setup for geo-tagged video data collection: laptop computer,
OceanServer OS5000-US compass, Canon VIXIA HV30 camera, and Pharos iGPS500 receiver.
Now we will first discuss the acquisition of geo-referenced videos. Figure 1.2 illustrates the capture application with computer, camera, GPS, and compass separately.
When using this early prototype, its operation is very inconvenient. In other words,
we need to carry significant equipment to record videos. In contrast, Figures 1.3
and 1.4 show the acquisition applications implemented on mobile phones. We have
implemented the software for iPhone and Android. It is obvious that using mobile
phones will be more feasible than using the equipment shown in Figure 1.2. Therefore, in our recent work we have been using these phones for video acquisition. With
mobile phone applications we can more easily expand our data set and with a larger
data set our experimental results will be more convincing.
To test the feasibility of this idea we have collected a number of videos that
were augmented with compass and GPS sensor information using the above data
acquisition prototype. We then have used Google Maps and Google Earth as a
backdrop to overlay the acquired video clips in the correct locations. According to
this method, our video results can be presented in an intuitive way, and the users
can watch the videos within the mirror world.
Another issue we have considered and implemented in this research is the presentation of multiple videos. As search results contain more and more video clips,
our objective is to show multiple videos with a good viewpoint. Then the users can
watch several videos at the same time and find the most relevant one more quickly
3
Figure 1.3: Integrated iPhone application for geo-tagged video acquisition.
than otherwise. To achieve this goal, we are proposing an algorithm to compute a
common viewpoint. However, there is a tradeoff between obtaining a good viewpoint
and providing a smooth trajectory. The trajectory is a path which consists of many
viewpoints or camera positions to describe where the view or camera location is. To
balance the tradeoff, we have provided a number of rules to solve this problem, and
also to improve the efficiency.
1.2
Research Problem Statement
Our research goal is to provide users with an enhanced presentation of multiple
geo-referenced videos in a specific region of interest. The term enhanced presentation
refers to the display of multiple videos such that each video is rendered on a virtual
canvas positioned in a 3D environment to provide an aligned overlay-view with the
objects in the background (e.g., buildings). Our conjecture is that such an integrated
rendering of videos provides increased situational awareness to users and would be
beneficial for a number of applications. Based on this objective we state several
research problems that we investigated in our work.
First, we need to determine which environment works best to present the videos.
The reason is that using a suitable environment can help users understand the videos
more comprehensively. With Google Earth, the 3D virtual models correspond to
4
Figure 1.4: Android application for geo-tagged video data acquisition.
objects in the real world (termed a mirror world), however, with Second Life the
environment is imaginary. Therefore, Google Earth serves as a good choice for our
research. Figure 1.5 shows a screen shot of Google Earth, and we can see the virtual
3D buildings in this environment.
Second, it is very important to provide precise visual alignment of the video frames
with the 3D virtual world. In such a virtual world, there are many virtual objects
corresponding to the real world. By comparing the frames in the videos with such
objects we can check the accuracy of our system and our initial data. If a video
frame can totally match the objects in the virtual environment, we may say that
our system is very precise. However, because of inaccuracies in the initial data from
GPS and compass equipment, the matching process is sometimes challenging and the
video frames do not match the objects. In such a situation, if the camera location is
in the right place (i.e., matching the street, road, etc.), this means that our system
is not accurate. We need to check the frames; if these frames are in the range (we
will define this later), then we can accept such a result, otherwise, the system may
not be good for video presentation.
Third, we need to think about how to reasonably present multiple geo-referenced
videos. With an appropriate environment, how to place the videos and how to
5
Figure 1.5: Example Google Earth 3D environment of the Marina Bay area in
Singapore.
show them are important issues that we need to carefully consider especially for
multiple videos. Our conjecture is that showing videos in a 3D environment with
a 3D perspective will be better than simply using a flat, non-warped 2D view. We
can see the difference between a 2D perspective and a 3D perspective of videos in
Figure 1.6. In addition, with the presentation of multiple videos we need to design
an algorithm to compute the best viewpoint from which a user can view multiple
videos in a suitable way.
Fourth, most of the time the search results contain more than one video. Accordingly, we need to consider how to rank them, and how many videos should be
presented at the same time. In addition, as part of these considerations, we also
need to consider the network bandwidth. With the presentation of multiple videos
in a 3D environment, a possible network bottleneck is a big challenge.
Lastly, given different environments we need to utilize different methods. With
2D environments we can easily present the videos with a flat, non-warped 2D view.
Using a video player such as Flowplayer, Adobe Flash Player, etc. we can achieve
this. However, with a 3D environment the situation is more complicated. Given
videos with a 3D perspective, a normal video player cannot handle these issues. We
use the HTML 5 video tag to play videos with a 3D perspective. In addition, the
query window should have a 3D shape which means we can query in terms of 3D
instead of 2D. More specifically, the 3D FOV model we use is shown in Figure 1.1(b).
6
(a)
(b)
Figure 1.6: The difference between presenting videos in 2D perspective or 3D perspective.
1.3
Thesis Roadmap
The rest of this thesis is organized as follows. Chapter 2 presents a literature survey related to our research. Implementation of our system is described in Chapter 3.
In this chapter, we present the detailed technologies we have adopted. Furthermore,
in Chapter 4 we describe some experiments and show how our algorithm works. In
addition, challenges and future work are outlined in Chapter 5. Finally, Chapter 6
draws conclusions of this thesis.
7
Chapter 2
Literature Survey
The existing literature on geo-located videos is quite limited. In this chapter we
review some early work that has focused on 2D geo-referenced video acquisition,
search and presentation. Additionally, we also give a general overview of other
relevant research topics. The subsequent parts of this chapter are organized like
the following. First, definitions of related concepts are described in Section 2.1 to
help explain the content. Second, since our video search engine is based on acquired
sensor information (location coordinates, compass data, etc.), in Section 2.2 we
review some selected papers of image geo-spatial techniques which utilize different
types of sensor information. Third, an effective approach of indexing and retrieving
geo-referenced video is necessary for our system. Hence, a brief survey of video
retrieval techniques is given in Section 2.3. Fourth, we describe the Field-of-View
(FOV) model in Section 2.4. For each video using the FOV model can provide a more
accurate position when it is shown on a map. Therefore, some work which exploits
the direction (orientation) information in their FOV models are examined in this
section. Fifth, how to present video in a 3D environment is another vital problem
in our system. In Section 2.5, several approaches which target 3D presentation
methods are reviewed. We summarize how these previous techniques have inspired
our new algorithm for computing a viewpoint for multiple videos. Finally, we draw
conclusions for the literature review in Section 2.6.
2.1
Definition of Related Concepts
To be able to better describe the forthcoming concepts, we first list several definitions of specialized terms.
Document Space: We are only concerned with geographic information of document space which can be broken into two subspaces: a geographical space and a
thematic space.
8
Figure 2.1: Information Retrieval versus Data Retrieval spectrum.
• Geographical space: a two-dimensional space representing a geographic coordinate system. Documents can be geometrically represented and applied as
footprints in such a space.
• Thematic space: a multi-dimensional space where documents are concerned
with their theme.
RDF: The Resource Description Framework is a framework for representing Web
resources which can be used in a variety of areas. For instance, providing better
search engines, describing the content of special web pages or digital libraries, and so
on. RDF can denote metadata for inter-communication between applications that
exchange information which machines can understand via the web. In addition,
RDF metadata is represented by a syntax for encoding and transportation. One
choice of syntax is the Extensible Markup Language (XML). Combining RDF and
XML can make metadata more understandable. The objective of RDF is to define
a mechanism of describing data information without assumptions for a particular
domain [35].
GIR: Geographic Information Retrieval can be treated as special case of traditional information retrieval. GIR provides access to geo-referenced information
sources which includes all of the core areas of Information Retrieval (IR). In addition, it lays emphasis on spatial and geographic indexing and retrieval.
The concepts of “Information Retrieval” and “Data Retrieval (DR)” related to
database management systems (DBMS) are different. A variety of attributes of IR
and DR are shown in Figure 2.1. Firstly, in IR, the model of providing access
to documents is probabilistic as it is concerned with subjective issues. On the
other hand, DR is deterministic with retrieval processes that are certain. In GIR,
applications generally adopt both deterministic and probabilistic retrieval. Secondly,
9
Figure 2.2: Pictorial diagram of angle of view.
indexing for IR is derived from contents while with DR its entirety is the indexing
unit. Still the hybrid method is applied for GIR. Thirdly, the matching and retrieval
algorithms are based on the retrieval model. In other words, the retrieval algorithms
of IR are probabilistic which may include the actual calculation of probabilities. In
contrast, the DR algorithms are deterministic which require an exact match of query
specification and the contents of a database. Fourthly, the query types of IR and
DR are distinct, meaning that IR searches are expressed in natural language that
may be ambiguous, while DR queries are expressed in a structured query language
which is more precise. Finally, the results for IR are shown in a ranked order while
DR query results are arbitrary. As a consequence, Geographic Information Retrieval
(GIR) is a combination with a DBMS concerning indexing, retrieving, and searching
of geo-referenced information sources [34].
GIS: Geographical Information Systems introduce particular utilities for obtaining, storing, controlling, and showing geo-referenced location data. In a generic
sense, GIS are systems that allow users to create queries to match associated geographical information. The most common method of data creation for modern GIS
is digitization, where a map is transferred into digital imagery through a computeraided design (CAD) program [67].
PIRIA: The Program for the Indexing and Research of Images by Affinity is a
content-based search engine. Piria is a novel search engine that uses the query-byexample method. When a query is sent to the system, then we can obtain a list of
ranked images. The ranking method is not only based on keywords, but also form,
color and texture [21]. This technique is described in one of the manuscripts we have
reviewed, therefore we introduce this terminology as an illustrative example.
FOV: The field of view (abbreviated FOV) is the (angular or linear or areal)
range of the observable world [67]. Different animals have different fields of view
which depend on the location of the eyes. Compared with humans who have almost
10
180-degree forward-facing vision, some birds have a nearly 360-degree field of view.
The concept is related with the angle of view, and Figure 2.2 shows the detailed
information. A rectilinear lens is in the camera, and S1 is the distance between the
lens and the object. Considering the situation in two dimensions, α is the angle
of view, and F is the local length which is attained by setting the lens for infinite
focus. According to this figure, we can easily obtain that the “opposite” side of the
right triangle is d/2, and the “adjacent” side is S2 (the distance from the lens to
the image plane). Therefore, we can obtain Equation 2.1 from basic trigonometry
which we can solve for α, and Equation 2.2 is generated. Then the angle of view is
given by Equation 2.3, in this equation f = F .
d/2
α
tan( ) =
2
S2
(2.1)
α = 2 arctan
d
2S2
(2.2)
α = 2 arctan
d
2f
(2.3)
LIDAR: Light Detection And Ranging is an optical remote sensing system which
can collect geometry samples according to measured properties of scattered light
to find the range and other characteristics of a distant object. A general method
(radar) is to use radio waves to determine distance, but LIDAR adopts laser pulses
to compute the distance. The range is computed through the time delay between
transmitting a pulse and the detection of the return signal [67].
MBB: The minimum bounding box for a point set in N dimensions is the box
which measures the smallest side lengths within which all the points lie. The term
“box” stems from its use in the Cartesian coordinate system, and in the 2D case it
is also called the minimum bounding rectangle.
IBR: Image Based Rendering depends on a set of two-dimensional (2D) images
of a scene to produce a three-dimensional (3D) model to render novel views of this
scene with the help of computer graphics and computer vision methods. Typically,
IBR is an automatic method to map multiple 2D images to novel 2D images.
FVV: Free View Video allows users to control the viewpoint and generate new
views of a dynamic scene from any 3D position.
VBR: Video Based Rendering is an extension of image-based rendering that can
handle dynamic scenes [47]. Furthermore, according to Shields [6], VBR can refer
to the generation of individual frames by computer. This can be used to produce a
fluid video and especially for affecting certain types of applications. For instance, if
employing a special filter in a video using a software program, then the video will
11
be rendered through the computer and each frame will be produced and assembled
into a video output.
MVC: Multi-view Video Coding is an amendment to the H.264/MPEG-4 AVC
video compression standard to enable the efficient encoding of video sequences from
multiple cameras based on a single video stream. In addition, multi-view video contains a large number of inter-view statistical dependencies, therefore the integration
of temporal and inter-view prediction is key for MVC [67].
2.2
Geo-Spatial Techniques for Images
There exist several research areas that are concerned with geo-spatial images related with location, time and orientation information. Some research focuses on
sharing and browsing of geo-referenced photos, some emphasize hierarchies and clustering of images, and others concentrate on how to present images to users. In the
following sections we will perform a detailed literature review.
2.2.1
Image Browsing
We will review several papers related with how to browse images according to
location and other relevant information. As examples, many tourists are recording
photos of family while traveling, archaeologists take photos of historical relics, and
botanists shoot images of plant species. In these situations, the geographic location
information is a critical marker when browsing these images. In addition, there are
many ways to present location information, such as place names (“San Francisco”),
street addresses, zip codes, latitude/longitude coordinates, and so forth. Most of
the GIS projects use latitude and longitude coordinates, for example as defined by
the WGS84 standard. This is a very concise and accurate way to designate point
locations and also a format that can be recognized by certain systems [66]. We will
now describe some related projects.
First, Google groups allows the embedding of photos in Google Maps. These photos and videos are both based on their geo-locations, which have to be uploaded
manually. GPicSync [53] is a Google project that aims to automatically insert locations into users’ photos. Thus, such photos can also be used with any ’geocode aware’
application like Google Earth, Flickr, etc. On the other hand, Microsoft groups have
introduced the World Wide Media eXchange (WWMX) to browse images on their
web site. Toyama et al. [66] from Microsoft have presented a system that uses geographic location tags based on WWMX. The WWMX database contains metadata
of timestamps and location information, which makes it relatively easy to browse the
photos. In addition, acquiring location tags on photos, establishing data structures
12
Figure 2.3: Architecture of ThemExplorer.
for images, and implementing UIs for location-tagged photo browsing are the other
main contributions of this paper.
Second, another research direction is based on geographical information retrieval.
GeoVIBE is a browsing tool which builds on geographical information retrieval (GIR)
and textual information (IR) systems [7]. In addition, this system includes two types
of browsing strategies: GeoView and VibeView. GeoView enforces a geographical
order on the document space with the idea of hypermaps. On the other hand,
VibeView presents a similar document space with multiple reference points. The
GeoVIBE integrates the two, and users can search information with either geographic
clues or conceptual clues. Similarly, Popescu et al. [1] presented a suitable and
more powerful map-based tool named ThemExplorer which combines a geographical
database and a content-based facility. Nowadays, there exist a number of map-based
interfaces such as Google Maps, Google Earth, Yahoo Maps, and so on. The authors
also evaluated the accuracy of ThemExplorer for browsing geo-referenced images
through different dimensions.
Figure 2.3 shows the architecture of ThemExplorer which includes TagMaps,
Content-Based Image Retrieval (CBIR), and an image collector. With ThemExplorer, users can ask for geographic names within a certain region, then the system
can retrieve the images by querying the database. To search images with CBIR, the
system employs PIRIA which is a content-based search engine. This system provides
a layering of images according to the geonames in the database. However, there are
no usability studies to support the validation of the system.
With a similar idea as ThemExplorer, TagNSearch proposed by Nguyen et al. [44]
13