Consider the following comma-delimited file:
city, sun, temp, precip
Los Angeles, 300, 70, 10
London, 50, 55, 40
Singapore, 330, 80, 60
Looking at the contents of the file, we can see that it contains data about the cities of Los Angeles, London, and Singapore. A comma separates each field or attribute, and the file also contains a header row that describes the data contained in each column. Or does it? What does the column “sun” refer to? Is it the number of sunny days this year, last year, annually, or when? What about “temp”? Does this refer to the average daytime, evening, or annual temperature? For that matter, how is temperature measured? In Celsius? Fahrenheit? Kelvin? The column “precip” probably refers to precipitation, but again, what are the units or time frame for such measures and data? Finally, where did these data come from? Who collected them, when were they collected, and for what purpose?
It is fantastic to think that such a small text file can lead to so many questions. Now let us extend the example to a file with one hundred records on ten variables, one thousand records on one hundred variables or, better yet, ten thousand records on one thousand variables. Through this rather simple example, many general but central issues that are related to data emerge. Such issues range from the relatively mundane naming conventions that are used to identify individual records (i.e., rows) and distinguish one field (i.e., column) from another, to the issue of providing documentation about what data are included in a given file; when the data were collected; for what purpose are the data to be used; who collected them; and, of course, where did the data come from?
The previous simple text file illustrates how we cannot and should not take data and information for granted. It also highlights two essential concepts concerning the source of data and the contents of data files. Concerning data sources, data can be put into one of two distinct categories. The first category is called primary data. Primary data refer to data that are collected directly or on a firsthand basis. For example, if you wanted to examine the variability of local temperatures in May, and you recorded the temperature at noon every day in May, you would be constructing a primary data set. Conversely, secondary data refer to data collected by someone else or some other party. For instance, when we work with Census or economic data collected and distributed by the government, we are using secondary data.
Several factors influence the decision behind the construction and use of primary data sets versus secondary data sets. Among the most critical factors are the costs associated with data acquisition in terms of money, availability, and time. The data acquisition and integration phase of most geographic information system (GIS) projects are often the most time-consuming. In other words, locating, obtaining, and putting together the data to be used for a GIS project, whether you collect the data yourself or use secondary data, may indeed take up most of your time. Of course, depending on the purpose, availability, and need, it may not be necessary to construct an entirely new data set (i.e., primary data set). In light of the vast amounts of data and information that are publicly available, for example, via the Internet, the cost and time savings of using secondary data often offset any benefits that are associated with primary data collection.
By having a foundational understanding regarding primary and secondary data, as well as the rationale behind each, one can go about finding the data and information that we need. There is an incredibly vast and growing amount of data and information available to us, and performing an online search for “deforestation data” will return hundreds, if not thousands, of results. To overcome this data and information overload, we need to turn to even more data. In particular, we are looking for a special kind of data called metadata. Simply defined, metadata are data about data. At one level, a header row in a simple text file like those discussed in the previous section is analogous to metadata. The header row provides data (e.g., names and labels) about the subsequent rows of data.
Header rows themselves, however, may need an additional explanation as previously illustrated. Furthermore, when working with or searching through several data sets, it can be quite tedious at best or impossible at worst to open each file in order to determine its contents and usability. Enter metadata. Today many files, and in particular secondary data sets, come with a metadata file. These metadata files contain general descriptions about the contents of the file, definitions for the various terms used to identify records (rows) and fields (fields), the range of values for fields, the quality or reliability of the data and measurements, how the data were collected, when the data were collected, and who collected the data. Though not all data are accompanied by metadata, it is easy to see and understand why metadata are essential and valuable when searching for secondary data, as well as when constructing primary data that may be shared in the future.
Just as simple files come in all shapes, sizes, and formats, so too do metadata. As the amount and availability of data and information increase every day, metadata play a critical role in making sense of it all. The class of metadata that we are most concerned with when working with a GIS is called geospatial metadata. As the name suggests, geospatial metadata are data about geographical and spatial data. According to the Federal Geographic Data Committee (FGDC) in the United States, “Geospatial metadata are used to document digital geographic resources such as GIS files, geospatial databases, and earth imagery. A geospatial metadata record includes core library catalog elements such as Title, Abstract, and Publication Data; geographic elements such as Geographic Extent and Projection Information; and database elements such as Attribute Label Definitions and Attribute Domain Values.” The definition of geospatial metadata is about improving transparency when it comes to data, as well as promoting standards.
Generally, standards refer to widely promoted, accepted, and followed the rules and practices. Given the range and variability of data and data sources, identifying a common thread to locate and understand the contents of any given file can be a challenge. Just as the rules of grammar and mathematics provide the foundations for communication and numeric calculations, respectively, metadata provide similar frameworks for working with and sharing data and information from various sources.
The central point behind metadata is that it facilitates data and information sharing. Within the context of large organizations such as governments, data, and information sharing can eliminate redundancies and increase efficiencies. Moreover, access to data and information promotes the integration of different data that can improve analyses, inform decisions, and shape policy. The role that metadata, and in particular geospatial metadata, play in the world of GIS is critical and offers enormous benefits in terms of cost and time savings. It is precisely the sharing, widespread distribution, and integration of various geographic and nongeographic data and information, enabled by metadata, that drive some of the most exciting and compelling innovations in GIS and the broader geospatial information technology community. More critical, widespread access, distribution, and sharing of geographic data and information have social costs and benefits and yield better analyses and more informed decisions.