1.4 Data Quality and Data Types
Understanding data and data quality are basic requirements to work as a data scientist. There are many different ways of thinking about data and classifying data (see Figure 1.6). In the following sections we will briefly define and add examples of all the data types in the Figure. It is important to note that these categories are not mutually exclusive, and data can belong to more than one of them given that some of these categorizations respond to different criteria. A single dataset may contain data elements (variables) that contain different types of data. For example, a survey may include fixed category responses, continuous information, and open-ended questions that represent qualitative data. Understanding these categores is important because different types of data require different kinds of analytic techniques and visualizations. As a data scientist, you may need to do some work to transform from one type to another (e.g. unstructured to structured, or qualitative to discrete).
Figure 1.6 Data types.
Discrete vs. Continuous Data
Discrete |
Continuous |
•Can take on any value over a range
•Numeric data
•Can be counted
|
•Can take on any value over a range
•Value has infinite possibilities
•Is measured
|
•Survey question, “One a scale of 1 to 5, rate your PAD 504 experience” (you can take the average response to assess student sentiment)
•Total number of hip replacements performed by each hospital in New York, in 2014
|
•Height, weight
•Age
•Temperature
•Time to complete 504 problem set
•Commute distance from home to Husted 004
•Money
|
Note: sometimes you will see body mass index (BMI) turned into discrete categories (underweight, normal weight, overweight, …). The underlying data (height, weight) is still continuous; this is “binning” (turning into bins).
Qualitative Data
Qualitative |
•Categories or longer texts– formats could include categories, text, audio, visual observations
•Typically collected to understand a phenomena
•Some numbers that refer to categories are also qualitative
|
•Newspaper articles
•Congressional hearing transcripts
•Notes from observing a town hall meeting
•Dialogue from open-ended interview
•Gender, country of origin, ZIP codes
|
Note: Categories can be transformed into quantitative variables by counting frequencies or calculating proportions. Some surveys contain open-ended questions that later get coded into categories. There are some qualitative analysis techniques to quantify textual information by counting related words/terms. Yet most qualitative data are not analyzed this way– statistical methods are not appropriate for many qualitative data.
Cross-Sectional vs. Lingitudinal Data
Cross-sectional |
Longitudinal |
•One point in time
•Data were collected within a short time frame
•“Snapshot”
|
•Multiple time periods
•Observations of changes over time
|
•Midterm exams
•Course evaluations
•Opinion polls
•Online shopping product ratings
|
•U.S. Census (repeated survey)
•Bus ridership (daily rate of passengers)
•HIV disease registry
•UAlbany student records
|
Note: A single political opinion poll is cross-sectional. However, if you have multiple polls done at different time periods with a similar questionnaire, you can turn this into longitudinal data.
Structured vs. Unstructured Data
Structured |
Unstructured |
•Rows and columns
•Links and relations
|
•Raw, unorganized
•With time and sophisticated computing, can turn into structured data
|
•Excel Welcome Week data (MPA admissions, Clinton impeachment)
•Survey responses with fixed categories (e.g. “your satisfaction with the Welcome Week experience on a scale from 1 to 5”) or continuous responses (e.g. “your age”)
•Most databases
|
•Medical claims (if they contain X-ray images and text-based doctor notes)
•Emails
•PDF files
•Open-ended survey responses
•Social media
•Video
|
Note: Qualitative long texts are also unstructured data.
Big Data
Big data |
•Velocity, volume, variety
•Typically requires complex computer algorithms to analyze (e.g. automated text coding and machine learning)
•Cannot crunch data using a standard computer – need big processing capability
|
•Social network data (e.g. blogs, Tweets, YouTube videos, text messages, Instagram pictures)
•Health information (e.g. medical records, billing information such as diagnostic codes, X-rays)
•Industry-wide data (e.g. commercial transactions, banking/stock records, services delivered from a social service agency)
•Internet of Things (e.g. remote sensors such as traffic cameras, surveillance videos)
|
Note: The term “big data” is frequently used incorrectly… a very large survey with millions of observations (e.g. U.S. Census) is large, but not “big.” Big data can be structured (e.g. business transactions), unstructured (e.g. Tweets and Instagram photos), or have elements of both (e.g. medical records containing quantitative information, doctor notes, and images).
About Data Quality
Databases and other repositories are not error free. In fact, when surveyed about data quality, organizations frequently recognize to have many problems of data quality. Moreover, data scientists and analysts usually report that they spend 80% of their time cleaning and preparing data, which in many cases is related to poor quality of the data in the database or poor data definitions or descriptions included in the metadata. Metadata is “data about the data,” usually included in codebooks, lists of data elements or data dictionaries, explanations of how the data were collected, description of the sample, explanation of how the original records were transformed into the current dataset, etc. These metadata are very important to understand the content of a dataset and to merge it with other data for analysis.
Although traditionally data quality was understood as accuracy, data researchers have identified several characteristics of data to define their quality, including accuracy, but also other factors such as completeness, accessibility or completeness. One useful way to understand data quality is to group those characteristics in major dimensions. Two important dimensions recognized in the literature are contextual and intrinsic data quality. The following two tables include a list of data quality characteristics in these two major dimensions. Because of their importance to understand and work with datasets, discussions on metadata quality have also become very important.
Intrinsic Data Quality
Dimension |
Definition |
Application to Political Opinion Polls |
Accuracy | Data are correct and certified free of error | The staff who call people to ask questions record responses correctly, and there are no mistakes when transposing into the database (e.g. if “yes” and “no” represented as 1 and 0, then all yes responses entered as 1’s, not “.” or another value) |
Believability/ reputation | Data are regarded as true, real, and credible, and come from a trusted source | Reuters and university-based polling services versus Fox News poll |
Confidentiality | Data are confidential and protect privacy | Individual respondents cannot be identified from the summary report or individual dataset (if made available to others) |
Objectivity | Data are unbiased, unprejudiced, and impartial | “To what extent do you support Trump’s firm stance on immigration, to protect our borders from unwanted criminals like rapists and murders?” “To what extent do you support Clinton’s plan to reduce college education costs for hard-working middle class families to ensure upward mobility?” |
Reliability | Data would have similar values, if collected multiple times or in different ways | Public support for the Affordable Care Act (“Obamacare”) varied depending on how the questions were framed |
Validity | Data are a good representation of the underlying construct | Poll questions about Obamacare are really asking about their support for the legislation, not about whether they like Obama or their personal health – the questions solicit information about the concept you are trying to measure |
Contextual Data Quality
Dimension |
Definition |
Application to Political Opinion Polls |
Appropriate amount of info | Quantity or volume of available data is appropriate | For a nationally representative sample, there should be enough respondents to have a low margin of error |
Completeness | No missing data; of sufficient breadth and scope for the data user’s purpose; all relevant individuals included | Respondents answer all questions; a survey about the complex Affordable Care Act asks questions about different components not a simple “do you support this?” (expanded insurance, rules on what services are covered, tax credits, …); a survey about likelihood of voting for Clinton vs Trump includes all likely voters (different ages, Republicans vs Democrats, different geographic regions…) |
Concise representation | Data are compactly represented without being overwhelming | Many of the lay public will want to see all of the poll data look like this…. While researchers want to see these raw polling data… |
Ease of manipulation | Data are easy to manipulate and apply to different tasks | NYT Stop, Question, and Frisk interactive website is easy to navigate, but many big survey datasets require days to understand how to use and repackage into your preferred statistical software program |
Ease of understanding | Data have little ambiguity, and are clear to users | Very easy to understand the fivethirtyeight.com polling data without reading all of the documentation |
Relevancy | Data are applicable and helpful for user and task at hand | Fivethirtyeight.com presidential polling data displays information by popular vote, electoral college, etc. |
Timeliness | Age of data are appropriate for user and task at hand | Fivethirtyeight.com polling data are refreshed regularly which is great for the lay public… While researchers may also want historical data to analyze long-term trends in voter preferences |
Value-added | Data are beneficial and provide advantages from their use | There are so many existing polls that creating a new one will likely not improve the forecasts |
Metadata Quality
Dimension |
Definition |
Application to Political Opinion Polls |
Completeness | Users can understand what the data are, how they can (not) be used | Data collection instruments available, sampling methods explained, codebook available, data limitations described (e.g. insufficient sample to create state-level estimates), etc. |
Accuracy | Metadata are correct and certified free of error | The correct codebooks are included (esp. important for questions or sampling methods that change slightly over time) |
Conformance to expect-ations
|
Metadata contain standard elements that users would expect to find | Users would expect to learn standard things about the data such as the contributor, coverage, creator, etc. …”expectations” are nebulous but there are frameworks like the Dublin Core |
Consistency | Metadata homogenously represented; for example, using controlled vocabularies | Terms like “nationally representative sample” or “data owner” have the same meaning across datasets |
Interpretab-ility | Clear and without ambiguity | The metadata are presented in accessible language; it does not take a PhD in statistics to understand data descriptions |
Provenance | Describes data lifecycle– complete information about how data were created and transformed | It is clear how people were selected to be contacted (e.g. random digit dialing of landlines, who is excluded), what questions were asked and how), how individual records (telephone calls to respondents) were compiled into an electronic database, how missing values were imputed, etc. |
Timeliness | As data or our understanding of the data change over time, the metadata are also updated | Survey questions change over time, even for repeated polls by the same polling firm. Metadata should be updated to reflected these changes. Non-political example: New York just made major changes to its HIV surveillance dataset (removing “lost cases” and people with out-of-state addresses), which has changed all of the data series (all estimates of people living with HIV are lower than they were before)… this needs to be explained in the metadata. |
Attribution
By Luis F. Luna-Reyes and Erika Martin, and licensed under CC BY-NC-SA 4.0.