14.4 Secondary data analysis

Matthew DeCarlo

14.4 Secondary data analysis

Learning Objectives

Define secondary data analysis
List the strengths and limitations of secondary data analysis
Name at least two sources of publicly available quantitative data
Name at least two sources of publicly available qualitative data

One advantage of unobtrusive research is that you may be able to skip the data collection phase altogether. To many, skipping the data collection phase is preferable since it allows the researcher to proceed directly to answering their question through data analysis. When researchers analyze data originally gathered by another person or entity, they engage in secondary data analysis. Researchers access data collected by other researchers by connecting with those researchers personally or by accessing their data via publicly available sources.

Imagine you wanted to study whether race or gender influenced what major people chose at your college. You could do your best to distribute a survey to a representative sample of students, but perhaps a better idea would be to ask your college registrar for this information. Your college already collects this information on all of its students. Wouldn’t it be better to simply ask for access to this information, rather than collecting it yourself? Maybe.

Challenges in secondary data analysis

Some of you may be thinking, “I never gave my college permission to share my information with other researchers.” Depending on the policies of your university, this may or may not be true. In any case, secondary data is usually anonymized or does not contain identifying information. In our example, students’ names, student ID numbers, home towns, and other identifying details would not be shared with a secondary researcher. Instead, just the information on the variables—race, gender, and major—would be shared. Anonymization techniques are not foolproof, and this is a challenge to secondary data analysis. Based on my limited sample of social work classrooms I have taught, there are usually only two or three men in the room. While privacy may not be a big deal for a study about choice of major, imagine if our example study included final grades, income, or whether your parents attended college. If I were a researcher using secondary data, I could probably figure out which data belonged to which men because there are so few men in the major. This is a problem in real-world research, as well. Anonymized data from credit card companies, Netflix, AOL, and online advertising companies have been “unmasked,” allowing researchers to identify nearly all individuals in a data set (Bode, K. 2017; de Montjoy, Radaelli, Singh, & Pentland, 2015) ^[1]

person wearing a tin foil mask

Another challenge with secondary data stems from the lack of control over the data collection process. Perhaps your university made a mistake on their forms or they entered data incorrectly. You certainly would not have made such a mistake if this were your data, but if you did make a mistake, you could correct it right away. Using secondary data, you are less able to correct for errors made by the original source during data collection. More importantly, you may not know these errors exist and reach erroneous conclusions as a result. Researchers using secondary data should evaluate data collection procedures wherever possible, and they should treat data that lacks procedural documentation with caution.

It is also important to attend to how the original researchers dealt with missing or incomplete data. Researchers may have simply used the mean score for a piece of missing data or excluded them from analysis entirely. The primary researchers made that choice for a reason, and secondary researchers should understand their decision-making process before proceeding with analysis. Finally, secondary researchers must have access to the codebook for quantitative data and coding scheme for qualitative data. A quantitative dataset often contains shorthand for question numbers, variables, and attributes. A qualitative data analysis contains as a coding scheme explaining definitions and relationships for all codes. Without these, the data would be difficult to comprehend for a secondary researcher.

Secondary researchers, particularly those conducting quantitative research, must also ensure that their conceptualization and operationalization of variables matches that of the primary researchers. If your secondary analysis focuses on a variable that was not a major part of the original analysis, you may not have enough information about that variable to conduct a thorough analysis. For example, let’s say you wanted to study whether depression is associated with income for students and you found a dataset that included those variables. If depression was not a focus of the dataset, researchers may only have included a question like, “Have you ever been diagnosed with major depressive disorder?” While answers to this question will give you some information about depression, it will not provide the same depth of a scale like Beck’s Depression Inventory or the Hamilton Rating Scale for Depression. It would also fail to provide information about severity of symptoms like hospitalization or suicide attempts. Without this level of depth, your analysis may lack validity. Even when the variables are thoroughly operationalized, researchers may conceptualize variables differently than you do. Perhaps they are interested in whether a person was diagnosed with depression in their life, but you are concerned with current symptoms of depression. For these reasons, reading research reports and other documentation is a requirement for secondary data analysis.

The lack of control over the data collection process also hamstrings the research process itself. While some studies are created perfectly, most are refined through pilot testing and feedback before the full study is conducted (Engel & Schutt, 2016). ^[2] Secondary data analysis does not allow you to engage in this process. For qualitative researchers in particular, this is an important challenge. Qualitative research, particularly from the interpretivist paradigm, uses emergent processes in which research questions, conceptualization of terms, and measures develop and change over the course of the study. Secondary data analysis inhibits this process because the data are already collected. Qualitative methods often involve analyzing the context in which data are collected, therefore secondary researchers may not know enough to represent the original data authentically and accurately in a new analysis.

Returning to our example on race, gender, and major, let’s assume you are reasonably certain the data do not contain errors and you are comfortable with having no control over the data collection process. Getting access to the data is not as simple as walking into the registrar’s office with a smile. Researchers seeking access to data collected by universities (or hospitals, health insurers, human service agencies, etc.) must have the support of the administration. In some cases, a researcher may only have to demonstrate that they are competent to complete the analysis, share their data analysis plan, and receive ethical approval from an IRB. Administrators of data that are often accessed by researchers, such as Medicaid or Census data, may fall into this category.

Your school administration may not be used to partnering with researchers to analyze their students. In fact, administrators may be quite sensitive to how their school is perceived as a result of your study. If your study found that women or Latinos are excluded from engineering and science degrees, that would reflect poorly on the university and the administration. It may be important for researchers to form a partnership with the agency or university whose data is included in the secondary data analysis. Administrators will trust people who they perceive as competent, reputable, and objective. They must trust you to engage in rigorous and conscientious research. A disreputable researcher may seek to raise their reputation by finding shocking results (real or fake) in your university’s data, while damaging the reputation of the university.

On the other hand, if your secondary data analysis paints the university in a glowing and rosy light, other researchers may be skeptical of your findings. This problem concerned Steven Levitt, an economist who worked with Uber to estimate how much consumers saved by using its service versus traditional taxis. Levitt knew that he would have to partner with Uber in order to gain access to their data but was careful to secure written permission to publish his results, regardless of whether his results were positive or negative for Uber (Huggins, 2016). ^[3] Researchers using secondary data must be careful to build trust with gatekeepers in administration while not compromising their objectivity through conflicts of interest.

Strengths of secondary data analysis

While the challenges associated with secondary data analysis are important, the strengths of this method often outweigh these limitations. Most importantly, secondary data analysis is quicker and cheaper than a traditional study because the data are already collected. Once a researcher gains access to the data, it is simply a matter of analyzing it and writing up the results to complete the project. Data can take a long time to gather and be quite resource-intensive. So, avoiding this step is a significant strength of secondary data analysis. If the primary researchers had access to more resources, they may also be able to engage in data collection that is more rigorous than a secondary researcher could. In this way, outsourcing the data collection to someone with more resources may make your design stronger, not weaker. Finally, secondary researchers ask new questions that the primary researchers may not have considered. In this way, secondary data analysis deepens our understanding of existing data in the field.

stacks in a library, showing books on one side and boxes on another

Secondary data analysis also provides researchers with access to data that would otherwise be unavailable or unknown to the public. A good example of this is historical research, in which researchers analyze data from primary sources of historical events and proceedings. Netting and O’Connor (2016) ^[4] were interested in understanding what impact religious organizations had on the development of human services in Richmond, Virginia. Using documents from the Valentine History Center, Virginia Historical Society, and other sources, the researchers were able to discover the origins of social welfare in the city—traveler’s assistance programs in the 1700s. In their study, they also uncovered the important role women played in social welfare agencies, a surprising finding given the historical disenfranchisement of women in American society. Secondary data analysis provides the researcher with the opportunity to answer questions like these without a time machine. Table 14.3 summarizes the strengths and limitations of existing data.

Table 14.3 Strengths and limitations of existing data
Strengths	Limitations
Reduces the time needed to complete the project	Anonymous data may not be truly anonymous
Cheaper to conduct, in many cases	No control over data collection process
Primary researcher may have more resources to conduct a rigorous data collection than you	Cannot refine questions, measures, or procedure based on feedback or pilot tests
Helps us deepen our understanding of data already in the literature	May operationalize or conceptualize concepts differently than primary researcher
Useful for historical research	Missing qualitative context
	Barriers to access and conflicts of interest

Ultimately, you will have to weigh the strengths and limitations of using secondary data on your own. Engel and Schutt (2016, p. 327) ^[5] propose six questions to ask before using secondary data:

What were the agency’s or researcher’s goals in collecting the data?
What data were collected, and what were they intended to measure?
When was the information collected?
What methods were used for data collection? Who was responsible for data collection, and what were their qualifications? Are they available to answer questions about the data?
How is the information organized (by date, individual, family, event, etc.)? Are there identifiers used to identify different types of data available?
What is known about the success of the data collection effort? How are missing data indicated and treated? What kind of documentation is available? How consistent are the data with data available from other sources?

Sources of secondary data

Many sources of quantitative data are publicly available. The General Social Survey (GSS), which was discussed in Chapter 11, is one of the most commonly used sources of publicly available data among quantitative researchers (http://www.norc.uchicago.edu/GSS+Website). Data for the GSS have been collected regularly since 1972, thus offering social researchers the opportunity to investigate changes in Americans’ attitudes and beliefs over time. Questions on the GSS cover an extremely broad range of topics, from family life to political and religious beliefs to work experiences.

Other sources of quantitative data include Add Health (http://www.cpc.unc.edu/projects/addhealth), a study that was initiated in 1994 to learn about the lives and behaviors of adolescents in the United States, and the Wisconsin Longitudinal Study (https://www.ssc.wisc.edu/wlsresearch), which has systematically surveyed a panel of 10,000 people who graduated from Wisconsin high schools in 1957. Quantitative researchers interested in studying social processes outside of the United States also have many options when it comes to publicly available data sets. Data from the British Household Panel Study (https://www.iser.essex.ac.uk/bhps), a longitudinal, representative survey of households in Britain, are freely available to those conducting academic research (private entities are charged for access to the data). The International Social Survey Programme (http://www.issp.org) merges the GSS with its counterparts in other countries around the globe. These represent just a few of the many sources of publicly available quantitative data.

Unfortunately for qualitative researchers, far fewer sources of free, publicly available qualitative data exist. However, this is slowly changing as technical sophistication grows and it becomes easier to digitize and share qualitative data. Despite comparatively fewer sources than for quantitative data, there are still a number of data sources available to qualitative researchers whose interests or resources limit their ability to collect data on their own. The Murray Research Archive Harvard, housed at the Institute for Quantitative Social Science at Harvard University, offers case histories and qualitative interview data (http://dvn.iq.harvard.edu/dvn/dv/mra). The Global Feminisms project at the University of Michigan offers interview transcripts and videotaped oral histories focused on feminist activism; women’s movements; and academic women’s studies in China, India, Poland, and the United States. ^[6] At the University of Connecticut, the Oral History Office provides links to a number of other oral history sites (http://www.oralhistory.uconn.edu/links.html). Not all the links offer publicly available data, but many do. Finally, the Southern Historical Collection at University of North Carolina–Chapel Hill offers digital versions of many primary documents online such as journals, letters, correspondence, and other papers that document the history and culture of the American South (http://dc.lib.unc.edu/ead/archivalhome.php?CISOROOT=/ead).

Keep in mind that the resources mentioned here represent just a snapshot of the many sources of publicly available data that can be easily accessed via the web. Table 14.4 summarizes the data sources discussed in this section.

Table 14.4 Sources of publicly available data
Organizational home	Focus/topic	Data	Web address
National Opinion Research Center	General Social Survey; demographic, behavioral, attitudinal, and special interest questions; national sample	Quantitative	http://www.norc.uchicago.edu/GSS+Website/
Carolina Population Center	Add Health; longitudinal social, economic, psychological, and physical well-being of cohort in grades 7–12 in 1994	Quantitative	http://www.cpc.unc.edu/projects/addhealth
Center for Demography of Health and Aging	Wisconsin Longitudinal Study; life course study of cohorts who graduated from high school in 1957	Quantitative	https://www.ssc.wisc.edu/wlsresearch/
Institute for Social & Economic Research	British Household Panel Survey; longitudinal study of British lives and well- being	Quantitative	https://www.iser.essex.ac.uk/bhps
International Social Survey Programme	International data similar to GSS	Quantitative	http://www.issp.org/
The Institute for Quantitative Social Science at Harvard University	Large archive of written data, audio, and video focused on many topics	Quantitative and qualitative	http://dvn.iq.harvard.edu/dvn/dv/mra
Institute for Research on Women and Gender	Global Feminisms Project; interview transcripts and oral histories on feminism and women’s activism	Qualitative	http://www.umich.edu/~glblfem/index.html
Oral History Office	Descriptions and links to numerous oral history archives	Qualitative	http://www.oralhistory.uconn.edu/links.html
UNC Wilson Library	Digitized manuscript collection from the Southern Historical Collection	Qualitative	http://dc.lib.unc.edu/ead/archivalhome.php? CISOROOT=/ead

While the public and free sharing of data has become increasingly common over the years, and it is an increasingly common requirement of those who fund research, Harvard researchers recently learned of the potential dangers of making one’s data available to all (Parry, 2011). ^[7] In 2008, Professor Nicholas Christakis, Jason Kaufman, and colleagues, of Harvard’s Berkman Center for Internet & Society, rolled out the first wave of their data collected from the profiles of 1,700 Facebook users (2008). ^[8] But shortly thereafter, the researchers were forced to deny public access to the data after it was discovered that subjects could easily be identified with some careful mining of the data set. Perhaps only time and additional experience will tell what the future holds for increased access to data collected by others.

Key Takeaways

The strengths and limitations of secondary data analysis must be considered before a project begins.
Previously collected data sources enable researchers to conduct secondary data analysis.

Glossary

Anonymized data- data that does not contain identifying information

Historical research– analyzing data from primary sources of historical events and proceedings

Secondary data analysis– analyzing data originally gathered by another person or entity

Image attributions

anonymous by SplitShire CC-0

archive by Pexels CC-0

Bode, K. (2017, January 26). One more time with feeling: ‘Anonymized’ user data not really anonymous. Techdirt. Retrieved from: https://www.techdirt.com/articles/20170123/08125136548/one-more-time-with-feeling-anonymized-user-data-not-really-anonymous.shtml; de Montjoye, Y. A., Radaelli, L., & Singh, V. K. (2015). Unique in the shopping mall: On the reidentifiability of credit card metadata. Science, 347(6221), 536-539. ↵
Engel, R. J. & Schutt, R. K. (2016). The practice of research in social work (4th ed.). Washington, DC: SAGE Publishing. ↵
Huggins, H. (Producer). (2016, September 7). Why Uber is an economist’s dream [Audio podcast]. Retrieved from: http://freakonomics.com/podcast/uber-economists-dream/ ↵
Netting, F. E., & O’Connor, M. K. (2016). The intersectionality of religion and social welfare: Historical development of Richmond’s nonprofit health and human services. Religions, 7(1), 13-28. ↵
Engel, R. J. & Schutt, R. K. (2016). The practice of research in social work (4th ed.). Washington, DC: SAGE Publishing. ↵
These data are not free, though they are available at a reasonable price. See the Global Feminisms’ webpage at https://globalfeminisms.umich.edu/contact ↵
Parry, M. (2011, July 10). Harvard researchers accused of breaching students’ privacy. The Chronicle of Higher Education. Retrieved from https://chronicle.com/article/Harvards-Privacy-Meltdown/128166 ↵
Berkman Center for Internet & Society. (2008, September 25). Tastes, ties, and time: Facebook data release. Retrieved from https://cyber.law.harvard.edu/node/4682 ↵

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License