2. Corpus of Contemporary American English (COCA) Tutorial

Ingrid Veloso

Ingrid Veloso

Montclair State University

 

1. Introduction to COCA

The Corpus of Contemporary American English (COCA) is one of the largest corpora of American English with over one billion words from various sources collected from 1990 to 2020. The texts that make up this corpus were collected from magazines, web pages, conversation, and more, thus serving as a comprehensive source for research exploring language patterns across several registers or genres. In other words, the wide range of texts in this corpus allows researchers to understand language use in various contexts.

COCA is freely available as part of the English-Corpora suite of corpora. The English-Corpora website contains several corpora, such as News on the Web (NOW), the Wikipedia Corpus, the Corpus of Historical American English, among others. This tutorial focuses specifically on the COCA interface of the English-Corpora website. Users can use the COCA interface to search words in particular contexts with filters and functions catering to various categories of linguistic features. The frequency of a word or phrase can also be shown in different forms, such as charts or lists, depending on the user’s preference. These are only a few of COCA’s interface options, and its features when compounded and layered can provide meaningful results for English teachers.

This tutorial aims to aid first-time users of COCA getting started, from registering to utilizing the basic features and functionalities of the website’s tools. Resources for more in-depth tutorials will be provided in Section 6 in case a user would like to learn more about the capabilities of COCA tools. To ensure your understanding of the tutorial’s material, we encourage you to use the practice exercises.

2. Registering

To register as a user of the corpus, go to https://www.english-corpora.org/coca/ on your browser. At the top right corner, click the yellow ID icon, which will lead you to the “LOGIN” page. Below the “Log in” button, find and click “REGISTER” in gray. This will take you to a page similar to the one illustrated in Figure 1.

Figure 1 – Registration page

Fill in the corresponding details on the page, specifying your name, email address, password, country, and category. For Category, pick the option that best describes your status or occupation. Be sure to click on the checkbox below regarding the Terms and Conditions of the site. In the following section, type the letters found in the colored text block shown (the blue text background in the screenshot above). This changes every time the site is refreshed or if the “RESET” button is clicked.

Once all details have been filled out, click “SUBMIT” at the bottom of the page. You will be redirected to a page that thanks you for registering and asks you to confirm your account via an email they send you immediately.

Check for an email from admin@english-corpora.org in your inbox (look into your spam folder if it is not there within 2 minutes). Click on the link provided in the email to complete your registration, and this will provide you a page that confirms your registration. Once confirmed, close your browser completely and reload the corpora website to log in again. After logging in, you will be brought to your user information page. To begin using the corpus, navigate away from this page by clicking the COCA title at the upper left corner of the webpage.

3. Navigating the site

As a registered user logged into the website, your homepage will land on the search tab as illustrated in Figure 2.

 

Figure 2 – COCA Home Page

3.1 Top Right Icons

The icons immediately to the right of the COCA title logo are additional resources for the corpus. Please refer to the visual below for the names of the top right icons on the same bar.

Figure 3 – Menu Icons

The Profile icon leads users to their account profile and allows them to view their profile information. Changes to one’s profile can be completed on this page. The profile also displays a user’s statistical usage of the corpus, including the number of words or phrases saved in one’s lists (see following paragraph for explanation on lists).

The Saved Words icon directs you to words that you have saved in previous searches before. Please refer to Section 4 to learn how to save words in your search.

The Virtual Corpora icon allows you to create corpora of your own from the entries of COCA itself. A list of resources to learn how to accomplish this feature will be provided in Section 6.

The History icon shows a user’s most recent searches in the corpus. The last icon to the right, the Help icon, leads the user to a page that explains each icon. It also provides a short description of each icon page’s available functions.

4. Using the corpora: Main Menus

Directly below the COCA name at the top left corner of the screen are the main tabs of the website: SEARCH, FREQUENCY, CONTEXT, and OVERVIEW. These tab options will shift and change depending on the website function most recently used by the user.

The Search function of the corpus is housed in the SEARCH tab. This function serves as a starting point for word, phrase and string searches in the corpus. For a full tutorial on the functions available on the SEARCH tab, please refer to Section 5.

When a word, phrase or string is searched via the SEARCH tab, users can view the frequency of their input within the corpora against several different measures by clicking on the FREQUENCY tab. For an in-depth tutorial and explanation of this tab, please refer to Section 5.

The CONTEXT tab displays a search’s concordance lines as found in the corpus. Users can view words surrounding their searches and analyze their use based on the information provided. For a guided explanation of the CONTEXT tab, please refer to Section 5.

The OVERVIEW tab provides a brief explanation of the corpus and its general functions. Users can refer to this page when browsing what functions the corpus website is capable of. This tutorial will present a handful of these functions and teach users how to use them on their own.

5. Using the Corpora: Other options

Once you are logged into your account, click on the “SEARCH” tab on the left-most side of the menu bar. The “SEARCH” tab has 7 different functions, which are highlighted in blue above the search bar. We will first explore the defaults of the “List” function, and then introduce the remaining 6 functions in their own subsections.

5.1 List

To search for a word or phrase within the corpus, type the word, phrase or string into the textbox and click the “Find matching strings” button below it.

Figure 4 – List Option

The search will redirect you to the results on the “FREQUENCY” tab:

Figure 5 – Frequency of “On Loop”

This page will list all the relevant forms (under ALL FORMS) of your input and its frequency (under FREQ) in the corpus. To save one or more of the word results in your search, you can click the star next to the result and it will be added to your “Saved Words” list (as seen in section 4.1). If you click on the form presented (“ON LOOP” in the example above), you will be redirected to your search results on the “CONTEXT” tab.

Figure 6 – Concordance lines of “on the loop”

This page will show you examples of your search word in several sentences – these are called concordance lines. This page also shows the date of the source, its type (magazine, spoken, fiction, etc.), and source name.

To restart your search on all tabs, return to the “SEARCH” tab and click the “Reset” button below the text box.

There are a few other notable features to the search bar that can help narrow your searches. They will be presented in the following subsections.

5.1.1 Wildcard Search

The wildcard search allows the user to see different results utilizing  part of a word or morpheme. The way to do is by using the “*” wildcard. For example, if we attach “*” directly to the end of “be,” we receive words from COCA that start with the word “be”:

Figure 7 – Example of search with a wildcard

Figure 8 – Example of results for wildcard search

Additionally, if we separate the “*” with “be,” we receive results of phrases that start with the word “be”:

Figure 9 – Search with space before wildcard

Figure 10 – Results of a search with space before wildcard

We can place this “*” before or after the searched word/morpheme to get results pertaining to which side we want to see the most frequently used words/phrases that use it. For example, if a teacher is looking for example words with the prefix “im”, they can search for “im*”.

5.1.2 Searching for multiple words

To search for results for more than one word, we can use the divider symbol “|” to accomplish this. Simply place “|” between two or more words, and COCA will return results comparing the words chosen, giving us a great side-by-side view of multiple words of interest:

Figure 11 – Search with “or”

Figure 12 – Result of search with “or”

For example, if teachers would like to show to students which word is more frequent “teenager or adolescent”, they can use “adolescent|teenager” as a search term.

5.1.3 Searching for lemmas

While a handful of words only have one form, there are plenty of other words with multiple forms (e.g., work = works, working, worked; cat = cat, cats) . The several forms of a word are referred to as lemmas. We can search for a word’s lemmas in COCA, allowing us to see the most frequently used forms in the corpus. To do so, we place our desired word into square brackets “[]”:

Figure 13 – Example of search for lemmas

Figure 14 – Results of search for lemmas

5.1.4 Part of Speech

Next to the search box in each menu is the “[POS]” or part of speech option. Click on this text to narrow your search down to a specific part of speech. For example, if you want to search the word “run” as a verb, you would click “[POS]” and select “verb.ALL” from the list of part of speech types provided.

Figure 15 – List of POS available on COCA

Once you select one of the POS options, COCA will edit your text entry accordingly to produce your desired results:

Figure 16 – Example search with part-of-speech information

__________________________

Exercise 1: Using the COCA list function, we want to compare the frequency of the words “delicious” and “tasty.” Show how this would be typed into the search bar. Which word is used more often according to the corpus’ search results?

__________________________

5.2 Chart

The Chart tab is the menu option immediately to the right of the List tab. With this feature, you can search for a given word and see its frequency by register (blog, web, TV and Movies, Spoken, Fiction, Magazines, News or Academic Writing). To use the Chart tab, enter the word of your choice into the text box and click the “See frequency by section” button. This search will lead you to a page that shows the frequency of your search word in different domains. The far right side of the chart also shows the frequency distribution of the word from 1990 every 5 years (eg. 1990-1994 and so on).

Figure 17 – Example of chart search

“SECTION” refers to the register or year cluster. The “FREQ” or frequency row shows the number of times the searched word appears in the section specified in the row above. Below “FREQ” is the “WORDS (M)” row, which shows how many words (M is million) in total are within the specified section in COCA (eg. 128.6 is 128,600,000 million). The “PER MIL” row shows the frequency of the searched word in the domain over the total number of words in the domain (“WORDS (M)”), hence the numbers in this row are given as percentages (eg. FREQ 2893 / WORDS (M) 128.6 Million = 22.49%).

To gain more information about a word’s distribution within a given domain on the chart, you can click on the domain name (eg. “BLOG”) and a chart will appear below the main chart, displaying the subcategories within that domain.

Figure 18 – Subcategories of Blogs in COCA

To see the contexts of the word in a given section/domain, you can click on the section’s chart bar within the very last row of  the chart given.

Figure 19 – Concordance lines of “Jello” within TV/MOVIES

It is worth noting that the wildcard feature and the multiple word feature as illustrated in instructions for the List tool (Section 5.1) can also be used when utilizing the Chart tool.

__________________________

Exercise 2: Using the COCA Chart function, search for the word “assist.” Which section is this word most frequently used in according to the search results?

__________________________

5.3 Word

Another handy tool within COCA is the “Word” menu option, which provides definitions and detailed information on a word much like a dictionary does, but with the added data of the COCA corpus. This option is located immediately to the right of the “Chart” menu option in the “SEARCH” tab. Click on the “Word” tab and type in your desired word into the text box. To search for the word, click the “See detailed info for word” button below.

Figure 20 – Results of “Word” search for the word “hello”

Clicking this button will bring you the “WORD” tab. Within this tab, we can see various information regarding our search word, including a dictionary definition, the part of speech, frequency across sections (see Section 5.2), collocates, and clusters. To the right of the frequency chart, the “COLLOCATES” section lists the words that occur most frequently with the word you searched. These words are organized by part of speech. Much like the collocates, the “CLUSTERS” section on the results page lists the most common words immediately surrounding the searched word. However, this section lists the most common words by their location around the word (before or after it) as a phrase. It also includes words up to 3 spaces away from the searched word.

Figure 21 – Clusters with the word “hello”

Lastly, this menu also features the “CONCORDANCE” section, which shows the searched word in contexts, highlighting the words closest to it so users can easily identify its most common contextual usages.

Figure 22 – KWIC with the word “hello”

__________________________

Exercise 3: Using the COCA Word function, search for the word “amuse.” List the verb collocates of this word according to the search results.

__________________________

5.4 Collocates

The “COLLOCATES” menu option can be found when clicking the “+” button after the “Browse” menu option on the “SEARCH” tab. This option allows users to find the most common words adjacent to the searched word. To make a search on this feature, type the word you want to search into the text box and click the “Find collocates” button underneath. This is the default search, which brings us to the following screens.

Figure 23 – Example of Collocates search

Figure 24 – Example of Collocates output

The resulting page will present the collocates by part of speech (NOUN, ADJ, VERB, and ADV). The chart is organized by frequency ranking, with the most frequent occurrences being higher on the list and the less frequent occurrences being at the bottom of the list. The leftmost column underneath the part of speech title shows the number of times the given collocate to the right has occurred with the searched word within the corpus. For the sake of simplicity in this tutorial, the number to the right can be ignored.

Besides the default search option, which simply retrieves whichever words surround the word in both directions, we can also limit our search to find the most common occurrences before a word  and after a word, as well as the distance of these occurrences from the original searched word. For example, if we only want to find what occurs after our searched word up to two places (words) away, we can adjust the number bar beneath the text boxes to generate the desired results.

Figure 25 – Example search of collocates with two words to the right select

By clicking the “0” on the left of the blue box (which represents the searched word) and selecting the “2” on the right side of the blue box, we will obtain the following results:

Figure 26 – Example output of collocates search

The chart in the resulting page represents the most common occurrences surrounding the word given our restrictions from the number bar. Only words up to two places away from our word are shown and in order from highest to lowest frequency. In the example above, for the word “hello,” the question mark “?” is the most common occurrence. The frequency of this punctuation in conjunction with “hello” is 20,915 (“FREQ” column, and this can be compared to all occurrences of this punctuation in the corpus: 7723700 (“ALL” column). The frequency of this combination given the number of times it occurs in the corpus can be seen under the “%” column as a percentage.

To see each result in context with the searched word, simply click on the result you want to check and you will be directed to the corresponding page in the CONTEXT tab.

__________________________

Exercise 4: Using the COCA Collocates feature, search for the word “fun” and adjust the search to include 1 collocate to the left of the word and 3 collocates to the right of the word. What is the most common collocate for this search? What is the frequency of this collocate in the corpus?

__________________________

5.5 Compare

The “COMPARE” menu option allows users to compare the collocates of two different words based on their usage in the corpus. This menu option is located next to the “COLLOCATES” feature when the menu is expanded with the “+” button. To make a search with this feature, simply enter the two words you want to compare, one per text box provided, and click on the “Compare words” button.

Figure 27 – Example of Compare search

Figure 28 – Example of Compare search output

The resulting page presents two charts, one for each word and their most frequent collocates, arranged from highest to lowest ratio. By default, the charts will be sorted by ratio, meaning it will allow us to see how frequent a collocate is with two different words in comparison to the overall frequency of those two words. A detailed explanation of the charts’ ratios can be found by clicking the “[HELP…]” button on the upper right corner of this page.

To sort the charts by frequency rather than ratio, click “FREQUENCY,” which is highlighted in blue above the charts.

Figure 29 – Compare output sorted by frequency

__________________________

Exercise 5: Using the COCA Compare feature, compare the words “eat” and “drink.” What is the most common collocate for “eat?” What is the most common collocate for “drink?”

__________________________

5.6 KWIC

The KWIC function, which stands for Keyword In Context, allows users to search a word and see its surrounding context with examples from the corpus. This function can be found as the very last menu option when the menu is expanded with the “+” button. To make a search with this feature, type the word you want to search and click on the “Keyword in Context” button.

Figure 30 – KWIC Example

The resulting page shows the word lined up in a single column within a chart with the word being used in various sentences. For convenience and ease of viewing, the part of speech for each word is also highlighted accordingly. The part of speech coloring is as follows:

Figure 31 – Description of POS coding

On the top right corner of the results page are the alphabetical sorting buttons, which allow you to sort the resulting KWIC results alphabetically. These buttons can also be found in the “SEARCH” tab when starting a KWIC search. The default for KWIC searches is the alphabetization of the 3 words immediately to the right of the searched word. By clicking the “L,” the entries are sorted by the three words to the left of the searched word. By clicking the “R,” the entries are sorted by the three words to the right. The blue box in the middle marked by a dash represents the searched word, and by clicking any of the dashed lines next to it, we can sort our desired number of words immediately to the left or right of the searched word. To reset the alphabetical sorting, click the square with the “*” asterisk.

__________________________

Exercise 6: Using the COCA KWIC function, search for the word “take.” Based on the search results, what word most frequently appears immediately after “take?”

__________________________

5.7 SEARCH Tab Additional Features

In addition to the menu options of the “SEARCH” tab as presented, the menus also have additional options beneath the “search” button for each page, allowing users to narrow their searches in different ways. For this tutorial, we will be focusing on the “Sections” function, the “Texts/Virtual” function, and the “Sort/Limit” function.

Figure 31 – Additional search features

5.7.1 Sections

By clicking “Sections,” we can narrow our search to only certain domains and time periods available in the Corpus. To do so, simply scroll List 1’s domains/time periods and click on the sections you want to restrict your search to. To compare your search restriction results to another group of restrictions, scroll through List 2 and click on the sections you want to see in comparison to those selected in List 1. To select more than one section per list, press the ALT or command button on your keyboard and click on the additional sections you want to include.

Figure 32 – Options under sections

5.7.2 Texts/Virtual

The Sort/Limit feature is another feature at the bottom of each search menu that allows you to select how you want your results to be sorted. To use this feature, click on “Sort/Limit” at the bottom of the search. This will reveal the following options:

Figure 33 – Options under Text/Virtual

In the “SORTING” drop-down menu, you can select to arrange your results by “FREQUENCY” (more common occurrences in the corpus will appear at the top of the list), “RELEVANCE” (results that are more related to your search will appear first), and “ALPHABETICAL” (results will be alphabetically sorted). You can also adjust the number of results you receive with the “MINIMUM” drop-down menu. In this menu, you can adjust your search to have at least a certain number of results to pop up based on “FREQUENCY” and “MUT INFO”, which stands for mutual info. “MUT INFO” is only available for searches with one or more words for comparison. This minimum can be adjusted further by the text box to the right of the drop-down menu. With this, you can type the number minimum for your search (eg. the default setting is “MINIMUM: FREQUENCY of 20.” To activate this number feature, you must click the checkbox to the left of this text box.

Figure 34 – Sorting and Frequency Menu

__________________________

Exercise 7: Using the Sections functions and the COCA List function, search for the word “book” and narrow the search to results in TV/Movies and Fiction. Which section does this word appear in more frequently?

__________________________

6. Resources for learning more functions

The following links are video suggestions for users who would like to learn more about how to use COCA in depth:

__________________________

Answers to Exercise Questions:

Exercise 1: delicious (FREQ: 15519)

Exercise 2: ACAD

Exercise 3: entertain, smile, pretend, cease, inform, inspire, amaze, invent

Exercise 4: “much” (FREQ: 6033)

Exercise 5: Most common collocate for “eat”: “bugs”; Most common collocate for “drink”: “liquor”

Exercise 6: “a”

Exercise 7: FICT

 

Veloso, I. (2023). Corpus of Contemporary American English (COCA) Tutorial. In L. Goulart & I. Veloso (Eds). Corpora in English Language Teaching: Classroom Activities for Teachers New to Corpus Linguistics. Open Educational Resource. Montclair State University.

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Corpora in English Language Teaching Copyright © by Ingrid Veloso is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book