A glossary of corpus literacy

The EdUHK Corpus Team

A glossary of corpus literacy

This glossary aims to provide a concise definition of different terminologies for novices new to corpus linguistics and its applications. Specifically, this list compiles essential metalinguistic terms for corpus literacy (CL) and covers the concepts teachers need to know before learning about the corpus-based approach in teaching and applying CBLP.

While the glossary is not comprehensive, it covers the fundamental concepts to enhance corpus literacy. The listed terms are generic and may be encountered on various corpus websites or tools. Whenever you feel unsure or forget their meanings, refer back to glossary again.

A

Authenticity

Authenticity is a feature that a corpus offers through the principled collection of naturally occurring linguistic data. Authenticity of corpus data will provide us with contextual information about how they are used in reality.

C

Corpus literacy

“The ability to use the technology of corpus linguistics to investigate language and enhance the language development of students” (Heather & Helt, 2012, p. 417). The ability encompasses several components, including understanding the basic concepts in corpus linguistics, searching corpora and analysing corpus data, and interpreting/analysing corpus data.

Concordance, concordance lines, and concordancer

A concordance is a list of sentences showing all occurrences of the search item(s) in a corpus. The search term is the node of the concordance, which is usually aligned at the centre, so readers can look at its surrounding words to obtain contextual information, showing how the search item(s) can be used in real-life.

Each separate line is called a concordance line. Usually, a concordance can be sorted, meaning the results can be rearranged according to your search purposes.

A concordancer can be an online web-based search engine, e.g., COCA, Compleate Lextutor, SKELL, CorpusMate, or an offline analysis software, e.g., AntConc and LancsBox, which helps users to generate concordances.

Collocation and collocates

Collocation refers to the phenomenon of words co-occurring frequently in certain contexts. A collocate is a word that appears very occasionally to a particular word. For example, “iced” is one of the collocates of “tea”. Concordancers often allow users to inquire about the collocation of search item(s) by measuring the frequency of appearances of words surrounding them. The more frequently a word appears around a search item, the more likely it is a collocate. In some software, the strengths of collocation can be measured as well.

Studying collocations is useful for learning how some low-frequency collocates were internalised by native speakers, and how they manage to intuit collocates with higher familiarity (Hoffman and Lehmann, 2000).

Context

Context or contextual information is another important feature that makes studying corpus interesting. In corpus concordancers and software, conducting corpus searches will show us the contextual information of the word of interest. For instance, in KWIC searches, aligning the node at the centre of the concordancer will reveal patterns of language use, such as the immediate word that follows or precedes, the collocations, word clusters, etc. Other contextual information like the register, genre, and year of publication, will also inform us about the usage of the search item across different subjects or time periods.

D

Dispersion

The dispersion refers to the statistical calculation of how a search item is used across different text files or in a corpus. In some software, e.g., AntConc and LancsBox, the dispersion can be obtained using the plot function, which displays dispersion plots to visualise the occurrence of the search item in a corpus. Studying the dispersion may inform us about how the search item spreads throughout the text, revealing the central theme in the parts of the text.

F

Frequency

In the realm of corpus linguistics, the concept of frequency underpins most analytical work carried out in corpus analysis. Frequencies (usually expressed as X times per million words) obtained through corpus concordancers or analysis software will show how frequently the search item is used in any specific contexts.

It is also common for users to compare the frequencies of different words or phrases if they want to conduct a comparative analysis. The concept of frequency counts can be applied to individual words, phrases, collocations, etc. A word list of a corpus/body of text is typically compiled by the frequency counts of each word.

K

Keyword & keyword list

To know what a keyword is, we must understand why we need to obtain a keyword. By principle, keywords are used to study how one text compares to another in terms of its “aboutnesss” or “keyness”, meaning its main theme or subject. In order to calculate the keyness, the corpus we needed to study is compared to a reference corpus. Therefore, a keyword or a keyword list typically sums up what is distinctive about this text, since the words in the keyword list appear “unusually frequently” than other words through statistically measurement using log-likelihood.

Corpus analysis software like AntConc and LancsBox are examples of tools that can help to generate keyword lists.

Keyword in context (KWIC)

KWIC is the abbreviation for Keyword in Context. KWIC is used in many corpus concordancers websites and analysis software. KWIC can be pronounced as individual letters (K-W-I-C) or as a single word (as in “quick”, /kwɪk/).

In KWIC searches, the node is aligned at the centre. KWIC is sometimes used interchangeably with concordance.

L

Lemma & lemmatisation

A lemma is the canonical form of a word, meaning the set of lexical forms having the same stem and belonging to the same major word class, regardless of its inflections or spelling.

Lemmatisation refers to the process of reducing any words belonging to the same lemma by removing inflections. A lemma is formally expressed using small capital letters (as opposed to full capital). For instance, lemmatising running, runs (verb), runs (plural form), ran, etc., will produce the lemma ‘run’.

In some corpus software, frequencies are sometimes calculated based on the lemmatised form of the words, such that different words classes or variant spellings of the same stem can be included for calculation.

N

Node

In corpus concordancers, the word of interest is also called the node, which we want to search and analyse. KWIC search often aligns the node at the centre, so we can look at the surrounding words to study collocations.

In some corpus analysis software, e.g. LancsBox, the node can be used to plot graphs, with the node as the centre to plot a network to reveal the relationship between the node and its collocates.

P

POS

POS is the abbreviation for part-of-speech. In corpus analysis, annotating words with different part-of-speech (including word classes such as nouns, verbs, adjectives, as well as other properties, such as singular or plural) is preferred because it helps to filter searches and enables further analyses of search results. For instance, it is possible to limit the search results in COCA if we want to look at all possible attributive adjectives for the noun “boy”.

POS tagging is the process of annotating a body of text with an automatic tagger to attach the correct part-of-speech to each word. Such programme is available in POS taggers like TagAnt. However, manual pre-processing of text files and post-editing of tagged files are still necessary to remove errors and increase accuracy. Consistency is also an issue when conducting POS tagging.

R

Range

In corpus analysis software like AntConc, the range is one of the measurements to reveal the dispersion of a target word or phrase. The range refers to the number of corpus parts or body of texts it occurs across a corpus. For instance, in a corpus containing 17 pieces of texts about religion, the word god appears 155 times (frequency) in 14 texts. The range is therefore 14. For clearer depiction, see textual tutorial for setting up search parameters for range in AntConc.

Register

In the context of a corpus, the term “register” refers to the variety or style of language that is associated with a particular social situation, context, or domain. It refers to the specific language choices, vocabulary, grammar, and tone used by text creator or speaker in different communicative settings. Register can vary based on factors such as formality, technicality, familiarity, and social relationships. For example, the language used in a formal register, like an academic paper, would be different from the language used in an informal register, like a casual conversation among very close friends.

In corpus concordancing websites, such as COCA or CorpusMate, where disciplinary searches are possible, specifying the registers will help us understand how language is adapted and used in different contexts and for different purposes. For example, in COCA, it is possible to choose in which “sections” we look for our target language patterns. In CorpusMate, a similar function will be searching “in topic”.

S

Span

The span refers to the measurement of the collocation window displayed in a concordance. For example, in a concordancer, we can look for collocates of the node within the range of 3 words to the left (3L) and 3 words to the right (3R), i.e. the span is L3-R3.

In most concordancers and corpus analysis software, users can sort their search results according to a predefined span to determine the range of collocation they want to study. For clearer depiction, see textual tutorial for setting up search parameters for span in AntConc.

T

Target corpus & reference corpus

In corpus analysis software like AntConc, the target corpus is the corpus that we want to study and conduct analysis on, while the reference corpus is by principle a larger corpus compiling a larger set of texts drawn from different genres or disciplines. It is because in some cases, for instance, generating a keyword list for a corpus of interest, comparison needs to be done with the larger reference corpus to reflect the “keyness”/“aboutness” of the keywords. For clearer depiction, see textual tutorial for managing target corpus and reference corpus to conduct keyword analysis in AntConc.

Token and Type

Token and type refer to two interrelated concepts. They are both used as a unit for a single linguistic component.

The number of tokens refers to the total number of individual words, no matter how many times a word appears and reappears in a text, whereas the number of types refer to the total number of unique words, and disregards their re-occurrence. Therefore, by principle, a corpus should contain more tokens than types.

For instance, in the following sentence:

How much wood could a woodchuck chuck if a woodchuck could chuck wood?

There are 13 tokens (13 individual words), but only 8 types (since the words in red have repeated themselves)

Type/token ratio (sometimes abbreviated as TTR) is a type of calculation by dividing the number of types by number of tokens. If a text has a high type/token ratio, it suggests the text is lexically diverse, as opposed to a low ratio, meaning repetition is frequent throughout the text.

W

Wildcard

A wildcard (sometimes ‘wild card’) is a symbol or character that represents one or more unknown or variable characters in a search pattern. In the context of corpus linguistics, wildcards can be used to retrieve concordances based on patterns rather than exact matches. Two common symbols – * (asterisk) and ? (question mark) – are used as a wildcard in query language.

For example, if the wildcard * (asterisk) is used in the search phrase “draw*”, where it immediately follows “draw”, the search results will show concordance lines containing “draws”, “drawing”, “drawn”, “drawer”, “drawn”, etc. In other words, the results will display suffixes of “draw” including inflections and derivations. On the other hand, if we include a space between the wildcard after “draw” in the search phrase “draw *”, results will show concordance lines containing “draw the”, “draw a”, “draw on”, “draw attention”, etc. Note that in both cases, the wildcard can consist of any number of letters. In some cases, if the question mark is used as the wildcard, it represents only one single letter.

Wildcards are particularly useful in situations where you want to search for language data that partially matches a specified pattern or when you want to retrieve multiple results that share a common characteristic.

For clearer depiction, see textual tutorial for wildcard searches in COCA, Netspeak, or CorpusMate.

Word cloud

A word cloud, in the context of a corpus, is a visual representation of the most frequently occurring words in a text or collection of texts. It provides a quick and intuitive way to identify the key themes or topics within a corpus. In a word cloud, words are displayed graphically, with the size of each word indicating its frequency of occurrence. Typically, the more frequently a word appears in the corpus, the larger and more prominent it appears in the word cloud. This visualisation technique helps users gain insights into the important words or concepts that appear frequently in the analysed corpus. It is also particularly useful to stimulate learners, especially younger learners, with a visualisation of the words contained in a body of text, so they have a brief idea about the text.

Examples of corpus websites and tools with the word cloud feature are VersaText, SKELL, AntConc, and LancsBox.

Now that you have a deeper understanding of these terms about corpus literacy, continue with the following modules to learn how to use corpus technology. Go and check out the first tool – COCA!