Bag-of-words representation of text

Consider the following text:

A (real) vector is just a collection of real numbers, referred to as the components (or, elements) of the vector; \mathbb{R}^n denotes the set of vectors with n elements. If x \in \mathbb{R}^n denotes a vector, we use subscripts to denote elements, so that x_i is the i-th component of x. Vectors are arranged in a column, or a row. If x is a column vector, x^T denotes the corresponding row vector, and vice-versa.

The row vector x = [5,3,4] contains the number of times each word in the list {vector, of, the} appear in the above paragraph. Vectors can be thus used to represent text documents. The representation often referred to as the bag-of-words representation, is not faithful, as it ignores the respective order of appearance of the words. In addition, often, stop words (such as the or of) are also ignored.

See also: Bag-of-words representation of text: measure of document similarity.

License

Icon for the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Linear Algebra and Applications Copyright © 2023 by VinUiversity is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, except where otherwise noted.

Share This Book