Bag-of-words representation of text
Consider the following text:
A (real) vector is just a collection of real numbers, referred to as the components (or, elements) of the vector; denotes the set of vectors with elements. If denotes a vector, we use subscripts to denote elements, so that is the -th component of . Vectors are arranged in a column, or a row. If is a column vector, denotes the corresponding row vector, and vice-versa. |
The row vector contains the number of times each word in the list {vector, of, the}
appear in the above paragraph. Vectors can be thus used to represent text documents. The representation often referred to as the bag-of-words representation, is not faithful, as it ignores the respective order of appearance of the words. In addition, often, stop words (such as the
or of
) are also ignored.
See also: Bag-of-words representation of text: measure of document similarity.