Bag-of-words representation of text
Consider the following text:
A (real) vector is just a collection of real numbers, referred to as the components (or, elements) of the vector; ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The row vector contains the number of times each word in the list
{vector, of, the}
appear in the above paragraph. Vectors can be thus used to represent text documents. The representation often referred to as the bag-of-words representation, is not faithful, as it ignores the respective order of appearance of the words. In addition, often, stop words (such as the
or of
) are also ignored.
See also: Bag-of-words representation of text: measure of document similarity.