Haystacks and Needles

Tokenization is domain specific

You don't always want to split up letters from numbers - f.e. in product numbers.
You might want to exclude words (stop words)
You might want to filter out words that are short, or just long
You might want to define synonyms
You might want to normalize text (remove accents, Unicode forms)