Some languages provide special classes to solve that problem, e.g.
Let’s take a string as follows
Into etc. Note that the given string has some leading and trailing spaces. Some words are also delimited by more than just a single space.
string.split() function and pass a single-space string as a separator string.
It gives something like:
So this is not exactly what we expected. There are too many empty tokens which we need to get rid of. One of the solutions could be to remove empty strings (tokens) from the output array using
Array.prototype.splice function. That, however, is not a great approach as it would require too much extra code (additional testing, etc.). We need something which works straight during text conversion to tokens.
Let’s try following regular expression:
// we use the same text variable as defined above console.log(text.split(/\s+/))
It will give output like:
So we have managed to get rid of empty tokens in the middle of the output array. Now we need to get rid of empty tokens at the begging and the end of the array.
Let’s try slightly modified regular expression.
// we use the same text variable as defined above console.log(text.split(/\b\s+/))
It gives output like:
We have managed to get rid of the first empty token, but we still have an empty token at the end of the array. Let’s modify our regular expression and split the string only if it is followed by non-whitespace characters.
// we use the same text variable as defined above console.log(text.split(/\b\s+(?!$)/))
It gives output like:
String.prototype.match function. The
match() method retrieves the matches when matching a string against a regular expression. So we need to create a regular expression which will match all our words but whitespace characters.
\S which matches a single non-whitespace character.
So let’s check whether
match function helps us. Our code snippet is as follows. Note that we used
g (global) option to match all tokens in the given string.
// we use the same text variable as defined above console.log(text.match(/\S+/g))
And the output is:
Now the output is precisely what we expected it to be, i.e., there is no empty tokens and no spaces in the extracted tokens.
Note that by providing different regular expression matcher we can extract different tokens. In the example below, we retrieve all digit tokens.
"I've found 4 ducks on 11th street.".match(/\d+/g)
That results in the following array of tokens:
You may also be interested in:
- String split into n-character long tokens
- The first 500 Fibonacci numbers
- High-performance Fibonacci numbers program in Go
comments powered by Disqus