Tokenizing

Tokenizing
Prev	Chapter 13. String Classes	Next

The Standard C (and C++) function strtok() leaves a lot to be desired in terms of user-friendliness. It's unintuitive, it destroys the character string on which it operates, and it requires you to handle all the memory problems. But it does let the client code decide what to use to break the string into pieces; it allows you to choose the "whitespace," so to speak.

A C++ implementation lets us keep the good things and fix those annoyances. The implementation here is more intuitive (you only call it once, not in a loop with varying argument), it does not affect the original string at all, and all the memory allocation is handled for you.

It's called stringtok, and it's a template function. Sources are as below, in a less-portable form than it could be, to keep this example simple (for example, see the comments on what kind of string it will accept).

#include <string>
template <typename Container>
void
stringtok(Container &container, string const &in,
          const char * const delimiters = " \t\n")
{
    const string::size_type len = in.length();
          string::size_type i = 0;

    while (i < len)
    {
        // Eat leading whitespace
        i = in.find_first_not_of(delimiters, i);
        if (i == string::npos)
	  return;   // Nothing left but white space

        // Find the end of the token
        string::size_type j = in.find_first_of(delimiters, i);

        // Push token
        if (j == string::npos) 
	{
	  container.push_back(in.substr(i));
	  return;
        } 
	else
	  container.push_back(in.substr(i, j-i));

        // Set up for next loop
        i = j + 1;
    }
}

The author uses a more general (but less readable) form of it for parsing command strings and the like. If you compiled and ran this code using it:

   std::list<string>  ls;
   stringtok (ls, " this  \t is\t\n  a test  ");
   for (std::list<string>const_iterator i = ls.begin();
        i != ls.end(); ++i)
   {
       std::cerr << ':' << (*i) << ":\n";
   }

You would see this as output:

   :this:
   :is:
   :a:
   :test:

with all the whitespace removed. The original s is still available for use, ls will clean up after itself, and ls.size() will return how many tokens there were.

As always, there is a price paid here, in that stringtok is not as fast as strtok. The other benefits usually outweigh that, however. Another version of stringtok is given here, suggested by Chris King and tweaked by Petr Prikryl, and this one uses the transformation functions mentioned below. If you are comfortable with reading the new function names, this version is recommended as an example.

Added February 2001: Mark Wilden pointed out that the standard std::getline() function can be used with standard istringstreams to perform tokenizing as well. Build an istringstream from the input text, and then use std::getline with varying delimiters (the three-argument signature) to extract tokens into a string.

Prev	Up	Next
Arbitrary Character Types	Home	Shrink to Fit