Richard G Baldwin (512) 223-4758, baldwin@austin.cc.tx.us, http://www2.austin.cc.tx.us/baldwin/

Stream Tokenizer

Java Programming, Lecture Notes # 61, Revised 12/16/98.


Preface

Students in Prof. Baldwin's Intermediate Java Programming classes at ACC will be responsible for knowing and understanding all of the material in this lesson beginning with the Spring semester of 1999.

This lesson was originally written on September 24, 1998, using the JDK 1.1.6 download package. Upgraded to JDK 1.2 on 12/16/98. The purpose of this lesson is to illustrate the use of the StreamTokenizer class..

Introduction

The StreamTokenizer class makes it possible to parse an input stream into a set of tokens.  The tokens can be read and dealt with one at a time.  A table is used to control the parsing process.  In addition, there are several flags that can be set to different states to assist in the control of the parsing process.  The parsing process can recognize words, numbers, quoted strings, and comment styles.  It also recognizes ordinary characters which are not included in any of the above.

Each byte read from the input stream is treated as a character in the hexadecimal range from 00 to FF.  Each input character is considered by the StreamTokenizer to be one of the following:

An instance of the class has four flags which can be set to true or false.  These flags are set to indicate:

Sequential tokens are extracted from the stream by invoking the nextToken() method on an object of the StreamTokenizer class after instantiating the object and passing a stream Reader object as a parameter to the constructor.  The Reader object specifies the stream that will be tokenized.

The nextToken() method always returns an int that can be used to interpret the token.  The class contains four class constants.  The int value returned by nextToken() will always either match one of the constants, or will contain the value of the character.  (It appears that the constants are always negative int values which cannot possibly match the value of a character in the range 00 to FF.)

The four constants are:

In addition, the class provides three public instance variables: nval, sval, and ttype.

After a call to the nextToken() method, the ttype variable contains the type of the token just read. For a single character token, its value is the single character, converted to an integer. For a quoted string token its value is the quote character (" or ').  Otherwise, its value is one of the following:

The value contained in ttype is always the same as the value returned by the nextToken() method.

After a call to the nextToken() method, the contents of the sval variable will depend on the type of token that was just read.  If the current token is a word token, sval contains a string giving the characters of the word token.  When the current token is a quoted string token, sval contains the body of the string.  Note that the string can contain any characters including the characters which normally delimit words and numbers.  sval will contain null if the current token is neither a word nor a quoted string.

After a call to the nextToken() method, the contents of the nval variable will also depend on the type of token.  If the current token is a number, nval contains the value of that number.  Otherwise, it contains null.

A variety of methods are available to modify the contents of the table used for parsing.  However, one of the problems with the StreamTokenizer class is that I was unable to find any documentation that defined the default values contained in the table.  Therefore, it was necessary for me to experiment in an attempt to determine the default state of the table.  These are my general conclusions, based solely on experimentation.

By default, the space character is a whitespace character used to separate tokens in the stream.

Only the upper and lower-case alphabetic characters are word characters.

The numeric characters from 0 through 9, the minus sign, and the period are treated as number characters.

The / character behaves in ways that I was unable to figure out.  For example, the /0 (slash zero) combination caused my program to return TT_EOF until I invoked a method to force the / character to be treated as an ordinary character.  The behavior was as if it was a single-line comment, and since there were no TT_EOL characters in the stream, the entire remainder of the file was ignored.

Pairs of double quotes or pairs of single quotes serve to delimit string data.

Beyond this, it appears that all of the other characters between 32 and 126 are treated as ordinary characters unless one of the methods is invoked to cause them to be treated differently.

To use this class, instantiate an object of the class, link it to an input stream, set up the parsing table, and then loop calling the nextToken() method in each iteration of the loop until it returns the value TT_EOF.  

Methods

A large number of methods are available, many of which are used to set up the parsing table.  A sampling of those methods follows.
 

  • eolIsSignificant(boolean) - Invoking this method with a boolean parameter establishes whether or not the end of line is treated as a token.
  • lineno() - Returns the current line number. 
  • lowerCaseMode(boolean) - Lets you specify that all the characters in word tokens are to be converted to lower case.
  • nextToken() - Gets the next token from the stream.  The type of the token is returned in the ttype field. Additional information about the token may be in the nval field or the sval field.  Returns the value of the ttype field. 
  • ordinaryChar(int) - Lets you specify that a character should be treated as ordinary.  This removes any special significance the character has as a comment character, word component, string delimiter, white space, or number character. When such a character is encountered by the parser, the parser treats it as a single-character token, returns the value of the character, and sets the ttype field to the character value. 
  • ordinaryChars(int, int) - Lets you specify a range of character values that are to be treated as ordinary characters.
  • parseNumbers() - Lets you specify that numbers should be parsed by this tokenizer. The syntax table of this tokenizer is modified so that each of the characters from 0 through 9 and including the minus sign and the period will have the "numeric" attribute.  This appears to be the default case.  When the parser encounters a word token that has the format of a double precision floating-point number, it treats the token as a number rather than a word.  It sets the ttype field to the value TT_NUMBER and puts the numeric value of the token into the nval field.
  • pushBack() - Causes the next call to the nextToken() method to return the current value in the ttype   field, and not to modify the value in the nval or sval field. 
  • quoteChar(int) - Lets you specify that a matching pair of the character specific by the int parameter will delimit string constants.
  • resetSyntax() - Resets the parsing table so that all characters are treated as ordinary.  This makes it possible for you use the other methods to construct your own parsing table without having to contend with default values.
  • commentChar(int) - Lets you specify a single-line comment delimiter.  All characters from the character to the end of the line are ignored.
  • slashSlashComments(boolean) - Determines if the tokenizer recognizes C++-style comments.  If the argument is true, any occurrence of two consecutive slash characters, //, is treated as the beginning of a comment that extends to the end of the line. 
  • slashStarComments(boolean) - Determines if the tokenizer recognizes C-style comments. If the argument is true, all text between successive occurrences of /* and */ are discarded. 
  • whitespaceChars(int, int) - Lets you specify a range of character values that will be treated as whitespace characters.  Whitespace characters separate tokens in the input stream.
  • wordChars(int, int) - Lets you specify a range of character values that will be treated as word characters. A word token consists of a word character followed by zero or more word characters or number characters. 

Sample Program

The program begins by creating a test file containing the sequence of ASCII character values from 32 to 126 inclusive.  In addition, several extra characters are inserted in the sequence to illustrate special behavior of the StreamTokenizer class.

For example, a space character is inserted every seventh character.  A space character is a default word delimiter of the StreamTokenizer class and is used to separate tokens in the input stream.

A double quote character (") is inserted several character positions beyond the normal position of the character in the ASCII sequence to cause the characters in  between to be treated as a quoted string.  The double quote is a default delimiter for a quoted string for the StreamTokenizer.

Likewise, a single quote character (') or apostrophe is inserted several character positions beyond the normal position of the character in the ASCII sequence to cause the characters in between to be treated as a quoted string.  The single quote is also a default delimiter for a quoted string for the StreamTokenizer.

As mentioned earlier, each input character is treated by the StreamTokenizer as though it were one of the following:

As mentioned in the description of the methods, there are also some special capabilities having to do with the (/) character and the (*) as used in Java comments.  However, those capabilities are not illustrated by this program.

One of the methods of the class was used to cause the (/) character to be treated as an ordinary character.  Another method was used to cause the s, j, k, and l characters to be treated as whitespace characters.  Still another method was used to cause the semicolon, left angle bracket, and equal sign characters (;<=) to be treated as word characters.

The value returned by the nextToken() method indicates how a character is being interpreted according to the possibilities listed above.  A switch statement was used to analyze the return value and take the appropriate action based on that return value.

The file was read and parsed by an object of the StreamTokenizer class.  By using a combination of the return value from the nextToken() method and the value of the ttype instance variable, a display was produced showing how the file was parsed.

Finally, the file was read again in a simple sequential mode and displayed as numeric and character data to confirm its contents and make it  possible to compare those contents with the behavior of the parser.

The output from running the program is shown in the comments in the program listing in a later section.  I will comment on some of the output as I discuss the interesting code fragments.

The program was tested using JDK 1.1.6 under Win95.  

Interesting Code Fragments

The entire program is contained in the main() method, the beginning of which is shown in the next fragment.  This fragment instantiates and initializes an output file stream object.
 

  public static void main(String[] args)
  {
    System.out.println(
                    "Start the program and write a file");
    try{
    FileOutputStream outFile = 
                          new FileOutputStream("junk.txt");

The next fragment shows the beginning of a for loop that writes the character values from 32 to 126 inclusive in the file.  

    for(int cnt = 32; cnt < 127; cnt++){

The next fragment shows the insertion of a space character every seventh character.  This resulted in the following type of output where the input stream was parsed into groups of seven characters per group.  The reason that the first group doesn't contain seven characters is that the characters immediately ahead of the upper case "A" were treated as ordinary characters and were not being controlled by the whitespace character.

returnValue=62 ttype=62 char = >
returnValue=63 ttype=63 char = ?
returnValue=64 ttype=64 char = @
TT_WORD ABCDE ttype=-3
TT_WORD FGHIJKL ttype=-3
TT_WORD MNOPQRS ttype=-3
TT_WORD TUVWXYZ ttype=-3  

      if(cnt%7 == 0){
        outFile.write(' ');//every seventh char is space
      }//end if

Because the file contains the entire ASCII sequence, it will contain a the value of the double quote character (").  If left completely alone, the parser would consider that character to be the beginning of a quoted string and continue searching for the matching quote character until the end of file is encountered.  To avoid this problem, and also to illustrate the use of a quoted string, a second quote character was inserted immediately ahead of the ampersand character (&).  This caused the characters between the two double quotes to be treated as a quoted string, providing the following output:

Quoted string= #$% ttype=34 char = "  

      if(cnt == '&') outFile.write('\"');

The apostrophe or single quote is also a default delimiter for quoted strings.  For the same reason given above, a single quote was inserted into the file immediately ahead of the asterisk character.  This produced the following output:

Quoted string=()  ttype=39 char = '  

      if(cnt == '*') outFile.write('\'');

Finally, we see the completion of the for loop and the writing of the sequential byte value into the file. This fragment also closes the output file stream.  

      outFile.write((char)cnt);//write the byte
    }//end for loop
    outFile.close();

The next fragment instantiates a FileInputStream object linked to the test file and wraps that input stream object in a Reader object.  Then it instantiates the StreamTokenizer object and wraps it around the Reader object. At this point, we can invoke the nextToken() method on the StreamTokenizer object to get the next token from the FileInputStream.  

    FileInputStream inFile = 
                           new FileInputStream("junk.txt");
    Reader rdr = new BufferedReader(
                            new InputStreamReader(inFile));
    StreamTokenizer strTok = new StreamTokenizer(rdr);

The next few fragments modify the parsing table relative to its default values.  First I caused the slash (/) character to be treated as an ordinary character which is not the case with the default.  This produces the following output which is typical output for an ordinary character.

returnValue=47 ttype=47 char = /  

    strTok.ordinaryChar('/');//forward slash

The next fragment uses two different calls to the same method to cause the s, j, k, and l characters to be treated as whitespace characters.  This produces the following output where the normal sequence of characters in the stream is broken into tokens due to these characters being there.  Note the break between the i and the m (jkl missing), and also between the r and the t (s missing).

TT_WORD a ttype=-3
TT_WORD bcdefgh ttype=-3
TT_WORD i ttype=-3
TT_WORD mno ttype=-3
TT_WORD pqr ttype=-3
TT_WORD tuv ttype=-3  

    strTok.whitespaceChars('s','s');
    strTok.whitespaceChars('j','l');

By default, the characters ; < = (semicolon, left angle bracket, and equal) are ordinary characters.  The following fragment causes them to be treated as word characters producing the following output where these three characters group together to form a word.

returnValue=58 ttype=58 char = :
TT_WORD ;<= ttype=-3
returnValue=62 ttype=62 char = >  

    strTok.wordChars(';','=');

That completes the modifications to the parsing table.  All that remains is to

The following fragment shows the loop, along with a switch statement that analyzes the return value from nextToken() to decide the type of token, and to display information about the token based on that decision.  The switch statement is followed by a couple of additional statements that use the value of ttype to display more information about the token.

Note that because of the conditional expression in the while loop, it should never be possible to match the value of TT_EOF in the switch statement.  I put the case there anyway for the sake of completeness.  Rather than to break this code into smaller fragments, I decided to keep it intact and to highlight some of the interesting aspects of the code with boldface.

If the return value from the method matches any of the class constants of the StreamTokenizer class, it is not an ordinary character.  In one of those cases, the value of sval containing a word is displayed, an in another of those cases, the value of nval containing a number is displayed.

In addition, two of the cases test for single and double quotes, and display the value of sval which contains the body of the quoted string in this case.

The default clause of the switch statement is executed when the return value is an ordinary character.

The code following the switch statement is used to illustrate that the value of ttype can be used for essentially the same purpose.  

    int returnValue = strTok.nextToken();//priming read
    while(returnValue != StreamTokenizer.TT_EOF){
      //Terminate the loop on end of file
      switch(returnValue){//determine the type of token
        case StreamTokenizer.TT_EOF ://shouldn't be here 
          System.out.print("TT_EOF ");break;
        case StreamTokenizer.TT_EOL ://end of line
          System.out.print("TT_EOL ");break;
        case StreamTokenizer.TT_NUMBER ://a number 
          System.out.print("TT_NUMBER " + strTok.nval);
          break;
        case StreamTokenizer.TT_WORD ://a word
          System.out.print("TT_WORD " + strTok.sval);break;
        case '\"' ://a double quote
          System.out.print("Quoted string=" + strTok.sval);
          break;
        case '\'' ://a single quote
          System.out.print("Quoted string=" + strTok.sval);
          break;
        default:System.out.print(//none of the above
                             "returnValue=" + returnValue);
      }//end switch
      
      //Use ttype variable to display additional info.
      System.out.print(" ttype=" + strTok.ttype + " ");  
      if(strTok.ttype >= 0)//neg is eof, eol, number, word 
        //display the character
        System.out.println("char = " + (char)strTok.ttype);
      else System.out.println();//new line on eol, etc.
      
      returnValue = strTok.nextToken();//get next token
    }//end while loop
    inFile.close();//close the file
    System.out.println(); //new line

Part of the output produced by this fragment is shown below.  Portions of this output were also shown earlier along with the code that prepared the parsing table to produce a specific kind of behavior.  The remaining output can be viewed in the complete program listing later in the lesson.

returnValue=33 ttype=33 char = !
Quoted string= #$% ttype=34 char = "
returnValue=38 ttype=38 char = &
Quoted string=()  ttype=39 char = '
returnValue=42 ttype=42 char = *
returnValue=43 ttype=43 char = +
returnValue=44 ttype=44 char = ,
TT_NUMBER -0.0 ttype=-2
returnValue=47 ttype=47 char = /
TT_NUMBER 0.0 ttype=-2
TT_NUMBER 1234567.0 ttype=-2
TT_NUMBER 89.0 ttype=-2
returnValue=58 ttype=58 char = :
TT_WORD ;<= ttype=-3
returnValue=62 ttype=62 char = >
returnValue=63 ttype=63 char = ?
returnValue=64 ttype=64 char = @
TT_WORD ABCDE ttype=-3
TT_WORD FGHIJKL ttype=-3
TT_WORD MNOPQRS ttype=-3
TT_WORD TUVWXYZ ttype=-3
returnValue=91 ttype=91 char = [
returnValue=92 ttype=92 char = \

The final fragment simply reads and displays the file again in sequential fashion in case you have any questions about the actual contents of the file.  You can view the output produced by this fragment in the next section.  

    System.out.println("Display file data");
    //Open the file again for simple read and display
    inFile = new FileInputStream("junk.txt");
    int data;
    int cnt = 0;
    while( (data = inFile.read()) != -1){
      System.out.print("" + data + " ");
      System.out.print((char)data + "   ");
      if(cnt++ == 4){//new line every fifth read
        System.out.println();//new line
        cnt = 0;//reinitialize the counter
      }//end if
    }//end while
    inFile.close();//close file again

The code that was not highlighted in the fragments above can be viewed in the complete listing of the program that follows in the next section.  

Program Listing

/* File StreamTok02.java Copyright 1998, R.G.Baldwin
Revised 9/24/98

The program begins by creating a file containing the
sequence of ASCII character values from 32 to 126 
inclusive.  In addition, several extra characters are 
inserted in the sequence to illustrate special behavior of
the StreamTokenizer class.

For example, a space character is inserted every seventh
character.  A space character is a default word delimiter
of the StreamTokenizer class.

A double quote character is inserted several character
positions beyond the normal position of the double quote 
character in the ASCII sequence to cause the characters in 
between to be treated as a quoted string.  The double quote
is a default delimiter for a quoted string for the 
StreamTokenizer.

Likewise, a single quote character is inserted several
character positions beyond the normal position of the 
single quote character in the ASCII sequence to cause the
characters in between to be treated as a quoted string.
The single quote is a default delimiter for a quoted
string for the StreamTokenizer.

Each input character is treated by the StreamTokenizer
to be one of the following:
1.  A whitespace character that delimits a word.
2.  An ordinary character that is not a delimiter and is
    not part of either of the following.
3.  A character that is part of a word.
4.  A character that is part of a number.

There are also some special capabilities having to do with
the "/" character and the "*" as used in Java comments.

A method of the class was used to cause the "/" character
to be treated as an ordinary character.

A method of the class was used to cause the s, j, k, and l
characters to be treated as whitespace characters to 
delimit words.
    
A method of the class was used to cause the ";" and "="
character to be treated a word characters.

The value returned by the nextToken() method indicates 
how a character is being interpreted according to the
possibilities listed above.  A switch statement was used
to analyze the return value and take the appropriate
action based on that return value.

A public instance variable of the class contains the same
information and can be used for similar purposes.

The file was read and parsed by an object of the 
StreamTokenizer class.  By using a combination of the 
return value from the nextToken() method and the
instance variable, a printout was produced showing how the
file was parsed by the object.

Finally, the file was read again and displayed as numeric
and character data to confirm its contents and make it 
possible to compare those contents with the behavior of the
parser.

Program output:
  
Start the program and write a file
returnValue=33 ttype=33 char = !
Quoted string= #$% ttype=34 char = "
returnValue=38 ttype=38 char = &
Quoted string=()  ttype=39 char = '
returnValue=42 ttype=42 char = *
returnValue=43 ttype=43 char = +
returnValue=44 ttype=44 char = ,
TT_NUMBER -0.0 ttype=-2 
returnValue=47 ttype=47 char = /
TT_NUMBER 0.0 ttype=-2 
TT_NUMBER 1234567.0 ttype=-2 
TT_NUMBER 89.0 ttype=-2 
returnValue=58 ttype=58 char = :
TT_WORD ;<= ttype=-3 
returnValue=62 ttype=62 char = >
returnValue=63 ttype=63 char = ?
returnValue=64 ttype=64 char = @
TT_WORD ABCDE ttype=-3 
TT_WORD FGHIJKL ttype=-3 
TT_WORD MNOPQRS ttype=-3 
TT_WORD TUVWXYZ ttype=-3 
returnValue=91 ttype=91 char = [
returnValue=92 ttype=92 char = \
returnValue=93 ttype=93 char = ]
returnValue=94 ttype=94 char = ^
returnValue=95 ttype=95 char = _
returnValue=96 ttype=96 char = `
TT_WORD a ttype=-3 
TT_WORD bcdefgh ttype=-3 
TT_WORD i ttype=-3 
TT_WORD mno ttype=-3 
TT_WORD pqr ttype=-3 
TT_WORD tuv ttype=-3 
TT_WORD wxyz ttype=-3 
returnValue=123 ttype=123 char = {
returnValue=124 ttype=124 char = |
returnValue=125 ttype=125 char = }
returnValue=126 ttype=126 char = ~

Display file data
32     33 !   34 "   32     35 #   
36 $   37 %   34 "   38 &   39 '   
40 (   41 )   32     39 '   42 *   
43 +   44 ,   45 -   46 .   47 /   
48 0   32     49 1   50 2   51 3   
52 4   53 5   54 6   55 7   32     
56 8   57 9   58 :   59 ;   60 <   
61 =   62 >   32     63 ?   64 @   
65 A   66 B   67 C   68 D   69 E   
32     70 F   71 G   72 H   73 I   
74 J   75 K   76 L   32     77 M   
78 N   79 O   80 P   81 Q   82 R   
83 S   32     84 T   85 U   86 V   
87 W   88 X   89 Y   90 Z   32     
91 [   92 \   93 ]   94 ^   95 _   
96 `   97 a   32     98 b   99 c   
100 d   101 e   102 f   103 g   104 h   
32     105 i   106 j   107 k   108 l   
109 m   110 n   111 o   32     112 p   
113 q   114 r   115 s   116 t   117 u   
118 v   32     119 w   120 x   121 y   
122 z   123 {   124 |   125 }   32     
126 ~   End of program

Tested using JDK 1.1.6 under Win95.
**********************************************************/

import java.io.*;

class StreamTok02{
  public static void main(String[] args)
  {
    System.out.println(
                    "Start the program and write a file");
    try{
    //Instantiate and initialize an output file stream 
    // object.
    FileOutputStream outFile = 
                          new FileOutputStream("junk.txt");
  
    //Write a series of bytes to the file and close it
    for(int cnt = 32; cnt < 127; cnt++){
      if(cnt%7 == 0){
        outFile.write(' ');//every seventh char is space
      }//end if
      
      //Force double quotes around several char following
      // the double quote in the ASCII sequence
      if(cnt == '&') outFile.write('\"');
      
      //Force single quotes around several char following
      // the single quote in the ASCII sequence
      if(cnt == '*') outFile.write('\'');
      
      outFile.write((char)cnt);//write the byte
    }//end for loop
    outFile.close();

    //Instantiate and initialize an input stream
    FileInputStream inFile = 
                           new FileInputStream("junk.txt");
    Reader rdr = new BufferedReader(
                            new InputStreamReader(inFile));
                            
    //Instantiate a StreamTokenizer object
    StreamTokenizer strTok = new StreamTokenizer(rdr);
    
    //Convert the "/" character to an ordinary char
    strTok.ordinaryChar('/');//forward slash
    
    //Make the lower-case s, j, k, and l, whitespace char.
    // They will function as word delimiters in the parsed
    // data stream.
    strTok.whitespaceChars('s','s');
    strTok.whitespaceChars('j','l');
    
    //Make the ;<= characters to be word characters
    strTok.wordChars(';','=');

    //Loop getting sequential tokens from the file,
    // interpreting, and displaying those tokens
    int returnValue = strTok.nextToken();//priming read
    while(returnValue != StreamTokenizer.TT_EOF){
      //Terminate the loop on end of file
      switch(returnValue){//determine the type of token
        case StreamTokenizer.TT_EOF ://shouldn't be here 
          System.out.print("TT_EOF ");break;
        case StreamTokenizer.TT_EOL ://end of line
          System.out.print("TT_EOL ");break;
        case StreamTokenizer.TT_NUMBER ://a number 
          System.out.print("TT_NUMBER " + strTok.nval);
          break;
        case StreamTokenizer.TT_WORD ://a word
          System.out.print("TT_WORD " + strTok.sval);break;
        case '\"' ://a double quote
          System.out.print("Quoted string=" + strTok.sval);
          break;
        case '\'' ://a single quote
          System.out.print("Quoted string=" + strTok.sval);
          break;
        default:System.out.print(//none of the above
                             "returnValue=" + returnValue);
      }//end switch
      
      //Use ttype variable to display additional info.
      System.out.print(" ttype=" + strTok.ttype + " ");  
      if(strTok.ttype >= 0)//neg is eof, eol, number, word 
        //display the character
        System.out.println("char = " + (char)strTok.ttype);
      else System.out.println();//new line on eol, etc.
      
      returnValue = strTok.nextToken();//get next token
    }//end while loop
    
    inFile.close();//close the file
    System.out.println(); //new line

    System.out.println("Display file data");
    //Open the file again for simple read and display
    inFile = new FileInputStream("junk.txt");
    int data;
    int cnt = 0;
    while( (data = inFile.read()) != -1){
      System.out.print("" + data + " ");
      System.out.print((char)data + "   ");
      if(cnt++ == 4){//new line every fifth read
        System.out.println();//new line
        cnt = 0;//reinitialize the counter
      }//end if
    }//end while
    inFile.close();//close file again
    
  }catch(IOException e){System.out.println(e);}
  System.out.println("End of program");
  }// end main
}//end class StreamTok02 definition

-end-