slanted W3C logo

Day 21 — Strings IV

In today's lecture we look at the StringBuffer class and Java's regular expression mechanism.

Mutable Strings: StringBuffer

A String encapsulates an immutable sequence of characters. Even though the String class provides transformer methods to change characters or substrings in a string, these methods always return a new String object instead of changing the existing object.

Sometimes the client would prefer to change the existing sequence of characters. The StringBuffer and StringBuilder objects encapsulate a mutable sequence of characters.

For CSE1020 it does not matter if you use StringBuffer or StringBuilder (their APIs are identical). We will prefer StringBuffer to stay consistent with the textbook, but the StringBuffer documentation actually recommends StringBuilder as the preferred choice for common usage.

StringBuffer

Generally, you should prefer using ordinary strings, but sometimes string buffers produce simpler code or more efficient code.

A standard example is reading in a text file one line at a time:

import java.io.File;
import java.io.PrintStream;
import java.util.Scanner;
import javax.swing.JFileChooser;

public class ReadFile
{
   public static void main(String[] args) throws java.io.IOException
   {
      PrintStream output = System.out;

      JFileChooser chooser = new JFileChooser();
      int returnVal = chooser.showOpenDialog(null);
      if (returnVal == JFileChooser.APPROVE_OPTION)
      {
         File file = chooser.getSelectedFile();
         Scanner input = new Scanner(file);
         
         String text = "";
         int lines = 0;
         for (; input.hasNextLine(); lines++)
         {
            String line = input.nextLine();
            text = text + line;
         }
         output.println("File has " + lines + " lines");
      }
   }
}

On my laptop PC, this program takes about 10 seconds to read in a file with 10,000 words where each word is on a separate line.

Appending to a String Buffer

Instead of concatenating to a string, we can append the text to a string buffer.

The client can append anything to the end of a string buffer using the overloaded append methods; it is the string representation that is appended to the string buffer.

import java.io.File;
import java.io.PrintStream;
import java.util.Scanner;
import javax.swing.JFileChooser;

public class ReadFile2
{
   public static void main(String[] args) throws java.io.IOException
   {
      PrintStream output = System.out;

      JFileChooser chooser = new JFileChooser();
      int returnVal = chooser.showOpenDialog(null);
      if (returnVal == JFileChooser.APPROVE_OPTION)
      {
         File file = chooser.getSelectedFile();
         Scanner input = new Scanner(file);
         
         StringBuffer text = new StringBuffer();
         int lines = 0;
         for (; input.hasNextLine(); lines++)
         {
            String line = input.nextLine();
            text.append(line);
         }
         output.println("File has " + lines + " lines");
      }
   }
}

On my laptop PC, this program takes less than 1 second to read in a file with 10,000 words where each word is on a separate line.

Inserting into a String Buffer

The client can insert anything into a string buffer at any valid position using the overloaded insert methods; it is the string representation that is inserted into the string buffer.

      StringBuffer s =
        new StringBuffer("I had breakfast.");
      output.println(s.toString());
      
      s.insert(6, " eggs for ");
      output.println(s.toString());

      int numEggs = 2;
      s.insert(6, numEggs);
      output.println(s.toString());

The above code fragment prints:

I had breakfast.
I had  eggs for breakfast.
I had 2 eggs for breakfast.

Deleting from a String Buffer

The client can delete a single character from a string buffer using the deleteCharAt method. A range of characters can be deleted using the delete method.

      StringBuffer s =
        new StringBuffer("I had 2 eggs for breakfast.");
      output.println(s.toString());
      
      s.delete(6, 8);
      output.println(s.toString());

      s.delete(6, 15);
      output.println(s.toString());

The above code fragment prints:

I had 2 eggs for breakfast.
I had eggs for breakfast.
I had breakfast.
import java.io.PrintStream;

public class StringBufferExample
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      StringBuffer s =
        new StringBuffer("I had breakfast.");
      output.println(s.toString());
      
      s.insert(6, " eggs for ");
      output.println(s.toString());

      int numEggs = 2;
      s.insert(6, numEggs);
      output.println(s.toString());

      s.delete(6, 8);
      output.println(s.toString());

      s.delete(6, 15);
      output.println(s.toString());
   }
}

StringBuffer Summary

You should consider using a string buffer or string builder if:

Regular Expressions

In Java, a regular expression (or regex) is a string that describes a pattern of characters in a concise unambiguous fashion. Regexes are typically used for pattern matching. Some examples are determining if a string:

The term regular expression means something else in formal language theory (where the term was invented) which you will learn about in CSE2001: Introduction to Theory of Computation.

The Most Basic Regex

The most basic form of pattern matching supported by the Java regex API is the matching of a string literal. For example, the string "foo" matches the string "foo".

      String s = "foo";
      String regex = "foo";
      output.println(s.matches(regex));

The above code fragment will print true. Notice that this example of matching is equivalent to using equals.

import java.io.PrintStream;

public class Regex1
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      String s = "foo";
      String regex = "foo";
      output.println(s.matches(regex));
   }
}

Any Character

Suppose that we are now interested in matching "foo" followed by any character (including whitespace). In the Java regex API, the period '.' is used to match any character.

The regex "foo." means the string "foo" followed by any character.

      String regex = "foo.";
      
      output.println("foo".matches(regex));
      output.println("goo".matches(regex));
      output.println("hello".matches(regex));
      output.println("foofighter".matches(regex));
      output.println();
      
      output.println("foot".matches(regex));
      output.println("fool".matches(regex));
      output.println("foo9".matches(regex));
      output.println("foo ".matches(regex));

The above code fragment will print

false
false
false
false

true
true
true
true
import java.io.PrintStream;

public class Regex2
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      String regex = "foo.";
      
      output.println("foo".matches(regex));
      output.println("goo".matches(regex));
      output.println("hello".matches(regex));
      output.println("foofighter".matches(regex));
      output.println();
      
      output.println("foot".matches(regex));
      output.println("fool".matches(regex));
      output.println("foo9".matches(regex));
      output.println("foo ".matches(regex));
   }
}

Metacharacters

The period '.' in a regular expression is a metacharacter—a character with special meaning interpreted by the matcher.

The full set of metacharacters is ([{\^-$|]})?*+. and we will see examples of most if not all of them in the following slides.

Sometimes, you will want a metacharacter to be treated as a normal character. For example, suppose you wanted to match only the string "foo.". The following code does not work:

      String regex = "foo.";
      boolean matches = s.match(regex);

because regex will also match strings such as "foo!", "food", and "fooy". To match only "foo." you must use a backslash '\\' character before the metacharacter.

      String regex = "foo\\.";
      
      output.println("food".matches(regex));
      output.println("foo.".matches(regex));

The above code fragment prints:

false
true
import java.io.PrintStream;

public class Regex3
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      String regex = "foo\\.";
      
      output.println("food".matches(regex));
      output.println("foo.".matches(regex));
   }
}

Zero or More Times

Suppose that now we are interested in matching any string that starts with "foo". Such strings can be defined as:

"foo" followed by zero or more characters

We already know that '.' means any character. The metacharacter '*' means zero or more times. ".*" means any character zero or more times.

      String regex = "foo.*";
      
      output.println("xfoo".matches(regex));
      output.println();
      
      output.println("foo".matches(regex));
      output.println("foo.".matches(regex));
      output.println("foobar".matches(regex));
      output.println("foofighter".matches(regex));

The above code fragment prints:

false

true
true
true
true
import java.io.PrintStream;

public class Regex4
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      String regex = "foo.*";
      
      output.println("xfoo".matches(regex));
      output.println();
      
      output.println("foo".matches(regex));
      output.println("foo.".matches(regex));
      output.println("foobar".matches(regex));
      output.println("foofighter".matches(regex));
   }
}

Range of Characters

Suppose you have a string that represents a simple (no hypens or multiword names) last name. You want to know if the name starts with a letter between 'A'-'M'. If you allow one letter names, then such strings are defined as:

one character in the range A-M followed by zero or more lowercase letters

The string "[A-M]" means one character in the range A-M. The string "[a-z]" means one character in the range a-z. "*" means zero or more times.

      String regex = "[A-M][a-z]*";
      
      output.println("Newton".matches(regex));
      output.println();
      
      output.println("Gauss".matches(regex));
      output.println("Bernoulli".matches(regex));
      output.println("Aabbccdd".matches(regex));

The above code fragment prints:

false

true
true
true
import java.io.PrintStream;

public class Regex5
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      String regex = "[A-M][a-z]*";
      
      output.println("Newton".matches(regex));
      output.println();
      
      output.println("Gauss".matches(regex));
      output.println("Bernoulli".matches(regex));
      output.println("Aabbccdd".matches(regex));
   }
}

Range of Characters and Union

Suppose that in the previous example of matching last names you don't care about the case of the first letter (the name can start with an upper or lowercase letter). You could write the regular expression as a union of ranges.

The string "[a-m[A-M]]" means one character in the range "a-m" or "A-M".

      String regex = "[a-m[A-M]][a-z]*";
      
      output.println("Newton".matches(regex));
      output.println();
      
      output.println("gauss".matches(regex));
      output.println("Bernoulli".matches(regex));
      output.println("Aabbccdd".matches(regex));

The above code fragment prints:

false

true
true
true
import java.io.PrintStream;

public class Regex6
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      String regex = "[a-m[A-M]][a-z]*";
      
      output.println("Newton".matches(regex));
      output.println();
      
      output.println("gauss".matches(regex));
      output.println("Bernoulli".matches(regex));
      output.println("Aabbccdd".matches(regex));
   }
}

Predefined Character Classes

Suppose you wanted to check if a string was an unsigned (no + or -) whole number. Such strings could be defined as:

any digit one or more times

You could use "[0-9]" to represent any digit, but because matching digits is a common operation, there is a predefined character class "\\d" for digits. The plus '+' metacharacter means one or more times, so "\\d+" means any digit one or more times.

      String regex = "\\d+";
      
      output.println("12a".matches(regex));
      output.println();
      
      output.println("1".matches(regex));
      output.println("861435".matches(regex));
      output.println("000000000".matches(regex));

The above code fragment prints:

false

true
true
true
import java.io.PrintStream;

public class Regex7
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      String regex = "\\d+";
      
      output.println("12a".matches(regex));
      output.println();
      
      output.println("1".matches(regex));
      output.println("861435".matches(regex));
      output.println("000000000".matches(regex));
   }
}

Zero or One

A signed whole number has a + or - or nothing in front of the digits. Such strings could be defined as:

zero or one of [+-] followed by any digit one or more times

The metacharacter '?' means zero or one, so the string "[+-]?" means zero or one character from the set "+, -". The regex matching a signed whole number is "[+-]?\\d+" or "[+-]?[0-9]+".

      String regex = "[+-]?\\d+";

      output.println("1".matches(regex));
      output.println("+861435".matches(regex));
      output.println("-400".matches(regex));

The above code fragment prints:

true
true
true
import java.io.PrintStream;

public class Regex8
{
   public static void main(String[] args)
   {
      PrintStream output = System.out;
      
      String regex = "[+-]?\\d+";

      output.println("1".matches(regex));
      output.println("+861435".matches(regex));
      output.println("-400".matches(regex));
   }
}

Some Random Examples

"a+a+" two or more a's
"^a" any character except a (not a)
"[^0-9]" or "^\\d" any character except a digit (not a digit)
"[a-mq-z]" or "[a-z&&[^n-p]" a through z but not n, o nor p
".{3,}" at least 3 characters
".{3,5}" at least 3 but no more than 5 characters
look it up email addresses

To Do For Next Lecture