Good programming practices

Simple SGML parser

October 30th, 2011

There exists many parsing libraries for XML, HTML and SGML. Sometimes it’s enough to use simple solution instead of whole library. I created 4 classes for parsing string into list of markups and plain text. The parser use regular expressions for extracting all elements.

  1. public static void main(String[] args) {
  2. String text = "<center>Centered text"
  3. + "<font face=\"Times New Roman\" unused color=\"\" size=\"6\">"
  4. + "Formatted<br/>New line</center></font>";
  5. List<Element> elements = new SgmlParser().parseText(text);
  6. for (Element e : elements) {
  7. if (e instanceof TextElement) {
  8. System.out.println("Text: " + e.getText());
  9. } else if (e instanceof SgmlElement) {
  10. System.out.println("Markup: <" + e.getMarkup() + ">, "
  11. + (((SgmlElement) e).isClosing() ? "closing" : "opening")
  12. + ", attributes: " + e.getAttributes());
  13. }
  14. }
  15. }

The output of the example is below:
  1. Markup: <center>, opening, attributes: {}
  2. Text: Centered text
  3. Markup: <font>, opening, attributes: {face=Times New Roman, color=, size=6}
  4. Text: Formatted
  5. Markup: <br>, closing, attributes: {}
  6. Text: New line
  7. Markup: <center>, closing, attributes: {}
  8. Markup: <font>, closing, attributes: {}

See and download the source code and JUnit tests.
Classes haven’t external dependencies except FindBugs annotations. ISC license.

Leave a Response

You must be logged in to post a comment.