jMatchParser – © 2006 - 2010 Michael Schierl, <schierlm at users dot sourceforge dot net>
jMatchParser is a utility to match existing, possible computer generated, text files against a template file to obtain interesting information that is hidden inside lots of boilerplate text in the file. The result of the parsing is a XML DOM tree that can be further parsed and/or evaluated using Java's standard XML DOM functions. For example, it can match a HTML export of some proprietary database application against a template that describes their HTML format to obtain the original field values again and store them as a XML or CSV file. As this library is mostly used for HTML files (and sometimes for convoluted XML files), there are special options to make HTML/XML parsing easier; they will be described at the end of this document.
The most important difference between jMatchParser templates and any scripting language is that templates are declarative and non-deterministic. For example if a file consists of any number of special blocks which all follow one of (for example) three different structures, you just declare the three structures as alternatives and add a loop around them. You don't have to tell the parser how it can disambiguate the different structures. There are methods to give some hints, mainly to make parsing of very complicated nested structures faster, but they are not needed; the parser will try all still matching alternatives until there is only one alternative left. (On the other hand, if there are multiple different alternatives that can match the current input file completely, the parsing will fail).
This makes it easy to start with a few example files, but still extend the template when parsing more real world files and different file structures are seen. And, as you explicitly have to tell the parser which parts to ignore while parsing, the parser will tell you if you forgot some part and you can decide if you want to parse or ignore it.
The template file is a line-based plain text file. The typical file extension for jMatchParser templates is .jmt. It can use any character set supported by Java (your code will have to specify the encoding that should be used). Every line starts (after any number of whitespace used as indentation) with a case-insensitive keyword giving the type of the line (called command), followed by one space or tab and the parameters. Both case and whitespace (including leading and trailing whitespace) is significant in the parameter.
As template files can get quite confusing, you can add comments, too. Comments start with a # character at the beginning of the line (after indentation) and span to the end of the line. Multi-line comments or comments that start in the middle of the line are not supported.
In general, the parameters are treated as literal text, for example when you try to tell that there has to be a line containg of two dashes at that point in the input file, you just use the MATCHLINE -- command (no quotes, no escaping). This is useful to copy and paste any text from a sample input file into the template (there are also features to automatically generate a "dummy template" from an input file or a part of it, which will basically generate MATCHLINE lines like the one shown here).
There is one exception to this rule, which are special tags. They start with "«" and end with "»". To match one of those delimiters exactly, the special tags «[» and «]» are used. Special tags can be nested arbitrarily deep. If you happen to live in an English-speaking country and throw your hands up in horror now as you don't know how you should type these characters in your templates, please read the description of the $SPECIALTAGS preprocessor command that can be used to redefine these characters to anything else you like more (yes, even < and >, although that might be a bad idea when parsing HTML. << and >> should work well for HTML, though).
Most special tags can be used for two different things. They can either match some part of the input document, resulting in added XML to the output or changes of local variables, or they can be used to build a text (from information in local variables). Generally, every tag that can be used to build text can be used to match the same (literal) text. But tags used for matching dynamic content can usually not be used for building strings.
There are commands that need more than one parameter. There are two kinds of parameters. Short parameters are just a single word, and may not contain whitespace. Long parameters may contain any characters. Short parameters always occur before long parameters and are delimited by spaces or tabs. Long parameters (after the short parameters), if there is more than one, are delimited by the «,» special tag.
While parsing text, the parser has to keep track of progress. This information is stored in the parser state; whenever the parser has to follow two different alternatives, the parser state is cloned and each copy is used to track progress of one alternative path.
The parser state contains local variables and a partially built XML document (the output). It also stores the progress of scanning through the input document and may optionally track all template commands that have been used while parsing (to detect unneccessary template code). While it is not possible for the parser to change anything already written to the XML document, and while it cannot scan backwards in the input document, local variables can be changed as often as desired.
To build text (i. e. constant values), you can of course use literal text without special tags. But there are also a few special tags that can be used for this scenario:
Builds/mathches the current value of local variable name. An error is thrown if the variable is not defined. As most other tags, you can use other build tags inside this tag to build the variable name dynamically.
Builds/matches the result of the given function applied to the given parameters. Parameters are always separated by «,»; short parameters are not supported. There are a few predefined functions; more application-specific functions can be added by implementing CallbackFunction and registering the new function with the parser. Function names are case sensitive.
Used to escape special characters.
As described above, all tags used for building text (and of course literal text as well) can be used to match their specific value.
In addition, there are more tags for matching text that optionally store the text in the XML output or in a local variable.
In case XML output or a local variable should be set unconditionally, the SET and SETEXPRESSION commands can be used that take one match tag and one building tag and assign the result of the building tag to the match tag.
This tag matches either the given text, or nothing. It is useful to exclude optional parts in the input without needing to build a regular expression and quote all metacharacters in text.
This is the generic syntax for all the matching tags that follow. This tag tries to match a regexp against the input (or .*? if none is given) and if the match was successful, the result will be stored in the target. Note that the regular expression may not contain any capturing groups (non-capturing groups are fine, though) as those are used internally to retrieve the results of multiple matches if more than one tag is used in an expression. Capturing groups in the regular expression will result in a runtime error.
target can be empty (if a regular expression is given); in this case, the regular expression will be matched (consumed) but not stored anywhere.
All names used with these match commands, regardless whether they end up as local variables or xml tags/attributes, have to start with a (case sensitive) letter and may only contain letters and digits.
There is one deprecated tag, which acts as a hybrid between matching and parsing tags.
When you add an ampersand (&) before a tag, its effect depends on whether this tag is a parsing tag or a building tag. When it is a building tag, it can be used for building the result of the inner tag, but with html entities escaped (like the ^addentities function). When it is a parsing tag, it will strip all entities (like the ^stripentities function) from the results before they get assigned.
Therefore, it can be useful for parsing/building HTML elements.
It can have some unexpected side-effects, for example when using html special characters in the regular expression used for a matching tag. The behavior is fixed now (no more bugs will be fixed) to not break older scripts, but for newer scripts you are encouraged to use the more explicit newer ^addentities and ^stripentities functions instead.
Preprocessor commands can basically be used to include common parts into more than one template. For includes to work, it is important that an appropriate MatchTemplateResolver is configured, either manually or by using the MatchTemplate constructors that take a File or a resource.
They can also be used to define symbols and conditionally disable parts of a template; this is mostly useful within included scripts (to have a basic script that does two different things depending on whether a symbol is defined or not).
Preprocessor commands start with a $ sign and are case sensitive just like normal commands; the main difference is that preprocessor commands are parsed first (i. e. you can have malformed commands within an $IFDEF command and nobody complains if that symbol is not defined).
Includes the given file at that point. Note that the included file may not have any dangling blocks (i. e. blocks closed but not opened in the file or vice versa), every block has to be closed in the same file where it is defined. Of course, there may be blocks open at the point where the include is defined.
Define the given symbol. Can be used by $IFDEF etc. later. A definition cannot have a value, a given symbol is either defined or not.
Remove the definition of the given symbol. Can be used by $IFDEF etc. later.
Interpret the following lines only if the given symbol has or has not been defined.
Redefine the start and end markers for tags. The markers can be arbitrary strings, and the separators do not need to be slashes, but could be any character (only that the character must be both the first and the last character of the argument). The default value is:
Redefine the start and end markers for tags. Additionally, define quoting character sequences that can be used to match a literal marker. For example, an US-ASCII-only version for parsing HTML could be:
Assignment commands are used to assign values to variables or to produce output without parsing parts of the input document.
Build the result of value, and parse the parsetag(s) against this expression. The only difference between SET and SETEXPRESSION is that the latter supports whitespace in the parsetag (which is a long parameter), having the drawback that you have to use the «,» argument separator.
SET «myvar» 42
Sets the local variable myvar to 42.
Creates an XML block (element) and puts output of the content between BLOCK and ENDBLOCK inside of this XML block.
SET «@value» 42
SET «/» See Douglas Adams
Creates this XML in the output: <answer value="42">See Douglas Adams</answer>
Control flow commands are nondeterministic, in the sense that you don't have to decide how often a loop should run or which branch to take. The structure of the data will decide that for you.
Repeat the block from mincount to maxcount times. If omitted, maxcount is unlimited and mincount is 0.
LOOP 3 7
Match 3 to 7 empty lines.
Short form for encapulating the given command into an LOOP / ENDLOOP block.
Match any number of empty lines
Provide two or more alternatives that can be parsed here. Nondeterminism will automatically choose the correct one (that one that matches).
The short and long forms are semantically identical, they are just provided since the short forms are hard to read for newbies, but the long forms are a lot longer.
When starting with OPTALT, an implicit alternative is added that is empty. OPTALT may be the only given alternative (as the empty one is added automatically).
Match either Done. or Completed. or Congratulations.
MATCHLINE Two optional
Match either the two optional lines or not.
Short form for encapulating the given command into an OPTALT / ENDALT block.
OPTIONAL MATCHLINE Optional line
Match either the optional line or not.
These two commands do not change the result of parsing (but they may make it fail if used incorrectly), but are hints to the nondeterminism engine. As today's computers will have to act deterministically, nondeterminism is simulated by evaluating all alternatives in order. There are situations where the current nondeterminism engine tends to run into a situation where a lot of useless alternatives are determined to be checked later, which will accumulate a lot of memory, where in fact, they could be quickly proven to not match if they were checked first.
The CHECKOTHERS command will check if there are other alternatives pending, and will try to evaluate them first.
The FINISHOTHERS command is more drastic: If it is not possible to finish all other commands (or more than one command is waiting at a FINISHOTHERS command), parsing will fail. This can be used at points where you are sure that there is only one place in the input file where this place can be reached, effectively splitting the template into multiple templates where each one is nondeterministic, but each one may only have one outcome (like the whole template). This is also useful for debugging when you are sure there should not be any nondeterminism at some given place, and you want to make sure there is really none (that you did not accidentally added two ALTERNATIVES that both matched the content, for example.)
Define a block of commands (which has to be properly nested) as a template with a given name. This template can be called with CALLTEMPLATE like a subroutine in an imperative programming language. Note that templates share the state of the parent template; they can be recursive, though.
Match the lines 1 2 + 1 2.
Acts similar to the SET command, but can be only used inside a DEFTEMPLATE template and will only set variables (without the parse tag around).
When used for the first time in a DEFTEMPLATE template, the old value of the variable is remembered and restored when the function quits.
This is useful to have local variables in recursive templates.
Check that the current input start can be matched against matchexpression. If the expression contains a newline, it may match more than one line, else it may match up to one line. The CHECKNOT command checks that the input cannot be matched against the expression. The current input is not consumed.
Check that the next current input line can be matched against matchexpression. The CHECKLINENOT command checks that the input lin cannot be matched against the expression. The current input is not consumed. CL is an alias for CHECKLINE.
Check the beginning of variable's content against the match expression. To check the whole content, anchor the expression at the end by using the «:$» regexp at the end of the match expression. The variable is not changed.
CHECKTEXT answer 42«:$»
Check that the variable answer contains 42.
See the chapter Features for HTML and XML parsing.
Act the same way as the CHECK counterparts, except that the matching part will be removed from the input or the variable.
Although jMatchParser uses a nondeterministic parser, each individual match is done deterministically - if you use a greedy expression, it will match as much as possible, even if a later match will fail.
Consider matching the following simple example against aaab:
As the first line will consume all the a, the second line will never match. If you use a nongreedy match «:a+?», it will not work either, as the first expression will not match anyhting.
To solve this dilemma, either do both matching in one command, MATCH «:a+»ab, or use the MATCHANY command for the first one. Note that for the MATCHANY command, greedy matches have to be used, since jMatchParser will only try variants that are shorter than the original match.
All commands in this section are deterministic, i. e. they act as if a FINISHOTHERS command was directly before them.
Add the given expression to the parser buffer. The parser buffer is initially empty and is used to collect the input for the next pass. The WRITELINE and WL commands act the same, they add a newline at the end of the written expression.
This command starts the next parsing pass. All the input must have been consumed, and no XML output may have been generated when this command is used.
Example (making all input uppercase and adding ! to each
line before parsing):
MATCHLINE HEADER REPEATED MATCHLINE TEXT
Write some text to the debug stream (which can be set on the parser, but is stdout by default). A newline will be added automatically.
Match a line and write it prefixed with ML to the debug stream (which can be set on the parser, but is stdout by default). This is useful when using multi-pass parsing or formats, to create a "template" for the final match template, that can be reused by copy&paste.
Apply a formatter to the rest of the unparsed input and start the next pass. More than one formatter can be applied by separating the formatters by commas, which will apply them from left to right.
Formatters can be added by implementing the Formatter interface; there are also some predefined formatters.
Supported predefined formatters:
Sometimes you do not want to have XML files, but CSV files. The Parser class supports parseToCSV and convertDOMToCSV to make this conversion easy.
Basically, you write your parser as if you wanted to create XML with <row> and <cell> tags. The tag names for these two tags do not even matter - the outer tag is row, the inner tag is cell. Add text content into the cells to put them into the CSV file. You can also add an attribute file to the row, which is a zero-based index of the CSV file to write into, to make it easier to create different CSV files in one parsing pass (in that case you have to pass more than one CSV file to the method, of course).
SET «/cell» Last Name
SET «/cell» First Name
SET «/cell» Adams
SET «/cell» Douglas
SET «@file» 1
SET «/cell» This is for the second file
jMatchParser is basically optimized for text-based files. In practice, most of the files to be parsed a XML or HTML files, and in those files formatting is often changed without any impact to the output file.
Since every formatting change might lead into the necessity of changing the templates, there are special formatters available that try to format "unimportant" formatting away.
The XML formatter will parse the XML file and build a line from every XML node (except attributes). Start and end tags will be written verbatim (with the attributes sorted), text nodes will be prefixed by a dash, comments by a !, CDATA by ] and processing instructions by ?.
For example, <x y="1">Drag&Drop</x> will be represented as
When parsing XML documents, there are often nodes with lots of sub-nodes that are not interesting for parsing (like ad areas or navigation on a website); therefore the MATCHXML command can be used to match a start tag, all its content, and the end tag.
The CHECKXML command is a counterpart to validate whether the next few lines really contain a valid XML command.
For HTML formatting, jtidy formatters are available; a html formatter combines JTidy with the XML formatter, therefore creating XHTML structured in exactly the way described above.