Cleans up the content of the body and transforms it to the valid XML. The body is usually HTML obtained as a result of http processor execution. Actual parsing and cleaning job is delegated to HtmlCleaner tool. Although no special tuning is needed in most cases, cleaner may be configured with the several parameters defined with the processor's attributes.

Syntax

<html-to-xml outputtype="..." advancedxmlescape="..." usecdata="..."
             specialentities="..." unicodechars="..." omitunknowntags="..."
             treatunknowntagsascontent="..." omitdeprtags="..."
             treatdeprtagsascontent="..." omitcomments="..."
             omithtmlenvelope="..." allowmultiwordattributes="..."
             allowhtmlinsideattributes="..." namespacesaware="..."
             prunetags="...">
    body as html to be cleaned
</html-to-xml>

Attributes

Name Required Default Description
outputtype no simple Defines how the resulting XML will be serialized. Allowed values are simple, compact, browser-compact and pretty.
advancedxmlescape no true If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &amp;XXX;
usecdata no true If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped).
specialentities no true If true, special HTML entities (i.e. &ocirc;, &permil;, &times;) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '.
unicodechars no true If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. &#1078; is replaces with ж).
omitunknowntags no false Tells whether to skip (ignore) unknown tags during cleanup.
treatunknowntagsascontent no false Tells whether to treat unknown tags as ordinary content, i.e. <something...> will be transformed to &lt;something...&gt;. This attribute is applicable only if omitUnknownTags is set to false.
omitdeprtags no false Tells whether to skip (ignore) deprecated HTML tags during cleanup.
treatdeprtagsascontent no false Tells whether to treat deprecated tags as ordinary content, i.e. <font...> will be transformed to &lt;font...&gt;. This attribute is applicable only if omitDeprecatedTags is set to false.
omitcomments no false Tells whether to skip HTML comments.
omithtmlenvelope no false Tells whether to remove HTML and BODY tags from the resulting XML, and use first tag in the BODY section instead. If BODY section doesn't contain any tags, then this attribute has no effect.
allowmultiwordattributes no true Tells parser wether to allow attribute values consisting of multiple words or not. If true, attribute att="a b c" will stay like it is, and if false parser will split this into att="a" b="b" c="c" (this is default browsers' behaviour).
allowhtmlinsideattributes no false Tells parser wether to allow html tags inside attribute values. For example, when this flag is set att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will end attribute value after "here is ".
This flag makes sense only if allowMultiWordAttributes is set as well.
namespacesaware no true If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped.
prunetags no empty string Comma-separated list of tags that will be complitely removed (with all nested elements) from XML tree after parsing. For exampe if pruneTags is "script,style", resulting XML will not contain scripts and styles.

Example

<html-to-xml outputtype="pretty">
    <http url="http://www.motors.ebay.com"/>
</html-to-xml>

Downloads the www.motors.ebay.com page and cleans it up producing pretty-prented XML content.