<html-to-xml outputtype="..." advancedxmlescape="..." usecdata="..." specialentities="..." unicodechars="..." omitunknowntags="..." treatunknowntagsascontent="..." omitdeprtags="..." treatdeprtagsascontent="..." omitcomments="..." omithtmlenvelope="..." allowmultiwordattributes="..." allowhtmlinsideattributes="..." namespacesaware="..." prunetags="..."> body as html to be cleaned </html-to-xml>
Name | Required | Default | Description |
---|---|---|---|
outputtype | no | simple |
Defines how the resulting XML will be serialized. Allowed values
are simple , compact , browser-compact
and pretty .
|
advancedxmlescape | no | true | If this parameter is set to true, ampersand sign (&) that proceeds valid XML character sequences (&XXX;) will not be escaped with &XXX; |
usecdata | no | true | If true, HtmlCleaner will treat SCRIPT and STYLE tag contents as CDATA sections, or otherwise it will be regarded as ordinary text (special characters will be escaped). |
specialentities | no | true | If true, special HTML entities (i.e. ô, ‰, ×) are replaced with unicode characters they represent (ô, ‰, ×). This doesn't include &, <, >, ", '. |
unicodechars | no | true | If true, HTML characters represented by their codes in form &#XXXX; are replaced with real unicode characters (i.e. ж is replaces with ж). |
omitunknowntags | no | false | Tells whether to skip (ignore) unknown tags during cleanup. |
treatunknowntagsascontent | no | false |
Tells whether to treat unknown tags as ordinary content, i.e.
<something...> will be transformed to
<something...> . This attribute is
applicable only if omitUnknownTags is set to false.
|
omitdeprtags | no | false | Tells whether to skip (ignore) deprecated HTML tags during cleanup. |
treatdeprtagsascontent | no | false |
Tells whether to treat deprecated tags as ordinary content, i.e.
<font...> will be transformed to
<font...> . This attribute is
applicable only if omitDeprecatedTags is set to false.
|
omitcomments | no | false | Tells whether to skip HTML comments. |
omithtmlenvelope | no | false | Tells whether to remove HTML and BODY tags from the resulting XML, and use first tag in the BODY section instead. If BODY section doesn't contain any tags, then this attribute has no effect. |
allowmultiwordattributes | no | true |
Tells parser wether to allow attribute values consisting of multiple words or not. If true, attribute
att="a b c" will stay like it is, and if false parser will split this
into att="a" b="b" c="c" (this is default browsers' behaviour).
|
allowhtmlinsideattributes | no | false |
Tells parser wether to allow html tags inside attribute values. For example, when this flag is set
att="here is <a href='xxxx'>link</a>" will stay like it is, and if not, parser will
end attribute value after "here is ". This flag makes sense only if allowMultiWordAttributes is set as well.
|
namespacesaware | no | true | If true, namespace prefixes found during parsing will be preserved and all neccessery xml namespace declarations will be added in the root element. If false, all namespace prefixes and all xmlns namespace declarations will be stripped. |
prunetags | no | empty string |
Comma-separated list of tags that will be complitely removed (with all nested elements)
from XML tree after parsing. For exampe if pruneTags is "script,style" ,
resulting XML will not contain scripts and styles.
|
<html-to-xml outputtype="pretty"> <http url="http://www.motors.ebay.com"/> </html-to-xml>
Downloads the www.motors.ebay.com page and cleans it up producing pretty-prented XML content.