SWI-Prolog -- Loading Structured Documents

Documentation
- Reference manual
- Packages
  - SWI-Prolog SGML/XML parser
    - Predicate Reference

3.1 Loading Structured Documents

SGML or XML files are loaded through the common predicate load_structure/3. This is a predicate with many options. For simplicity a number of commonly used shorthands are provided: load_sgml_file/2, load_xml_file/2, and load_html_file/2.

load_structure(+Source, -ListOfContent, +Options)

Parse Source and return the resulting structure in ListOfContent. Source is either a term of the format stream(StreamHandle) or a file-name. Options is a list of options controlling the conversion process.

A proper XML document contains only a single toplevel element whose name matches the document type. Nevertheless, a list is returned for consistency with the representation of element content. The ListOfContent consists of the following types:

Atom: Atoms are used to represent CDATA. Note this is possible in SWI-Prolog, as there is no length-limit on atoms and atom garbage collection is provided.
element(Name, ListAttributes, ListOfContent): Name is the name of the element. Using SGML, which is case-insensitive, all element names are returned as lowercase atoms.
ListOfAttributes is a list of Name=Value pairs for attributes. Attributes of type CDATA are returned literal. Multi-valued attributes (NAMES, etc.) are returned as a list of atoms. Handling attributes of the types NUMBER and NUMBERS depends on the setting of the number(+NumberMode) attribute through set_sgml_parser/2 or load_structure/3. By default they are returned as atoms, but automatic conversion to Prolog integers is supported. ListOfContent defines the content for the element.
sdata(Text): If an entity with declared content-type SDATA is encountered, this term is returned holding the data in Text.
ndata(Text): If an entity with declared content-type NDATA is encountered, this term is returned holding the data in Text.
pi(Text): If a processing instruction is encountered (<?...?>), Text holds the text of the processing instruction. Please note that the <?xml ...?> instruction is handled internally.

The Options list controls the conversion process. Currently defined options are below. Other options are passed to sgml_parse/2.

dtd(?DTD): Reference to a DTD object. If specified, the <!DOCTYPE ...> declaration is ignored and the document is parsed and validated against the provided DTD. If provided as a variable, the created DTD is returned. See section 3.5.
dialect(+Dialect): Specify the parsing dialect. Supported are sgml (default), html4, html5, html (same as html4, xhtml, xhtml5, xml and xmlns. See the option dialect of set_sgml_parser/2 for details.
shorttag(+Bool): Define whether SHORTTAG abbreviation is accepted. The default is true for SGML mode and false for the XML modes. Without SHORTTAG, a / is accepted with warning as part of an unquoted attribute-value, though /> still closes the element-tag in XML mode. It may be set to false for parsing HTML documents to allow for unquoted URLs containing /.
space(+SpaceMode): Sets the‘space-handling-mode’for the initial environment. This mode is inherited by the other environments, which can override the inherited value using the XML reserved attribute xml:space. See section 3.2.
number(+NumberMode): Determines how attributes of type NUMBER and NUMBERS are handled. If token (default) they are passed as an atom. If integer the parser attempts to convert the value to an integer. If successful, the attribute is passed as a Prolog integer. Otherwise it is still passed as an atom. Note that SGML defines a numeric attribute to be a sequence of digits. The - sign is not allowed and 1 is different from 01. For this reason the default is to handle numeric attributes as tokens. If conversion to integer is enabled, negative values are silently accepted.
case_sensitive_attributes(+Boolean): Treat attribute values as case sensitive. The default is true for XML and false for SGML and HTML dialects.
case_preserving_attributes(+Boolean): Treat attribute values as case insensitive but do not alter their case. The default is false. Setting this option sets the case_sensitive_attributes to the same value. This option was added to support HTML quasi quotations and most likely has little value in other contexts.
system_entities(+Boolean): Define whether SYSTEM entities are expanded. The default is false.
defaults(+Bool): Determines how default and fixed values from the DTD are used. By default, defaults are included in the output if they do not appear in the source. If false, only the attributes occurring in the source are emitted.
entity(+Name, +Value): Defines (overwrites) an entity definition. At the moment, only CDATA entities can be specified with this construct. Multiple entity options are allowed.
file(+Name): Sets the name of the file on which errors are reported. Sets the linenumber to 1.
line(+Line): Sets the starting line-number for reporting errors.
max_memory(+Max): Sets the maximum buffer size in bytes available for input data and CDATA output. If this limit is reached a resource error is raised. Using max_memory(0) (the default) means no resource limit will be enforced.
ignore_doctype(+Bool): If set, doctype declarations in the document will be ignored. This can help prevent XXE attacks
cdata(+Representation): Specify the representation of cdata elements. Supported are atom (default), and string. The choice is not obvious. Strings are allocated on the Prolog stacks and subject to normal stack garbage collection. They are quicker to create and avoid memory fragmentation. But, multiple copies of the same string are stored multiple times, while the text is shared if atoms are used. Strings are also useful for security sensitive information as they are invisible to other threads and cannot be enumerated using, e.g., current_atom/1. Finally, using strings allows for resource usage limits using the global stack limit (see set_prolog_stack/2).
attribute_value(+Representation): Specify the representation of attribute values. Supported are atom (default), and string. See above for the advantages and disadvantages of using strings.
keep_prefix(+Boolean): If true, xmlns namespaces with prefixes are returned as ns(Prefix, URI) terms. If false (default), the prefix is ignored and the xmlns namespace is returned as just the URI.