3.1 Loading Structured Documents
SGML or XML files are loaded through the common predicate load_structure/3. This is a predicate with many options. For simplicity a number of commonly used shorthands are provided: load_sgml_file/2, load_xml_file/2, and load_html_file/2.
- load_structure(+Source, -ListOfContent, +Options)
- Parse Source and return the resulting structure in
ListOfContent. Source is either a term of the
format
stream(StreamHandle)
or a file-name. Options is a list of options controlling the conversion process.A proper XML document contains only a single toplevel element whose name matches the document type. Nevertheless, a list is returned for consistency with the representation of element content. The ListOfContent consists of the following types:
- Atom
- Atoms are used to represent
CDATA
. Note this is possible in SWI-Prolog, as there is no length-limit on atoms and atom garbage collection is provided. - element(Name, ListAttributes, ListOfContent)
- Name is the name of the element. Using SGML, which is
case-insensitive, all element names are returned as lowercase atoms.
ListOfAttributes is a list of Name=Value pairs for attributes. Attributes of type
CDATA
are returned literal. Multi-valued attributes (NAMES
, etc.) are returned as a list of atoms. Handling attributes of the typesNUMBER
andNUMBERS
depends on the setting of thenumber(+NumberMode)
attribute through set_sgml_parser/2 or load_structure/3. By default they are returned as atoms, but automatic conversion to Prolog integers is supported. ListOfContent defines the content for the element. - sdata(Text)
- If an entity with declared content-type
SDATA
is encountered, this term is returned holding the data in Text. - ndata(Text)
- If an entity with declared content-type
NDATA
is encountered, this term is returned holding the data in Text. - pi(Text)
- If a processing instruction is encountered (
<?...?>
), Text holds the text of the processing instruction. Please note that the<?xml ...?>
instruction is handled internally.
The Options list controls the conversion process. Currently defined options are below. Other options are passed to sgml_parse/2.
- dtd(?DTD)
- Reference to a DTD object. If specified, the
<!DOCTYPE ...>
declaration is ignored and the document is parsed and validated against the provided DTD. If provided as a variable, the created DTD is returned. See section 3.5. - dialect(+Dialect)
- Specify the parsing dialect. Supported are
sgml
(default),html4
,html5
,html
(same ashtml4
,xhtml
,xhtml5
,xml
andxmlns
. See the optiondialect
of set_sgml_parser/2 for details. - shorttag(+Bool)
- Define whether SHORTTAG abbreviation is accepted. The default is true
for SGML mode and false for the XML modes. Without SHORTTAG, a
is accepted with warning as part of an unquoted attribute-value, though/
/>
still closes the element-tag in XML mode. It may be set to false for parsing HTML documents to allow for unquoted URLs containing
./
- space(+SpaceMode)
- Sets the‘space-handling-mode’for the initial environment.
This mode is inherited by the other environments, which can override the
inherited value using the XML reserved attribute
xml:space
. See section 3.2. - number(+NumberMode)
- Determines how attributes of type
NUMBER
andNUMBERS
are handled. Iftoken
(default) they are passed as an atom. Ifinteger
the parser attempts to convert the value to an integer. If successful, the attribute is passed as a Prolog integer. Otherwise it is still passed as an atom. Note that SGML defines a numeric attribute to be a sequence of digits. The
sign is not allowed and-
1
is different from01
. For this reason the default is to handle numeric attributes as tokens. If conversion to integer is enabled, negative values are silently accepted. - case_sensitive_attributes(+Boolean)
- Treat attribute values as case sensitive. The default is
true
for XML andfalse
for SGML and HTML dialects. - case_preserving_attributes(+Boolean)
- Treat attribute values as case insensitive but do not alter their case.
The default is
false
. Setting this option sets thecase_sensitive_attributes
to the same value. This option was added to support HTML quasi quotations and most likely has little value in other contexts. - system_entities(+Boolean)
- Define whether SYSTEM entities are expanded. The default is
false
. - defaults(+Bool)
- Determines how default and fixed values from the DTD are used. By
default, defaults are included in the output if they do not appear in
the source. If
false
, only the attributes occurring in the source are emitted. - entity(+Name, +Value)
- Defines (overwrites) an entity definition. At the moment, only
CDATA
entities can be specified with this construct. Multiple entity options are allowed. - file(+Name)
- Sets the name of the file on which errors are reported. Sets the linenumber to 1.
- line(+Line)
- Sets the starting line-number for reporting errors.
- max_memory(+Max)
- Sets the maximum buffer size in bytes available for input data and CDATA
output. If this limit is reached a resource error is raised. Using
max_memory(0)
(the default) means no resource limit will be enforced. - ignore_doctype(+Bool)
- If set, doctype declarations in the document will be ignored. This can help prevent XXE attacks
- cdata(+Representation)
- Specify the representation of cdata elements. Supported are
atom
(default), andstring
. The choice is not obvious. Strings are allocated on the Prolog stacks and subject to normal stack garbage collection. They are quicker to create and avoid memory fragmentation. But, multiple copies of the same string are stored multiple times, while the text is shared if atoms are used. Strings are also useful for security sensitive information as they are invisible to other threads and cannot be enumerated using, e.g., current_atom/1. Finally, using strings allows for resource usage limits using the global stack limit (see set_prolog_stack/2). - attribute_value(+Representation)
- Specify the representation of attribute values. Supported are
atom
(default), andstring
. See above for the advantages and disadvantages of using strings. - keep_prefix(+Boolean)
- If
true
, xmlns namespaces with prefixes are returned asns(Prefix, URI)
terms. Iffalse
(default), the prefix is ignored and the xmlns namespace is returned as just the URI.