module soup.tf
The tweakstreet/soup
module helps extracting information from HTML and XML documents.
The soup module treats markup documents as ‘soups of tags’. It favors the ability to extract sensible data from documents over strict adherence to standards. It is usually able to process documents with syntax problems.
library html
The html library contains functions to extract data from HTML documents.
select
(
string html,
any select=nil,
dict options=nil
) -> any
Given a HTML document as html
, selects tags from the document and returns them as structured data.
If the given HTML is a fragment, any missing tags such as head or body tags are implicitly added. All tag and
attribute names are normalized to lowercase.
The select
parameter can be a string CSS selector,
or a dict, or list of CSS selectors.
The function returns a list per given selector, imitating the structure of the select
argument.
For example:
- If
select
is a string, a list of matching nodes is returned. - If
select
is a dict of strings, a dict of lists containing matching nodes is returned. - If
select
is a list of strings, a list of lists containing matching nodes is returned. - If
select
isnil
, the entire html document is returned.
Each matching HTML node is represented as a dict in the return value. Any attributes and child nodes are placed in corresponding keys.
The options
parameter allows fine tuning the return value.
If any of the following keys are present, they are interpreted as follows:
Key | Handling |
---|---|
:attributes |
A string taking one of the values: "flat" or "nested" . Governs how node attributes are represented in the return value.
|
:nested_attributes_name |
A string, if :attributes is "nested" , the dict containing the attributes is placed at this key. |
:flat_attributes_prefix |
A string, if :attributes is "flat" , attribute keys of tags are prefixed with this |
:text |
A string taking one of the values "flat" or "nested" . Governs how text content of tags is represented in the return value.
|
:nested_text_name |
A string, the key of any text nodes created. |
:descendant_text |
A list of strings, each string acting as a CSS selector, if any returned node matches one of the given selectors, the return value contains all text contained in the node, including the text of the node’s children. |
:flat_text |
A list of strings, each string acting as a CSS selector. If the :text option is set to "nested" , nodes matching the given selectors are treated as exceptions and they are treated as flat, if possible. |
:nested_text |
A list of strings, each string acting as a CSS selector. If the :text option is set to "flat" , nodes matching the given selectors are treated as exceptions and they are treated as nested. |
:tags |
A string taking one of the values "flat" or "nested" . Governs how nested tags are represented in the return value.
|
:flat_tags |
A list of strings, each string acting as a CSS selector. If the :tags option is set to "nested" , nodes matching the given selectors are treated as exceptions and they are treated as flat, if possible. |
:nested_tags |
A list of strings, each string acting as a CSS selector. If the :tags option is set to "flat" , nodes matching the given selectors are treated as exceptions and they are treated as nested. |
:keep_ns_prefix |
A boolean value, if true any XML namespace prefixes are kept in the return value, if false any XML namespace prefixes are dropped and only local names of tags and attributes are used. |
If any other keys are present in options
, they are ignored.
If options
is nil
, the following default value is used:
{
:attributes "flat",
:nested_attributes_name "@attributes",
:flat_attributes_prefix "",
:text "flat",
:nested_text_name "text",
:descendant_text [],
:flat_text [],
:nested_text [],
:tags "flat",
:flat_tags [],
:nested_tags [],
:keep_ns_prefix false
}
> html.select('<p>Click <a href="/index.html">here</a> to go somewhere else</p>', "p")
[
{
:a {
:href "/index.html",
:text "here"
},
:text "Click here to go somewhere else"
}
]
> html.select('<p><b>Kilroy</b><i> was </i><b>here</b><p>And here!', "p")
[
{
:b ["Kilroy", "here"],
:i "was"
},
"And here!"
]
> html.select('<p><b>Kilroy</b><i> was </i><b>here</b><p>And here!', "p", {:descendant_text ['p'], :nested_text ['p']})
[
{
:b ["Kilroy", "here"],
:i "was",
:text "Kilroy was here"
},
{
:text "And here!"
}
]
> html.select(nil)
nil
library xml
select
(
string xml,
any select=nil,
dict options=nil
) -> any
Given an XML document as xml
, selects tags from the document and returns them as structured data.
The select
and options
parameters behave exactly the same as in html.select.
> \e
catalog: '<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer''s Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>'
\e
"<?xml version=\"1.0\"?>
<catalog>
<book id=\"bk101\">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>"
# by default a single book in the catalog appears as a flat property
> xml.select(catalog)
[
{
:catalog {
:book {
:genre "Computer",
:price "44.95",
:author "Gambardella, Matthew",
:title "XML Developer's Guide",
:id "bk101",
:description "An in-depth look at creating applications with XML.",
:publish_date "2000-10-01"
}
}
}
]
# forcing books to always be a list, regardless of how many happen to be in the catalog
> xml.select(catalog, nil, {:nested_tags ["catalog > book"]})
[
{
:catalog {
:book [
{
:genre "Computer",
:price "44.95",
:author "Gambardella, Matthew",
:title "XML Developer's Guide",
:id "bk101",
:description "An in-depth look at creating applications with XML.",
:publish_date "2000-10-01"
}
]
}
}
]
> xml.select(nil)
nil