module soup.tf

The tweakstreet/soup module helps extracting information from HTML and XML documents.

The soup module treats markup documents as ‘soups of tags’. It favors the ability to extract sensible data from documents over strict adherence to standards. It is usually able to process documents with syntax problems.

library html

The html library contains functions to extract data from HTML documents.

select

(
  string html,
  any select=nil,
  dict options=nil
) -> any

Given a HTML document as html, selects tags from the document and returns them as structured data. If the given HTML is a fragment, any missing tags such as head or body tags are implicitly added. All tag and attribute names are normalized to lowercase.

The select parameter can be a string CSS selector, or a dict, or list of CSS selectors.

The function returns a list per given selector, imitating the structure of the select argument. For example:

If select is a string, a list of matching nodes is returned.
If select is a dict of strings, a dict of lists containing matching nodes is returned.
If select is a list of strings, a list of lists containing matching nodes is returned.
If select is nil, the entire html document is returned.

Each matching HTML node is represented as a dict in the return value. Any attributes and child nodes are placed in corresponding keys.

The options parameter allows fine tuning the return value.

If any of the following keys are present, they are interpreted as follows:

Key	Handling
`:attributes`	A string taking one of the values: `"flat"` or `"nested"`. Governs how node attributes are represented in the return value. In flat mode, attributes are attached to nodes directly. In nested mode, all attributes are collected into a dict and that dict is placed under an attributes key.
`:nested_attributes_name`	A string, if `:attributes` is `"nested"`, the dict containing the attributes is placed at this key.
`:flat_attributes_prefix`	A string, if `:attributes` is `"flat"`, attribute keys of tags are prefixed with this
`:text`	A string taking one of the values `"flat"` or `"nested"`. Governs how text content of tags is represented in the return value. In flat mode, text nodes are used as the value of a node directly, unless that node has child nodes or atrributes. If child nodes or attrbutes are present, the key as specified by `:nested_text_name` In nested mode, the key specified by `:nested_text_name` is always used for text node values.
`:nested_text_name`	A string, the key of any text nodes created.
`:descendant_text`	A list of strings, each string acting as a CSS selector, if any returned node matches one of the given selectors, the return value contains all text contained in the node, including the text of the node’s children.
`:flat_text`	A list of strings, each string acting as a CSS selector. If the `:text` option is set to `"nested"`, nodes matching the given selectors are treated as exceptions and they are treated as flat, if possible.
`:nested_text`	A list of strings, each string acting as a CSS selector. If the `:text` option is set to `"flat"`, nodes matching the given selectors are treated as exceptions and they are treated as nested.
`:tags`	A string taking one of the values `"flat"` or `"nested"`. Governs how nested tags are represented in the return value. In flat mode, child tags are indexed as single items, and are upgraded to lists, if another child tag of the same name exists. In nested mode, all child tags are indexed as lists.
`:flat_tags`	A list of strings, each string acting as a CSS selector. If the `:tags` option is set to `"nested"`, nodes matching the given selectors are treated as exceptions and they are treated as flat, if possible.
`:nested_tags`	A list of strings, each string acting as a CSS selector. If the `:tags` option is set to `"flat"`, nodes matching the given selectors are treated as exceptions and they are treated as nested.
`:keep_ns_prefix`	A boolean value, if `true` any XML namespace prefixes are kept in the return value, if `false` any XML namespace prefixes are dropped and only local names of tags and attributes are used.

If any other keys are present in options, they are ignored.

If options is nil, the following default value is used:

{
  :attributes "flat",
  :nested_attributes_name "@attributes",
  :flat_attributes_prefix "",
  :text "flat",
  :nested_text_name "text",
  :descendant_text [],
  :flat_text [],
  :nested_text [],
  :tags "flat",
  :flat_tags [],
  :nested_tags [],
  :keep_ns_prefix false
}

> html.select('<p>Click <a href="/index.html">here</a> to go somewhere else</p>', "p")
[
  {
    :a {
      :href "/index.html",
      :text "here"
    },
    :text "Click here to go somewhere else"
  }
]

> html.select('<p><b>Kilroy</b><i> was </i><b>here</b><p>And here!', "p")
[
  {
    :b ["Kilroy", "here"],
    :i "was"
  },
  "And here!"
]

> html.select('<p><b>Kilroy</b><i> was </i><b>here</b><p>And here!', "p", {:descendant_text ['p'], :nested_text ['p']})
[
  {
    :b ["Kilroy", "here"],
    :i "was",
    :text "Kilroy was here"
  },
  {
    :text "And here!"
  }
]

> html.select(nil)
nil

library xml

select

(
  string xml,
  any select=nil,
  dict options=nil
) -> any

Given an XML document as xml, selects tags from the document and returns them as structured data.

The select and options parameters behave exactly the same as in html.select.

> \e
catalog: '<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer''s Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
</catalog>'
\e
"<?xml version=\"1.0\"?>
<catalog>
   <book id=\"bk101\">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
</catalog>"

# by default a single book in the catalog appears as a flat property
> xml.select(catalog)
[
  {
    :catalog {
      :book {
        :genre "Computer",
        :price "44.95",
        :author "Gambardella, Matthew",
        :title "XML Developer's Guide",
        :id "bk101",
        :description "An in-depth look at creating applications with XML.",
        :publish_date "2000-10-01"
      }
    }
  }
]

# forcing books to always be a list, regardless of how many happen to be in the catalog
> xml.select(catalog, nil, {:nested_tags ["catalog > book"]})
[
  {
    :catalog {
      :book [
        {
          :genre "Computer",
          :price "44.95",
          :author "Gambardella, Matthew",
          :title "XML Developer's Guide",
          :id "bk101",
          :description "An in-depth look at creating applications with XML.",
          :publish_date "2000-10-01"
        }
      ]
    }
  }
]

> xml.select(nil)
nil

Tweakstreet v1.22.6

crypto⌃

data⌃

data - selection⌃

data - transforms⌃

json⌃

strings⌃

time⌃

time - between⌃

time - get⌃

time - set⌃

urls⌃

Filesystems⌃

Databases⌃

Kafka⌃

MongoDB⌃

Smtp Mail Server⌃

OAuth 2.0⌃

Hadoop⌃

module soup.tf

library html

select

library xml

select