All Products
Search
Document Center

Dynamic Content Delivery Network:HTMLStream API

Last Updated:Mar 05, 2024

You can use the HTMLStream API to process HTML streaming data on points of presence (POPs), such as real-time stock data or chat records. Moreover, with the HTMLStream API, you can transmit HTML streaming data in chunks to accelerate data transmission.

Background information

EdgeRoutine is ideal for a wide array of frontend development scenarios. Special data, such as User-Agent headers, geographic locations, and IP addresses, is sent from POPs. Therefore, you may need to modify the flow of an HTML page in real time on the POPs. Conventionally, regular expressions are used to create an ad hoc parser to modify the flow of an HTML page. However, this conventional approach may cause errors and does not support stream processing. Open source JavaScript parsers, such as parse5 and htmlparser2, consume a large amount of memory resources and degrade system performance. To address these issues, EdgeRoutine provides a parser that supports stream processing on the POPs. You can use this parser to modify HTML code and pages.

Note

The parser is built in EdgeRoutine and is not based on web standards.

Example

  • Scenario

    If you need to modify all anchor tags <a/> in an HTML page to link them to http://www.taobao.com, use the following code in EdgeRoutine.

  • Sample code

    addEventListener('fetch', (event) => {
      event.respondWith(handle(event));
    });
    
    async function handle(event) {
      // 1. In this example, the HTML page that you want to modify is returned. 
      const response = await fetch("http://www.example.com");
      // 2. Configure the stream processing-based parser to manage HTML content. The parser supports multiple CSS selectors.
      // Specify the method for capturing the syntax and register a callback function for rewriting. 
      const htmlStream = new HTMLStream(
        response.body, // Specify the HTML flow that you want to modify.
        [[
          "a",         // The element selector. This specifies that all the anchor tags are selected. 
          {  
            // Register a callback function. The element callback function can be called in the anchor tags or in the element nodes of the Document Object Model (DOM) API. 
            // In the callback function, you can change the event object (e). 
            element: function(e) {
              // Modify the href attribute.
              e.setAttribute("href", "http://www.taobao.com");
            }
          }
        ]]);
      
      // 3. Return the modified request to the browser. HTMLStream is a readable stream.
      // You can use HTMLStream in all scenarios that support ReadableStream. 
      return new Response(htmlStream);
    }
                
  • Result analysis

    The preceding sample code shows how to modify an HTML flow in real time by using the HTMLStream API. The following section describes how HTMLStream works:

    • The Fetch API retrieves a flow expression for the request. However, EdgeRoutine may not have retrieved the response body from the network layer. As a result, the frequency of garbage collection caused by data buffering is reduced.

    • HTMLStream is also a flow that supports the TransformStream operation. After HTMLStream receives a flow, HTMLStream calls the rewrite callback function to modify the specified HTML page in real time. If you want to modify an HTML page, you must acquire the raw data stream of the HTML page. You can then put the raw data stream in the HTML flow that you have created, as described in Step 2 in the preceding sample code.

      • The first parameter of HTMLStream is a stream, which represents the raw data stream of the HTML page.

      • The second parameter of HTMLStream is an array, which represents a group of rewriters. A rewriter is an array that includes a selector. The selector is used to specify the HTML content that you want to modify and select an object. Some attributes of the object will be called as callback functions. ["a" , {....}] in the preceding example is used to declare a rewriter, which represents an array whose element is 2. The string "a" in the preceding example represents the element selector that is used to locate all anchor tags in the document. The object in the preceding example specifies the callback object. If you choose to use an element selector, the object can contain the following callback functions:

        • The element callback function. The signature of this callback function is function(e). This callback function is called when the elements that are selected by the element selector are parsed.

        • The comments callback function. The signature of this callback function is function(e). This callback function is called when the comments that are nested inside the elements are parsed.

        • The text callback function. The signature of this callback function is function(e). This callback function is called when the text in the elements is parsed. This callback function can be called multiple times. Due to stream processing, HTMLStream may be processing a chunk of a string.

    • You can directly respond to HTMLStream. However, HTMLStream does not buffer data. Unlike parse5 and htmlparser2, HTMLStream does not generate a DOM tree, which significantly reduces the processing time and memory consumption. This way, HTMLStream can deliver high throughput and concurrency when it parses HTML content.

Rewriter

The rewriter is used to register the object that you want to rewrite. The object is an array that consists of two elements.

  • The first element in the array must be of the string type or null.

    • String: specifies an element selector that is used to locate an element or tag.

    • null: specifies that the rewriter applies to an entire document.

      Note

      In most cases, you do not need to apply the rewriter to an entire document. If the rewriter is applied to an entire document, it cannot locate elements.

  • The second element in the array must be a JavaScript object. This object is returned to the callback function that you have registered.

    If you use an element selector, this object is named as the element callback function. If you use a document selector, this object is named as the document callback function.

Note

You can specify one or more rewriters for an HTMLStream operation. You can specify multiple element selectors but only one document selector.

Syntax of element selectors

The element selector syntax is a subset of the CSS selector syntax. The programming language of an element selector may be different from the programming language of a CSS selector. The following section describes the syntax of element selectors:

  • *: specifies all elements or tags.

  • div: specifies the tag named div. You can specify other tag names in this format. HTML and custom tags are supported.

  • E#id: specifies the tag named E. The ID of the tag is specified by id.

  • E.Class: specifies the tag named E. The class of the tag is specified by Class.

  • E[attr]: specifies the tag named E. The attributes of the tag include the name attr.

  • Element attributes:

    • E[attr="a"]: specifies the tag named E. The attributes of the tag include attr, which is set to a. The value is case-sensitive.

    • E[attr^="a"]: specifies the tag named E. The attributes of the tag include attr, which is set to a. The value is not case-sensitive.

    • E[attr$="a"]: specifies the tag named E. The attributes of the tag include attr whose value ends with a.

    • E[attr^="a"]: specifies the tag named E. The attributes of the tag include attr whose value starts with a.

    • E[attr*="a"]: specifies the tag named E. The attributes of the tag include attr whose value contains a.

    • E[attr|="a"]: specifies the tag named E. The attributes of the tag include attr whose value starts with a- and are separated by commas (,). Example: en-ch, en-us.

  • Order between elements:

    • E F: specifies the tag named F, which exists in the parent element named E.

    • E > F: specifies the tag named F whose parent element is the tag named E.

  • E:not(S): specifies the element named E. S is another element selector. Element E can be selected only when the selector is set to false.

Callback functions for element selectors

The following table describes the callback functions that are supported by element selectors.

Callback function

Description

Callback function signature

element

A non-asynchronous callback function that is called after the selected elements are completely parsed.

The signature of the callback function is function(e). This signature is carried in the Element object. For more information, see Element.

comments

A non-asynchronous callback function that is called when the selected elements have comments.

The signature of the callback function is function(e). This signature is carried in the Comments object. For more information, see Comments.

text

A non-asynchronous callback function that is called when the text returned to the callback function is parsed.

The signature of the callback function is function(e). This signature is carried in the TextChunk object. For more information, see TextChunk.

Note

This callback function may be called multiple times. When HTMLStream reads chunks of text from the raw HTML data, this callback function is called each time a chunk is parsed. If you want to view the complete text, you must load and merge all the text chunks.

Note

An element selector can ignore all the preceding callback functions. In this case, relevant elements are printed directly without processing. If you want to modify a specific chunk of the text, you need only register the required callback function.

Document selector

A document selector is used to select a specified document. To use a document selector, set the first element in the rewriter array to null. In the HTMLStream operation, you can configure only one document selector.

Callback functions for document selectors

The callback functions for document selectors are similar to the callback functions for element selectors. The following table describes the callback functions that are supported by document selectors.

Callback function

Description

Callback function signature

doctype

A non-asynchronous callback function that is called when the document type declaration (DOCTYPE) in the specified document is parsed.

The signature of the callback function is function(e). This signature is carried in the Doctype object. For more information, see Doctype.

comments

A non-asynchronous callback function that is called when the specified document has comments.

The signature of the callback function is function(e). This signature is carried in the Comments object. For more information, see Comments.

text

A non-asynchronous callback function that is called when the specified document has text nodes.

The signature of the callback function is function(e). This signature is carried in the TextChunk object. For more information, see TextChunk.

Note

This callback function may be called multiple times. When HTMLStream reads chunks of text from the raw HTML data, this callback function is called each time a chunk is parsed. If you want to view the complete text, you must load and merge all the text chunks.

docend

A non-asynchronous callback function that is called after the specified document is completely parsed. This callback function appends content such as debugging information to the end of the HTML document as comments. You can use this information for troubleshooting.

The signature of the callback function is function(e). This signature is carried in the Docend object. For more information, see Docend.

Error handling

EdgeRoutine catches all JavaScript exceptions that are thrown by the preceding callback functions. Meanwhile, HTMLStream stops processing HTML streams and propagates the exception to the outer layers.

  • If the reader.read method is triggered in JavaScript, the exceptions are thrown again.

  • If the reader.read method is called in HTMLStream when EdgeRoutine is running, EdgeRoutine hides the exceptions. For example, if the exceptions occur when EdgeRoutine returns a response to a client, the response is interrupted and the client receives only a part of the response. This is because HTMLStream treats data as streams. In this case, the stream may be interrupted before all the data is returned to the client. The method used by HTMLStream to process exceptions is similar to the method used by TransformStream. TransformStream also writes and reads data as streams.

Callback parameters

Each callback function receives an object that represents the selected HTML tags or other relevant information. The object is also known as a callback parameter. This topic describes callback parameters such as Element, TextChunk, and Comments.

Note
  • All parameters must be passed into callback functions. If the methods or attributes of a parameter are invoked outside of a callback function, JavaScript exceptions are thrown. To avoid this problem, you can pass the desired parameters into other JavaScript objects or data structures.

  • The option in the methods that are described in this topic represents an object. You can set the HTML attribute in the object to true or false. The value true specifies HTML content, and false specifies text content. If you set the HTML attribute to false, HTMLStream calls the html encoding/escaping function.

Element

  • Definition

    This object is returned when the Element callback function is called. This object represents the selected HTML tags.

  • Attributes

    • tagName(string): specifies the name of the tag.

    • attributes(iterator): returns an iterator that locates all attributes that are specified in the [name, value] format.

    • removed(bool): specifies whether to delete the element. This attribute is read-only. You can call the remove() method to delete an element. Typically, you must configure this attribute to ignore elements that are already deleted.

    • namespaceURI: the namespace URI of an element, such as, the SVG or Script element. This attribute is read-only.

  • Methods

    • Modify attributes

      • getAttribute(name): queries an attribute name of a specified element.

      • setAttribute(name, value): sets an attribute name for a specified element and modifies the attribute name of a specified element.

      • hasAttribute(name): queries whether an attribute name exists in a specified element.

      • removeAttribute(name): deletes an attribute name from a specified element.

      Note

      Both the attribute name and value must be of string type.

    • Modify content

      • before(data, option): inserts content before the specified element (element tag).

      • after(data, option): inserts content after the specified element (element tag).

      • prepend(data, option): inserts content before the content of the specified element (after the opening tag of the element). Example: <div>(prepend) |aaaa|(append)</div>.

      • append(data, option): inserts content after the content of the specified element (before the closing tag of the element). Example: <div>(prepend) |aaaa|(append)</div>.

      • replace(data, option): replaces the entire element, including the tags and nested tags.

      • setInnerContent(data, option): specifies the element content and retains the tags and attributes.

      • remove(): deletes the specified element. After the element is deleted, the value of the removed attribute changes to true.

      • removeAndKeepContent(): deletes tags and attributes of the specified element and retains the content.

TextChunk

  • Definition

    This object is returned when the Text callback function is called. This object represents a chunk of the selected HTML text.

  • Attributes

    • removed(bool): specifies whether to delete the element. This attribute is read-only. You can call the remove() method to delete an element. Typically, you must configure this attribute to ignore elements that are already deleted.

    • text(string): specifies the text content. This attribute is read-only. The text may be a chunk of text. If the string is empty, it indicates that the last chunk of text is returned. In this case, you can merge all chunks of text.

    • lastInTextNode(bool): specifies whether it is the last chunk of text. This attribute is read-only. If the value of this attribute is true, the text attribute returns an empty string.

  • Methods

    Modify content

    • before(data, option): inserts content before the specified element (element tag).

    • after(data, option): inserts content after the specified element (element tag).

    • replace(data, option): replaces the entire element, including the tags and nested tags.

    • remove(): deletes the specified element. After the element is deleted, the value of the removed attribute changes to true.

Comments

  • Definition

    This object is returned when the Comments callback function is called. This object represents the comments in the selected HTML content.

  • Attributes

    • removed(bool): specifies whether to delete the element. This attribute is read-only. You can call the remove() method to delete an element. Typically, you must configure this attribute to ignore elements that are already deleted.

    • text(string): specifies the existing comments or the comments to overwrite the existing ones. This attribute is readable and writable.

  • Methods

    Modify content

    • before(data, option): inserts content before the specified element (element tag).

    • after(data, option): inserts content after the specified element (element tag).

    • replace(data, option): replaces the entire element, including the tags and nested tags.

    • remove(): deletes the specified element. After the element is deleted, the value of the removed attribute changes to true.

Doctype

  • Definition

    This object is returned when the DOCTYPE callback function is called. This object represents the DOCTYPE of the selected HTML content.

  • Attributes

    • name(string): specifies the DOCTYPE name. This attribute is ready-only.

    • publicId(string): returns a public identifier. If no public identifier exists, a value of null is returned. This attribute is read-only.

    • systemId(string): returns a system identifier. If no system identifier exists, a value of null is returned. This attribute is read-only.

Docend

  • Definition

    This object is returned when the Docend callback function is called. This object represents the end of an HTML document.

  • Method

    append(string, option): appends content to the end of the HTML document.