This topic describes how to use HTMLStream to modify the flow of an HTML page in real time.

Background

EdgeRoutine (ER) can be applied to a wide array of frontend development scenarios. Special data, such as User-Agent headers, geographic locations, and IP addresses, is sent from the edge. Therefore, you may need to modify the flow of an HTML page in real time on the edge. The traditional approach used to modify the flow of an HTML page is using regular expressions to create an ad hoc parser. However, this approach may cause errors and does not support stream processing. Open source JavaScript parsers, such as parse5 and htmlparser2, consume a large amount of memory resources and degrade system performance. To address these issues, ER provides a parser that supports stream processing on the edge. You can use this parser to modify HTML code and pages.
Note The parser is a built-in feature of ER, and is not based on web standards.

Examples

  • Scenario

    If you need to modify all the anchor tags (<a/>) in an HTML page to redirect requests to http://www.taobao.com, use the following code in ER.

  • Sample code
    addEventListener('fetch', (event) => {
      event.respondWith(handle(event));
    });
    
    async function handle(event) {
      // 1.In this example, the HTML page that you want to modify is returned. 
      const response = await fetch("http://www.example.com");
      // 2.Configure the stream processing-based parser to manage HTML content. The parser supports multiple CSS selectors.
      // Specify the method for capturing the syntax and register a callback function for rewriting. 
      const htmlStream = new HTMLStream(
        response.body, // Specify the HTML flow that you want to modify.
        [[
          "a",         // The element selector. This specifies all the anchor tags selected. 
          {  
            // Register a callback function. The Element callback function can be called in the anchor tags or in the element nodes of the Document Object Model (DOM) API. 
            // In the callback function, you can change the event object (e). 
            element: function(e) {
              // Modify the href attribute.
              e.setAttribute("href", "http://www.taobao.com");
            }
          }
        ]]);
      
      // 3.Return the modified request to the browser. HTMLStream is a readable stream. You can use HTMLStream in
      // all scenarios that support ReadableStream. 
      return new Response(htmlStream);
    }
                
  • Result analysis
    The preceding sample code shows how to modify an HTML flow in real time by using HTML stream. The following section describes how HTMLStream works:
    • The Fetch API retrieves a flow expression for the request. However, ER may not have retrieved the response body from the network layer. As a result, the frequency of garbage collection caused by data buffering is reduced.
    • HTMLStream is also a flow that supports the TransformStream operation. After HTMLStream receives a flow, HTMLStream calls the rewrite callback function to modify the specified HTML page in real time. If you need to modify an HTML page, you must acquire the raw data stream of the HTML page. You can then put the raw data stream in the HTML flow that you have created, as described in Step 2 in the preceding sample code.
      • The first parameter of HTMLStream is a stream, which represents the raw data stream of the HTML page.
      • The second parameter of HTMLStream is an array, which represents a group of rewriters. A rewriter is an array that includes a selector. The selector is used to specify the HTML content that you want to modify and select an object. Some properties of the object will be called as callback functions. In the preceding example, ["a" , {....}] is used to declare a rewriter, which is an array whose element is 2. The string "a" in the preceding example represents the element selector that is used to locate all anchor tags in the document. The object in the preceding example specifies the callback object. If you choose to use an element selector, the object can contain the following callback functions:
        • The element callback function. The signature of this callback function is function(e). This callback function is called when the elements that are selected by the element selector are parsed.
        • The comments callback function. The signature of this callback function is function(e). This callback function is called when the comments that are nested inside the elements are parsed.
        • The text callback function. The signature of this callback function is function(e). This callback function is called when the text in the elements is parsed. This callback function can be called multiple times. Due to stream processing, HTMLStream may be processing a chunk of a string.
    • You can immediately respond to HTMLStream. However, HTMLStream does not buffer data. Unlike parse5 and htmlparser2, HTMLStream does not generate a DOM tree, which significantly reduces the processing time and memory consumption. This way, HTMLStream can deliver high throughput and concurrency when it parses HTML content.

Rewriter

The rewriter is used to register the object that you want to rewrite. The object is actually an array that consists of two elements. Description of the array:
  • The first element in the array must be of the string type or null.
    • String: specifies an element selector that is used to locate an element or tag.
    • null: specifies that the rewriter applies to an entire document.
      Note In most cases, you do not need to apply the rewriter to an entire document. If the rewriter is applied to an entire document, it cannot locate elements.
  • The second element in the array must be a JavaScript object. This object is returned to the callback function that you have registered.

    If you use an element selector, this object is named as the Element callback function. If you use a document selector, this object is named as the Document callback function.

Note You can specify one or more rewriters for the HTMLStream operation. You can specify multiple element selectors but only one document selector.

Syntax of element selectors

The element selector syntax is a subset of the CSS selector syntax. The programming language of an element selector may be different from the programming language of a CSS selector. The following list describes the syntax of element selectors:
  • *: specifies all elements or tags.
  • div: specifies the tag named div. You can specify other tag names in this format. HTML and custom tags are supported.
  • E#id: specifies the tag named E. The ID of the tag is specified by id.
  • E.Class: specifies the tag named E. The class of the tag is specified by Class.
  • E[attr]: specifies the tag named E. The properties of the tag include the name attr.
  • Element properties:
    • E[attr="a"]: specifies the tag named E. The properties of the tag include attr, which is set to a. The value is case-sensitive.
    • E[attr^="a"]: specifies the tag named E. The properties of the tag include attr, which is set to a. The value is not case-sensitive.
    • E[attr$="a"]: specifies the tag named E. The properties of the tag include attr whose value ends with a.
    • E[attr^="a"]: specifies the tag named E. The properties of the tag include attr whose value starts with a.
    • E[attr*="a"]: specifies the tag named E. The properties of the tag include attr whose value contains a.
    • E[attr|="a"]: specifies the tag named E. The properties of the tag include attr whose values start with a- and are separated with commas (,). Example: en-ch, en-us.
  • Order between elements:
    • E F: specifies the tag named F, which exists in the parent element named E.
    • E > F: specifies the tag named F whose parent element is the tag named E.
  • E:not(S): specifies the element named E. S is another element selector. Element E can be selected only when the selector is set to false.

Callback functions for element selectors

The following table describes the callback functions that are supported by element selectors.
Callback function Description Callback function signature
element A non-asynchronous callback function that is called after the selected elements are completely parsed. The signature of the callback function is function(e). This signature is carried in the Element object. For more information, see Element.
comments A non-asynchronous callback function that is called when the selected elements have comments. The signature of the callback function is function(e). This signature is carried in the Comments object. For more information, see Comments.
text A non-asynchronous callback function that is called when the text returned to the callback function is parsed. The signature of the callback function is function(e). This signature is carried in the TextChunk object. For more information, see TextChunk.
Note This callback function may be called multiple times. When HTMLStream reads chunks of text from the raw HTML data, this callback function is called each time a chunk is parsed. If you want to view the complete text, you must load and merge all the text chunks.
Note An element selector can ignore all the preceding callback functions. In this case, relevant elements are printed directly without processing. If you want to modify a specific chunk of the text, you need only register the required callback function.

Document selector

A document selector is used to select a specified document. To use a document selector, set the first element in the rewriter array to null. In the HTMLStream operation, you can configure only one document selector.

Callback functions for document selectors

The callback functions for document selectors are similar to the callback functions for element selectors. The following table describes the callback functions that are supported by document selectors.
Callback function Description Callback function signature
doctype A non-asynchronous callback function that is called when the document type declaration (DOCTYPE) in the specified document is parsed. The signature of the callback function is function(e). This signature is carried in the Doctype object. For more information, see Doctype.
comments A non-asynchronous callback function that is called when the specified document has comments. The signature of the callback function is function(e). This signature is carried in the Comments object. For more information, see Comments.
text A non-asynchronous callback function that is called when the specified document has text nodes.
docend A non-asynchronous callback function that is called after the specified document is completely parsed. This callback function appends content such as debugging information to the end of the HTML document as comments. You can use this information to troubleshoot and track errors. The signature of the callback function is function(e). This signature is carried in the docend object. For more information, see Docend.

Solutions to exceptions

ER catches the JavaScript exceptions that are thrown by the preceding callback functions. Meanwhile, HTMLStream stops processing HTML streams and propagates the exception to the outer layers.
  • If the reader.read method is triggered in JavaScript, the exceptions are thrown again.
  • If the reader.read method is called in HTMLStream when ER is running, ER hides the exceptions. For example, if the exceptions occur when ER returns a response to a client, the response is interrupted and the client receives only a part of the response. This is because HTMLStream treats data as streams. In this case, the stream may be interrupted before all the data is returned to the client. The method used by HTMLStream to process exceptions is similar to the method used by TransformStream. TransformStream also writes and reads data as streams.

Callback parameters

Each callback function receives an object that represents the selected HTML tags or other relevant information. The object is also known as a callback parameter. This topic describes callback parameters such as Element, TextChunk, and Comments.

Note
  • All parameters must be passed into callback functions. If the methods or properties of a parameter are invoked outside of a callback function, JavaScript exceptions are thrown. To avoid this problem, you can pass the desired parameters into other JavaScript objects or data structures.
  • The option in the methods that are described in this topic represents an object. You can set the HTML property in the object to true or false. True specifies HTML content, and false specifies text content. If you set the HTML property to false, HTMLStream calls the html encoding/escaping function.

Element

  • Definition

    This object is returned when the Element callback function is called. This object represents the selected HTML tags.

  • Attributes
    • tagName(string): the name of the tag.
    • attributes(iterator): returns an iterator that locates all attributes that are specified in the [name, value] format.
    • removed(bool): specifies whether to delete the element. This attribute is read-only. You can call the remove() method to delete an element. Typically, you must configure this attribute to ignore elements that are already deleted.
    • namespaceURI: specifies the namespace URI of an element, for example, the SVG or Script element. This attribute is read-only.
  • Methods
    • Modify attributes
      • getAttribute(name): queries an attribute name of a specified element.
      • setAttribute(name, value): sets an attribute name for a specified element and modifies the attribute name of a specified element.
      • hasAttribute(name): queries whether an attribute name exists in a specified element.
      • removeAttribute(name): deletes an attribute name from a specified element.
      Note Both the attribute name and value must be of STRING type.
    • Modify content
      • before(data, option): inserts content before the specified element (element tag).
      • after(data, option): inserts content after the specified element (element tag).
      • prepend(data, option): inserts content before the element content (after the opening tag of the element). Example: <div>(prepend) |aaaa|(append)</div>.
      • append(data, option): inserts content after the element content (before the closing tag of the element). Example: <div>(prepend) |aaaa|(append)</div>.
      • replace(data, option): replaces the entire element, including the tags and nested tags.
      • setInnerContent(data, option): specifies the element content and retains the tags and attributes.
      • remove(): deletes the specified element. After the element is deleted, the value of the removed attribute changes to true.
      • removeAndKeepContent(): deletes tags and attributes of the specified element and retains the content.

TextChunk

  • Definition

    This object is returned when the Text callback function is called. This object represents a chunk of the selected HTML text.

  • Attributes
    • text(string): specifies the text content. This attribute is read-only. The text may be a chunk of text. If the string is empty, it indicates that the last chunk of text is returned. In this case, you can merge all chunks of text.
    • lastInTextNode(bool): indicates whether it is the last chunk of text. This attribute is read-only. If the value of this attribute is true, the text attribute returns an empty string.
  • Methods
    Modify content

Comments

  • Definition

    This object is returned when the Comments callback function is called. This object represents the comments in the selected HTML content.

  • Attributes
    • text(string): specifies the existing comments or the comments to overwrite the existing ones. This property is readable and writable.
  • Methods
    Modify content

Doctype

  • Definition

    This object is returned when the DOCTYPE callback function is called. This object represents the DOCTYPE of the selected HTML content.

  • Attributes
    • name(string): specifies the DOCTYPE name. This attribute is ready-only.
    • publicId(string): returns a public identifier. If no public identifier exists, a value of null is returned. This attribute is read-only.
    • systemId(string): returns a system identifier. If no system identifier exists, a value of null is returned. This attribute is read-only.

Docend

  • Definition

    This object is returned when the docend callback function is called. This object represents the end of an HTML document.

  • Methods

    append(string, option): appends content to the end of the HTML document.