This topic describes how to use HTMLStream to modify the flow of an HTML page in real time.

Background information

EdgeRoutine (ER) can be applied to a wide array of frontend development scenarios. Special data, such as the User-Agent header, geographic location, and IP address, is sent from the edge. Therefore, you may need to modify the flow of an HTML page in real time on the edge. The traditional approach to modifying the flow of an HTML page is using regular expressions to create an ad hoc parser. However, this approach may cause errors and does not support stream processing. Open source JavaScript parsers, such as parse5 and htmlparser2, consume a large amount of memory resources and cause performance degradation. To address these issues, ER provides a parser that supports stream processing and runs on the edge. This parser is used to modify HTML code and pages.
Note The parser is not based on web standards. It is a built-in feature of ER.

Examples

  • Scenario

    If you want to modify all the anchor tags (<a/>) in an HTML page to redirect requests to http://www.taobao.com, use the following code in ER.

  • Sample code
    addEventListener('fetch', (event) => {
      event.respondWith(handle(event));
    });
    
    async function handle(event) {
      // 1.In this example, the HTML page that you want to modify is returned. 
      const response = await fetch("http://www.example.com");
      // 2.Configure the stream processing-based parser to manage HTML content. The parser supports multiple CSS selectors.
      // Specify the method of capturing the syntax and register a callback function for modifying the HTML page. 
      const htmlStream = new HTMLStream(
        response.body, // Specify the HTML flow that you want to modify.
        [[
          "a",         // The element selector that specifies all the anchor tags. 
          {  
            // Register a callback function. The Element callback function can be called in the anchor tags or in the element nodes of the Document Object Model (DOM) API. 
            // In the callback function, you can change the event object (e). 
            element: function(e) {
              // Modify the href attribute.
              e.setAttribute("href", "http://www.taobao.com");
            }
          }
        ]]);
      
      // 3.The modified request is returned to the browser. HTMLStream is a readable stream. Therefore, HTMLStream can be applied to
      // all scenarios that support ReadableStream. 
      return new Response(htmlStream);
    }
                
  • Result analysis
    The preceding sample code is a simple approach to modifying an HTML flow in real time by using HTMLStream. The following section describes how HTMLStream works:
    • The Fetch API retrieves a flow expression for the request. However, ER may not have retrieved the response body from the network layer. This mechanism reduces the frequency of garbage collection caused by data buffering.
    • HTMLStream is also a flow that supports the TransformStream operation. After HTMLStream receives a flow, HTMLStream calls the rewrite callback function to modify the specified HTML page in real time. If you need to modify an HTML page, you must acquire the raw data stream of the HTML page. You can then put the raw data stream in the HTML flow that you have created, as described in Step 2 in the preceding sample code.
      • The first parameter of HTMLStream is a stream, which represents the raw data stream of the HTML page.
      • The second parameter of HTML is an array, which represents a rewriter. A rewriter is an array that includes a selector. The selector is used to specify the HTML content that you want to modify and select an object. Some properties of the object will be called as callback functions. In the preceding example, ["a" , {....}] is used to declare a rewriter that is an array whose element is 2. The string "a" in the preceding example represents the element selector that is used to locate all anchor tags in the document. The object specifies the event object to be returned to the callback function. If you choose to use an element selector, the object can contain the following callback functions:
        • The element callback function. The signature of this callback function is function(e). This callback function is called when the elements selected by the element selector are parsed.
        • The comments callback function. The signature of this callback function is function(e). This callback function is called when the comments nested inside the elements are parsed.
        • The text callback function. The signature of this callback function is function(e). This callback function is called when the text in the elements is parsed. This callback function can be called multiple times and supports stream processing. Therefore, the data processed by HTMLStream may be a chunk of a string.
    • HTMLStream is a flow to which you can directly respond. However, HTMLStream does not buffer data and is different from parse5 and htmlparser2. HTMLStream does not generate a DOM tree. This greatly reduces the processing time and memory consumption. This way, HTMLStream maintains high throughput and concurrency while it parses HTML content.