A New Stream: Adding the Generator Feature to Java

By Lei Wen (Yilai)

Preface

This article does not recommend tools or share application cases. Its main focus is to introduce a new design pattern.

This design pattern did not originate from Java, but rather from a simple scripting language. It requires minimal language features, making it universally applicable to many modern programming languages.

About Stream

Let's start by reviewing the traditional streaming APIs in Java. Since the introduction of lambda expressions and Stream in Java 8, the convenience of development has significantly improved. Stream is a fundamental skill that every Java developer should master as it enhances the efficiency of processing complex business logic. However, without parallel stream, Stream cannot be considered as a good design.

Firstly, the encapsulation is excessive, the implementation is complex, and the source code is extremely difficult to read. While this might be a compromise to support parallel stream, the level of coupling is too deep, making it hard to grasp. Beginners often find the source code daunting, creating an impression that Stream is an advanced and complex feature. In reality, it is not the case. Streams can be created in a very simple manner.

Secondly, the API is redundant, specifically in the case of stream.collect. In contrast, Kotlin provides comprehensive operations such as toList, toSet, and associate (toMap) that can be directly used on streams. It was not until Java 16 that Java added toList for Stream to directly utilize. Java even declined to add toSet and toMap.

Thirdly, the API functions are limited. In Java 8, there were only seven chain operations: map, filter, skip, limit, peek, distinct, and sorted. It was not until Java 9 that takeWhile and dropWhile were added. However, Kotlin offers many additional useful features beyond these, such as mapIndexed, mapNotNull, filterIndexed, filterNotNull, onEachIndexed, distinctBy, sortedBy, sortedWith, zip, and zipWithNext. The number of features is more than doubled. These features are not complex to implement, but they greatly improve the user experience.

In this article, I will propose a completely new mechanism for creating streams. This mechanism is extremely simple and can be implemented by any developer who understands lambda expressions (closures). Any programming language that supports closures can use this mechanism to create its own stream. The simplicity of this mechanism allows developers to create a large number of practical APIs at a low cost, surpassing the experience provided by Stream.

About Generator

Generator [1] is a widely recognized and important feature in many modern programming languages, including Python, Kotlin, C#, JavaScript, and others. It is primarily built upon the yield keyword (or method).

With generators, whether it's an iterable, iterator, or a complex closure, it can be directly transformed into a stream. For instance, if you want to convert an underscore-separated string into camel case, you can achieve this using a generator in Python.

def underscore_to_camelcase(s):
    def camelcase():
        yield str.lower
        while True:
            yield str.capitalize

    return ''.join(f(sub) for sub, f in zip(s.split('_'), camelcase()))

These few lines of code exemplify the cleverness of Python's generator. First, the yield keyword is used within the camelcase function, which the interpreter recognizes as a generator. This generator initially provides a lower function and subsequently provides numerous capitalize functions. Since generators are always lazily executed, it is common to generate infinite streams using the while True approach without wasting performance or memory. Second, in Python, a stream can be zipped together with a list. When a finite list and an infinite stream are zipped, the stream naturally ends when the list ends.

In the given code, Python utilizes generator comprehension [2] to call the content within the join() function, which essentially remains a stream. The zipped stream, when mapped, can be aggregated into a string using the join method.

The operations demonstrated in the above code can be easily accomplished in any language that supports generators. However, implementing such operations in Java would be inconceivable, even with the existence of long-standing Java 8 or the latest OpenJDK19, which introduced Project Loom [3] and coroutines. Java, unfortunately, lacks direct support for generators.

In essence, the implementation of a generator relies on the suspension and resumption of continuation [4]. Continuation refers to a specific breakpoint after the program executes to a certain position. Coroutines involve jumping to a breakpoint of another function to continue execution without blocking threads after suspending the current function's breakpoint. Generators share a similar characteristic.

Python achieves function reentry and generator [5] by saving and restoring stack frames. Kotlin utilizes Continuation Passing Style (CPS) [6] technology to transform bytecode during the compilation phase, thereby simulating coroutines [7] on the JVM. Other languages either follow a similar approach or provide more direct support.

So, is there a way to implement or at least simulate a yield keyword in Java without coroutines to dynamically create streams with high performance? The answer is yes.

Main Body

In Java, streams are called Streams and in Kotlin, they are called Sequences. I wanted to call them flows, but that name was already taken. Since I couldn't come up with a better name, let's simply call them Seq for now.

Definitions

First, let's define the interface of Seq.

public interface Seq<T> {
    void consume(Consumer<T> consumer);
}

This interface essentially represents a consumer of consumers, and I will explain its true meaning later. Although this interface may seem abstract, it is actually quite common. The java.lang.Iterable interface naturally has this interface, which is known as forEach. By using method inference, we can create the first instance of Seq.

List<Integer> list = Arrays.asList(1, 2, 3);
Seq<Integer> seq = list::forEach;

As you can see, in this example, consume and forEach are completely equivalent. In fact, I initially named this interface forEach, and only after several iterations did I change it to the more accurate name, consume.

Taking advantage of the great feature in Java where a single-method interface is automatically recognized as a functional interface, we can use a simple lambda expression to create a Seq, such as a Seq with only one element.

static <T> Seq<T> unit(T t) {
    return c -> c.accept(t);
}

This method, which is mathematically important (although not frequently used in practice), defines the unit element operation of the generic type Seq, which means the mapping of T -> Seq<T>.

Map and flatMap

map

From the intuitive point of view of forEach, it is easy to write a map [8] to convert a Seq of type T into a Seq of type E, that is, to obtain a mapping of Seq -> Seq according to the function T -> E.

default <E> Seq<E> map(Function<T, E> function) {
  return c -> consume(t -> c.accept(function.apply(t)));
}

flatMap

Similarly, you can continue to write flatMap, that is, expand each element into a Seq and then merge it.

default <E> Seq<E> flatMap(Function<T, Seq<E>> function) {
    return c -> consume(t -> function.apply(t).consume(c));
}

You can write these two methods in IDEA by yourself. It is actually very convenient to write with the help of intelligent prompts. If you think it is not intuitive to understand, just think of Seq as List and the consumer as forEach.

Filter and Take/Drop

Map and flatMap provide the ability to map and combine Seqs. Seqs also have several core capabilities: element filtering and interrupt control.

filter

Filtering elements is also simple.

default Seq<T> filter(Predicate<T> predicate) {
    return c -> consume(t -> {
        if (predicate.test(t)) {
            c.accept(t);
        }
    });
}

take

There are many scenarios for interrupt control of Seq. Taking is one of the most common scenarios. That is, take the first n elements and drop the rest, which is equivalent to Stream.limit.

Since Seq does not depend on iterators, interrupts must be implemented through exceptions. To this end, it is necessary to build a special exception for a global singleton and cancel the operation of this exception capturing the call stack to reduce performance overhead (because it is a global singleton, it doesn't matter if it is not canceled)

public final class StopException extends RuntimeException {
    public static final StopException INSTANCE = new StopException();

    @Override
    public synchronized Throwable fillInStackTrace() {
        return this;
    }
}

Corresponding methods:

static <T> T stop() {
    throw StopException.INSTANCE;
}

default void consumeTillStop(C consumer) {
    try {
        consume(consumer);
    } catch (StopException ignore) {}
}

Then you can take elements:

default Seq<T> take(int n) {
    return c -> {
        int[] i = {n};
        consumeTillStop(t -> {
            if (i[0]-- > 0) {
                c.accept(t);
            } else {
                stop();
            }
        });
    };
}

drop

Dropping is a concept corresponding to taking. Drop the first n elements, which is equivalent to Stream.skip. It does not involve interrupt control of the Seq, but is instead more of a variant of a filter, a filter with state. Observe it and the implementation details of taking above. As Seq iterates, there is a counter that is constantly refreshing its state, but this counter cannot be perceived by the outside world. In fact, the clean characteristic of Seq has already been reflected here. Even if it carries the state, it will not be exposed at all.

default Seq<T> drop(int n) {
    return c -> {
        int[] a = {n - 1};
        consume(t -> {
            if (a[0] < 0) {
                c.accept(t);
            } else {
                a[0]--;
            }
        });
    };
}

Other APIs

onEach

Add an operation consumer to an element of Seq, but do not execute the Seq — corresponding to Stream.peek.

default Seq<T> onEach(Consumer<T> consumer) {
    return c -> consume(consumer.andThen(c));
}

zip

Seq is aggregated with an iterable element in pairs and then the aggregation is converted to a new Seq—there is no correspondence in Stream, but there is an implementation of the same name in Python.

default <E, R> Seq<R> zip(Iterable<E> iterable, BiFunction<T, E, R> function) {
    return c -> {
        Iterator<E> iterator = iterable.iterator();
        consumeTillStop(t -> {
            if (iterator.hasNext()) {
                c.accept(function.apply(t, iterator.next()));
            } else {
                stop();
            }
        });
    };
}

Terminal operation

The methods implemented above are chain APIs of Seq that map one Seq to another, but Seq is still lazy or has not yet been actually executed. Executing this Seq requires the so-called terminal operations to consume or aggregate Seq. In Stream, consumption is forEach, and aggregation is Collector. For Collector, there can be a better design, which will not be expanded here. However, for the sake of example, we can simply and quickly implement a join.

default String join(String sep) {
    StringJoiner joiner = new StringJoiner(sep);
    consume(t -> joiner.add(t.toString()));
    return joiner.toString();
}

And implement the toList.

default List<T> toList() {
    List<T> list = new ArrayList<>();
    consume(list::add);
    return list;
}

So far, we have implemented a streaming API with only a few dozen lines of code. In most cases, these APIs already cover 80%-90% of usage scenarios. You can use them in other programming languages, such as Go (laughs).

Derivation of a Generator

This article has been talking about generators from the very beginning, and it is no exaggeration to say that generators are the core feature. However, why am I still not talking about generators after I have introduced several core streaming APIs? In fact, I am not keeping you in suspense. Read carefully, and you can find that the generators appeared as early as when I said Iterable was a born Seq.

List<Integer> list = Arrays.asList(1, 2, 3);
Seq<Integer> seq = list::forEach;

Don't you see it? Deduce and rewrite this method as a normal lambda function:

Seq<Integer> seq = c -> list.forEach(c);

Replace this forEach with a more traditional for loop:

Seq<Integer> seq = c -> {
    for (Integer i : list) {
        c.accept(i);
    }
};

Since it is known that this list is [1,2, 3], the above code can be written as:

Seq<Integer> seq = c -> {
    c.accept(1);
    c.accept(2);
    c.accept(3);
};

Does it look familiar? Let's take a look at something similar in Python:

def seq():
    yield 1
    yield 2
    yield 3

The two are almost the same in form. This is actually a generator. The accept in this code plays the role of yield. The name of the consume interface means that it is a consumption operation, and all terminal operations are implemented based on this consumption operation. Functionally, it is completely equivalent to forEach of Iterable. The reason why it is not directly called forEach is that its elements are not self-contained but are temporarily generated through code blocks in the closure.

This kind of generator is not a generator that uses continuation to suspend in the traditional sense but uses closures to capture temporarily generated elements in code blocks. Even if there is no suspension, this kind of generator can simulate the usage and characteristics of traditional generators to a great extent. In fact, all the above chain APIs implementations are essentially generators, but the generated elements come from the original Seq.

With the generator, we can write the operation of converting underscores to camelcase mentioned above in Java.

static String underscoreToCamel(String str) {
    // Java does not have a method of capitalizing the first letter. Write one casually.
    UnaryOperator<String> capitalize = s -> s.substring(0, 1).toUpperCase() + s.substring(1).toLowerCase();
     // Use the generator to create a Seq of methods.
    Seq<UnaryOperator<String>> seq = c -> {
        // Yield the first lowercase function.
        c.accept(String::toLowerCase);
        // IDEA will generate an alert indicating the risk of an endless loop. Ignore the alert.
        while (true) {
            // Yield functions with the first letter capitalized on demand.
            c.accept(capitalize);
        }
    };
    List<String> split = Arrays.asList(str.split("_"));
    // Both zip and join here are implemented above.     return seq.zip(split, (f, sub) -> f.apply(sub)).join("");
}

You can copy these pieces of code and run them to see if they have really achieved their target functions.

The Essence of a Generator

Although the generator has been derived, it still seems a little confused about what happened during the derivation: how did the endless loop happen, and how were elements generated? For further explanation, here is another familiar example.

The Producer-Consumer Pattern

The relationship between producers and consumers appears not only in multi-threading or coroutine scenarios but also in some classic single-threaded scenarios. For example, A and B, cooperate in a project and develop two modules respectively: A is responsible for producing data, and B is responsible for using data. A does not care how B processes the data. It may need to filter some data first and aggregate it before calculation, or write it to a local or remote storage. Naturally, B does not care how A's data is obtained. The only problem here is that there are too many pieces of data for memory to hold at one time. In this case, the traditional approach is to let A provide an interface with a callback function consumer, and B pass in a specific consumer when calling A's data.

public void produce(Consumer<String> callback) {
    // do something that produce strings
    // then use the callback consumer to eat them
}

As this kind of interaction based on the callback function is too classic, there is nothing more to say. However, after we have a generator, we might as well make a bold but minor modification: carefully observe the above producer interface, which inputs a consumer and returns void. Surprisingly, it is found that it is actually a Seq!

Seq<String> producer = this::produce;

Next, we only need to slightly adjust the code to upgrade this callback function-based interface and turn it into a generator.

public Seq<String> produce() {
    return c -> {
        // still do something that produce strings
        // then use the callback consumer to eat them
    };
}

Based on this layer of abstraction, A as a producer and B as a consumer are completely decoupled. A only needs to put the data production process into the closure of the generator. All side effects during the process, such as I/O operations, are completely isolated by this closure. B gets a clean Seq directly. B doesn't need to and is unable to care about the internal details of the Seq. B only needs to focus on what he wants to do.

More importantly, although A and B are completely decoupled in operation logic and invisible to each other, they are overlapped in CPU scheduling time. B can even directly block and interrupt A's production process. It can be said that there is a coroutine even without a coroutine.

At this point, we have finally discovered the essence of Seq as a generator: consumer of callback. It's kind of amazing that a consumer of a callback function becomes a producer. But it makes sense to think carefully: the guy who can meet the consumer demand - callback, no matter how strange that demand is, is the producer.

It is easy to find that the call overhead of a generator based on the callback mechanism is only the execution overhead of the code blocks inside the generator closure and a little closure creation overhead. In many business scenarios that involve stream computing and control, this will bring significant memory and performance advantages. Later, I will show you their performance advantages based on specific scenario examples.

In addition, looking at this transformation code, you will find that what has been produced by a producer is still a function at all, and no data is actually executed and produced. This is the natural advantage of a generator as an anonymous interface: lazy evaluation. Consumers seem to get the whole Seq, but in fact, it is dispensable, which can be scribbled and discarded. Only with the real callback will the Seq be executed.

I/O Isolation and Seq Output

Haskell invented IO Monad [9] to isolate I/O operations from pure functions. Java struggles to achieve a similar encapsulation effect with Stream. Take the java.io.BufferedReader as an example. Read a local file as a Stream, which can be written as follows:

Stream<String> lines = new BufferedReader(new InputStreamReader(new FileInputStream("file"))).lines();

If you take a closer look at the implementation of the lines method, you'll see that it uses a big chunk of code to create an iterator before turning it into a stream. Regardless of how cumbersome its implementation is, the first thing to note here is that BufferedReader is a Closeable. The safe way is to close it after it is used, or to wrap a layer with try-with-resources syntax to realize automatic close. But BufferedReader.lines doesn't close the source. It's a less secure interface. In other words, its isolation is incomplete. Java has also made a patch for this with java.nio.file.Files.lines, which will add an onClose callback handler to ensure that the close operation is performed after the stream is exhausted.

But is there a more universal method? Not everyone knows the difference in security between BufferedReader.lines and Files.lines after all, and not all Closeable can provide similar stream interfaces that can be closed safely. There is even no stream interface at all.

Fortunately, now we have Seq, which has the built-in advantage of isolating side effects. It happens that in scenarios involving a large number of data I/O operations, using callback interaction is a very classic design method. These scenarios are the best stage for the method to make a big splash.

It is simple to isolate I/O with generators. It only needs to wrap the whole try-with-resources code, which is equivalent to covering the whole life cycle of I/O.

Seq<String> seq = c -> {
    try (BufferedReader reader = Files.newBufferedReader(Paths.get("file"))) {
        String s;
        while ((s = reader.readLine()) != null) {
            c.accept(s);
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
};

The core code is actually three lines, building the data source, reading the data one by one, and then yielding (that is, accepting). Any subsequent operations on the Seq appear to occur after the Seq is created. Actual execution is wrapped into this I/O lifecycle. That is, one read for one consumption, which is use-as-you-go.

In other words, the generator's callback mechanism ensures that even though Seq can be passed around as a variable, any side-effect operation involved is executed lazily in the same code block. Unlike Monad, it does not need to define many kinds of Monads, such as IO Monad and State Monad.

Similarly, here is another example of Alibaba Cloud middleware, which uses Tunnel to download the well-known ODPS table data as a Seq:

public static Seq<Record> downloadRecords(TableTunnel.DownloadSession session) {
    return c -> {
        long count = session.getRecordCount();
        try (TunnelRecordReader reader = session.openRecordReader(0, count)) {
            for (long i = 0; i < count; i++) {
                c.accept(reader.read());
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    };
}

With the Record Seq, if a map function can be implemented, the Record Seq can be easily mapped to a DTO Seq with business semantics. This is actually equivalent to an ODPS Reader.

Asynchronous Stream

A generator based on the callback mechanism is naturally compatible with asynchronous operations, in addition to playing an important role in the I/O field. After all, as soon as you hear the phrase callback function, many of you can reflexively think of asynchronization and Future. The fate of a callback function determines that it does not care where it is placed or how it is used. For example, throw the callback function to violent asynchronous logic:

public static Seq<Integer> asyncSeq() {
    return c -> {
        CompletableFuture.runAsync(() -> c.accept(1));
        CompletableFuture.runAsync(() -> c.accept(2));
    };
}

This is a simple asynchronous stream generator. For external users, in addition to being unable to guarantee the element order, an asynchronous stream is no different from a synchronous stream. In essence, they are essentially a piece of runnable code that generates data while running.

Parallel Stream

Since the callback function can be used by everyone, how about being used by ForkJoinPool? Java parallelStream is implemented based on ForkJoinPool. We can also use it to create our own parallel stream. The specific step is very simple: Replace CompletableFuture in the above asynchronous stream example with ForkJoinPool.submit. However, it should be noted that parallelStream should be blocked after the final execution (such as the most commonly used forEach), and it does not simply submit the task to ForkJoinPool but performs join again after the submission.

Given this, we might as well adopt the simplest idea to construct a list of ForkJoinTask, submit the elements to forkJoinPool in turn, generate a task and add it to the list, and then join all the tasks in the list after all the elements are submitted.

default Seq<T> parallel() {
    ForkJoinPool pool = ForkJoinPool.commonPool();
    return c -> map(t -> pool.submit(() -> c.accept(t))).cache().consume(ForkJoinTask::join);
}

This is the generator-based parallel stream, which requires only two lines of code to implement. As mentioned at the beginning of this article, streams can be created in a very simple way. Even the parallel stream, which Stream has struggled to implement, can be implemented easily in a different way.

It is worth emphasizing again that this mechanism is not Java-only. Any programming language that supports closures can use it. In fact, the earliest verification and implementation of this stream mechanism was done on the built-in rudimentary scripting language of the software AutoHotKey_v2 [10].

Talk About the Producer-Consumer Pattern Again

In order to explain the callback nature of the generator, the producer-consumer pattern in a single-threaded environment is introduced earlier. After the asynchronous stream is implemented, things are even more interesting.

Recall that Seq, as an intermediate data structure, can completely decouple producers and consumers, with one side producing data and the other side taking data for consumption. Does this structure look familiar? Yes, it is common blocking queues for Java developers and channels in languages that support coroutine, such as Go and Kotlin.

In a sense, the channel is also a blocking queue. The main difference between the channel and the traditional blocking queue is that when the data in the channel exceeds the limit or is empty, the corresponding producer/consumer will be suspended instead of being blocked. Both methods will suspend production/consumption, but after the coroutine is suspended, the CPU can be released, and it can continue to work in other coroutines.

What are the advantages of Seq over Channel? There are a large number of advantages. First of all, the code block of callback in the generator closure strictly ensures that production and consumption must be executed alternately — the one who gets in first must get out first, the one who gets in must get out, and nothing gets out if nothing gets in. Therefore, there is no need to separately open up heap memory to maintain a queue. Since there is no queue, there is no lock, and thus no blocking or suspension. Secondly, Seq essentially uses consumption to monitor production. Without production, there is naturally no consumption. There will be no overproduction because Seq is lazy. Even if producers produce endlessly because of the while endless loop, it is just a common infinite Seq.

This is another way of understanding the generator, a queue-free, lock-free, and non-blocking channel. The deadlock and memory leak problems of the Go language channel, which are often criticized, do not exist at all in Seq. Kotlin's two similar APIs, asynchronous flow and synchronous sequence, can be replaced by Seq.

It can be said that there is no safer channel than Seq, because it has no security risk at all. What should we do if there is no consumption after production? Seq is inherently lazy. Without consumption, nothing will be produced. What should we do if the channel is not closed after consumption? Seq doesn't need to be closed. It's not necessary to close a lambda.

To help you understand more easily, here is a simple channel example. First, implement an asynchronous consumption interface based on ForkJoinPool. This interface allows users to freely choose whether to join after consumption.

default void asyncConsume(Consumer<T> consumer) {
    ForkJoinPool pool = ForkJoinPool.commonPool();
    map(t -> pool.submit(() -> consumer.accept(t))).cache().consume(ForkJoinTask::join);
}

With the asynchronous consumption interface, the channel function of Seq can be demonstrated immediately.

@Test
public void testChan() {
    // Produce infinite natural numbers and put them into the channel Seq, where the Seq itself is the channel, so it doesn't matter whether it is a synchronous Seq or an asynchronous Seq.
    Seq<Long> seq = c -> {
        long i = 0;
        while (true) {
            c.accept(i++);
        }
    };
    long start = System.currentTimeMillis();
    // The channel Seq is given to the consumer. The consumer asks for five even numbers only.
    seq.filter(i -> (i & 1) == 0).take(5).asyncConsume(i -> {
        try {
            Thread.sleep(1000);
            System.out.printf("produce %d and consume\n", i);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    });
    System.out.printf("elapsed time: %dms\n", System.currentTimeMillis() - start);
}

Result

produce 0 and consume
produce 8 and consume
produce 6 and consume
produce 4 and consume
produce 2 and consume
elapsed time: 1032ms

As you can see, because the consumption is executed in parallel, even if each element takes 1 second to consume, the overall time taken ends up being a little more than 1 second. Of course, it is still different from the traditional channel mode, such as in the actual working thread term. A more comprehensive design is to add a lock-free, non-blocking queue to the stream to implement a sound channel, which can solve many problems of the Go channel and improve performance at the same time. I will write another article to discuss it later.

Application Scenarios of a Generator

In the previous section, I described the nature and features of the generator. It is a consumer of callback that perfectly encapsulates I/O operations in the form of closures, seamlessly switches to asynchronous and parallel streams, and plays a lock-free channel role in asynchronous interactions. In addition to the advantages of these core features, it has many interesting and valuable application scenarios.

Tree Traversal

The fate of a callback function determines that it does not care where it is placed or how it is used. For example, it can be put in recursion. A typical scenario of recursion is tree traversal. As a contrast, let's look at how to use yield to traverse a binary tree in Python:

def scan_tree(node):
    yield node.value
    if node.left:
        yield from scan_tree(node.left)
    if node.right:
        yield from scan_tree(node.right)

For Seq, since Java does not allow functions to be internally nested, it is necessary to write more code. The core principle is simple. That is, throw the callback function to the recursive function and remember to bring it along with each recursion.

//static <T> Seq<T> of(T... ts) {
//    return Arrays.asList(ts)::forEach;
//}

// Recursive function
public static <N> void scanTree(Consumer<N> c, N node, Function<N, Seq<N>> sub) {
    c.accept(node);
    sub.apply(node).consume(n -> {
        if (n != null) {
            scanTree(c, n, sub);
        }
    });
}

// General method, which can traverse any tree
public static <N> Seq<N> ofTree(N node, Function<N, Seq<N>> sub) {
    return c -> scanTree(c, node, sub);
}

// Traverse a binary tree
public static Seq<Node> scanTree(Node node) {
    return ofTree(node, n -> Seq.of(n.left, n.right));
}

The ofTree here is a very powerful tree traversal method. Traversing trees is not rare, but generating the traversal process as a stream provides a huge room for imagination. In programming languages, the construction of trees can be found everywhere. For example, we can simply create a stream that traverses JSONObject.

static Seq<Object> ofJson(Object node) {
    return Seq.ofTree(node, n -> c -> {
        if (n instanceof Iterable) {
            ((Iterable<?>)n).forEach(c);
        } else if (n instanceof Map) {
            ((Map<?, ?>)n).values().forEach(c);
        }
    });
}

Then it becomes very convenient to analyze JSON. For example, suppose you want to check whether a JSON has an Integer field regardless of the level of the field. You can achieve it by using the any or anyMatch method with just one line of code:

boolean hasInteger = ofJson(node).any(t -> t instanceof Integer);

This method is powerful not only because it is simple enough, but also because it is a short-circuit operation. It's quite troublesome to use normal code to achieve short-circuit operation in a depth-first recursive function because an exception may be thrown or a context parameter needs to be added to participate in recursion (it can only stop after returning to the root node). But with Seq, you only need an any, all, or none operation.

For example, suppose you want to check whether there is an illegal string "114514" in a JSON field. It also can be implemented with a single line of code:

boolean isIllegal = ofJson(node).any(n -> (n instanceof String) && ((String)n).contains("114514"));

By the way, the predecessor of JSON, XML, is also a tree structure. Combined with many mature XML parsers, we can also implement similar streaming scanning tools. For example, can we get a faster Excel parser?

Better Cartesian Product

The Cartesian product may not be of much use to most development, but it is an important construct in functional languages and is extremely common in the field of operations research when building optimization models. Previously, if you wanted to use Stream to build multiple Cartesian products in Java, you needed to nest multiple layers of flatMap.

public static Stream<Integer> cartesian(List<Integer> list1, List<Integer> list2, List<Integer> list3) {
    return list1.stream().flatMap(i1 ->
        list2.stream().flatMap(i2 ->
            list3.stream().map(i3 -> 
                i1 + i2 + i3)));
}

For such scenarios, Scala provides a syntactic sugar that allows users to combine Cartesian products in the form of for loop + yield [11]. However, Scala's yield is a pure syntactic sugar, which has nothing to do with the generator. It can translate the code into the above flatMap form during the compilation phase. This sugar is formally equivalent to the do annotation in Haskell [12].

Fortunately, we have generators now. We can directly write a nested for loop and generate it as a stream without adding syntax, introducing keywords, or bothering the compiler. And the form is freer. You can add code logic at any layer of the for loop.

public static Seq<Integer> cartesian(List<Integer> list1, List<Integer> list2, List<Integer> list3) {
    return c -> {
        for (Integer i1 : list1) {
            for (Integer i2 : list2) {
                for (Integer i3 : list3) {
                    c.accept(i1 + i2 + i3);
                }
            }
        }
    };
}

In other words, Java does not need such sugar. It could have been unnecessary for Scala.

Probably the Fastest CSV/Excel Parser of Java

I have mentioned multiple times that generators can greatly improve performance in the above context. This view is not only theoretically supported but also backed by clear engineering practice data. One example is the unified parser I developed for the CSV family, which includes CSV, Excel, and Alibaba Cloud's ODPS. As long as the format conforms to the unified paradigm, it can be considered a member of this family.

However, handling the CSV family has always been a pain point in the Java language. Even ODPS seems to have no solution. Although there are many CSV libraries available, none of them are very satisfactory. They either have complicated APIs or low performance. None of them can compare with Python's Pandas. Some relatively well-known libraries in this field are OpenCSV [13], Jackson's jackson-dataformat-csv [14], and the fastest univocity-parsers [15].

Excel, on the other hand, is different. EasyExcel [16], an open-source software developed by the group, is remarkable. I can only ensure that my CSV Excel parser is faster than it, without planning to cover more functions than it already has.

As for the implementation of CsvReader, there are too many similar products in the market, and I don't have the energy to compare them one by one. However, I can confidently say that my implementation is much faster than the one that claims to be the fastest publicly. The CsvReader I implemented on my office computer about a year ago could only reach 80%~90% of the speed of univocity-parsers. No matter how much I optimized it, I couldn't improve its performance. It was not until I discovered the generator mechanism and refactored it that the speed directly surpassed univocity-parsers by 30% to 50%, becoming the fastest implementation among similar open-source products that I know of.

For Excel, on a given data set, my ExcelReader is 50%~55% faster than EasyExcel.

Modifying EasyExcel to Enable Direct Stream Output

EasyExcel, mentioned above, is a well-known open-source product developed by Alibaba Cloud. It has comprehensive features and excellent quality, receiving wide praise. It happens to have a classic case of I/O interaction using callback functions, which is a suitable example to discuss. Based on the example provided on the official website, we can create the simplest Excel reading method using callback functions.

public static <T> void readEasyExcel(String file, Class<T> cls, Consumer<T> consumer) {
    EasyExcel.read(file, cls, new PageReadListener<T>(list -> {
        for (T person : list) {
            consumer.accept(person);
        }
    })).sheet().doRead();
}

EasyExcel captures data through calling back listeners. For example, the PageReadListener here has a list cache inside. When the cache is full, it puts data into the callback function and continues to put data into the cache. This callback function-based approach is indeed typical, but inevitably there are some inconveniences:

Consumers need to care about the producer's internal cache. For example, the cache here is a list.
If the consumer wants to take all the data, they need to add lists one by one or add all each time. This operation is non-inert.
It is difficult to convert the read process into a stream. Any streaming operation must be stored in a list and converted into a stream before it can be processed. Flexibility is poor.
It is not convenient for consumers to intervene in the data production process, such as directly interrupting it after meeting a certain condition (such as a set number), unless you override this logic when implementing the callback listener [17].

Using a generator, we can completely enclose the process of reading Excel in the above example. The consumer does not need to pass in any callback function or care about any internal details — just get a stream directly. The transformation is also simple. The main logic remains unchanged, and you only need to wrap the callback function with a consumer:

public static <T> Seq<T> readExcel(String pathName, Class<T> head) {
    return c -> {
        ReadListener<T> listener = new ReadListener<T>() {
            @Override
            public void invoke(T data, AnalysisContext context) {
                c.accept(data);
            }

            @Override
            public void doAfterAllAnalysed(AnalysisContext context) {}
        };
        EasyExcel.read(pathName, head, listener).sheet().doRead();
    };
}

I have already submitted a pull request [18] to EasyExcel for this transformation, but instead of generating Seq, it is a stream built based on the generator principle. The creation method will be introduced in detail later.

Furthermore, the parsing process of Excel can be completely transformed into a generator method, and a one-time callback call can be used to avoid the storage and modification of a large number of internal states, thus bringing considerable performance improvement. This work cannot be submitted to EasyExcel for the time being because it depends on a series of APIs of CsvReader above.

Building a Stream With a Generator

As a new design pattern, the generator can provide more powerful streaming API features. However, there is always an adaptation cost or migration cost because it is not Stream that people are familiar with. For existing mature libraries, using Stream is still the most responsible choice for users. Thankfully, even though the mechanisms are completely different, Stream and Seq are still highly compatible.

First of all, it's obvious that Stream is naturally a Seq, just like Iterable:

Stream<Integer> stream = Stream.of(1, 2, 3);
Seq<Integer> seq = stream::forEach;

In turn, can Seq be converted into Stream? In the official implementation provided by Java Stream, there is a StreamSupport.stream construction tool that can help users convert an iterator into a stream. For this entry, we can use generators to construct a non-standard iterator: instead of implementing hastNext and next, we can overload forEachRemaining method separately, thus hacking into the underlying logic of Stream. In the maze-like source code, there is a very secret method called AbstractPipeline.copyInto. The forEachRemaining method of Spliterator will be called to traverse the elements when the stream is actually executed. Although this method was originally implemented through next and hasNext, we can skip them when we overload it.

public static <T> Stream<T> stream(Seq<T> seq) {
    Iterator<T> iterator = new Iterator<T>() {
        @Override
        public boolean hasNext() {
            throw new NoSuchElementException();
        }

        @Override
        public T next() {
            throw new NoSuchElementException();
        }

        @Override
        public void forEachRemaining(Consumer<? super T> action) {
            seq.consume(action::accept);
        }
    };
    return StreamSupport.stream(
        Spliterators.spliteratorUnknownSize(iterator, Spliterator.ORDERED),
        false);
}

In other words, we can even use generators to construct Streams now! Example:

public static void main(String[] args) {
    Stream<Integer> stream = stream(c -> {
        c.accept(0);
        for (int i = 1; i < 5; i++) {
            c.accept(i);
        }
    });
    System.out.println(stream.collect(Collectors.toList()));
}

Thank the author of Stream for not just using while hasNext to traverse. Otherwise, we will hardly achieve this operation.

Of course, because the nature of Iterator here has changed, this operation will also have some restrictions. It is no longer possible to use the parallel method to convert it into a parallel stream, nor to use the limit method to limit the number. However, methods such as map, filter, flatMap, forEach, and collect can still be used normally as long as they do not involve stream interruption.

Infinite Recursive Sequence

There are not many practical application scenarios. Stream's iterate method can support an infinite sequence of single seed recursion, but two or more seed recursions are beyond its capabilities, such as the most popular, programmer-specific Fibonacci sequence:

public static Seq<Integer> fibonaaci() {
    return c -> {
        int i = 1, j = 2;
        c.accept(i);
        c.accept(j);
        while (true) {
            c.accept(j = i + (i = j));
        }
    };
}

There is also a more interesting application that uses the properties of the Farey tree in the method to perform a Diophantine approximation [22]. In short, it is to approximate the real number with a rational number. This is a suitable and interesting example for a demo. I will not detail it due to space limitations. If I have the opportunity, I will write another article to discuss it.

More Features of Streams

Stream Aggregation

How to design the aggregation interface of streams is very complicated. If I discuss it carefully, I can almost write thousands of words. Because of the space limitation, I will introduce it with a few sentences here. In my opinion, a good streaming API should allow the stream itself to directly call aggregate functions, instead of like Stream, which uses Collectors to construct a Collector and then uses Stream to call the Collector. You can compare the following two ways, and you can distinguish which one is better easily.

Set<Integer> set1 = stream.collect(Collectors.toSet());
String string1 = stream.map(Integer::toString).collect(Collectors.joinning(","));

Set<Integer> set2 = seq.toSet();
String string2 = seq.join(",", Integer::toString);

Kotlin is much better than Java in this aspect. However, every coin has its two sides. From the perspective of function interface rather than user use experience, the design of Collector is more comprehensive. Stream and groupBy are isomorphic: all operations that can be directly done on streams by Collector can be done by the same Collector constructed by groupBy, and even groupBy itself is a Collector.

So, a better design is to preserve the completeness and isomorphic feature of the function, while providing shortcuts that are called directly by the stream. To illustrate, here's an example where neither Java nor Kotlin has implemented a function, but the function's requirement is common - find a weighted average.

public static void main(String[] args) {
    Seq<Integer> seq = Seq.of(1, 2, 3, 4, 5, 6, 7, 8, 9);

    double avg1 = seq.average(i -> i, i -> i); // = 6.3333
    double avg2 = seq.reduce(Reducer.average(i -> i, i -> i)); // = 6.3333
    Map<Integer, Double> avgMap = seq.groupBy(i -> i % 2, Reducer.average(i -> i, i -> i)); // = {0=6.0, 1=6.6}
    Map<Integer, Double> avgMap2 = seq.reduce(Reducer.groupBy(i -> i % 2, Reducer.average(i -> i, i -> i)));
}

The average and Reducer.average in the preceding code and the average used in groupBy are completely isomorphic. In other words, the same Reducer can be used directly on the stream or on each substream after the stream is grouped. This is a set of APIs similar to Collector, which not only solves some problems of Collector, but also provides more abundant features. The point is it is open, and the mechanism is simple enough for anyone to write.

Stream Segmentation

Segmentation is actually a blind spot of various streaming APIs. Whether it is map or forEach, we occasionally hope that the first half and the second half adopt different processing logic, or more directly, we hope that the first element will be specially processed. To this end, I provide three APIs: element replacing, segment mapping, and segment consuming.

Again, the scenario of converting underscores to camelcase mentioned earlier serves as a typical example: after the underscores string is split, lowercase the first element and capitalize the rest of the elements. Using the segmented map function, you can achieve this function more quickly.

static String underscoreToCamel(String str, UnaryOperator<String> capitalize) {
    // split=> Segemnted map=>join
    return Seq.of(str.split("_")).map(capitalize, 1, String::toLowerCase).join("");
}

For another example, when you parse a CSV file, if there is a header, it will be processed separately during parsing: the header information is used to reorder the fields, and the remaining content is converted into DTO by row. With proper segmentation logic, this seemingly cumbersome operation can be done in one stream at a time.

One-time Stream or Reusable Stream?

Those familiar with Stream should be aware that Stream is a one-time stream because its data comes from an iterator. If you call a Stream that has been fully utilized twice, an exception will be thrown. Kotlin's Sequence, on the other hand, follows a different design principle. Its stream comes from Iterable and is reusable in most cases. However, when Kotlin reads a file stream, it still employs the same idea as Stream by encapsulating BufferedReader as an Iterator. Therefore, it is one-time.

Unlike the previous two approaches, the generator's method is evidently more flexible, and whether the stream is reusable depends entirely on whether the data source wrapped by the generator is reusable. For example, in the above code, whether it is a local file or an ODPS table, it is reusable as long as the data source is included in the generator. You can use the same stream multiple times, just like using a normal List. From this perspective, the generator itself is immutable. Its elements are directly produced from the code block without relying on the runtime environment and memory state data. Any consumer can expect consistent streams from the same generator.

The essence of the generator, like humans, is that of a repeater.

Of course, the repetition of the repeater also depends on the cost. For scenarios where a high-cost stream like I/O needs to be reused, repeatedly performing the same I/O operation is definitely unreasonable. It is advisable to design a cache method for caching the stream.

The most common caching method is to read data into an ArrayList. Since ArrayList itself does not implement the Seq interface, it is better to create an ArraySeq, which is both an ArrayList and a Seq—as mentioned many times before, List naturally qualifies as a Seq.

public class ArraySeq<T> extends ArrayList<T> implements Seq<T> {
    @Override
    public void consume(Consumer<T> consumer) {
        forEach(consumer);
    }
}

With ArraySeq, you can immediately cache streams.

default Seq<T> cache() {
    ArraySeq<T> arraySeq = new ArraySeq<>();
    consume(t -> arraySeq.add(t));
    return arraySeq;
}

Observant readers may notice that this cache method has been used when I created a parallel stream earlier. Additionally, with the help of ArraySeq, we can easily sort Seq. If you find it interesting, give it a try.

BiSeq

Since the consumer of callback can be used as a mechanism to create a Seq, the interesting question arises: What if this callback is not a Consumer but a BiConsumer? The answer is, BiSeq!

public interface BiSeq<K, V> {
    void consume(BiConsumer<K, V> consumer);
}

BiSeq is a new concept. Previously, any iterator-based stream, such as Java streams, Kotlin sequences, and Python generators, cannot be binary. I'm not aiming at anyone. After all, the next method of all of you here must spit out an object instance, which means that even if you want to create a stream with two elements at the same time, you must wrap it into a structure such as Pair. Therefore, it is essentially still a one-dimension stream. When the number of elements of a stream is large, their memory overhead can be significant.

Even a Python zip that looks most like a BiSeq:

for i, j in zip([1, 2, 3], [4, 5, 6]):
    pass

The i and j in the above code are actually the results of unpacking a tuple.

However, BiSeq based on the callback mechanism is completely different from them. It is as lightweight as a one-dimension stream! This means it can save memory while being fast. For example, when I implemented CsvReader, I rewrote the String.split method to make it output as a Seq. This Seq and DTO field zip are binary, so I can realize a one-to-one match between the value and the field. You do not need to use subscripts or create temporary arrays or lists for storage. Each segmented substring is disposable throughout its lifecycle and can be used as you go.

It is also worth mentioning here that, similar to Iterable, Map in Java is inherently a BiSeq.


for i, j in zip([1, 2, 3], [4, 5, 6]):
    pass

Since there are BiSeq based on BiConsumer, there are TriSeq based on TriConsumer, quaternary Seq, and Seq based on native types such as IntConsumer and DoubleConsumer. This is a big Seq family, and there are even many special operations different from the one-dimensional stream. Here, I will only mention one:

BiSeq, TriSeq, and even multi-Seq can be used to construct real lazy tuples in Java. When your function needs to return multiple return values, in addition to writing a Pair or Triple, you now have a better choice, which is to directly return a BiSeq or TriSeq by means of a generator, which adds the advantage of lazy evaluation compared with direct tuples, and can be consumed by callback functions when really needed. It is even unnecessary to check if the pointer is null.

Summary

First of all, thank you for reading here. The content I want to tell has been basically finished. Although there are still many interesting details that have not been discussed, it will not affect the integrity of the story. What I want to emphasize again is that all of the above content, code, features, and cases, and the CsvReader series I have implemented, derive from this simple interface, which is the source of everything, and it is completely worth writing again at the end of this article.

public interface Seq<T> {
    void consume(Consumer<T> consumer);
}

For this magical interface, its roles are as follows:

Define the Seq
Derive the two characteristics of Seq, which is both a stream and a generator.
Use a generator to implement comprehensive streaming APIs, export I/O streams that can be safely isolated, and finally export asynchronous stream, parallel stream, and channel features.

Appendix

The appendix contains API documentation, reference addresses, and performance benchmarks. Since it is not yet open source, only Monad is introduced here.

Monad

Monad [24] is a concept from category theory and an important design pattern in Haskell, a representative functional programming language. However, it is not necessary for Seq or generator, so it is included in the appendix.

I want to mention Monad because Seq naturally becomes a Monad after implementing unit and flatMap. For readers who are interested in related theories, it may be uncomfortable if I don't mention it at all. Unfortunately, although Seq is a Monad in form, there are some conflicts in concept. For example, flatMap, which is crucial in Monad, serves as both a core definition and undertakes the important functions of combination and unpacking. Even though map is not necessary for Monad, it can be derived from flatMap and unit (see the derivation process below), but the reverse derivation does not work. However, for streaming APIs, map is the most critical and frequently used operation, while flatMap is less important and even not commonly used.

Monad is revered for several important features: lazy evaluation, chain calls, and side-effect isolation. Side-effect isolation is especially important in pure functions. However, for most programming languages, including Java, a more direct way to implement lazy evaluation is through interface-oriented programming rather than object-oriented programming, as interfaces are inherently lazy since they have no member variables. Chain operation is a natural characteristic of streams, so I will not elaborate on it here. As for side-effect isolation, it is not exclusive to Monad. Generators can also achieve side-effect isolation through closure + callback, as mentioned above.

Derive the Implementation of Map

Firstly, map can be obtained by directly combining unit and flatMap, which is called map2 here:

default <E> Seq<E> map2(Function<T, E> function) {
    return flatMap(t -> unit(function.apply(t)));
}

This means that elements of type T are transformed into Seq of type E, and then merged with flatMap. This is the most intuitive approach, as it does not require a prior concept of a stream and is an inherent property of Monad. Although its efficiency may be poor, we can simplify it.

We already have implementations of unit and flatMap:

static <T> Seq<T> unit(T t) {
    return c -> c.accept(t);
}

default <E> Seq<E> flatMap(Function<T, Seq<E>> function) {
    return c -> consume(t -> function.apply(t).consume(c));
}

We can expand unit and substitute the implementation of map2 into it:

default <E> Seq<E> map3(Function<T, E> function) {
    return flatMap(t -> c -> c.accept(function.apply(t)));
}

Extract the function in this flatMap and turn it into a flatFunction. Then expand flatMap:

default <E> Seq<E> map4(Function<T, E> function) {
    Function<T, Seq<E>> flatFunction = t -> c -> c.accept(function.apply(t));
    return consumer -> consume(t -> flatFunction.apply(t).consume(consumer));
}

It is easy to notice that the flatFunction here has two consecutive arrows, which is completely equivalent to currying a two-parameter (t, c) function. We can inverse curry it and obtain this two-parameter function through reverse derivation:

Function<T, Seq<E>> flatFunction = t -> c -> c.accept(function.apply(t));
// Equivalent to
BiConsumer<T, Consumer<E>> biConsumer = (t, c) -> c.accept(function.apply(t));

As you can see, this equivalent two-parameter function is actually a BiConsumer, and then substitute it into map4:

default <E> Seq<E> map5(Function<T, E> function) {
    BiConsumer<T, Consumer<E>> biConsumer = (t, c) -> c.accept(function.apply(t));
    return c -> consume(t -> biConsumer.accept(t, c));
}

Note that the actual parameter and formal parameter of biConsumer here are exactly the same, so its method body can be substituted directly into the following for direct replacement:

default <E> Seq<E> map6(Function<T, E> function) {
    return c -> consume(t -> c.accept(function.apply(t)));
}

At this step, map6 is exactly the same as the map written directly from the streaming concept. The implementation is complete!

References

[1]https://en.wikipedia.org/wiki/Generator_(computer_programming)
[2]https://www.pythonlikeyoumeanit.com/Module2_EssentialsOfPython/Generators_and_Comprehensions.html
[3]https://openjdk.org/projects/loom/
[4]https://en.wikipedia.org/wiki/Continuation
[5]https://hackernoon.com/the-magic-behind-python-generator-functions-bc8eeea54220
[6]https://en.wikipedia.org/wiki/Continuation-passing_style
[7]https://kotlinlang.org/spec/asynchronous-programming-with-coroutines.html
[8]https://zh.wikipedia.org/wiki/Map_(%E9%AB%98%E9%98%B6%E5%87%BD%E6%95%B0)
[9]https://crypto.stanford.edu/~blynn/haskell/io.html
[10]https://www.autohotkey.com/docs/v2/
[11]https://stackoverflow.com/questions/1052476/what-is-scalas-yield
[12]https://stackoverflow.com/questions/10441559/scala-equivalent-of-haskells-do-notation-yet-again
[13]https://opencsv.sourceforge.net/
[14]https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv
[15]https://github.com/uniVocity/univocity-parsers
[16]https://github.com/alibaba/easyexcel
[17]https://github.com/alibaba/easyexcel/issues/1566
[18]https://github.com/alibaba/easyexcel/pull/3052
[20]https://github.com/alibaba/easyexcel/pull/3052
[21]https://github.com/alibaba/fastjson2/blob/f30c9e995423603d5b80f3efeeea229b76dc3bb8/extension/src/main/java/com/alibaba/fastjson2/support/csv/CSVParser.java#L197
[22]https://www.bilibili.com/video/BV1ha41137oW/?is_story_h5=false&p=1&share_from=ugc&share_medium=android&share_plat=android&share_session_id=96a03926-820b-4c9f-a2fd-162944103bed&share_source=COPY&share_tag=s_i&timestamp=1663058544&unique_k=p94n8tD
[24]https://en.wikipedia.org/wiki/Monad_(functional_programming)

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

: 5424811297062398 March 24, 2024 at 4:20 pm

Thanks for the most thought inspiring Java article ever! Any follow-up?

0 0

: 5424811297062398 March 26, 2024 at 3:03 pm

This is so stunningly elegant AND practical! (But my head is steaming, because it is easy to get something wrong in higher-order functional programming...) I actually have a use case for an OctoSeq (8 parameters), which I would like to map to a HexaSeq (6 parameters) because I could then just copy-paste the formula from tested code in JavaScript without worrying about confusing an array index. But it seems that doesn't work with standard Java... After that it seems I need to cast the HexaSeq into an ArrayList of 6-tuples and do other mapping there, because the the OctoSeq involves CompletableFuture.

0 0