When Users Say "The App Is Frozen" but You Cannot Find the Cause — Your Monitoring Approach Might Be Wrong

This article introduces the mainstream stuttering monitoring solutions in iOS and provides a detailed explanation of a RunLoop-based stuttering monitoring implementation.

Background Information

When users interact with an app, they may encounter the following issues:

● The screen stays black or white for a moment when a complex page is opened.

● The display of a list stutters occasionally during scrolling.

● The UI becomes sluggish when images are being loaded.

● Operations freeze when multiple network requests are sent at the same time.

These issues are not limited to low-end devices. They also happen frequently on mid-range and high-end models. When the main thread cannot respond to user interactions, the app becomes unresponsive, and prolonged stuttering severely impacts both the functionality and user experience. Stuttering issues have always been one of the biggest pain points in mobile app development.

In most cases, the primary causes of main thread blocking and stuttering include the following ones:

● Heavy UI rendering: When a screen contains deep view hierarchies or a large amount of mixed text and images, the numbers of layout calculations and drawing operations increase sharply, exceeding the processing capacity of a single refresh cycle.

● Synchronous network requests on the main thread: After synchronous network calls are triggered on the main thread, the entire app must wait for a network response before it can proceed. During this period, no user interaction can be processed.

● Extensive file I/O operations: Directly performing large-scale data reads or writes on the main thread, such as accessing databases or local files, consumes significant time due to the limited disk speed.

● Heavy-load computation: If complex algorithms or large-scale data processing workflows are running on the main thread, CPUs remain in a heavy-load state for an extended period of time, and no room is left to handle UI events.

● Improper thread lock usage: If the main thread is waiting for another thread to release a lock, it enters the pending state. If the wait duration is long, stuttering happens. In extreme cases, circular waiting between different threads can even cause a deadlock, rendering the entire app unresponsive.

Because these issues are often intermittent and environment-dependent, traditional offline debugging methods are usually ineffective. To accurately and efficiently troubleshoot these online stuttering issues, we have explored several monitoring approaches.

Mainstream Stuttering Monitoring Solutions

The following solutions are some widely used stuttering monitoring approaches in iOS development:

● Ping thread monitoring

● Frames per second (FPS) monitoring

● RunLoop-based monitoring

Ping Thread Monitoring

The ping thread monitoring solution works based on the following principles:

● A worker thread is created to "probe" the responsiveness of the main thread.

● Each time the worker thread pings the main thread, the worker thread sets a flag to YES and then dispatches a task to the main thread. The main thread clears the flag by setting it to NO.

● The worker thread sleeps for the specified period of time. After the period elapses, it checks whether the flag has been cleared. If the value of the flag is still YES, the main thread is experiencing stuttering.

The following figure shows the process.

Key steps:

Create a worker thread: Start an independent worker thread for monitoring.
Periodically dispatch tasks: The worker thread periodically dispatches a simple task to the main thread and sets a waiting flag.
Wait for the main thread to respond: When the main thread processes the task, the main thread calls back into the worker thread and clears the waiting flag.
Perform a timeout check: If the worker thread finds that the waiting flag has not been cleared within a short time window after a task is dispatched, the main thread is considered experiencing stuttering.
Perform capturing and reporting: Capture and report stacks.

The logic of this monitoring solution is relatively simple and easy to understand. However, its accuracy is limited. Detections may be missed among pings. In addition, a ping thread continuously wakes up the RunLoop of the main thread, which also introduces specific performance overhead.

FPS Monitoring

Under normal conditions, the screen refreshes at a rate of 60 Hz. Newer iOS devices can even maintain a refresh rate of 120 Hz. A screen refresh signal is sent for each refresh, and CADisplayLink allows developers to register a callback that is synchronized with this signal.

We can evaluate the UI smoothness by calculating how many times this callback is triggered within 1 second. Although CADisplayLink is lightweight, it can be invoked only when the CPU is somewhat idle. As a result, stack capturing during severe stuttering may not be timely. In addition, frame rates lower than 50 FPS still appear smooth to the human eye. Therefore, relying solely on FPS monitoring makes it difficult to determine whether stuttering has occurred.

RunLoop-based Monitoring

This monitoring solution is one of the most mainstream solutions suitable for production environments. Its core idea is to observe the state changes of the RunLoop of the main thread by using CFRunLoopObserver. The following figure is a simplified explanation of the RunLoop mechanism, adapted from Dai Ming's RunLoop diagram.

● Notify observers that the RunLoop is about to enter a loop.

● Start a do-while loop to keep the thread alive.

Notify observers that the RunLoop will trigger the Timer and Source0 callbacks, followed by the execution of blocks.
If Source1 is in the ready state, jump to the message handling process.

● Notify observers that the RunLoop is about to enter the sleeping state.

● Wait for mach_port messages to wake up the RunLoop again.

Port-based Source events
Timer expiration
RunLoop timeout
Explicit wakeup by a caller

● Notify observers that the RunLoop has been awakened.

● Handle messages.

● Continue with the next loop.

In a typical RunLoop-based monitoring implementation, the following key steps are involved:

Register observers: Register observers on the RunLoop of the main thread to listen for state changes.
Create a monitoring thread: Use a worker thread to monitor the RunLoop state changes of the main thread.
State flag setting and timeout detection: Based on the state changes of the RunLoop, the worker thread updates flags and continuously checks whether these flags are updated within the predefined threshold. If not, stuttering is detected.
Stack capturing and reporting: When stuttering is detected, the call stack of the main thread is captured and reported to the server for further analysis.

The RunLoop-based monitoring solution can accurately capture various types of stuttering caused by main thread blocking. This makes it well suited for online stuttering monitoring, diagnostics, and analysis.

Solution Comparison

The three performance monitoring solutions focus on different dimensions, as summarized in the following table.

Comparison dimension	Ping thread monitoring	FPS monitoring	RunLoop-based monitoring
Core principle	A worker thread periodically dispatches tasks to detect the responsiveness of the main thread.	CADisplayLink is used to count the number of callbacks per unit time and calculate the FPS value.	CFRunLoopObserver is used to listen for RunLoop state changes and timeout events.
Monitoring accuracy	Medium: depends on probing frequency and may miss intermittent stuttering.	Low: focuses on average performance. Occasional severe stuttering may be averaged out.	High: can capture individual long blocking events.
Root cause analysis	Moderate: captures stacks after timeout events, but with a potential timing delay.	Weak: reflects only smoothness results and cannot locate code-level stacks.	Strong: captures the call stack of the main thread immediately after a timeout event is generated to locate the required code.
Performance overhead	Low: worker thread overhead plus slight main thread overhead. The overall impact is minimal.	Very low: CADisplayLink adopted, which is a lightweight, system-level mechanism.	Low: observer callbacks are lightweight, with extra processing required only during stuttering.
Complexity	Medium: requires thread management and timeout handling logic.	Low: simple implementation based on counts and timestamps.	High: requires a deep understanding of the RunLoop mechanism and multi-threaded synchronization.
Scenarios	Quickly implement basic stuttering monitoring.	Quantify UI smoothness, such as scrolling or animation optimization.	Diagnose main thread blocking, such as I/O, deadlocks, and complex computations.

● The ping thread monitoring solution detects stuttering issues by periodically probing the response time of the main thread from a worker thread. Its accuracy is lower than that of the RunLoop-based monitoring solution.

● The FPS monitoring solution serves as a global performance metric, reflecting app smoothness by using frame rate fluctuations. However, it cannot pinpoint specific performance bottlenecks.

● The RunLoop-based monitoring solution involves the event loop mechanism of the main thread, which captures individual blocking events within milliseconds and precisely identify the stuttering sources of the main thread.

Stuttering Monitoring Implementation

The core goal of a stuttering monitoring solution is to accurately capture and pinpoint blocking-type stuttering issues that interrupt user interactions and significantly degrade user experience. When stuttering occurs, it is not enough to simply detect the event itself. The monitoring solution must also trace execution paths down to the code line level to identify the root cause.

Compared with other mainstream solutions, the RunLoop-based monitoring solution continuously tracks the task duration on the main thread. The solution can precisely capture stuttering events while simultaneously collecting the complete contextual call stack. Although the implementation of the solution is relatively complex, its suitability for production environments and its strong diagnostics value in identifying root causes make it the ultimate solution.

The basic principles of RunLoop have been introduced earlier. The following sections focus on how to implement this solution.

RunLoop State Change Monitoring

To implement the RunLoop-based stuttering monitoring solution, the first step is to monitor RunLoop state changes. As shown in the following figure, by registering an observer, you can listen for state change events in the RunLoop of the main thread. The associated state and timestamp information is recorded by using the running and startTime variables. The monitoring thread then reads the values of running and startTime to determine whether a state change has exceeded the expected time threshold.

When the main thread takes an extended period of time to run a task, RunLoop state changes are delayed. By measuring the time difference between key RunLoop states from a backend monitoring thread, you can determine whether the main thread is blocked.

In this implementation:

● When the observer receives a kCFRunloopBeforeTimers, kCFRunloopBeforeSource, or kCFRunLoopAfterWaiting notification, the observer sets the value of running to YES and records the current timestamp in startTime.

● When the observer receives a kCFRunloopBeforeWaiting or kCFRunLoopExit notification, the observer sets the value of running to NO.

● The monitoring thread continuously reads the values of running and startTime, and determines whether a stuttering issue has occurred by comparing the current time with the value of startTime, as shown in the following figure.

Stack Capturing

When a RunLoop state change timeout is detected, that is, when a stuttering issue is identified, the call stack of the main thread needs to be captured and stored in memory. Stack capturing is based on the well-known open source implementation KSCrash. Compared with using system functions to retrieve call stacks, KSCrash-based stack capturing supports symbolication based on dSYM files. This allows issues to be traced back to specific code lines, and the performance overhead is relatively low.

Capturing of the Most Time-consuming Stack

When the monitoring thread observes the RunLoop of the main thread, it captures a snapshot of the main thread as the stuttering stack. However, this snapshot is not necessarily the most time-consuming stack, nor is it always the primary cause of the main thread timeout. To improve capturing accuracy, if stuttering is detected on the main thread, the system retrospectively analyzes the stacks stored in a circular buffer, which samples data once every 50 ms, to identify the most time-consuming stack in the recent time window.

As shown in the preceding figure, the most time-consuming stack is identified based on the following characteristics:

● The top function in a call stack is used as a distinguishing characteristic. If two stacks share the same top function, they are considered the same stack. Example:

Stack A has FuncA as its top function.
Stack B has FuncB as its top function.
FuncA and FuncB are different functions. In this case, Stack A and Stack B are considered different stacks.

● Because stacks are captured at fixed intervals, the number of times a stack appears can be used as an approximate metric of the stack execution duration. The more repetitions, the longer the execution duration. Example:

Stack A appears once, with an approximate duration of 50 ms.
Stack B appears once, with an approximate duration of 50 ms.
Stack C appears three times, with an approximate duration of 150 ms.
In this case, Stack C is considered the most time-consuming stack.

● If multiple stacks have the same number of repetitions, the most recent one is selected as the most time-consuming stack.

Annealing Algorithm of the Monitoring Thread

In normal scenarios without exceptions, the stuttering detection mechanism introduces negligible overhead. However, if a stuttering issue lasts for several seconds, significant performance degradation occurs when the stack information of the main thread is frequently captured. Repeated recording of identical stack information provides little analytical value and is unnecessary. To reduce the performance overhead introduced by stuttering monitoring, the SDK adopts an annealing algorithm that gradually increases the detection interval. This avoids secondary performance issues caused by the repeated capturing of the same stuttering issue.

● Each time the monitoring thread detects a stuttering issue on the main thread, it captures the call stack of the main thread and stores it in memory.

● The captured stack is compared with the stack obtained from the previous stuttering issue.

If they are different, the current thread snapshot is written to a file.
If they are the same, the capturing is skipped, and the detection interval is increased based on the Fibonacci sequence, until stuttering disappears or a different stack is detected.

This algorithm prevents the same stuttering issue from being written to multiple files and avoids the continuous dumping of thread snapshots by the monitoring thread when the main thread is frozen.

Performance Overhead

The primary principle of any monitoring tool is that it cannot affect the performance of monitored objects. Therefore, it is necessary to measure the actual performance impact of the RunLoop-based stuttering monitoring solution. The core approach is A/B testing. In such testing, two app versions that are almost identical are prepared:

● Version A (baseline version): Stuttering monitoring is disabled.

● Version B (monitored version): Stuttering monitoring is enabled.

Both versions are tested on the same device and under the same conditions, with the same operations performed on them. The difference in key performance metrics is measured, which represents the performance overhead introduced by stuttering monitoring.

Test device: iPhone 12 Pro
Test OS: iOS 18.7

When stuttering monitoring is disabled, let the app run for a period of time and then manually trigger stuttering. In this case, the overall CPU utilization of the app is shown in the following figure.

When stuttering monitoring is enabled, let the app run for a period of time and then manually trigger stuttering. In this case, the overall CPU utilization of the app is shown in the following figure.

With stuttering monitoring enabled, the CPU utilization of the monitoring thread is shown in the following figures.

When stuttering occurs

When no stuttering occurs

Based on the preceding analysis, after stuttering monitoring is introduced into the app:

● When no stuttering occurs, the impact on the performance of the app is almost negligible.

● When stuttering occurs, the overall CPU utilization of the app increases by approximately 0.33%. The actual values may vary slightly based on devices.

Summary

This article introduces the mainstream stuttering monitoring solutions in iOS and provides a detailed explanation of a RunLoop-based stuttering monitoring implementation, which includes RunLoop state change monitoring, stack capturing, time-consuming stack capturing, and the annealing algorithm for sustained stuttering scenarios. By integrating mature and proven industry implementations, the adopted solution accurately detects the blocking and stuttering of the main thread. Stuttering monitoring continues to evolve, with several improvements that can be made in the future, such as the detection of stuttering caused by high CPU utilization and app startup stuttering. This solution has already been applied in the Real User Monitoring (RUM) SDK for iOS of Application Real-Time Monitoring Service (ARMS).

Community

When Users Say "The App Is Frozen" but You Cannot Find the Cause — Your Monitoring Approach Might Be Wrong

Background Information

Mainstream Stuttering Monitoring Solutions

Ping Thread Monitoring

FPS Monitoring

RunLoop-based Monitoring

Solution Comparison

Stuttering Monitoring Implementation

RunLoop State Change Monitoring

Stack Capturing

Capturing of the Most Time-consuming Stack

Annealing Algorithm of the Monitoring Thread

Performance Overhead

Summary

Read previous post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

Application Real-Time Monitoring Service

Real-Time Livestreaming Solutions

Managed Service for Prometheus

CloudMonitor