EMR Serverless Spark is a high-performance lakehouse product for data and AI. It provides a one-stop data platform for enterprises with features such as task development, debugging, scheduling, and operations and maintenance (O&M). This simplifies the entire process of data processing and model training. The product is 100% compatible with the open source Spark ecosystem and can be seamlessly integrated into your existing data platform. EMR Serverless Spark allows enterprises to focus on optimizing data processing, analysis, and model training to improve work efficiency.
Service architecture
The architecture of EMR Serverless Spark consists of the following four layers:
Application scenario layer
EMR Serverless Spark meets a wide range of data needs. For data warehouse and BI analytics scenarios, it provides an SQL editor for simple data queries and report development. It is also compatible with traditional data warehouse usage patterns. For artificial intelligence and data science, it integrates a Notebook feature that supports Python environment management and interactive machine learning development. The platform is designed to be a unified solution that combines multiple scenarios. This allows users to efficiently complete the entire workflow, from data analytics to model training, without switching tools.
Platform capability layer
This layer supports the scenarios in the application scenario layer. It uses workflow orchestration to enable mixed scheduling for batch processing, stream computing, and AI jobs. You can orchestrate ETL tasks, real-time analytics, and machine learning training in the same pipeline. This avoids issues caused by fragmented systems. All operations can be managed through RAM authentication and authorization. This provides fine-grained control over access to resources, data, and features to ensure enterprise-grade security. In addition, the SQL editor and Notebook feature optimize the development experience for data warehouses and AI, respectively. The Notebook, Kyuubi, and Livy services provide developers with flexible programming interfaces and task submission services.
Core engine layer
Fusion engine: Designed for CPU-intensive scenarios, it provides a C++-based vectorized SQL engine. Compared to the Java Virtual Machine (JVM), the Fusion engine makes better use of SIMD instructions. This improves CPU utilization and reduces memory overhead.
Celeborn: An enterprise-grade Remote Shuffle Service that supports multi-tenant data isolation and resource elasticity for I/O-intensive scenarios.
Lakehouse storage layer
This layer is based on open data lake formats such as Paimon and Iceberg. It retains the flexibility of a data lake while providing key capabilities of a traditional data warehouse. These capabilities include ACID transactions, efficient data upserts, and complete data lineage records.
Benefits
Cloud-native high-speed compute engine
Built-in Fusion Engine (Spark Native Engine): Delivers a 300% performance improvement over the open source version and significantly accelerates big data computing tasks. The engine optimizes computing efficiency with a vectorized engine and batch data processing technology. It also reduces memory usage, which improves overall performance.
Built-in Celeborn (Remote Shuffle Service): Supports petabyte-scale shuffle data processing, which greatly improves the stability and performance of large shuffle tasks. Compute nodes do not require large disks. The service fully utilizes Spark's dynamic resource scaling capabilities to reduce storage costs. The total cost of computing resources can be reduced by up to 30%.
Flexible scaling and efficient resource utilization
On-demand elastic scaling: Supports a compute-storage decoupled architecture. Computing resources can scale elastically within seconds, with a minimum granularity of one core. Resources are metered at a fine-grained task or queue level. Storage uses a pay-as-you-go model to prevent resource waste and significantly reduce operational costs.
Seamless migration and compatibility: Integrates with OSS-HDFS and is fully compatible with HDFS cloud storage, which supports a smooth migration of your business to the cloud. It uses DLF to fully integrate lakehouse metadata. This ensures data access consistency and complete permission management, which helps you easily build a modern data lakehouse architecture.
Seamless ecosystem compatibility
Full compatibility with open source Spark: You can run jobs directly without code modification. It provides compatible
spark-submitandspark-sqltools to lower the migration barrier.Deep integration with mainstream lakehouse formats: Fully supports mainstream lakehouse formats, such as Apache Paimon, Iceberg, Delta, and Hudi, to meet diverse data storage needs.
Scheduling systems and security capabilities: Supports integration with mainstream scheduling systems, such as Apache Airflow and Apache DolphinScheduler. It can connect to external Kerberos or LDAP for identity authentication and use Apache Ranger for data authorization to ensure data security.
Machine learning adaptation: Provides a built-in SparkML environment and Notebook. It supports the full lifecycle management of third-party Python libraries.
One-stop development experience
End-to-end development support: Provides a one-stop development experience from task development, debugging, and publishing to scheduling. This meets the high standards for enterprise-level development and release. The built-in version management feature records the complete history of each release and supports source code and configuration difference comparisons to ensure that changes are traceable.
Efficient collaboration and stability: Development and production environments are strictly isolated to ensure business stability. This helps teams collaborate efficiently and deliver stable results.
Serverless resource platform
Out-of-the-box: You can start task development quickly without manual management or complex infrastructure setup.
Second-level elasticity: Dynamically pulls resources and starts pods based on the resource requirements of Spark tasks. Resources are released immediately after the computation is complete. Billing is based only on the amount of resources that are actually used, which further reduces the total computing cost.
Cost estimation: Provides task-level resource metering and cost estimation to help you achieve fine-grained operations.
Billing
The following billing methods are supported:
Subscription: Purchase resources for a specific period. You pay before you use the resources.
Pay-as-you-go: Activate and release resources as needed. You pay after you use the resources.
How to use
EMR Serverless Spark console: A web-based service page for interactive operations.
API: Supports RPC-style API operations that use GET and POST requests. For more information about the API operations, see the API Reference. The following are common developer tools for calling API operations:
OpenAPI Developer Portal: Provides services such as quick API retrieval, online API calls, and the dynamic generation of SDK example code.
Alibaba Cloud SDK: Provides software development kits (SDKs) for various programming languages, such as Java, Python, and PHP.