This topic describes what a catalog is and how to use a catalog to manage and query internal and external data.

Terms

  • Internal data: the data that is stored in StarRocks.
  • External data: the data that is stored in external data sources, such as Apache Hive, Apache Iceberg, and Apache Hudi.

Catalog

StarRocks 2.3 and later allow you to use catalogs to access and query data that is stored in various external data sources with ease. StarRocks supports two types of catalogs: internal catalogs and external catalogs.

  • Internal catalog: used to manage all internal data in a StarRocks cluster. For example, databases and tables that are created by executing the CREATE DATABASE and CREATE TABLE statements are managed in the internal catalog of the StarRocks cluster. Each StarRocks cluster has only one internal catalog named default catalog.
  • External catalog: used to manage the access information of external data sources, such as the data source types and the uniform resource identifiers (URIs) of Hive metastores. In StarRocks, you can directly query external data by using an external catalog.
    You can create an external catalog for the following types of data sources:
    When you use an external catalog to query data from an external data source, StarRocks uses two components of the external data source:
    • Metadata service: used to expose metadata for a frontend (FE) of a StarRocks cluster to generate a query plan.
    • Storage system: used to store data. Data files are stored in different formats in a distributed file system or an object storage system. After the FE distributes the generated query plan to each backend (BE), each BE scans the destination data in the Hive storage system in parallel, performs computing, and then returns the query results.

Query data

Query internal data

For more information about how to query data that is stored in StarRocks, see Default catalog.

Query external data

For more information about how to query data that is stored in external data sources, see Data lake analytics.

Query data across catalogs

If you want to query data across catalogs, you can reference the destination data by specifying the destination in the format of catalog_name.db_name or catalog_name.db_name.table_name. Examples:
  • In the default_catalog.olap_db catalog, execute the following statement to query data from the hive_table table in the hive_catalog catalog:
    SELECT * FROM hive_catalog.hive_db.hive_table;
  • In the hive_catalog.hive_db catalog, execute the following statement to query data from the olap_table table in the default_catalog catalog:
    SELECT * FROM default_catalog.olap_db.olap_table;
  • In the hive_catalog.hive_db catalog, execute the following statement to perform a federated query on the hive_table table and the olap_table table in the default_catalog catalog:
    SELECT * FROM hive_table h JOIN default_catalog.olap_db.olap_table o WHERE h.id = o.id;
  • In other catalogs, execute the following statement to perform a federated query on the hive_table table in the hive_catalog catalog and the olap_table table in the default_catalog catalog:
    SELECT * FROM hive_catalog.hive_db.hive_table h JOIN default_catalog.olap_db.olap_table o WHERE h.id = o.id;