edit-icon download-icon

Quick start with Presto

Last Updated: May 04, 2018

This article describes the basic usage and application development methods of the Presto database for developers to quickly start application development using Presto database.

System structure

Presto’s system structure is shown in the following figure:

270de554-f53f-45c7-b94d-51e962173e51.png

Presto is a typical M/S architecture system, comprising a Coordinator node and multiple Worker nodes. Coordinator is responsible for the following:

  • Receiving and parsing users’ query requests, generating execution plans, and sending the execution plans to the Worker nodes for execution.

  • Monitoring the running status of the Worker nodes. Each Worker node maintains heartbeat connection with the Coordinator node, reporting the node statuses.

  • Maintaining the MetaStore data

Worker nodes run the tasks assigned by the Coordinator node, read data from external storage systems through connectors, process the data, and send the results to the Coordinator node.

Basic concepts

This section describes the basic Presto concepts for a better understanding of the Presto work mechanism.

Data model

Data model indicates to the data organization form. Presto uses a three-level structure, namely Catalog, Schema, and Table, to manage data.

  • Catalog

    A Catalog contains multiple Schemas, and is physically directed to an external data source, which can be accessed through Connectors. When you run a SQL statement in Presto, you are running it against one or more Catalogs.

  • Schema

    You can take a Schema as a database instance, which contains multiple data tables.

  • Table

    Data table, which is the same as general database tables.

Relations among Catalog, Schema, and Table are shown in the following figure.

a9eb8622-08cb-46f4-bad1-bdec09401de6.png

Connector

Presto uses Connector to connect to various external data sources. Presto provides a standard SPI, which allows users to develop their own Connectors using this standard API, to access customized data sources.

Generally, a Catalog is associated with a specific Connector (which can be configured in the Properties file of the Catalog). Presto contains multiple built-in Connectors. For more information, see Connectors.

This section mainly describes related concepts in the Presto query process, for users to understand better, the execution process of Presto statements and the performance optimization methods.

  • Statement

    Statement refers to an SQL statement entered by a user via JDBC or CLI.

  • Query

    Query refers to the execution process of a query. When Presto receives an SQL Statement, the Coordinator parses this statement, generates an execution plan, and sends this plan to a Worker for execution. A Query is logically made up by several components, namely Stage, Task, Driver, Split, Operator, and DataSource, which are shown in the following figure:

    9e31a966-b4df-469f-867a-2a020b45dfc0.png

  • Stage

    A Presto Query contains multiple Stages. Stage is a logical concept, which indicates a stage of the query process, comprising one or more execution Tasks. Presto uses a tree structure to organize Stages, the root node of which is Single Stage. This Stage aggregates data output from the upstream Stages, and directly sends the results to Coordinator. The leaf node of this tree is Source Stage. The Source Stage receives data from Connector for processing.

  • Task

    Task refers to a specific task to be executed, and it is the smallest Presto task scheduling unit.During the execution process, Presto task scheduler distributes these tasks to individual Workers for execution. Tasks in one Stage can be executed in parallel. Tasks in two different Stages transmits data via the Exchange module.

    Task is also a logical concept, which comprises task parameters and contents. The actual task execution is completed by Drivers.

  • Driver

    Driver is responsible for executing the specific tasks. A Task may contain multiple Driver instances, to achieve parallel processing within the same Task. Each Driver processes a Split. A Driver is made up by a set of Operators, and is responsible for specific data operations, such as conversion and filtering.

  • Operator

    The Operator is the smallest execution unit, and is responsible for processing each Page of a Split, such as weighting and conversion. It is similar to logical operators in concept.

    Page is a column-based data structure, and is the smallest data unit that an Operator can process. A Page object constitutes of multiple Blocks, with each Block representing multiple data rows of a field. A Page can be of a maximum of 1 MB, and can contain data of up to 16 x 1024 rows.

  • Exchange

    Two Stages exchange data through the Exchange module. The data transmission process is actually is completed between two Tasks. Generally, a downstream Task fetches data from the Output Buffer of an upstream Task using an Exchange Client. The fetched data is then transmitted to Driver in Splits for processing.

The command line tool

The command line tool uses SSH to log on to an EMR cluster, and executes the following command to enter the Presto console:

  1. $ presto --server emr-header-1:9090 --catalog hive --schema default --user hadoop

You can execute the following command from the console:

  1. presto:default> show schemas;
  2. Schema
  3. --------------------
  4. default
  5. hive
  6. information_schema
  7. tpch_100gb_orc
  8. tpch_10gb_orc
  9. tpch_10tb_orc
  10. tpch_1tb_orc
  11. (7 rows)

We can execute the presto --help command to obtain help from the console. The parameters and definitions are as follows:

  1. --server <server> # Specifies the URI of a Coordinator
  2. --user <user> # Sets the username
  3. --catalog <catalog> # Specifies the default Catalog
  4. --schema <schema> # Specifies the default Schema
  5. --execute <execute> # Executes a statement and then exits
  6. -f <file>, --file <file> # Executes an SQL statement and then exits
  7. --debug # Shows debugging information
  8. --client-request-timeout <timeout> # Specifies the client timeout value, which is 2 minutes by default
  9. --enable-authentication # Enables client authentication
  10. --keystore-password <keystore password> # KeyStore password
  11. --keystore-path <keystore path> # KeyStore path
  12. --krb5-config-path <krb5 config path> # Kerberos configuration file path (default: /etc/krb5.conf)
  13. --krb5-credential-cache-path <path> # Kerberos credential cache path
  14. --krb5-keytab-path <krb5 keytab path> # Kerberos Key table path
  15. --krb5-principal <krb5 principal> # Kerberos principal to be used
  16. --krb5-remote-service-name <name> # Remote Kerberos node name
  17. --log-levels-file <log levels> # Configuration file path for debugging logs
  18. --output-format <output-format> # Bulk export data format, which is CSV by default
  19. --session <session> # Specifies the session attribute, in the format key=value
  20. --socks-proxy <socks-proxy> # Sets the proxy server
  21. --source <source> # Sets query source
  22. --version # Shows version info
  23. -h, --help # Shows help info

Uses JDBC

Java applications can access databases using the JDBC driver provided by Presto. The usage is basically the same as that of the general RDBMS databases.

Introduction into Maven

You can add the following configuration into the pom file to introduce Presto JDBC driver:

  1. <dependency>
  2. <groupId>com.facebook.presto</groupId>
  3. <artifactId>presto-jdbc</artifactId>
  4. <version>0.187</version>
  5. </dependency>

Driver class name

Presto JDBC driver class is com.facebook.presto.jdbc.PrestoDriver.

Connection string

The following connection string format is supported

  1. jdbc:presto://<COORDINATOR>:<PORT>/[CATALOG]/[SCHEMA]

For example:

  1. jdbc:presto://emr-header-1:9090 # Connects to data base, using the default Catalog and Schema
  2. jdbc:presto://emr-header-1:9090/hive # Connects to data base, using Catalog(hive) and the default Schema
  3. jdbc:presto://emr-header-1:9090/hive/default # Connects to data base, using Catalog(hive) and Schema(default)

Connection parameters

Presto JDBC driver supports various parameters that may be set as URL parameters or as properties passed to DriverManager. Both of the following examples are equivalent:

Example for passing to DriverManager as Properties:

  1. String url = "jdbc:presto://emr-header-1:9090/hive/default";
  2. Properties properties = new Properties();
  3. properties.setProperty("user", "hadoop");
  4. Connection connection = DriverManager.getConnection(url, properties);
  5. ......

Example for passing to DriverManager as URL parameters:

  1. String url = "jdbc:presto://emr-header-1:9090/hive/default?user=hadoop";
  2. Connection connection = DriverManager.getConnection(url);
  3. ......

The parameters are described as follows:

Parameter Name Format Description
user STRING Username
password STRING Password
socksProxy \:\ SOCKS proxy server address and port. Example: localhost:1080
httpProxy \:\ HTTP proxy server address and port. Example: localhost:8888
SSL true\ false Whether or not to use HTTPS for connections
SSLTrustStorePath STRING Java TrustStore file path
SSLTrustStorePassword STRING Java TrustStore password
KerberosRemoteServiceName STRING Kerberos service name
KerberosPrincipal STRING Kerberos principal
KerberosUseCanonicalHostname true\ false Whether or not to use the canonical hostname
KerberosConfigPath STRING Kerberos configuration file path
KerberosKeytabPath STRING Kerberos KeyTab file path
KerberosCredentialCachePath STRING Kerberos credential cache path

Java example:

The following is an example of using Presto JDBC driver with Java.

  1. .....
  2. // Loads the JDBC Driver class
  3. try {
  4. Class.forName("com.facebook.presto.jdbc.PrestoDriver");
  5. } catch(ClassNotFoundException e) {
  6. LOG.ERROR("Failed to load presto jdbc driver.", e);
  7. System.exit(-1);
  8. }
  9. Connection connection = null;
  10. Statement statement = null;
  11. try {
  12. String url = "jdbc:presto://emr-header-1:9090/hive/default";
  13. Properties properties = new Properties();
  14. properties.setProperty("user", "hadoop");
  15. // Creates the connection object
  16. connection = DriverManager.getConnection(url, properties);
  17. // Creates the Statement object
  18. statement = connection.createStatement();
  19. // Executes the query
  20. ResultSet rs = statement.executeQuery("select * from t1");
  21. // Returns results
  22. int columnNum = rs.getMetaData().getColumnCount();
  23. int rowIndex = 0;
  24. while (rs.next()) {
  25. rowIndex++;
  26. for(int i = 1; i <= columnNum; i++) {
  27. System.out.println("Row " + rowIndex + ", Column " + i + ": " + rs.getInt(i));
  28. }
  29. }
  30. } catch(SQLException e) {
  31. LOG.ERROR("Exception thrown.", e);
  32. } finally {
  33. // Destroys Statement object
  34. if (statement != null) {
  35. try {
  36. statement.close();
  37. } catch(Throwable t) {
  38. // No-ops
  39. }
  40. }
  41. // Closes connection
  42. if (connection != null) {
  43. try {
  44. connection.close();
  45. } catch(Throwable t) {
  46. // No-ops
  47. }
  48. }
  49. }
Thank you! We've received your feedback.