Alluxio provides a distributed cache mechanism to cache data in memory and on disks of nodes in an E-MapReduce (EMR) cluster. This topic describes the cache policies and the commands that are used to manage the lifecycle of data.

Prerequisites

  • An EMR Hadoop cluster is created, and Alluxio is selected from the optional services when you create the cluster. For more information, see Create a cluster.
  • You have logged on to the cluster. For more information, see Log on to a cluster.

Background information

By default, EMR uses a two-level cache mechanism to cache data in memory and disks. The memory size accounts for 10% of the storage space of a node. The disk space accounts for 30% of the storage space of the node. To modify the settings, go to the Configure tab on the Alluxio service page, search for the parameters that start with alluxio.worker.tieredstore in the Configuration Filter section, and change the values of the parameters. Alluxio-tiered storage

For more information about caching, see Caching.

Cache policies

By default, a client writes new data blocks to level 0. If level 0 has no sufficient space, the client attempts to write the data blocks to the next level. If the storage space of both levels is insufficient, Alluxio releases space to store the newly written data blocks. By default, Alluxio releases space based on the LRUAnnotator policy. Based on this policy, Alluxio releases space that is occupied by existing data blocks in the Least Recently Used (LRU) order.

When a client reads data blocks, if the data blocks are stored in Alluxio, the client reads the data blocks from the core node on which Alluxio runs. If the data blocks are not stored in Alluxio, the client caches the data blocks to Alluxio. Then, the client can read the data blocks from the core node on which Alluxio runs.

Commands that are used to manage the lifecycle of data

The following table describes the common commands that you can run to manage the lifecycle of data.

Command Description
free

Removes data from the cache.

load

Loads data to the Alluxio cache.

persist

Persists files or directories in Alluxio to UFS.

setTtl

Specifies the time-to-live (TTL) of files or directories.

free

Removes data from the cache.

You can run this command to remove data from the Alluxio cache, not from the Under File Storage (UFS). The data that is removed is still available to users. However, the access from a client to the data may slow down.

  • Syntax:
    alluxio fs free <path>
  • Example: Remove all data in the tmp directory from the Alluxio cache.
    alluxio fs free /tmp
    The following information is returned:
    /tmp was successfully freed from Alluxio space.

load

Loads data to the Alluxio cache.

  • Syntax:
    alluxio fs load <path>
  • Example: Load all data that is stored in the /tmp3/logs directory to the Alluxio cache.
    alluxio fs load /tmp3/logs
    The following information is returned:
    /tmp3/logs loaded

persist

Persists files or directories in Alluxio to UFS.

You can run this command to write the data in Alluxio to UFS. This way, you can restore data if faults occur in Alluxio.

  • Syntax:
    alluxio fs persist <path>
  • Example: Persist the tmp directory in Alluxio to UFS.
    alluxio fs persist /tmp
    The following information is returned:
    persisted file /tmp with size 46

setTtl

Specifies the time-to-live (TTL) of files or directories.

If the current time is later than the creation time plus the TTL value of a file, you can specify the action parameter to determine the operation that you want to perform on the file. If you set the action parameter to delete, the file is deleted from Alluxio and UFS. If you set the action parameter to free, the file is deleted only from Alluxio. The default value of the action parameter is delete.

  • Syntax:
    alluxio fs setTtl [--action delete|free] <path> <time to live>
  • Examples:
    • The tmp directory will be deleted from Alluxio and UFS one minute after the directory is created.
      alluxio fs setTtl /tmp 60000
      The following information is returned:
      TTL of path '/tmp' was successfully set to 60000 milliseconds, with expiry action set to DELETE
    • The dir directory is deleted only from Alluxio one day after the directory is created.
      alluxio fs setTtl --action free /dir 86400000
      The following information is returned:
      TTL of path '/dir' was successfully set to 60000 milliseconds, with expiry action set to FREE