edit-icon download-icon

Time series

Last Updated: Aug 15, 2018

Contents

x13_arima

  • Autoregressive Integrated Moving Average Model (ARIMA) is a famous time series prediction method defined by Box and Jenkins in the early 1970s. Therefore, this model is also called the Box-Jenkins model or the Box-Jenkins approach.

  • x13-arima is an ARIMA algorithm based on the open-source X-13ARIMA-SEATS seasonal adjustment.

  • For more information about X-13ARIMA-SEATS Seasonal Adjustment Program, visit wiki.

  • For more information about ARIMA, visit wiki.

PAI command

  1. PAI -name x13_arima
  2. -project algo_public
  3. -DinputTableName=pai_ft_x13_arima_input
  4. -DseqColName=id
  5. -DvalueColName=number
  6. -Dorder=3,1,1
  7. -Dstart=1949.1
  8. -Dfrequency=12
  9. -Dseasonal=0,1,1
  10. -Dperiod=12
  11. -DpredictStep=12
  12. -DoutputPredictTableName=pai_ft_x13_arima_out_predict
  13. -DoutputDetailTableName=pai_ft_x13_arima_out_detail

Parameter description

Parameter Description Value range Required/Optional, default value/act
inputTableName Input table Table name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). Partition name (Optional) All partitions are selected by default.
seqColName Time series column Column name (Required) It is used only to sort valueColNames, and the value is irrelevant to the algorithm.
valueColName Value column Column name Required
groupColNames Grouping column. Multiple columns are separated by commas (,), such as col0,col1. A time series is created for each group. Column name Optional
order p, d, and q respectively represent the autoregressive coefficient, difference, and moving regression coefficient. p, d, and q are non-negative integers in the range of [0, 36]. Required
start Time series start date String, in the format of year.seasonal, such as 1986.1
Time series format description
Optional; default value: 1.1
frequency Frequency of time series Positive integer in the range of (0, 12]
Time series format description
Optional; default value: 12, indicating 12 months/year
seasonal sp, sd, and sq respectively represent the seasonal autoregressive coefficient, seasonal difference, and seasonal moving regression coefficient. sp, sd, and sq are all non-negative integers in the range of [0, 36]. Optional; default value: not seasonal
period Seasonal period Number in the range of (0, 100] Optional; default value: frequency
maxiter Maximum number of iterations Positive integer Optional; default value: 1500
tol Tolerance Double type Optional; default value: 1e-5
predictStep Number of prediction items Number in the range of (0, 365] Optional; default value: 12
confidenceLevel Prediction confidence level Number in the range of (0, 1) Optional; default: 0.95
outputPredictTableName Prediction output table name Table name Required
outputDetailTableName Detail table Table name Required
outputTablePartition Partitions in the output table Partition name Optional, not output to the partition by default
coreNum Number of nodes Paired with the memSizePerCore parameter, positive integer (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] (Optional) Automatically calculated by default
lifecycle (Optional) Lifecycle of the output table Positive integer No lifecycle

Time series format

The start and frequency parameters specify the two time dimensions of data (valueColName), TS1 and TS2. The frequency parameter represents the data frequency within a cycle, namely, the frequency of TS2 in each TS1. The start parameter is in the format of n1.n2, which indicates that the start date is the N2 TS2 in the N1 TS1.

Unit time ts1 ts2 frequency start
12 months/Year Year Month 12 1949.2 represents the 2nd month of the 1949th year.
Four seasons/year Year Season 4 1949.2 represents the second quarter of the 1949th year.
Seven days/week Week Day 7 1949.2 represents the second day of the 1949th week.
1 Any time unit 1 1 1949.1 represents the 1949th (year, day, hour, and so on).

For example, value=[1,2,3,5,6,7,8,9,10,11,12,13,14,15]

  • start=1949.3, frequency=12 indicates that the data frequency is monthly per year, and the prediction start date is 1950.06.
year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 1 2 3 4 5 6 7 8 9 10
1950 11 12 13 14 15
  • start=1949.3, frequency=4 indicates that the data frequency is quarterly per year, and the prediction start date is 1953.02.
year Qtr1 Qtr2 Qtr3 Qtr4
1949 1 2
1950 3 4 5 6
1951 7 8 9 10
1952 11 12 13 14
1953 14
  • start=1949.3, frequency=7 indicates that the data frequency is daily per week, and the prediction start date is 1951.04.
week Sun Mon Tue Wed Thu Fri Sat
1949 1 2 3 4 5
1950 6 7 8 9 10 11 12
1951 13 14 15
  • start=1949.1, frequency=1 can represent any time unit, and the prediction start date is 1963.00.
cycle p1
1949 1
1950 2
1951 3
1951 4
1952 5
1953 6
1954 7
1955 8
1956 9
1957 10
1958 11
1959 12
1960 13
1961 14
1962 15

Example

Test data

Used data: AirPassengers.

This data set is the number of passengers for international airlines each month from 1949 to 1960, as shown in the following table.

id number
1 112
2 118
3 132
4 129
5 121

Upload data by tuunel. The command is as follows.

  1. create table pai_ft_x13_arima_input(id bigint,number bigint);
  2. tunnel upload data/airpassengers.csv pai_ft_x13_arima_input -h true;

PAI command

  1. PAI -name x13_arima
  2. -project algo_public
  3. -DinputTableName=pai_ft_x13_arima_input
  4. -DseqColName=id
  5. -DvalueColName=number
  6. -Dorder=3,1,1
  7. -Dseasonal=0,1,1
  8. -Dstart=1949.1
  9. -Dfrequency=12
  10. -Dperiod=12
  11. -DpredictStep=12
  12. -DoutputPredictTableName=pai_ft_x13_arima_out_predict
  13. -DoutputDetailTableName=pai_ft_x13_arima_out_detail

Output description

  • Output table: outputPredictTableName.
    The fields are:
    column namecomment
    pdatePrediction date
    forecastPrediction conclusion
    lowerLower threshold of the prediction conclusion when the confidence level is confidenceLevel (default: 0.95)
    upperUpper threshold of the prediction conclusion when the confidence level is confidenceLevel (default: 0.95)
    Data display:
    image
  • Output table: outputDetailTableName.
    The fields are:

    column namecomment
    keymodel: indicates the model.
    evaluation: indicates the evaluation result.
    parameters: indicates training parameters.
    log: indicates the training log.
    summaryStorage details
    Data display:
    image

    Display on the PAI web - Model factor (key=model)
    image

    Display on the PAI web - Evaluation index (key=evaluation)
    image

Algorithm scale

  • Supported scale

    • Rows: a maximum of 1200 rows in a single group

    • Columns: 1 data column

  • Resource calculation method

    • Default calculation method used when groupColNames is not set
      coreNum = 1
      memSizePerCore = 4096

    • Default calculation method used when groupColNames is set
      coreNum = floor (total number of data rows / 120,000)
      memSizePerCore = 4096

x13_auto_arima

x13-auto-arima includes an automatic ARIMA model selection procedure based largely on the procedure of Gomez and Maravall (1998) as implemented in TRAMO (1996) and subsequent revisions.

The x13-auto-arima selection process is as follows:

  1. Default model estimation.

    • When frequency is 1, the default model is (0,1,1).
    • When frequency is greater than 1, the default model is (0,1,1)(0,1,1).
  2. Identication of dierencing orders.

    • Skip this step if you set diff and seasonalDiff.
    • Determine the difference d and the seasonal difference D by using the Unit root test (wiki ).
  3. Identication of ARMA model orders.

    Select the optimal model based on BIC(wiki), and the maxOrder and maxSeasonalOrder parameters are used in this step.

  4. Comparison of identified model with default model.

    Compare models by using Ljung-Box Q statistic(wiki). If both models are unacceptable, use the (3,d,1)(0,D,1) model.

  5. Final model checks.

PAI command line

  1. PAI -name x13_auto_arima
  2. -project algo_public
  3. -DinputTableName=pai_ft_x13_arima_input
  4. -DseqColName=id
  5. -DvalueColName=number
  6. -Dstart=1949.1
  7. -Dfrequency=12
  8. -DpredictStep=12
  9. -DoutputPredictTableName=pai_ft_x13_arima_out_predict2
  10. -DoutputDetailTableName=pai_ft_x13_arima_out_detail2

Parameter description

Parameter Description Value range Required/Optional, default value/act
inputTableName Input table Table name Required
inputTablePartitions Partitions used for training in the input table, in the format of partition_name=value. The multilevel format is name1=value1/name2=value2. Multiple partitions are separated by commas (,). Partition name (Optional) All partitions are selected by default.
seqColName Time series column Column name (Required) It is used only to sort valueColNames, and the value is irrelevant to the algorithm.
valueColName Value column Column name Required
groupColNames Grouping column. Multiple columns are separated by commas (,), such as col0,col1. A time series is created for each group. Column name Optional
start Time series start date String, in the format of year.seasonal, such as 1986.1
Time series format description
Optional; default value: 1.1
frequency Frequency of time series Positive integer in the range of (0, 12]
Time series format description
Optional; default value: 12, indicating 12 months/year
maxOrder Maximum values of p and q Non-negative integer in the range of [0, 4] Optional; default value: 2
maxSeasonalOrder Maximum values of seasonal p and q Non-negative integer in the range of [0, 2] Optional; default value: 1
maxDiff Maximum value of differential d Non-negative integer in the range of [0, 2] Optional; default value: 2
maxSeasonalDiff Maximum value of seasonal differential d Non-negative integer in the range of [0,1] Optional; default value: 1
diff Differential d Non-negative integer in the range of [0, 2]
When both diff and maxDiff are set, maxDiff is ignored.
diff and seasonalDiff must be both set.
Optional; default value: -1; no diff specified
seasonalDiff Seasonal differential d Non-negative integer in the range of [0, 1]
When both seasonalDiff and maxSeasonalDiff are set,
maxSeasonalDiff is ignored.
Optional; default value: -1; no seasonalDiff specified
maxiter Maximum number of iterations Positive integer Optional; default value: 1500
tol Tolerance Double type Optional; default value: 1e-5
predictStep Number of prediction items Number in the range of (0, 365] Optional; default value: 12
confidenceLevel Prediction confidence level Number in the range of (0, 1) Optional; default: 0.95
outputPredictTableName Prediction output table name Table name Required
outputDetailTableName Detail table Table name Required
outputTablePartition Partitions in the output table Partition name Optional, not output to the partition by default
coreNum Number of nodes Paired with the memSizePerCore parameter, positive integer (Optional) Automatically calculated by default
memSizePerCore Memory size of each node, in MB Positive integer in the range of [1024, 64*1024] (Optional) Automatically calculated by default
lifecycle (Optional) Lifecycle of the output table Positive integer No lifecycle

Time series format

The start and frequency parameters specify the two time dimensions of data (valueColName), TS1 and TS2.The frequency parameter represents the data frequency within a cycle, namely, the frequency of TS2 in each TS1. The start parameter is in the format of n1.n2, which indicates that the start date is the N2 TS2 in the N1 TS1.

Unit time ts1 ts2 frequency start
12 months/Year Year Month 12 1949.2 represents the 2nd month of the 1949th year.
Four seasons/year Year Season 4 1949.2 represents the second quarter of the 1949th year.
Seven days/week Day Week 7 1949.2 represents the second day of the 1949th week.
1 Any time unit 1 1 1949.1 represents the 1949th (year, day, hour, and so on).

For example, value=[1,2,3,5,6,7,8,9,10,11,12,13,14,15]

  • start=1949.3, frequency=12 indicates that the data frequency is monthly per year, and the prediction start date is 1950.06.
year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 1 2 3 4 5 6 7 8 9 10
1950 11 12 13 14 15
  • start=1949.3, frequency=4 indicates that the data frequency is quarterly per year, and the prediction start date is 1953.02.
year Qtr1 Qtr2 Qtr3 Qtr4
1949 1 2
1950 3 4 5 6
1951 7 8 9 10
1952 11 12 13 14
1953 14
  • start=1949.3, frequency=7 indicates that the data frequency is daily per week, and the prediction start date is 1951.04.
week Sun Mon Tue Wed Thu Fri Sat
1949 1 2 3 4 5
1950 6 7 8 9 10 11 12
1951 13 14 15
  • start=1949.1, frequency=1 can represent any time unit, and the prediction start date is 1963.00.
cycle p1
1949 1
1950 2
1951 3
1951 4
1952 5
1953 6
1954 7
1955 8
1956 9
1957 10
1958 11
1959 12
1960 13
1961 14
1962 15

Example

Test data

Used data: AirPassengers

This data set is the number of passengers for international airlines each month from 1949 to 1960, as shown in the following table.

id number
1 112
2 118
3 132
4 129
5 121

Upload data by tuunel. The command is as follows.

  1. create table pai_ft_x13_arima_input(id bigint,number bigint);
  2. tunnel upload data/airpassengers.csv pai_ft_x13_arima_input -h true;

PAI command

  1. PAI -name x13_auto_arima
  2. -project algo_public
  3. -DinputTableName=pai_ft_x13_arima_input
  4. -DseqColName=id
  5. -DvalueColName=number
  6. -Dstart=1949.1
  7. -Dfrequency=12
  8. -DmaxOrder=4
  9. -DmaxSeasonalOrder=2
  10. -DmaxDiff=2
  11. -DmaxSeasonalDiff=1
  12. -DpredictStep=12
  13. -DoutputPredictTableName=pai_ft_x13_arima_auto_out_predict
  14. -DoutputDetailTableName=pai_ft_x13_arima_auto_out_detail

Output description

  • Output table: outputPredictTableName.
    The fields are:
    column namecomment
    pdatePrediction date
    forecastPrediction conclusion
    lowerLower threshold of the prediction conclusion when the confidence level is confidenceLevel (default: 0.95)
    upperUpper threshold of the prediction conclusion when the confidence level is confidenceLevel (default: 0.95)
    Data display:
    image
  • Output table: outputDetailTableName.

The fields are:

column namecomment
keymodel: indicates the model.
evaluation: indicates the evaluation result.
parameters: indicates training parameters.
log: indicates the training log.
summaryStorage details
Data display:
image

Display on the PAI web - Model factor (key=model)
image

Display on the PAI web - Evaluation index (key=evaluation)
image

Algorithm scale

  • Supported scale

    • Rows: a maximum of 1200 rows in a single group

    • Columns: 1 data column

  • Resource calculation method

    • Default calculation method used when groupColNames is not set
      coreNum = 1
      memSizePerCore = 4096

    • Default calculation method used when groupColNames is set
      coreNum = floor (total number of data rows / 120,000)
      memSizePerCore = 4096

Important notes

Why are the prediction results the same?

When an exception occurs during model training, the mean model is called, and all prediction results are the mean of the training data.

Common exceptions include “Not stationary after timing differential diff”, “training does not converge”, “variance is 0”. You can view the stderr file of individual nodes in the logview to obtain exception details.

Thank you! We've received your feedback.