This topic describes the features of Mars, the differences between Mars and PyODPS, and the scenarios of using Mars and the PyODPS DataFrame API.

Features

Mars is a unified distributed computing framework based on tensors. Mars can use parallel and distributed computing technologies to accelerate data processing for Python data science libraries such as NumPy, pandas and scikit-learn.

Mars provides the following common APIs:
  • Mars tensor

    The Mars tensor API mimics the NumPy API and supports processing large multidimensional arrays, which are also called tensors. The following code shows an example of how to use the Mars tensor API:

    import mars.tensor as mt
    a = mt.random.rand(10000, 50)
    b = mt.random.rand(50, 5000)
    a.dot(b).execute()
  • Mars DataFrame

    The Mars DataFrame API mimics the pandas API and supports processing and analyzing a large amount of data. The following code shows an example of how to use the Mars DataFrame API:

    import mars.dataframe as md
    ratings = md.read_csv('Downloads/ml-20m/ratings.csv')
    movies = md.read_csv('Downloads/ml-20m/movies.csv')
    movie_rating = ratings.groupby('movieId', as_index=False).agg({'rating': 'mean'})
    result = movie_rating.merge(movies[['movieId', 'title']], on='movieId')
    result.sort_values(by='rating', ascending=False).execute()
  • Mars learn

    The Mars learn API mimics the scikit-learn API. The Mars learn API can be integrated with TensorFlow, PyTorch, and XGBoost. The following code shows an example of how to use the Mars learn API:

    import mars.dataframe as md
    from mars.learn.neighbors import NearestNeighbors
    df = md.read_csv('data.csv')
    nn = NearestNeighbors(n_neighbors=10)
    nn.fit(df)
    neighbors = nn.kneighbors(df).fetch()

Differences between Mars and PyODPS

  • API
    • Mars

      The Mars DataFrame API is fully compatible with pandas. The Mars tensor API is compatible with NumPy. The Mars learn API is compatible with scikit-learn.

    • PyODPS

      PyODPS only provides the DataFrame API, which is different from the pandas API.

  • Index
    • Mars
      The Mars DataFrame API supports operations based on indexes, including row indexes and column indexes. The following code shows an example:
      In [1]: import mars.dataframe as md
      In [5]: import mars.tensor as mt
      In [7]: df = md.DataFrame(mt.random.rand(10, 3), index=md.date_range('2020-5-1', periods=10))
      In [9]: df.loc['2020-5'].execute()
      Out[9]:
                         0         1         2
      2020-05-01  0.061912  0.507101  0.372242
      2020-05-02  0.833663  0.818519  0.943887
      2020-05-03  0.579214  0.573056  0.319786
      2020-05-04  0.476143  0.245831  0.434038
      2020-05-05  0.444866  0.465851  0.445263
      2020-05-06  0.654311  0.972639  0.443985
      2020-05-07  0.276574  0.096421  0.264799
      2020-05-08  0.106188  0.921479  0.202131
      2020-05-09  0.281736  0.465473  0.003585
      2020-05-10  0.400000  0.451150  0.956905
    • PyODPS

      PyODPS does not support index-based operations.

  • Data order
    • Mars
      After a Mars DataFrame is created, it maintains the data order. The Mars DataFrame API provides time series methods such as shift, and missing value handling methods such as ffill and bfill.
      In [3]: df = md.DataFrame([[1, None], [None, 1]])
      In [4]: df.execute()
      Out[4]:
           0    1
      0  1.0  NaN
      1  NaN  1.0
      
      In [5]: df.ffill().execute() # Fill the missing value with the value in the previous row.
      Out[5]:
           0    1
      0  1.0  NaN
      1  1.0  1.0
    • PyODPS

      PyODPS processes and stores data by using MaxCompute, which does not maintain the data order. Therefore, PyODPS does not maintain the data order or support time series methods.

  • Execution
    • Mars

      Mars consists of a client and a distributed execution layer. You can call the o.create_mars_cluster method to create a Mars cluster in MaxCompute and submit computing jobs to the Mars cluster. This process greatly saves the costs for scheduling. Mars outperforms PyODPS in processing smaller amounts of data.

    • PyODPS

      PyODPS is a client and does not contain any servers. When you use the PyODPS DataFrame API, the system compiles the operations to MaxCompute SQL statements for execution. Therefore, the operations supported by the PyODPS DataFrame API depend on MaxCompute SQL. Each time you call the execute method, a MaxCompute job is submitted for the cluster to schedule.

Scenarios

Use Mars and the PyODPS DataFrame API in the following scenarios:
  • Mars
    • You often call the to_pandas() method of the PyODPS DataFrame API to convert a PyODPS DataFrame to a pandas DataFrame.
    • You are familiar with the pandas API, but do not want to learn the PyODPS DataFrame API.
    • You want to use indexes.
    • You want to maintain the data order after you create a DataFrame.

      The Mars DataFrame API provides the iloc method to retrieve rows and obtain data in specific rows. For example, df.iloc[10] is used to obtain data in the tenth row. The Mars DataFrame API also provides the df.shift() and df.ffill() methods, both of which can be used only in scenarios where the data order is maintained.

    • You want to run NumPy or scikit-learn in a parallel and distributed manner, or run TensorFlow, PyTorch, and XGBoost in a distributed manner.
    • You want to process data whose volume is less than 1 TB.
  • PyODPS DataFrame
    • You want to use MaxCompute to schedule jobs. The PyODPS DataFrame API compiles operations on DataFrames to MaxCompute SQL statements. If you want to schedule jobs by using MaxCompute, we recommend that you use the PyODPS DataFrame API.
    • You want to schedule jobs in a more stable environment. The PyODPS DataFrame API compiles operations to MaxCompute SQL statements for execution. MaxCompute is stable, which means PyODPS is stable. Mars is new and less stable. Therefore, we recommend that you use the PyODPS DataFrame API if you require high stability.
    • If you want to process data whose volume is larger than 1 TB, we recommend that you use the PyODPS DataFrame API.

References

Technical support

If you encounter any problems when you use Mars, join the DingTalk group whose ID is 11701793 for technical support.