Apache Spark + Intel Analytics Zoo for deep learning-Alibaba Cloud Developer Community

Analytics Zoo is an open-source big data analysis and AI platform based on Apache Spark and Inte BigDL by Intel, which facilitates users to develop end-to-end deep learning applications based on big data. This topic describes how to use E-MapReduce for deep learning Analytics Zoo Alibaba Cloud.

Analytics Zoo is an open-source big data analysis and AI platform based on Apache Spark and Inte BigDL by Intel, which facilitates users to develop end-to-end deep learning applications based on big data.

  • JDK 8
  • Spark cluster (Spark 2.xsupported by EMR is recommended)
  • python-2.7(python 3.5 and 3.6 are also supported),pip
  • the latest release version of Analytics Zoo is 0.2.0.
  • Install Scala

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

if you use an ECS instance for compilation, we recommend that you modify the Maven repository mirror:

<mirror>
    <id>nexus-aliyun</id>
    <mirrorOf>central</mirrorOf>
    <name>Nexus aliyun</name
    <url>http://maven.aliyun.com/nexus/content/groups/public</url>
</mirror>

download the Analytics Zoo release version, decompress it, and run it in the directory:

bash make-dist.sh

after the build is completed, all runtime environments are included in the dist directory. Place the dist directory in the unified directory of the EMR software stack runtime:

cp -r dist/ /usr/lib/analytics_zoo
  • the installation pythonAnalytics Zoo supports pip installation and non-pip installation. pyspark and bigdl are installed in pip installation. Because pyspark is already installed in the EMR cluster, pip installation may cause conflicts, so non-pip installation is adopted.

    • To install a non-pip, run the following command:
bash make-dist.sh

go to the pyzoo directory and install analytcis zoo:

python setup.py install
  • set environment variables after scala is installed, place the dist directory in the unified directory of the EMR software stack, and then set environment variables. Edit/etc/profile.d/analytics_zoo.sh and add:

trial

export ANALYTICS_ZOO_HOME=/usr/lib/analytics_zoo
export PATH=$ANALYTICS_ZOO_HOME/bin:$PATH

you do not need to set SPARK_HOME for EMR.

spark-submit --master yarn \
--deploy-mode cluster --driver-memory 8g \
--executor-memory 20g --class com.intel.analytics.zoo.examples.textclassification.TextClassification \
/usr/lib/analytics_zoo/lib/analytics-zoo-bigdl_0.6.0-spark_2.1.0-0.2.0-jar-with-dependencies.jar --baseDir /news
  • you can use the ssh proxy to view the Spark running details page.

View the logs at the same time, and you can see the accuracy of each azone.

INFO optim.DistriOptimizer$: [Epoch 2 9600/15107][Iteration 194][Wall Clock 193.266637037s] Trained 128 records in 0.958591653 seconds. Throughput is 133.52922 records/second. Loss is 0.74216986.
INFO optim.DistriOptimizer$: [Epoch 2 9728/15107][Iteration 195][Wall Clock 194.224064816s] Trained 128 records in 0.957427779 seconds. Throughput is 133.69154 records/second. Loss is 0.51025534.
INFO optim.DistriOptimizer$: [Epoch 2 9856/15107][Iteration 196][Wall Clock 195.189488678s] Trained 128 records in 0.965423862 seconds. Throughput is 132.58424 records/second. Loss is 0.553785.
INFO optim.DistriOptimizer$: [Epoch 2 9984/15107][Iteration 197][Wall Clock 196.164318688s] Trained 128 records in 0.97483001 seconds. Throughput is 131.30495 records/second. Loss is 0.5517549.
  • Use pyspark and Analytics Zoo in Jupyter for deep learning training

    • mounting Jupyter
pip install jupyter

-Run the following command to start:

jupyter-with-zoo.sh

-To use Analytics Zoo, we recommend that you use the built-in Wide And Deep model.

  1. Import data
  2. define models and optimizers
  3. perform training
  4. view training results

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now