Community Blog Introduction to Convolutional Neural Networks for Computer Vision

Introduction to Convolutional Neural Networks for Computer Vision

In this article, we will introduce the applications of convolutional neural network (CNN) in the field of computer vision.

The classic method of fine-grained image classification is to first define different locations on the image, for example, the head, foot, or wings of a bird. Then we have to extract features from these locations, and finally, combine these features and use them to complete classification. This type of method features very high accuracy, but it requires a massive dataset and manual tagging of location information. One major trend in fine-grained classification is training without additional supervision information, instead of using only image notes. This method gets represented by the bilinear convolutional neural network (CNN) method.

Bilinear CNN computes the outer-product of convolution descriptors to find the mutual relationships between different dimensions. Because the dimensions of different descriptors correspond to different channels for convolution features, and different channels extract different semantic features, using bilinear operation allows us to capture the relationships between different semantic elements on the input image.

Image captioning is the process of generating a one or two sentence description of an image. The basic idea behind designing an image captioning network gets based on the concept behind machine translation in the field of natural language processing. After replacing the source language encoding network in a machine translator with an image CNN encoding network and extracting the features of the image, we can use the decoder network for the target language to create a text description.

Given an image and a question related to that image, visual question answering aims to answer that question from a selection of candidate answers. The concept is to use CNN to extract features from an image, RNN to extract text features from the text question, then combine the visual and textual features, and finally perform classification using the fully-connected later. The key to this task is figuring out how to connect these two types of features. Methods that directly combine these features transform them into a vector, or add or produce the visual and textual vector by adding or multiplying the elements.

For more information, please go to Deep Dive into Computer Vision with Neural Networks – Part 1.

Related Blog Posts

Deep Dive into Computer Vision with Neural Networks – Part 2

In this part, we will introduce Texture Synthesis and Style Transform, Face Verification/Recognition, Image Search and Retrieval, Object Tracking, Generative Models and Video Classification. You will also see how convolutional neural network is used for Video Classification.

Geoffrey Hinton's Capsule Networks: A Novel Approach to Deep Learning

The Capsule Network proposed by Dr. Geoffrey Hinton brings a new perspective to Deep Learning as compared to Convolutional Neural Networks.

In 2017, together with his two colleagues at Google Brain, Sara Sabour and Nicholas Frosst, Hinton published the paper Dynamic Routing Between Capsules. The team proposed a new neural network model called the Capsule Network, which has better results for specific task than the traditional convolutional neural network (CNN). Unlike CNN, the Capsule Network helps machines understand images by giving them a new perspective, similar to the three-dimensional perspective that humans have.

Related Documentation

Implement image classification by TensorFlow

This guide creates an image recognition model using the deep learning framework TensorFlow in Alibaba Cloud Machine Learning Platform for AI within 30 minutes. You will also get the code to creates the training model using the convolutional neural network (CNN) here.

Run TensorFlow-based AlexNet in Alibaba Cloud Container Service

AlexNet is a CNN network developed in 2012 by Alex Krizhevsky using five-layer convolution and three-layer ReLU layer, and won the ImageNet competition (ILSVRC). AlexNet proves the effectiveness in classification (15.3% error rate) of CNN, against the 25% error rate by previous image recognition tools. The emergence of this network marks a milestone for deep learning applications in the computer vision field.

AlexNet is also a common performance indicator tool for deep learning framework. TensorFlow provides the alexnet_benchmark.py tool to test GPU and CPU performance. This document uses AlexNet as an example to illustrate how to run a GPU application in Alibaba Cloud Container Service easily and quickly.

Related Products

Container Service

Container Service is a high-performance and scalable container application management service that enables you to use Docker and Kubernetes to manage the lifecycle of containerized applications. It is able to run TensorFlow-based AlexNet for deep learning tasks in the computer vision field.

Realtime Compute

Realtime Compute offers a one-stop, high-performance platform that enables real-time big data processing based on Apache Flink. It is widely used in diverse scenarios, such as streaming data processing, offline data processing, and data lake computing. With Realtime Compute, you can process and analyze big data in real time for business insights and decision making.

0 0 0
Share on

Alibaba Clouder

2,626 posts | 711 followers

You may also like