Python Distributed Computing Chapter 1 Parallel and Distributed Computing (Distributed Computing with Python)-Alibaba Cloud Developer Community

Preface Chapter 1 Introduction to Parallel and Distributed Computing Chapter 2 Asynchronous Programming chapter 3 parallel computing of Python Chapter 4 Celery distributed applications Chapter 5 deployment of cloud platform Python Chapter 6 Use of supercomputer groups Python Chapter 7 test and debugging distributed applications Chapter 8 continue learning

the sample code in this book is applicable to Python 3.5 and later.

The first contemporary digital computer was born in the late 1930 s and early 40s (there was controversy Konrad Zuse Z1 in 1936), perhaps earlier than most readers of this book and earlier than the author himself. The past 70 years have witnessed the rapid development of computers. Computers have become faster and cheaper, which is unique in the whole industrial field. Today's mobile phones, iPhone or Android are faster than the fastest computers 20 years ago. Moreover, computers are becoming smaller and smaller: supercomputers used to be able to hold the whole room, but now it is OK to put it in your pocket.

This includes two important inventions. One is to install multiple processors on the motherboard (each processor contains multiple cores), which enables computers to truly achieve concurrency. We know that a processor can only process the same transaction at a time. In the following chapters, we will see that when the processor reaches a certain level, it can give the illusion that multiple tasks are performed at the same time. To achieve multiple tasks at the same time, multiple processors are required.

Another invention is a high-speed computer network. It enables infinite computers to communicate with each other for the first time. Networked computers may be located in the same location (called LAN) or distributed in different locations (called WAN).

Today, we are all familiar with multi-processor/multi-core computers. In fact, our mobile phones, tablets and laptops are all multi-core. Graphics cards, or graphics processors (GPU), are often large-scale parallel mechanisms that contain hundreds or even thousands of processing units. Computer networks around us are everywhere, including Internet, WiFi, and 4G networks.

The rest of this chapter will discuss some definitions. We will introduce the concepts of Parallel and Distributed Computing. Some common examples are provided. Discuss the advantages and disadvantages of each architecture and the programming paradigm.

Before introducing the concept, clarify something. In the rest, unless otherwise specified, we will cross-use the processor and CPU core. This is obviously wrong in concept: a processor has one or more cores, and each computer has one or more processors. Depending on the algorithm and performance requirements, running on multi-processor or single-processor multi-core computers may have different speeds, assuming that the algorithms are parallel. However, we will ignore these differences and focus on the concept itself.

there are many concepts of parallel computing. This book provides a concise concept:

parallel computing uses multiple processors to process transactions at the same time.

Typically, this concept requires these processors to be located on the same motherboard to distinguish them from distributed computing.

The history of division of labor is as long as human civilization, and it is also applicable to the digital world. There are more and more computing units installed by modern computers.

Parallel computing is useful and necessary for many reasons. The simplest reason is performance. If we want to divide a lengthy calculation into small pieces and package them into different processors, we can finish more work in the same time.

Alternatively, parallel computing can also present a feedback interface to users when processing a task. Remember that a processor can only process one task at a time. Applications with GUIs need to deliver tasks to an independent thread of another processor so that the other processor can update the GUI and feedback the input.

The following figure shows this common architecture. The main thread uses Event loops (Event Loop) to process user and system input. Tasks that need to be processed for a long time and tasks that block the GUI are transferred to the background or worker thread:

an actual case of the parallel architecture can be an image application. When we connect a digital camera or mobile phone to a computer, the picture application will perform a series of actions, while its user interface should remain Interactive. For example, an application needs to copy images from a device to a hard disk, create thumbnails, extract metadata (shooting date and time), generate indexes, and update image libraries. At the same time, we can still browse the previously transmitted pictures, open the pictures, edit, and so on.

Of course, the whole process may be sequential on a single processor, and this processor also needs to process the GUI. This will slow down the user interface and slow down the entire application. Parallel operation can make this process smooth and improve user experience.

Sharp readers may point out at this time that old computers with only single processor and single core can also handle multiple events (through multiple tasks) at the same time. Even today, the number of running tasks can exceed the number of processors on the computer. In fact, this is because a running task is removed from the CPU (this may be spontaneous or forced by the operating system, such as responding to IO events), so that another task can run on the CPU. Similar interrupts occur from time to time. During application running, various tasks are successively obtained and removed from the CPU. Switching is usually very fast, so that users will feel that the computer runs tasks in parallel. In fact, only one task runs at a specific time.

Threads are usually used to run tools in parallel applications. Systems (such as Python) usually have strict restrictions on threads (see Chapter 3). Developers need to switch to subprocess (the common method is forking). Subprocesses replace or cooperate with threads and run simultaneously with the main process.

The first method is multithreaded programming. The second method is multiprocessing. Multiple processes can be considered as the flexibility of multiple threads.

In many cases, multi-process is better than multi-thread. Interestingly, although both are running on a single machine, multithreading is an example of the shared memory architecture, while multiprocess is an example of the distributed memory architecture (see the following content of this book).

this book defines distributed computing as follows:

distributed computing uses multiple computers to process a task at a time.

Generally, similar to parallel computing, this definition is also limited. This restriction is usually required. For users, these computers can be regarded as a machine to cover the distribution of applications. In this book, we prefer this generalized definition.

Obviously, distributed computing can only be used when computers are connected to each other. In fact, most of the time, this is only a summary of the concept of parallel computing in the previous part.

There are many reasons to build a distributed system. The common reason is that the amount of tasks to be done is too large, and one computer is difficult to complete, or cannot complete within a certain period of time. A practical example is the 3D animation film rendering of Pixar or DreamWorks.

Considering the total number of frames to be rendered for the whole movie (The movie has 30 frames per second for two hours), the movie studio needs to allocate a large amount of work to multiple computers (they call it computer farm).

In addition, the application itself needs a distributed environment. For example, instant messaging and video conferencing applications. For these applications, performance is not the most important. The most important thing is that the application itself is distributed. In the following figure, we can see a very common network application architecture (another example of distributed application). Multiple users are connected to websites. At the same time, the application itself needs to communicate with systems of different hosts in the LAN (such as database servers):

another example of a distributed system may be a bit counterintuitive, that is, CPU-GPU. Nowadays, graphics cards themselves are very complicated computers. They run in high parallel and process massive computing-intensive tasks, not only to display images on the display. A large number of tools and libraries (such as NVIDIA CUDA,OpenCL, and OpenAcc) allow developers to develop GPU for generalized computing tasks. (Translator's note: for example, in Bitcoin, graphics card programming is used to mine.)

however, the system composed of CPU and GPU is actually a distributed system, and the network is replaced by PCI bus. Any application that uses CPU and GPU has to process the movement of data between two subsystems, just like traditional applications running in the network.

Migrating existing code to a computer network (or GPU) is not an easy task. To transplant, I found that it is a very useful intermediate step to use multiple processes on a computer first. We will see in Chapter 3 that Python has powerful functions to complete this task (see concurrent.futuresmodule).

Once multiple processes are run in parallel, you can consider splitting these processes into independent applications. This is not the key point.

Pay special attention to where data is stored and how data is accessed. In simple cases, shared file systems (for example, UNIX NFS) are sufficient. In other cases, databases or message queues are required. We will look at several examples in chapter 4. Remember, the real bottleneck is often data rather than CPU.

conceptually, parallel computing is similar to distributed computing. After all, both of them need to decompose the total computing workload into small pieces and then run on the processor. Some readers may think that in one case, all the processors used are located in one computer, and in another case, the processors are located in different computers. So, is this technology a bit redundant?

The answer is, maybe. As we can see, some applications are distributed. Other applications require more performance. For these applications, the answer may be "a little redundant"-the application itself does not care where the computing power comes from. However, considering all the situations, the physical storage location of hardware resources is still meaningful.

Perhaps the most obvious difference between parallel computing and distributed computing is that the underlying memory architecture and access methods are different. For parallel computing, all concurrent tasks can access the same memory space in principle. Here, we must say in principle, because we have seen that parallel does not have to use threads (threads can access the same memory space).

The following figure shows a typical shared memory architecture in which four processors can access the same memory address. If the application uses a thread, the thread can access the same memory space if necessary:

however, for distributed applications, different concurrent tasks cannot access the same memory. The reason is that some tasks are run on this computer and some are run on another computer. They are physically separated.

Because computers can communicate with each other by network, we can imagine writing a software layer (middleware) to present applications in a unified memory logic space. These middleware are distributed shared memory architectures. This book does not involve such systems.

In the following figure, we have four CPUs in the shared memory architecture. Each CPU has its own private memory, and the memory space of other CPUs cannot be seen. Four computers (boxes surrounding CPU and memory) communicate through the network and transmit data through the network:

in reality, computers are the combination of the two extreme situations we have talked about before. Computer network communication is like a pure distributed memory architecture. However, each computer has multiple processors and runs a shared memory architecture. The following figure shows the hybrid architecture:

these architectures have their own advantages and disadvantages. For shared memory systems, data sharing is extremely fast in concurrent threads of a single file, which is much faster than network transmission. In addition, using a single memory address can simplify programming.

In addition, do not overlap threads or change parameters.

The distributed memory system has strong scalability and low construction cost. It requires higher performance and can be expanded. Another advantage is that processors can access their own memory without worrying about competitive conditions (competitive conditions mean that when multiple threads or processes are reading and writing a shared data, results depend on the relative time when they are executed). Its disadvantage is that developers need to manually write data transmission policies and consider the location of data storage. In addition, not all algorithms can be easily migrated to this architecture.

the last important concept of this chapter is Amdal's law. In short, Amdahl's law means that we can freely parallelize or distribute computing and add computing power resources to achieve higher performance. However, the final code cannot be faster than single-sequence (non-parallel) components running on a single processor.

More formally, Amdahl's law has the following formula. Consider a partially parallel algorithm, called Pis a parallel component, Sis a sequence component (that is, a non-parallel component), P+S=100%. T(n)Is the runtime. The number of processors is n. The relationship is as follows:

this formula is converted into vernacular: in nthe time spent running this algorithm on each processor is greater than or equal to the time spent running sequence components on a single processor. S*T(1)In addition, the running time of parallel components on a single processor P*T(1)divided n.

When the number of processors is increased n, the second term on the right of the equation becomes smaller and smaller. Compared with the first term, it gradually becomes negligible. In this case, the formula is similar:

this formula is converted into vernacular: The time for running this algorithm on infinite processors is approximately equal to the running time of sequence components on a single processor S*T(1).

Now, think about the meaning of Amdahl's law. The assumption here is too simple. Generally, we cannot completely parallelize the algorithm.

That is to say, in most cases, we cannot let S=0. There are many reasons: we may have to copy data and/or code to locations that different processors can access. We may have to separate data and transfer data blocks over the network. You may need to collect and process the results of all concurrent tasks.

Whatever the reason is, if the algorithm cannot be completely parallelized, the final running time depends on the performance of sequence components. The higher the degree of parallelization, the less obvious the acceleration performance.

Digression, full parallel is usually called Embarrassingly parallel, or in politically correct words, pleasantly parallel, it has the best scalability (speed is linearly related to the number of processors). Of course, don't feel embarrassed about it! Unfortunately, dense parallelism is rare.

Let's add some numbers to Amdahl's law. Assume that the algorithm takes 100 seconds for a single processor. Assuming that the parallel component is 99%, this value is already very high. Add the number of processors to increase the speed. See the following calculations:

as we see nthe acceleration effect is not satisfactory. Using 10 processors, it is 9.2 times the speed. Using 100 processors, the speed is 50 times. Using 1000 processors, only 91 times the speed.

The following figure shows the relationship between double speed and the number of processors. No matter how many processors are used, the speed cannot be greater than 100 times, that is, the running time is less than 1 second, that is, less than the running time of sequence components.

Amdahl's law tells us two points: How much can we increase the double speed as soon as possible; When should the input of hardware resources be reduced when the revenue is reduced.

Another interesting thing is that Amdahl's law applies to distributed systems and hybrid parallel-distributed systems. At this time, nequal to the total number of processors for all computers.

As the performance of accessible systems becomes higher and higher, if the residual performance can be used, the running time of distributed algorithms can be shortened.

As the application execution time becomes shorter, we tend to deal with more complex problems. It is called Gustafsen's law when such algorithm evolves, that is, the expansion of the problem scale (the expansion of calculation requirements) reaches acceptable performance.

most of the computers we can buy now are multi-processor and multi-core, and the distributed application we are going to write is to run on such computers. This allows us to develop both distributed computing and parallel computing. This hybrid distribution-parallel paradigm is the de facto standard for developing network distribution applications. Reality is usually mixed.

this chapter describes the basic concepts. We learned parallel and distributed computing, as well as two architecture examples, and discussed the advantages and disadvantages. This paper analyzes how they access memory and points out that reality is usually mixed. At last, the paper introduces Amdahl's law, its significance to extended performance and the economic consideration of hardware investment. The next chapter will translate concepts into practices and write Python code!

Preface Chapter 1 Introduction to Parallel and Distributed Computing Chapter 2 Asynchronous Programming chapter 3 parallel computing of Python Chapter 4 Celery distributed applications Chapter 5 deployment of cloud platform Python Chapter 6 Use of supercomputer groups Python Chapter 7 test and debugging distributed applications Chapter 8 continue learning

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now