Contents Previous Next Index
|
15 Parallel Programming
|
|
Computers with multicore processors are now commonplace. To take advantage of the power of multicore computers, Maple provides tools for parallel programming. This chapter provides a basic introduction to parallel programming in Maple.
|
15.1 In This Chapter
|
|
•
|
The two forms of parallel programming available in Maple, shared memory and multiple process.
|
•
|
An introduction to shared memory programming using the Task Programming Model.
|
•
|
An introduction to multiple process programming using the Grid Programming Model
|
|
|
15.2 Introduction
|
|
Maple provides tools for two different types of parallel programming. The Task Programming Model enables parallelism by executing multiple tasks within a single process. The second type of parallelism comes from the Grid package, which enables parallelism by starting multiple processes.
Each type of parallelism has advantages and disadvantages. The Task Programming Model is a high level programming tool that simplifies many aspects of parallel programming. In the Task Programming Model, tasks share memory thus they can work closely together, sharing data with low overhead. However because they share memory, code running one task must be careful not to interfere with code running in other tasks. As the Task Programming Model is very new to Maple, much of the Maple library has not been verified to work correctly with tasks. This means that much of Maple's core functionality cannot be used in task-based code.
As Grid uses multiple process parallelism it does not suffer from this problem, each process has its own independent memory. Thus you can use all of Maple's library routines in multiple process execution. Further, with the addition of the Grid Computing Toolbox, you can execute multiple process parallelism across multiple computers which can allow you to access far more computing power. However because the processes are independent the cost of communication between processes can be quite high. As well, balancing the computation evenly across all the available processors, especially those on remote computers, can be difficult.
|
|
15.3 Introduction to Parallel Programming with Tasks
|
|
|
Parallel Execution
|
|
Consider two procedures, f and g. f contains a sequence of statements, , , ..., , and g contains the sequence, , , ..., . If these two procedures are run in serial, they can be run in two possible orders: f followed by g, or g followed by f. In other words, the order in which the statements are run can be either , , ..., , , , ..., or , , ..., , , , ..., . The programmer defines the order in which the statements are run. For example, if must run before for the code to execute correctly, the programmer can call f before g to make sure that the statements run in the correct order.
>
|
f := proc()
local i;
for i from 1 to 5
do
print( procname[i] );
end do;
end proc:
|
>
|
g := eval(f):
f();
g();
|
| (1) |
If f and g are called in parallel (that is, at the same time), different sequences can be generated. Although will run before , the order in which runs relative to cannot be controlled. Therefore, the order could be , , , , , , .... or it could be , , , , , , ... or any other valid order. Also, the statements can be ordered differently each time these procedures are run; every possible order will eventually happen, given enough iterations.
The following example uses functions from the Task Programming Model. These functions are described in the Task Programming Model section of this chapter. For now, consider these functions as a way to start tasks, which are functions that can run in parallel.
>
|
Threads:-Task:-Start( null, Task=[f], Task=[g] );
|
| (2) |
Running the statement above multiple times generates different sequences.
If the code requires to execute before to run correctly, running these functions in parallel may produce errors, even if is the first statement of f and is the last statement of g. Every possible order will eventually occur; therefore, to write correct parallel code, you must guarantee that every possible order of execution leads to a correct result.
>
|
f := proc( n )
local i, tmp;
global shared;
for i from 1 to n
do
tmp := shared;
shared := tmp+1;
end do;
NULL;
end proc:
|
>
|
Threads:-Task:-Start( null, Task=[f,1000], Task=[g,1000] ):
|
| (3) |
In the example above, f and g increment the global variable shared 1000 times. You might expect the final value of shared to be 2000; however, this is not the case. The for loop contains two statements:
–
|
:
|
–
|
:
|
When f and g are running in parallel, these statements could run in the following order:
–
|
:
|
–
|
:
|
–
|
:
|
–
|
:
|
and the increment performed by g is lost.
In some orders, the total will be 2000 and, in fact, every value from 1000 to 2000 could occur. Therefore, even for this simple example, there are 1001 different possible outcomes and even more possible orders.
In sequential programs, the order in which statements are run are defined by the programmer. In parallel code, the order in which statements run is not defined exclusively by the code. In other words, a programmer who writes parallel code must consider all of the different possible orders to determine if the code is correct.
Functions that work correctly when run in parallel with other code are called safe. Functions that do not work correctly when run in parallel with other code are called unsafe.
|
How the Ordering Is Determined
|
|
The operating system can interrupt and pause a task that is running for many reasons. If the task tries to access a memory location, the operating system may need to transfer the value into a register. This process could take hundreds, thousands, or even millions of cycles. If the task tries to access a system resource (for example, by reading or writing data, allocating memory, and so on), the operating system may need to pause the task while it waits for the resource to become available. Also, the operating system may move a task from a core to allow another process to run.
Therefore, many factors may cause a task to pause for a brief or long time period. In some cases, the task may pause as a result of the action that is being performed; however, other factors are beyond the task's control.
|
|
Issues Caused by Multiple Orders
|
|
These multiple potential orders may cause other issues when developing parallel code. For example, parallel code can be difficult to test because orders that cause issues may not occur during testing. This is particularly true if you are developing code on a single-core computer. Many orders that are possible on multiple-core computers may never occur on a single-core computer.
|
|
|
Controlling Parallel Execution
|
|
The previous section provided a simple example of parallel code with 1001 possible outcomes. Each outcome can result from multiple statement orders and there are numerous potential statement orders for even simple code. The only way to write correct parallel programs is to get a handle on all of these orders.
|
Execution Orders That Do Not Matter
|
|
Many of the possible orders will not cause problems. Consider the following example, which is similar to the previous one.
>
|
f := proc( n )
local i, tmp;
global shared;
shared[procname] := 0;
for i from 1 to n
do
tmp := shared[procname];
shared[procname] := tmp+1;
end do;
NULL;
end proc:
|
>
|
Threads:-Task:-Start( null, Task=[f,1000], Task=[g,1000] ):
|
| (4) |
In this case, the result is 2000 each time you run the code. The difference between this example and the example above is that although there are just as many statement orders, the orders do not cause conflicts. In this example the two tasks are not writing to the same location in memory. In general, statements that cause issues are statements that access shared data. In the first example, the two tasks share the value stored by the variable, shared. As the two tasks modify shared, conflicts occur. In this example, both threads write to different variables, so no conflicts occur. The regions of code that contain statements that cause conflicts are called critical sections.
By understanding this concept, you can limit yourself to worrying about the orderings that involve critical sections. This also implies that if you have fewer critical sections in your parallel code, it will be easier to make the code work properly.
|
|
Shared Data in Maple
|
|
Since sharing data may cause issues in parallel programming, it is useful to consider how data can be shared in Maple. A piece of data is shared if it is accessible from multiple tasks that are running in parallel. Also, data that can be accessed from a shared value is also shared. In particular, if a shared variable is assigned a module, all of the data in the module is also shared, including the module locals. Similarly, remember tables of shared procedures are also shared.
The most common way data is shared is using global variables. A global variable can be accessed from anywhere in the code, so whenever a procedure uses a global variable, it could conflict with another procedure running in parallel. In a similar way, lexically scoped variables that are used in tasks that run in parallel are also shared. Another way to share data is to pass the same value into multiple tasks as an argument.
For more information, see Variables in Procedures.
|
|
Sharing Data Safely
|
|
It is often necessary to share data to implement a parallel algorithm, so you must consider how to share data safely.
The simplest method for sharing data safely is to treat the shared data as read-only. If the tasks do not modify the shared data, no conflicts will occur. However, if even one task modifies the shared data, all of the tasks (even the ones that simply read data) must access shared data carefully.
Another method is to share a data structure, such as an Array, but limit which elements of the structure are accessible by certain tasks. For example, two tasks can share an Array if each task only accesses one half of the Array. In such an example, the tasks share a structure, but not the data within the structure.
In the following example, an Array is shared between two tasks.
>
|
task := proc( A, lo, hi )
local i, x;
for i from lo to hi
do
x := A[i];
A[i] := x^4+4*x^3+6*x^2+4*x+1;
end do;
end proc:
|
>
|
A := Array( 1..N, x->(Float(x)/N) ):
Threads:-Task:-Start( null, Task=[task,A,1,N2], Task=[task,A,N2+1,N] ):
|
|
|
Protecting Critical Sections
|
|
If you can't avoid creating critical sections using techniques like the ones described in the previous section, you will need to protect them by creating mutual exclusion sections of your code. These sections guarantee that at most one task can execute code in the protected sections at a time.
To create a mutual exclusion zone, use a Mutex. A mutex can be in one of two states: locked or unlocked. If a mutex is unlocked, any task can lock it. If a mutex is locked and a task attempts to lock it, the task waits until the mutex is unlocked, and then it attempts to lock the mutex again. This means that only one task can lock the mutex.
If all of the tasks only access the critical section while holding the lock, there will never be more than one task accessing the shared data at a time. Therefore, by using a mutex, multiple tasks can run in parallel and share data without conflicting with other tasks.
The following example takes the unsafe code from the first example and adds a mutex to make it safe.
>
|
task := proc(m, n)
local i, tmp;
global common_variable;
for i to n do
Threads:-Mutex:-Lock(m);
tmp := common_variable;
common_variable := tmp + 1;
Threads:-Mutex:-Unlock(m)
end do;
NULL
end proc:
|
>
|
m := Threads:-Mutex:-Create():
|
>
|
Threads:-Task:-Start( null, Task=[task,m,1000], Task=[task,m,1000] ):
|
| (5) |
Note: The excessive use of mutexes may cause performance issues. First, simply having to lock and unlock a mutex will add processing time to your code. However, more significantly, if a thread tries to lock a mutex that is already locked, it must wait. This waiting period reduces the parallelism of the algorithm.
The mutex example shown above falls into this category. The body of the task runs while holding a lock, which means that, at most, one task can run that code at a time. To fix this, limit the access to the global variable. For this example, a local variable can be used and the local results can be combined once at the end of the execution.
>
|
task := proc(m, n)
local i, local_sum;
global common_variable;
local_sum := 0;
for i to n do
local_sum := local_sum + 1;
end do;
Threads:-Mutex:-Lock(m);
common_variable := common_variable + local_sum;
Threads:-Mutex:-Unlock(m)
end proc:
|
>
|
m := Threads:-Mutex:-Create():
|
>
|
Threads:-Task:-Start( null, Task=[task,m,1000], Task=[task,m,1000] ):
|
| (6) |
|
|
|
|
15.4 Task Programming Model
|
|
The Task Programming Model is a high-level parallel programming interface. It is designed to make parallel programming easier.
|
Tasks
|
|
Consider the following Maple procedure.
>
|
f := proc() fc( f1(args1), f2(args2), ..., fn(argsn) ); end proc;
|
To evaluate , the values are evaluated and their return values are computed. These return values are then passed to as arguments. When completes, its return value is passed as the return value of . The Task Programming Model takes this pattern, but creates tasks for the values and . A task is a piece of executable code. In Maple, a task is a procedure combined with a set of arguments to that procedure. Once a task is created, the Maple kernel can run it. By allowing the kernel to schedule tasks, Maple can automatically distribute them to available processors of your computer.
In the example above, a function call, , can be replaced by a task, , in a straightforward way: the procedure is and the arguments are . However, the task, , corresponding to the function call is more complex. The function call will not run until all the calls have completed, thus supplying their return values to as arguments. Similarly, the task must wait for values from before it can run. The procedure of is , and its arguments are the values returned by the other tasks. These other tasks are called the child tasks of . Similarly, is called the parent of the tasks. The task is called the continuation task.
In the example code above, the value returned by is the return value of . Similarly, when a task, , creates a continuation task, the value returned by the continuation task is given to the parent of . When this happens, any value returned by the task is discarded. is effectively replaced by . This occurs because, in the Task Programming Model, tasks always run until they are complete. A task , does not need to stop in the middle of its execution to wait for child tasks to complete; instead it can finish and the continuation task will handle the return values of the child tasks. If does not create any tasks, its return value is given to its parent.
|
|
The Task Tree
|
|
In the Task Programming Model, any task can replace itself with child tasks and a continuation task. Therefore, these newly created tasks can also create additional tasks. This process creates a tree of tasks. Leaf nodes are the tasks that do not have child tasks and internal nodes are the tasks that have child tasks. A leaf task does not have child tasks, so it can run at any time. As leaf tasks complete, their parent tasks may become leaf tasks, allowing them to run.
|
|
Starting Tasks
|
|
To create the first task, call the Threads:-Task:-Start function.
>
|
Start( task, arg1, arg2, ..., argn ):
|
task is the procedure to run as the first task and arg1, arg2, ..., argn are the arguments to task.
>
|
task := proc( )
`+`( _passed );
end proc:
|
>
|
Threads:-Task:-Start( task, 1,x^3,q/3 );
|
| (7) |
This procedure creates one task. After the task runs and returns a value (or a continuation function returns a value), that value is the returned by the Start function. Starting a task by itself is not as useful as starting multiple tasks that run in parallel.
To start child tasks and a continuation function, call the Threads:-Task:-Continue function.
>
|
Continue( cont, arg1, arg2, ..., argn ):
|
cont is the procedure to use for the continuation task. arg1, arg2, ..., argn are either arguments to cont or child task specifiers of the form
Task=[task, targ1, targ2, ..., targn ]
Tasks=[task, [targs1], [targs2], ..., [targsm] ]
|
|
|
The first task specifier creates one task with the procedure task and arguments targ1, targ2, ..., targn. The second task specifier creates m tasks, each using task as the procedure and the sequence of expressions targsi as arguments to task i.
As Continue replaces a running task with child tasks and a continuation task, it can only be called from within a running task. In addition, it can only be called once per task. However Continue can be called from any running task, including a continuation task.
When a child task completes, its return value is passed to its parent as a parameter. The position of the parameter corresponds to the position the task was specified in the call to the Continue function. The following example illustrates how this passing works.
>
|
task := proc(i)
cat( t, i );
end proc:
|
>
|
start := proc( )
Threads:-Task:-Continue( print, 1, Task=[task,2], 3, Tasks=[task,[4],[5]], 6 );
end proc:
|
>
|
Threads:-Task:-Start( start );
|
| (8) |
The simple example shown earlier can be modified to use the Task Programming Model.
>
|
task := proc(n)
local sum, i;
sum := 0;
for i from 1 to n
do
sum := sum+1;
end do;
sum;
end proc:
|
>
|
start := proc( )
Threads:-Task:-Continue( `+`, Task=[task,1000], Task=[task,1000] );
end proc:
|
>
|
Threads:-Task:-Start( start );
|
| (9) |
By using the value passing behavior of the Task Programming Model, this problem can be solved without using global variables or a mutex. The return values of the two child tasks are passed to the continuation function, +. It adds them together and returns the computed value. This value is then returned by the Start function.
|
|
Task Management
|
|
Now that you have the functions to create tasks, you must determine how many tasks to start. To understand this, a few parallel programming concepts must be considered. Parallel algorithms are said to scale if they get faster when they run on more cores. A good parallel algorithm will scale linearly with the number of available processors.
To achieve linear scaling, a parallel algorithm must consider load balancing. Load balancing refers to techniques used to distribute the work evenly over all the cores of your computer. If you want to use n cores, you would want to divide the work into n even parts. However, you may not be able to determine how to divide the work evenly. Dividing the input into n evenly sized parts may not divide the work evenly. It is possible that one task will require more time to evaluate than the others. Once the other tasks complete, their cores will be idle while the remaining task runs.
One way to solve this problem is to create a large number of small tasks. This way, each task is relatively small. However, even if one task requires more time to run, the other cores can run many other tasks while one core is running the long task. Another advantage is that you can create the tasks without considering the number of cores. Thus your code does not need to know about the underlying hardware.
One limitation is that creating tasks requires resources. If the tasks are too small, the resources required to create the tasks may dominate the running time. Consider the following example, which is run on a computer with four cores.
>
|
add_range := proc(lo, hi)
local i;
add( i, i = lo..hi );
end proc:
|
The add_range function adds the numbers from lo to hi.
>
|
start := time[real]():
add_range( 1, N );
time[real]()-start;
|
| (10) |
>
|
parallel_add_range := proc( lo, hi, n )
local i,step,d;
d := hi-lo+1;
step := floor( d/n );
Threads:-Task:-Continue( `+`, Tasks=[ add_range,
seq( [i*step+lo,(i+1)*step], i=0..n-2 ),
[ (n-1)*step+lo,hi ] ] );
end proc:
|
The parallel_add_range function also adds the numbers from lo to hi, but it distributes the work over n tasks.
>
|
start := time[real]():
Threads:-Task:-Start( parallel_add_range, 1, N, 2 );
time[real]()-start;
|
| (11) |
>
|
start := time[real]():
Threads:-Task:-Start( parallel_add_range, 1, N, 4 );
time[real]()-start;
|
| (12) |
>
|
start := time[real]():
Threads:-Task:-Start( parallel_add_range, 1, N, 100 );
time[real]()-start;
|
| (13) |
Increasing the number of tasks from 2 to 4 increases the performance, as you would expect on a 4 core computer. However further increasing the number of cores from 4 to 100 also increases the performance. By using a larger number of tasks, Maple is better able to schedule the work onto available cores.
>
|
start := time[real]():
Threads:-Task:-Start( parallel_add_range, 1, N, 10000 );
time[real]()-start;
|
| (14) |
However, running 10000 tasks introduces a slowdown. The overhead of managing the tasks begins to become significant. The Task Programming Model is a relatively new feature in Maple, so this overhead will be reduced in future versions of Maple.
|
Coarse-grained Versus Fine-grained Parallelism
|
|
Consider the following example.
>
|
work := proc(n) # do O(n) "work"
local i;
for i from 1 to n
do
end do;
n;
end proc:
|
>
|
N := 100000000: # the total amount of work
M := 100:
n := N/M:
A := [ seq( M, i=1..n ) ]: # evenly distributed work
|
>
|
t:=time[real]():
add( work( A[i] ), i=1..nops(A) );
time[real]()-t;
|
| (15) |
In this example, the time taken by the work function depends on the input value n. This process can be parallelized at a high level by subdividing over the input Array until a base case is reached.
>
|
task := proc( A, low, high )
local i, count, mid;
mid := high-low;
if ( mid > 10000 ) then
mid := floor(mid/2) + low;
Threads:-Task:-Continue( `+`,
Task=[ task, A, low, mid ],
Task=[ task, A, mid+1, high ] );
else
count := 0;
for i from low to high
do
count := count + work(A[i]);
end do;
count;
end if;
end proc:
|
>
|
t:=time[real]():
Threads:-Task:-Start( task, A, 1, nops(A) );
time[real]()-t;
|
| (16) |
You can see that this provides a reasonable speedup. High-level parallelism, as shown in the example above, is called coarse-grained parallelism. Generally, coarse-grained parallelism refers to dividing a problem into subproblems at a high level, and then running the subproblems in parallel with each other.
However, if a different input is specified, the weakness of coarse-grained parallelism can be seen. For example, if work is distributed unevenly, the speedup is not as significant.
>
|
N2 := N/2:
n := N2/M:
A2 := [ N2, seq( M, i=1..n ) ]:
|
>
|
t:=time[real]():
Threads:-Task:-Start( task, A2, 1, nops(A2) );
time[real]()-t;
|
| (17) |
This happens because subdividing over the range does not take into account the actual amount of work necessary to compute the subranges. In the example above, the first subrange contains over half the work. Therefore, it may be difficult to divide the work into equal subsections, by only looking at the input.
Another approach to parallelizing a problem like this is to parallelize the work function.
>
|
workTask := proc(n)
local i, m;
if ( n > 10000 ) then
m := floor( n/2 );
Threads:-Task:-Continue( `+`,
Task=[ workTask, m ],
Task=[workTask, n-m ] );
else
for i from 1 to n
do
end do;
n;
end if;
end proc:
|
>
|
work := proc(n) # do O(n) "work"
local i;
if ( n > 10000 ) then
Threads:-Task:-Start( workTask, n );
else
for i from 1 to n
do
end do;
n;
end if;
end proc:
|
>
|
t:=time[real]():
add( work( A2[i] ), i=1..nops(A2) );
time[real]()-t;
|
| (18) |
Low-level parallelism, as shown in the example above, is called fine-grained parallelism. Simply using the parallel work function gives a speedup in this case. However, fine-grained parallelism also has flaws. In particular, although the work function is faster for large inputs, it is not faster than the sequential version for small inputs. Thus, when you have an even distribution of work, there is no advantage to using this approach.
>
|
t:=time[real]():
add( work( A[i] ), i=1..nops(A) );
time[real]()-t;
|
| (19) |
The best solution is to use both coarse and fine-grained parallelism. Note: The work function has been redefined, so task will now use the new definition.
>
|
t:=time[real]():
Threads:-Task:-Start( task, A2, 1, nops(A2) );
time[real]()-t;
|
| (20) |
Using both coarse and fine-grained parallelism combines the best of both of these approaches.
|
|
|
|
15.5 Examples
|
|
|
The N Queens Problem
|
|
On an N by N chess board, find the positions for N queens such that no two queens can capture each other. A queen can capture other chess pieces in the row and column in which it is positioned, and along the two diagonals that pass through the queen's position.
We will represent the board position by an Array of length N. Each element of the Array is an integer in the range 1..N, and each integer only appears once. The combination of the Array index and the element stored at that index specify the position of a queen.
This representation is sufficient because only one queen can be in each row and column at a time. These restrictions can be specified while creating the positions, so when the chess board layouts are checked for valid solutions, we only need to look for conflicts along the diagonals.
nQueens := module()
By passing 0 as the second argument, child tasks are not actually created. The following is the running time for sequential execution.
>
|
time[real]( nQueens( 9, 0 ) );
|
| (21) |
New tasks are created for all of the permutations for the first two rows of the chess board.
>
|
time[real]( nQueens( 9, 2 ) );
|
| (22) |
|
|
|
15.6 Limitations of Parallel Programming
|
|
Parallel programming in Maple is a relatively new feature. Maple has some limitations that affect the performance of parallel code. As new versions of Maple are released, these limitations will change. For more details about the following limitations, refer to the multithreaded help page.
|
Library Code
|
|
Only certain Maple library commands are thread-safe. If a Maple command is thread-safe, a note is included in its help page. If a Maple command that is not thread-safe is used in parallel code, may not work correctly.
A list of all the thread safe functions is available in the Maple help system on the index/threadsafe help page.
|
|
Maple Interpreter
|
|
The Maple interpreter executes all the code written in Maple. It is able to execute most Maple statements in parallel, however there are some internal systems that can reduce parallelism.
For a description of the performance issues in your version of Maple, see the multithreaded/performancelimitations help page.
|
|
|
15.7 Avoiding Common Problems
|
|
This section provides a list of hints and common mistakes that will help you understand and avoid common errors made in parallel programming.
|
Every Execution Order Will Happen
|
|
In parallel code, all possible execution orders will eventually occur. Therefore, never assume that a statement of one task will complete before another statement in another task, no matter how unlikely it seems that the other statement could run first. Always use the parallel programming tools in Maple (that is, the task dependencies in the Task Programming Model or mutexes) to enforce the order of execution.
|
|
Lock around All Accesses
|
|
It is common to think that if you have shared data, you only need to lock when modifying the data, but not when reading from the data. In general, this is not correct. If one task is reading data and another task starts writing data, the task that writes data can interfere with the parallel task that reads data. (Do not forget that tasks can pause at any time.) The only way to keep the task that writes data from interfering with the task that reads data is by having the task that reads data acquire the lock.
|
|
Debugging Parallel Code
|
|
Debugging parallel code can be difficult in many ways. The multiple possible orders can make bugs difficult to find. In particular, running your parallel code on a single-core machine may not produce orders that occur on a multicore machine.
Sometimes, the best way to debug parallel code is to do careful code inspections (that is, reading over the code) with the implications of parallel execution in mind. In the most extreme case, you can consider the shared data as the state in a state machine and the critical sections as transitions. This can allow you to see potential states and transitions that you did not consider.
|
|
|
15.8 Introduction to Grid Programming
|
|
The Grid package allows the user to launch multiple copies of Maple's computation engine. Each copy of the engine is independent, thus they do not share memory as in the Task Programming Model. This means if the engines need to share data they must communicate by sending messages back and forth.
|
Starting a Grid-Based Computation
|
|
To start a new computation using the Grid package, use the Grid:-Launch command. This starts new copies of computation engine, called Nodes, and passes a command to each node.
>
|
hello := proc()
printf("I'm node %d of %d\n",Grid:-MyNode(),Grid:-NumNodes());
Grid:-Barrier();
end:
|
I'm node 1 of 4
I'm node 2 of 4
I'm node 3 of 4
I'm node 0 of 4
| |
This example creates a number of nodes, and executes the hello function on each node. The Grid:-NumNodes command returns the number of nodes that were started by Launch. Grid:-MyNode returns an integer in the range 0 to NumNodes()-1 which can be used to identify the executing node. The Grid:-Barrier command creates a synchronization point. All the nodes must execute the Barrier command before any of them can proceed past it.
Node 0 is given special significance in Grid programming. The value returned by the function executing in node 0 is returned by the Launch command. Thus when node 0 returns a value, the whole Grid computation is considered complete. Nodes that are still running are terminated. This is why the call to Barrier is necessary in the previous example, without it node 0 could exit before the other threads have completed executing their commands.
|
|
Communicating between Nodes
|
|
As nodes are independent processes, to share data you need to explicitly send data from one node to another.
|
Launch
|
|
Launch allows you to specify data that will be passed to the given functions as arguments. Additionally, Launch can automatically import global names to each node when nodes are started. As well, Launch can export global names from node 0 when it exits. In the following example, we pass two arguments into func, arg1 and arg2, and import the global variable data1 into each node using the imports argument. We also set the value of data2 in node 0 and use the exports argument to update the value in the main context.
>
|
func := proc(arg1, arg2)
global data1, data2;
printf( "%d: %a %a %a\n", Grid:-MyNode(), arg1, arg2, data1 );
Grid:-Barrier();
if ( Grid:-MyNode() = 0 ) then
data2 := 1;
end;
end:
|
>
|
Grid:-Launch( func, 10, 20, imports=[ 'data1'=30 ], exports=[ 'data2' ] ):
|
3: 10 20 30
2: 10 20 30
1: 10 20 30
0: 10 20 30
| |
| (23) |
One important use of the imports option is the ability to pass user defined functions that are needed on the nodes. These functions will not be available on the nodes if they are not explicitly imported to the nodes.
|
The Grid package also contains two commands for explicitly sending data from one node to another, Grid:-Send and Grid:-Receive.
|
Send
|
|
Send allows one node to send a Maple expression to another node. Send accepts two arguments, an integer that identifies the destination node and the expression to send. Send does not wait for the target node to receive the message before returning.
|
|
Receive
|
|
Receive receives an expression that was sent from another node. Receive has one optional argument, an integer, that identifies the sender from whom an expression should be read. Without the argument Receive will return an expression from any sender. If there is no expression available, a call to Receive will wait until an expression is received. Some care should be taken as it is possible to cause a deadlock if all nodes are waiting to receive a message and no one is sending.
|
|
An Example Using Send and Receive
|
|
>
|
circ := proc()
local r, me := Grid:-MyNode(), n := Grid:-NumNodes();
if me = 0 then
Grid:-Send(1,0);
r := Grid:-Receive(n-1);
else
r := Grid:-Receive(me-1);
Grid:-Send(me+1 mod n, r, me);
end if;
end:
|
>
|
[ Grid:-Launch( circ ) ];
|
| (24) |
The next section includes a more complex example using Send and Receive.
|
|
|
|
15.9 Grid Examples
|
|
|
Computing a Mandelbrot Set
|
|
Here is a simple function for computing the Mandelbrot set. It creates a 2 dimensional Array that stores the computed values.
Mandelbrot := module()
>
|
N := 500:
s := time[real]():
points := Mandelbrot( N, N, 100, -2.0, .7, -1.35, 1.35, 10.0 ):
time[real]()-s;
|
| (25) |
We can implement a Grid-based implementation by dividing the input range into evenly sized chunks. In the following example a node uses its node identifier to determine which chuck of the final Array it should use. Once a node has completed its computation, it sends the computed Array to node 0. Node 0 collects all the results and returns them. These results are then combined into a single output Array.
Mandelbrot := module()
For this example we are executing on a four core machine.
| (26) |
>
|
s := time[real]():
points := Mandelbrot( N, N, 100, -2.0, .7, -1.35, 1.35, 10.0 ):
time[real]()-s;
|
| (27) |
Although we do see a speed up, it is not a good as we'd expect. If you execute this example and watch the CPU utilization, you'll notice that some nodes complete quite quickly, while others run for longer. This indicates that the distribution of work is uneven between nodes.
We can improve this by using a Client/Server model for work distribution. In this model, one node (node 0 in our case) acts as a server handing out work to clients as they request it. As long as work is available the clients can continue computing. In the following example the server passes row indexes to the clients. The client then computes the entire row. The computed row is sent back to the server, which collects all the rows and reconstructs them into the final Array.
It is important to notice that the following example starts an extra node. The server node does relatively little work, compared to the other nodes. Thus we create one client for each processor. The server node does not need a complete processor for itself.
Mandelbrot := module()
>
|
s := time[real]():
points := Mandelbrot( N, N, 100, -2.0, .7, -1.35, 1.35, 10.0 ):
time[real]()-s;
|
| (28) |
Using the client/server model to better distribute the work over the nodes, we get speed ups that match our expectations, four processors leads to a four times speed up.
|
|
|
15.10 The Grid Computing Toolbox
|
|
In addition to the Grid package included in Maple, the Grid Computing Toolbox is available as an add-on for Maple. The Grid Computing Toolbox enables nodes to run on remote Grid servers. These remote servers can support a much larger number of nodes distributed over multiple computers.
An algorithm implemented on top of the Grid package that ships with Maple should work on top of the Grid Computing Toolbox. The Grid Computing Toolbox does introduce new functions, however these functions are mostly dedicated to managing remote servers.
There are a few differences between local and remote execution. First, local nodes may start with local Maple libraries available. These libraries will generally not be available to remote nodes. Instead of relying on sharing the libraries via libname, explicitly pass the routines you need using the Launch command's imports parameter.
|
|
15.11 Limitations
|
|
There are a few situations where it may be difficult to effectively take advantage of the Grid package.
|
Memory Usage
|
|
With the Grid package, multiple processes run on the local machine. If the original computation requires a significant amount of memory, then each Grid node may still require a significant amount of memory, effectively multiplying the amount of memory needed by the number of nodes. This could consume all the memory resources on the machine, which can make the entire computation slower in the long run.
|
|
Cost of Communication
|
|
Passing data between nodes can be slow. Algorithms where each node needs to have access to a large amount of data may be difficult to speed up using the Grid package. Minimizing the amount of data passed between nodes can be an effective way to optimize a Grid-based computation.
|
|
Load Balancing
|
|
The Grid package currently does not have any built in load balancing. Therefore the programmer is responsible for making sure that all the nodes are kept busy. This can be difficult. You need to balance the need to have work available for nodes to compute with the overhead of excessive communication.
|
|
|
15.12 Troubleshooting
|
|
|
Deadlocking
|
|
Some care must be taken when using Send and Receive. A call to Receive will wait until a message is received, so if all nodes call Receive when there are no messages to be read, the execution will deadlock. In addition there are a few limitations on what types of expressions can be used for messages. See the Grid:-Send help page for more information.
When an unhandled exception is raised on a node this will cause the node to exit prematurely. This may cause a Send or Receive to be missed, leading to a deadlock.
|
|
'libname' and Other Engine Variables
|
|
The nodes started by the Grid package are independent from the main engine. Thus changes in the state of the main engine will not be reflected in the other nodes. In particular the value of libname on the nodes may not be the same as the value of libname in the main engine. When running local grid, the local nodes will use the same libname as used in the main engine when the first Grid computation is started. Later changes to libname will not effect the nodes. In general, it is better to use the Launch command's imports argument to pass values to the nodes instead of relying on libname.
With remote servers and the Grid Computing Toolbox, the value of libname in the main engine will have no effect on the value of libname set in the remote nodes.
|
|
Missing Functions
|
|
Forgetting to send all the necessary functions to the nodes may lead to nodes exiting without properly executing the work they have been given. This may occur without any exceptions being raised.
|
|
|
Contents Previous Next Index
|