Intro

Parallel programming is the next post from the Insights on .NET Concurrency series. Parallelism is about making more than one thing happen at the same time, with the main objective being improve the performance of CPU bound operations. Parallel programming model tries to overcome single core limitations. With parallelism we can process larger more complex problems in a more granular approach and increased throughput by utilising multiple CPU cores. For example if want to to search a keyword in a large file, simply we could iterate through the file lines and execute the search function sequentially looking for the keyword, or dice the file into blocks where each block has a start and end line without overlapping and let Tasks execute the search function on each block in parallel.

 

FileSearch

 

Concepts

Degree of Parallelism (DOP): Refers to the number of processor cores that are used to process an operation simultaneously. In .NET also denotes to the number of Tasks used to run a parallelisable job. Usually the number of Tasks is greater than the available cores. DOP is managed automatically by the underlying framework. When the number of threads utilised is larger than the available CPU cores leads to extra overheads because of context switching and it is called Processor Oversubscription. Conversely, when the number of threads utilised is less than available cores leads to ineffective CPU utilisation, and is called Processor Undersubscription.

Limits of Parallelism: Conceptually each task is handled by an individual CPU core, and it appears to mind that the more cores we have the more tasks can run simultaneously. Unfortunately in reality there are limitations to the number of cores that can be utilised. This limitation is mathematically captured by Amdahl’s Law. Basically it means that the speed gain is determined by the sequential part of the program flow.

Amdahl's Law - Source Wikipedia
Amdahl’s Law – Source Wikipedia

Synchronisation: A code that is protected from non deterministic access by multiple threads is called thread-safe. Synchronisation is the mechanism to manage such access to shared resources and avoid race conditions and deadlocks hence become deterministic. A race condition results in preserving last thread’ written value, whilst when two threads compete on two shared resources each one held by different thread and waiting each other to release it. The synchronisation implementation is a major performance factor.

Partitioning: In order to process a job in parallel it must be partitioned. TPL starts with small number of partitions and adjusts itself to larger number over time. Having too many partitions incurs overheads from the extra context switching, whilst having too few partitions leaves cores underutilised. Also having fixed size of partitions reduces complexity of shared resources, whilst loosing on load balancing (Attempt to distribute work equally amongst tasks). By default these trade-offs are managed by TPL during runtime, however can be customised if needed.

Parallel Partioning tradeoffs

 

Potential Parallelism: Since .NET TPL is a generic framework and targets general-purpose computing hardware, parallel execution is not always guaranteed. Run-time environment automatically adapts to the workload on a particular machine. Unlike when writing software for a specific hardware e.g. gaming console.

Parallel Checklist

Parallel programming is mainly about a mindset and not just learning a new API.

At the start of every parallelisation journey, try to answer the following mindset questions in order to get insights on the complexity involved and potential speed gains of parallelising a sequential work flow in your application.

  1. Can I split the job into smaller jobs? if no then no parallelism, if yes then,
    • e.g. a large data source can be partitioned and processed at the same time
  2. Is there any degree of dependency amongst each job? if no then voila, if yes then
    • e.g. each number in a data array can be manipulated independently
    • e.g. one partition’s input is the output of another partition (dependent data blocks)
  3. Do they share resources? if no then again voila, if yes then
    • e.g. each partition has access and updates a shared local variable
  4. How to synchronise access to shared resources
    • e.g. controlling the access to a shared variable is one thing and which control mechanism and abstraction level is another thing.
    • Do you lock a shared variable when accessed by a Task or lock calling a Task in first place until another Task completes

Rules of the Game

The TPL API that enables loop iterations to execute in parallel is Parallel.For/Parallel.ForEach. Under the hood Parallel class creates Tasks and runs them.

ParallelToTasks

 

Rule 1: #Tasks is less or equal to #Iterations

The above mapping between parallel and tasks is not always correct because, as we mentioned in the concepts section, the default TPL partitioner starts with small number of tasks and tunes itself based on various environmental factors e.g. machine workload, load balancing state, or available threads.

Rule 2: #Threads is less or equal to #Tasks

Potentially a single threadpool thread could handle multiple tasks awaiting in the global queue.

Parallel Patterns

Patterns provide a systematic guidance to solve a reoccurring problem, i.e. problems with similarities can follow the same patterns. But since we live in a complex world there multiple patterns that address different problems. There are quite few patterns to be used in parallel programming, but we will be looking at the most rudimentary ones since the others are built on top them. Other patterns to look for are: Producer-Consumer (aka Pipeline), or Divide and Conquer (aka Dynamic Task Parallelism).

Map Pattern

Use case: A large data source of independent numbers where each input is fed into an algorithm.

Concurrent checklist: 1- Yes, 2- No, 3- No.

Aka Data Parallelism, Parallel Loop.

API: Parallel.For/Parallel.ForEach

Considerations

#1 Know your dependencies

Maybe the your your code design is thread safe but what about any external dependencies your code relies on e.g. 3rd party libraries, or API classes. For example Random, DbConnection, and FileStream.WriteByte classes are not thread safe, therefore you need a separate instance of such classes for each thread.

#2 Small loop bodies with few Iterations

Creating a task, push it to the global queue, then grab a thread to execute it. This is not a free process and Task creation imposes overheads that most likely will deteriorate performance for small number of tasks with relatively small computation body.

Join/Fork Pattern

Use case: Run multiple independent operations at the same time. Each operation may or may not encapsulate parallelisable implementation. Task parallelism is higher level of abstraction from data parallelism since data parallelism leverages single operation with multiple inputs.

Concurrent checklist: 1- Yes, 2- No, 3- No.

Aka Task Parallelism, Parallel Tasks.

API: Parallel.Invoke, Explicit Tasks

Considerations

#1 Shared Resources and Race Conditions

Very common pitfall. In the example below in the first loop the index variable is shared by all created Tasks, and by the time each task starts running the value of the variable has changed. In the second loop each task uses its own local variable. Note that in the second loop the output order is non deterministic.

SharedResourceClosurePitfall

#2 Shared Resources and Atomicity Violation

“A statement sequence S is atomic if S’s effects appear to other threads as if S executed without interruption”.

The sequence of operations that can be interrupted violates atomicity, hence their outcome can’t be guaranteed especially in a multiple thread environment. And that’s true whether you are dealing with high level operations e.g. Database access (ACID), or low level operation in memory e.g. add/subtract a variable.

atomicity violation example

 

Reduce Pattern

Use case: A the end of each parallel iteration you have to process the outcome with another parallel iteration outcome. This means there is a dependency amongst data source elements unlike the Map pattern. common examples on using this pattern is calculating the summation, or average of a data source. For this pattern to work the operation must be associative and commutative i.e. a + b = b + a, and (a * b) * c = a * (b * c).

Concurrent checklist: 1- Yes, 2- Yes, 3- Yes.

Aka Parallel Aggregation.

API: Parallel.For/Parallel.ForEach, Explicit Tasks

Reduce Pattern Diagram

Considerations

#1 Synchronisation price

The code below will output the same result as the one above. Since synchronisation to protect a shared resource takes place on every iteration it regresses performance. In the code above the number of entering the critical section at max is N-1; N is the number of elements in a data source.

Exception Handling

Exception thrown inside Tasks are wrapped by AggregateException instance. Such exceptions get thrown in the calling thread when are awaited.

Design Mindset

As you noticed parallel code implementation is possible at low level e.g. algorithms, and at higher levels of abstraction e.g. Domain Coordinators (Task Parallelism). Parallelising few algorithms could potentially bring performance improvements but more could be achieved.

Ideally aim to parallelise application structure not just specific bottlenecks (holistic view). Remember Amdal’s law, where the sequential part of you application limits performance gain. The diagram below depicts the trade-offs of various degrees of code structural changes; Depth of Parallelism.

Depth of Parallelism

 

One thought on “Insights on .NET Concurrency: Parallel Programming”

Leave a Reply

Your e-mail address will not be published. Required fields are marked *