Lazy binary-splitting: a run-time adaptive work-stealing scheduler
Title | Lazy binary-splitting: a run-time adaptive work-stealing scheduler |
Publication Type | Conference Papers |
Year of Publication | 2010 |
Authors | Tzannes A, Caragea GC, Barua R, Vishkin U |
Conference Name | Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming |
Date Published | 2010/// |
Publisher | ACM |
Conference Location | New York, NY, USA |
ISBN Number | 978-1-60558-877-3 |
Keywords | Dynamic scheduling, load balancing, nested parallelism, thread scheduling, work stealing |
Abstract | We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst)for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for fine-grain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and ease-of-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning. We compare LBS to Auto-Partitioner (AP), the latest default scheduler of TBB, which does not require manual tuning either but lacks context portability, and outperform it by 38.9% using TBB's default AP configuration, and by 16.2% after we tuned AP to our experimental platform. We also compare LBS to SP by manually finding SP's sst using a training dataset and then running both on a different execution dataset. LBS outperforms SP by 19.5% on average. while allowing for improved performance portability without requiring tedious manual tuning. LBS also outperforms SP with sst=1, its default value when undefined, by 56.7%, and serializing work-stealing (SWS), another work-stealer by 54.7%. Finally, compared to serializing inner parallelism (SI) which has been used by OpenMP, LBS is 54.2% faster. |
URL | http://doi.acm.org/10.1145/1693453.1693479 |
DOI | 10.1145/1693453.1693479 |