diff options
Diffstat (limited to 'gcc-4.4.3/boehm-gc/doc/scale.html')
-rw-r--r-- | gcc-4.4.3/boehm-gc/doc/scale.html | 210 |
1 files changed, 0 insertions, 210 deletions
diff --git a/gcc-4.4.3/boehm-gc/doc/scale.html b/gcc-4.4.3/boehm-gc/doc/scale.html deleted file mode 100644 index 2e70148df..000000000 --- a/gcc-4.4.3/boehm-gc/doc/scale.html +++ /dev/null @@ -1,210 +0,0 @@ -<HTML> -<HEAD> -<TITLE>Garbage collector scalability</TITLE> -</HEAD> -<BODY> -<H1>Garbage collector scalability</h1> -In its default configuration, the Boehm-Demers-Weiser garbage collector -is not thread-safe. It can be made thread-safe for a number of environments -by building the collector with the appropriate -<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation -flag. This has primarily two effects: -<OL> -<LI> It causes the garbage collector to stop all other threads when -it needs to see a consistent memory state. -<LI> It causes the collector to acquire a lock around essentially all -allocation and garbage collection activity. -</ol> -Since a single lock is used for all allocation-related activity, only one -thread can be allocating or collecting at one point. This inherently -limits performance of multi-threaded applications on multiprocessors. -<P> -On most platforms, the allocator/collector lock is implemented as a -spin lock with exponential back-off. Longer wait times are implemented -by yielding and/or sleeping. If a collection is in progress, the pure -spinning stage is skipped. This has the advantage that uncontested and -thus most uniprocessor lock acquisitions are very cheap. It has the -disadvantage that the application may sleep for small periods of time -even when there is work to be done. And threads may be unnecessarily -woken up for short periods. Nonetheless, this scheme empirically -outperforms native queue-based mutual exclusion implementations in most -cases, sometimes drastically so. -<H2>Options for enhanced scalability</h2> -Version 6.0 of the collector adds two facilities to enhance collector -scalability on multiprocessors. As of 6.0alpha1, these are supported -only under Linux on X86 and IA64 processors, though ports to other -otherwise supported Pthreads platforms should be straightforward. -They are intended to be used together. -<UL> -<LI> -Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to -run the mark phase in parallel in multiple threads, and thus on multiple -processors. The mark phase typically consumes the large majority of the -collection time. Thus this largely parallelizes the garbage collector -itself, though not the allocation process. Currently the marking is -performed by the thread that triggered the collection, together with -<I>N</i>-1 dedicated -threads, where <I>N</i> is the number of processors detected by the collector. -The dedicated threads are created once at initialization time. -<P> -A second effect of this flag is to switch to a more concurrent -implementation of <TT>GC_malloc_many</tt>, so that free lists can be -built, and memory can be cleared, by more than one thread concurrently. -<LI> -Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread -local allocation. It does not, by itself, cause thread local allocation -to be used. It simply allows the use of the interface in -<TT>gc_local_alloc.h</tt>. -<P> -Memory returned from thread-local allocators is completely interchangeable -with that returned by the standard allocators. It may be used by other -threads. The only difference is that, if the thread allocates enough -memory of a certain kind, it will build a thread-local free list for -objects of that kind, and allocate from that. This greatly reduces -locking. The thread-local free lists are refilled using -<TT>GC_malloc_many</tt>. -<P> -An important side effect of this flag is to replace the default -spin-then-sleep lock to be replace by a spin-then-queue based implementation. -This <I>reduces performance</i> for the standard allocation functions, -though it usually improves performance when thread-local allocation is -used heavily, and thus the number of short-duration lock acquisitions -is greatly reduced. -</ul> -<P> -The easiest way to switch an application to thread-local allocation is to -<OL> -<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>, -and then include the <TT>gc.h</tt> -header in each client source file. -<LI> Invoke <TT>GC_thr_init()</tt> before any allocation. -<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>, -and/or <TT>GC_GCJ_MALLOC</tt>. -</ol> -<H2>The Parallel Marking Algorithm</h2> -We use an algorithm similar to -<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by -Endo, Taura, and Yonezawa</a> at the University of Tokyo. -However, the data structures and implementation are different, -and represent a smaller change to the original collector source, -probably at the expense of extreme scalability. Some of -the refinements they suggest, <I>e.g.</i> splitting large -objects, were also incorporated into out approach. -<P> -The global mark stack is transformed into a global work queue. -Unlike the usual case, it never shrinks during a mark phase. -The mark threads remove objects from the queue by copying them to a -local mark stack and changing the global descriptor to zero, indicating -that there is no more work to be done for this entry. -This removal -is done with no synchronization. Thus it is possible for more than -one worker to remove the same entry, resulting in some work duplication. -<P> -The global work queue grows only if a marker thread decides to -return some of its local mark stack to the global one. This -is done if the global queue appears to be running low, or if -the local stack is in danger of overflowing. It does require -synchronization, but should be relatively rare. -<P> -The sequential marking code is reused to process local mark stacks. -Hence the amount of additional code required for parallel marking -is minimal. -<P> -It should be possible to use generational collection in the presence of the -parallel collector, by calling <TT>GC_enable_incremental()</tt>. -This does not result in fully incremental collection, since parallel mark -phases cannot currently be interrupted, and doing so may be too -expensive. -<P> -Gcj-style mark descriptors do not currently mix with the combination -of local allocation and incremental collection. They should work correctly -with one or the other, but not both. -<P> -The number of marker threads is set on startup to the number of -available processors (or to the value of the <TT>GC_NPROCS</tt> -environment variable). If only a single processor is detected, -parallel marking is disabled. -<P> -Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside -the collector to immediately yield the processor instead of busy waiting -first. In the case of a multiprocessor and a client with multiple -simultaneously runnable threads, this may have disastrous performance -consequences (e.g. a factor of 10 slowdown). -<H2>Performance</h2> -We conducted some simple experiments with a version of -<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to -run multiple concurrent client threads in the same address space. -Each client thread does the same work as the original benchmark, but they share -a heap. -This benchmark involves very little work outside of memory allocation. -This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine -under Linux 2.2.12. -<P> -Running with a thread-unsafe collector, the benchmark ran in 9 -seconds. With the simple thread-safe collector, -built with <TT>-DLINUX_THREADS</tt>, the execution time -increased to 10.3 seconds, or 23.5 elapsed seconds with two clients. -(The times for the <TT>malloc</tt>/i<TT>free</tt> version -with glibc <TT>malloc</tt> -are 10.51 (standard library, pthreads not linked), -20.90 (one thread, pthreads linked), -and 24.55 seconds respectively. The benchmark favors a -garbage collector, since most objects are small.) -<P> -The following table gives execution times for the collector built -with parallel marking and thread-local allocation support -(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested -the client using either one or two marker threads, and running -one or two client threads. Note that the client uses thread local -allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector -switches to a locking strategy that is better tuned to less frequent -lock acquisition. The standard allocation primitives thus peform -slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be -avoided in time-critical code. -<P> -(The results using <TT>pthread_mutex_lock</tt> -directly for allocation locking would have been worse still, at -least for older versions of linuxthreads. -With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the -lock with pthread_mutex_try_lock(), busy_waiting between attempts. -After a fixed number of attempts, we use pthread_mutex_lock().) -<P> -These measurements do not use incremental collection, nor was prefetching -enabled in the marker. We used the C version of the benchmark. -All measurements are in elapsed seconds on an unloaded machine. -<P> -<TABLE BORDER ALIGN="CENTER"> -<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th> -<TH>2 marker threads (secs.)</th></tr> -<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td> -<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td> -</table> -<PP> -The execution time for the single threaded case is slightly worse than with -simple locking. However, even the single-threaded benchmark runs faster than -even the thread-unsafe version if a second processor is available. -The execution time for two clients with thread local allocation time is -only 1.4 times the sequential execution time for a single thread in a -thread-unsafe environment, even though it involves twice the client work. -That represents close to a -factor of 2 improvement over the 2 client case with the old collector. -The old collector clearly -still suffered from some contention overhead, in spite of the fact that the -locking scheme had been fairly well tuned. -<P> -Full linear speedup (i.e. the same execution time for 1 client on one -processor as 2 clients on 2 processors) -is probably not achievable on this kind of -hardware even with such a small number of processors, -since the memory system is -a major constraint for the garbage collector, -the processors usually share a single memory bus, and thus -the aggregate memory bandwidth does not increase in -proportion to the number of processors. -<P> -These results are likely to be very sensitive to both hardware and OS -issues. Preliminary experiments with an older Pentium Pro machine running -an older kernel were far less encouraging. - -</body> -</html> |