aboutsummaryrefslogtreecommitdiffstats
path: root/gcc-4.2.1/boehm-gc/doc/scale.html
diff options
context:
space:
mode:
Diffstat (limited to 'gcc-4.2.1/boehm-gc/doc/scale.html')
-rw-r--r--gcc-4.2.1/boehm-gc/doc/scale.html210
1 files changed, 0 insertions, 210 deletions
diff --git a/gcc-4.2.1/boehm-gc/doc/scale.html b/gcc-4.2.1/boehm-gc/doc/scale.html
deleted file mode 100644
index 2e70148df..000000000
--- a/gcc-4.2.1/boehm-gc/doc/scale.html
+++ /dev/null
@@ -1,210 +0,0 @@
-<HTML>
-<HEAD>
-<TITLE>Garbage collector scalability</TITLE>
-</HEAD>
-<BODY>
-<H1>Garbage collector scalability</h1>
-In its default configuration, the Boehm-Demers-Weiser garbage collector
-is not thread-safe. It can be made thread-safe for a number of environments
-by building the collector with the appropriate
-<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
-flag. This has primarily two effects:
-<OL>
-<LI> It causes the garbage collector to stop all other threads when
-it needs to see a consistent memory state.
-<LI> It causes the collector to acquire a lock around essentially all
-allocation and garbage collection activity.
-</ol>
-Since a single lock is used for all allocation-related activity, only one
-thread can be allocating or collecting at one point. This inherently
-limits performance of multi-threaded applications on multiprocessors.
-<P>
-On most platforms, the allocator/collector lock is implemented as a
-spin lock with exponential back-off. Longer wait times are implemented
-by yielding and/or sleeping. If a collection is in progress, the pure
-spinning stage is skipped. This has the advantage that uncontested and
-thus most uniprocessor lock acquisitions are very cheap. It has the
-disadvantage that the application may sleep for small periods of time
-even when there is work to be done. And threads may be unnecessarily
-woken up for short periods. Nonetheless, this scheme empirically
-outperforms native queue-based mutual exclusion implementations in most
-cases, sometimes drastically so.
-<H2>Options for enhanced scalability</h2>
-Version 6.0 of the collector adds two facilities to enhance collector
-scalability on multiprocessors. As of 6.0alpha1, these are supported
-only under Linux on X86 and IA64 processors, though ports to other
-otherwise supported Pthreads platforms should be straightforward.
-They are intended to be used together.
-<UL>
-<LI>
-Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
-run the mark phase in parallel in multiple threads, and thus on multiple
-processors. The mark phase typically consumes the large majority of the
-collection time. Thus this largely parallelizes the garbage collector
-itself, though not the allocation process. Currently the marking is
-performed by the thread that triggered the collection, together with
-<I>N</i>-1 dedicated
-threads, where <I>N</i> is the number of processors detected by the collector.
-The dedicated threads are created once at initialization time.
-<P>
-A second effect of this flag is to switch to a more concurrent
-implementation of <TT>GC_malloc_many</tt>, so that free lists can be
-built, and memory can be cleared, by more than one thread concurrently.
-<LI>
-Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
-local allocation. It does not, by itself, cause thread local allocation
-to be used. It simply allows the use of the interface in
-<TT>gc_local_alloc.h</tt>.
-<P>
-Memory returned from thread-local allocators is completely interchangeable
-with that returned by the standard allocators. It may be used by other
-threads. The only difference is that, if the thread allocates enough
-memory of a certain kind, it will build a thread-local free list for
-objects of that kind, and allocate from that. This greatly reduces
-locking. The thread-local free lists are refilled using
-<TT>GC_malloc_many</tt>.
-<P>
-An important side effect of this flag is to replace the default
-spin-then-sleep lock to be replace by a spin-then-queue based implementation.
-This <I>reduces performance</i> for the standard allocation functions,
-though it usually improves performance when thread-local allocation is
-used heavily, and thus the number of short-duration lock acquisitions
-is greatly reduced.
-</ul>
-<P>
-The easiest way to switch an application to thread-local allocation is to
-<OL>
-<LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
-and then include the <TT>gc.h</tt>
-header in each client source file.
-<LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
-<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
-and/or <TT>GC_GCJ_MALLOC</tt>.
-</ol>
-<H2>The Parallel Marking Algorithm</h2>
-We use an algorithm similar to
-<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
-Endo, Taura, and Yonezawa</a> at the University of Tokyo.
-However, the data structures and implementation are different,
-and represent a smaller change to the original collector source,
-probably at the expense of extreme scalability. Some of
-the refinements they suggest, <I>e.g.</i> splitting large
-objects, were also incorporated into out approach.
-<P>
-The global mark stack is transformed into a global work queue.
-Unlike the usual case, it never shrinks during a mark phase.
-The mark threads remove objects from the queue by copying them to a
-local mark stack and changing the global descriptor to zero, indicating
-that there is no more work to be done for this entry.
-This removal
-is done with no synchronization. Thus it is possible for more than
-one worker to remove the same entry, resulting in some work duplication.
-<P>
-The global work queue grows only if a marker thread decides to
-return some of its local mark stack to the global one. This
-is done if the global queue appears to be running low, or if
-the local stack is in danger of overflowing. It does require
-synchronization, but should be relatively rare.
-<P>
-The sequential marking code is reused to process local mark stacks.
-Hence the amount of additional code required for parallel marking
-is minimal.
-<P>
-It should be possible to use generational collection in the presence of the
-parallel collector, by calling <TT>GC_enable_incremental()</tt>.
-This does not result in fully incremental collection, since parallel mark
-phases cannot currently be interrupted, and doing so may be too
-expensive.
-<P>
-Gcj-style mark descriptors do not currently mix with the combination
-of local allocation and incremental collection. They should work correctly
-with one or the other, but not both.
-<P>
-The number of marker threads is set on startup to the number of
-available processors (or to the value of the <TT>GC_NPROCS</tt>
-environment variable). If only a single processor is detected,
-parallel marking is disabled.
-<P>
-Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
-the collector to immediately yield the processor instead of busy waiting
-first. In the case of a multiprocessor and a client with multiple
-simultaneously runnable threads, this may have disastrous performance
-consequences (e.g. a factor of 10 slowdown).
-<H2>Performance</h2>
-We conducted some simple experiments with a version of
-<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
-run multiple concurrent client threads in the same address space.
-Each client thread does the same work as the original benchmark, but they share
-a heap.
-This benchmark involves very little work outside of memory allocation.
-This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
-under Linux 2.2.12.
-<P>
-Running with a thread-unsafe collector, the benchmark ran in 9
-seconds. With the simple thread-safe collector,
-built with <TT>-DLINUX_THREADS</tt>, the execution time
-increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
-(The times for the <TT>malloc</tt>/i<TT>free</tt> version
-with glibc <TT>malloc</tt>
-are 10.51 (standard library, pthreads not linked),
-20.90 (one thread, pthreads linked),
-and 24.55 seconds respectively. The benchmark favors a
-garbage collector, since most objects are small.)
-<P>
-The following table gives execution times for the collector built
-with parallel marking and thread-local allocation support
-(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We tested
-the client using either one or two marker threads, and running
-one or two client threads. Note that the client uses thread local
-allocation exclusively. With -DTHREAD_LOCAL_ALLOC the collector
-switches to a locking strategy that is better tuned to less frequent
-lock acquisition. The standard allocation primitives thus peform
-slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
-avoided in time-critical code.
-<P>
-(The results using <TT>pthread_mutex_lock</tt>
-directly for allocation locking would have been worse still, at
-least for older versions of linuxthreads.
-With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
-lock with pthread_mutex_try_lock(), busy_waiting between attempts.
-After a fixed number of attempts, we use pthread_mutex_lock().)
-<P>
-These measurements do not use incremental collection, nor was prefetching
-enabled in the marker. We used the C version of the benchmark.
-All measurements are in elapsed seconds on an unloaded machine.
-<P>
-<TABLE BORDER ALIGN="CENTER">
-<TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
-<TH>2 marker threads (secs.)</th></tr>
-<TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
-<TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
-</table>
-<PP>
-The execution time for the single threaded case is slightly worse than with
-simple locking. However, even the single-threaded benchmark runs faster than
-even the thread-unsafe version if a second processor is available.
-The execution time for two clients with thread local allocation time is
-only 1.4 times the sequential execution time for a single thread in a
-thread-unsafe environment, even though it involves twice the client work.
-That represents close to a
-factor of 2 improvement over the 2 client case with the old collector.
-The old collector clearly
-still suffered from some contention overhead, in spite of the fact that the
-locking scheme had been fairly well tuned.
-<P>
-Full linear speedup (i.e. the same execution time for 1 client on one
-processor as 2 clients on 2 processors)
-is probably not achievable on this kind of
-hardware even with such a small number of processors,
-since the memory system is
-a major constraint for the garbage collector,
-the processors usually share a single memory bus, and thus
-the aggregate memory bandwidth does not increase in
-proportion to the number of processors.
-<P>
-These results are likely to be very sensitive to both hardware and OS
-issues. Preliminary experiments with an older Pentium Pro machine running
-an older kernel were far less encouraging.
-
-</body>
-</html>