More Quick compile-time tuning: labels & branches

This CL represents a roughly 3.5% performance improvement for the compile phase of dex2oat. Move of the gain comes from avoiding the generation of dex boundary LIR labels unless a debug listing is requested. The other significant change is moving from a basic block ending branch model of "always generate a fall-through branch, and then delete it if we can" to a "only generate a fall-through branch if we need it" model. The data motivating these changes follow. Note that two area of potentially attractive gain remain: restructing the assembler model and reworking the register handling utilities. These will be addressed in subsequent CLs. --- data follows The Quick compiler's assembler has shown up on profile reports a bit more than seems reasonable. We've tried a few quick fixes to apparently hot portions of the code, but without much gain. So, I've been looking at the assembly process at a somewhat higher level. There look to be several potentially good opportunities. First, an analysis of the makeup of the LIR graph showed a surprisingly high proportion of LIR pseudo ops. Using the boot classpath as a basis, we get: 32.8% of all LIR nodes are pseudo ops. 10.4% are LIR instructions which require pc-relative fixups. 11.8% are LIR instructions that have been nop'd by the various optimization passes. Looking only at the LIR pseudo ops, we get: kPseudoDalvikByteCodeBoundary 43.46% kPseudoNormalBlockLabel 21.14% kPseudoSafepointPC 20.20% kPseudoThrowTarget 6.94% kPseudoTarget 3.03% kPseudoSuspendTarget 1.95% kPseudoMethodExit 1.26% kPseudoMethodEntry 1.26% kPseudoExportedPC 0.37% kPseudoCaseLabel 0.30% kPseudoBarrier 0.07% kPseudoIntrinsicRetry 0.02% Total LIR count: 10167292 The standout here is the Dalvik opcode boundary marker. This is just a label inserted at the beginning of the codegen for each Dalvik bytecode. If we're also doing a verbose listing, this is also where we hang the pretty-print disassembly string. However, this label was also being used as a convenient way to find the target of switch case statements (and, I think at one point was used in the Mir->GBC conversion process). This CL moves the use of kPseudoDalvikByteCodeBoundary labels to only verbose listing runs, and replaces the codegen uses of the label with the kPseudoNormalBlockLabel attached to the basic block that contains the switch case target. Great savings here - 14.3% reduction in the number of LIR nodes needed. After this CL, our LIR pseudo proportions drop to 21.6% of all LIR. That's still a lot, but much better. Possible further improvements via combining normal labels with kPseudoSafepointPC labels where appropriate, and also perhaps reduce memory usage by using a short-hand form for labels rather than a full LIR node. Also, many of the basic block labels are no longer branch targets by the time we get to assembly - cheaper to delete, or just ingore? Here's the "after" LIR pseudo op breakdown: kPseudoNormalBlockLabel 37.39% kPseudoSafepointPC 35.72% kPseudoThrowTarget 12.28% kPseudoTarget 5.36% kPseudoSuspendTarget 3.45% kPseudoMethodEntry 2.24% kPseudoMethodExit 2.22% kPseudoExportedPC 0.65% kPseudoCaseLabel 0.53% kPseudoBarrier 0.12% kPseudoIntrinsicRetry 0.04% Total LIR count: 5748232 Not done in this CL, but it will be worth experimenting with actually deleting LIR nodes from the graph when they are optimized away, rather than just setting the NOP bit. Keeping them around is invaluable during debugging - but when not debugging it may pay off if the cost of node removal is less than the cost of traversing through dead nodes in subsequent passes. Next up (and partially in this CL - but mostly to be done in follow-on CLs) is the overall assembly process. Inherited from the trace JIT, the Quick compiler has a fairly simple-minded approach to instruction assembly. First, a pass is made over the LIR list to assign offsets to each instruction. Then, the assembly pass is made - which generates the actual machine instruction bit patterns and pushes the instruction data into the code_buffer. However, the code generator takes the "always optimistic" approach to instruction selection and emits the shortest instruction. If, during assembly, we find that a branch or load doesn't reach, that short-form instruction is replaces with a longer sequence. Of course, this invalidates the previously-computed offset calculations. Assembly thus is an iterative process: compute offsets and then assemble until we survive an assembly pass without invalidation. This seems like a likely candidate for improvement. First, I analyzed the number of retries required, and the reason for invalidation over the boot classpath load. The results: more than half of methods don't require a retry, and very few require more than 1 extra pass: 5 or more: 6 of 96334 4 or more: 22 of 96334 3 or more: 140 of 96334 2 or more: 1794 of 96334 - 2% 1 or more: 40911 of 96334 - 40% 0 retries: 55423 of 96334 - 58% The interesting group here is the one that requires 1 retry. Looking at the reason, we see three typical reasons: 1. A cbnz/cbz doesn't reach (only 7 bits of offset) 2. A 16-bit Thumb1 unconditional branch doesn't reach. 3. An unconditional branch which branches to the next instruction is encountered, and deleted. The first 2 cases are the cost of the optimistic strategy - nothing much to change there. However, the interesting case is #3 - dead branch elimination. A further analysis of the single retry group showed that 42% of the methods (16305) that required a single retry did so *only* because of dead branch elimination. The big question here is why so many dead branches survive to the assembly stage. We have a dead branch elimination pass which is supposed to catch these - perhaps it's not working correctly, should be moved later in the optimization process, or perhaps run multiple times. Other things to consider: o Combine the offset generation pass with the assembly pass. Skip pc-relative fixup assembly (other than assigning offset), but push LIR* for them into work list. Following the main pass, zip through the work list and assemble the pc-relative instructions (now that we know the offsets). This would significantly cut back on traversal costs. o Store the assembled bits into both the code buffer and the LIR. In the event we have to retry, only the pc-relative instructions would need to be assembled, and we'd finish with a pass over the LIR just to dumb the bits into the code buffer. Change-Id: I50029d216fa14f273f02b6f1c8b6a0dde5a7d6a6
author: buzbee <buzbee@google.com> 2013-09-08 16:20:53 -0700
committer: buzbee <buzbee@google.com> 2013-09-09 15:42:01 -0700
commit: 252254b130067cd7a5071865e793966871ae0246 (patch)
tree: 0c74a9aa87ace39abbaefb323935e3d0c543e382 /compiler/dex/quick/local_optimizations.cc
parent: 28c2300d9a85f4e7288fb5d94280332f923b4df3 (diff)
download: art-252254b130067cd7a5071865e793966871ae0246.tar.gz
art-252254b130067cd7a5071865e793966871ae0246.tar.bz2
art-252254b130067cd7a5071865e793966871ae0246.zip
1 files changed, 5 insertions, 42 deletions
diff --git a/compiler/dex/quick/local_optimizations.cc b/compiler/dex/quick/local_optimizations.cc
index 41adb946a2..cb7694de68 100644
--- a/compiler/dex/quick/local_optimizations.cc
+++ b/compiler/dex/quick/local_optimizations.cc
@@ -168,7 +168,7 @@ void Mir2Lir::ApplyLoadStoreElimination(LIR* head_lir, LIR* tail_lir) {
             if (check_lir->operands[0] != native_reg_id) {
               ConvertMemOpIntoMove(check_lir, check_lir->operands[0], native_reg_id);
             }
-            check_lir->flags.is_nop = true;
+            NopLIR(check_lir);
           }
         } else if (alias_condition == ENCODE_DALVIK_REG) {
           /* Must alias */
@@ -187,7 +187,7 @@ void Mir2Lir::ApplyLoadStoreElimination(LIR* head_lir, LIR* tail_lir) {
                   native_reg_id) {
                   ConvertMemOpIntoMove(check_lir, check_lir->operands[0], native_reg_id);
                 }
-                check_lir->flags.is_nop = true;
+                NopLIR(check_lir);
               } else {
                 /*
                  * Destinaions are of different types -
@@ -201,7 +201,7 @@ void Mir2Lir::ApplyLoadStoreElimination(LIR* head_lir, LIR* tail_lir) {
               stop_here = true;
             } else if (!is_this_lir_load && !is_check_lir_load) {
               /* WAW - nuke the earlier store */
-              this_lir->flags.is_nop = true;
+              NopLIR(this_lir);
               stop_here = true;
             }
           /* Partial overlap */
@@ -256,7 +256,7 @@ void Mir2Lir::ApplyLoadStoreElimination(LIR* head_lir, LIR* tail_lir) {
            * top-down order.
            */
           InsertLIRBefore(check_lir, new_store_lir);
-          this_lir->flags.is_nop = true;
+          NopLIR(this_lir);
         }
         break;
       } else if (!check_lir->flags.is_nop) {
@@ -452,7 +452,7 @@ void Mir2Lir::ApplyLoadHoisting(LIR* head_lir, LIR* tail_lir) {
          * is never the first LIR on the list
          */
         InsertLIRBefore(cur_lir, new_load_lir);
-        this_lir->flags.is_nop = true;
+        NopLIR(this_lir);
       }
     }
   }
@@ -467,41 +467,4 @@ void Mir2Lir::ApplyLocalOptimizations(LIR* head_lir, LIR* tail_lir) {
   }
 }
 
-/*
- * Nop any unconditional branches that go to the next instruction.
- * Note: new redundant branches may be inserted later, and we'll
- * use a check in final instruction assembly to nop those out.
- */
-void Mir2Lir::RemoveRedundantBranches() {
-  LIR* this_lir;
-
-  for (this_lir = first_lir_insn_; this_lir != last_lir_insn_; this_lir = NEXT_LIR(this_lir)) {
-    /* Branch to the next instruction */
-    if (IsUnconditionalBranch(this_lir)) {
-      LIR* next_lir = this_lir;
-
-      while (true) {
-        next_lir = NEXT_LIR(next_lir);
-
-        /*
-         * Is the branch target the next instruction?
-         */
-        if (next_lir == this_lir->target) {
-          this_lir->flags.is_nop = true;
-          break;
-        }
-
-        /*
-         * Found real useful stuff between the branch and the target.
-         * Need to explicitly check the last_lir_insn_ here because it
-         * might be the last real instruction.
-         */
-        if (!is_pseudo_opcode(next_lir->opcode) ||
-          (next_lir == last_lir_insn_))
-          break;
-      }
-    }
-  }
-}
-
 }  // namespace art
author	buzbee <buzbee@google.com>	2013-09-08 16:20:53 -0700
committer	buzbee <buzbee@google.com>	2013-09-09 15:42:01 -0700
commit	252254b130067cd7a5071865e793966871ae0246 (patch)
tree	0c74a9aa87ace39abbaefb323935e3d0c543e382 /compiler/dex/quick/local_optimizations.cc
parent	28c2300d9a85f4e7288fb5d94280332f923b4df3 (diff)
download	art-252254b130067cd7a5071865e793966871ae0246.tar.gz art-252254b130067cd7a5071865e793966871ae0246.tar.bz2 art-252254b130067cd7a5071865e793966871ae0246.zip