better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dgm&master=false BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot Review URL: https://codereview.chromium.org/1502843002
author: mtklein <mtklein@chromium.org> 2015-12-07 08:21:11 -0800
committer: Steve Kondik <steve@cyngn.com> 2016-06-27 17:37:25 -0700
commit: 20d94523acceeaed59bca57e5c75d56bc066c21b (patch)
tree: df76995163f6b934171ec08054370afed49ed067
parent: a0bea85aea71d427007dc0d441eb47d002b68428 (diff)
download: android_external_skia-20d94523acceeaed59bca57e5c75d56bc066c21b.tar.gz
android_external_skia-20d94523acceeaed59bca57e5c75d56bc066c21b.tar.bz2
android_external_skia-20d94523acceeaed59bca57e5c75d56bc066c21b.zip
1 files changed, 3 insertions, 3 deletions
diff --git a/src/opts/Sk4px_NEON.h b/src/opts/Sk4px_NEON.h
index a8def1d418..62f1deb4ac 100644
--- a/src/opts/Sk4px_NEON.h
+++ b/src/opts/Sk4px_NEON.h
@@ -58,9 +58,9 @@ inline Sk4px Sk4px::Wide::addNarrowHi(const Sk16h& other) const {
 }
 
 inline Sk4px Sk4px::Wide::div255() const {
-    // Calculated as ((x+128) + ((x+128)>>8)) >> 8.
-    auto v = *this + Sk16h(128);
-    return v.addNarrowHi(v>>8);
+    // Calculated as (x + (x+128)>>8 +128) >> 8.  The 'r' in each instruction provides each +128.
+    return Sk16b(vcombine_u8(vraddhn_u16(this->fLo.fVec, vrshrq_n_u16(this->fLo.fVec, 8)),
+                             vraddhn_u16(this->fHi.fVec, vrshrq_n_u16(this->fHi.fVec, 8))));
 }
 
 inline Sk4px Sk4px::alphas() const {
author	mtklein <mtklein@chromium.org>	2015-12-07 08:21:11 -0800
committer	Steve Kondik <steve@cyngn.com>	2016-06-27 17:37:25 -0700
commit	20d94523acceeaed59bca57e5c75d56bc066c21b (patch)
tree	df76995163f6b934171ec08054370afed49ed067
parent	a0bea85aea71d427007dc0d441eb47d002b68428 (diff)
download	android_external_skia-20d94523acceeaed59bca57e5c75d56bc066c21b.tar.gz android_external_skia-20d94523acceeaed59bca57e5c75d56bc066c21b.tar.bz2 android_external_skia-20d94523acceeaed59bca57e5c75d56bc066c21b.zip