aboutsummaryrefslogtreecommitdiffstats
path: root/guava/src/com/google/common/hash/HashFunction.java
blob: 038626124a4c11e3425188553cb7817ce225f582 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
/*
 * Copyright (C) 2011 The Guava Authors
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
 * in compliance with the License. You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software distributed under the License
 * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
 * or implied. See the License for the specific language governing permissions and limitations under
 * the License.
 */

package com.google.common.hash;

import com.google.common.annotations.Beta;
import com.google.common.primitives.Ints;

import java.nio.charset.Charset;

/**
 * A hash function is a collision-averse pure function that maps an arbitrary block of
 * data to a number called a <i>hash code</i>.
 *
 * <h3>Definition</h3>
 *
 * <p>Unpacking this definition:
 *
 * <ul>
 * <li><b>block of data:</b> the input for a hash function is always, in concept, an
 *     ordered byte array. This hashing API accepts an arbitrary sequence of byte and
 *     multibyte values (via {@link Hasher}), but this is merely a convenience; these are
 *     always translated into raw byte sequences under the covers.
 *
 * <li><b>hash code:</b> each hash function always yields hash codes of the same fixed bit
 *     length (given by {@link #bits}). For example, {@link Hashing#sha1} produces a
 *     160-bit number, while {@link Hashing#murmur3_32()} yields only 32 bits. Because a
 *     {@code long} value is clearly insufficient to hold all hash code values, this API
 *     represents a hash code as an instance of {@link HashCode}.
 *
 * <li><b>pure function:</b> the value produced must depend only on the input bytes, in
 *     the order they appear. Input data is never modified. {@link HashFunction} instances
 *     should always be stateless, and therefore thread-safe.
 *
 * <li><b>collision-averse:</b> while it can't be helped that a hash function will
 *     sometimes produce the same hash code for distinct inputs (a "collision"), every
 *     hash function strives to <i>some</i> degree to make this unlikely. (Without this
 *     condition, a function that always returns zero could be called a hash function. It
 *     is not.)
 * </ul>
 *
 * <p>Summarizing the last two points: "equal yield equal <i>always</i>; unequal yield
 * unequal <i>often</i>." This is the most important characteristic of all hash functions.
 *
 * <h3>Desirable properties</h3>
 *
 * <p>A high-quality hash function strives for some subset of the following virtues:
 *
 * <ul>
 * <li><b>collision-resistant:</b> while the definition above requires making at least
 *     <i>some</i> token attempt, one measure of the quality of a hash function is <i>how
 *     well</i> it succeeds at this goal. Important note: it may be easy to achieve the
 *     theoretical minimum collision rate when using completely <i>random</i> sample
 *     input. The true test of a hash function is how it performs on representative
 *     real-world data, which tends to contain many hidden patterns and clumps. The goal
 *     of a good hash function is to stamp these patterns out as thoroughly as possible.
 *
 * <li><b>bit-dispersing:</b> masking out any <i>single bit</i> from a hash code should
 *     yield only the expected <i>twofold</i> increase to all collision rates. Informally,
 *     the "information" in the hash code should be as evenly "spread out" through the
 *     hash code's bits as possible. The result is that, for example, when choosing a
 *     bucket in a hash table of size 2^8, <i>any</i> eight bits could be consistently
 *     used.
 *
 * <li><b>cryptographic:</b> certain hash functions such as {@link Hashing#sha512} are
 *     designed to make it as infeasible as possible to reverse-engineer the input that
 *     produced a given hash code, or even to discover <i>any</i> two distinct inputs that
 *     yield the same result. These are called <i>cryptographic hash functions</i>. But,
 *     whenever it is learned that either of these feats has become computationally
 *     feasible, the function is deemed "broken" and should no longer be used for secure
 *     purposes. (This is the likely eventual fate of <i>all</i> cryptographic hashes.)
 *
 * <li><b>fast:</b> perhaps self-explanatory, but often the most important consideration.
 *     We have published <a href="#noWeHaventYet">microbenchmark results</a> for many
 *     common hash functions.
 * </ul>
 *
 * <h3>Providing input to a hash function</h3>
 *
 * <p>The primary way to provide the data that your hash function should act on is via a
 * {@link Hasher}. Obtain a new hasher from the hash function using {@link #newHasher},
 * "push" the relevant data into it using methods like {@link Hasher#putBytes(byte[])},
 * and finally ask for the {@code HashCode} when finished using {@link Hasher#hash}. (See
 * an {@linkplain #newHasher example} of this.)
 *
 * <p>If all you want to hash is a single byte array, string or {@code long} value, there
 * are convenient shortcut methods defined directly on {@link HashFunction} to make this
 * easier.
 *
 * <p>Hasher accepts primitive data types, but can also accept any Object of type {@code
 * T} provided that you implement a {@link Funnel Funnel<T>} to specify how to "feed" data
 * from that object into the function. (See {@linkplain Hasher#putObject an example} of
 * this.)
 *
 * <p><b>Compatibility note:</b> Throughout this API, multibyte values are always
 * interpreted in <i>little-endian</i> order. That is, hashing the byte array {@code
 * {0x01, 0x02, 0x03, 0x04}} is equivalent to hashing the {@code int} value {@code
 * 0x04030201}. If this isn't what you need, methods such as {@link Integer#reverseBytes}
 * and {@link Ints#toByteArray} will help.
 *
 * <h3>Relationship to {@link Object#hashCode}</h3>
 *
 * <p>Java's baked-in concept of hash codes is constrained to 32 bits, and provides no
 * separation between hash algorithms and the data they act on, so alternate hash
 * algorithms can't be easily substituted. Also, implementations of {@code hashCode} tend
 * to be poor-quality, in part because they end up depending on <i>other</i> existing
 * poor-quality {@code hashCode} implementations, including those in many JDK classes.
 *
 * <p>{@code Object.hashCode} implementations tend to be very fast, but have weak
 * collision prevention and <i>no</i> expectation of bit dispersion. This leaves them
 * perfectly suitable for use in hash tables, because extra collisions cause only a slight
 * performance hit, while poor bit dispersion is easily corrected using a secondary hash
 * function (which all reasonable hash table implementations in Java use). For the many
 * uses of hash functions beyond data structures, however, {@code Object.hashCode} almost
 * always falls short -- hence this library.
 *
 * @author Kevin Bourrillion
 * @since 11.0
 */
@Beta
public interface HashFunction {
  /**
   * Begins a new hash code computation by returning an initialized, stateful {@code
   * Hasher} instance that is ready to receive data. Example: <pre>   {@code
   *
   *   HashFunction hf = Hashing.md5();
   *   HashCode hc = hf.newHasher()
   *       .putLong(id)
   *       .putString(name)
   *       .hash();}</pre>
   */
  Hasher newHasher();

  /**
   * Begins a new hash code computation as {@link #newHasher()}, but provides a hint of the
   * expected size of the input (in bytes). This is only important for non-streaming hash
   * functions (hash functions that need to buffer their whole input before processing any
   * of it).
   */
  Hasher newHasher(int expectedInputSize);

  /**
   * Shortcut for {@code newHasher().putInt(input).hash()}; returns the hash code for the given
   * {@code int} value, interpreted in little-endian byte order. The implementation <i>might</i>
   * perform better than its longhand equivalent, but should not perform worse.
   *
   * @since 12.0
   */
  HashCode hashInt(int input);

  /**
   * Shortcut for {@code newHasher().putLong(input).hash()}; returns the hash code for the
   * given {@code long} value, interpreted in little-endian byte order. The implementation
   * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
   */
  HashCode hashLong(long input);

  /**
   * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation
   * <i>might</i> perform better than its longhand equivalent, but should not perform
   * worse.
   */
  HashCode hashBytes(byte[] input);

  /**
   * Shortcut for {@code newHasher().putBytes(input, off, len).hash()}. The implementation
   * <i>might</i> perform better than its longhand equivalent, but should not perform
   * worse.
   *
   * @throws IndexOutOfBoundsException if {@code off < 0} or {@code off + len > bytes.length}
   *   or {@code len < 0}
   */
  HashCode hashBytes(byte[] input, int off, int len);

  /**
   * Shortcut for {@code newHasher().putString(input).hash()}. The implementation <i>might</i>
   * perform better than its longhand equivalent, but should not perform worse. Note that no
   * character encoding is performed; the low byte and high byte of each character are hashed
   * directly (in that order). This is equivalent to using
   * {@code hashString(input, Charsets.UTF_16LE)}.
   */
  HashCode hashString(CharSequence input);

  /**
   * Shortcut for {@code newHasher().putString(input, charset).hash()}. Characters are encoded
   * using the given {@link Charset}. The implementation <i>might</i> perform better than its
   * longhand equivalent, but should not perform worse.
   */
  HashCode hashString(CharSequence input, Charset charset);

  /**
   * Shortcut for {@code newHasher().putObject(instance, funnel).hash()}. The implementation
   * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
   *
   * @since 14.0
   */
  <T> HashCode hashObject(T instance, Funnel<? super T> funnel);

  /**
   * Returns the number of bits (a multiple of 32) that each hash code produced by this
   * hash function has.
   */
  int bits();
}