This small Jupyter notebook is a short experiment, to compare the time complexity of three different implementations of the SHA-256 hash function, in pure Python, with Cython, and with Numba.
I will copy the API proposed by the hashlib
module in Python standard library, so it will be very easy to compare my implementations with the one provided with your default Python installation.
class Hash(object):
""" Common class for all hash methods.
It copies the one of the hashlib module (https://docs.python.org/3.5/library/hashlib.html).
"""
def __init__(self, *args, **kwargs):
""" Create the Hash object."""
self.name = self.__class__.__name__ # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.name
self.byteorder = 'little'
self.digest_size = 0 # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.digest_size
self.block_size = 0 # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.block_size
def __str__(self):
return self.name
def update(self, arg):
""" Update the hash object with the object arg, which must be interpretable as a buffer of bytes."""
pass
def digest(self):
""" Return the digest of the data passed to the update() method so far. This is a bytes object of size digest_size which may contain bytes in the whole range from 0 to 255."""
return b""
def hexdigest(self):
""" Like digest() except the digest is returned as a string object of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments."""
digest = self.digest()
raw = digest.to_bytes(self.digest_size, byteorder=self.byteorder)
format_str = '{:0' + str(2 * self.digest_size) + 'x}'
return format_str.format(int.from_bytes(raw, byteorder='big'))
hashlib
module in Python standard library¶import hashlib
We can check the available algorithms, some of them being guaranteed to be on any platform, some are not.
list(hashlib.algorithms_available)
['SHA', 'dsaWithSHA', 'dsaEncryption', 'md4', 'md5', 'sha512', 'sha256', 'sha1', 'RIPEMD160', 'ecdsa-with-SHA1', 'MD4', 'SHA256', 'MD5', 'SHA224', 'SHA1', 'sha224', 'whirlpool', 'DSA', 'ripemd160', 'sha', 'SHA512', 'SHA384', 'DSA-SHA', 'sha384']
I will need at least this one:
assert 'SHA256' in hashlib.algorithms_available
Lets check that they have the block size and digest size announced:
name = 'SHA256'
s = hashlib.sha256()
print("For {:<8} : the block size is {:<3} and the digest size is {:<2}.".format(name, s.block_size, s.digest_size))
For SHA256 : the block size is 64 and the digest size is 32.
Let now study and implement a last hashing function, again slightly harder to write but more secure: SHA-2, "Secure Hash Algorithm, version 2". See the SHA-2 hashing function on Wikipedia, if needed.
This is exactly like for MD5. But SHA-2 requires right-rotate as well.
def leftrotate(x, c):
""" Left rotate the number x by c bytes."""
x &= 0xFFFFFFFF
return ((x << c) | (x >> (32 - c))) & 0xFFFFFFFF
def rightrotate(x, c):
""" Right rotate the number x by c bytes."""
x &= 0xFFFFFFFF
return ((x >> c) | (x << (32 - c))) & 0xFFFFFFFF
As SHA-2 plays with big-endian and little-endian integers, and at the end it requires a leftshift to combine the 5 hash pieces into one.
def leftshift(x, c):
""" Left shift the number x by c bytes."""
return x << c
def rightshift(x, c):
""" Right shift the number x by c bytes."""
return x >> c
SHA2
class¶I will use a simple class, very similar to the class used for the SHA-1 algorithm (see above). It is a direct implementation of the pseudo-code, as given for instance on the Wikipedia page.
I will only implement the simpler one, SHA-256, of digest size of 256 bits. Other variants are SHA-224, SHA-384, SHA-512 (and others include SHA-512/224, SHA-512/256).
class SHA2(Hash):
"""SHA256 hashing, see https://en.wikipedia.org/wiki/SHA-2#Pseudocode."""
def __init__(self):
self.name = "SHA256"
self.byteorder = 'big'
self.block_size = 64
self.digest_size = 32
# Note 2: For each round, there is one round constant k[i] and one entry in the message schedule array w[i], 0 ≤ i ≤ 63
# Note 3: The compression function uses 8 working variables, a through h
# Note 4: Big-endian convention is used when expressing the constants in this pseudocode,
# and when parsing message block data from bytes to words, for example,
# the first word of the input message "abc" after padding is 0x61626380
# Initialize hash values:
# (first 32 bits of the fractional parts of the square roots of the first 8 primes 2..19):
h0 = 0x6a09e667
h1 = 0xbb67ae85
h2 = 0x3c6ef372
h3 = 0xa54ff53a
h4 = 0x510e527f
h5 = 0x9b05688c
h6 = 0x1f83d9ab
h7 = 0x5be0cd19
# Initialize array of round constants:
# (first 32 bits of the fractional parts of the cube roots of the first 64 primes 2..311):
self.k = [
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
]
# Store them
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def update(self, arg):
h0, h1, h2, h3, h4, h5, h6, h7 = self.hash_pieces
# 1. Pre-processing, exactly like MD5
data = bytearray(arg)
orig_len_in_bits = (8 * len(data)) & 0xFFFFFFFFFFFFFFFF
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 448 (mod 512)
while len(data) % 64 != 56:
data.append(0)
# 1.c. append original length in bits mod (2 pow 64) to message
data += orig_len_in_bits.to_bytes(8, byteorder='big')
assert len(data) % 64 == 0, "Error in padding"
# 2. Computations
# Process the message in successive 512-bit = 64-bytes chunks:
for offset in range(0, len(data), 64):
# 2.a. 512-bits = 64-bytes chunks
chunks = data[offset : offset + 64]
w = [0 for i in range(64)]
# 2.b. Break chunk into sixteen 32-bit = 4-bytes words w[i], 0 ≤ i ≤ 15
for i in range(16):
w[i] = int.from_bytes(chunks[4*i : 4*i + 4], byteorder='big')
# 2.c. Extend the first 16 words into the remaining 48
# words w[16..63] of the message schedule array:
for i in range(16, 64):
s0 = (rightrotate(w[i-15], 7) ^ rightrotate(w[i-15], 18) ^ rightshift(w[i-15], 3)) & 0xFFFFFFFF
s1 = (rightrotate(w[i-2], 17) ^ rightrotate(w[i-2], 19) ^ rightshift(w[i-2], 10)) & 0xFFFFFFFF
w[i] = (w[i-16] + s0 + w[i-7] + s1) & 0xFFFFFFFF
# 2.d. Initialize hash value for this chunk
a, b, c, d, e, f, g, h = h0, h1, h2, h3, h4, h5, h6, h7
# 2.e. Main loop, cf. https://tools.ietf.org/html/rfc6234
for i in range(64):
S1 = (rightrotate(e, 6) ^ rightrotate(e, 11) ^ rightrotate(e, 25)) & 0xFFFFFFFF
ch = ((e & f) ^ ((~e) & g)) & 0xFFFFFFFF
temp1 = (h + S1 + ch + self.k[i] + w[i]) & 0xFFFFFFFF
S0 = (rightrotate(a, 2) ^ rightrotate(a, 13) ^ rightrotate(a, 22)) & 0xFFFFFFFF
maj = ((a & b) ^ (a & c) ^ (b & c)) & 0xFFFFFFFF
temp2 = (S0 + maj) & 0xFFFFFFFF
new_a = (temp1 + temp2) & 0xFFFFFFFF
new_e = (d + temp1) & 0xFFFFFFFF
# Rotate the 8 variables
a, b, c, d, e, f, g, h = new_a, a, b, c, new_e, e, f, g
# Add this chunk's hash to result so far:
h0 = (h0 + a) & 0xFFFFFFFF
h1 = (h1 + b) & 0xFFFFFFFF
h2 = (h2 + c) & 0xFFFFFFFF
h3 = (h3 + d) & 0xFFFFFFFF
h4 = (h4 + e) & 0xFFFFFFFF
h5 = (h5 + f) & 0xFFFFFFFF
h6 = (h6 + g) & 0xFFFFFFFF
h7 = (h7 + h) & 0xFFFFFFFF
# 3. Conclusion
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def digest(self):
# h0 append h1 append h2 append h3 append h4 append h5 append h6 append h7
return sum(leftshift(x, 32 * i) for i, x in enumerate(self.hash_pieces[::-1]))
We can also write a function to directly compute the hex digest from some bytes data.
def hash_SHA2(data):
""" Shortcut function to directly receive the hex digest from SHA2(data)."""
h = SHA2()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
Let try the example from SHA-2 Wikipedia page :
hash_SHA2("The quick brown fox jumps over the lazy dog")
assert hash_SHA2("The quick brown fox jumps over the lazy dog") == 'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the avalanche effect. For example, adding a period at the end of the sentence:
hash_SHA2("The quick brown fox jumps over the lazy dog.")
assert hash_SHA2("The quick brown fox jumps over the lazy dog.") == 'ef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c'
'ef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c'
The hash of the zero-length string is:
hash_SHA2("")
assert hash_SHA2("") == 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
$\implies$ We obtained the same result, OK our function works!
On a small sentence:
hash_SHA2("My name is Zorro !")
'10a6dec377c28d2d34001c103760339f8d6ab02660d200d6014329b86253552c'
h = hashlib.sha256()
h.update(b"My name is Zorro !")
h.hexdigest()
'10a6dec377c28d2d34001c103760339f8d6ab02660d200d6014329b86253552c'
It starts to look good.
def true_hash_SHA2(data):
h = hashlib.sha256()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
On some random data:
import numpy.random as nr
alphabets = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
def random_string(size=10000):
return ''.join(alphabets[nr.randint(len(alphabets))] for _ in range(size))
random_string(10)
'bXOwDYtUOz'
from tqdm import tqdm_notebook as tqdm
%%time
for _ in tqdm(range(1000)):
x = random_string()
assert hash_SHA2(x) == true_hash_SHA2(x), "Error: x = {} gave two different SHA2 hashes: my implementation = {} != hashlib implementation = {}...".format(x, hash_SHA2(x), true_hash_SHA2(x))
CPU times: user 1min 2s, sys: 88 ms, total: 1min 2s Wall time: 1min 2s
from numba import jit, jitclass
@jit
def leftrotate_numba(x, c):
""" Left rotate the number x by c bytes."""
x &= 0xFFFFFFFF
return ((x << c) | (x >> (32 - c))) & 0xFFFFFFFF
@jit
def rightrotate_numba(x, c):
""" Right rotate the number x by c bytes."""
x &= 0xFFFFFFFF
return ((x >> c) | (x << (32 - c))) & 0xFFFFFFFF
@jit
def leftshift_numba(x, c):
""" Left shift the number x by c bytes."""
return x << c
@jit
def rightshift_numba(x, c):
""" Right shift the number x by c bytes."""
return x >> c
class SHA2_Numba(Hash):
"""SHA256 hashing, speed-up with Numba.jit, see https://en.wikipedia.org/wiki/SHA-2#Pseudocode."""
def __init__(self):
self.name = "SHA256"
self.byteorder = 'big'
self.block_size = 64
self.digest_size = 32
# Note 2: For each round, there is one round constant k[i] and one entry in the message schedule array w[i], 0 ≤ i ≤ 63
# Note 3: The compression function uses 8 working variables, a through h
# Note 4: Big-endian convention is used when expressing the constants in this pseudocode,
# and when parsing message block data from bytes to words, for example,
# the first word of the input message "abc" after padding is 0x61626380
# Initialize hash values:
# (first 32 bits of the fractional parts of the square roots of the first 8 primes 2..19):
h0 = 0x6a09e667
h1 = 0xbb67ae85
h2 = 0x3c6ef372
h3 = 0xa54ff53a
h4 = 0x510e527f
h5 = 0x9b05688c
h6 = 0x1f83d9ab
h7 = 0x5be0cd19
# Initialize array of round constants:
# (first 32 bits of the fractional parts of the cube roots of the first 64 primes 2..311):
self.k = [
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
]
# Store them
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
@jit
def update(self, arg):
h0, h1, h2, h3, h4, h5, h6, h7 = self.hash_pieces
# 1. Pre-processing, exactly like MD5
data = bytearray(arg)
orig_len_in_bits = (8 * len(data)) & 0xFFFFFFFFFFFFFFFF
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 448 (mod 512)
while len(data) % 64 != 56:
data.append(0)
# 1.c. append original length in bits mod (2 pow 64) to message
data += orig_len_in_bits.to_bytes(8, byteorder='big')
assert len(data) % 64 == 0, "Error in padding"
# 2. Computations
# Process the message in successive 512-bit = 64-bytes chunks:
for offset in range(0, len(data), 64):
# 2.a. 512-bits = 64-bytes chunks
chunks = data[offset : offset + 64]
w = [0 for i in range(64)]
# 2.b. Break chunk into sixteen 32-bit = 4-bytes words w[i], 0 ≤ i ≤ 15
for i in range(16):
w[i] = int.from_bytes(chunks[4*i : 4*i + 4], byteorder='big')
# 2.c. Extend the first 16 words into the remaining 48
# words w[16..63] of the message schedule array:
for i in range(16, 64):
s0 = (rightrotate(w[i-15], 7) ^ rightrotate(w[i-15], 18) ^ rightshift(w[i-15], 3)) & 0xFFFFFFFF
s1 = (rightrotate(w[i-2], 17) ^ rightrotate(w[i-2], 19) ^ rightshift(w[i-2], 10)) & 0xFFFFFFFF
w[i] = (w[i-16] + s0 + w[i-7] + s1) & 0xFFFFFFFF
# 2.d. Initialize hash value for this chunk
a, b, c, d, e, f, g, h = h0, h1, h2, h3, h4, h5, h6, h7
# 2.e. Main loop, cf. https://tools.ietf.org/html/rfc6234
for i in range(64):
S1 = (rightrotate(e, 6) ^ rightrotate(e, 11) ^ rightrotate(e, 25)) & 0xFFFFFFFF
ch = ((e & f) ^ ((~e) & g)) & 0xFFFFFFFF
temp1 = (h + S1 + ch + self.k[i] + w[i]) & 0xFFFFFFFF
S0 = (rightrotate(a, 2) ^ rightrotate(a, 13) ^ rightrotate(a, 22)) & 0xFFFFFFFF
maj = ((a & b) ^ (a & c) ^ (b & c)) & 0xFFFFFFFF
temp2 = (S0 + maj) & 0xFFFFFFFF
new_a = (temp1 + temp2) & 0xFFFFFFFF
new_e = (d + temp1) & 0xFFFFFFFF
# Rotate the 8 variables
a, b, c, d, e, f, g, h = new_a, a, b, c, new_e, e, f, g
# Add this chunk's hash to result so far:
h0 = (h0 + a) & 0xFFFFFFFF
h1 = (h1 + b) & 0xFFFFFFFF
h2 = (h2 + c) & 0xFFFFFFFF
h3 = (h3 + d) & 0xFFFFFFFF
h4 = (h4 + e) & 0xFFFFFFFF
h5 = (h5 + f) & 0xFFFFFFFF
h6 = (h6 + g) & 0xFFFFFFFF
h7 = (h7 + h) & 0xFFFFFFFF
# 3. Conclusion
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def digest(self):
# h0 append h1 append h2 append h3 append h4 append h5 append h6 append h7
return sum(leftshift(x, 32 * i) for i, x in enumerate(self.hash_pieces[::-1]))
We can also write a function to directly compute the hex digest from some bytes data.
def hash_SHA2_Numba(data):
""" Shortcut function to directly receive the hex digest from SHA2_Numba(data)."""
h = SHA2_Numba()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
Let try the example from SHA-2 Wikipedia page :
hash_SHA2_Numba("The quick brown fox jumps over the lazy dog")
assert hash_SHA2_Numba("The quick brown fox jumps over the lazy dog") == 'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-44-12552f1afe46> in <module>() ----> 1 hash_SHA2_Numba("The quick brown fox jumps over the lazy dog") 2 assert hash_SHA2_Numba("The quick brown fox jumps over the lazy dog") == 'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592' <ipython-input-43-57e557bf04d8> in hash_SHA2_Numba(data) 4 if isinstance(data, str): 5 data = bytes(data, encoding='utf8') ----> 6 h.update(data) 7 return h.hexdigest() /usr/local/lib/python3.5/dist-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws) 284 argtypes.append(self.typeof_pyval(a)) 285 try: --> 286 return self.compile(tuple(argtypes)) 287 except errors.TypingError as e: 288 # Intercept typing error that may be due to an argument /usr/local/lib/python3.5/dist-packages/numba/dispatcher.py in compile(self, sig) 530 531 self._cache_misses[sig] += 1 --> 532 cres = self._compiler.compile(args, return_type) 533 self.add_overload(cres) 534 self._cache.save_overload(sig, cres) /usr/local/lib/python3.5/dist-packages/numba/dispatcher.py in compile(self, args, return_type) 79 impl, 80 args=args, return_type=return_type, ---> 81 flags=flags, locals=self.locals) 82 # Check typing error if object mode is used 83 if cres.typing_error is not None and not flags.enable_pyobject: /usr/local/lib/python3.5/dist-packages/numba/compiler.py in compile_extra(typingctx, targetctx, func, args, return_type, flags, locals, library) 691 pipeline = Pipeline(typingctx, targetctx, library, 692 args, return_type, flags, locals) --> 693 return pipeline.compile_extra(func) 694 695 /usr/local/lib/python3.5/dist-packages/numba/compiler.py in compile_extra(self, func) 348 self.lifted = () 349 self.lifted_from = None --> 350 return self._compile_bytecode() 351 352 def compile_ir(self, func_ir, lifted=(), lifted_from=None): /usr/local/lib/python3.5/dist-packages/numba/compiler.py in _compile_bytecode(self) 656 """ 657 assert self.func_ir is None --> 658 return self._compile_core() 659 660 def _compile_ir(self): /usr/local/lib/python3.5/dist-packages/numba/compiler.py in _compile_core(self) 643 644 pm.finalize() --> 645 res = pm.run(self.status) 646 if res is not None: 647 # Early pipeline completion /usr/local/lib/python3.5/dist-packages/numba/compiler.py in run(self, status) 234 # No more fallback pipelines? 235 if is_final_pipeline: --> 236 raise patched_exception 237 # Go to next fallback pipeline 238 else: /usr/local/lib/python3.5/dist-packages/numba/compiler.py in run(self, status) 226 try: 227 event(stage_name) --> 228 stage() 229 except _EarlyPipelineCompletion as e: 230 return e.result /usr/local/lib/python3.5/dist-packages/numba/compiler.py in stage_analyze_bytecode(self) 362 Analyze bytecode and translating to Numba IR 363 """ --> 364 func_ir = translate_stage(self.func_id, self.bc) 365 self._set_and_check_ir(func_ir) 366 /usr/local/lib/python3.5/dist-packages/numba/compiler.py in translate_stage(func_id, bytecode) 754 def translate_stage(func_id, bytecode): 755 interp = interpreter.Interpreter(func_id) --> 756 return interp.interpret(bytecode) 757 758 /usr/local/lib/python3.5/dist-packages/numba/interpreter.py in interpret(self, bytecode) 95 # Data flow analysis 96 self.dfa = dataflow.DataFlowAnalysis(self.cfa) ---> 97 self.dfa.run() 98 99 # Temp states during interpretation /usr/local/lib/python3.5/dist-packages/numba/dataflow.py in run(self) 25 def run(self): 26 for blk in self.cfa.iterliveblocks(): ---> 27 self.infos[blk.offset] = self.run_on_block(blk) 28 29 def run_on_block(self, blk): /usr/local/lib/python3.5/dist-packages/numba/dataflow.py in run_on_block(self, blk) 69 for offset in blk: 70 inst = self.bytecode[offset] ---> 71 self.dispatch(info, inst) 72 return info 73 /usr/local/lib/python3.5/dist-packages/numba/dataflow.py in dispatch(self, info, inst) 78 def dispatch(self, info, inst): 79 fname = "op_%s" % inst.opname.replace('+', '_') ---> 80 fn = getattr(self, fname) 81 fn(info, inst) 82 AttributeError: Failed at object (analyzing bytecode) 'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'
I failed to make numba.jit
work on that function :-(
SHA-2
hashing function¶%load_ext cython
For the functions defined before, we rewrite them with type annotations in %%cython
cells.
All variables are int
, i.e., 32-bits integer (64-bits are long
).
%%cython
cpdef int leftrotate_cython(int x, int c):
""" Left rotate the number x by c bytes."""
return (x << c) | (x >> (32 - c))
cpdef int rightrotate_cython(int x, int c):
""" Right rotate the number x by c bytes."""
return (x >> c) | (x << (32 - c))
leftrotate_cython?
rightrotate_cython?
Docstring: Left rotate the number x by c bytes. Type: builtin_function_or_method Docstring: Right rotate the number x by c bytes. Type: builtin_function_or_method
On basic functions like this, of course we don't get any speedup with Cython:
from numpy.random import randint
%timeit leftrotate(randint(0, 100000), 5)
%timeit leftrotate_cython(randint(0, 100000), 5)
%timeit rightrotate(randint(0, 100000), 5)
%timeit rightrotate_cython(randint(0, 100000), 5)
1.57 µs ± 42.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.4 µs ± 60.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.63 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.44 µs ± 88.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%%cython
cpdef int leftshift_cython(int x, int c):
""" Left shift the number x by c bytes."""
return x << c
cpdef int rightshift_cython(int x, int c):
""" Right shift the number x by c bytes."""
return x >> c
leftshift_cython?
rightshift_cython?
Docstring: Left shift the number x by c bytes. Type: builtin_function_or_method Docstring: Right shift the number x by c bytes. Type: builtin_function_or_method
On basic functions like this, of course we don't get any speedup with Cython:
%timeit leftshift(randint(0, 100000), 5)
%timeit leftshift_cython(randint(0, 100000), 5)
%timeit rightshift(randint(0, 100000), 5)
%timeit rightshift_cython(randint(0, 100000), 5)
1.4 µs ± 33.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.31 µs ± 24.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.36 µs ± 35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 1.33 µs ± 51.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
SHA2_Cython
class¶And similarly for the SHA2
class, we write it in a %%cython
cell, and we type everything.
%%cython
# cython: c_string_type=unicode, c_string_encoding=utf8
cdef int rightrotate_cython(int x, int c):
""" Right rotate the number x by c bytes."""
return (x >> c) | (x << (32 - c))
cdef int rightshift_cython(int x, int c):
""" Right shift the number x by c bytes."""
return x >> c
# See http://cython.readthedocs.io/en/latest/src/tutorial/array.html
from cpython cimport array
import array
cdef array.array empty_64 = array.array('i', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
cdef int[:] view_empty_64 = empty_64
cpdef void update_cython(int[:] hash_pieces, int[:] k, bytearray arg):
""" One pass of the SHA-256 algorithm, update hash_pieces on place. """
# Extract the 8 variables
cdef int h0 = hash_pieces[0], h1 = hash_pieces[1], h2 = hash_pieces[2], h3 = hash_pieces[3], h4 = hash_pieces[4], h5 = hash_pieces[5], h6 = hash_pieces[6], h7 = hash_pieces[7]
# 1. Pre-processing, exactly like MD5
cdef bytearray data = arg
cdef long orig_len_in_bits = 8 * len(data)
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 448 (mod 512)
while len(data) % 64 != 56:
data.append(0x0)
# 1.c. append original length in bits mod (2 pow 64) to message
data += orig_len_in_bits.to_bytes(8, byteorder='big')
assert len(data) % 64 == 0, "Error in padding"
# Declare loop indexes and variables
cdef int offset, i
cdef int a, b, c, d, e, f, g, h
cdef int temp1, temp2
# 2. Computations
# Process the message in successive 512-bit = 64-bytes chunks:
cdef int[:] w = view_empty_64
for offset in range(0, len(data), 64):
# 2.a. 512-bits = 64-bytes chunks
# 2.b. Break chunk into sixteen 32-bit = 4-bytes words w[i], 0 ≤ i ≤ 15
for i in range(16):
w[i] = int.from_bytes(data[offset : offset + 64][4*i : 4*i + 4], byteorder='big')
# 2.c. Extend the first 16 words into the remaining 48
# words w[16..63] of the message schedule array:
for i in range(16, 64):
w[i] = w[i-16] + (rightrotate_cython(w[i-15], 7) ^ rightrotate_cython(w[i-15], 18) ^ rightshift_cython(w[i-15], 3)) + w[i-7] + (rightrotate_cython(w[i-2], 17) ^ rightrotate_cython(w[i-2], 19) ^ rightshift_cython(w[i-2], 10))
# 2.d. Initialize hash value for this chunk
a = h0
b = h1
c = h2
d = h3
e = h4
f = h5
g = h6
h = h7
# 2.e. Main loop, cf. https://tools.ietf.org/html/rfc6234
for i in range(64):
temp1 = h + (rightrotate_cython(e, 6) ^ rightrotate_cython(e, 11) ^ rightrotate_cython(e, 25)) + ((e & f) ^ ((~e) & g)) + k[i] + w[i]
temp2 = (rightrotate_cython(a, 2) ^ rightrotate_cython(a, 13) ^ rightrotate_cython(a, 22)) + ((a & b) ^ (a & c) ^ (b & c))
# Rotate the 8 variables
a, b, c, d, e, f, g, h = temp1 + temp2, a, b, c, d + temp1, e, f, g
# Add this chunk's hash to result so far:
h0 += a
h1 += b
h2 += c
h3 += d
h4 += e
h5 += f
h6 += g
h7 += h
# 3. Conclusion
hash_pieces[0] = h0
hash_pieces[1] = h1
hash_pieces[2] = h2
hash_pieces[3] = h3
hash_pieces[4] = h4
hash_pieces[5] = h5
hash_pieces[6] = h6
hash_pieces[7] = h7
class SHA2_Cython(Hash):
"""SHA256 hashing, speed-up with Numba.jit, see https://en.wikipedia.org/wiki/SHA-2#Pseudocode."""
def __init__(self):
self.name = "SHA256"
self.byteorder = 'big'
self.block_size = 64
self.digest_size = 32
# Note 2: For each round, there is one round constant k[i] and one entry in the message schedule array w[i], 0 ≤ i ≤ 63
# Note 3: The compression function uses 8 working variables, a through h
# Note 4: Big-endian convention is used when expressing the constants in this pseudocode,
# and when parsing message block data from bytes to words, for example,
# the first word of the input message "abc" after padding is 0x61626380
# Initialize hash values:
# (first 32 bits of the fractional parts of the square roots of the first 8 primes 2..19):
h0 = 0x6a09e667
h1 = 0xbb67ae85
h2 = 0x3c6ef372
h3 = 0xa54ff53a
h4 = 0x510e527f
h5 = 0x9b05688c
h6 = 0x1f83d9ab
h7 = 0x5be0cd19
# Initialize array of round constants:
# (first 32 bits of the fractional parts of the cube roots of the first 64 primes 2..311):
self.k = [
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
]
# Store them
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def update(self, data):
update_cython(self.hash_pieces, self.k, data)
def digest(self):
# h0 append h1 append h2 append h3 append h4 append h5 append h6 append h7
return sum(leftshift(x, 32 * i) for i, x in enumerate(self.hash_pieces[::-1]))
We can also write a function to directly compute the hex digest from some bytes data.
def hash_SHA2_Cython(data):
""" Shortcut function to directly receive the hex digest from SHA2_Cython(data)."""
h = SHA2_Cython()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
print("type(data) =", type(data))
h.update(data)
return h.hexdigest()
data = bytes("", encoding='utf8')
h = SHA2_Cython()
h.hash_pieces[:1]
type(h.hash_pieces)
h.k[:1]
type(h.k)
data
type(data)
update_cython(h.hash_pieces, h.k, bytearray(data))
[1779033703]
list
[1116352408]
list
b''
bytes
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-185-38958c7f0f71> in <module>() 9 type(data) 10 ---> 11 update_cython(h.hash_pieces, h.k, bytearray(data)) _cython_magic_892b302b9e5d7477154227e427367045.pyx in _cython_magic_892b302b9e5d7477154227e427367045.update_cython (/home/lilian/.cache/ipython/cython/_cython_magic_892b302b9e5d7477154227e427367045.c:2840)() ~/.cache/ipython/cython/_cython_magic_892b302b9e5d7477154227e427367045.cpython-35m-x86_64-linux-gnu.so in View.MemoryView.memoryview_cwrapper (/home/lilian/.cache/ipython/cython/_cython_magic_892b302b9e5d7477154227e427367045.c:9311)() ~/.cache/ipython/cython/_cython_magic_892b302b9e5d7477154227e427367045.cpython-35m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__cinit__ (/home/lilian/.cache/ipython/cython/_cython_magic_892b302b9e5d7477154227e427367045.c:5546)() TypeError: a bytes-like object is required, not 'list'
Let try the example from SHA-2 Wikipedia page :
hash_SHA2_Cython("The quick brown fox jumps over the lazy dog")
assert hash_SHA2_Cython("The quick brown fox jumps over the lazy dog") == 'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-90-ef3902a7030b> in <module>() ----> 1 hash_SHA2_Cython("The quick brown fox jumps over the lazy dog") 2 assert hash_SHA2_Cython("The quick brown fox jumps over the lazy dog") == 'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592' <ipython-input-89-7da2f16c435f> in hash_SHA2_Cython(data) 4 if isinstance(data, str): 5 data = bytes(data, encoding='utf8') ----> 6 h.update(data) 7 return h.hexdigest() <ipython-input-88-3e4059769092> in update(self, data) 41 42 def update(self, data): ---> 43 return update_cython(self.hash_pieces, self.k, data) 44 45 def digest(self): _cython_magic_3a51a049b19292c89caa81e80cc99ee0.pyx in _cython_magic_3a51a049b19292c89caa81e80cc99ee0.update_cython (/home/lilian/.cache/ipython/cython/_cython_magic_3a51a049b19292c89caa81e80cc99ee0.c:3101)() ~/.cache/ipython/cython/_cython_magic_3a51a049b19292c89caa81e80cc99ee0.cpython-35m-x86_64-linux-gnu.so in View.MemoryView.memoryview_cwrapper (/home/lilian/.cache/ipython/cython/_cython_magic_3a51a049b19292c89caa81e80cc99ee0.c:9569)() ~/.cache/ipython/cython/_cython_magic_3a51a049b19292c89caa81e80cc99ee0.cpython-35m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__cinit__ (/home/lilian/.cache/ipython/cython/_cython_magic_3a51a049b19292c89caa81e80cc99ee0.c:5804)() TypeError: a bytes-like object is required, not 'list'
"SHA" is pronouced like the French word "chat", which means cat.
See my GitHub
notebooks
project for others notebooks.