#!/usr/bin/env python
# coding: utf-8
# # Table of Contents
#
# # Manual implementation of some hash functions
#
# This small [Jupyter notebook](https://www.Jupyter.org/) is a short experiment, to see if I can implement the some basic [Hashing functions](https://en.wikipedia.org/wiki/Hash_function), more specifically [cryptographic hashing functions](https://en.wikipedia.org/wiki/Cryptographic_hash_function), like `MD5`, `SHA1`, `SHA256`, `SHA512` etc
#
# And then I want compare my manual implementations with the functions implemented in [the `hashlib` module in Python standard library](https://docs.python.org/3/library/hashlib.html).
# Ideally, my implementation should work exactly like the reference one, only slower!
#
#
# - *Reference*: Wikipedia pages on [Hash functions](https://en.wikipedia.org/wiki/Hash_function), [MD5](https://en.wikipedia.org/wiki/MD5), [SHA-1](https://en.wikipedia.org/wiki/SHA1) and [SHA-2](https://en.wikipedia.org/wiki/SHA-2), as well as [the `hashlib` module in Python standard library](https://docs.python.org/3/library/hashlib.html).
# - *Date*: 13 May 2017 (first part about MD5), 19 June 2017 (second part about SHA-1), 19 June 2017 (last part about SHA-2).
# - *Author*: [Lilian Besson](https://GitHub.com/Naereen/notebooks).
# - *License*: [MIT Licensed](https://LBesson.MIT-License.org/).
# ----
# ## What is a hash function?
# > TL;DR : [Hash functions](https://en.wikipedia.org/wiki/Hash_function) and [cryptographic hashing functions](https://en.wikipedia.org/wiki/Cryptographic_hash_function) on Wikipedia.
# ----
# ## Common API for the different classes
#
# I will copy the API proposed by [the `hashlib` module in Python standard library](https://docs.python.org/3/library/hashlib.html), so it will be very easy to compare my implementations with the one provided with your default [Python](https://www.Python.org/) installation.
# In[1]:
class Hash(object):
""" Common class for all hash methods.
It copies the one of the hashlib module (https://docs.python.org/3.5/library/hashlib.html).
"""
def __init__(self, *args, **kwargs):
""" Create the Hash object."""
self.name = self.__class__.__name__ # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.name
self.byteorder = 'little'
self.digest_size = 0 # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.digest_size
self.block_size = 0 # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.block_size
def __str__(self):
return self.name
def update(self, arg):
""" Update the hash object with the object arg, which must be interpretable as a buffer of bytes."""
pass
def digest(self):
""" Return the digest of the data passed to the update() method so far. This is a bytes object of size digest_size which may contain bytes in the whole range from 0 to 255."""
return b""
def hexdigest(self):
""" Like digest() except the digest is returned as a string object of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments."""
digest = self.digest()
raw = digest.to_bytes(self.digest_size, byteorder=self.byteorder)
format_str = '{:0' + str(2 * self.digest_size) + 'x}'
return format_str.format(int.from_bytes(raw, byteorder='big'))
# ----
# ## Checking the [the `hashlib` module in Python standard library](https://docs.python.org/3/library/hashlib.html)
# In[2]:
import hashlib
# We can check [the available algorithms](https://docs.python.org/3.5/library/hashlib.html#hashlib.algorithms_available), some of them being [guaranteed to be on any platform](https://docs.python.org/3.5/library/hashlib.html#hashlib.algorithms_guaranteed), some are not.
# In[3]:
list(hashlib.algorithms_available)
# I will need at least these ones:
# In[4]:
assert 'MD5' in hashlib.algorithms_available
assert 'SHA1' in hashlib.algorithms_available
assert 'SHA256' in hashlib.algorithms_available
assert 'SHA224' in hashlib.algorithms_available
assert 'SHA512' in hashlib.algorithms_available
assert 'SHA384' in hashlib.algorithms_available
# Lets check that they have the block size and digest size announced:
# In[5]:
for name, s in zip(
('MD5', 'SHA1', 'SHA256', 'SHA224', 'SHA512', 'SHA384'),
(hashlib.md5(), hashlib.sha1(), hashlib.sha256(), hashlib.sha224(), hashlib.sha512(), hashlib.sha384())
):
print("For {:<8} : the block size is {:<3} and the digest size is {:<2}.".format(name, s.block_size, s.digest_size))
# ----
# ## First stupid example: a stupid hashing function
#
# This "stupid" hashing function will use `digest_size` of 128 bytes (= 16 bits), and compute it by ... just looking at the first 128 bytes of the input data.
#
# This is just to check the API and how to read from a bytes buffer.
# In[6]:
class HeaderHash(Hash):
""" This "stupid" hashing function will use `digest_size` of 128 bytes, and compute it by ... just looking at the first 128 bytes of the input data.
"""
def __init__(self):
# Common part
self.digest_size = 16
self.block_size = 16
self.name = "Header"
# Specific part
self._data = b""
def update(self, arg):
""" Update the hash object with the object arg, which must be interpretable as a buffer of bytes."""
if len(self._data) == 0:
self._data = arg[:self.block_size]
def digest(self):
""" Return the digest of the data passed to the update() method so far. This is a bytes object of size digest_size which may contain bytes in the whole range from 0 to 255."""
return self._data
# Let us try it:
# In[7]:
h1 = HeaderHash()
# In[8]:
h1
print(h1)
# Let us use some toy data, to test here and after.
# In[9]:
data = b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" * 100
# In[10]:
h1.update(data)
h1.digest()
# > Well... It seems to work, even if this first example is stupid.
# ----
# ## First real example: the MD5 hashing function
# Let start with a simple one: [the MD5 hashing function](https://en.wikipedia.org/wiki/MD5), from Rivest in 1992.
#
# Warning: it is considered broken since at least 2012, never use it for security purposes.
# ### Useful functions for the MD5 algorithm
# Instead of writing the complete MD5 algorithm in the class below, I preferred to define here some useful functions, using [Bitwise operators](https://wiki.python.org/moin/BitwiseOperators).
# In[11]:
def MD5_f1(b, c, d):
""" First ternary bitwise operation."""
return ((b & c) | ((~b) & d)) & 0xFFFFFFFF
def MD5_f2(b, c, d):
""" Second ternary bitwise operation."""
return ((b & d) | (c & (~d))) & 0xFFFFFFFF
def MD5_f3(b, c, d):
""" Third ternary bitwise operation."""
return (b ^ c ^ d) & 0xFFFFFFFF
def MD5_f4(b, c, d):
""" Forth ternary bitwise operation."""
return (c ^ (b | (~d))) & 0xFFFFFFFF
# In[12]:
def leftrotate(x, c):
""" Left rotate the number x by c bytes."""
x &= 0xFFFFFFFF
return ((x << c) | (x >> (32 - c))) & 0xFFFFFFFF
# In[13]:
def leftshift(x, c):
""" Left shift the number x by c bytes."""
return x << c
# In[14]:
from math import floor, sin
# ### The `MD5` class
# It is a direct implementation of the pseudo-code, as given for instance on the Wikipedia page, or the original research article by Rivest.
# In[15]:
class MD5(Hash):
"""MD5 hashing, see https://en.wikipedia.org/wiki/MD5#Algorithm."""
def __init__(self):
self.name = "MD5"
self.byteorder = 'little'
self.block_size = 64
self.digest_size = 16
# Internal data
s = [0] * 64
K = [0] * 64
# Initialize s, s specifies the per-round shift amounts
s[ 0:16] = [7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22]
s[16:32] = [5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20]
s[32:48] = [4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23]
s[48:64] = [6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21]
# Store it
self._s = s
# Use binary integer part of the sines of integers (Radians) as constants:
for i in range(64):
K[i] = floor(2**32 * abs(sin(i + 1))) & 0xFFFFFFFF
# Store it
self._K = K
# Initialize variables:
a0 = 0x67452301 # A
b0 = 0xefcdab89 # B
c0 = 0x98badcfe # C
d0 = 0x10325476 # D
self.hash_pieces = [a0, b0, c0, d0]
def update(self, arg):
s, K = self._s, self._K
a0, b0, c0, d0 = self.hash_pieces
# 1. Pre-processing
data = bytearray(arg)
orig_len_in_bits = (8 * len(data)) & 0xFFFFFFFFFFFFFFFF
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 448 (mod 512)
while len(data) % 64 != 56:
data.append(0)
# 1.c. append original length in bits mod (2 pow 64) to message
data += orig_len_in_bits.to_bytes(8, byteorder='little')
assert len(data) % 64 == 0, "Error in padding"
# 2. Computations
# Process the message in successive 512-bit = 64-bytes chunks:
for offset in range(0, len(data), 64):
# 2.a. 512-bits = 64-bytes chunks
chunks = data[offset : offset + 64]
# 2.b. Break chunk into sixteen 32-bit = 4-bytes words M[j], 0 ≤ j ≤ 15
A, B, C, D = a0, b0, c0, d0
# 2.c. Main loop
for i in range(64):
if 0 <= i <= 15:
F = MD5_f1(B, C, D)
g = i
elif 16 <= i <= 31:
F = MD5_f2(B, C, D)
g = (5 * i + 1) % 16
elif 32 <= i <= 47:
F = MD5_f3(B, C, D)
g = (3 * i + 5) % 16
elif 48 <= i <= 63:
F = MD5_f4(B, C, D)
g = (7 * i) % 16
# Be wary of the below definitions of A, B, C, D
to_rotate = (A + F + K[i] + int.from_bytes(chunks[4*g : 4*g+4], byteorder='little')) & 0xFFFFFFFF
new_B = (B + leftrotate(to_rotate, s[i])) & 0xFFFFFFFF
A, B, C, D = D, new_B, B, C
# Add this chunk's hash to result so far:
a0 = (a0 + A) & 0xFFFFFFFF
b0 = (b0 + B) & 0xFFFFFFFF
c0 = (c0 + C) & 0xFFFFFFFF
d0 = (d0 + D) & 0xFFFFFFFF
# 3. Conclusion
self.hash_pieces = [a0, b0, c0, d0]
def digest(self):
return sum(leftshift(x, (32 * i)) for i, x in enumerate(self.hash_pieces))
# We can also write a function to directly compute the hex digest from some bytes data.
# In[16]:
def hash_MD5(data):
""" Shortcut function to directly receive the hex digest from MD5(data)."""
h = MD5()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# *Note:* [This page helped for debugging](https://rosettacode.org/wiki/MD5/Implementation#Python).
# ### First check on MD5
#
# Let us try it:
# In[17]:
h2 = MD5()
h2
print(h2)
# In[18]:
h2.update(data)
h2.digest()
# In[19]:
h2.hexdigest()
# ### A less stupid check on MD5
#
# Let try the example from [MD5 Wikipedia page](https://en.wikipedia.org/wiki/MD5#MD5_hashes) :
# In[20]:
hash_MD5("The quick brown fox jumps over the lazy dog")
assert hash_MD5("The quick brown fox jumps over the lazy dog") == '9e107d9d372bb6826bd81d3542a419d6'
# Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the [**avalanche effect**](https://en.wikipedia.org/wiki/Avalanche_effect). For example, adding a period to the end of the sentence:
# In[21]:
hash_MD5("The quick brown fox jumps over the lazy dog.")
assert hash_MD5("The quick brown fox jumps over the lazy dog.") == 'e4d909c290d0fb1ca068ffaddf22cbd0'
# The hash of the zero-length string is:
# In[22]:
hash_MD5("")
assert hash_MD5("") == 'd41d8cd98f00b204e9800998ecf8427e'
# $\implies$ We obtained the same result, OK our function works!
# ### Trying 1000 random examples
# On a small sentence:
# In[23]:
hash_MD5("My name is Zorro !")
# In[24]:
h = hashlib.md5()
h.update(b"My name is Zorro !")
h.hexdigest()
# It starts to look good.
# In[25]:
def true_hash_MD5(data):
h = hashlib.md5()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# On some random data:
# In[26]:
import numpy.random as nr
alphabets = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
def random_string(size=10000):
return ''.join(alphabets[nr.randint(len(alphabets))] for _ in range(size))
# In[27]:
random_string(10)
# In[28]:
from tqdm import tqdm_notebook as tqdm
# In[29]:
get_ipython().run_cell_magic('time', '', 'for _ in tqdm(range(1000)):\n x = random_string()\n assert hash_MD5(x) == true_hash_MD5(x), "Error: x = {} gave two different MD5 hashes: my implementation = {} != hashlib implementation = {}...".format(x, hash_MD5(x), true_hash_MD5(x))\n')
# ----
# ## Second real example: the SHA-1 hashing function
# Let now study and implement another hashing function, slightly harder to write but more secure: SHA1, "Secure Hash Algorithm, version 1".
# See [the SHA1 hashing function](https://en.wikipedia.org/wiki/SHA-1) on Wikipedia, if needed.
#
# Warning: it is considered broken since at least 2011, it is not advised to use it for real security purposes. SHA-2 or SHA-3 is better advised.
#
# For instance, see [this nice blog post](https://konklone.com/post/why-google-is-hurrying-the-web-to-kill-sha-1).
# ### Useful functions the SHA-1 algorithm
# Pretty similar to the ones used for the MD5 algorithm.
# In[30]:
def SHA1_f1(b, c, d):
""" First ternary bitwise operation."""
return ((b & c) | ((~b) & d)) & 0xFFFFFFFF
def SHA1_f2(b, c, d):
""" Second ternary bitwise operation."""
return (b ^ c ^ d) & 0xFFFFFFFF
def SHA1_f3(b, c, d):
""" Third ternary bitwise operation."""
return ((b & c) | (b & d) | (c & d) ) & 0xFFFFFFFF
def SHA1_f4(b, c, d):
""" Forth ternary bitwise operation, = SHA1_f1."""
return (b ^ c ^ d) & 0xFFFFFFFF
# This is exactly like for MD5.
# In[31]:
def leftrotate(x, c):
""" Left rotate the number x by c bytes."""
x &= 0xFFFFFFFF
return ((x << c) | (x >> (32 - c))) & 0xFFFFFFFF
# As SHA-1 plays with big-endian and little-endian integers, and at the end it requires a leftshift to combine the 5 hash pieces into one.
# In[32]:
def leftshift(x, c):
""" Left shift the number x by c bytes."""
return x << c
# ### The `SHA1` class
#
# I will use a simple class, very similar to the class used for the MD5 algorithm (see above).
# It is a direct implementation of the pseudo-code, as given for instance on the Wikipedia page.
# In[33]:
class SHA1(Hash):
"""SHA1 hashing, see https://en.wikipedia.org/wiki/SHA-1#Algorithm."""
def __init__(self):
self.name = "SHA1"
self.byteorder = 'big'
self.block_size = 64
self.digest_size = 20
# Initialize variables
h0 = 0x67452301
h1 = 0xEFCDAB89
h2 = 0x98BADCFE
h3 = 0x10325476
h4 = 0xC3D2E1F0
# Store them
self.hash_pieces = [h0, h1, h2, h3, h4]
def update(self, arg):
h0, h1, h2, h3, h4 = self.hash_pieces
# 1. Pre-processing, exactly like MD5
data = bytearray(arg)
orig_len_in_bits = (8 * len(data)) & 0xFFFFFFFFFFFFFFFF
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 448 (mod 512)
while len(data) % 64 != 56:
data.append(0)
# 1.c. append original length in bits mod (2 pow 64) to message
data += orig_len_in_bits.to_bytes(8, byteorder='big')
assert len(data) % 64 == 0, "Error in padding"
# 2. Computations
# Process the message in successive 512-bit = 64-bytes chunks:
for offset in range(0, len(data), 64):
# 2.a. 512-bits = 64-bytes chunks
chunks = data[offset : offset + 64]
w = [0 for i in range(80)]
# 2.b. Break chunk into sixteen 32-bit = 4-bytes words w[i], 0 ≤ i ≤ 15
for i in range(16):
w[i] = int.from_bytes(chunks[4*i : 4*i + 4], byteorder='big')
# 2.c. Extend the sixteen 32-bit words into eighty 32-bit words
for i in range(16, 80):
w[i] = leftrotate(w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16], 1)
# 2.d. Initialize hash value for this chunk
a, b, c, d, e = h0, h1, h2, h3, h4
# 2.e. Main loop, cf. http://www.faqs.org/rfcs/rfc3174.html
for i in range(80):
if 0 <= i <= 19 :
f = SHA1_f1(b, c, d)
k = 0x5A827999
elif 20 <= i <= 39 :
f = SHA1_f2(b, c, d)
k = 0x6ED9EBA1
elif 40 <= i <= 59 :
f = SHA1_f3(b, c, d)
k = 0x8F1BBCDC
elif 60 <= i <= 79 :
f = SHA1_f4(b, c, d)
k = 0xCA62C1D6
new_a = leftrotate(a, 5) + f + e + k + w[i] & 0xFFFFFFFF
new_c = leftrotate(b, 30)
# Rotate the 5 variables
a, b, c, d, e = new_a, a, new_c, c, d
# Add this chunk's hash to result so far:
h0 = (h0 + a) & 0xFFFFFFFF
h1 = (h1 + b) & 0xFFFFFFFF
h2 = (h2 + c) & 0xFFFFFFFF
h3 = (h3 + d) & 0xFFFFFFFF
h4 = (h4 + e) & 0xFFFFFFFF
# 3. Conclusion
self.hash_pieces = [h0, h1, h2, h3, h4]
def digest(self):
return sum(leftshift(x, 32 * i) for i, x in enumerate(self.hash_pieces[::-1]))
# We can also write a function to directly compute the hex digest from some bytes data.
# In[34]:
def hash_SHA1(data):
""" Shortcut function to directly receive the hex digest from SHA1(data)."""
h = SHA1()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# ### First check on SHA-1
#
# Let us try it:
# In[35]:
h3 = SHA1()
h3
print(h3)
# In[36]:
h3.update(data)
h3.digest()
# In[37]:
digest = h3.digest()
bin(digest)
len(bin(digest))
# In[38]:
h3.hexdigest()
# ### A less stupid check on SHA-1
#
# Let try the example from [SHA-1 Wikipedia page](https://en.wikipedia.org/wiki/SHA-1#SHA-1_hashes) :
# In[39]:
hash_SHA1("The quick brown fox jumps over the lazy dog")
assert hash_SHA1("The quick brown fox jumps over the lazy dog") == '2fd4e1c67a2d28fced849ee1bb76e7391b93eb12'
# Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the [**avalanche effect**](https://en.wikipedia.org/wiki/Avalanche_effect). For example, changing one letter in the sentence:
# In[40]:
hash_SHA1("The quick brown fox jumps over the lazy cog")
assert hash_SHA1("The quick brown fox jumps over the lazy cog") == 'de9f2c7fd25e1b3afad3e85a0bd17d9b100db4b3'
# The hash of the zero-length string is:
# In[41]:
hash_SHA1("")
assert hash_SHA1("") == 'da39a3ee5e6b4b0d3255bfef95601890afd80709'
# $\implies$ We obtained the same result, OK our function works!
# ### Trying 1000 random examples
# On a small sentence:
# In[42]:
hash_SHA1("My name is Zorro !")
# In[43]:
h = hashlib.sha1()
h.update(b"My name is Zorro !")
h.hexdigest()
# It starts to look good.
# In[44]:
def true_hash_SHA1(data):
h = hashlib.sha1()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# On some random data:
# In[45]:
random_string(10)
# In[46]:
from tqdm import tqdm_notebook as tqdm
# In[47]:
get_ipython().run_cell_magic('time', '', 'for _ in tqdm(range(1000)):\n x = random_string()\n assert hash_SHA1(x) == true_hash_SHA1(x), "Error: x = {} gave two different SHA1 hashes: my implementation = {} != hashlib implementation = {}...".format(x, hash_SHA1(x), true_hash_SHA1(x))\n')
# ----
# ## Third real example: the SHA-2 hashing function
# Let now study and implement a last hashing function, again slightly harder to write but more secure: SHA-2, "Secure Hash Algorithm, version 2".
# See [the SHA-2 hashing function](https://en.wikipedia.org/wiki/SHA-2) on Wikipedia, if needed.
#
# Remark: it is not (yet) considered broken, and it is the military standard for security and cryptographic hashing. SHA-3 is preferred for security purposes.
# ### Useful functions the SHA-2 algorithm
# This is exactly like for MD5. But SHA-2 requires right-rotate as well.
# In[48]:
def leftrotate(x, c):
""" Left rotate the number x by c bytes."""
x &= 0xFFFFFFFF
return ((x << c) | (x >> (32 - c))) & 0xFFFFFFFF
def rightrotate(x, c):
""" Right rotate the number x by c bytes."""
x &= 0xFFFFFFFF
return ((x >> c) | (x << (32 - c))) & 0xFFFFFFFF
# As SHA-2 plays with big-endian and little-endian integers, and at the end it requires a leftshift to combine the 5 hash pieces into one.
# In[49]:
def leftshift(x, c):
""" Left shift the number x by c bytes."""
return x << c
def rightshift(x, c):
""" Right shift the number x by c bytes."""
return x >> c
# ### The `SHA2` class
#
# I will use a simple class, very similar to the class used for the SHA-1 algorithm (see above).
# It is a direct implementation of the pseudo-code, as given for instance on the Wikipedia page.
#
# I will only implement the simpler one, SHA-256, of digest size of 256 bits. Other variants are SHA-224, SHA-384, SHA-512 (and others include SHA-512/224, SHA-512/256).
# In[50]:
class SHA2(Hash):
"""SHA256 hashing, see https://en.wikipedia.org/wiki/SHA-2#Pseudocode."""
def __init__(self):
self.name = "SHA256"
self.byteorder = 'big'
self.block_size = 64
self.digest_size = 32
# Note 2: For each round, there is one round constant k[i] and one entry in the message schedule array w[i], 0 ≤ i ≤ 63
# Note 3: The compression function uses 8 working variables, a through h
# Note 4: Big-endian convention is used when expressing the constants in this pseudocode,
# and when parsing message block data from bytes to words, for example,
# the first word of the input message "abc" after padding is 0x61626380
# Initialize hash values:
# (first 32 bits of the fractional parts of the square roots of the first 8 primes 2..19):
h0 = 0x6a09e667
h1 = 0xbb67ae85
h2 = 0x3c6ef372
h3 = 0xa54ff53a
h4 = 0x510e527f
h5 = 0x9b05688c
h6 = 0x1f83d9ab
h7 = 0x5be0cd19
# Initialize array of round constants:
# (first 32 bits of the fractional parts of the cube roots of the first 64 primes 2..311):
self.k = [
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
]
# Store them
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def update(self, arg):
h0, h1, h2, h3, h4, h5, h6, h7 = self.hash_pieces
# 1. Pre-processing, exactly like MD5
data = bytearray(arg)
orig_len_in_bits = (8 * len(data)) & 0xFFFFFFFFFFFFFFFF
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 448 (mod 512)
while len(data) % 64 != 56:
data.append(0)
# 1.c. append original length in bits mod (2 pow 64) to message
data += orig_len_in_bits.to_bytes(8, byteorder='big')
assert len(data) % 64 == 0, "Error in padding"
# 2. Computations
# Process the message in successive 512-bit = 64-bytes chunks:
for offset in range(0, len(data), 64):
# 2.a. 512-bits = 64-bytes chunks
chunks = data[offset : offset + 64]
w = [0 for i in range(64)]
# 2.b. Break chunk into sixteen 32-bit = 4-bytes words w[i], 0 ≤ i ≤ 15
for i in range(16):
w[i] = int.from_bytes(chunks[4*i : 4*i + 4], byteorder='big')
# 2.c. Extend the first 16 words into the remaining 48
# words w[16..63] of the message schedule array:
for i in range(16, 64):
s0 = (rightrotate(w[i-15], 7) ^ rightrotate(w[i-15], 18) ^ rightshift(w[i-15], 3)) & 0xFFFFFFFF
s1 = (rightrotate(w[i-2], 17) ^ rightrotate(w[i-2], 19) ^ rightshift(w[i-2], 10)) & 0xFFFFFFFF
w[i] = (w[i-16] + s0 + w[i-7] + s1) & 0xFFFFFFFF
# 2.d. Initialize hash value for this chunk
a, b, c, d, e, f, g, h = h0, h1, h2, h3, h4, h5, h6, h7
# 2.e. Main loop, cf. https://tools.ietf.org/html/rfc6234
for i in range(64):
S1 = (rightrotate(e, 6) ^ rightrotate(e, 11) ^ rightrotate(e, 25)) & 0xFFFFFFFF
ch = ((e & f) ^ ((~e) & g)) & 0xFFFFFFFF
temp1 = (h + S1 + ch + self.k[i] + w[i]) & 0xFFFFFFFF
S0 = (rightrotate(a, 2) ^ rightrotate(a, 13) ^ rightrotate(a, 22)) & 0xFFFFFFFF
maj = ((a & b) ^ (a & c) ^ (b & c)) & 0xFFFFFFFF
temp2 = (S0 + maj) & 0xFFFFFFFF
new_a = (temp1 + temp2) & 0xFFFFFFFF
new_e = (d + temp1) & 0xFFFFFFFF
# Rotate the 8 variables
a, b, c, d, e, f, g, h = new_a, a, b, c, new_e, e, f, g
# Add this chunk's hash to result so far:
h0 = (h0 + a) & 0xFFFFFFFF
h1 = (h1 + b) & 0xFFFFFFFF
h2 = (h2 + c) & 0xFFFFFFFF
h3 = (h3 + d) & 0xFFFFFFFF
h4 = (h4 + e) & 0xFFFFFFFF
h5 = (h5 + f) & 0xFFFFFFFF
h6 = (h6 + g) & 0xFFFFFFFF
h7 = (h7 + h) & 0xFFFFFFFF
# 3. Conclusion
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def digest(self):
# h0 append h1 append h2 append h3 append h4 append h5 append h6 append h7
return sum(leftshift(x, 32 * i) for i, x in enumerate(self.hash_pieces[::-1]))
# We can also write a function to directly compute the hex digest from some bytes data.
# In[51]:
def hash_SHA2(data):
""" Shortcut function to directly receive the hex digest from SHA2(data)."""
h = SHA2()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# ### First check on SHA-2
#
# Let us try it:
# In[52]:
h4 = SHA2()
h4
print(h4)
# In[53]:
h4.update(data)
h4.digest()
# In[54]:
digest = h4.digest()
bin(digest)
len(bin(digest))
# In[55]:
h4.hexdigest()
# ### Check on SHA-2
#
# Let try the example from [SHA-2 Wikipedia page](https://en.wikipedia.org/wiki/SHA-2#Test_vectors) :
# In[56]:
hash_SHA2("The quick brown fox jumps over the lazy dog")
assert hash_SHA2("The quick brown fox jumps over the lazy dog") == 'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
# Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the [**avalanche effect**](https://en.wikipedia.org/wiki/Avalanche_effect). For example, adding a period at the end of the sentence:
# In[57]:
hash_SHA2("The quick brown fox jumps over the lazy dog.")
assert hash_SHA2("The quick brown fox jumps over the lazy dog.") == 'ef537f25c895bfa782526529a9b63d97aa631564d5d789c2b765448c8635fb6c'
# The hash of the zero-length string is:
# In[58]:
hash_SHA2("")
assert hash_SHA2("") == 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
# $\implies$ We obtained the same result, OK our function works!
# ### Trying 1000 random examples
# On a small sentence:
# In[59]:
hash_SHA2("My name is Zorro !")
# In[60]:
h = hashlib.sha256()
h.update(b"My name is Zorro !")
h.hexdigest()
# It starts to look good.
# In[61]:
def true_hash_SHA2(data):
h = hashlib.sha256()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# On some random data:
# In[62]:
random_string(10)
# In[63]:
from tqdm import tqdm_notebook as tqdm
# In[64]:
get_ipython().run_cell_magic('time', '', 'for _ in tqdm(range(1000)):\n x = random_string()\n assert hash_SHA2(x) == true_hash_SHA2(x), "Error: x = {} gave two different SHA2 hashes: my implementation = {} != hashlib implementation = {}...".format(x, hash_SHA2(x), true_hash_SHA2(x))\n')
# ----
# ## Comparison : MD5 vs SHA-1 vs SHA-2
#
# It can be interesting to compare each hash functions, with respect to its time complexity.
# In[65]:
def test_MD5():
x = random_string()
return hash_MD5(x) == true_hash_MD5(x)
get_ipython().run_line_magic('timeit', 'test_MD5()')
# In[66]:
def test_SHA1():
x = random_string()
return hash_SHA1(x) == true_hash_SHA1(x)
get_ipython().run_line_magic('timeit', 'test_SHA1()')
# In[67]:
def test_SHA2():
x = random_string()
return hash_SHA2(x) == true_hash_SHA2(x)
get_ipython().run_line_magic('timeit', 'test_SHA2()')
# As expected, the MD5 hash is the fastest, and SHA-2 is slower than SHA-1.
#
# $\implies$ The more secure, the slowest. Of course.
# ----
# ## Bonus : SHA-2 variants
#
# Now that we have implemented SHA-256, it should be quick to add the SHA-224 variant, which is simply the SHA-256 with different initial values and a shorter digest.
#
# The SHA-512 variant is working with 64-bits words and 1024-bits chunks, instead of 32-bits words and 512-bits chunks, and the SHA-384 is very similar to SHA-512 with different initial values and a shorter digest.
#
# > Sorry about the length of this part, I know all these variants are very similar, but I wanted to write them all.
# ### SHA-224
#
# As said on the Wikipedia page:
# > SHA-224 is identical to SHA-256, except that:
# > - the initial hash values `h0` through `h7` are different, and
# > - the output is constructed by omitting `h7`.
# #### The `SHA224` class
# In[68]:
class SHA224(Hash):
"""SHA224 hashing, see https://en.wikipedia.org/wiki/SHA-2#Pseudocode."""
def __init__(self):
self.name = "SHA224"
self.byteorder = 'big'
self.block_size = 64
self.digest_size = 28
# Note 2: For each round, there is one round constant k[i] and one entry in the message schedule array w[i], 0 ≤ i ≤ 63
# Note 3: The compression function uses 8 working variables, a through h
# Note 4: Big-endian convention is used when expressing the constants in this pseudocode,
# and when parsing message block data from bytes to words, for example,
# the first word of the input message "abc" after padding is 0x61626380
# Initialize hash values:
# (The second 32 bits of the fractional parts of the square roots of the 9th through 16th primes 23..53)
h0 = 0xc1059ed8
h1 = 0x367cd507
h2 = 0x3070dd17
h3 = 0xf70e5939
h4 = 0xffc00b31
h5 = 0x68581511
h6 = 0x64f98fa7
h7 = 0xbefa4fa4
# Initialize array of round constants:
# (first 32 bits of the fractional parts of the cube roots of the first 64 primes 2..311):
self.k = [
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
]
# Store them
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def update(self, arg):
h0, h1, h2, h3, h4, h5, h6, h7 = self.hash_pieces
# 1. Pre-processing, exactly like MD5
data = bytearray(arg)
orig_len_in_bits = (8 * len(data)) & 0xFFFFFFFFFFFFFFFF
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 448 (mod 512)
while len(data) % 64 != 56:
data.append(0)
# 1.c. append original length in bits mod (2 pow 64) to message
data += orig_len_in_bits.to_bytes(8, byteorder='big')
assert len(data) % 64 == 0, "Error in padding"
# 2. Computations
# Process the message in successive 512-bit = 64-bytes chunks:
for offset in range(0, len(data), 64):
# 2.a. 512-bits = 64-bytes chunks
chunks = data[offset : offset + 64]
w = [0 for i in range(64)]
# 2.b. Break chunk into sixteen 32-bit = 4-bytes words w[i], 0 ≤ i ≤ 15
for i in range(16):
w[i] = int.from_bytes(chunks[4*i : 4*i + 4], byteorder='big')
# 2.c. Extend the first 16 words into the remaining 48
# words w[16..63] of the message schedule array:
for i in range(16, 64):
s0 = (rightrotate(w[i-15], 7) ^ rightrotate(w[i-15], 18) ^ rightshift(w[i-15], 3)) & 0xFFFFFFFF
s1 = (rightrotate(w[i-2], 17) ^ rightrotate(w[i-2], 19) ^ rightshift(w[i-2], 10)) & 0xFFFFFFFF
w[i] = (w[i-16] + s0 + w[i-7] + s1) & 0xFFFFFFFF
# 2.d. Initialize hash value for this chunk
a, b, c, d, e, f, g, h = h0, h1, h2, h3, h4, h5, h6, h7
# 2.e. Main loop, cf. https://tools.ietf.org/html/rfc6234
for i in range(64):
S1 = (rightrotate(e, 6) ^ rightrotate(e, 11) ^ rightrotate(e, 25)) & 0xFFFFFFFF
ch = ((e & f) ^ ((~e) & g)) & 0xFFFFFFFF
temp1 = (h + S1 + ch + self.k[i] + w[i]) & 0xFFFFFFFF
S0 = (rightrotate(a, 2) ^ rightrotate(a, 13) ^ rightrotate(a, 22)) & 0xFFFFFFFF
maj = ((a & b) ^ (a & c) ^ (b & c)) & 0xFFFFFFFF
temp2 = (S0 + maj) & 0xFFFFFFFF
new_a = (temp1 + temp2) & 0xFFFFFFFF
new_e = (d + temp1) & 0xFFFFFFFF
# Rotate the 8 variables
a, b, c, d, e, f, g, h = new_a, a, b, c, new_e, e, f, g
# Add this chunk's hash to result so far:
h0 = (h0 + a) & 0xFFFFFFFF
h1 = (h1 + b) & 0xFFFFFFFF
h2 = (h2 + c) & 0xFFFFFFFF
h3 = (h3 + d) & 0xFFFFFFFF
h4 = (h4 + e) & 0xFFFFFFFF
h5 = (h5 + f) & 0xFFFFFFFF
h6 = (h6 + g) & 0xFFFFFFFF
h7 = (h7 + h) & 0xFFFFFFFF
# 3. Conclusion
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def digest(self):
# h0 append h1 append h2 append h3 append h4 append h5 append h6
pieces_without_h7 = self.hash_pieces[:-1]
return sum(leftshift(x, 32 * i) for i, x in enumerate(pieces_without_h7[::-1]))
# We can also write a function to directly compute the hex digest from some bytes data.
# In[69]:
def hash_SHA224(data):
""" Shortcut function to directly receive the hex digest from SHA224(data)."""
h = SHA224()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# #### Checks on SHA-224
#
# Let try the example from [SHA-2 Wikipedia page](https://en.wikipedia.org/wiki/SHA-2#Test_vectors) :
# In[70]:
def true_hash_SHA224(data):
h = hashlib.sha224()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# In[71]:
hash_SHA224("The quick brown fox jumps over the lazy dog")
assert hash_SHA224("The quick brown fox jumps over the lazy dog") == '730e109bd7a8a32b1cb9d9a09aa2325d2430587ddbc0c38bad911525'
assert hash_SHA224("The quick brown fox jumps over the lazy dog") == true_hash_SHA224("The quick brown fox jumps over the lazy dog")
# Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the [**avalanche effect**](https://en.wikipedia.org/wiki/Avalanche_effect). For example, adding a period at the end of the sentence:
# In[72]:
hash_SHA224("The quick brown fox jumps over the lazy dog.")
assert hash_SHA224("The quick brown fox jumps over the lazy dog.") == '619cba8e8e05826e9b8c519c0a5c68f4fb653e8a3d8aa04bb2c8cd4c'
assert hash_SHA224("The quick brown fox jumps over the lazy dog.") == true_hash_SHA224("The quick brown fox jumps over the lazy dog.")
# The hash of the zero-length string is:
# In[73]:
hash_SHA224("")
assert hash_SHA224("") == 'd14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f'
assert hash_SHA224("") == true_hash_SHA224("")
# $\implies$ We obtained the same result, OK our function works!
# ### SHA-512
#
# As said on the Wikipedia page:
#
# > SHA-512 is identical in structure to SHA-256, but:
# >
# > - the message is broken into 1024-bit chunks,
# > - the initial hash values and round constants are extended to 64 bits,
# > - there are 80 rounds instead of 64,
# > - the message schedule array w has 80 64-bit words instead of 64 32-bit words,
# > - to extend the message schedule array w, the loop is from 16 to 79 instead of from 16 to 63,
# > - the round constants are based on the first 80 primes 2..409,
# > - the word size used for calculations is 64 bits long,
# > - the appended length of the message (before pre-processing), in bits, is a 128-bit big-endian integer, and
# > - the shift and rotate amounts used are different.
#
# #### Useful functions the SHA-512 algorithm
# This is exactly like for SHA-256, except we work with 64-bits words instead of 32-bits.
# In[74]:
def leftrotate_64(x, c):
""" Left rotate the number x by c bytes, for 64-bits numbers."""
x &= 0xFFFFFFFFFFFFFFFF
return ((x << c) | (x >> (64 - c))) & 0xFFFFFFFFFFFFFFFF
def rightrotate_64(x, c):
""" Right rotate the number x by c bytes, for 64-bits numbers."""
x &= 0xFFFFFFFFFFFFFFFF
return ((x >> c) | (x << (64 - c))) & 0xFFFFFFFFFFFFFFFF
# #### The `SHA512` class
# In[75]:
class SHA512(Hash):
"""SHA384 hashing, see https://en.wikipedia.org/wiki/SHA-2#Pseudocode."""
def __init__(self):
self.name = "SHA512"
self.byteorder = 'big'
self.block_size = 128
self.digest_size = 64
# Note 2: For each round, there is one round constant k[i] and one entry in the message schedule array w[i], 0 ≤ i ≤ 79
# Note 3: The compression function uses 8 working variables, a through h
# Note 4: Big-endian convention is used when expressing the constants in this pseudocode
# Initialize hash values:
# (The second 64 bits of the fractional parts of the square roots of the first 8 primes 2..19)
h0 = 0x6a09e667f3bcc908
h1 = 0xbb67ae8584caa73b
h2 = 0x3c6ef372fe94f82b
h3 = 0xa54ff53a5f1d36f1
h4 = 0x510e527fade682d1
h5 = 0x9b05688c2b3e6c1f
h6 = 0x1f83d9abfb41bd6b
h7 = 0x5be0cd19137e2179
# Initialize array of round constants:
# (first 64 bits of the fractional parts of the cube roots of the first 80 primes 2..409):
self.k = [
0x428a2f98d728ae22, 0x7137449123ef65cd, 0xb5c0fbcfec4d3b2f, 0xe9b5dba58189dbbc, 0x3956c25bf348b538,
0x59f111f1b605d019, 0x923f82a4af194f9b, 0xab1c5ed5da6d8118, 0xd807aa98a3030242, 0x12835b0145706fbe,
0x243185be4ee4b28c, 0x550c7dc3d5ffb4e2, 0x72be5d74f27b896f, 0x80deb1fe3b1696b1, 0x9bdc06a725c71235,
0xc19bf174cf692694, 0xe49b69c19ef14ad2, 0xefbe4786384f25e3, 0x0fc19dc68b8cd5b5, 0x240ca1cc77ac9c65,
0x2de92c6f592b0275, 0x4a7484aa6ea6e483, 0x5cb0a9dcbd41fbd4, 0x76f988da831153b5, 0x983e5152ee66dfab,
0xa831c66d2db43210, 0xb00327c898fb213f, 0xbf597fc7beef0ee4, 0xc6e00bf33da88fc2, 0xd5a79147930aa725,
0x06ca6351e003826f, 0x142929670a0e6e70, 0x27b70a8546d22ffc, 0x2e1b21385c26c926, 0x4d2c6dfc5ac42aed,
0x53380d139d95b3df, 0x650a73548baf63de, 0x766a0abb3c77b2a8, 0x81c2c92e47edaee6, 0x92722c851482353b,
0xa2bfe8a14cf10364, 0xa81a664bbc423001, 0xc24b8b70d0f89791, 0xc76c51a30654be30, 0xd192e819d6ef5218,
0xd69906245565a910, 0xf40e35855771202a, 0x106aa07032bbd1b8, 0x19a4c116b8d2d0c8, 0x1e376c085141ab53,
0x2748774cdf8eeb99, 0x34b0bcb5e19b48a8, 0x391c0cb3c5c95a63, 0x4ed8aa4ae3418acb, 0x5b9cca4f7763e373,
0x682e6ff3d6b2b8a3, 0x748f82ee5defb2fc, 0x78a5636f43172f60, 0x84c87814a1f0ab72, 0x8cc702081a6439ec,
0x90befffa23631e28, 0xa4506cebde82bde9, 0xbef9a3f7b2c67915, 0xc67178f2e372532b, 0xca273eceea26619c,
0xd186b8c721c0c207, 0xeada7dd6cde0eb1e, 0xf57d4f7fee6ed178, 0x06f067aa72176fba, 0x0a637dc5a2c898a6,
0x113f9804bef90dae, 0x1b710b35131c471b, 0x28db77f523047d84, 0x32caab7b40c72493, 0x3c9ebe0a15c9bebc,
0x431d67c49c100d4c, 0x4cc5d4becb3e42b6, 0x597f299cfc657e2a, 0x5fcb6fab3ad6faec, 0x6c44198c4a475817
]
# Store them
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def update(self, arg):
h0, h1, h2, h3, h4, h5, h6, h7 = self.hash_pieces
# 1. Pre-processing, exactly like MD5
data = bytearray(arg)
orig_len_in_bits = (8 * len(data)) & 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 896 (mod 1024)
while len(data) % 128 != 112:
data.append(0)
# 1.c. append original length in bits mod (2 pow 128) to message
data += orig_len_in_bits.to_bytes(16, byteorder='big')
assert len(data) % 128 == 0, "Error in padding"
# 2. Computations
# Process the message in successive 1024-bit = 128-bytes chunks:
for offset in range(0, len(data), 128):
# 2.a. 1024-bits = 128-bytes chunks
chunks = data[offset : offset + 128]
w = [0 for i in range(80)]
# 2.b. Break chunk into sixteen 128-bit = 8-bytes words w[i], 0 ≤ i ≤ 15
for i in range(16):
w[i] = int.from_bytes(chunks[8*i : 8*i + 8], byteorder='big')
# 2.c. Extend the first 16 words into the remaining 64
# words w[16..79] of the message schedule array:
for i in range(16, 80):
s0 = (rightrotate_64(w[i-15], 1) ^ rightrotate_64(w[i-15], 8) ^ rightshift(w[i-15], 7)) & 0xFFFFFFFFFFFFFFFF
s1 = (rightrotate_64(w[i-2], 19) ^ rightrotate_64(w[i-2], 61) ^ rightshift(w[i-2], 6)) & 0xFFFFFFFFFFFFFFFF
w[i] = (w[i-16] + s0 + w[i-7] + s1) & 0xFFFFFFFFFFFFFFFF
# 2.d. Initialize hash value for this chunk
a, b, c, d, e, f, g, h = h0, h1, h2, h3, h4, h5, h6, h7
# 2.e. Main loop, cf. https://tools.ietf.org/html/rfc6234
for i in range(80):
S1 = (rightrotate_64(e, 14) ^ rightrotate_64(e, 18) ^ rightrotate_64(e, 41)) & 0xFFFFFFFFFFFFFFFF
ch = ((e & f) ^ ((~e) & g)) & 0xFFFFFFFFFFFFFFFF
temp1 = (h + S1 + ch + self.k[i] + w[i]) & 0xFFFFFFFFFFFFFFFF
S0 = (rightrotate_64(a, 28) ^ rightrotate_64(a, 34) ^ rightrotate_64(a, 39)) & 0xFFFFFFFFFFFFFFFF
maj = ((a & b) ^ (a & c) ^ (b & c)) & 0xFFFFFFFFFFFFFFFF
temp2 = (S0 + maj) & 0xFFFFFFFFFFFFFFFF
new_a = (temp1 + temp2) & 0xFFFFFFFFFFFFFFFF
new_e = (d + temp1) & 0xFFFFFFFFFFFFFFFF
# Rotate the 8 variables
a, b, c, d, e, f, g, h = new_a, a, b, c, new_e, e, f, g
# Add this chunk's hash to result so far:
h0 = (h0 + a) & 0xFFFFFFFFFFFFFFFF
h1 = (h1 + b) & 0xFFFFFFFFFFFFFFFF
h2 = (h2 + c) & 0xFFFFFFFFFFFFFFFF
h3 = (h3 + d) & 0xFFFFFFFFFFFFFFFF
h4 = (h4 + e) & 0xFFFFFFFFFFFFFFFF
h5 = (h5 + f) & 0xFFFFFFFFFFFFFFFF
h6 = (h6 + g) & 0xFFFFFFFFFFFFFFFF
h7 = (h7 + h) & 0xFFFFFFFFFFFFFFFF
# 3. Conclusion
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def digest(self):
# h0 append h1 append h2 append h3 append h4 append h5 append h6 append h7
return sum(leftshift(x, 64 * i) for i, x in enumerate(self.hash_pieces[::-1]))
# We can also write a function to directly compute the hex digest from some bytes data.
# In[76]:
def hash_SHA512(data):
""" Shortcut function to directly receive the hex digest from SHA512(data)."""
h = SHA512()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# #### Checks on SHA-512
#
# Let try the example from [SHA-2 Wikipedia page](https://en.wikipedia.org/wiki/SHA-2#Test_vectors) :
# In[77]:
def true_hash_SHA512(data):
h = hashlib.sha512()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# In[78]:
hash_SHA512("The quick brown fox jumps over the lazy dog")
assert hash_SHA512("The quick brown fox jumps over the lazy dog") == true_hash_SHA512("The quick brown fox jumps over the lazy dog")
# Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the [**avalanche effect**](https://en.wikipedia.org/wiki/Avalanche_effect). For example, adding a period at the end of the sentence:
# In[79]:
hash_SHA512("The quick brown fox jumps over the lazy dog.")
assert hash_SHA512("The quick brown fox jumps over the lazy dog.") == true_hash_SHA512("The quick brown fox jumps over the lazy dog.")
# The hash of the zero-length string is:
# In[80]:
hash_SHA512("")
assert hash_SHA512("") == 'cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e'
assert hash_SHA512("") == true_hash_SHA512("")
# $\implies$ We obtained the same result, OK our function works!
# ### SHA-384
#
# As said on the Wikipedia page:
#
# > SHA-384 is identical to SHA-512, except that:
# > - the initial hash values h0 through h7 are different (taken from the 9th through 16th primes), and
# > - the output is constructed by omitting h6 and h7.
# #### The `SHA384` class
# In[81]:
class SHA384(Hash):
"""SHA384 hashing, see https://en.wikipedia.org/wiki/SHA-2#Pseudocode."""
def __init__(self):
self.name = "SHA384"
self.byteorder = 'big'
self.block_size = 96
self.digest_size = 48
# Note 2: For each round, there is one round constant k[i] and one entry in the message schedule array w[i], 0 ≤ i ≤ 79
# Note 3: The compression function uses 8 working variables, a through h
# Note 4: Big-endian convention is used when expressing the constants in this pseudocode
# Initialize hash values:
# (The second 64 bits of the fractional parts of the square roots of the first 9th through 16th primes 23..53)
h0 = 0xcbbb9d5dc1059ed8
h1 = 0x629a292a367cd507
h2 = 0x9159015a3070dd17
h3 = 0x152fecd8f70e5939
h4 = 0x67332667ffc00b31
h5 = 0x8eb44a8768581511
h6 = 0xdb0c2e0d64f98fa7
h7 = 0x47b5481dbefa4fa4
# Initialize array of round constants:
# (first 64 bits of the fractional parts of the cube roots of the first 80 primes 2..409):
self.k = [
0x428a2f98d728ae22, 0x7137449123ef65cd, 0xb5c0fbcfec4d3b2f, 0xe9b5dba58189dbbc, 0x3956c25bf348b538,
0x59f111f1b605d019, 0x923f82a4af194f9b, 0xab1c5ed5da6d8118, 0xd807aa98a3030242, 0x12835b0145706fbe,
0x243185be4ee4b28c, 0x550c7dc3d5ffb4e2, 0x72be5d74f27b896f, 0x80deb1fe3b1696b1, 0x9bdc06a725c71235,
0xc19bf174cf692694, 0xe49b69c19ef14ad2, 0xefbe4786384f25e3, 0x0fc19dc68b8cd5b5, 0x240ca1cc77ac9c65,
0x2de92c6f592b0275, 0x4a7484aa6ea6e483, 0x5cb0a9dcbd41fbd4, 0x76f988da831153b5, 0x983e5152ee66dfab,
0xa831c66d2db43210, 0xb00327c898fb213f, 0xbf597fc7beef0ee4, 0xc6e00bf33da88fc2, 0xd5a79147930aa725,
0x06ca6351e003826f, 0x142929670a0e6e70, 0x27b70a8546d22ffc, 0x2e1b21385c26c926, 0x4d2c6dfc5ac42aed,
0x53380d139d95b3df, 0x650a73548baf63de, 0x766a0abb3c77b2a8, 0x81c2c92e47edaee6, 0x92722c851482353b,
0xa2bfe8a14cf10364, 0xa81a664bbc423001, 0xc24b8b70d0f89791, 0xc76c51a30654be30, 0xd192e819d6ef5218,
0xd69906245565a910, 0xf40e35855771202a, 0x106aa07032bbd1b8, 0x19a4c116b8d2d0c8, 0x1e376c085141ab53,
0x2748774cdf8eeb99, 0x34b0bcb5e19b48a8, 0x391c0cb3c5c95a63, 0x4ed8aa4ae3418acb, 0x5b9cca4f7763e373,
0x682e6ff3d6b2b8a3, 0x748f82ee5defb2fc, 0x78a5636f43172f60, 0x84c87814a1f0ab72, 0x8cc702081a6439ec,
0x90befffa23631e28, 0xa4506cebde82bde9, 0xbef9a3f7b2c67915, 0xc67178f2e372532b, 0xca273eceea26619c,
0xd186b8c721c0c207, 0xeada7dd6cde0eb1e, 0xf57d4f7fee6ed178, 0x06f067aa72176fba, 0x0a637dc5a2c898a6,
0x113f9804bef90dae, 0x1b710b35131c471b, 0x28db77f523047d84, 0x32caab7b40c72493, 0x3c9ebe0a15c9bebc,
0x431d67c49c100d4c, 0x4cc5d4becb3e42b6, 0x597f299cfc657e2a, 0x5fcb6fab3ad6faec, 0x6c44198c4a475817
]
# Store them
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def update(self, arg):
h0, h1, h2, h3, h4, h5, h6, h7 = self.hash_pieces
# 1. Pre-processing, exactly like MD5
data = bytearray(arg)
orig_len_in_bits = (8 * len(data)) & 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
# 1.a. Add a single '1' bit at the end of the input bits
data.append(0x80)
# 1.b. Padding with zeros as long as the input bits length ≡ 896 (mod 1024)
while len(data) % 128 != 112:
data.append(0)
# 1.c. append original length in bits mod (2 pow 128) to message
data += orig_len_in_bits.to_bytes(16, byteorder='big')
assert len(data) % 128 == 0, "Error in padding"
# 2. Computations
# Process the message in successive 1024-bit = 128-bytes chunks:
for offset in range(0, len(data), 128):
# 2.a. 1024-bits = 128-bytes chunks
chunks = data[offset : offset + 128]
w = [0 for i in range(80)]
# 2.b. Break chunk into sixteen 128-bit = 8-bytes words w[i], 0 ≤ i ≤ 15
for i in range(16):
w[i] = int.from_bytes(chunks[8*i : 8*i + 8], byteorder='big')
# 2.c. Extend the first 16 words into the remaining 64
# words w[16..79] of the message schedule array:
for i in range(16, 80):
s0 = (rightrotate_64(w[i-15], 1) ^ rightrotate_64(w[i-15], 8) ^ rightshift(w[i-15], 7)) & 0xFFFFFFFFFFFFFFFF
s1 = (rightrotate_64(w[i-2], 19) ^ rightrotate_64(w[i-2], 61) ^ rightshift(w[i-2], 6)) & 0xFFFFFFFFFFFFFFFF
w[i] = (w[i-16] + s0 + w[i-7] + s1) & 0xFFFFFFFFFFFFFFFF
# 2.d. Initialize hash value for this chunk
a, b, c, d, e, f, g, h = h0, h1, h2, h3, h4, h5, h6, h7
# 2.e. Main loop, cf. https://tools.ietf.org/html/rfc6234
for i in range(80):
S1 = (rightrotate_64(e, 14) ^ rightrotate_64(e, 18) ^ rightrotate_64(e, 41)) & 0xFFFFFFFFFFFFFFFF
ch = ((e & f) ^ ((~e) & g)) & 0xFFFFFFFFFFFFFFFF
temp1 = (h + S1 + ch + self.k[i] + w[i]) & 0xFFFFFFFFFFFFFFFF
S0 = (rightrotate_64(a, 28) ^ rightrotate_64(a, 34) ^ rightrotate_64(a, 39)) & 0xFFFFFFFFFFFFFFFF
maj = ((a & b) ^ (a & c) ^ (b & c)) & 0xFFFFFFFFFFFFFFFF
temp2 = (S0 + maj) & 0xFFFFFFFFFFFFFFFF
new_a = (temp1 + temp2) & 0xFFFFFFFFFFFFFFFF
new_e = (d + temp1) & 0xFFFFFFFFFFFFFFFF
# Rotate the 8 variables
a, b, c, d, e, f, g, h = new_a, a, b, c, new_e, e, f, g
# Add this chunk's hash to result so far:
h0 = (h0 + a) & 0xFFFFFFFFFFFFFFFF
h1 = (h1 + b) & 0xFFFFFFFFFFFFFFFF
h2 = (h2 + c) & 0xFFFFFFFFFFFFFFFF
h3 = (h3 + d) & 0xFFFFFFFFFFFFFFFF
h4 = (h4 + e) & 0xFFFFFFFFFFFFFFFF
h5 = (h5 + f) & 0xFFFFFFFFFFFFFFFF
h6 = (h6 + g) & 0xFFFFFFFFFFFFFFFF
h7 = (h7 + h) & 0xFFFFFFFFFFFFFFFF
# 3. Conclusion
self.hash_pieces = [h0, h1, h2, h3, h4, h5, h6, h7]
def digest(self):
# h0 append h1 append h2 append h3 append h4 append h5
hash_pieces_without_67 = self.hash_pieces[:-2]
return sum(leftshift(x, 64 * i) for i, x in enumerate(hash_pieces_without_67[::-1]))
# We can also write a function to directly compute the hex digest from some bytes data.
# In[82]:
def hash_SHA384(data):
""" Shortcut function to directly receive the hex digest from SHA384(data)."""
h = SHA384()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# #### Checks on SHA-384
#
# Let try the example from [SHA-2 Wikipedia page](https://en.wikipedia.org/wiki/SHA-2#Test_vectors) :
# In[83]:
def true_hash_SHA384(data):
h = hashlib.sha384()
if isinstance(data, str):
data = bytes(data, encoding='utf8')
h.update(data)
return h.hexdigest()
# In[84]:
hash_SHA384("The quick brown fox jumps over the lazy dog")
assert hash_SHA384("The quick brown fox jumps over the lazy dog") == true_hash_SHA384("The quick brown fox jumps over the lazy dog")
# Even a small change in the message will (with overwhelming probability) result in a mostly different hash, due to the [**avalanche effect**](https://en.wikipedia.org/wiki/Avalanche_effect). For example, adding a period at the end of the sentence:
# In[85]:
hash_SHA384("The quick brown fox jumps over the lazy dog.")
assert hash_SHA384("The quick brown fox jumps over the lazy dog.") == true_hash_SHA384("The quick brown fox jumps over the lazy dog.")
# The hash of the zero-length string is:
# In[86]:
hash_SHA384("")
assert hash_SHA384("") == '38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b'
assert hash_SHA384("") == true_hash_SHA384("")
# $\implies$ We obtained the same result, OK our function works!
# ### More comparison
# In[87]:
def test_SHA224():
x = random_string()
return hash_SHA224(x) == true_hash_SHA224(x)
get_ipython().run_line_magic('timeit', 'test_SHA224()')
# In[88]:
def test_SHA512():
x = random_string()
return hash_SHA512(x) == true_hash_SHA512(x)
get_ipython().run_line_magic('timeit', 'test_SHA512()')
# In[89]:
def test_SHA384():
x = random_string()
return hash_SHA384(x) == true_hash_SHA384(x)
get_ipython().run_line_magic('timeit', 'test_SHA384()')
# `SHA512` and `SHA384` are slower than `SHA256` obviously, but it's weird that `SHA224` is slower than the 64-bits versions...
# ----
# ## Conclusion
# Well, it was fun and interesting to implement these hashing functions, manually.
# Using [Python](https://www.Python.org) made it easy!
#
# > Note that a Python 2 library implementing manually all these hashing functions already exist: [`pysha2`](https://github.com/thomdixon/pysha2), by [@thomdixon](https://github.com/thomdixon).
# > (I discovered it *after* writing this notebook!)
#
# [![made-with-jupyter](https://img.shields.io/badge/Made%20for-Jupyter%20notebook-1f425f.svg)](https://www.jupyter.org/)
# [![GitHub license](https://img.shields.io/github/license/Naereen/notebooks.svg)](https://github.com/Naereen/notebooks/blob/master/LICENSE.txt)
# [![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
# [![ForTheBadge built-with-science](http://ForTheBadge.com/images/badges/built-with-science.svg)](https://GitHub.com/Naereen/)
# [![ForTheBadge powered-by-electricity](http://ForTheBadge.com/images/badges/powered-by-electricity.svg)](http://ForTheBadge.com)
# ### Bonus
# "SHA" is pronouced like the French word "chat", which means *cat*.
#
# ![a cat playing on a computer](https://media.giphy.com/media/JIX9t2j0ZTN9S/giphy.gif)
# > See [my GitHub `notebooks` project](https://GitHub.com/Naereen/notebooks/) for others notebooks.