=====Dateien in Python hashen - Performance Überlegungen - Dubletten suchen=====

Mit dem Packet "hashlib" können die verschiedenen [[https://en.wikipedia.org/wiki/Cryptographic_hash_function#Cryptographic_hash_algorithms|Hash]] Algorithmen in Python angewandt werden.

  * Import mit: <fc #008000>import hashlib</fc>
  * Alle möglichen Algorithmen aufzeigen mit : <fc #008000>hashlib.algorithms_available</fc>
  * Hash Class definieren : <fc #008000>hashtask = hashlib.new(hashAlgo)</fc>
  * Hash ermitteln lassen:<fc #008000> hashtask.update(buf)</fc>
  * File Hash anzeigen : <fc #008000> filehash = hashtask.hexdigest()</fc>


===Code um die Performance der verschiedenen Algorithmen zu prüfen===

Die Test Datei wird über einen Buffer eingelesen und der Bufferwert wird gehasht:
<code python>
__author__ = 'gpipperr'
import time, os

# Hashing
import hashlib

# graph
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d

# check the possible hash libs
print(hashlib.algorithms_available)

# add your testfile here
file = 'D:\\temp\\20140818\\DSCN0025.TIF'
size = os.path.getsize(file)

# size of the file buffer
buffersize = 262144

results = []
for hashAlgo in hashlib.algorithms_available:

    # Remember the start time of the program
    start_time = time.clock()

    # define the hash algorithm
    hashtask = hashlib.new(hashAlgo)

    # Open the file and read and hash
    with open(file, 'rb') as afile:
        buffer = afile.read(buffersize)
        while len(buffer) > 0:
            hashtask.update(buffer)
            buffer = afile.read(buffersize)

    # get the hash af the file
    filehash = hashtask.hexdigest()

    # Remember Results
    results.append((hashAlgo, time.clock() - start_time))
    # print("-- Info  :: The file {0} has the {1} hash :: {2}".format(file,hashalgo,filehash))
    # print("-- Info  :: The singel run needs          :: {0:5.8f} seconds".format(time.clock() - start_time))

# sort the result list after the second parameter
results.sort(key=lambda x: x[1])
sorted_result = results

print("--" + 40 * "=")
print("-- Info  :: File {0} and size {1:3.3f} MB".format(file, size / 1024 / 1024))
print("-- Info  :: Use Buffersize {0}".format(buffersize))

for result in sorted_result:
    print("   Hash {0:15} | {1:2.6f} ".format(result[0], result[1]))

print("--" + 40 * "=")

# check the main algorithm  with different buffersizes
buffersize = 512
resultsBuffersize = []

# algoList=hashlib.algorithms_available
algoList = ['SHA1', 'MD5', 'DSA-SHA', 'MD4', 'SHA', 'DSA']

for i in range(12):
    buffersize = 512
    for i in range(10):
        buffersize = buffersize * 2
        for hashAlgo in algoList:
            # Remember the start time of the program
            start_time = time.clock()
            hashtask = hashlib.new(hashAlgo)
            with open(file, 'rb') as afile:
                buffer = afile.read(buffersize)
                while len(buffer) > 0:
                    hashtask.update(buffer)
                    buffer = afile.read(buffersize)
            filehash = hashtask.hexdigest()
            resultsBuffersize.append((hashAlgo, buffersize, time.clock() - start_time))

# sort the list with sorted after the second element
sorted_result = sorted(resultsBuffersize, key=lambda x: float(x[2]))

print("--" + 40 * "=")
print("-- Info  :: Results for Buffersize and Algorithm")
print(" {0:17} | {1:11} |  {2:10} ".format("Hash", "Buffer", "Time"))

for result in sorted_result:
    print("  {0:15} |  {1:8} |  {2:2.6f} ".format(result[0], result[1], result[2]))

print("--" + 40 * "=")

valuesX = []
valuesY = []
# graph the figures
for hashAlgo in algoList:
    for val in sorted_result:
        if val[0] == hashAlgo:
            valuesX.append(val[2])
            valuesY.append(val[1])

    plt.plot(valuesX, valuesY, label=hashAlgo, marker='x')  # --,marker='o', linestyle='--', color='r',

    # plot a smoth line with interpolation #must be implemented later
    # f2 = interp1d(valuesX, valuesY, kind='cubic')
    # plt.plot(f2, label=hashAlgo, marker='x')

    # Empty the list
    valuesX[:] = []
    valuesY[:] = []


############################################################
# Graph the figures
############################################################
# Runtime X axis
# plt.xscale('log')
plt.xlabel('Runtime s')
plt.xlim([0.04, 0.12])

# Buffersize Y axis
plt.ylabel('Buffersize Byte')
plt.yscale('log')
buffersize = 512
yticks = []
for i in range(10):
    buffersize = buffersize * 2
    yticks.append(buffersize)
plt.yticks(yticks, yticks)

# Diagram options
plt.title('Hash Performance for 10 runs of each algorithm with different buffersize')
legend = plt.legend(loc='upper right', shadow=True, fontsize='x-large')
legend.get_frame().set_facecolor('#FFFFE0')
plt.grid()

plt.show()
############################################################

</code>

==== Performance Überlegungen ====

In der folgenden Graphik der wichtigsten Hash Algorithmen ist der Zusammenhang zwischen der Puffer Größe beim Verarbeiten der Datei und der Laufzeit für die wichtigen Hash Algorithmen aufgezeigt.

{{ :python:python_hash_performance_v01.png | Overview Hash Performance Python}}


Jeder Hash Versuch wird 10 mal mit der gleichen Puffergröße durchgeführt um Schwankungen auf dem System auszugleichen.


Die besten Werte ergeben sich bei folgenden Kombinationen für die Testdatei (Größe 12.588MB) auf einen Intel i7 Notebook:
^Hash^Buffer^Time s^
|MD4             |    262144 |  0.044618|
|MD5             |    262144 |  0.046280|
|DSA-SHA         |    262144 |  0.049455|
|DSA             |    262144 |  0.049489| 
|SHA1            |    262144 |  0.049500|
|SHA             |    262144 |  0.084308|

D.h. im aktuellen Fall ist die Puffergröße 262144 Byte am besten.

Auch mit openssl kann die Performance der aktuellen Maschine für verschiedene Hashs überprüft werden:
<code bash>
openssl speed
</code>
Dient aber im Prinzip mehr dazu die Performance der Verschlüsselungsalgorithmen besser zu verstehen.


----

==== Einsatz - Erkennen von Dubletten in einen Verzeichnis Baum ====

Auslesen der Verzeichnisstruktur mit <fc #008000>for folder, subFolders, files in os.walk(base_dir):</fc> und hashen aller Dateien, anschließend die Liste auf Dubletten prüfen.

Während des Laufes wird eine Progressbar angezeigt, daher muss leider zuvor die Anzahl der Dateien ermittelt werden.

Das sieht dann so aus:
<code bash>

D:\Python34\python.exe D:/entwicklung/work/python/ImageImp/deduplicateDirectoryTree.py
--========================================
-- Info :: start to read base directory d:\temp with 7116 files (each x for 88 files)
--========================================
--
-- [xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx]
--
--========================================

....
 ----------------------------------------
-- Info hash ffd7ac3677ce85775a40b195f4263d5d and this date 2014-10-07 10:43:06 ::  File d:\temp\old_images\Handy_Bilder\WhatsApp Images\IMG-20141007-WA0002.jpeg 
-- Info hash ffd7ac3677ce85775a40b195f4263d5d and this date 2014-10-07 09:43:06 ::  File d:\temp\old_images\WhatsApp Images\IMG-20141007-WA0002.jpeg 

--========================================
-- Info :: found 7116 duplicate entries in 7116 total files
-- Info  :: The run needs          :: 33.66735999 seconds
--========================================
</code>


Im nächsten Schritt stellt sich nun die Frage welche der doppelten Daten dann am Ende gelöscht werden sollen.

Werde das wohl so implementieren, das ein Verzeichnis als Quelle angeben wird und eine Baum als Ziel.

Der Anwender kann entscheiden, ob die Daten aus der Quelle oder aus dem Ziel Baum gelöscht werden sollen.
===Code===

<code python>
__author__ = 'gpipperr'

import time, os, sys, datetime

# Hashing
import hashlib


# Get the change date of a file
def modification_date(filename):
    t = os.path.getmtime(filename)
    return datetime.datetime.fromtimestamp(t)


# Write the progressbar
def write_progress(step, width):
    if width > step:
        s = "X" * step + ">"
    else:
        s = "X" * step
    r = "." * (width - step)
    print("-- [{0:{1}}]".format(s + r, width), end="\r")
    sys.stdout.flush()


# Close the progressbar
def finish_progress(step):
    s = "x" * step
    print("-- [" + s + "]")
    sys.stdout.flush()


# after x files draw a point
progress_points = 10
progress_width = 80
# algoList=hashlib.algorithms_available
algoList = ['SHA1', 'MD5', 'DSA-SHA', 'MD4', 'SHA', 'DSA']
# Best Performance
hashAlgo = 'MD5'
# size of the file buffer
buffersize = 262144

# Remember the start time of the program
start_time = time.clock()

# read of a complete directory structure
# base_dir = "d:\\temp\\duptest"
base_dir = "d:\\temp"

fileRead = 0
# calculate how many files to compare
for folder, subFolders, files in os.walk(base_dir):
    # print( "folder {0}, subFolders {1}, files {2}".format( folder, subFolders, files ) )
    for file in files:
        fileRead += 1

# calculate the progressbar
progress_points = int(fileRead / progress_width)

print("--" + 40 * "=")
print("-- Info :: start to read base directory {0} with {1} files (each x for {2} files)".format(base_dir, fileRead,
                                                                                                 progress_points))
# print("-- Info  :: The prep needs          :: {0:5.8f} seconds".format(time.clock() - start_time))
print("--" + 40 * "=")
print("--")

fileRead = 0
step = 1
# get all MD5 Hashes of all Files in the directory structure
results = []
for folder, subFolders, files in os.walk(base_dir):
    # print( "folder {0}, subFolders {1}, files {2}".format( folder, subFolders, files ) )
    for file in files:
        fileFullName = folder + os.path.sep + file
        if os.path.isfile(fileFullName):
            fileRead += 1

            # progressbar
            if fileRead % progress_points == 0:
                write_progress(step, progress_width)
                step += 1

            # define the hash algorithm
            hashtask = hashlib.new(hashAlgo)
            try:
                # Open the file and read and hash
                with open(fileFullName, 'rb') as afile:
                    buffer = afile.read(buffersize)
                    while len(buffer) > 0:
                        hashtask.update(buffer)
                        buffer = afile.read(buffersize)
                # get the hash af the file
                filehash = hashtask.hexdigest()
                # Remember Results
                results.append([filehash, fileFullName, modification_date(fileFullName)])
                # print("-- Info  :: The file {0} has the {1} hash :: {2}".format(fileFullName, hashAlgo, filehash))
            except:
                print("-- Error :: 01 - Read File {0} :: error {1}:".format(fileFullName, sys.exc_info()))
                pass
# get duplicate md5 hash values

# close progressbar
finish_progress(step)
print("--")

# sort md4 hashes in the list
results.sort(key=lambda x: x[0])

# get only the element of a list
h = []
# for a,b in enumerate(results):
for b in results:
    h.append(b[0])

# get a unorderd but unique list
s = set(h)
print("--" + 40 * "=")


# search the duplicates
dub = []
for x in h:
    if x in s:
        s.remove(x)
    else:
        dub.append(x)

x = '-'
dub.sort(key=lambda x: x[0])
if len(dub) > 0:
    x = dub[0]

# show the duplicates entries
for res in results:
    if res[0] in dub:
        if res[0] != x:
            print("  " + 40 * "-")
        print("-- Info hash {0} and this date {2} ::  File {1} ".format(res[0], res[1], res[2]))
    x = res[0]

print("--" + 40 * "=")
print("-- Info :: found {1} duplicate entries in {0} total files".format(len(h) - len(s), fileRead))
print("-- Info  :: The run needs          :: {0:5.8f} seconds".format(time.clock() - start_time))
print("--" + 40 * "=")

</code>


==== Quellen ====

  * http://pythoncentral.io/hashing-files-with-python/
  * https://docs.python.org/3/library/hashlib.html

Hashing Allgemein

  * http://www.unixwiz.net/techtips/iguide-crypto-hashes.html
  * http://cacr.uwaterloo.ca/hac/

Duplikate finden in Listen

  * http://stackoverflow.com/questions/9835762/find-and-list-duplicates-in-python-list