=====Dateien in Python hashen - Performance Überlegungen - Dubletten suchen===== Mit dem Packet "hashlib" können die verschiedenen [[https://en.wikipedia.org/wiki/Cryptographic_hash_function#Cryptographic_hash_algorithms|Hash]] Algorithmen in Python angewandt werden. * Import mit: import hashlib * Alle möglichen Algorithmen aufzeigen mit : hashlib.algorithms_available * Hash Class definieren : hashtask = hashlib.new(hashAlgo) * Hash ermitteln lassen: hashtask.update(buf) * File Hash anzeigen : filehash = hashtask.hexdigest() ===Code um die Performance der verschiedenen Algorithmen zu prüfen=== Die Test Datei wird über einen Buffer eingelesen und der Bufferwert wird gehasht: __author__ = 'gpipperr' import time, os # Hashing import hashlib # graph import matplotlib.pyplot as plt from scipy.interpolate import interp1d # check the possible hash libs print(hashlib.algorithms_available) # add your testfile here file = 'D:\\temp\\20140818\\DSCN0025.TIF' size = os.path.getsize(file) # size of the file buffer buffersize = 262144 results = [] for hashAlgo in hashlib.algorithms_available: # Remember the start time of the program start_time = time.clock() # define the hash algorithm hashtask = hashlib.new(hashAlgo) # Open the file and read and hash with open(file, 'rb') as afile: buffer = afile.read(buffersize) while len(buffer) > 0: hashtask.update(buffer) buffer = afile.read(buffersize) # get the hash af the file filehash = hashtask.hexdigest() # Remember Results results.append((hashAlgo, time.clock() - start_time)) # print("-- Info :: The file {0} has the {1} hash :: {2}".format(file,hashalgo,filehash)) # print("-- Info :: The singel run needs :: {0:5.8f} seconds".format(time.clock() - start_time)) # sort the result list after the second parameter results.sort(key=lambda x: x[1]) sorted_result = results print("--" + 40 * "=") print("-- Info :: File {0} and size {1:3.3f} MB".format(file, size / 1024 / 1024)) print("-- Info :: Use Buffersize {0}".format(buffersize)) for result in sorted_result: print(" Hash {0:15} | {1:2.6f} ".format(result[0], result[1])) print("--" + 40 * "=") # check the main algorithm with different buffersizes buffersize = 512 resultsBuffersize = [] # algoList=hashlib.algorithms_available algoList = ['SHA1', 'MD5', 'DSA-SHA', 'MD4', 'SHA', 'DSA'] for i in range(12): buffersize = 512 for i in range(10): buffersize = buffersize * 2 for hashAlgo in algoList: # Remember the start time of the program start_time = time.clock() hashtask = hashlib.new(hashAlgo) with open(file, 'rb') as afile: buffer = afile.read(buffersize) while len(buffer) > 0: hashtask.update(buffer) buffer = afile.read(buffersize) filehash = hashtask.hexdigest() resultsBuffersize.append((hashAlgo, buffersize, time.clock() - start_time)) # sort the list with sorted after the second element sorted_result = sorted(resultsBuffersize, key=lambda x: float(x[2])) print("--" + 40 * "=") print("-- Info :: Results for Buffersize and Algorithm") print(" {0:17} | {1:11} | {2:10} ".format("Hash", "Buffer", "Time")) for result in sorted_result: print(" {0:15} | {1:8} | {2:2.6f} ".format(result[0], result[1], result[2])) print("--" + 40 * "=") valuesX = [] valuesY = [] # graph the figures for hashAlgo in algoList: for val in sorted_result: if val[0] == hashAlgo: valuesX.append(val[2]) valuesY.append(val[1]) plt.plot(valuesX, valuesY, label=hashAlgo, marker='x') # --,marker='o', linestyle='--', color='r', # plot a smoth line with interpolation #must be implemented later # f2 = interp1d(valuesX, valuesY, kind='cubic') # plt.plot(f2, label=hashAlgo, marker='x') # Empty the list valuesX[:] = [] valuesY[:] = [] ############################################################ # Graph the figures ############################################################ # Runtime X axis # plt.xscale('log') plt.xlabel('Runtime s') plt.xlim([0.04, 0.12]) # Buffersize Y axis plt.ylabel('Buffersize Byte') plt.yscale('log') buffersize = 512 yticks = [] for i in range(10): buffersize = buffersize * 2 yticks.append(buffersize) plt.yticks(yticks, yticks) # Diagram options plt.title('Hash Performance for 10 runs of each algorithm with different buffersize') legend = plt.legend(loc='upper right', shadow=True, fontsize='x-large') legend.get_frame().set_facecolor('#FFFFE0') plt.grid() plt.show() ############################################################ ==== Performance Überlegungen ==== In der folgenden Graphik der wichtigsten Hash Algorithmen ist der Zusammenhang zwischen der Puffer Größe beim Verarbeiten der Datei und der Laufzeit für die wichtigen Hash Algorithmen aufgezeigt. {{ :python:python_hash_performance_v01.png | Overview Hash Performance Python}} Jeder Hash Versuch wird 10 mal mit der gleichen Puffergröße durchgeführt um Schwankungen auf dem System auszugleichen. Die besten Werte ergeben sich bei folgenden Kombinationen für die Testdatei (Größe 12.588MB) auf einen Intel i7 Notebook: ^Hash^Buffer^Time s^ |MD4 | 262144 | 0.044618| |MD5 | 262144 | 0.046280| |DSA-SHA | 262144 | 0.049455| |DSA | 262144 | 0.049489| |SHA1 | 262144 | 0.049500| |SHA | 262144 | 0.084308| D.h. im aktuellen Fall ist die Puffergröße 262144 Byte am besten. Auch mit openssl kann die Performance der aktuellen Maschine für verschiedene Hashs überprüft werden: openssl speed Dient aber im Prinzip mehr dazu die Performance der Verschlüsselungsalgorithmen besser zu verstehen. ---- ==== Einsatz - Erkennen von Dubletten in einen Verzeichnis Baum ==== Auslesen der Verzeichnisstruktur mit for folder, subFolders, files in os.walk(base_dir): und hashen aller Dateien, anschließend die Liste auf Dubletten prüfen. Während des Laufes wird eine Progressbar angezeigt, daher muss leider zuvor die Anzahl der Dateien ermittelt werden. Das sieht dann so aus: D:\Python34\python.exe D:/entwicklung/work/python/ImageImp/deduplicateDirectoryTree.py --======================================== -- Info :: start to read base directory d:\temp with 7116 files (each x for 88 files) --======================================== -- -- [xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx] -- --======================================== .... ---------------------------------------- -- Info hash ffd7ac3677ce85775a40b195f4263d5d and this date 2014-10-07 10:43:06 :: File d:\temp\old_images\Handy_Bilder\WhatsApp Images\IMG-20141007-WA0002.jpeg -- Info hash ffd7ac3677ce85775a40b195f4263d5d and this date 2014-10-07 09:43:06 :: File d:\temp\old_images\WhatsApp Images\IMG-20141007-WA0002.jpeg --======================================== -- Info :: found 7116 duplicate entries in 7116 total files -- Info :: The run needs :: 33.66735999 seconds --======================================== Im nächsten Schritt stellt sich nun die Frage welche der doppelten Daten dann am Ende gelöscht werden sollen. Werde das wohl so implementieren, das ein Verzeichnis als Quelle angeben wird und eine Baum als Ziel. Der Anwender kann entscheiden, ob die Daten aus der Quelle oder aus dem Ziel Baum gelöscht werden sollen. ===Code=== __author__ = 'gpipperr' import time, os, sys, datetime # Hashing import hashlib # Get the change date of a file def modification_date(filename): t = os.path.getmtime(filename) return datetime.datetime.fromtimestamp(t) # Write the progressbar def write_progress(step, width): if width > step: s = "X" * step + ">" else: s = "X" * step r = "." * (width - step) print("-- [{0:{1}}]".format(s + r, width), end="\r") sys.stdout.flush() # Close the progressbar def finish_progress(step): s = "x" * step print("-- [" + s + "]") sys.stdout.flush() # after x files draw a point progress_points = 10 progress_width = 80 # algoList=hashlib.algorithms_available algoList = ['SHA1', 'MD5', 'DSA-SHA', 'MD4', 'SHA', 'DSA'] # Best Performance hashAlgo = 'MD5' # size of the file buffer buffersize = 262144 # Remember the start time of the program start_time = time.clock() # read of a complete directory structure # base_dir = "d:\\temp\\duptest" base_dir = "d:\\temp" fileRead = 0 # calculate how many files to compare for folder, subFolders, files in os.walk(base_dir): # print( "folder {0}, subFolders {1}, files {2}".format( folder, subFolders, files ) ) for file in files: fileRead += 1 # calculate the progressbar progress_points = int(fileRead / progress_width) print("--" + 40 * "=") print("-- Info :: start to read base directory {0} with {1} files (each x for {2} files)".format(base_dir, fileRead, progress_points)) # print("-- Info :: The prep needs :: {0:5.8f} seconds".format(time.clock() - start_time)) print("--" + 40 * "=") print("--") fileRead = 0 step = 1 # get all MD5 Hashes of all Files in the directory structure results = [] for folder, subFolders, files in os.walk(base_dir): # print( "folder {0}, subFolders {1}, files {2}".format( folder, subFolders, files ) ) for file in files: fileFullName = folder + os.path.sep + file if os.path.isfile(fileFullName): fileRead += 1 # progressbar if fileRead % progress_points == 0: write_progress(step, progress_width) step += 1 # define the hash algorithm hashtask = hashlib.new(hashAlgo) try: # Open the file and read and hash with open(fileFullName, 'rb') as afile: buffer = afile.read(buffersize) while len(buffer) > 0: hashtask.update(buffer) buffer = afile.read(buffersize) # get the hash af the file filehash = hashtask.hexdigest() # Remember Results results.append([filehash, fileFullName, modification_date(fileFullName)]) # print("-- Info :: The file {0} has the {1} hash :: {2}".format(fileFullName, hashAlgo, filehash)) except: print("-- Error :: 01 - Read File {0} :: error {1}:".format(fileFullName, sys.exc_info())) pass # get duplicate md5 hash values # close progressbar finish_progress(step) print("--") # sort md4 hashes in the list results.sort(key=lambda x: x[0]) # get only the element of a list h = [] # for a,b in enumerate(results): for b in results: h.append(b[0]) # get a unorderd but unique list s = set(h) print("--" + 40 * "=") # search the duplicates dub = [] for x in h: if x in s: s.remove(x) else: dub.append(x) x = '-' dub.sort(key=lambda x: x[0]) if len(dub) > 0: x = dub[0] # show the duplicates entries for res in results: if res[0] in dub: if res[0] != x: print(" " + 40 * "-") print("-- Info hash {0} and this date {2} :: File {1} ".format(res[0], res[1], res[2])) x = res[0] print("--" + 40 * "=") print("-- Info :: found {1} duplicate entries in {0} total files".format(len(h) - len(s), fileRead)) print("-- Info :: The run needs :: {0:5.8f} seconds".format(time.clock() - start_time)) print("--" + 40 * "=") ==== Quellen ==== * http://pythoncentral.io/hashing-files-with-python/ * https://docs.python.org/3/library/hashlib.html Hashing Allgemein * http://www.unixwiz.net/techtips/iguide-crypto-hashes.html * http://cacr.uwaterloo.ca/hac/ Duplikate finden in Listen * http://stackoverflow.com/questions/9835762/find-and-list-duplicates-in-python-list