Automating Metagoofil with Python
I recently needed to automate metagoofil searches using Python, and thought I'd share the "proof of concept" script that got it working. I'll apologize ahead of time for any errors--I'm still only just beginning to learn Python.
Why automate metagoofil? Well, once you have your hands the type of files and metadata metagoofil brings back, there's a myriad of things you could dive into.
- Extract text from the resulting PDFs
- Manipulate the .docx files
- Use https://api.mongodb.com/python/current/ to record metadata into MongoDB
- Use scikit-learn to perform some advanced text processing
I already have some pretty specific ideas for how I can use this short automation script, but I'll save those for late. There's quite a bit to share.
Ok, here's the code.
from subprocess import Popen, PIPE import pprint printer = pprint.PrettyPrinter(indent = 4) res = Popen([ "python", "~/metagoofil.py", "-d", "some-domain.com", "-t", "doc,pdf", "-l", "200", "-n", "100", "-o", "/your/files/here", "-f", "results.html"], stdout = PIPE) printer.pprint(res.communicate()) while res.poll() is None: time.sleep(0.5) printer.pprint("completed metagoofil run")
I'm specifically using
Popen so that I can easily use
res = Popen(["params", "array", "of", "strings"], stdout = PIPE) printer.pprint(res.communicate())
This is a naive (yet, effective) mechanism for blocking until the process is completed. It checks every 0.5 seconds to see if the process is finished. BEWARE! If you are using this...and your process never finishes...this won't finish either! Hence, naive. :)
while res.poll() is None: time.sleep(0.5)
Metagoofil tool http://www.edge-security.com/metagoofil.php
Extract text with Python + pdfminer-six https://gist.github.com/jmcarp/7105045
Work with .docx files https://python-docx.readthedocs.io/en/latest/user/documents.html
Textual processing with Python + scikit-learn https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/