FileAnalyser: Hashing, PE Parsing, and VirusTotal

The problem

I was working through malware triage exercises and kept switching between separate tools - hash a sample, look it up on VirusTotal, open PE headers somewhere else, guess at runtime behavior. FileAnalyser bundles those steps into one pipeline I could run from a GUI or the terminal while I was still learning what each layer actually tells you.

Background

A hash is a fixed-size fingerprint of a file's exact bytes. Change even one bit and the digest almost always changes completely - that is what makes hashes useful for malware triage. Services like VirusTotal index known files by hash, so a lookup can return in seconds with multi-engine results: has anyone else already seen and labeled this exact binary? That is reputation lookup, not magic.

It works because a lot of commodity malware gets reused. Same strain, same builder, sometimes the same unpacked sample - the hash hit rate is surprisingly high for unmodified files. Security teams treat hash checks as step one because they are fast and often decisive before anyone runs a sandbox.

It breaks down in predictable ways. Repack or tweak the payload and the hash is new even when behavior is identical. Zero detections only means no engine has flagged that hash yet, not that the file is safe. Weak algorithms like MD5 have known collision research (two different files, same digest in theory) - which is part of why I lean on SHA-256 for submissions even when older feeds still show MD5.

One detail I picked up while reading about this: much of real-world triage never reaches deep reverse engineering. Analysts lean on hash and reputation layers first because they behave like a cache of other people's lab time. VirusTotal started as a single researcher's side project before becoming the shared lookup desk the industry uses today - which made me feel less guilty about building a tool that stops at "hash, then PE, then maybe behavior" instead of pretending I was writing a full sandbox on day one.

The workflow

Both interfaces walk the same path. For a file on disk the tool can:

Generate MD5, SHA-1, and SHA-256 hashes
Submit the file or an existing hash to VirusTotal and pull multi-engine results (reputation lookup, not local signature matching)
Parse PE structure for Windows executables (.exe, .dll, .sys) - sections, imports, exports
Run optional sandbox-style behavior monitoring on the host (CPU, memory, open files, network connections)
Extract metadata from Office documents, images, and plain text files

The GUI also has a hash lookup tab when I already have a digest from another source and just want VirusTotal results without re-uploading the binary. Full PE analysis is most reliable on Windows, which matches how I built and tested it.

Key decisions

API key via .env. Keeps the VirusTotal key out of source control and mirrors how real tools ship config.
Dual interface. GUI for exploring results; CLI for scripting and quick checks (with an optional --sandbox flag).
Reputation plus structure. VirusTotal answers "seen before?"; PE parsing answers "how is this binary shaped?"; behavior monitoring hints at "what did it do when run?" - none of those replace the others.

What I learned

Malware triage is layered, not a single verdict. Hashes are fast fingerprints. VirusTotal aggregates other engines' detections - useful, but rate limits are real and a clean result is not proof of safety. PE headers sometimes expose imports and sections that matter more than a hash lookup alone.

Running even basic behavior monitoring on my own machine reminded me that "sandbox-style" on a laptop is still my laptop. Fine for learning, not something I would treat as production isolation. This is a triage helper I run locally, not a deployed security product.

What I'd improve next

Richer reporting exports, YARA rule hooks, and clearer handling when VirusTotal rate-limits or returns ambiguous results. Longer term I would containerize dynamic analysis instead of running binaries on the host OS.

Source code