Sandboxes are terrific tools for malware protection. They allow to analyze executables and documents, looking for threats unknown to antivirus products.

Thus, one of the main objective of malware writers is to bypass sandboxes by exhibiting malicious behaviors only when infecting the targets. But the eternal fight between malware writer versus sandbox writers to write respectively bypass and anti-bypass solution can end very quickly if the sandbox manager uses MD5 as unique file identifier. The first improvement of an automated Sandbox solution is to not re-analyze already known files (according to certain temporal policies maybe), so if you analyzed the program P identified with HASH(P), when a new program P’ arrives with HASH(P’) == HASH(P), then you can assume that P’ == P, right?

Statistically you can make that assumption, but if HASH is MD5, then you must know that writing two different files having the same MD5 hash value is very, very, very easy. As explained in HashClash, the generation of Chosen-prefix Collisions is very efficient, so one could write a program with the same prefix but different body. I’m not saying you can generate two completely different programs and append some magic suffix that make them identical in a reasonable period of time, but you can generate two different collision blocks that differs for a few bytes but have the same MD5.

Let’s assume we can create two collision blocks and embed them (in the correct position) inside a program data:

#include <stdio.h>

char pillData[] = "000000000000000000000"
                  /** ... **/
                  /** Collision Blocks **/
                  /** ... **/

int main() {
  if (pillData[666] !== 0x43) {
    // Do something evil
  } else {
    // Do something good

Let’s name the collision blocks A and A’, such that MD5(A) == MD5(A’), and A[J] == 0x43, A'[J] != 0x43. If we can embed them into pillData such that pillData[666] == 0x43 and pillData'[666] != 0x43 then the program with the A block will have a good behavior, while the program with the A’ code will have a bad behavior! (Note that the code is the same, the only difference between the two program is the value of pillData[666].)

Before approaching the binary world, I want to give a demonstration using an interpreted language. First download e compile fastcoll:

git clone
cd clone-fastcoll

Then write the beginning of a Python program:

#! /usr/bin/env python2
# -*- coding: utf-8 -*-

pill = """

Execute fastcoll and verify that the two generated files (md5_data1 and md5_data2) are different and have the same MD5.

./fastcoll ./
md5sum ./md5_data*
diff md5_data*
sha256sum md5_data*

Now we can modify each file introducing the malicious and legal code:

#! /usr/bin/env python2
# -*- coding: utf-8 -*-

pill = """

if ord(pill[idx]) == val:
  print "Malicious!"
  print "Legal!"

Where idxis the index such that md5_data1.pill[idx] == val != md5_data2.pill[idx] . You can download two sample files here: .

Now that we have the malicious and legal code with same MD5, we can bypass the target sandbox solution by sending the legal program first. Once it has been analyzed, we send the malicious code that, having the same MD5, won’t be analyzed at all.

My example of course was only a PoC, with few applications. In the next post we will analyze the SilentSignal’s sheep-wolf tool to create two real executable files: one .exe that spawn a meterpreter and one .exe that does nothing.