By using algorithms described in the previous post, it's possible to detect if an application is really close from another one. And it can be really useful in many situations, like :
- check if your application has been stolen by someone, for now it's very easy to rip off an application from the android market, and to crack/re-package the application with smali/basksmali/apk-tool to re-inject the application in the market,
- check if you use a correct obfuscator (proguard :) ?),
- check the new methods injected by a malware.
We can use the similarity between :
- methods (like described in the previous post),
- Android/Java API,
- constants (integer, float, ...),
- variable initialisations,
- control flow graph,
- fill array data,
For now, the program (androsim.py is available on androguard repository) uses only the first/second/third points, and it calculates the inclusion similarity (percentage) of the first application inside the second one.
But, it's interesting to see it in action, and to view first results :
with two quite identical applications :
The display is very basic, you can view :
- DIFF METHODS : how many methods have been really compared (they are quite the same),
- NEW METHODS : how many methods are totaly new in the second application,
- MATCH METHODS : how many methods matched perfectly,
- DELETE METHODS : how many methods have been deleted in the first application.
And the two latest lines are :
- the marks to calculate the final score (0.0 is a good mark, 1.0 not),
- the final percentage score (100.0 indicates that the applications are the same).
with two different applications :
At the end of the obfuscation, the application are quite the same (there is an optimizer) because there is no obfuscation in this software... but there is a good java obfuscator ?:)
with infected applications :
- identical methods,
- new methods (the methods of the malware !!).
As I said, the algorithm must be improve with new tests, documentations ... but I think it's possible to do the same things with classical assembly applications because it's a very generic algorithm.
The whitepaper which describes all algorithms is coming soon, "stay tuned" for new examples :)
See ya !!!