I feel like the decompiler space is a little stuck? I mostly go with Hex-Rays ou...

tralarpa · on July 13, 2022

> A lot of my colleagues use Ghidra a lot now and complain about its decompiler regularly.

Are your colleagues decompiling obfuscated code (for example malware)? Publicly available decompilers are not working well for that, but I assume that many specalists have their own little improvements and plugins that they don't share with others because it's their core business.

For non-obfuscated code, Ghidra has served me very well, even for entire applications. Often, it has to be pushed into the right direction (for example, by manually specifying the type of a variable) and it sometimes misses some obvious simplifications especially when arrays are involved, but I think those issues could be solved relatively easily by polishing/extending its heuristics. Nothing where I would say that ML is needed, although it would be possible. At the end, most programs contain the same patterns and an ML-based system could help identifying them.

But yeah, obfuscated code, that's something else. There are some academic publications about the usage of ML for that. No idea what's happening inside the company labs, though.

badsectoracula · on July 13, 2022

I haven't used Ghidra "seriously" but i fed it some non-trivial programs i wrote in Free Pascal and i was very surprised to see that it recreated a C++ program that was incredibly similar to what the Free Pascal program looked like.

Of course it wasn't obfuscated and there were a couple of mistakes here and there but overall it'd work perfectly fine for someone to understand what the program was doing if they didn't had access to the source code.

aardshark · on July 13, 2022

From my small experience of Ghidra, it didn't do great once the code was not using standard calling conventions (i.e it was probably compiled with optimization flags )

Sometimes it would just straight up ignore (functional) assembly for apparently no reason. Or it would turn simple code into a myriad of nested conditionals and loops, achieving the same goal, but looking nothing like a human would write.

It was still very helpful in understanding blocks of assembly much faster than I otherwise would, and it's possible I was lacking some configuration that a more experienced user could do to help the decompiler out.

Akronymus · on July 13, 2022

>Sometimes it would just straight up ignore (functional) assembly for apparently no reason.

probably code it thinks is unreachable. (Jmp or ret right in front of it and no jmp/call into that address, probably a computed jump/call)

> Or it would turn simple code into a myriad of nested conditionals and loops

Ran into that myself, usually a switch case. (Dunno how to get ghidra to deal with that properly myself)

The biggest help you can give ghidra is defining structs, naming the fields, and setting the right types.

Joker_vD · on July 13, 2022

> probably code it thinks is unreachable.

Or dead assignments. I've seen it with HexRays: if you don't tell it that e.g. var_16 is actually 32 bytes long, not 4, it will completely ignore any code that reads/writes stack between var_16 + 4 and var_48 (which is at var_16 + 32). It's quite an amusing sight to see: you have an 8 lines-long decompiled function from 300 lines of assembly, you edit a variable's annotation, boom, the decompiled function is now 40 lines long, with all kinds of interesting computations in its body.

ishitatsuyuki · on July 13, 2022

Rellic [1] implements an algorithm that generates goto-free control flows (citation in README), which would be a significant improvement against what Ghidra/IDA generates currently.

Unfortunately it looks like the maintenance state of the pieces around Rellic isn't very good, and it's quite rocket science to get it building. It doesn't have as much UI/GUI as Ghidra either so it's a bit far from accessible right now.

[1]: https://github.com/lifting-bits/rellic

dataflow · on July 13, 2022

> that generates goto-free control flows

...note: from LLVM bitcode.

lmz · on July 13, 2022

https://github.com/lifting-bits/remill is linked from there, which I guess is where you get your bitcode from.

dataflow · on July 13, 2022

Oh that's cool, thanks!

FrozenVoid · on July 13, 2022

What happens with code that uses lots of gotos(incl. computed gotos)?

jcranmer · on July 13, 2022

From reading the paper, it basically does jump unthreading. Basically, if you imagine code like this:

  bool found = false;
  for (...) {
    if (...) {
      found = true;
      break;
    }
  }
  if (found) {
    // A
  } else {
    // B
  }

Jump threading is an optimization pass that replace the break statement with a goto A. After that replacement, found is always false, so the boolean variable and the if statement is deleted. The resulting code would look something like this [1]:

  for (...) {
    if (...) {
      // A
      goto end;
    }
  }
  // B
  end:;

What the lifting is doing here is essentially running this pass in reverse. If you find a branch pattern that doesn't meet any preordained schema (such as a loop with multiple exits), just synthesize a variable that tells you which target you're going to jump to. Were the compiler to optimize the resulting code, jump threading would convert it back into the gotos present in the compiled binary.

[1] This kind of optimization pass runs at a stage when the code is basically treated entirely as a CFG and there's no such thing as if statements or jumps or gotos, just conditional and unconditional branches terminating basic blocks. Any reflection of the code outside of this form is therefore somewhat imprecise.

mdp2021 · on July 13, 2022

Can you think of an example of "goto"-based code that cannot be translated into conventionally structure code?

gjm11 · on July 13, 2022

[EDITED to say explicitly:] You can translate any goto-laden code into "conventionally structured" code mechanically, if you don't care about having the structure of the resulting code actually indicate what it does. Here's an example of the sort of code for which that might be the best you can do.

Suppose you implement a state machine with gotos. So, for a simple (and contrived) example, suppose you have something that absorbs the decimal digits of a number and keeps track of the value of the number modulo 3 by having three states. Something like this (pseudocode):

    def mod3():
        state0:
            d = getdigit()
            if d == FINISHED: return 0
            if d is 0, 3, 6, 9: goto state0
            if d is 1, 4, 7: goto state1
            if d is 2, 5, 8: goto state2
            return ERROR
        state1: 
            d = getdigit()
            if d == FINISHED: return 0
            if d is 0, 3, 6, 9: goto state1

(etc.) You've got three stateN labels each of which can jump to any of the stateN labels (as well as being able to return from the function).

If you have tail-call optimization you can turn this into conventionally structured code, more or less:

    def state0():
        d = getdigit()
        if d == FINISHED: return 0
        if d is 0, 3, 6, 9: return state0()
        if d is 1, 4, 7: return state1()
        if d is 2, 5, 8: return state2()
        return ERROR

with similar definitions for state1() and state2(), and then the top-level function just calls state0. But this depends on knowing that all those tail calls will get optimized, or else on never having enough digits to overflow your stack.

Or else you can have an explicit state variable:

    def mod3():
        state = 0
        loop:
            if state == 0:
                d = getdigit()
                if d == FINISHED: return 0
                if d is 0, 3, 6, 9: state = 0
                else if d is 1, 4, 7: state = 1
                else if d is 2, 5, 8: state = 2
                else: return ERROR
            else if state == 1:
                ...
            else:
                ...

which works pretty well for the special case of state machines but badly for most other things a goto might be used for. (Though obviously you can translate literally any goto-using code into this sort of thing. You might want to call the analogue of the "state" variable here "program_counter" in that case.)

not2b · on July 13, 2022

The translation can always be done, but for dense spaghetti control structures duplication of code may be required. One can construct artificial cases where the size increase is exponential, but that's unlikely to be an issue in even the worst real code.

psifertex · on July 14, 2022

This is actually why we chose _not_ to implement no-more-gotos for Binary Ninja's HLIL! Code is actually more readable with gotos in some situations and trying to force their elimination hurts readability.

zozbot234 · on July 13, 2022

> Rellic [1] implements an algorithm that generates goto-free control flows

Doesn't WebAssembly implement that already, via Relooper?

hoosieree · on July 13, 2022

> Is there any new approach in the works? Maybe something ML-based for optimization?

I'm doing a PhD on this.

My goal is to detect known functions from obfuscated binaries.

The biggest challenge by far is building a good dataset. Unlike computer vision (millions of pictures with the label "dog") the number of training examples for a typical function is one. For now I'm focusing on C standard libraries, since there are a handful of real-world implementations plus some FOSS or students samples available for things like strlen and atoi.

If anyone wants to collaborate, feel free to message me.

anitil · on July 14, 2022

I'm not sure I follow - wouldn't many statically linked programs have much of some version of libc within them? So you could take any program, change it to be statically linked and use that for training?

That said I assume I'm missing something here.

Akronymus · on July 13, 2022

Could a best guess + fuzzing + compiling the decompiled code work towarda a heuristic?

hoosieree · on July 13, 2022

Not sure exactly what you mean by "best guess + fuzzing", but I have compiled code that was first decompiled by Ghidra. The problem is there are lots of invalid identifiers in the decompiled output.

The worst are symbols that are used inconsistently within the same function, like a parameter which is passed in as a long and then used as a pointer to a struct or even as a function.

The Ghidra community basically says you should not expect the exported decompiled code to be valid [1,2]. Which is fine, since rount-trip compile-decompile-compile is not exactly Ghidra's purpose.

Maybe there's a setting to make Ghidra export asm literals when it can't figure out a valid disassembly, but I am pretty new to Ghidra so it could just be my own ignorance.

[1]: https://github.com/NationalSecurityAgency/ghidra/issues/236

[2]: https://github.com/NationalSecurityAgency/ghidra/issues/3553

Akronymus · on July 13, 2022

> The worst are symbols that are used inconsistently within the same function, like a parameter which is passed in as a long and then used as a pointer to a struct or even as a function.

Split into new variable. Sounds like ghidra has trouble telling whether it is a reused stoarge location or actually the same variable.

Best guess = something that looks approxinately fitting for the relevant assembly

fuzzing = tweaking the source code to get what it compiles to closer to the actual assembly.

as in, generate a function, see how close its compilation resembles the assembly, tweak until you find a match

hoosieree · on July 13, 2022

Yeah, I ended up creating new variables to get the compile to succeed.

As for generating functions, I'll have to think about what that loss function would look like. I've been looking at asm2vec[1] and structure2vec[2] for inspiration. I'm currently looking at different kinds of graph embeddings, because even answering the basic question of "are these N bytes of assembly semantically similar to these other N bytes" is a challenge.

[1]: https://ieeexplore.ieee.org/document/8835340

[2]: https://arxiv.org/abs/1603.05629

Akronymus · on July 13, 2022

Maybe start with a simple fixed size instriction set? To get some methology down to later be refined. something like early 8 bit micros

hoosieree · on July 14, 2022

That's not a bad idea. My first crack at this has been with a linux x64 target, but I have the infrastructure in place for mips, armv7, thumb, etc. I haven't tried compiling to very old/simple targets but I was considering using the MOVfuscator as one of the compilers.

Or maybe I can figure out how to tell LLVM to do some extreme strength reduction and target an ultra reduced subset of some ISA. Great food for thought, thanks!

develatio · on July 13, 2022

I hear good things about Binary Ninja!

baby · on July 13, 2022

I always found it odd that ida pro was such a pile of poop when it probably made sooo much money

ohnoesjmr · on July 13, 2022

Decompiler space probably has a few tens of millions in revenue yearly, yet writing a good decompiler is quite a lot of engineering effort, and you are not going to spend tons of money and effort to capture a measly 10m market, you'll rather be the next uber type thing that targets a much bigger market.

Hence HexRay can get away with not doing much and just collecting license fees from existing customers yearly, as there isn't a better alternative anyway.

Sesse__ · on July 13, 2022

One thing Hex-Rays has that Ghidra doesn't (and cannot) is amazing support. Back when I had a license at work, I could report bugs and literally get a fixed binary back a couple of hours later.

They're both amazing, they're both quirky, and they're both buggy. But one is free and the other has its support. Pick which one matters to you :-)

jcul · on July 13, 2022

It's been a while since I needed to do any serious reverse engineering work, but I do remember IDA pro having support for a range of obscure CPUs, not sure how ghidra compares on that front these days.

popinman322 · on July 13, 2022

Ghidra still mostly relies on community plugins for obscure platform support. It's ~OK but less than ideal.

saagarjha · on July 14, 2022

Ghidra outstrips IDA Pro in the number of architectures it supports.

pjc50 · on July 13, 2022

One of the major categories of users is people in the warez scene, all of whom are pirating it. The only other one is security researchers, which is a pretty small market.

mdp2021 · on July 13, 2022

And backward engineers, including porters and students, and code recoverers.

And especially, software tweakers and improvers. Not all software is open source.

lalopalota · on July 13, 2022

IDA Pro was an amazing disassembler and accompanying set of tools - top of the pack for quite a while.