Intro to Malware Analysis: Analyzing Python Malware

January 19, 2023 By Juan Aguirre

11 minute read time

Sonatype's next-generation AI behavioral analysis systems are constantly searching for malicious packages published to open source repositories. Once these systems flag a package, they are passed on to our Security Research team, where we verify what is truly malicious.

In this article, we will dive into the waters of malware analysis, starting with some basics and slowly going into the deep end as we see fit along the way.

A popular attack vector for malicious authors is typosquatting, a technique we've mentioned in some of our other articles. This consists of authors publishing malicious packages with names similar to legitimate ones, such that a small typo would result in the malicious package name. This way authors can prey on unsuspecting victims as they attempt to install what they believe is legitimate software, right until they notice something has gone terribly wrong.

`Views` is a Python package designed to make generators and sequence creation efficient. There are many similarly named packages for the same purpose. If someone tried to download this package, but forgot the last 's' they could have found themselves infected with one of the most recent finds by our AI-enabled systems: 'view.'

When it comes to malware, there are usually two main things we want to do: static and dynamic analysis. Static analysis focuses on the source code. What can I find out by looking at the sources, imports, strings, etc.? Dynamic analysis, on the other hand, focuses more on behavior and understanding what the malware actually does by executing portions of the code or in some cases all of it. In most cases, we will need to apply both techniques to get the full picture. It is common that dynamic analysis is run while also looking through the resulting assembly code in a disassembler and/or debugger. But enough introductions already, let's begin with the actual analysis.

Source Code: The Low-Hanging Fruit

Since we are talking about open source malware, we have access to the source code. However, open source malware is becoming more and more like traditional malware, in the sense that all you see in the open source code is a first-stage dropper whose sole purpose is to reach out to an external server and grab the second-stage payload, the true malware. Let's see if we get lucky with our malicious package 'view' and find something interesting within the code.

image8-3
Image 2: malicious snippet in setup.py from ‘view’

Just as we suspected, all this is to reach out to an external server and grab a second stage executable. In this case, as we can see in Image2, it's reaching out to anonfiles which is immediately a red flag. At this point, we probably know enough about this package to understand that it is malicious, and we probably don't want it anywhere near our systems.

As a security researcher, my eyes light up when I see this, and all I'm thinking is that I hope the file hasn’t been taken down yet. So I rushed to download it. This feels like those times in TV commercials where they have to place a disclaimer at the bottom: "Don't try this at home." However, in the interest of learning, all I can say is: Take all proper precautions when dealing with malware. Perhaps an article on how to set up a safe malware analysis environment could be a good addition to our blog. Let us know if this sounds interesting by submitting a comment at the bottom of the page.

The reason the code first makes a request, parses the response, and then makes a second request is because the target file has a changing URL. All the first requests are to parse the website for the correct download link. Then, the second request grabs the file.

With our Virtual Machine all set up, we can now download the executable. I like to use `curl` or `wget` to ensure the malicious file is only downloaded and never executed. For additional precautions, we can write the output to a file with a non-executable file extension, such as txt.

Now that we have the executable, you could think the source code analysis is over. But not so fast. Even though executables don't have source code we can read, they do have strings that are often full of valuable information. Something as simple as running `strings` on the exe is enough to give us tons of clues about what this is doing. With strings, we can see some imports the executable uses, along with lots of Python libraries and Python code. Maybe this is a Python script compiled with pyinstaller to make it an executable, which would explain all the Python code we see within the executable. To get a better idea, we can move on to other tools that will help us.

image7-Jan-18-2023-08-19-20-2307-PM Image 3: strings command output from ‘view’s remote executable

Online Tools for Analyzing Binaries

Now that we have an executable, there are some valuable services online that will help us understand this malware. Many options for sandboxes exist, and many have free options that are complete. My favorites are VirusTotal (VT) and any.run, these are the first I always go to, but I wanted to try something new and came across filescan.io. Let's give it a shot.

image11-1 Image 4: filescan.io summary result

This gives us tons of information, and one of the most interesting parts is that it allows you to download extracted files the executable may hide. But it doesn't always work, so you might still need to dive into the deep waters of malware analysis to get to the bottom of the malicious behavior of extracted files.

One of the tabs contains extracted strings, where we can see the imports and functions the malware is trying to use. We can see things like `GetProcAddress` and `LoadLibrary` which tell us that the author is likely to try to hide the true inner workings of their code by loading libraries in memory. Another interesting one, as seen in Image5, is `IsDebuggerPresent` which tells us that this malware is implementing some sort of Sandbox and Analysis Evasion and wants to complicate things for us. Oftentimes, as soon as malware detects it's being debugged or run in a sandbox, it proceeds to sleep while hiding its true behavior.

image6-Jan-18-2023-08-23-44-5517-PM Image 5: extracted strings from filescan.io

VirusTotal is leading the industry on these solutions, because we can get so much more detail from VT reports. It's always good to have various tools under your belt and ready to go, but we tend to have favorites for a reason. As we can see in image6, the level of detail we can get is much more. We can see the actual arguments passed to the calls, and even the returned value. This goes a long way in understanding true behavior.

image2-Jan-18-2023-08-24-58-2037-PM Image 6: Native calls in VirusTotal report

The network information given by VT also has lots of hints at what the malware is doing, where it is going and who it is talking to. In this case, we can see plenty of indicators of malicious behavior, and we can even extract a few Indicators of Compromise (IOC).

Our malware sample contacts malicious sites:

Accv.es : Hints for attribution. European, Spain?
Url paths in Spanish : More hints for attribution.
discord.com/api/webhooks : We know discord web hooks are usually up to no good.
Crl.dhimyotis.com : Reaching out to grab a root certificate from. Odd?
Pastebin.com : Command and control

image3-Jan-18-2023-08-35-57-8859-PM Image 7: Pastebin.com: Command and Control (C2)

Of course, pastebin.com stands out and is even identified by the automated sandbox engines as the Command and Control (C2 or C&C) server. Another one that stands out is dhimyotis.com. The latter is odd, because when I check out what it is, it tells me it's a security website. They have a product designed to help verify trust and identity on the internet. However, why is this malware reaching out to it and why is it grabbing a root certificate from a page with Directory listing enabled? Seeing a page with directory listing can be an indicator that the site has been compromised, but these could also be there for legitimate purposes. Many malware analysis consists of heavy research, understanding new concepts and exploring all possibilities, so this isn't necessarily something malicious. It does however tell us something more about the behavior of our executable.

image4-Jan-18-2023-08-37-07-1091-PM Image 8: Directory listing reached by malware from dhimyotis.com

Manual Analysis

We have now used manual inspection and automated online tools to help us understand what the malware is doing, and we certainly have enough to deem this a true malicious package with nefarious intentions. But we're not clear on what the end goal is here. This is where things start getting fun.

Before we go full reversing mode and open IDA or Ghidra, let's follow the clues that have been telling us this is a Python script wrapped in a Windows executable. We already know the exe is filled with Python bytecode and libraries, so to confirm this is a packed executable, we can look at the PE headers.

img-PE_headers_with_pestudio Image 9: PE headers with pestudio

In image9, we can see this executable contains a section that doesn't fall under the standard naming mechanism established for standard executables, `_RDATA`. PEstudio is even nice enough to highlight it for us to bring our attention to it. Looking further, we see that there is also an overlay. An overlay is an appended section to the exe which screams packed executable, in this case, Python script wrapped in an exe.

Enough playing around, we know there is Python inside, so let's crack it open and extract it. There are many ways to extract the section we are interested in. There is even a library in the PyPI registry to help us with this: `pefile`. We can write a quick script to read the executable, get the overlay offset, and dump the file contents from that offset until the end of the file. But why stop there. There is another Python script available that can extract all the `pyc` files, which are the Python bytecode within our exe: `pyinstxtractor`. Let's run this and get the interesting files.

image10-2 Image 10: extracting pyc bytecode from exe via pyinstxtractor.

In image10, we can see that this tool not only extracts our files of interest, but also tells us what the probable entry point is, in this case it points to `source.pyc`. Which is useful, given that these executables are wrapped with everything they need to run, that means all libraries and functions it uses. Malicious actors can also add dead code to make it confusing, so knowing the entry point is valuable.

image9-1_Optimized Image 11: extracted files

Finally, we can use a Python decompiler to go from bytecode, pyc files, to source code, py files. There are plenty of options out there, don't you just love open source? For our analysis, we used `uncompyle6` which can of course be found on the PyPI registry. This is as simple as point and run.

Ultimately, this looks like some sort of cryptominer. It reaches out to pastebin and other sites. It contains many references to crypto wallets, specifically Exodus wallets, and even uses some Discord webhooks for exfiltration and communication.

The feeling here is the reason I love doing this: Putting it all together in an article makes it look fast and simple. And sometimes with enough experience, it can certainly be that way. But there is nothing better than banging your head against the wall for a couple days, then getting an epiphany mid-day while doing something completely different, and finally coming back to the problem to realize the solution. That feeling of finally understanding everything you were working on is amazing, and a great part of the reason why I love to wrap it all nicely in a blog post and share my experience.

We didn't end up fully into the deep end, just briefly tested the waters and found what we wanted. But there is so much more that can be done depending on the complexity of the malware sample. In this case, we only needed some basic malware analysis to get to the bottom, but perhaps some reversing with IDA or Ghidra can be next.

All our research regarding this package is now available in our products and cataloged under Sonatype-2023-0134. Users of Sonatype Repository Firewall can rest easy knowing that such malicious packages would automatically be blocked from reaching their development builds.

If you're not yet a Sonatype customer and want to find out if your code is vulnerable, you can use our free Sonatype Vulnerability Scanner to quickly find out.

Written by Juan Aguirre

Juan is a security researcher at Sonatype and part of the team who has helped Sonatype catalog more than 100 million open source components.

Intro to Malware Analysis: Analyzing Python Malware

Source Code: The Low-Hanging Fruit

Online Tools for Analyzing Binaries

Manual Analysis

Block Open Source Malware

Related Resources

SANDWORM_MODE: The Rise of Adaptive Supply Chain Worms

Modern Vulnerability Management in the Age of AI

Modern Vulnerability Management with HeroDevs