Cold case: protocol reverse engineering (part 2)

October 18, 2017

If you missed the previous post, you can find it here.

I was determined to find out which protocol they were mimicking. If you know OSI model (often named ISO/OSI), you already know that each level may (in fact, they do) adds an header (sometimes also a footer) in every Protocol Data Unit (PDU) in a process named "encapsulation".

So, I ignored the ethernet "type" field and I tried to manually match the protocol by looking at the raw PDU (printed in hexadecimal format in a piece of paper). After some trial-and-error, I found that the PDU was a real-complete IP packet! I haven't tried the IP protocol as the first one because I thought that they built a custom protocol based on a simpler/older one than IP.

After that discovery, I forced Wireshark to decode it as IP, and also the TCP connection showed up. Counters, addresses and other things that I was supposing before (see previous post) were actually IP and TCP fields (IP addresses, TCP ports, sequence numbers, etc).

I was happy because this was a big step forward. Now we can assume that PDU size (at TCP level) is actually a proprietary protocol. So it comes the hardest part.

I made a number of traces (~200) and I come back to vbindiff to compare TCP data. Note that, as in the previous case, the trace is made with a passive bridge tap and by doing some action on the controller/PLC. Then, every trace is saved with raw capture and "metadata" (the complete environment, eg: what commands are sent by the controller, what actions is performed on PLC, which are the values displayed or set).

What I learned, by doing many comparison (which took me nearly two weeks), is that:

The "dual-reply" that we were seeing before the IP decoding were actually TCK ACKs: if the PLC was slow on reply (we're speaking about milliseconds...), the networking component on PLC were sending ACK replies. Otherwise, if queries data were available immediately, it responds with only one ACK,PSH packet (as standard)
The connection is made by two kind of messages:
- Periodic messages from PLC to the controller (not enabled by default)
- Controller actions/queries (with relative replies)
Message type is an 8-bit field inside the TCP PDU
The second byte indicates if the PDU is a reply or a request
There are two counters, one used to distinguish between duplicates in dual-ethernet config, another is for duplicates in the same link
On value set or get, values are coded in TLV (type-length-value) at the end of the PDU, even the strings and the item ID (which, in this system, is an 8+3 static string, whitespace padded)
Floating point values are coded in 4 bytes IEEE 754
The endianess is big-endian
The FCS field on ethernet was used for a custom counter (maybe something that the controller board uses)

A side node: actually the first thing that I noticed was the floating point value. I still remember the IEEE 754 logic and structure (I've studied that in High school, nearly 7 years ago), that helped me because floating point values are stored in a particular (characteristic) format.

There was sufficient data to write a custom software to do tests. But one question remains: how do I simulate a "wrong" ethertype for IP? I was suspecting that the ethertype were modified by the industrial ethernet board (or its driver), so I build my "PLC"-client for Windows and I made it run over the controller. And I was right: every ethernet frame with an IP/TCP connection to the PLC was changed to adjust the ethertype.

After some adjustment and testing, my client was working well (as the original controller) for setting and getting values (not the PLC-programming part yet). But still, how to use another industrial ethernet card with Windows and make it change the ethertype?

With a kernel driver, of course.

The Windows Kernel Driver parabola

I must admit: that was a really impulsive decision. But still, it taught me a lot.

Unfortunately I do not remember a lot - I'm not a Windows fan. I downloaded the whole Windows Driver toolkit and I began to play with some examples. Then, I wrote a simple "network filter" that was doing the same as the board: checking if the connection was for the TCP port of the PLC and change the ethertype field.

The "development-environment" was a pain: two physical machines (why I didn't use VMs? I tried, but something in Windows kernel didn't cooperate...) linked with both ethernet and serial connection. There was an option to use ethernet-only, but for some reason it didn't work at all. The "host" was a simple-plain Windows 7 PC with Visual Studio, the "guest" was a Windows 7 with debug enabled.

As you can expect, with a debugger attached to a kernel you can do pretty much everything: freeze the machine (and then continue the execution), inject drivers, trigger events, and, of course, trigger an Blue Screen Of Death.

The driver was working perfectly - except for only two bugs (which causes some BSODs in tests) that I fixed immediately after the crash. I checked with Wireshark: packets were indistinguishable (I mean, you cannot tell if it was sent by the original controller or my software).

Then, after the tests with my client over the "driver-equipped-machine", I accidentally launched my PLC client even on the host machine (which didn't have the driver). And it worked like a charm. WTF? Why?

It turns out that custom ethertype (and so the Windows driver) was not really necessary. Even standard IP ethertype was OK.

See why I say that was an impulsive decision?

Conclusion

After that, I moved all the code into the OPC toolkit to build an OPC driver, so I was able to get/set values by using OPC compliant software, such as the Kepware one.

It took me nearly three months to reverse engineer this protocol. And I'm still asking me how the hell I made it.

Appendix - tools

To summarize tools and usages:

Wireshark and tcpdump: used for capturing and decoding packets. Wireshark made this work possible.
A binary diff-like software (like vbindiff), if you have a binary protocol to decode, or the standard diff if your protocol is plain-text (like HTTP).
A Linux distribution (Debian in this case) with bridge utils: by using the brctl command you can setup a bridge between two (or more) interfaces in a PC.
In order to do the bridge, I usually have with me one or more USB-to-Ethernet converter.
You may want to disable IPv6, disable Stateless IPv4 config, Zeroconf/Avahi/Bonjour services, DHCP clients, Network managers and other things that immediately do something when you plug a cable in your computer. If you don't know what will happen, you need to study more in order to know more about what's running on your computer (eg. your operative system).
The previous point is why I use GNU/Linux: I have full control on what is running (and where). By disabling pretty much every automation, you still have a full control of your system without garbage. For example, with Windows you'll have SMB broadcast packets and things like that.
Reverse engineering a protocol is a matter of knowledge, hard work and luck: you need to know your environment perfectly (eg. protocols at each levels, engineering solutions embedded in various standards, etc); you need to do comparison, tests, speculations, inference, hypothesis, [go back to 1] for an unknown time, a lot more that you might expect; and you need to be lucky.

A lot of things that helped me were not coming from networking study: programming skills (I'm a C developer, and I made some small software in assembly), operative system internals, RFC comments about some technical trick in standards (see, for example, RFCs about IEEE 754 and TCP Tahoe/Reno).