Parsing XML in bash

airman_dopey · January 24, 2013

Hey guys,

I'm trying to parse some information given from etterlog in XML and am not sure how to proceed. The output I am being given is:

<?xml version="1.0" encoding="UTF-8" ?>

<etterlog version="0.7.4.1" date="Wed Jan 23 19:08:37 2013">
	<host ip="192.168.0.1">
		<mac>XX:XX:XX:XX:XX:XX</mac>
		<manuf></manuf>
		<distance>1</distance>
		<type>LAN host</type>
	</host>
	<host ip="192.168.0.4">
		<mac>XX:XX:XX:XX:XX:XX</mac>
		<manuf></manuf>
		<distance>1</distance>
		<type>LAN host</type>
	</host>
	<host ip="192.168.0.102">
		<mac>XX:XX:XX:XX:XX:XX</mac>
		<manuf></manuf>
		<distance>1</distance>
		<type>LAN host</type>
		<port proto="udp" addr="68" service="dhcpclient">
		</port>
		<port proto="udp" addr="137" service="netbios-ns">
		</port>
		<port proto="udp" addr="138" service="netbios-dgm">
		</port>
	</host>
</etterlog>

What I would like to do is parse this and output it as a single line for each <host>info</host> and I am at a loss for how to accomplish this.

<host ip="192.168.0.1"><mac>XX:XX:XX:XX:XX:XX</mac><manuf></manuf><distance>1</distance><type>LAN host</type></host>
<host ip="192.168.0.4"><mac>XX:XX:XX:XX:XX:XX</mac><manuf></manuf><distance>1</distance><type>LAN host</type></host>
<host ip="192.168.0.102"><mac>XX:XX:XX:XX:XX:XX</mac><manuf></manuf><distance>1</distance><type>LAN host</type><port proto="udp" addr="68" service="dhcpclient"></port><port proto="udp" addr="137" service="netbios-ns"></port><port proto="udp" addr="138" service="netbios-dgm"></port></host>

Any ideas how to do this? I am aware that awk (or sed) will probably work here, but my skills in either are not anywhere near this type of problem. Thanks in advance

digip · January 24, 2013

Might be easier to use some cat, piped through grep, awk, etc, and writing each part out to a single text file or use some other language to just strip carriage returns and only regex keep from the first <host to the last </host>. I generally suck at regex, but it can be helpful to return only the data you want. Bwall could probably write a one liner of code that does just what you ask for. I suck at bash and programming in general though, but I would read up on the cat, grep, awk and sed commands, and how to cut things out you don't need, that should be able to return all of it on one line and then >> to a single file.

Jason Cooper · January 24, 2013

This is where I would reach for good old reliable Perl.

perl -e'$x=join("",&lt;STDIN&gt;);$x=~s/\s*[\r\n]+\s*//gs; $x=~s/^.*?(&lt;host.*&lt;\/host&gt;).*?$/$1/i;$x=~s/&lt;\/host&gt;/&lt;\/host&gt;\n/gi;print $x;' &lt;InputFile.xml

First we load the whole input into a variable ($x) and then there is just three stages as digip suggested. The first is to remove all the new lines (including and spaces before and after those characters) which leaves us with a single line. Secondly we strip outside of the host tags ($x=~s/^.*?(<host.*<\/host>).*?$/$1/i;). Finally we put a newline on the end of every closing host tag.

If you are wanting to do this regularly then I would suggest putting it into an actual perl script file that you can just run and direct your XML into.

The next question is what are you planning to do once you have the files in this format? I assume you planning on looking for changes over time by comparing older files with newer files, but if you are planning on doing something more complex then it could pay off to expand the script further.

Sitwon · January 24, 2013

To be honest, regular expressions are the wrong tool to use for working with XML. In this case, you are trying to do a simple enough transformation on a narrow enough data set that it could work, but in general you should avoid using regular expressions on context-free grammars.

To put it simply, the XML data model is conceptually a tree of nodes. The tags we see are just a serialization of that tree. The way to deal with XML is as a node tree, not as a set of tags. One of the problems with using tools like RegEx to deal with tags is that RegEx doesn't do tag-matching. RegEx looks for patterns as a sequence of characters in a line of text, it has no concept of "matching tag" or parent-child relationships.

If you're just transforming whitespace you might get away with it, for a while, but when you start manipulating the data your house of cards will likely collapse.

airman_dopey · January 24, 2013

Let me explain what I am doing and maybe that will help.

Currently a friend of mine and myself have been trying to greatly expand our knowledge of the different security tools. One area we are currently in is passive OS fingerprinting. We have developed a script that reads the output of p0f, sorts the data into multiple arrays based on the last octet (if the IP address is found to be within our network), and display an easy to read report on information found. Problem is p0f is only getting OS, Browsers, Apps, and uptime. Since we had already gone over some of the arp spoofing that ettercap can do, I dug further to see if it can also do any fingerprinting and found it does. After creating the passive log and spiting it out using etterlog, we discovered mac address, port, and a few other things that we'd like to add to the report in real time. My idea for doing this is to handle it the same way we do it for p0f. Each line is one "report" with two address (server and client), with additional information based on which ip address it is from and specified by the "subj" variable (cli or srv). So sorting that information is easy. For ettercap, I want to have a timestamp going (date +%s) and after 10 seconds I want to add a function to dump said report into the p0f file I am parsing (using tail) and read each line. Since each line would have the host information in the front, sorting the remaining information would be much easier.

Thank you for the responses so far, and if you have better ideas for what I am trying to do I'd love to hear them.

Sitwon · January 24, 2013

If your ultimate goal is to parse the information out of the XML document, then I would recommend using a language with convenient libraries for parsing XML. Python supports both the SAX and DOM APIs, Perl and Ruby do as well. JavaScript does DOM very well, if you're more comfortable with that. Java supports SAX and DOM, Scala adds syntactic sugar to Java's DOM support. XSLT/XPath support a different, arguably more complete API, but that's a bit esoteric to learn just for a short script. Erlang has terrible XML support, but I hear it's improving.

As for Bash, well I have yet to see a clean XML parser implementation for Bash. Not to say it is impossible, but I haven't seen one so far.

Jason Cooper · January 24, 2013

To put it simply, the XML data model is conceptually a tree of nodes. The tags we see are just a serialization of that tree. The way to deal with XML is as a node tree, not as a set of tags. One of the problems with using tools like RegEx to deal with tags is that RegEx doesn't do tag-matching. RegEx looks for patterns as a sequence of characters in a line of text, it has no concept of "matching tag" or parent-child relationships.

I have encountered a number of situations where regular expressions were the only way to effectively process the necessary data out of a some XML files in the time required. Generally it was when the XML files I was dealing with were so large, both physically and logically, that the overheads of fully parsing the XML would tie up the servers available resources for far too long. In one case we were dealing with a processing time of over a day on a server with 16GB of memory available to it, switching to regular expressions brought that time down to about an hour. Other times I have had to use regular expressions to recover data from corrupt XML files that actually break the parser.

You do have to be careful though when using regular expressions to process data from XML files, and usually you will be better off using a proper XML parser. Generally the best way for programmers to think of dealing with XML is "If you don't have a reason why you can't use a proper XML parser then you should be using one."

As to what airman_dopey is trying to do I suspect that it might be easier to create a program/script that effectively sucks data in from both logs while p0f and ettercap are running and uses that data to update its own data and then regularly uses its data to output a report.

airman_dopey · January 24, 2013

As to what airman_dopey is trying to do I suspect that it might be easier to create a program/script that effectively sucks data in from both logs while p0f and ettercap are running and uses that data to update its own data and then regularly uses its data to output a report.

That's exactly what I am trying to do. The original XML will be removed one I pull the relevant info from it. But to more easily find what info is in the log, I want one host per line.

If this thread has shown me anything it's that I am far behind when it comes to programming languages.

digip · January 25, 2013

You could try another approach, although I suck at programming myself, is serializing the data request into an array with json and php to spit out all the data into a web page interface that is even easier to read, with maybe some ajax way of updating it asynchronously. Most XML can be read in web pages if you supply a pre-defined XML style sheet as well to tell it how to display it. I'm just not the one to write it, since I wouldn't know where to begin with it, but I know it can be done, same way google maps can be fed XML and GPX files(which are also XML files) to plot all the info on a map or the way nmap spits out an XMl file for browsers to view reports.

airman_dopey · February 2, 2013

This is where I would reach for good old reliable Perl.
perl -e'$x=join("",<STDIN>);$x=~s/\s*[\r\n]+\s*//gs; $x=~s/^.*?(<host.*<\/host>).*?$/$1/i;$x=~s/<\/host>/<\/host>\n/gi;print $x;' <InputFile.xml
First we load the whole input into a variable ($x) and then there is just three stages as digip suggested. The first is to remove all the new lines (including and spaces before and after those characters) which leaves us with a single line. Secondly we strip outside of the host tags ($x=~s/^.*?(<host.*<\/host>).*?$/$1/i;). Finally we put a newline on the end of every closing host tag.

If you are wanting to do this regularly then I would suggest putting it into an actual perl script file that you can just run and direct your XML into.

The next question is what are you planning to do once you have the files in this format? I assume you planning on looking for changes over time by comparing older files with newer files, but if you are planning on doing something more complex then it could pay off to expand the script further.

I am an A-hole for not liking the post. Thank you again, as this is what we used for our tool. Thank you again!

Sign In

Parsing XML in bash

Recommended Posts

airman_dopey

Link to comment

Share on other sites

digip

Link to comment

Share on other sites

Jason Cooper

Link to comment

Share on other sites

Sitwon

Link to comment

Share on other sites

airman_dopey

Link to comment

Share on other sites

Sitwon

Link to comment

Share on other sites

Jason Cooper

Link to comment

Share on other sites

airman_dopey

Link to comment

Share on other sites

digip

Link to comment

Share on other sites

airman_dopey

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Browse

Activity