Help parsing a document?

toughbunny · November 17, 2015

Hi guys,

I need help parsing a document (dialogue from the Die Hard!)

Right now, it is in this XML format:

<s id="631">
    <time id="T583S" value="00:58:08,840" />
    <w id="631.1">Do</w>
    <w id="631.2">you</w>
    <w id="631.3">really</w>
    <w id="631.4">think</w>
    <w id="631.5">you</w>
    <w id="631.6">have</w>
    <w id="631.7">a</w>
    <w id="631.8">chance</w>
    <w id="631.9">against</w>
    <w id="631.10">us</w>
    <w id="631.11">,</w>
    <w id="631.12">Mr.</w>
    <w id="631.13">Cowboy</w>
    <w id="631.14">?</w>
    <time id="T583E" value="00:58:11,960" />
  </s>
  <s id="632">
    <time id="T584S" value="00:58:14,960" />
    <w id="632.1">Yippee-</w>
    <w id="632.2">ki-</w>
    <w id="632.3">yay</w>
    <w id="632.4">,</w>
    <w id="632.5">motherfucker</w>
    <w id="632.6">.</w>
    <time id="T584E" value="00:58:17,000" />
  </s>

What I want, is for it to be parsed so that it turns each "id" block into only the text it contains, putting one "id" block onto one line each. For example, this text would become the exactly the following:

Do you really think you have a chance against us Mr. Cowboy?
Yippee-ki-yay, motherfucker.

How do I do this?

Edited November 17, 2015 by toughbunny

overwraith · November 17, 2015

You really should tell everybody which programming language you want it parsed in. As it stands it's really just kind of sinister.

Edited November 17, 2015 by overwraith

toughbunny · November 17, 2015

@overwraith, the language really doesn't matter to me, so whatever you're comfortable with. I just need to get the xml file to the plaintext file.

cooper · November 17, 2015

The best 'language' to process XML in such a way I feel is XSL. It's also easy because you can use your browser as the tool for the conversion. We're now converting to HTML which is a copy/paste away from producing the format you want. Here's the style sheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:template match="/">
            <html>
                  <body>
                        <xsl:for-each select="*/s">
                              <xsl:for-each select="w">
                                    <xsl:sort select="id" />
                                    <xsl:value-of select="concat(text(),' ')"/>
                              </xsl:for-each>
                              <br/>
                        </xsl:for-each>
                  </body>
            </html>
      </xsl:template>
</xsl:stylesheet>

Save this to a "stylesheet.xsl" file in the same folder as your XML file. Make your XML file start like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="stylesheet.xsl"?>

Now open the XML file in any modern browser.

You're welcome.

toughbunny · November 17, 2015

Thanks so much, this works great. Just one question: would there be a way to have this output the plaintext to a file? I have enough files in enough folders that adding the adjusting the xml of each one manually and then putting the stylesheet in each folder manually is impractical.

Edited November 17, 2015 by toughbunny

Sebkinne · November 17, 2015

Really quick and dirty python 2.7 script:

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()

for segment in root.iter("s"):
	for word in segment.iter("w"):
		print word.text,
	print "\n"

Note that this will insert spaces after every element (such as punctuation). Should be trivial to add the check and not to print a space.

If you want to save the output, either redirect the output or add some file IO in there.

Best Regards,

Sebkinne

Mr-Protocol · November 17, 2015

Really quick and dirty python 2.7 script:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()

for segment in root.iter("s"):
	for word in segment.iter("w"):
		print word.text,
	print "\n"
Note that this will insert spaces after every element (such as punctuation). Should be trivial to add the check and not to print a space.

If you want to save the output, either redirect the output or add some file IO in there.

Best Regards,

Sebkinne

You got that from stack overflow didn't you? :P

Sebkinne · November 18, 2015

You got that from stack overflow didn't you? :P

Nah, just had a couple of minutes and saw this thread. I think most people use ElementTree / get the basics from the manpage.

Mr-Protocol · November 18, 2015

Nah, just had a couple of minutes and saw this thread. I think most people use ElementTree / get the basics from the manpage.

Just messing with you. Kind of like urlsnarf.

toughbunny · November 18, 2015

Thanks you so much, this worked great. Just one question: do you think I could modify this code so that it performs the function on every single file in my directory full of xml files? I have a huge number, so it is impractical to modify the code each time. Also: if I were to use this command

python xmlparse.py >> myfile.txt

would that work, to put all of the outputs in one single plaintext file?

Sebkinne · November 18, 2015

Thanks you so much, this worked great. Just one question: do you think I could modify this code so that it performs the function on every single file in my directory full of xml files? I have a huge number, so it is impractical to modify the code each time. Also: if I were to use this command
python xmlparse.py >> myfile.txt
would that work, to put all of the outputs in one single plaintext file?

Yeah, that is trivial to implement. And yes, that redirection should work.

Best Regards,

Sebkinne

cooper · November 18, 2015

#!/bin/sh

cat "$1" | python xmlparse.py >> "$1.txt"

find . -name \*.xml -exec that_script.sh {} \;

You'll need to alter the python script to read its XML from input. Alternatively, provide the name of the file to process as a parameter to the script, which still requires a small update to the script. Edited November 18, 2015 by cooper

zoro25 · December 9, 2015

I'm surprised that no, one mentioned regular expressions (Regex) , Parsing is what it was created for. The below code will match your text in 18 chars.

(?<=">).*?(?=</w>)

you may need to escape special chars deepening on the language you are coding in.

Explanation of code below.

() are groupings

Inside first grouping we have ?<= which means match prefix but exclude it from returned results.

We then have "> this is the end of the XML tag before the ID is returned.

I then have a wildcard match .*? which means Match any character any number of repetitions , match as few as possible

another group () and inside that group I have ?= which means match suffix and don't return matched results.

I searching to match on </w> the end tag in your XML

So I'm saying match anything between "> and </w> and don't return the matches or anything on either side of those matches, just whats inbetween

Hope it helps.

Edited December 9, 2015 by zoro25

cooper · December 9, 2015

A regexp is great to find significant parts in a structured document, but will always lose to a format-specific parser, which is what ElementTree thing is.

Sign In

Help parsing a document?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members