Jump to content

Help parsing a document?


toughbunny

Recommended Posts

Hi guys,

I need help parsing a document (dialogue from the Die Hard!)

Right now, it is in this XML format:

<s id="631">
    <time id="T583S" value="00:58:08,840" />
    <w id="631.1">Do</w>
    <w id="631.2">you</w>
    <w id="631.3">really</w>
    <w id="631.4">think</w>
    <w id="631.5">you</w>
    <w id="631.6">have</w>
    <w id="631.7">a</w>
    <w id="631.8">chance</w>
    <w id="631.9">against</w>
    <w id="631.10">us</w>
    <w id="631.11">,</w>
    <w id="631.12">Mr.</w>
    <w id="631.13">Cowboy</w>
    <w id="631.14">?</w>
    <time id="T583E" value="00:58:11,960" />
  </s>
  <s id="632">
    <time id="T584S" value="00:58:14,960" />
    <w id="632.1">Yippee-</w>
    <w id="632.2">ki-</w>
    <w id="632.3">yay</w>
    <w id="632.4">,</w>
    <w id="632.5">motherfucker</w>
    <w id="632.6">.</w>
    <time id="T584E" value="00:58:17,000" />
  </s>

What I want, is for it to be parsed so that it turns each "id" block into only the text it contains, putting one "id" block onto one line each. For example, this text would become the exactly the following:

Do you really think you have a chance against us Mr. Cowboy?
Yippee-ki-yay, motherfucker.

How do I do this?

Edited by toughbunny
Link to comment
Share on other sites

The best 'language' to process XML in such a way I feel is XSL. It's also easy because you can use your browser as the tool for the conversion. We're now converting to HTML which is a copy/paste away from producing the format you want. Here's the style sheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:template match="/">
            <html>
                  <body>
                        <xsl:for-each select="*/s">
                              <xsl:for-each select="w">
                                    <xsl:sort select="id" />
                                    <xsl:value-of select="concat(text(),' ')"/>
                              </xsl:for-each>
                              <br/>
                        </xsl:for-each>
                  </body>
            </html>
      </xsl:template>
</xsl:stylesheet>

Save this to a "stylesheet.xsl" file in the same folder as your XML file. Make your XML file start like this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="stylesheet.xsl"?>

Now open the XML file in any modern browser.

You're welcome.

Link to comment
Share on other sites

Thanks so much, this works great. Just one question: would there be a way to have this output the plaintext to a file? I have enough files in enough folders that adding the adjusting the xml of each one manually and then putting the stylesheet in each folder manually is impractical.

Edited by toughbunny
Link to comment
Share on other sites

Really quick and dirty python 2.7 script:

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()

for segment in root.iter("s"):
	for word in segment.iter("w"):
		print word.text,
	print "\n"

Note that this will insert spaces after every element (such as punctuation). Should be trivial to add the check and not to print a space.

If you want to save the output, either redirect the output or add some file IO in there.

Best Regards,

Sebkinne

Link to comment
Share on other sites

Really quick and dirty python 2.7 script:

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()

for segment in root.iter("s"):
	for word in segment.iter("w"):
		print word.text,
	print "\n"

Note that this will insert spaces after every element (such as punctuation). Should be trivial to add the check and not to print a space.

If you want to save the output, either redirect the output or add some file IO in there.

Best Regards,

Sebkinne

You got that from stack overflow didn't you? :P

Link to comment
Share on other sites

Thanks you so much, this worked great. Just one question: do you think I could modify this code so that it performs the function on every single file in my directory full of xml files? I have a huge number, so it is impractical to modify the code each time. Also: if I were to use this command

python xmlparse.py >> myfile.txt

would that work, to put all of the outputs in one single plaintext file?

Link to comment
Share on other sites

Thanks you so much, this worked great. Just one question: do you think I could modify this code so that it performs the function on every single file in my directory full of xml files? I have a huge number, so it is impractical to modify the code each time. Also: if I were to use this command

python xmlparse.py >> myfile.txt

would that work, to put all of the outputs in one single plaintext file?

Yeah, that is trivial to implement. And yes, that redirection should work.

Best Regards,

Sebkinne

Link to comment
Share on other sites

#!/bin/sh

cat "$1" | python xmlparse.py >> "$1.txt"

find . -name \*.xml -exec that_script.sh {} \;
You'll need to alter the python script to read its XML from input. Alternatively, provide the name of the file to process as a parameter to the script, which still requires a small update to the script. Edited by cooper
Link to comment
Share on other sites

  • 3 weeks later...

I'm surprised that no, one mentioned regular expressions (Regex) , Parsing is what it was created for. The below code will match your text in 18 chars.

(?<=">).*?(?=</w>)

you may need to escape special chars deepening on the language you are coding in.

Explanation of code below.

() are groupings

Inside first grouping we have ?<= which means match prefix but exclude it from returned results.

We then have "> this is the end of the XML tag before the ID is returned.

I then have a wildcard match .*? which means Match any character any number of repetitions , match as few as possible

another group () and inside that group I have ?= which means match suffix and don't return matched results.

I searching to match on </w> the end tag in your XML

So I'm saying match anything between "> and </w> and don't return the matches or anything on either side of those matches, just whats inbetween

Hope it helps.

Edited by zoro25
Link to comment
Share on other sites

A regexp is great to find significant parts in a structured document, but will always lose to a format-specific parser, which is what ElementTree thing is.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...