Saturday, May 5, 2012

Getting from an EPUB to an XML file

Just noticed the following on my twitter stream
https://twitter.com/#!/matthewdiener/status/198163400587612160
MatthewDiener
Has anyone used oXygen XML Editor (or any other program) to transform an ePUB to XML? Any XSLT out there? #eprdctn
5/4/12 12:33 AM
and as there is no solution I know of I wrote the following XSLT 2.0 stylesheet that converts an EPUB to one XML file. Basically it identifies the toc file from the EPUB by looking into the container.xml first, then identifying the content file and then the toc file. The toc file is processed then and the content referred by that is added in the result, thus obtaining one single XML file. An xml:base attribute is added on each included content to allow any relative resolving references correctly.Here it is the stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:c="urn:oasis:names:tc:opendocument:xmlns:container"
    xmlns:opf="http://www.idpf.org/2007/opf"
    xmlns:ncx="http://www.daisy.org/z3986/2005/ncx/"
    exclude-result-prefixes="xs c opf ncx"
    version="2.0">
        
    <xsl:param name="epub" select="'file:/Users/george/Documents/workspace/eXml/samples/epub/flowers.epub'"/>
    
    <xsl:variable name="containerFile" select="concat('zip:', $epub, '!/META-INF/container.xml')"/>
    <xsl:variable name="containerData" select="document($containerFile)"/>
    <xsl:variable name="content" select="$containerData/c:container/c:rootfiles/c:rootfile[@media-type='application/oebps-package+xml']/@full-path"/>
    <xsl:variable name="contentFile" select="concat('zip:', $epub, '!/', $content)"/>
    <xsl:variable name="contentData" select="document($contentFile)"/>    
    <xsl:variable name="tocData" select="document($contentData/opf:package/opf:manifest/opf:item[@media-type='application/x-dtbncx+xml']/@href)"/>

    <xsl:template name="main">
        <mergedEpub>
            <xsl:apply-templates select="$tocData" mode="copy"/>
        </mergedEpub>
    </xsl:template>
    
    <xsl:template match="node() | @*" mode="copy">
        <xsl:copy>
            <xsl:apply-templates select="node() | @*" mode="copy"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="ncx:content[@src]" mode="copy">
        <xsl:copy>
            <xsl:copy-of select="@*"/>
            <xsl:choose>
                <xsl:when test="contains(@src, '#')">
                    <xsl:variable name="file" select="substring-before(@src, '#')"/>
                    <xsl:variable name="id" select="substring-after(@src, '#')"/>
                    <xsl:attribute name="xml:base" select="$file"/>
                    <xsl:apply-templates select="document($file, .)//*[@id=$id]" mode="copy"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:attribute name="xml:base" select="@src"/>
                    <xsl:apply-templates select="document(@src)" mode="copy"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Note that the stylesheet is an XSLT 2.0 stylesheet and its entry point is a template called main. In oXygen you need to configure a transformation scenario that uses Saxon 9 PE as XSLT processor
and then use the "Advanced options" button that you find next to the processor to specify the initial template to "main"
 
The XML URL should be left empty.
The stylesheet has one parameter called "epub". If you set this as ${afu} which stands for archive file URL then oXygen will expand that at runtime to the URL of the current archive/EPUB open in the Archive Browser View.


14 comments:

  1. Looks like good work George, however, I cannot get Oxygen to configure a transformation scenario as per yours.

    I'm using Oxygen 13.2 and I have no option to select "Saxon-PE 9.4.0.3". I only have "Saxon-PE 9.3.0.5" available.

    Also, when configuring the transformation scenario it reports "You need a valid XML URL as an input for the XSLT". So I'm unsure how quote: "The XML URL should be left empty.".

    The only output I'm achieving is as follows:



    Would really appreciate it if you could provide a more detailed explanation on how to configure Oxygen to use this XSLT.

    Best Regards, Rob.

    ReplyDelete
    Replies
    1. Yes, that is correct - I always use the latest development snapshot and of course we updated Saxon to 9.4. The stylesheet should work fine also with Saxon 9.3.
      To have this working you need to set the initial template to "main". In the edit scenario dialog there is a button immediately at the right of the combo box where you select Saxon-PE 9.3.5 - click on that button and that will open the dialog to configure the Saxon options. In the "Template (-it)" field enter "main" as you see in the dialog before my remark that the XML should be left empty - that will instruct Saxon to invoke the XSLT by starting the execution with the named template with that name.
      Let me know if you still have issues.
      Regards,
      George

      Delete
    2. Thanks for your reply George.
      I have got a bit further this time round in achieving xml output.
      Now I get an error message after applying the xsl as follows:

      "Description: I/O error reported by XML parser processing
      zip:file:Users/myName/Desktop/new.epub!/toc.ncx:
      /Users/myName/Desktop/new.epub/toc.ncx (no such file entry)-
      /Users/myName/Desktop/new.epub/toc.ncx (no such file entry)
      Start location: 19:0

      I think the issue I am encountering is as a result of the EPUB/zip file structure that I am using. The epub I am using is directly output from InDesign CS5.x. It has the following standard file/folder structure inside:

      new.epub
      META-INF (folder)
      >> container.xml
      OEBPS (folder)
      >> content.opf
      >> template.css
      >> toc.ncx
      >> Untitled-1.xhtml
      mimetype

      As you can see above the "toc.ncx" (inside the epub from Indesign) is nested inside the 'OEBPS' folder.
      When using your .xsl does it assume the "toc.ncx" does NOT reside inside a folder named "OEBPS", as per the sample epub 'flowers.epub' which ships with Oxygen?

      If I move the "toc.ncx" out of the "OEBPS" folder (up one level) and then apply your xsl I achieve output, however, it's only the "toc.ncx" that has been converted to xml.

      What would I need to change inside the .xsl to instruct it that the toc file is is nested inside a folder named "OEBPS". I'm guessing lines 18-19 in the xsl.

      Thanks again, Rob.

      PS: FYI - The script achieves successful xml output when transforming 'flowers.epub', however, that has a different folder structure to the standard epub exported out of InDesign.

      Delete
    3. Most of the stylesheet code deals exactly with identifying the toc file. Now, I considered the reference from content.opf to be similar with the reference from container.xml, that is relative to the archive root, and that is wrong. I will update the stylesheet to treat the reference relative to the content.opf itself and that should work ok. Thanks for spotting this issue.

      Delete
    4. Please try with the updated stylesheet and let me know if you still have issues.

      Regards,
      George

      Delete
    5. Hi George,

      It works perfectly using the updated stylesheet.

      Many thanks for providing an update - much appreciated :)

      Regards, Rob.

      Delete
    6. George,

      Sorry - I think I spoke too soon!

      Just ran the template on several other ePUB's and the resultant output seems to have limited content inside it.

      In the resultant file I get the NavMap information from the original toc.ncx, however, ONLY the first paragraph from each .xhtml file is output.

      Sample of the source mark-up inside the toc.ncx as follows:-

      < navPoint id="navpoint1" playOrder="1" >
      <navLabel>
      <text>INTRODUCTION</text>
      </navLabel>
      <content src="Introduction.html#toc_marker-1"/>
      </navPoint>


      Sample of the resultant output is as follows:

      <navPoint id="navpoint1" playOrder="1"/>
      <navLabel/>
      <text>INTRODUCTION</text/>
      </navLabel/>
      <content src="Introduction.html#toc_marker-1" xml:base="Introduction.html"/>
      <p xmlns="http://www.w3.org/1999/xhtml" id="toc_marker-1" class="Chapter-Head-19"/> <span class="Small-Caps-19 char-style-override-2"/>INTRODUCTION</span/>
      </p/>
      </content/>
      </navPoint/>

      Regards, Rob.

      Delete
    7. That is how I designed the stylesheet, if there is an anchor then we get only the element with an id attribute matching that anchor. That is done in the xsl:choose instruction, when the src contains a # sign:
      <xsl:when test="contains(@src, '#')">
      <xsl:variable name="file" select="substring-before(@src, '#')"/>
      <xsl:variable name="id" select="substring-after(@src, '#')"/>
      <xsl:attribute name="xml:base" select="$file"/>
      <xsl:apply-templates select="document($file, .)//*[@id=$id]" mode="copy"/>
      </xsl:when>


      If you want to get the whole document, no matter what just replace the xsl:choose with the content of the xsl:otherwise

      <xsl:attribute name="xml:base" select="@src"/>
      <xsl:apply-templates select="document(@src)" mode="copy"/>

      Regards,
      George

      Delete
  2. Thanks again George,

    Perhaps this is my misunderstanding then. I thought the template would transform all the contents to xml as the title of this post is named 'Getting from an EPUB to an XML file'.

    In attempting to get the whole document I've, (as per your suggestion), replaced the xsl:choose with the content of the xsl:otherwise so that part of the the .xsl now reads:

    <xsl:template match="ncx:content[@src]" mode="copy">
    <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:attribute name="xml:base" select="@src"/>
    <xsl:apply-templates select="document(@src)" mode="copy"/>
    </xsl:copy>
    </xsl:template>

    However, the output remains the same as previously noted 'ONLY the first paragraph from each .xhtml file is output'.

    In summary - the only output I'm achieving is an exact copy of the code found in the original .ncx with the addition of the one paragraph of text that each of the 'navPoint' > 'content/@src' tags point to.

    Regards, Rob.

    ReplyDelete
    Replies
    1. You are right, that is because of the #anchor after the filename. Revert to the original and just replace

      <xsl:apply-templates select="document($file, .)//*[@id=$id]" mode="copy"/>

      with

      <xsl:apply-templates select="document($file, .)" mode="copy"/>

      Regards,
      George

      Delete
  3. Dear George,
    Looks exactly what I need! However, I cannot get it to work. What exactly should go into the "XML URL" field of the transformation scenario? I need to have some XML file opened in oXygen to even be able to create a scenario, so what would that file be?
    Any help would be greatly appreciated!
    Christof

    ReplyDelete
  4. The XML URL should be empty but make sure you specify the initial template to "main" using the "Advanced options" button that you find next to the processor. The "epub" parameter points to the EPUB file and if you use the editor variable ${afu} as its value then this will be expanded to the "Archive File URL" at runtime, so if you have an EPUB opened in the oXygen archive browser the parameter will be expanded to the URL of that EPUB.

    Best Regards,
    George

    ReplyDelete
  5. I see, but how do I get oxYgen to let me leave the XML URL empty, even with the crrect settings you describe. I get "You need a valid XML URL as input for the XSLT" if I let it empty and can't save. - Christof

    ReplyDelete
  6. You need Saxon PE or Saxon EE as XSLT engine and next to the engine there is a button that will get you the advanced options dialog where you need to specify the initial template to be "main". See the second screenshot from my post.

    ReplyDelete