docxを解剖してみる

突然の大量入稿、それも全部拡張子がdocxなファイル…思わず鼻水たらしちゃいそうなシチュエーションです。おまけに写真も貼り付いてるのはがしてねって言われると帰っちゃいたくなるのですがそうもいきません。
先日ワードドキュメントを間違えてエディタにドロップした時に、ばけばけの文字列の頭がPKで始まる事に気付き、ああ、こいつらってマイクロソフトの皮かぶったzipファイルだよなぁ〜って思い出したわけです。
そんなこんなで色々と見てみた次第です。

ではまずdocxファイルを展開します。ここは単純に拡張子を書き換えます。純粋なzipファイルですからこれだけでオK。
次にダブルクリックして解凍しましょう。展開後のディレクトリ構造は以下のようになります。画像等の配置が無い場合はmediaフォルダが存在しません。

_rels
[Content_Types].xml
docProps
word
   _rels
   document.xml
   endnotes.xml
   fontTable.xml
   footnotes.xml
   media
   settings.xml
   styles.xml
   stylesWithEffects.xml
   theme
   webSettings.xml

重要な部分だけ見ます。単純にwordっていうフォルダを見ればOK。スタイル関連だとか色々存在しますがdocument.xmlっていうのがターゲット。更に、mediaって言うディレクトリの中には配置されたイメージがいらっしゃいます。しかしながらこいつらってリネームされていますので、取り出す時はちょいちょいってリネームしとかないと何を出してきたかわけが分かんなくなったりするので注意しましょう。
さて、文字列の方はxmlのchild辿って行くと良いのですが手っ取り早くテキストボックスやパラグラフ関連のタグ自体を探すと手っ取り早かったりします。以下に代表的なタグを列挙しておきましょう。

DocumentFormat.OpenXml.Wordprocessing.RunProperties <w:rPr>
DocumentFormat.OpenXml.Wordprocessing.Break <w:br>
DocumentFormat.OpenXml.Wordprocessing.Text <w:t>
DocumentFormat.OpenXml.Wordprocessing.DeletedText <w:delText>
DocumentFormat.OpenXml.Wordprocessing.FieldCode <w:instrText>
DocumentFormat.OpenXml.Wordprocessing.DeletedFieldCode <w:delInstrText>
DocumentFormat.OpenXml.Wordprocessing.NoBreakHyphen <w:noBreakHyphen>
DocumentFormat.OpenXml.Wordprocessing.SoftHyphen <w:softHyphen>
DocumentFormat.OpenXml.Wordprocessing.DayShort <w:dayShort>
DocumentFormat.OpenXml.Wordprocessing.MonthShort <w:monthShort>
DocumentFormat.OpenXml.Wordprocessing.YearShort <w:yearShort>
DocumentFormat.OpenXml.Wordprocessing.DayLong <w:dayLong>
DocumentFormat.OpenXml.Wordprocessing.MonthLong <w:monthLong>
DocumentFormat.OpenXml.Wordprocessing.YearLong <w:yearLong>
DocumentFormat.OpenXml.Wordprocessing.AnnotationReferenceMark <w:annotationRef>
DocumentFormat.OpenXml.Wordprocessing.FootnoteReferenceMark <w:footnoteRef>
DocumentFormat.OpenXml.Wordprocessing.EndnoteReferenceMark <w:endnoteRef>
DocumentFormat.OpenXml.Wordprocessing.SeparatorMark <w:separator>
DocumentFormat.OpenXml.Wordprocessing.ContinuationSeparatorMark <w:continuationSeparator>
DocumentFormat.OpenXml.Wordprocessing.SymbolChar <w:sym>
DocumentFormat.OpenXml.Wordprocessing.PageNumber <w:pgNum>
DocumentFormat.OpenXml.Wordprocessing.CarriageReturn <w:cr>
DocumentFormat.OpenXml.Wordprocessing.TabChar <w:tab>
DocumentFormat.OpenXml.Wordprocessing.EmbeddedObject <w:object>
DocumentFormat.OpenXml.Wordprocessing.Picture <w:pict>
DocumentFormat.OpenXml.Wordprocessing.FieldChar <w:fldChar>
DocumentFormat.OpenXml.Wordprocessing.Ruby <w:ruby>
DocumentFormat.OpenXml.Wordprocessing.FootnoteReference <w:footnoteReference>
DocumentFormat.OpenXml.Wordprocessing.EndnoteReference <w:endnoteReference>
DocumentFormat.OpenXml.Wordprocessing.CommentReference <w:commentReference>
DocumentFormat.OpenXml.Wordprocessing.Drawing <w:drawing>
DocumentFormat.OpenXml.Wordprocessing.PositionalTab <w:ptab>
DocumentFormat.OpenXml.Wordprocessing.LastRenderedPageBreak <w:lastRenderedPageBreak>
Text <m:t>

次にサンプル的なコードを

var txbx = /<v:textbox\s.*?>.+?<\/v:textbox>/g;
var nm = /[A-Z]+\s[A-Z][a-z]+/g;
var para = /<w:p.*?>.+?<\/w:p>/g;
var txt = /<w:t xml:space=”preserve”>(.+?)<\/w:t>|<w:t>(.+?)<\/w:t>/;

var f = File.openDialog(“Select target file.”);
f.open (“r”);
f.encoding =’UTF8′;
tx = f.read();
f.close();

alert(st);<bxcontents.length;i++){ paras="bxContents[i].match(para);" if="" (paras="=null)" continue;="" for="" (j="0;j<paras.length;j++){" str="paras[j].match(txt);" while="" (str!="null){" (regexp.$1="="")" st="" +="RegExp.$2;" else="" }="" (j!="paras.length)" ;="" alert(st);

複数のテキストボックスに文章がおさめられているドキュメントからテキストを抜き出します。
私がdocxを扱う場合、ご覧のように構造解析した上で必要な部分を正規表現で抜き出したりする場合が多いです。複雑な構造のドキュメントなんかxml辿る気にもならなかったりしますのでw
参考までに、単純な構成のドキュメントは下記の様な構造になります。

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>
   <w:document
        xmlns:wpc=”http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas”
        xmlns:mc=”http://schemas.openxmlformats.org/markup-compatibility/2006″
        xmlns:o=”urn:schemas-microsoft-com:office:office”
        xmlns:r=”http://schemas.openxmlformats.org/officeDocument/2006/relationships”
        xmlns:m=”http://schemas.openxmlformats.org/officeDocument/2006/math”
        xmlns:v=”urn:schemas-microsoft-com:vml”
        xmlns:wp14=”http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing”
        xmlns:wp=”http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing”
        xmlns:w10=”urn:schemas-microsoft-com:office:word”
        xmlns:w=”http://schemas.openxmlformats.org/wordprocessingml/2006/main”
        xmlns:w14=”http://schemas.microsoft.com/office/word/2010/wordml”
        xmlns:wpg=”http://schemas.microsoft.com/office/word/2010/wordprocessingGroup”
        xmlns:wpi=”http://schemas.microsoft.com/office/word/2010/wordprocessingInk”
        xmlns:wne=”http://schemas.microsoft.com/office/word/2006/wordml”
        xmlns:wps=”http://schemas.microsoft.com/office/word/2010/wordprocessingShape”
        mc:Ignorable=”w14 wp14″>
        <w:body>
            <w:p w:rsidR=”00736A6C” w:rsidRDefault=”00CD0FDD”>
                <w:r>
                    <w:rPr>
                        <w:rFonts w:hint=”eastAsia”/>
                    </w:rPr>
                    <w:t>test string</w:t>
                </w:r>
                <w:bookmarkStart w:id=”0″ w:name=”_GoBack”/>
                <w:bookmarkEnd w:id=”0″/>
            </w:p>
            <w:sectPr w:rsidR=”00736A6C” w:rsidSect=”00736A6C”>
                <w:pgSz w:w=”11906″ w:h=”16838″/>
                <w:pgMar
                    w:top=”1985″
                    w:right=”1701″
                    w:bottom=”1701″
                    w:left=”1701″
                    w:header=”851″
                    w:footer=”992″
                    w:gutter=”0″/>
                <w:cols w:space=”425″/>
                <w:docGrid w:type=”lines” w:linePitch=”360″/>
            </w:sectPr>
        </w:body>
   </w:document>

正確に抜き出したい方はOpenXMLフォーマットのリファレンスを読むと良いでしょう。かなりのボリュームですが。

カテゴリー:etc

タグ:ESTK

ten_a

Graphic Designer, Scripter and Coder. Adobe Community Professional.

コメントを残すコメントをキャンセル

コメントを投稿するにはログインしてください。

シェアする

関連投稿

3分間クッキング的ななにか（versionDetector）

ESTKが「#1116」エラーを吐く問題について

ESTK CS6 ReadMe

さようならESTK

PDFファイルのページ数を知りたい

素敵な$オブジェクト

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル