Fast Data Extraction From Structured Documents

StepPy, Fast Text Extraction By Phrase Sequence Search

The intent of this project is to demonstrate a fast and simple way to extract data from structured text documents like XML and HTML.

For example EDGAR insider trading reports are stored as XML documents. Furthermore, a search for all insider trades over a short time such as a week will return thousands of reports.

An application like Beautiful Soup can be used to parse the documents for data extraction, but for thousands of documents is too slow. A faster method is to look for a sequence of signature phrases preceding the required data. The end of the data can be detected by another phrase.

A simple example is to extract the issuer name from the report. In this case only a single phrase, "<issuerName>", is required to identify the start of the data and the phrase "</issuerName>" to identify the end.

Partial EDGAR Report

<issuerName>HYSTER-YALE MATERIALS HANDLING, INC.</issuerName>

</issuer>

For other parts of the file, more leading phrases are required to unambiguously identify the start of required data. For example, a report can contain both non-derivative and derivative transactions.

Typical Form 4 Report

The outer tags, "<nonDerivativeTable>", "< nonDerivativeTransaction >", and "<derivativeTable>" and "<derivativeTransaction>" are different, while the inner tags are the same.

Following is an example of non derivative transaction table entries in an EDGAR form 4 report.

Non Derivative Transaction Table

<value>Common Stock</value>

</securityTitle>

</transactionDate>

</transactionCoding>

</transactionShares>

</transactionPricePerShare>

</transactionAcquiredDisposedCode>

</transactionAmounts>

</sharesOwnedFollowingTransaction>

</postTransactionAmounts>

</directOrIndirectOwnership>

</ownershipNature>

</nonDerivativeTransaction>

</nonDerivativeTable>

To extract the share count, three phrases are required to identify the start of the share count value, "<nonDerivativeTransaction>", "<transactionShares>", and "<value>". The terminal phrase will be "</value>".

If the share price is required, the search can continue using the new phrases "<transactionPricePerShare>" and "<value>" to identify the start of the price value and "</value>" for the terminal phrase.

The above search sequence can be looped until all transactions are read.

The same phrase sequence search can be used for web page scraping. Generally, only one or two phrases are required to uniquely identify the start of data with a single end phrase.