XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approach (from readability.js, goose and snacktory) to create a set of heuristics for fast article text extraction. It adds several features like
Here are some sample links of extracted text from around the web:
Similar to the readability approach, XTractor uses heuristics based on common web design practices to extract text. It walks through the DOM of the HTML removing nodes that are unlikely to be the main article text (and fluff like ads, banners, shout outs etc.). It then assigns a score to the container elements to find the best possible match for article text. It also extracts one significant image representing that page. The resultant output is sanitized and tokenized into paragraphs and is returned to the caller.
I wrote this to improve the article detection of my algorithmic summarizer demo.