Try it out!

What is XTractor?

XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approach (from readability.js, goose and snacktory) to create a set of heuristics for fast article text extraction. It adds several features like

  • paragraph preservation,
  • better image detection heuristics,
  • sibling score based enhancements to article detection
If you find a link which misbehaves with this, please do let me know at mohaps AT gmail DOT com or tweet it to me at @mohaps.

XTractor is built using JSoup and Apache HTTP Async Client

.

Live Samples

Here are some sample links of extracted text from around the web:

How does XTractor work?

Similar to the readability approach, XTractor uses heuristics based on common web design practices to extract text. It walks through the DOM of the HTML removing nodes that are unlikely to be the main article text (and fluff like ads, banners, shout outs etc.). It then assigns a score to the container elements to find the best possible match for article text. It also extracts one significant image representing that page. The resultant output is sanitized and tokenized into paragraphs and is returned to the caller.

Who built this?

XTractor is a weekend project/quick hack demo created by Saurav Mohapatra. If you like this, you might also like my other hack/weekend project TL;DRzr - an algorithmic summary generator.

I wrote this to improve the article detection of my algorithmic summarizer demo.