Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)?

Question

What are the best Java libraries to "fully download any webpage and render the built-in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programmatically and get the DOM Tree as an "HTML-Source"?

(Something similarly what firebug does in the end, it renders the page and I get access to the fully rendered DOM Tree, as the page looks like in the browser! In contrast, if I click "show source" I only get the JavaScript source code. This is not what I want. I need to have access to the rendered page...)

(With rendering I mean only rendering the DOM Tree not a visual rendering...)

This does not have to be one single library, it's ok to have several libraries that can accomplish this together (one will download, one render...), but due to the dynamic nature of JavaScript most likely the JavaScript library will also have to have some kind of downloader to fully render any asynchronous JS...

Background:
In the "good old days" HttpClient (Apache Library) was everything required to build your own very simple crawler. (A lot of cralwers like Nutch or Heretrix are still built around this core princible, mainly focussing on Standard HTML parsing, so I can't learn from them) My problem is that I need to crawl some websites that rely heavily on JavaScript and that I can't parse with HttpClient as I defenitely need to execute the JavaScripts before...

When you say "render any asynchronous js" do you mean that the library needs to have the ability to "scrape" any asynchronous calls that the page makes? This would be really difficult because you'd basically be trying to capture the content of a dynamic page that updates after the initial request is complete and sometimes data is not pulled in asynchronously until the user triggers an event. — bsimic
– bsimic, Commented Jan 31, 2012 at 17:17

Sergei Grinev · Accepted Answer · 2012-02-06 16:23:36Z

4

You can use JavaFX 2 WebEngine. Download JavaFX SDK (you may already have it if you installed JDK7u2 or later) and try code below.

It will print html with processed javascript. You can uncomment lines in the middle to see rendering as well.

public class WebLauncher extends Application {

    @Override
    public void start(Stage stage) {
        final WebView webView = new WebView();
        final WebEngine webEngine = webView.getEngine();
        webEngine.load("http://stackoverflow.com");
        //stage.setScene(new Scene(webView));
        //stage.show();

        webEngine.getLoadWorker().workDoneProperty().addListener(new ChangeListener<Number>() {
            @Override
            public void changed(ObservableValue<? extends Number> observable, Number oldValue, Number newValue) {
                if (newValue.intValue() == 100 /*percents*/) {
                    try {
                        org.w3c.dom.Document doc = webEngine.getDocument();
                        new XMLSerializer(System.out, new OutputFormat(doc, "UTF-8", true)).serialize(doc);
                    } catch (IOException ex) { 
                        ex.printStackTrace();
                    }
                }
            }
        });

    }

    public static void main(String[] args) {
        launch();
    }

}

edited Feb 6, 2012 at 16:23

answered Feb 6, 2012 at 15:40

Sergei Grinev

34.6k10 gold badges133 silver badges141 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

morja Over a year ago

Hi, thanks for that answer. But I did not manage to load some websites. E.g. I could not load maps.google.com/maps/…. It never reaches 100 percent, it hangs at 0. Also how could I make sure that everything loaded?

DevZer0 Over a year ago

you need to load the Scene or else it would not work

Andrew Scott Evans Over a year ago

you can use a JFrame to make the webengine work outside of launch(args) as well. So public class WebLauncher extends JFrame. You will need to avoid Selenium Drivers, they suck (leak, hang, throw exceptions when threaded or processed, and all sorts of nonsense). Also, you will need to rpc to a WebEngine server as there are leaks galore. If your sites don't normally demand up to date SSL and connection failures are ok with you, Scrapy uses the almost 10 year old Qt4.8 to do this with their new JS server. I recommend restarting the engine and passing cookies from time to time. JDK 9 should fix.

Andrew Scott Evans Over a year ago

For more on the above, I have an example program at github.com/asevans48/JScrape. Feel free to fork it and help out on any Goat project or jXXX project. I am putting the finishing touches on the first part (the Http Components part) now and then will incorporate my Java FX research with rabbitMQ RPC and JSON for http parameters per request.

ROMANIA_engineer · Accepted Answer · 2020-03-11 10:14:13Z

4

+100

This is a bit outside of the box, but if you are planning on running your code in a server where you have complete control over your environment, it might work...

Install Firefox (or XulRunner, if you want to keep things lightweight) on your machine.

Using the Firefox plugins system, write a small plugin which takes loads a given URL, waits a few seconds, then copies the page's DOM into a String.

From this plugin, use the Java LiveConnect API (see http://jdk6.java.net/plugin2/liveconnect/ and https://developer.mozilla.org/en/LiveConnect ) to push that string across to a public static function in some embedded Java code, which can either do the required processing itself or farm it out to some more complicated code.

Benefits: You are using a browser that most application developers target, so the observed behavior should be comparable. You can also upgrade the browser along the normal upgrade path, so your library won't become out-of-date as HTML standards change.

Disadvantages: You will need to have permission to start a non-headless application on your server. You'll also have the complexity of inter-process communication to worry about.

I have used the plugin API to call Java before, and it's quite achievable. If you'd like some sample code, you should take a look at the XQuery plugin - it loads XQuery code from the DOM, passes it across to the Java Saxon library for processing, then pushes the result back into the browser. There are some details about it here:

https://developer.mozilla.org/en/XQuery

edited Mar 11, 2020 at 10:14

ROMANIA_engineer

57k30 gold badges211 silver badges207 bronze badges

answered Jan 31, 2012 at 23:32

Erica

2,26116 silver badges21 bronze badges

2 Comments

Steffen Opel Over a year ago

+1 - A solution along these lines had already been started once, but unfortunately development stalled in 2008, apparently - enter Crowbar: Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues. - Even a Java integration has been attempted with some success, but Ben's conclusion and updates are highlighting some drawbacks and issues as well.

morja Over a year ago

Thanks, yes, I had this idea too. But if at all possible I would like to have a "headless" solition, as the software has to run on servers with possibly no X system installed... But thanks for the details and explanations, I will have a deeper look into it if nothing else comes up.

Erica · Accepted Answer · 2012-02-01 00:38:24Z

2

The Selenium library is normally used for testing, but does give you remote control of most standard browsers (IE, Firefox, etc) as well as a headless, browser free mode (using HtmlUnit). Because it is intended for UI verification by page scraping, it may well serve your purposes.

In my experience it can sometimes struggle with very slow JavaScript, but with careful use of "wait" commands you can get quite reliable results.

It also has the benefit that you can actually drive the page, not just scrape it. That means that if you perform some actions on the page before you get to the data you want (click the search button, click next, now scrape) then you can code that into the process.

I don't know if you'll be able to get the full DOM in a navigable form from Selenium, but it does provide XPath retrieval for the various parts of the page, which is what you'd normally need for a scraping application.

edited Feb 1, 2012 at 0:38

answered Jan 31, 2012 at 23:48

Erica

2,26116 silver badges21 bronze badges

1 Comment

morja Over a year ago

Thank you! Selenium looks promising, but if I want to run it headless I could directly use HtmlUnit. And so far I had a few issues with HtmlUnit. Especially when it comes to performance. I will have a closer look at Selenium.

Gepsens · Accepted Answer · 2012-02-06 10:50:45Z

2

You can use Java, Groovy with or without Grails. Then use Webdriver, Selenium, Spock and Geb these are for testing purposes, but the libraries are useful for your case. You can implement a Crawler that won't open a new window but just a runtime of these either browser.

Selenium : http://code.google.com/p/selenium/
Webdriver : http://seleniumhq.org/projects/webdriver/
Spock : http://code.google.com/p/spock/
Geb : http://www.gebish.org/manual/current/testing.html

answered Feb 6, 2012 at 10:50

Gepsens

6734 silver badges14 bronze badges

2 Comments

morja Over a year ago

Geb looks promising, I will look deeper into it. Thank you!

Gepsens Over a year ago

Yes I should have specified that Geb includes all of the above :) It's really a great new way to do testing.

Carlos Blanco · Accepted Answer · 2010-01-29 17:12:08Z

1

MozSwing could help http://confluence.concord.org/display/MZSW/Home.

answered Jan 29, 2010 at 17:12

Carlos Blanco

8,80217 gold badges73 silver badges104 bronze badges

Comments

Ruslans Uralovs · Accepted Answer · 2010-11-06 18:46:42Z

1

You can try JExplorer. For more information see http://www.teamdev.com/downloads/jexplorer/docs/JExplorer-PGuide.html

You can also try Cobra, see http://lobobrowser.org/cobra.jsp

answered Nov 6, 2010 at 18:46

Ruslans Uralovs

1,14210 silver badges18 bronze badges

Comments

Eran · Accepted Answer · 2014-01-27 14:09:37Z

1

I haven't tried this project, but I have seen several implementations for node.js that include javascript dom manipulation.

https://github.com/tmpvar/jsdom

edited Jan 27, 2014 at 14:09

Eran

395k57 gold badges726 silver badges793 bronze badges

answered Feb 6, 2012 at 13:40

James Westgate

11.5k8 gold badges66 silver badges69 bronze badges

Collectives™ on Stack Overflow

Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)?

7 Answers 7

4 Comments

2 Comments

1 Comment

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

4 Comments

2 Comments

1 Comment

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related