Manipulating HTML with Java and jsoup

Manipulating HTML with Java and jsoup.

Brendon Anderson

Have you ever needed to manipulate some HTML in your Java code? Maybe you are working with some HTML fragments that need some decorating or you simply need to clean up some possibly bad syntax or you have a need to do some screen scraping? A handy little library named jsoup is just what you need.

It’s easy to setup. Just download the jar from the jsoup download area and include it in your class path. There is even a Maven artifact if you are into that sort of thing. No other dependencies are needed other than Java 5 or higher.

Let’s start with a basic example. Let’s read in an HTML fragment: String fragment = "<div id='div1'>" + "<p id='para1'>This is the first paragraph</p>" + "<p id='para2'>Second paragraph here!" + "</div>"; Document doc = Jsoup.parseBodyFragment(fragment); System.out.println(doc.toString());

The output from this is: <html> <head></head> <body> <div id="div1"> <p id="para1">This is the first paragraph</p> <p id="para2">Second paragraph here!</p> </div> </body> </html>

The first thing you’ll notice is that jsoup wraps your fragment with all the necessary tags to create a valid HTML document. This can be helpful or hindersome at times. You can also read in a complete HTML document using Jsoup.parse().  Notice in the output the missing p tag in the source HTML has been added to the document. Jsoup does it’s best to clean up invalid HTML to make it valid. If you want to get back to your (valid) fragment without the added html, head, and body tags, you can do this:

doc.body().children().toString();

Now you probably want to manipulate the document a little. Say you want to add in a third paragraph. Using the same HTML fragment as above, add in a paragraph like this:

doc.select("p").last().after("<p id='para3'>Third paragraph I just added</p>");

Output: <div id="div1"> <p id="para1">This is the first paragraph</p> <p id="para2">Second paragraph here!</p> <p id="para3">Third paragraph I just added</p> </div>

Does that look familiar? Hint:

$("p").last().after("Third paragraph I just added");

If you are used to jQuery, jsoup should be an easy transition. Many of the same methods and selectors are available in jsoup. You can select on id, tag name (e.g. “p” or “div”), class name, or elements with specific attributes.  Just like jQuery you can retrieve children, siblings, and parents, insert and remove elements, and get values of elements or attributes.

System.out.println(doc.select("#para1").toString());

This will get you what you’d expect:

<p id="para1">This is the first paragraph</p>

Similarly, to find all p elements:

Elements elements = doc.select("p"); System.out.println(elements.toString());

Output:

<p id="para1">This is the first paragraph</p> <p id="para2">Second paragraph here!</p>

To remove an element:

Elements elements = doc.select("#para1").remove(); System.out.println(doc.body().children().toString()); System.out.println("---------------------------------"); System.out.println(elements.toString());

Output: <div id="div1"> <p id="para2">Second paragraph here!</p> </div> --------------------------------- <p id="para1">This is the first paragraph</p>

The removed elements are returned in an Elements object, but no longer exist in the Document.

A powerful feature of jsoup is it’s ability to scrub HTML.  You may be accepting HTML from users on your website, but you don’t want them injecting potentially harmful tags or code.  The clean() method on the Jsoup class takes a Whitelist as one of it’s parameters.  Jsoup comes with several Whitelists and you can create your own if you need something customized.  Here’s an example of cleaning the example HTML from above with the “basic” Whitelist:

System.out.println(Jsoup.clean(fragment, Whitelist.basic()));

Output:

<p>This is the first paragraph</p> <p>Second paragraph here!</p>

Notice the missing <div> tags.  The basic Whitelist does not allow

tags.  The built-in Whitelists range anywhere from allowing no tags (only text) to a pretty wide variety of tags.  You can even limit protocols (e.g. http and ftp) and allowed attributes on specific tags.

Whitelist myWhitelist = new Whitelist(); myWhitelist.addTags("div", "p"); myWhitelist.addAttributes("div", "class"); myWhitelist.addAttributes("p", "id"); System.out.println(Jsoup.clean(fragment, myWhitelist));

Output:

<div> <p id="para1">This is the first paragraph</p> <p id="para2">Second paragraph here!</p> </div>

Notice the missing id attribute from the div tag.

Some other nice features of jsoup are it’s ability to read directly from a url (Jsoup.connect(url)), testing a string of HTML against a Whitelist to check for validity, CSS selectors and more.

If you need to manipulate HTML in your Java code, you need jsoup!

Share this Post

Related Blog Posts

JavaScript

JQuery UI Datepicker IE focus fix

June 18th, 2012

Using event handlers in Jquery UI Datepicker to return focus to the input field after date selection, while handling a quirk in IE that would reload the calendar window.

Jeff Sheets
JavaScript

SplitView for JQuery Mobile

December 22nd, 2011

JQuery Mobile SplitView plugin brings the powers of the Split-View interface to mobile web applications.

Object Partners
JavaScript

Configuring Eclipse to support WTP for Maven web projects

December 15th, 2011

Step-by-step guide to a simple way to configure Eclipse WTP for Maven web app projects

Object Partners

About the author

Brendon Anderson

Sr. Consultant

Brendon has over 10 years of software development experience at organizations large and small.  He craves learning new technologies and techniques and lives in and understands large enterprise application environments with complex software and hardware architectures.

Ready to

get started?

Over the past 20 years, we have an impressive history of successful enterprise software development projects using the best software developers with the best tools and technologies.

GET IN TOUCH

Locations

  • Minneapolis, MN (HQ)
    1515 Central Ave. NE
    (612) 746-1580
  • Omaha, NE
    1303 S. 72nd St.
    (402) 657-2558

Services

  • Web + Mobile
  • Real-Time Data
  • Cloud Engineering
  • Modern APIs
  • AWS Development
  • Confluent Dev.

Careers

  • Why Work for OPI
  • Current Openings
  • Meet the OPI Team

Newsletter

  • Get curated content and new job postings delivered straight to your inbox.
© Object Partners 2019 All Rights Reserved