Execute xpath queries on html downloaded from a url with java using eclipse luna

FYI: Eclipse Luna (4.4) is currently beta.

Create new Maven Project.

Open the pom.xml

Add the following dependencies:

    <dependency>
      <groupId>net.sourceforge.htmlcleaner</groupId>
      <artifactId>htmlcleaner</artifactId>
      <version>2.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.httpcomponents</groupId>
      <artifactId>httpclient</artifactId>
      <version>4.3.4</version>
    </dependency>

The code is as follows:

package sty.qainjava.xpath.on.html;
 
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.net.URL;
import java.nio.charset.Charset;
 
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
 
import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.entity.ContentType;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.SimpleHtmlSerializer;
import org.htmlcleaner.TagNode;
import org.w3c.dom.Document;
 
/**
 * QAinJava: how to do an xpath on html in java.
 * 
 * We use <a href="http://htmlcleaner.sourceforge.net/">HtmlCleaner</a>
 * and <a href="https://hc.apache.org/">HttpClient</a>.
 * 
 * @author Mihail STY
 */
public class Program {
	/**
	 * We're not using any methods so that the source code is as straight
	 * forward as possible.
	 * 
	 * No exception handling at all for simplicity
	 */
	public static void main(String[] args) throws IOException,
	        ParserConfigurationException, XPathExpressionException,
	        TransformerException {
 
		String address = "https://www.google.com/";
 
		String html;
 
 
		{
			// the httpclient part
			CloseableHttpClient httpclient = HttpClients.createDefault();
			HttpGet httpGet = new HttpGet(address);
			CloseableHttpResponse response = httpclient.execute(httpGet);
			HttpEntity entity = response.getEntity();
 
			ContentType contentType = ContentType.getOrDefault(entity);
			Charset charset = contentType.getCharset();
 
			BufferedReader r = new BufferedReader(new InputStreamReader(
			        entity.getContent(), charset));
 
			// we can directly plug the input to HtmlCleaner,
			// but we put it in a string so we can print it,
			// or save it to a file
			String line = null;
			StringBuilder builder = new StringBuilder();
			while ((line = r.readLine()) != null) {
				builder.append(line);
			}
			html = builder.toString();
		}
 
		{// write html to a file
			BufferedWriter bf = new BufferedWriter(new OutputStreamWriter(
			        new FileOutputStream("google.html.xml")));
			bf.write(html);
			// exception handling is not exceptionally good, but that's not our
			// focus here
			bf.flush();
			bf.close();
		}
 
		// HtmlCleaner part
		TagNode tagNode = new HtmlCleaner().clean(html);
		String cleanHtml = new SimpleHtmlSerializer(new CleanerProperties())
		        .getAsString(tagNode);
		// System.out.println(cleanHtml);
 
		{// write cleanHtml to a file
			BufferedWriter bf = new BufferedWriter(new OutputStreamWriter(
			        new FileOutputStream("clean.html.xml")));
			bf.write(cleanHtml);
			// exception handling is not exceptionally good, but that's not our
			// focus here
			bf.flush();
			bf.close();
		}
 
		// we need a DOM document to execute xpath, HtmlCleaner helps in creating one
		Document doc = new DomSerializer(new CleanerProperties())
		        .createDOM(tagNode);
 
		{// save dom to file with a transformer (just for testing purposes)
			TransformerFactory factory = TransformerFactory.newInstance();
			Transformer transformer = factory.newTransformer();
			transformer.transform(new DOMSource(doc), new StreamResult(
			        new File("dom.html.xml")));
		}
 
		// the xpath part
		XPath xpath = XPathFactory.newInstance().newXPath();
		String imgURL = (String) xpath.evaluate("//img/@src", doc,
		        XPathConstants.STRING);
 
		//using two URLs we can make sure we get the absolute URL even if relative.
		System.out.println(new URL(new URL(address), imgURL).toString());
	}
}

TrueCrypt insecure?

There is a warning on the homepage of truecrypt

http://truecrypt.sourceforge.net/

>WARNING: Using TrueCrypt is not secure as it may contain unfixed security issues

More info:

http://www.networkworld.com/article/2342845/microsoft-subnet/encryption-canary-or-insecure-app--truecrypt-warning-says-use-microsoft-s-bitlocker.html
http://www.rawstory.com/rs/2014/05/29/security-enthusiasts-may-revive-truecrypt-encryption-tool-after-mystery-shutdown/
http://www.hbarel.com/analysis/itsec/the-status-of-truecrypt
http://it.slashdot.org/comments.pl?sid=5212985&cid=47115785

Alternatives:

http://truecrypt.ch/ (forking or actually continuing the development)
https://ciphershed.org/ (truecrypt fork)

TrueCrypt's audit in pdf:
https://opencryptoaudit.org/reports/iSec_Final_Open_Crypto_Audit_Project_TrueCrypt_Security_Assessment.pdf

Two problems

  • Some people, when confronted with a problem, think, "I know, I'll use regular expressions." Now they have two problems.
  • Some people, when faced with a problem, think, "I know, I'll use #binary." Now they have 10 problems.
  • Some people, when confronted with a problem, think, "I know, I'll use #threads," and then two they hav erpoblesms.
  • Some people, when confronted with a problem, think "I know, I'll use #multithreading". Nothhw tpe yawrve o oblems.
  • Some people, when confronted with a problem, think, "I know, I'll use mutexes." Now they have
  • Some people, when confronted with a problem, think: "I know, I'll use caching." Now they have one problems.
  • Some people see a problem and think "I know, I'll use #Java!" Now they have a ProblemFactory.
  • Some programmers, when confronted with a problem, think "I know, I'll use floating point arithmetic." Now they have 1.999999999997 problems.
  • Some people, wanting an escape from their full-time job, think "I know, I'll contribute to open source." Now they have two full-time jobs.
  • Some people, when confronted with a problem, think: "I know, I'll think outside the box!" Now, they have 3.75 problems, an entirely new framework, and three dozen toll house cookies cooling in the kitchen.
  • Some people when confronted with a desire to use pithy quotes in their presentations think "I know, I'll use something from Star Wars". Now two problems they have.
  • Some people, when confronted with a problem, think, "I know, I'll use #UTF8." Now they à??????µ?ç°§ùÔ_¦Ñ?.
  • "I'll use #PHP!" Now they have ("1 apple" + "1 orange") problems.
  • "I'll use #Perl!" Now they have more than one way to have more than one problem....
  • Some people, when confronted with a problem, think, "I know, I'll use Shareware." Now they have two trials.
  • Some people, when confronted with a problem, think, "I know, I'll use delegations." Now their problem is a problem of their problem.
  • Some people when confronted with a problem think "I know, I'll quote jwz". Now everyone has a problem.

Source: http://nedbatchelder.com/blog/201204/two_problems.html