如何使用Java从网页上阅读文本?


问题内容

我想从网页上阅读文字。我不想获取网页的HTML代码。我发现此代码:

    try {
        // Create a URL for the desired page
        URL url = new URL("http://www.uefa.com/uefa/aboutuefa/organisation/congress/news/newsid=1772321.html#uefa+moving+with+tide+history");

        // Read all the text returned by the server
        BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
        String str;
        while ((str = in.readLine()) != null) {
            str = in.readLine().toString();
            System.out.println(str);
            // str is one line of text; readLine() strips the newline character(s)
        }
        in.close();
    } catch (MalformedURLException e) {
    } catch (IOException e) {
    }

但是此代码为我提供了网页的HTML代码。我想在此页面中获取全文。如何使用Java做到这一点?


问题答案:

您可能想要看看jsoup

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html); 
String text = doc.body().text(); // "An example link"

本示例是他们网站上的摘录。