提问者:小点点

目录文件的 Lucene 索引


亲爱的用户们,我正在使用apache lucene进行索引和搜索。我必须将储存在电脑本地磁盘上的html文件编入索引。我必须对html文件的文件名和内容进行索引。我可以在lucene索引中存储文件名,但不能存储html文件内容,html文件内容不仅可以索引数据,还可以索引包含图像链接和url的整个页面,我如何从这些索引文件中访问内容以进行索引,我使用了以下代码:

    File indexDir = new File(indexpath);
    File dataDir = new File(datapath);
    String suffix = ".htm";
    IndexWriter indexWriter = new IndexWriter(
            FSDirectory.open(indexDir),
            new SimpleAnalyzer(),
            true,
            IndexWriter.MaxFieldLength.LIMITED);
    indexWriter.setUseCompoundFile(false);
    indexDirectory(indexWriter, dataDir, suffix);

    numIndexed = indexWriter.maxDoc();
    indexWriter.optimize();
    indexWriter.close();


private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException {
    try {
        for (File f : dataDir.listFiles()) {
            if (f.isDirectory()) {
                indexDirectory(indexWriter, f, suffix);
            } else {
                indexFileWithIndexWriter(indexWriter, f, suffix);
            }
        }
    } catch (Exception ex) {
        System.out.println("exception 2 is" + ex);
    }
}

private void indexFileWithIndexWriter(IndexWriter indexWriter, File f,
    String suffix) throws IOException {
    try {
        if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
            return;
        }
        if (suffix != null && !f.getName().endsWith(suffix)) {
            return;
        }
        Document doc = new Document();
        doc.add(new Field("contents", new FileReader(f)));
        doc.add(new Field("filename", f.getFileName(),
                Field.Store.YES, Field.Index.ANALYZED));
        indexWriter.addDocument(doc);
    } catch (Exception ex) {
        System.out.println("exception 4 is" + ex);
    }
}

提前谢谢


共1个答案

匿名用户

这一行代码是您的内容没有被存储的原因:

doc.add(new Field("contents", new FileReader(f)));

此方法不存储正在索引的内容。

如果您试图索引HTML文件,请尝试使用JTidy。这将使过程变得容易得多。

示例代码:

public class JTidyHTMLHandler {

    public org.apache.lucene.document.Document getDocument(InputStream is) throws DocumentHandlerException {
        Tidy tidy = new Tidy();
        tidy.setQuiet(true);
        tidy.setShowWarnings(false);
        org.w3c.dom.Document root = tidy.parseDOM(is, null);
        Element rawDoc = root.getDocumentElement();

        org.apache.lucene.document.Document doc =
                new org.apache.lucene.document.Document();

        String body = getBody(rawDoc);

        if ((body != null) && (!body.equals(""))) {
            doc.add(new Field("contents", body, Field.Store.NO, Field.Index.ANALYZED));
        }

        return doc;
    }

    protected String getTitle(Element rawDoc) {
        if (rawDoc == null) {
            return null;
        }

        String title = "";

        NodeList children = rawDoc.getElementsByTagName("title");
        if (children.getLength() > 0) {
            Element titleElement = ((Element) children.item(0));
            Text text = (Text) titleElement.getFirstChild();
            if (text != null) {
                title = text.getData();
            }
        }
        return title;
    }

    protected String getBody(Element rawDoc) {
        if (rawDoc == null) {
            return null;
        }

        String body = "";
        NodeList children = rawDoc.getElementsByTagName("body");
        if (children.getLength() > 0) {
            body = getText(children.item(0));
        }
        return body;
    }

    protected String getText(Node node) {
        NodeList children = node.getChildNodes();
        StringBuffer sb = new StringBuffer();
        for (int i = 0; i < children.getLength(); i++) {
            Node child = children.item(i);
            switch (child.getNodeType()) {
                case Node.ELEMENT_NODE:
                    sb.append(getText(child));
                    sb.append(" ");
                    break;
                case Node.TEXT_NODE:
                    sb.append(((Text) child).getData());
                    break;
            }
        }
        return sb.toString();
    }
}

要从URL获取InputStream:

URL url = new URL(htmlURLlocation);
URLConnection connection = url.openConnection();
InputStream stream = connection.getInputStream();

要从文件中获取InputStream:

InputStream stream = new FileInputStream(new File (htmlFile));