The ninth stage module two Lucene

Content output source: Lagou Education Java Employment Training Camp

1.1 Data classification

Structured data : Refers to data with a fixed format or limited length, such as databases, metadata, etc.

Unstructured data : refers to data with variable length or no fixed format, such as emails, word documents and other files on disk

Common structured data is the data in the database.

Insert picture description here

The data storage in the database is regular, there are rows and columns, and the data format and data length are fixed. So database search is easy.

1.3 Unstructured data query methods

(1) Serial Scanning

User search----->File

The so-called sequential scanning, such as looking for a file that contains a certain string, is to look at one document per document. For each document, from the beginning to the end, if the document contains this string, then this document is what we are looking for File, then look at the next file, until all the files are scanned. If you use windows search, you can also search the contents of the file, but it is quite slow .

(2) Full-text Search

The user queries the index library---->Generate Index----->Document

​ Full-text search refers to a computer indexing program that scans each word in the article and builds an index for each word, indicating the number and location of the word in the article. When the user queries, the search program will be based on the index established in advance Search and feed back the search result to the user's search method. This process is similar to the process of looking up characters through a dictionary.

​ Extract part of the information in the unstructured data, reorganize it to make it have a certain structure, and then search for the data with a certain structure, so as to achieve the purpose of relatively fast search. This part of the information extracted from unstructured data and then reorganized, we call it an index .

You can use Lucene to achieve full-text search.

Lucene usage scenarios

Provide full-text search implementation for data in the database in application.
Develop independent search engine services and systems

Features of Lucene:

  1. Stable, high index performance
  • Ability to index more than 150GB of data per hour
  • The memory requirements are small, only 1MB of heap memory is needed
  • Incremental indexing is as fast as bulk indexing
  • The size of the index is about 20%~30% of the size of the index text

2. Efficient, accurate and high-performance search algorithm

Good search ordering

  • Powerful query method support: phrase query, wildcard query, proximity query, range query, etc.
  • Support field search (such as title, author, content)
  • Can be sorted according to any field
  • Support multiple index query results merge

3. Cross-platform

Lucene architecture:

Insert picture description here

2.1 Index and search flowchart

Insert picture description here

Green indicates the indexing process. The original content to be searched is indexed to build an index library. The indexing process includes:
determining the original content is the content to be searched -> collecting documents -> creating documents -> analyzing documents -> indexing documents

Red indicates the search process, searching for content from the index library, the search process includes: the
user through the search interface -> create a query -> perform a search, search from the index library -> render the search results

2.2 Create an index


The source provided by the user is a record, which can be a text file, a string or a record in a database table, etc. After a record is indexed, it is stored in the index file in the form of a Document. When users search, they are also returned in the form of a Document list.


A document can contain multiple information fields. For example, an article can contain information fields such as "title", "body", and "last modified time". These information fields are stored in the Document through the Field.

Field has two optional attributes: storage and index. You can control whether to store this Field through the storage attribute; through the index attribute you can control whether to index the Field.


Term is the smallest unit of search. It represents a word in the document. Term consists of two parts: the word it represents and the name of the Field where the word appears.

Creation process

  1. Obtain the original document: query the data that needs to be indexed from the mysql database through the SQL statement
  2. Create a document object (Document), and construct the query content into a Document object that Lucene can recognize. The purpose of obtaining the original content is for indexing. Before indexing, the original content needs to be created into a document. The document includes a field (Field) , This field corresponds to the column in the table.
  3. Analyze the document: Create the original content as a document containing a field, and then analyze the content in the field. The process of analysis is to extract words from the original document, convert letters to lowercase, remove punctuation, The process of removing stop words and so on generates the final vocabulary unit, and the vocabulary unit can be understood as one word.
  4. Create index: Index the vocabulary units derived from the analysis of all documents. The purpose of indexing is to search. In the end, it is necessary to search only the indexed vocabulary units to find Document.

PS: Creating an index is to index the vocabulary unit and find documents through words. The structure of this index is called an inverted index structure.

Insert picture description here

2.3 Inverted Index

The inverted index records which documents each entry appears in, and its position in the document, and can quickly locate the document containing this entry and its appearing position according to the entry.

Create an inverted index, divided into the following steps:

1) Create a document list:
Lucene first numbers the original document data (DocID) to form a list, which is a document list

2) Create an inverted index list
. Perform word segmentation on the data in the document to get the term (word after word after the word segmentation). Number the entries and create an index based on the entries. Then record all the document numbers (and other information) that contain the term.

Search process :
When the user enters any entry, the data entered by the user is first segmented to obtain all the entries that the user wants to search, and then these entries are used to match in the inverted index list. By finding these terms, you can find the numbers of all documents that contain these terms. Then find the document in the document list according to these numbers

2.4 Query Index

Querying the index is also the process of searching. Searching is the process in which users enter keywords and search from the index. Search the index according to keywords, find the corresponding document according to the index

  1. Create a user interface: where the user enters keywords
  2. Create a query to specify the domain name and keywords of the query
  3. Execute query
  4. Rendering result

3 Lucene actual combat

Generate job information index database and retrieve data from the index database

3.1 Development environment

Create a SpringBoot project

Import dependencies



Create a boot class

public class LuceneApplication {

    public static void main(String[] args) {, args);


Configure yml file

  port: 9000
    name: lagou-lucene
    driver-class-name: com.mysql.jdbc.Driver
    url: jdbc:mysql://localhost:3306/es?useUnicode=true&characterEncoding=utf8&serverTimezone=UTC
    username: root
    password: 19970821

    map-underscore-to-camel-case: true

Create entity class, mapper, service

Entity class:

@Table(name = "job_info")
public class JobInfo {
  private long id;
  private String companyName;
  private String companyAddr;
  private String companyInfo;
  private String jobName;
  private String jobAddr;
  private String jobInfo;
  private int salaryMin;
  private int salaryMax;
  private String url;
  private String time;


public interface JobInfoMapper extends BaseMapper<JobInfo> {


public class JobInfoServiceImpl implements JobInfoService{

    private JobInfoMapper jobInfoMapper;

    public JobInfo selectById(long id) {
        return jobInfoMapper.selectById(id);

    public List<JobInfo> selectAll() {
        QueryWrapper<JobInfo> queryWrapper = new QueryWrapper<>();
        List<JobInfo> jobInfoList = jobInfoMapper.selectList(queryWrapper);
        return jobInfoList;

3.2 Create Index

    private JobInfoService jobInfoService;

     * 创建索引
    public void create()throws Exception{
        Directory directory = File("D:/class/index"));
        Analyzer analyzer = new IKAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(Version.LATEST,analyzer);
        IndexWriter indexWriter = new IndexWriter(directory,config);
        List<JobInfo> jobInfoList = jobInfoService.selectAll();
        //5. 遍历jobInfoList,每次遍历创建一个Document对象
        for (JobInfo jobInfo: jobInfoList) {
            Document document = new Document();
            document.add(new LongField("id",jobInfo.getId(), Field.Store.YES));
            document.add(new TextField("companyName",jobInfo.getCompanyName(), Field.Store.YES));
            document.add(new TextField("companyAddr",jobInfo.getCompanyAddr(), Field.Store.YES));
            document.add(new TextField("companyInfo",jobInfo.getCompanyInfo(), Field.Store.YES));
            document.add(new TextField("jobName",jobInfo.getJobName(), Field.Store.YES));
            document.add(new TextField("jobAddr",jobInfo.getJobAddr(), Field.Store.YES));
            document.add(new TextField("jobInfo",jobInfo.getJobInfo(), Field.Store.YES));
            document.add(new IntField("salaryMin",jobInfo.getSalaryMin(), Field.Store.YES));
            document.add(new IntField("salaryMax",jobInfo.getSalaryMax(), Field.Store.YES));
            document.add(new StringField("url",jobInfo.getUrl(), Field.Store.YES));
        System.out.println("create index success!");

Characteristics of Field:

Document is the carrier of Field. A Document is composed of multiple Fields. Field consists of two parts: name and value. The value of Field is the content to be indexed and the content to be searched.


Yes: Perform word segmentation processing on the value of Field, and the purpose of word segmentation is for indexing . For example: product name, product description. Users will query these contents by entering keywords. Due to the variety of content, word segmentation processing is required to establish indexing.
No: No Do word segmentation processing. Such as: order number, ID number, is a whole, after the word segmentation loses its meaning, so there is no need for word segmentation.

Whether to index

Yes: Index the words (or the entire Field content) obtained after the word segmentation of the Field content, and store them in the index field. The purpose of the index is to search. For example: product name, product description needs to be segmented to create an index. Order number, ID card The number is indexed as a whole. As long as the words that can be used as the user's query conditions , they need to be indexed.
No: No index. For example, the product image path will not be used as the query condition, and no index is required.

Whether to store

Yes: Save the Field value to the Document. Such as: product name, product price. All content that will be displayed to users on the search results page in the future needs to be stored.
No: not stored. Such as: product description. The content is in a large format, not It needs to be displayed directly on the search result page without storage. It can be retrieved from a relational database when needed.

Commonly used Field types:

Insert picture description here

3.3 Query Index

    public void query()throws Exception{
        Directory directory = File("D:/class/index"));
        IndexReader indexReader =;
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        Query query = new TermQuery(new Term("companyName","北京"));
        TopDocs topDocs =, 100);
        int totalHits = topDocs.totalHits;
        //获得命中的文档  ScoreDoc封装了文档id信息
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for(ScoreDoc scoreDoc : scoreDocs){
            int docId = scoreDoc.doc;
            Document doc = indexSearcher.doc(docId);

System.out .println("jobName:"+doc.get("jobName"));
System.out.println("**** ***************************************");
//Resource release
indexReader.close ();