The ninth stage module two Lucene

Content output source: Lagou Education Java Employment Training Camp

1.1 Data classification

Structured data : Refers to data with a fixed format or limited length, such as databases, metadata, etc.

Unstructured data : refers to data with variable length or no fixed format, such as emails, word documents and other files on disk

Common structured data is the data in the database.

Insert picture description here

The data storage in the database is regular, there are rows and columns, and the data format and data length are fixed. So database search is easy.

1.3 Unstructured data query methods

(1) Serial Scanning

User search----->File

The so-called sequential scanning, such as looking for a file that contains a certain string, is to look at one document per document. For each document, from the beginning to the end, if the document contains this string, then this document is what we are looking for File, then look at the next file, until all the files are scanned. If you use windows search, you can also search the contents of the file, but it is quite slow .

(2) Full-text Search

The user queries the index library---->Generate Index----->Document

​ Full-text search refers to a computer indexing program that scans each word in the article and builds an index for each word, indicating the number and location of the word in the article. When the user queries, the search program will be based on the index established in advance Search and feed back the search result to the user's search method. This process is similar to the process of looking up characters through a dictionary.

​ Extract part of the information in the unstructured data, reorganize it to make it have a certain structure, and then search for the data with a certain structure, so as to achieve the purpose of relatively fast search. This part of the information extracted from unstructured data and then reorganized, we call it an index .

You can use Lucene to achieve full-text search.

Lucene usage scenarios

Provide full-text search implementation for data in the database in application.
Develop independent search engine services and systems

Features of Lucene:

  1. Stable, high index performance
  • Ability to index more than 150GB of data per hour
  • The memory requirements are small, only 1MB of heap memory is needed
  • Incremental indexing is as fast as bulk indexing
  • The size of the index is about 20%~30% of the size of the index text

2. Efficient, accurate and high-performance search algorithm

Good search ordering

  • Powerful query method support: phrase query, wildcard query, proximity query, range query, etc.
  • Support field search (such as title, author, content)
  • Can be sorted according to any field
  • Support multiple index query results merge

3. Cross-platform

Lucene architecture:

Insert picture description here

2.1 Index and search flowchart

Insert picture description here

Green indicates the indexing process. The original content to be searched is indexed to build an index library. The indexing process includes:
determining the original content is the content to be searched -> collecting documents -> creating documents -> analyzing documents -> indexing documents

Red indicates the search process, searching for content from the index library, the search process includes: the
user through the search interface -> create a query -> perform a search, search from the index library -> render the search results

2.2 Create an index

Document:

The source provided by the user is a record, which can be a text file, a string or a record in a database table, etc. After a record is indexed, it is stored in the index file in the form of a Document. When users search, they are also returned in the form of a Document list.

Field

A document can contain multiple information fields. For example, an article can contain information fields such as "title", "body", and "last modified time". These information fields are stored in the Document through the Field.

Field has two optional attributes: storage and index. You can control whether to store this Field through the storage attribute; through the index attribute you can control whether to index the Field.

Term:

Term is the smallest unit of search. It represents a word in the document. Term consists of two parts: the word it represents and the name of the Field where the word appears.

Creation process

  1. Obtain the original document: query the data that needs to be indexed from the mysql database through the SQL statement
  2. Create a document object (Document), and construct the query content into a Document object that Lucene can recognize. The purpose of obtaining the original content is for indexing. Before indexing, the original content needs to be created into a document. The document includes a field (Field) , This field corresponds to the column in the table.
  3. Analyze the document: Create the original content as a document containing a field, and then analyze the content in the field. The process of analysis is to extract words from the original document, convert letters to lowercase, remove punctuation, The process of removing stop words and so on generates the final vocabulary unit, and the vocabulary unit can be understood as one word.
  4. Create index: Index the vocabulary units derived from the analysis of all documents. The purpose of indexing is to search. In the end, it is necessary to search only the indexed vocabulary units to find Document.

PS: Creating an index is to index the vocabulary unit and find documents through words. The structure of this index is called an inverted index structure.

Insert picture description here

2.3 Inverted Index

The inverted index records which documents each entry appears in, and its position in the document, and can quickly locate the document containing this entry and its appearing position according to the entry.

Create an inverted index, divided into the following steps:

1) Create a document list:
Lucene first numbers the original document data (DocID) to form a list, which is a document list

2) Create an inverted index list
. Perform word segmentation on the data in the document to get the term (word after word after the word segmentation). Number the entries and create an index based on the entries. Then record all the document numbers (and other information) that contain the term.

Search process :
When the user enters any entry, the data entered by the user is first segmented to obtain all the entries that the user wants to search, and then these entries are used to match in the inverted index list. By finding these terms, you can find the numbers of all documents that contain these terms. Then find the document in the document list according to these numbers

2.4 Query Index

Querying the index is also the process of searching. Searching is the process in which users enter keywords and search from the index. Search the index according to keywords, find the corresponding document according to the index

  1. Create a user interface: where the user enters keywords
  2. Create a query to specify the domain name and keywords of the query
  3. Execute query
  4. Rendering result

3 Lucene actual combat

Generate job information index database and retrieve data from the index database

3.1 Development environment

Create a SpringBoot project

Import dependencies

<dependencies>
        <!--web依赖-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <!--测试依赖-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
        <!--lombok工具-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.4</version>
            <scope>provided</scope>
        </dependency>
        <!--热部署-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-devtools</artifactId>
            <optional>true</optional>
        </dependency>
        <!--mybatis-plus-->
        <dependency>
            <groupId>com.baomidou</groupId>
            <artifactId>mybatis-plus-boot-starter</artifactId>
            <version>3.3.2</version>
        </dependency>
        <!--pojo持久化使用-->
        <dependency>
            <groupId>javax.persistence</groupId>
            <artifactId>javax.persistence-api</artifactId>
            <version>2.2</version>
        </dependency>
        <!--mysql驱动-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <scope>runtime</scope>
        </dependency>
        <!--引入Lucene核心包及分词器包-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.10.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.10.3</version>
        </dependency>
        <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>RELEASE</version>
            <scope>test</scope>
        </dependency>
        <!--IK中文分词器-->
        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <!--编译插件-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>11</source>
                    <target>11</target>
                    <encoding>utf-8</encoding>
                </configuration>
            </plugin>
            <!--打包插件-->
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>repackage</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

Create a boot class

@SpringBootApplication
@MapperScan("com.lagou.lucene.mapper")
public class LuceneApplication {

    public static void main(String[] args) {
        SpringApplication.run(LuceneApplication.class, args);
    }

}

Configure yml file

server:
  port: 9000
Spring:
  application:
    name: lagou-lucene
  datasource:
    driver-class-name: com.mysql.jdbc.Driver
    url: jdbc:mysql://localhost:3306/es?useUnicode=true&characterEncoding=utf8&serverTimezone=UTC
    username: root
    password: 19970821

#开启驼峰命名匹配映射
mybatis:
  configuration:
    map-underscore-to-camel-case: true

Create entity class, mapper, service

Entity class:

@Data
@Table(name = "job_info")
public class JobInfo {
  @Id
  private long id;
  private String companyName;
  private String companyAddr;
  private String companyInfo;
  private String jobName;
  private String jobAddr;
  private String jobInfo;
  private int salaryMin;
  private int salaryMax;
  private String url;
  private String time;
}

mapper:

public interface JobInfoMapper extends BaseMapper<JobInfo> {
}

service

@Service
public class JobInfoServiceImpl implements JobInfoService{

    @Autowired
    private JobInfoMapper jobInfoMapper;

    @Override
    public JobInfo selectById(long id) {
        return jobInfoMapper.selectById(id);
    }

    @Override
    public List<JobInfo> selectAll() {
        QueryWrapper<JobInfo> queryWrapper = new QueryWrapper<>();
        List<JobInfo> jobInfoList = jobInfoMapper.selectList(queryWrapper);
        return jobInfoList;
    }
}

3.2 Create Index

@Autowired
    private JobInfoService jobInfoService;

    /**
     * 创建索引
     */
    @Test
    public void create()throws Exception{
        //1.指定索引文件的存储位置,索引具体的表现形式就是一组有规则的文件
        Directory directory = FSDirectory.open(new File("D:/class/index"));
        //2.配置版本及其分词器
        Analyzer analyzer = new IKAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(Version.LATEST,analyzer);
        //3.创建IndexWriter对象,作用就是创建索引
        IndexWriter indexWriter = new IndexWriter(directory,config);
        //先删除已经存在的索引库
        indexWriter.deleteAll();
        //4.获得索引源/原始数据
        List<JobInfo> jobInfoList = jobInfoService.selectAll();
        //5. 遍历jobInfoList,每次遍历创建一个Document对象
        for (JobInfo jobInfo: jobInfoList) {
            //创建Document对象
            Document document = new Document();
            //创建Field对象,添加到document中
            document.add(new LongField("id",jobInfo.getId(), Field.Store.YES));
            //切分词、索引、存储
            document.add(new TextField("companyName",jobInfo.getCompanyName(), Field.Store.YES));
            document.add(new TextField("companyAddr",jobInfo.getCompanyAddr(), Field.Store.YES));
            document.add(new TextField("companyInfo",jobInfo.getCompanyInfo(), Field.Store.YES));
            document.add(new TextField("jobName",jobInfo.getJobName(), Field.Store.YES));
            document.add(new TextField("jobAddr",jobInfo.getJobAddr(), Field.Store.YES));
            document.add(new TextField("jobInfo",jobInfo.getJobInfo(), Field.Store.YES));
            document.add(new IntField("salaryMin",jobInfo.getSalaryMin(), Field.Store.YES));
            document.add(new IntField("salaryMax",jobInfo.getSalaryMax(), Field.Store.YES));
            document.add(new StringField("url",jobInfo.getUrl(), Field.Store.YES));
            //将文档追加到索引库中
            indexWriter.addDocument(document);
        }
        //关闭资源
        indexWriter.close();
        System.out.println("create index success!");
    }

Characteristics of Field:

Document is the carrier of Field. A Document is composed of multiple Fields. Field consists of two parts: name and value. The value of Field is the content to be indexed and the content to be searched.

Participle

Yes: Perform word segmentation processing on the value of Field, and the purpose of word segmentation is for indexing . For example: product name, product description. Users will query these contents by entering keywords. Due to the variety of content, word segmentation processing is required to establish indexing.
No: No Do word segmentation processing. Such as: order number, ID number, is a whole, after the word segmentation loses its meaning, so there is no need for word segmentation.

Whether to index

Yes: Index the words (or the entire Field content) obtained after the word segmentation of the Field content, and store them in the index field. The purpose of the index is to search. For example: product name, product description needs to be segmented to create an index. Order number, ID card The number is indexed as a whole. As long as the words that can be used as the user's query conditions , they need to be indexed.
No: No index. For example, the product image path will not be used as the query condition, and no index is required.

Whether to store

Yes: Save the Field value to the Document. Such as: product name, product price. All content that will be displayed to users on the search results page in the future needs to be stored.
No: not stored. Such as: product description. The content is in a large format, not It needs to be displayed directly on the search result page without storage. It can be retrieved from a relational database when needed.

Commonly used Field types:

Insert picture description here

3.3 Query Index

@Test
    public void query()throws Exception{
        //1.指定索引文件的存储位置,索引具体的表现形式就是一组有规则的文件
        Directory directory = FSDirectory.open(new File("D:/class/index"));
        //2.IndexReader对象
        IndexReader indexReader = DirectoryReader.open(directory);
        //3.创建查询对象,IndexSearcher
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //使用term,查询公司名称中包含"北京"的所有的文档对象
        Query query = new TermQuery(new Term("companyName","北京"));
        TopDocs topDocs = indexSearcher.search(query, 100);
        //获得符合查询条件的文档数
        int totalHits = topDocs.totalHits;
        System.out.println("符合条件的文档数:"+totalHits);
        //获得命中的文档  ScoreDoc封装了文档id信息
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for(ScoreDoc scoreDoc : scoreDocs){
            //文档id
            int docId = scoreDoc.doc;
            //通过文档id获得文档对象
            Document doc = indexSearcher.doc(docId);
            System.out.println("id:"+doc.get("id"));
            System.out.println("companyName:"+doc.get("companyName"));
            System.out.println("companyAddr:"+doc.get("companyAddr"));
            System.out.println("companyInfo:"+doc.get("companyInfo"));
            System.out.println("jobName:"+doc.get("jobName"));
            System.out.println("jobInfo:"+doc.get("jobInfo"));
            System.out.println("*******************************************");
        }
        //资源释放
        indexReader.close();
    }

Name"));
System.out.println("companyAddr:"+doc.get("companyAddr"));
System.out.println("companyInfo:"+doc.get("companyInfo"));
System.out .println("jobName:"+doc.get("jobName"));
System.out.println("jobInfo:"+doc.get("jobInfo"));
System.out.println("**** ***************************************");
}
//Resource release
indexReader.close ();
}