Online data cleaning-an interesting algorithm

Article Directory

demand:

I look at you
You look at me
two functions: a. I can check who I look at, b. I look at who I look at.
Historical design:
MySQL database is divided into 100 tables. When I look at you, a piece of data is stored in my table: I look at you status is 0, and a piece of data is inserted into your table and I was looked at with 1 status. When
you look at me, you insert data with state 0 into your table, and insert data with state 1 into my table, that is: one look at inserts data from both sides.

For example: My id is 111111 and your id is 222222. When I look at you, the table looks like this

My watch LOOK_AT_11

MY_IDYOU_IDSTATUSCREATE_TIMEUPDATE_TIME11111122222202005-06-04 13:45:44

Your watch LOOK_AT_22

MY_IDYOU_IDSTATUSCREATE_TIMEUPDATE_TIME22222211111112005-06-04 13:45:44

When you look at me, and I look at you, the situation is like this.

My watch LOOK_AT_11

Your watch LOOK_AT_22

Insert picture description here

Requirement 2: Now there is a new requirement. When I look at you and you look at me, it means that the two people see the right eye, and the product manager should select the person who sees the right eye and show it to the user. On the basis of the current design, we have added a new status, STATUS=2 is the right eye. When we look at each other, insert a piece of data with STATUS=2, because I need to query the right user The query is sorted according to UPDATE_TIME, so the data is like this after going online.
(My id: 333333, your id: 444444, we look at each other)

My watch LOOK_AT_33

Insert picture description here

Your watch LOOK_AT_22

Insert picture description here

So the question is, what should I do with the old data?
We need to write a job to insert a pair of STATUS=2 matching data for users who have looked at each other in history.

READY===========================GO!!!

First edition:

1. Cycle through 100 tables, each time through each table, take out all the beLookEdAtlists that I looked at at and the iLookAtLists that I looked at at yours.

2. Convert beLookEdAtlist to map, structure: Map<myId,youIdList>, purpose: to judge whether there is a person who looks at me, continue to judge if there is someone who looks at me, if not, the next cycle is used to reduce judgment.

3. Loop beLookEdAtlist and iLookAtList to find out the data assembly with the same myid and youId. Insert data


    public void execute()  {

        for (int table = 0; table<100; table++) {
            //param 控制第几张表
            String dbTable = "LOOK_AT" + table;
            Integer lookType = LookAtEnum.ILOOKATYOU.getType();
            List<LookAtEmpty> iLookAtList = Dao.findByLookAtType(dbTable, lookType);
            if (CollectionUtils.isEmpty(iLookAtList)) {
               return;
            }


            Integer beLookType = LookActionEnum.BE_LookED.getType();
            List<LookAtEmpty> beLookEdAtlist  = Dao.findByLookAtType(dbTable, beLookType);
            if (CollectionUtils.isEmpty(beLookEdAtlist )) {
               return;
            }


            Map<Long, List<Long>> map = new HashMap<>();
            beLookEdAtlist.forEach(lookAtEmpty -> {

                if (map.get(lookAtEmpty.getUserId()) == null) {
                    List<Long> list = new ArrayList<>();
                    list.add(lookAtEmpty.getOtherUserId());
                    map.put(lookAtEmpty.getUserId(), list);
                } else {
                    List<Long> existList = map.get(lookAtEmpty.getUserId());
                    existList.add(lookAtEmpty.getOtherUserId());
                    map.put(lookAtEmpty.getUserId(), existList);
                }
            });


            List<LookAtEmpty> resList = new ArrayList<>();
            for (LookAtEmpty lookAtEmpty : iLookAtList) {

                List<Long> beLookEdAtUserIds = map.get(lookAtEmpty.getUserId());
                if (CollectionUtils.isEmpty(beLookEdAtUserIds) || !beLookEdAtUserIds.contains(lookAtEmpty.getOtherUserId())) {
                    continue;
                }


                for (LookAtEmpty beLookedAtEmpty : beLookEdAtlist ) {
                    if (beLookedAtEmpty.getUserId() == beLookUser.getUserId() && beLookedAtEmpty.getOtherUserId() == beLookedAtEmpty.getOtherUserId()) {

                        Date updateTime = beLookUser.getCreateTime().before(beLookedAtEmpty.getCreateTime()) ? beLookedAtEmpty.getCreateTime() : beLookUser.getCreateTime();
                        beLookUser.setCreateTime(updateTime);
                        beLookUser.setUpdateTime(updateTime);
                        resList.add(beLookUser);

                    }

                }
            }
        }



    }

After reading my code, the boss said: "In fact, it can be paged."
I said: "No, because the relationship may exist on the next page." The
boss said: "You can page..."
I said: "No!" The
boss said: "You can page ......"
I said: "No!"

I still let the boss finish talking before I have the second version of the code.

second edition:

1. Circulate 100 tables, in each table, use the database limit to page out the data that I look at you, all of them are taken out into the memory, and stored with map, the key is myId-youId, and the value is Createtime. Because I need to take out the largest Createtime from the two data of my look at you and your look at me, and put it in the Createtime and UpdateTime of the corresponding data.

2. Then page out the data that I was looked at by you, and look for Createtime in the map. If there is, it means that the two have met each other, and you need to insert data

3. Insert data

 public void execute() {


        int pageSize = 500;
        int pageNum = 0 ;


        for (int table = 0; table<100; table++){


            String dbTable ="LOOK_AT"+table;

            Map<String,Date> lookAtMap = new HashMap<>();

            for (int i =0;i<= 400;i++){

                List<LookAtEmpty> lookAtList = Dao.findByLookAtType(dbTable,LookAtEnum.ILOOKATYOU.getType(),pageNum*pageSize,pageSize);
                if (CollectionUtils.isEmpty(lookAtList)  ){
                    break;
                }

                for (LookAtEmpty lookAtEmpty :lookAtList){
                    String key = lookAtEmpty.getMyId()+"-"+lookAtEmpty.getYouId();
                    lookAtMap.put(key,lookAtEmpty.getCreateTime());
                }


                if( lookAtList.size()<pageSize){
                    break;
                }
                pageNum++;
            }

            List<LookAtEmpty> idResList = new ArrayList<>();


            int res=0;
            int pageNum1 =0;
            for (int i =0;i<=400;i++){
                List<LookAtEmpty> beLookList = Dao.findByLookAtType(dbTable,LookAtEnum.IBELOOKEDBYYOU.getType(),pageNum1*pageSize,pageSize);
                if (CollectionUtils.isEmpty(beLookList)){
                    break;
                }
                for (LookAtEmpty lookAtEmpty :beLookList){
                    String key = lookAtEmpty.getMyId()+"-"+lookAtEmpty.getYouId();

                    if (lookAtMap.get(key) !=null){

                        Date lookDate = lookAtMap.get(key);
                        Date updateTime = lookAtEmpty.getCreateTime().before(lookDate) ? lookDate:userRelation.getCreateTime();


                        LookAtEmpty u = new LookAtEmpty(
                                lookAtEmpty.getMyId(),
                                lookAtEmpty.getYouId(),
                                LookAtEnum.DUIYANER.getType(),
                                updateTime,
                                updateTime);

                        idResList.add(u);

                    }

                }



                if (CollectionUtils.isNotEmpty(idResList)){
                    res += Dao.insertLookAtEmptyList(dbTable,idResList);

                }


                idResList.clear();
                pageNum1++;

                if (beLookList.size()<pageSize){
                    break;
                }

            }


             try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {

                log.error("sss,",e);
            }

        }





    }

The boss gave a few suggestions after reading it: 1.
There is a termination condition in the for loop:

if(lookAtList.size() < pageSize){
	break;
}

So there is no need to limit 400 in the for loop

2. The query paging of the findByLookAtType method uses the limit of MySQL. The lower the limit is in the paging, the lower the efficiency, so it is better to query in reverse order, each record the createtime of the last data, and the next query the data of <createtime Yes, createtime plus index.

Third edition:

  public void execute() {


        int pageSize = 500;

        for (int table = 0; table<100; table++){


            String dbTable ="LOOK_AT"+table;

            Map<String,Date> lookAtMap = new HashMap<>();

            Date iLookCreateTime = new Date();
            for (int i =0;;i++){

                List<LookAtEmpty> lookAtList = Dao.findByTime(dbTable,LookAtEnum.ILOOKATYOU.getType(),iLookCreateTime,pageSize);
                if (CollectionUtils.isEmpty(lookAtList)  ){
                    break;
                }

                for (LookAtEmpty lookAtEmpty :lookAtList){
                    String key = lookAtEmpty.getMyId()+"-"+lookAtEmpty.getYouId();
                    lookAtMap.put(key,lookAtEmpty.getCreateTime());
                }


                if( lookAtList.size()<pageSize){
                    break;
                }
                iLookCreateTime = lookAtList.get(lookAtList.size()-1).getCreateTime();
            }

            List<LookAtEmpty> idResList = new ArrayList<>();


            int res=0;
            int pageNum1 =0;
            Date beLookDateTime = new Date();
            for (int i =0;;i++){

                List<LookAtEmpty> beLookList = Dao.findByTime(dbTable,LookAtEnum.IBELOOKEDBYYOU.getType(),beLookDateTime,pageSize);
                if (CollectionUtils.isEmpty(beLookList)){
                    break;
                }
                for (LookAtEmpty lookAtEmpty :beLookList){
                    String key = lookAtEmpty.getMyId()+"-"+lookAtEmpty.getYouId();

                    Date lookDate = lookAtMap.get(key);
                    if (lookDate !=null){


                        Date updateTime = lookAtEmpty.getCreateTime().before(lookDate) ? lookDate:userRelation.getCreateTime();


                        LookAtEmpty u = new LookAtEmpty(
                                lookAtEmpty.getMyId(),
                                lookAtEmpty.getYouId(),
                                LookAtEnum.DUIYANER.getType(),
                                updateTime,
                                updateTime);

                        idResList.add(u);

                    }

                }



                if (CollectionUtils.isNotEmpty(idResList)){
                    res += Dao.insertLookAtEmptyList(dbTable,idResList);

                }

                beLookDateTime = beLookList.get(beLookList.size()-1).getCreateTime();

                idResList.clear();


                if (beLookList.size()<pageSize){
                    break;
                }

            }


             try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {

                log.error("sss,",e);
            }

        }

    }

Conclusion:

In this way, the data structure has improved a lot of efficiency, it's interesting!