Python rule 8 batch simple network request to get data rule set BatchSimpleGet


Use the asynchronous method to get data from a certain URL according to the slice_list of the id, and get the data to the folder.


1 Rule set design

The code name in the database is BatchSimpleGet.

1.1 Scene

The data is serially numbered according to a certain id field and needs to be obtained through network get and stored in a certain folder. Since the data of the target source is very large, we hope that this reading is cached: if the number/size of the current folder reaches the limit, the cache is full and the reading is skipped.

In addition, we can find out through trial and error without knowing which is the latest effective id.

1.2 Requirements/Assumptions

  • 1 The storage file is in the form of pkl
  • 2 There are other programs that come back to read/destroy the files in the result folder
  • 3 doc_id within a step is regarded as a file
  • 4 There is a default log file in the specified folder, ending with read.log, which will not interfere with other files
  • 5 The web request must be returned, and the execution status of success/failure is in the status field of the returned json body.
  • 6 The corresponding query data is in the data field of the returned json body

1.3 Rules

1.3.1 Entry rules

  • 1 Pass in a usable folder address and desired id range and read step length, as well as the largest file allowed in a folder, url_template.

1.3.2 Intermediate rules

  • 2 Generate the full set of slice_list that you want to read
  • 3 Read the completed part from the log
  • 4 According to the number of current folders and completed slices, determine whether to issue a request for data or a bye, and generate an actual slice_list request
  • 5 Write the request and returned results to the log

1.3.3 Rules

  • 6 Return some summary data of the current execution

1.4 Dependence

Asynchronous is still a bit cumbersome, but it's done

Because web request data is an IO time-consuming work, it must be done through asynchronous procedures. But the integration of asynchronous and synchronous programs is very troublesome, so an independent SimpleAPI application is responsible for initiating asynchronous requests and returning the results.

2 effect

Insert picture description here
# 自由定义

folder_path = './test_001/'
max_files_in_folder = 1000

id_min = 10000
id_max = 12000
read_step = 500

url_template = 'http:/xxx:%s'

# 公网测试地址
target_url = 'http://xxx:20002/api/'
# -- 封装
cur_dict ={}
cur_dict['input_dict'] = {}
cur_dict['input_dict']['folder_path'] = folder_path
cur_dict['input_dict']['max_files_in_folder'] = max_files_in_folder
cur_dict['input_dict']['id_min'] = id_min
cur_dict['input_dict']['id_max'] = id_max
cur_dict['input_dict']['read_step'] = read_step
cur_dict['input_dict']['url_template'] = url_template
cur_dict['input_dict']['target_url'] = target_url

fs.run_ruleset_v2('simple_get', BatchSimpleGet_df, cur_dict, fs)
{'name': 'simple_get',
 'status': True,
 'msg': 'ok',
 'duration': 14,
 'data': {'total_success_recs': 4,
  'cur_success_recs': 0,
  'cur_start_dt': None,
  'cur_end_dt': None,
  'cur_mean_seconds': None,
  'data_path': './test_001/',
  'log_path': './test_001/read.log',
  'cur_slice_list': []}}

According to the previous design, the ones that have been read will not be read again. We tried to add some that

Insert picture description here

meet the expectations. The whole process is really not too much brainstorming, just take a little time.

3 Upload and save


# rr_id 
BatchSimpleGet_df.loc[:,'rr_id'] = fs.gen_md5_for_columns(BatchSimpleGet_df, ['rule_set','rule_id'],fs)

Save to database

# 插入
fs.mongo_opr_insert_df_slice(func_lmongo, 'rules', 'rules', BatchSimpleGet_df, 'rr_id', fs)
# 更新
# fs.mongo_opr_update_df_slice(func_lmongo, 'rules', 'rules', BatchSimpleGet_df, 'rr_id', fs)
Insert picture description here

Only three steps to use

  • 1 Modify parameters
  • 2 Pull the rule set
  • 3 Call like a function

4 summary

Overall it is still very much in line with expectations:

  • 1 Many rule functions are universal and can be used with a little modification
  • 2 Time and complexity are just right

What may require further attention is:

  • 1 How to further extract the universal functions (this time exactly two rule sets have similar functions)
  • 2 Is it possible to further standardize the complexity of the function (when is one, and when is it split into two)