An analysis of the data stored in the graph database file

Analysis of graph_test.py and import_hg.py files

Then take the papers , we have studied re_web_scraping.pythis reptile document, its content is related to the CVE information from a web page climb stored in MongoDB taken down. Today we will study the import_hg.pyfiles and graph_test.pydocuments again . The longest and most difficult one is for the graph_generator4.pynext study.

graph_test.py source code

# -*- coding: utf-8 -*-
from PyHugeGraph import PyHugeGraphClient
import json

if __name__ == '__main__':
    hg = PyHugeGraphClient.HugeGraphClient("http://127.0.0.1","8080", "hugegraph")
    #调用HugeGraphClient这个类,并用它的构造函数完成初始化
    print hg.graph  
    #hug.graph就是上面构造函数输入的'hugegraph'
  
    pk = json.loads(hg.get_graph_allpropertykeys().response) 
    #获取所有的 PropertyKey

    vertexes = hg.get_all_vertelabels().response   
    #利用该函数获取所有的VertexLabel
    vertexes = json.loads(vertexes) 
    #将vertexes从字符串类型转变为dict对象

    for v in vertexes['vertexlabels']:  #遍历所有的vertexlabel(即所有顶点的类型)
        ret = json.loads(hg.get_vertex_by_condition(label=v['name']).response)  
        #获取符合条件的顶点,根据label的名字,即得到某一类的所有顶点
        for node in ret['vertices']:    #根据ID 删除顶点。所以每次运行graph-test.py,hugegraph中的数据都会被删除掉!
            hg.delete_vertex_by_id(node['id']).response #delete_vertex_by_id这个函数貌似没有回显

    edges = hg.get_all_edgelabels().response    #获取所有的EdgeLabels; 
    edges = json.loads(edges)
    ret2 = json.loads(hg.get_edge_by_condition(label='vulnerability').response) #根据条件查询获取边(label为'vulerability'),很奇怪,从之前EdgeLabels返回的结果看,边没有label这个属性;
    paths = hg.traverser_all_paths(source='7:A-1', target='7:F-2',direction='OUT', max_depth='10').response #根据起始顶点、目的顶点、方向、边的类型(可选)和最大深度,查找一条最短路径
    paths = eval(paths) #本来的paths是字符串,通过eval变成了dict对象
    for p in paths['paths']:
        print 'path:{}'.format(p['objects'])

PyHugeGraph library

This is a library for connecting hugegraph. The source code is very concise and concise. The author is tangjiawei. [Github address]( tanglion/PyHugeGraph: python API for hugegraph database (github.com) ) is released here .

hg = PyHugeGraphClient.HugeGraphClient("http://127.0.0.1","8080", "hugegraph")

This statement actually uses a HugeGraphClientclass in this library and initializes it with the constructor. After initialization, we are equivalent to having connected to the hugegraph database.

Then there are many functions to be defined in this HugeGraphClient class, such as the function we use get_graph_allpropertykeyshere, and the source code in the library is directly released here.

    def get_graph_allpropertykeys(self):
        """
        获取所有的 PropertyKey
        :return:
        """
        url = self.host + "/graphs" + "/" + self.graph + "/schema/propertykeys"
        response = requests.get(url)
        res = Response(response.status_code, response.content)
        return res

We can see that the function of this function is to find all PropertyKeys in the graph database. It is the value returned by accessing the hugegraph backend interface. The principle is very simple, but the workload of writing a function for each method is not small, here is a tribute to the author.

The final returned value is like this. For this value, I save the returned result into a json file, and then format the result with vscode, so it is very clear.

{
    "propertykeys": [
        {
            "name": "entry",
            "data_type": "TEXT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 7,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "accessVector",
            "data_type": "TEXT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 6,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "pre_path",
            "data_type": "TEXT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 9,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "reachgroup",
            "data_type": "TEXT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 8,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "vul_port",
            "data_type": "TEXT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 3,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "type",
            "data_type": "TEXT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 2,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "accessLevel",
            "data_type": "INT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 5,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "req_privilege",
            "data_type": "INT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 4,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "name",
            "data_type": "TEXT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 1,
            "cardinality": "SINGLE",
            "properties": []
        },
        {
            "name": "vh_pair",
            "data_type": "TEXT",
            "aggregate_type": "NONE",
            "user_data": {},
            "id": 10,
            "cardinality": "SINGLE",
            "properties": []
        }
    ]
}

Here we can try to manually get data from the backend interface of hugegraph to see if they are consistent.

Insert picture description here

We can see that the content is the same, but the order of the attributes has changed. This should be the change produced when the json module is called to convert the json string and the dict object to each other.

Next we will talk aboutjson模块

json module

Using the json module, we can realize the mutual conversion between json string and dict object. As long as it has four functions dump, dumps, load, loads.

Dump and dumps convert dict objects into json strings.

Load and loads transform json strings into dict objects. Load is very vivid, just reloading strings into an object.

This is an example of dumps.

➜  ~ cat test.py    
import json
dict1 = {'wwuconix': 'yyds','wcx': 'yyds'}
print('type(dict1):',type(dict1))
dict1 = json.dumps(dict1)
print('type(dict1):',type(dict1))
➜  ~ python3 test.py
type(dict1): <class 'dict'>
type(dict1): <class 'str'>

What is the difference between dump and dumps? The difference is that dump needs to specify a file handle when using it to store the generated json string.

Because our usual purpose of calling the json module is not simply to output it in the terminal, but more often to store the generated results in a json file, so we can use two methods, respectively using dumps and dumps.

import json
dict1 = {'wwuconix': 'yyds','wcx': 'yyds'}
with open('test.json','w') as f:
	f.write(json.dumps(dict1))
import json
dict1 = {'wwuconix': 'yyds','wcx': 'yyds'}
with open('test.json','w') as f:
	json.dump(dict1, f)

Using dump looks more concise and atmospheric. However, dumps may only be used in certain situations (for example, we only want to generate json data, but don't want to save it in a file).

Load is similar to loads, so I won’t go into details.

The magical effect of eval

We see this sentence at the end of this program

 paths = eval(paths) #本来的paths是字符串,通过eval变成了dict对象

The comment also states that eval can convert json strings into dict objects. The previous influence on eval only remained on eval('1+1'). Unexpectedly, it can handle more than that.

Let's take a look at the effect first.

➜  ~ cat test.py    
import json
dict1 = {'wwuconix': 'yyds','cwj': 'yyds'}
print('type(dict1):',type(dict1))
dict1 = json.dumps(dict1)
print('type(dict1):',type(dict1))
dict1 = eval(dict1)
print('type(dict1):',type(dict1))
➜  ~ python3 test.py
type(dict1): <class 'dict'>
type(dict1): <class 'str'>
type(dict1): <class 'dict'>

As you can see, it has the same effect as json.loads.

import_hg.py source code and analysis

# -*- coding: utf-8 -*-
from PyHugeGraph import PyHugeGraphClient
import ast
import json
import csv

def create_graph(hg, jfile, ejfile):    #hg是连接了数据库的对象,传入的文件分别是'nodes.json'和'edges.json'
    nodes = []
    edges = []
    with open(jfile, "r") as jf:    #从nodes.json中把每一行的顶点经过操作后插入nodes这个列表中。
        for l in jf.readlines():
            node = l.replace('\n','')
            nodes.append(ast.literal_eval(node))    #ast模块将每行取出的字符串变成了dict类型变量
    print nodes  #返回所有的顶点
    print hg.create_multi_vertex(nodes).response    #利用该函数创建多个节点, 返回的值是所有被创建的顶点的id
    
    print('-------------')
        
    with open(ejfile, "r") as ejf:  #从edges.json中把每一行的边经过操作后插入edges这个列表中。
        for l in ejf.readlines():
            edge = l.replace('\n', '')
            edges.append(ast.literal_eval(edge))
    print edges #返回所有的边
    print hg.create_multi_edge(edges).response #利用这个函数创建多条边,返回的值是类似于边的名字,但是有所不同
    
if __name__ == '__main__':
    hg = PyHugeGraphClient.HugeGraphClient("http://127.0.0.1","8080", "hugegraph")  #连接hugegraph
    create_graph(hg, "nodes.json", "edges.json")
    
    print hg.graph
    print hg.get_all_graphs().response

The function of this file is very simple.

We first look at the procedures used nodes.jsonand edges.jsonare look like it!

Insert picture description here

This is the content of the nodes.json file. Each line of it stores a vertex in json format. Then each line is independent because there is no comma.

So the purpose of this program is to extract every row, store it in nodesthis list, and finally create_multi_vertexcreate it using the functions in the PyhugeGraph library .

The edges.json is similar, extract the edges of each row and put them into edgesthis list, and finally use the function create_multi_edgeto create the edges.

ast module

We see that the ast module appears in the code

nodes.append(ast.literal_eval(node))    #ast模块将每行取出的字符串变成了dict类型变量

My comment also pointed out that this is actually a way to convert a json string into a dict object. I said before that evalthis effect can be achieved, and the introduction of this ast.literal_eval is as follows.

Simply put, the ast module is to help Python applications handle abstract syntax analysis. And the literal_eval()function under this module : it will judge whether the content to be calculated is a legal python type after calculation, if it is, it will perform the operation, otherwise it will not perform the operation.
For example, the above calculation operation ast.literal_eval('1+1'), and the dangerous operation, if replaced ast.literal_eval(), will refuse to execute.

So it's actually a safe version of eval

So to sum it up, there are now three ways to convert a json string into a dictionary object.

  1. json.loads
  2. eval
  3. ast.literal_eval

Summary of experience

When studying the code, don't take it hard.

We should divide the code, step by step, and view the output of the value of each step.

After seeing the results in an intuitive way, we can more easily understand the purpose of the code and get an epiphany.

After gaining some insights, other similar problems were easily solved.

The content of the next analysis

We see that the import_hg.pyfile depends on two important files nodes.jsonand edges.json. How are these two files generated? In the next article, we will analyze the important python files that generated these two files graph_generator4.py.

Things you can do in the future

PyHugeGraph is a very powerful library, but because the author wrote it in 2018 and uses the python2 version. As a result, many files using this library must also be forced to use python2 for writing. In the future, if you have time, you can rewrite this library to python3 to conform to the trend of the times.

Because each line of the nodes.json and edges.json files is an independent json string, it leads to import_hg.py that needs to integrate each line when importing the database. Is it possible to directly generate the content of this file? How about making each line not independent but just separated by commas to form a larger json string? Of course graph_generator4.py, after analyzing this file, there may be some reasons why you have to do this.