Database notes-data model

1 Conceptual data model and basic data model

1.1  Conceptual data model

The data model is independent of the computer system. It does not involve the representation of information in the computer system at all. It is only used to describe the information structure that a specific organization cares about. This type of model is called "conceptual data model".

Conceptual data model is used to establish the data model of the information world, emphasizing its semantic expression ability. The concept should be simple, clear and easy for users to understand. It is the first level of abstraction in the real world and a tool for communication between users and database personnel. The most famous is the "Entity Connection Model". (ER model)

1.2 Basic data model

The data model that directly faces the logical structure of the database is the second level of abstraction in the real world. This type of model involves computer systems and database management systems, and is also called "basic data model" or "structured data model".

For example, hierarchical, meshed, relational, object-oriented data model". This type of model has strict formal definitions so that it can be implemented in computer systems.

2  Hierarchical Data Model hierarchical model

2.1 Two concepts

record: Real-world entities, expressed in the form of "records"

field: Each record has several fields, which are used to represent different characteristics of the entity

2.2  Parent-Child relationship (PCR)

The most basic data relationship in the hierarchy, a one-to-many relationship between two data types.

Definition of RCR

Examples of PCR

2.3 Hierarchical Data Schema Hierarchical Data Schema

Each PCR is one-to-many, and N PCRs are composed.

Each record type can only have one parent.

definition

Instance

2.4 Virtual Record

In the real world, many data are not hierarchically structured, and it is difficult to express this structure directly with PCR.

To give a few examples:

Many-to-many relationship

 

 

 

Many-to-one relationship

 

 

Multiple relations

 

In order to avoid redundancy and to ensure the tree structure of PCR, we introduced the concept of virtual records.

Only one record is stored, and the other references to the record are replaced by pointers (records replaced by pointers—virtual records) [the record indicated by the subscript v in the figure below is a virtual record]

Many-to-many relationship

Many-to-one relationship

 

Multiple relations

Taking the many-to-many relationship as an example, we define two virtual record types, one for students and one for courses.

Then we define the one-to-many relationship of course-student and student-course respectively

However, the child nodes at this time are all virtual records. This virtual record record type does not really store data, but stores a pointer to real data.

2.5 Linear representation of hierarchical data

Since the memory is linear, the hierarchical data must be transformed into a linear form to be stored. An instance of the hierarchical data pattern corresponds to a hierarchical tree (or forest). The sequence generated by traversing the hierarchical tree (or forest) in order is called the hierarchical sequence ( hierarchical sequence), which is specified as the storage order.

2.6 Features of Hierarchical Data Model

2.6.1 Constraints

(1) Except for the root record, any other record cannot leave its parent record and exist in isolation;

(2) Any record, regardless of whether it is virtual or actual, is only allowed to have one parent record (ensure that the hierarchical data mode and its instances are tree-shaped);

(3) The pointer of the virtual record must point to an actual record, and the record pointed to by the virtual record must not be deleted;

(4) The virtual record must not be the root record.

2.6.2 Advantages

The connection between records is realized through pointers, and the query efficiency is higher (for the hierarchical structure).

2.6.3 Disadvantages

1. It can only represent 1:N connection. Although virtual records can be used to describe non-hierarchical data relationships, they are more complicated and difficult for users to grasp, and the efficiency of non-hierarchical queries is relatively low;

2. Due to the strict and complex hierarchical order, the query and update of the data are very complicated, so the writing of the application program is also more complicated;

3. The pattern description language is more complicated and the data independence is poor.

3 Network Data Model

The basic data structure of the network data model is set, which represents a one-to-many relationship between two record types . One side is the master record; the more side is the subordinate record.

A record type can be a master record of multiple departments, or it can be a record of multiple departments.

Departments with multiple types of genus records-"multiple genus lines".

A record type cannot be both the first record of a department and the belonging record of the department at the same time.

data item: The description of the record in the network data structure is somewhat similar to the field in the hierarchical data structure, but the'field' here can be a vector (composite type).

In the network data structure, the value of the system is no longer a tree, but a linked list

 

Types of

Instance

How to express self-connection in a networked data structure? (Because a record type cannot be a master record and a subordinate record at the same time)

For example, if we want to express the data model of the upper-lower relationship of the formula, we introduce a LINK record type:

Types of

Instance

1:1 means one person holds a leadership position

1: N means one person can lead N people

Because I and myself are also the relationship between the leader and the led, so we introduce the LINK type for assistance.

Employees -> Corresponding LINK is a one-to-one relationship, in other words, each LINK is equivalent to a "stand-in" for an employee, using it to connect with other employees.

s2 expresses the relationship between the leader and the led.

If you want to find a subordinate of a certain employee, first use the S1 linked list to find his corresponding S, and then use the S2 linked list to find elements other than yourself, and traverse this linked list. The person traversed is the employee's direct subordinate

3.2 The network data structure expresses many-to-many relationships

It also refers to the Link record, the link record has an SL linked list, which is the relationship with the student; the CL linked list, and the course

 

 

When we want to see how many courses a student chooses, we start from the designated student, first follow the SL to the Link, then start from the Link, and follow the corresponding CL to the course

 

If you want to see a course, which students choose? Then the other way around, starting from the course, first follow the corresponding CL to the link, and then start from the link, follow the SL to find all the students

The upper figure in this table is not acceptable (a record cannot appear in multiple system values ​​of the same system type, and Link records must be used if necessary)

Need this, after L5, you can follow CL to DB, or to L6 (C)

 

4 Relational Data Model relational data network

The main feature of the relaction data model is to express the entity set with a table structure, and use foreign keys to express the relationship between entities. Compared with the hierarchical model and the mesh model, the relational model is relatively simple.

The relational model is a collection of several relational patterns. Each relationship is actually a table, and the relationship between records is embodied through the keys of each relationship mode.

The most basic data structure of the relational model is the "table". In the real world, entities, entities and relationships between entities are all represented by "tables" (to show the difference, the previous levels and meshes, and the relationships between entities and entities are represented by levels and networks)

4.1 Difference from hierarchical structure and mesh structure

The biggest difference between the relational model, the hierarchy, and the mesh model is to navigate the data with keys instead of pointers. Its tables are simple and easy for users to understand. Users only need to use simple query statements to operate the database without involving storage structure, Access technical details.

4.2 Features

At this time, the stored things, the inquired things, all things are tables, they form a closed space, we can use this to define an algebra system (relational algebra).

Based on set theory, it has a higher level of abstraction.

Some details of the underlying computer implementation will be shielded.

Non-procedural query language (SQL) (the user only needs to provide what he wants; instead of the previous hierarchical data model and network data model, he needs to specify how he needs to query [traverse the linked list or traverse the tree and wait until])

4.2.1 Soft connection

In the previous two data models, many-to-many is complex (either virtual records or LINK records are required).

In the relational data model, we only need to make another table (elective as shown in the figure below). The s# and C# in it are the id of student and course, which can be regarded as the logical pointers (ie soft connection) of these two.

4.3  Some concepts of relational data model

Attribute (attribute): The nature of the entity is described by attributes (eg, student's name, student number...)

Domain: Every attribute has a value range, and this range is the domain.

4.3.1 The nature of the domain

1 Paradigm restriction (1NF): Every attribute is atomic and cannot be subdivided; that is, every attribute cannot be a structure, an array, or something; in other words, no set of tables in a table is allowed.

Allow the value in a record to be null (NULL, don’t know, it means that the attribute value is vacant)

In the same relationship, attributes with the same name are not allowed, but different attributes can have the same domain.

4.3.2 Relationship

Any entity is one or more relationships

If a relation R has n attributes A1, A2,...An, and the corresponding domains are D1, D2,...Dn, then the relation R can be expressed as R = (A1/D1, A2/D2,… An /Dn), simplified to R = (A1, A2,… An) ——>This is the relational schema (schema)

n is the degree of the relationship (degree)

In relation R, the order of tuples is irrelevant, but the same two tuples cannot be allowed

In relation R, the order of the attributes is irrelevant.

4.3.3 Tuples

Tuple is an instance of relation R

A relation R can be expressed as r or r(R), r = {t1, t2, …, tm}

Each tuple can be expressed as t = <v1, v2, …, vn>, vi∈Di, 1≤i ≤n (t ∈ D1×D2×, …, ×Dn, 1≤i ≤n)

Relations are also called tables, attributes are called columns, and tuples are called rows.

4.4 key

4.4.1 Candidate key (candidate key)

For a relationship, a set of attributes is called a candidate key, when it meets the following conditions:

1. Any two tuples, the values ​​in this group of attributes are different

2. Any subset of this group of attributes does not have the previous feature

Then this group of attributes is called candidate key (eg, ID number)

(To put it plainly, it is similar to the extremely irrelevant group)

4.4.2 Super key

If condition 2 of the candidate key is not satisfied, that is to say, there are redundant attributes, then such a set of attributes is called a super key.

4.4.3 Primary key and all key

The extremely irrelevant group is not unique, so the candidate key is not unique, so we can select one of the groups as the primary key, and the others are the alternate key.

If the primary key includes all the attributes in the relationship, then we call it all key.

4.4.4  foreign key (foreign key)

If the attribute in this table is used to refer to a tuple of another table (that is, the primary key of another table), then the attribute in this table is called a foreign key

In students, sid is the primary key, and in enrolled, sid is the foreign key

4.5 Constraints

4.5.1 Domain integrity constraint

The value of each attribute of each tuple meets the requirements of the value range.

4.5.2 Entity integrity constraint

The attribute of the primary key is not allowed to be empty

If this value must be NULL, then it is not set as the primary key.

4.5.3 Referential integrity constraints

The foreign key is either vacant or refers to the actual primary key value

5 Relational algebra

5.1  Tables that will be used in this section

Sailor information table (rating is the sailor’s rating)

Ship information form

Reservation relationship (At this time, the attribute group of the sailor's id and the ship's id is not the primary key, because the same sailor can book the ship at different times, so the primary key at this time is the full key).

5.2 Basic operators

If a system supports the following five basic operations, then the system is relational complete (all operations on the relational model can be derived through these five operations).

Because the results of these operations are also relations, these operators can be combined with each other.

Unary operations (monocular) have higher priority than binary operations.

5.2.1 Selection

1-Selection (σ): Selection is a monocular operation, that is, an operation applied to a relationship, and a set of tuples that meet the conditions are selected from the relationship according to a given condition.

5.2.2 Projection

Projection (Π): The projection operation is a monocular operation, which selects a new relationship composed of specified attributes from the relationship. That is, remove the unnecessary fields (columns)

In theory, the projection operation needs to remove duplicate elements (in the object-oriented model, even if all the attributes are the same, as long as the two element id are different, it is not an element; but in the relational model, if all the attributes are equal, then we consider them Express the same element).

However, in the actual database operation, the same elements are not actively removed unless the user explicitly asks for it to be deleted (because the database does not know what kind of data the user needs for application development, the deletion may not meet the user's needs) .

5.2.3 Cartesian product

Cross-product (Cartesian product): splicing two relations together, the two tables are put together into one big table

The result of the Cartesian product contains all the attributes of the original two tables.

The result is formed by concatenating each tuple of the two tables participating in the operation.

<t, g> is the splicing of t and g, that is, R × S is still a relationship, its order is nr+ns, and the number of tuples is |R|×|S|.

Take S1×R1 as an example:

But because S1 and R1 have the sid attribute, and two identical attributes are not allowed in the same table; so we have to rename the attribute:

The result is

5.2.4 Set difference

et-difference (-): set difference, AB: records belonging to relation A but not belonging to relation B

The intersection operation can be represented by the difference operation. Suppose A and B are two sets, then the intersection of A and B can be expressed as: A∩B≡A-(A-B)

5.2.5 Set Union

Union(∪) The set is composed of all tuples belonging to A or B. That is, the two tables are merged into one

The tuples of the two relations participating in the union and difference operation must be restricted to the same type, that is, they have the same order (number of attributes), and the domains of the corresponding attributes are the same-and compatible (union compatibility);

5.3 Connection

5.3.1 condition join

The Cartesian product is two-by-two splicing, which may result in many meaningless tuples, resulting in data redundancy.

Conditional linking is based on the Cartesian product, the result is selected according to condition C.

Sometimes, the conditional connection is also called theta connection.

The connection condition is the comparison of the corresponding attributes in the two relations. The corresponding attributes may not have the same name, but they must have the same domain. The general expression is: <condition 1>and <condition 2>and...and<condition k>.

5.3.2 Equi-join

Special condition link, only the two tables that do the Cartesian product, and the one/several attributes of the same will be selected

Compared with the result of conditional linking and Cartesian product, the result of equivalence connection removes one attribute of equivalence.

5.3.3 Natural join

Do two tables of Cartesian product, and only those with the same common attributes will be selected.

5.4 Completeness

5.5 Division

What conditions does the resulting set of x satisfy? For each y in B, a record can be found in A such that <x,y> is in A

For example, A is composed of two attributes, sno and pno, and the result of A/B has only the sno attribute because the pno attribute is removed.

We are looking for xno that is related to each element in B in the original relationship A.

Let's take A/B2 as an example, why not choose s2? Because there is <s2,p2> in A, but there is no <s2,p4>, the division condition is not satisfied (for each y in B, a record can be found in A)

5.5.1 Representation of division

Because division is not a basic operator, we have to look at how to express division with five basic operators.

It is more cumbersome to directly look at the constraint condition of “satisfying and all elements in B” are cumbersome, so we can look at this problem from the perspective of “negation of negation”. That is to say, the points that do not meet the division condition are first, and then these points can be eliminated.

Let's see first

All x and B in A are joined in pairs, and the Cartesian product is performed, then all possible combinations of A_x and B are available.

Followed by

Remove all the <x,y> pairs in A, and what is left is all the <x,y> pairs that do not meet the division conditions (the <x,y> pairs that are not in the original A)

Put all the x in the <x,y> pairs that do not meet the division conditions, these x are the points that do not meet the division conditions (because if a certain x meets the division conditions, then all the <x,y> pairs of x In front of -A should be eliminated, and all that remains are at least one <x,y> pair that is not in A)

So we get A/B (remove the x in the previous step)

5.6 Outer connection

The general connection is a natural connection, so the tuples that do not satisfy the connection operation will be discarded.

What if we must have these elements? Then we will use external connections.

Outer joins are divided into left outer joins, right outer joins, and full outer joins, which are to keep all tuples of left operand, right operand, and two operands respectively. Matches that can be matched, the value corresponding to the attribute that cannot be matched is NULL

5.7 Outer merge

To perform basic operations, the condition that needs to be met is that the attribute patterns need to be the same. However, in actual applications, if those that do not satisfy the compatibility relationship must be merged together, then external merge is required. The attribute set of the result is the union of the attribute sets participating in the operation. Elements without a value are filled with NULL.

6 Relational Calculus

The operations expressed in relational algebra require us to specify the order of operations [therefore the DB language based on relational algebra is a procedural language], while the relational calculus only needs to specify the logical conditions that need to be met (based on the logic of predicate calculus) Expression) [As long as the result to be obtained is explained, it is not necessary to indicate the process of operation, so the database language based on relational calculus is a declarative language].

In other words, relational calculus only uses predicate formulas to tell "what to do", but not "how to do".

At present, user-oriented relational database languages ​​are basically based on relational calculus.

Relationship calculus is divided into tuple relationship calculus (define variables in units of tuples) and domain relationship calculus (define variables in units of attributes). The only difference between them is the definition of variables.

Boolean expressions in relational calculus are called formulas, and all the results of making Boolean expressions true are the answers we want.

6.1 Security query

We can write an infinite number of relational calculus that is syntactically correct, but meets the condition of the result. Such queries are not safe.

Let’s take the sailor as an example, such as a relational calculus

It means to find all sailors who are not in the sailer table. This relational calculation is syntactically correct, but it is not safe, because we can find an infinite number of such sailors who meet the conditions.

6.2 Completeness of expressive ability

What can be expressed by relational algebra, we can also use a safe domain relation calculus or tuple relation calculus (the two expressions are equivalent).

6.3 Domain Relational Calculus

Variables are defined in domains, which are attributes. The form is as follows:

The formula on the right of the'|' sign can have more domain variables, but I only need n for the result.

6.3.1 Atomic formula in DRC

Among them, X and Y are domain variables, and op is a comparison operator.

6.3.2 Formulas in DRC

The last two lines of "free" mean that in p(X), X is not arbitrarily taken and there is a limit.

Taking the previous sailor as an example, if we want to find the id, name level and age of the sailor above level 7, we can write this using relational calculus:

6.4 Tuple Relational Calculus

It has the following format:

t is a tuple variable, you can use the entire t as the query object, or you can query some attributes in t.

If the entire t is queried, the <attribute table> can be omitted.

Let’s take a sailor as an example. We need to find the names of all sailors above level 7 and below age 50. We can write like this

6.5 Expression of relational operations with relational calculus

6.5.1 Projection

6.5.2 Selection

6.5.3 Subtraction

6.5.4 (Natural) Connection

7 Evaluation of traditional data structure

The traditional data structure is the collective name of the hierarchical, mesh, and relational data structure that we have discussed before.

7.1 Features

●All inherited the concepts of records, fields, etc. in the file.

●The physical level also draws on the file index, hash and other access methods.

● Provide users with a unified data model and corresponding database language.

●The basic structure, constraints and operations of the data are defined on the basis of records

7.2 Insufficiency

Based on records, based on structured data implementation, the actual data is represented as standardized content, but this is not well-oriented to users and applications.

Cannot express the connection between entities and entities naturally.

Lack of semantic information (taking relationships as an example, expressing entities as a list of tables, logically speaking, the status of the tables is equal. The mutual reference between the tables, the inheritance relationship and other internal relations, you need to restrict, Application, database documents, etc. can only be seen. From the perspective of the model alone, it is invisible).

Too few data types are supported (structures, dictionaries, and lists are not acceptable).

8 ER model

The ER data model is not oriented towards realization, but oriented towards the real world. The starting point of its design is to simulate the real world effectively and naturally, instead of first considering its implementation in the machine.

The ER model provides three abstract concepts of entity, attribute and connection. These three concepts are simple, clear, intuitive and easy to understand. They are natural to simulate the real world, and can easily convert relationships, levels, and network data patterns.

When using ER to represent the data model, we only care about what data (that is, what entities and attributes) and the relationship between the data (entity relationship), but not how these data are represented in the computer and what DBMS is used.

In the ER model, both entities and connections can have attributes. For example, the attributes of the course selection connection can have grades, course selection time, and so on.

Most of the features of the entity of the ER model are the same as the relational model, but the attributes of the entity allow composite types and multi-value attributes. (For example, if a person has multiple phones, the attribute of phone calls can be a multi-valued attribute)

8.1 ER diagram

Although the ER model did not succeed in the end, its derivative, the ER diagram, plays an important role in the design of the database (we can know its semantic information by looking at the diagram directly)

8.2 EER extended ER

8.2.1 Weak entity

Cannot exist alone, must depend on other entities.

Eg, school employees-employee family members, each employee family entity is a weak entity and must rely on the existence of employees.

8.2.2 Specialization and Generalization

Similar to the subclass and parent class in object-oriented.

8.2.3 Gathering

Allows the relationship between two entities to be regarded as one entity.

8.2.4 Category

An entity can be expressed as a collection of entities of different types.