IDENTIFYING SEMANTIC RELATIONS BETWEEN SENTENCES BY SOLVING AN ANAPHORA

This paper presents the method of semantic relations identification by solving the anaphora, specifically to determine the nature of relations between sentences in scientific texts. Authors explored the use of the pronoun "anaphora", developed a mathematical predicate model to determine the anaphora and antecedent and described the relationships between them for fragments in the Ukrainian language. Authors conducted an experiment based on the developed model and the logical network that was build based on presented predicate model.


Introduction
Problem Statement. The exponential increase in the amount of textual information contributes to the emergence of new tasks for natural language processing. Search engine technologies, machine automatic translation methods, text summarization, and other language processing methods need constant improvement. One of the tasks among the listed areas is to create a new method for determining semantic relations between phrases, within one sentence, and between different sentences. Scientific publications explore approaches to identifying semantic relations by defining anaphora. The determining anaphora place and role in syntactic constructions is a separate task of computational linguistics.
Analysis of Relevant Works. Semantic relations in syntactic constructions are studied via anaphora and antecedent in computational linguistics. Sentences, phrases, as well as parts of sentences, and sentence integrations are syntactic constructions that have anaphoric relations. This paper focuses on the semantic relations between sentences.
The problem of linguistic anaphora has been the subject of many studies [1][2][3][4][5][6]. The functionality of known approaches of the system analysis to the decision of an anaphora question of Ukrainian natural-language texts is investigated in work [4]. The algorithm for calculating semantic-syntactic criteria for a pre-selected antecedent-anaphora pair is presented. Peculiarities of the semantic-syntactic structure of the pronoun anaphora its functions in the literary text are investigated by the author [5]. The publication [6] proposes an approach to solving the pronoun anaphora by constructing a classifier. Based on the syntactic and semantic properties of the anaphora and antecedent a classifier draws a conclusion about their compatibility or incompatibility.
Extraction approach for information abstracting is implemented in [7]. Anaphoric relations are used to solve the automatic generalization of textual information on the Internet. Automatic text generalization is the process of creating a summary of one or more related documents that stores only the most valuable information.
The paper purpose is to present the method of semantic relations identification by solving the anaphora: to determine the nature of relations between sentences in scientific texts; explore the use of the pronoun "anaphora"; develop a mathematical predicate model to determine the anaphora and antecedent, and describe the relationship between them for fragments in the Ukrainian language and build an appropriate logical network; to conduct an experiment based on the developed model.

Types of relations between sentences in scientific texts
All texts can be divided into three varieties according to the relationship types between sentences [8]: texts with parallel (remote) relations; -texts with serial (chain) relations; -texts with relations or with a combined relations.
In parallel relations, the sentences are equal. Parallel relation is the use of sentences in which the same word order, the same grammatical forms of the sentence members. The main means of implementing parallel relations is syntactic parallelism. This is when the same or similar construction of sentences, which is often expressed in the same sequence of words, and the unity of temporal forms of verbs-predicates (predicates) [8].
("The degree of Web-space development will be determined by the technology of working with a huge amount of information accumulated on the Internet. The next-generation Web will be characterized by the transition from a network of documents to a network of data, which, if necessary, are aggregated into semantically related documents using Web-services".) The article's research object is a scientific text. Concepts and actions must be clearly and unambiguously presented in a scientific text in order of correct understanding by any researcher-reader. Correct understanding ensures the predominant use of sequential (chain) relations in academic texts.
Chain relation is used in all language styles. This is the most common way of connecting sentences, which is most consistent with the specifics of thinking. The chain relation is necessarily present where the thought develops sequentially, where each subsequent sentence develops the previous one as if formed from it. The relationship between sentences in a sequential relationship is great, so without the previous sentence, the meaning of the next one may not be clear.
Serial or chain relations exist because the complement of the previous sentence becomes the subject in the next sentence. The structural form of this relation is as follows: "complement-subject". Other models of the sentence structure are also widely used: "subject-complement", "complement-object", "subject-subject". The syntactic essence of the chain relation is expressed in these syntactic models, in the syntactic relations between neighboring members of the sentence. This is the internal, structural side of the chain relation. There are ways to embody syntactic relations in a serial relation [8]: lexical repetition, synonyms, indicative words, personal pronouns, pronoun adverbs, conjunctions, verbal omission, etc. Thus, we can distinguish: 1) chain relation by lexical repetition, 2) chain synonymous relation, 3) chain pronoun relation.
The syntactic essence of the chain relation, its structural types, models (the relation of the sentence member pairs) are preserved regardless of the means of model implementation: "Перше покоління моделей інноваційних процесів припадає на 1950-1960-ті  ("The first generation of innovation processes models dates back to the 1950s and 1960s. According to R. Roswell, this generation can be described as a "technology-driven model"; "The process of auto-oxidation during the operation of PPT, for example for polyethylene film, is characterized by three stages: the period of induction, which corresponds to the stage of nucleation of chains; period, acceleration, which corresponds to the stages of growth of the chains; the deceleration period corresponding to the stage of chain breakage. According to the change in the properties of polymer raw materials, these processes are divided into aggregative, associated with crosslinking processes, and destructive, associated with the decomposition of macromolecules into smaller fragments.")

Using the pronoun "anaphora"
There are the following means of relation: words close in meaning, pronouns, adverbs, numerals, and other means, repetition of words, words that indicate the sequence of development of the content. According to this, we can distinguish the following types of anaphora: pronoun, noun, adverb, and zero [6].
The most common type of anaphora in natural language is the pronoun anaphora [3]. In computational linguistics, anaphora is usually defined as a reference to objects mentioned earlier in the discourse, or a reference to a "pointing back". "Pointing back" (reference) is called an anaphora, and the essence to which it refers is its predecessor -the antecedent [2].
("The most pronounced in terms of dynamics is, of course, the segment of information in the form of news. On the one hand, it has the highest level of updating, and on the other hand -it generates and distributes really large amounts of data.") One of the most commonly used relations is between the anaphoric pronoun and the antecedent. "Anaphoric pronouns are pronouns that refer to some other words or phrase (antecedent) of this text, the semantic meaning of which they reflect" [9].
For research, we used our own text corpus, which consists of scientific articles and publications. The statistical characteristics of this corpus are presented in the Tabl. 1.
From the total number of words in the Ukrainian text corpus, 5.3% are pronouns (21923 out of 415565). Only 14 (0.5%) Ukrainian fragments out of the total number of 2633 do not contain pronouns. With an average fragment length of 157 words, the maximum number of pronouns in a fragment is 32, the average is 9.
These data indicate the ( ) The presence of pronouns in fragments of the Ukrainian text corpus is presented in the Tabl. 2.
The solution of anaphora is based on syntactic and morphological information. According to these features, the antecedent in relation to the anaphora is identified. Antecedent and anaphora should have similar characteristics. 90% of antecedents are in the same or previous sentence as the anaphora [10]. The mathematical model development for determining anaphora and antecedent To describe the semantic relations between sentences and solve the anaphora, we will use the apparatus of finite predicate algebra (FPA) and predicate operations (PO) [11][12]. It is known [13][14] that the algebra of finite predicates is complete, i.e. any finite relations can be described in its language.
Algebra of finite predicates is a formal language with which it is possible to describe mathematically investigated structures and functions. It is used to formalize the description of deterministic, discrete, finite objects. Deterministic processes are processes with an unambiguous result; they do not have a factor of randomness. Discrete processes are processes in which information has the form of separate portions or quantum. They lack the factor of continuity, for example, numbers, letters, words, formulas. Finite processes are processes in which there is no infinity factor, i.e. only a finite number of information units can participate in them [11][12]. This is how we describe our task of natural language processing and defining anaphoric relations.
For our problem, a predicate based on the Cartesian product 1 2 If we know the relation P , then we can always use rule (1) to trace all the values of the predicate P , ie to give a complete description of it (for example, to represent the predicate P in the graph form).
The inverse transition from the predicate P to the relation P can be performed using the following rule: Expression (2) allows for known Boolean values of the predicate P to find all sets of predicates, which are preceding the relation P .
Rules (1) and (2)  , ,..., 1 n P x x x = , forms the relation P , which is called the truth domain of the predicate P . The predicate P M ∈ , which is given by rule (1), is called the characteristic function of the relation P L ∈ . For Boolean elements as well as for predicates there are operations of disjunction, conjunction and negation.
Let's make a formal process description of the anaphoric relation identifying. We will develop a mathematical model to determine the anaphora and the antecedent and the relationship between them for fragments in the Ukrainian language and build an appropriate logical network. In this work we analyze only one type of connection construction, it has 3-person pronouns and a noun in the singular.
We identify seven features that identify the antecedent (tabl. 3). The features selected are complete, consistent and unbreakable. x -before the anaphora 2 1 x -after the anaphora 2. Part of speech is a noun xother parts 3. Grammatical gender , , x -masculine gender 2 3 x -feminine gender 3 3 x -neutral gender 4. Number x -plural 2 4 x -singular 5. Grammatical case x -Subjective 2 5 x -Genitive 3 5 x -Dative 4 5 x -Accusative 5 5 x -Ablative 6 5 x -Prepositional 6. Proper names { } x -proper name 2 6 x -not proper name 7. The distance between a pronoun and a potential antecedent in words. x -less than 30 words 2 7 x -more than 30 words

{ }
The distance we understand as the value of up to 30 words (double the average number of words in the sentence of scientific work corpus, 16 2 32 × = ). We have 7 attributes for which we define the area of change with the help of the truth, false, and negation laws.
x  0, 1. x The set of anaphora values (defining of anaphoric relation) is binary: The position in the text (before the pronoun (anaphora)) is important. The pretender to the antecedent should be before the anaphoric pronoun, preferably in the same sentence or in the next. Therefore, for Based on the attributes combination and the relations between them, we build a logical network, which is a graphical representation of the Cartesian product of all mentioned attributes. Logical networks are a universal, simple, and natural means of visual representation of any object structure. From a mathematical point of view, logical networks are a system of binary predicates [14].
Any logical network consists of poles and edges. Each pole has its own subject variable (pole attribute). The pole is denoted by its own subject variable. The pole attribute change area (domain) is associated with the pole. Any pole of a logical network at any given time reflects certain knowledge of attribute value. It is called the state of the pole. It is possible to obtain the state of a network at a given moment by specifying the state of all its poles at that moment [13].
Each edge of a logical network is assigned its own binary relation, which is called the relation of this edge. Each edge is denoted by its relation number. An edge unites two poles that meet the subject variables that are related to this edge [13].
To build a logical network, perform decomposition. Decomposition plays an important role. It provides the replacement of one complex equation (task) with an equivalent system of simpler equations (tasks). Thus, decomposition serves as a powerful means of simplifying and reducing the notation of finite predicate algebra equations [12].
After performing the decomposition, we group the attributes related by type and form aggregate indicators for each type of attribute.
The first group of attributes determines the location of the word, which is analyzed for the role of the antecedent: 1 X is a location in the text (regarding the pronoun (anaphora)), 7 X is the distance between a pronoun and a potential antecedent in words, they form an aggregate indicator 1 Q .
The second group of attributes analyzes the word about the possibility of performing the antecedent role: 2 X is part of speech (noun), 6 X is a proper name.
These attributes form an aggregate indicator 2 Q .
Morphological attributes of a potential antecedent characterize: 3 X is grammatical gender, 4 X is number, 5 X is grammatical case. q -relation is impossible.
The logical network of attributes for this indicator is presented in Fig. 1. We have the following equation for a logical network of attributes 1 Q : q -morphological attributes of a potential antecedent do not meet the attributes of an anaphoric pronoun (Fig. 3).
According to the aggregate indicators, we have a binary value of the indicator R (result), which shows the presence or absence of an anaphoric relation. The process of creating aggregate indicators in general is presented in Fig. 4. , , R r r = where 1 r -an anaphoric relation has been identified, 2 r -there is no anaphoric relation.
Thus, using the apparatus of finite predicate algebra (FPA) and predicate operations (PO), we created a mathematical model that describes the process of solving anaphora and antecedent in scientific articles in Ukrainian.

Experiments and Results
All experiments were carried out on our corpus described in the "Using the pronoun "anaphora" section. As a result of processing texts from this corpus, we received the following statistical information of the pronouns which are used for resolving the anaphora. The Tabl. 4 shows that scientific texts often contain pronouns. Fig, 5 shows the most frequent pronouns that are found in our corpus. These pronouns have frequency more than 500 in our corpus. Електроенергетикаце базова галузь національної економіки, стабільність роботи якої для розвитку країни має особливе значення. Вона впли-ває не тільки на розвиток народного господарства, а і на територіальну організацію продуктивних сил.
(Electricity is a basic branch of the national economy, the stability of which is of particular importance for the development of the country. It affects not only the development of the national economy, but also the territorial organization of productive forces.) Our method identified the pronoun "Вона" ("She") and found the corresponding antecedent "Електроенергетика" ("Electroenergetics").
The experiments showed that 96% of semantic relations (using the pronominal anaphora) were found in our corpus. Cases when the model did not work related to the syntactic features of sentences, errors in the work of the morphological analyzer, and the distance between the anaphora and antecedent.

Conclusions and Recommendations
The authors analyzed various means of communication: words close in meaning, pronouns, adverbs, numerals and other means, repetition of words, words that indicate the sequence of development of the content. The main attention in the paper is paid to the pronoun anaphora, for which a predicate mathematical model was developed. This model defines anaphora and antecedent, and the relationship between them. All research was conducted on the Ukrainian corpus of scientific publications created by the authors. The relationship between anaphora and antecedent for fragments in Ukrainian was described using algebra of finite predicates and predicate operations.
With the help of selected features, it was possible to form aggregate indicators for each type and set of these attributes. Based on this, a logical network is built, which is a graphical representation of the Cartesian product of all the attributes considered. The authors conducted an experiment using the developed model and a logical network built on the basis of the presented predicate model. As a result of the experiments, it was determined that about 96% of semantic relations were found in the text, cases when the model did not work were related to the syntactic features of sentences and their distance. The developed predicate model can be used in various applications that are related to the processing of natural language, namely the summarization task, semantic similarity of sentences, etc.