the identification of indicators of sentiment using a multiview self-training algorithm

Este artigo apresenta um algoritmo de “multi-view self-training” , que identifica os indicadores de sentimento por: 1. extração relações causais, 2. As relações causais classificação emuma categoria sentimento, 3. agrupamento causas comuns e 4. atribuindo categorias sentimento a causas comuns para criar um distribuição sentimento para cada causa comum. Uma avaliação manual global da estratégia descobriu que ele tinha uma precisão de 70,00%.

drury & lopes [2] r e l at e d wo r k The related work will discuss the following: causation in text, causal relation extraction, sentiment classification and prediction of future texts from information in past documents.
[2.1] Causation in text Causal relations in text can be seen as relation that exists between two events if one event is the cause of the other (Altenberg 1984).Altenberg (1984) stated that three conditions must exist before a causative relation can exist in written or spoken language.The three conditions are: 1. encapsulate the two members of the relationship, 2. express the type of relationship between the relation's members and 3. identify the members in a coherent sequence.An alternate definition of causative relation was provided by Baron (1974) who stated: "Causation is a relationship between two states of affairs, X at time T 1 and X' at time T 2 , and a cause Z that provides the necessary conditions for causing the change from X to X ′ ".Baron (1974) provided four areas that should be considered when analyzing causative grammar: 1. what it is represented by the causative relation, 2. what mechanisms does the language have to represent causation, 3. what level in the grammar is the causation represented and 4. what syntactic/semantic parameters define the relationship between elements in causative constructions (Baron 1974).Baron (1974) further states that causation can be seen as a relation between entire propositions and/or sentences.
Two types of causation in text can be considered: explicit and implicit.Explicit causation is when the causative link is explicitly stated, for example in the generalization for causative verbs, N P V N P 1 , that was provided by Levin (1993).An example of explicit causation that fits the N P V N P pattern is "Smoking causes cancer.".Implicit causation is when the causal link is implied, for example, "The sun was bright and I was sweating".The implied cause the action of sweating is the warmth of the sun.

[2.2] Causal relation extraction
The causal relation extraction can be grouped into general methods: manual and automatic.Manual methods rely upon manually identified characteristics of language, typically patterns, to detect a causative relation.The automatic approaches tend to be supervised machine learning strategies.Supervised learning strategies are methods where labelled data is used to induce a classification model that is used to identify causal relations in unlabelled text.Manual Approaches A simple approach for manual strategies is to use hand crafted patterns.These patterns are typically created by human experts and can be domain specific, that can't be generalized to other domains.In addition the rule construction process can be a time consuming process.There were a number of approaches that relied upon domain knowledge and hand-crafted rules.One of the earliest examples found in the literature was by Kaplan (1991).His system had a pipeline that had several stages that were: 1. hand coded propositional representational parser, 2. semantic analysis component, 3. causal analysis and 4. knowledge base acquisition.Each stage is dependent upon the previous stage.The causal analysis component creates a causal chain of events based upon the output of the semantic analysis component (SAC).The output of the SAC are a series of concept frames that are represented as structured inheritance network.The root node of the network is known as "thing", and the sub-nodes can be members of one of the following classes: objects, actions, or relationships.The causal chain is constructed by using an event seed pair, for example, "air rising" and "air cooling".The effect part of the pair is used as a part of the next causal pair.This process continues until no more causal pairs can be made.The detection of causal pairs is achieved with "propositional clues".Joskowicz et al. (1989) identified causal links between messages generated by equipment installed in navy ships.This approach also relied upon a manual and domain specific approach.

Machine Learning
A popular supervised approach to extract causative relations is to use a sequence classification strategy.There are a number of machine learning methods that can be used in sequence classification strategies, for example Hidden Markov Models (HMM) and Maximum Entropy Markov Models (MEMM).The research literature indicates that one of the most common methods for causal relation extraction are Conditional Random Fields (CRF).Mehrabi et al. (2013) used CRFs in a supervised strategy to extract causative relations from texts about the Geriatric Care domain.The authors used the following features: tokens, token categories, prefix and suffixes, and Part Of Speech (POS) tag.The CRF had three possible labels: cause, effect and out.Riaz & Girju (2014) used verbs and nouns as features for a classifier 2 .The features were grouped as: lexical, semantic and structural.Lexical features were described as "verb, lemma of verb, noun phrase, lemma of all words of noun phrase, head noun of noun phrase, lemmas of all words between verb and head noun of noun phrase.".The semantic features used were the nine noun hierarchies of WordNet.The structural features were the subject and object of a verb. [2] The authors describe the classifier as a "basic supervised classifier".

[2.3] Sentiment Analysis
There are different types of sentiment analysis, for example: extraction of sentiment lexicons (fine grained) and classification (document level).This related work will concentrate on sentiment classification because it is directly related to the work described in this paper.Sentiment classification treats sentiment as a classification task that assigns a document to a category, typically: negative, neutral or positive.A common approach is to use machine learning (Pang et al. 2002).Machine learning uses training data to induce a classification model.The model is then used to classify unlabelled instances into the aforementioned categories.Labelled data for sentiment classification can be imbalanced with one category comprising the majority of the data-set (Drury & Lopes 2014).There are a number of strategies to reduce the effect of imbalanced data for sentiment classification, and balancing by oversampling seems to be the most effective for imbalanced Portuguese sentiment data (Drury & Lopes 2014).
Manually labelling data can be a time consuming task, consequently there has been a number of approaches that use semi-supervised learning. 3Semi-supervised learning uses labelled and unlabelled data to produce a model from a classifier.One semi-supervised strategy for sentiment classification is self-training (He & Zhou 2011).Self-training induces a model from labelled instances and unlabelled data in an iterative way.In each iteration, high confidence classifications are added to the labelled data.At the end of an iteration a new model is induced from the new training data, and the process is continued.The process stops when there are no new instances added to the training data.Self-training can often produce worse results than supervised learning (Drury et al. 2011).This is due to a weak classifier being induced from the training data and propagating errors through each iteration.There are strategies, such as, guided self-training that attempts to eliminate these high-confidence errors (Drury et al. 2011).

[2.4] Prediction of Future Information from Texts
This area of related work concentrates upon work that uses past information in text to predict the likelihood of a future event.Radinsky & Horvitz (2013) used causal chains and probabilistic models to infer the likelihood of a specific event occurring in the future based upon current information.Hashimoto et al. (2014) used a supervised approach to learn causal chains and predict future events.They assumed that causality can be based on three assumptions: 1. two nouns that are joined by a binary semantic relation form causality between two events when combined with two predicates, 2. there are specific grammatical scenarios where causality will occur and 3. cause and events are strongly associated.Radinsky & Horvitz (2013) produced an algorithm called "Pundit" that generated event sce- [3] A common alternative strategy is to propagate label from labelled to unlabelled instances in a transductive strategy (Rossi et al. 2014).
the identification of indicators of sentiment [383] narios from a causal event.Kunneman & Van den Bosch (2012) used Tweets about Dutch football to predict future transfers of players.
[3] c o r p u s The corpus that we used for the experiments was news stories about agricultural in Brazil.These stories were gathered from various sources from the Internet from 1995 until 2014.The data was not contiguous, and consequently there were temporal gaps in the data.The stories were split into sentences and POS tagged with the De Alencar ( 2010).The corpus contained 295,307 sentences.
[3.1] Manually Labelled Data Labelled data was required for the causal relation extraction and the sentiment classification tasks.A random set of 394 sentences were selected from the corpus.The data was categorized by a single annotator into two categories: causative and non-causative.The non-causative category had 84 sentences and the causative category had 310 sentences.The sentences in the causative category had one of the following categories added to their words: cause, effect, causative link or noncausative.The density of causative relations was high when compared to other causative relations annotation exercises we have undertaken (Drury et al. 2014a).This may be due to the type of text annoatated or the selection of sentences may have been atypical.
The labelled causative data was sub-divided into three categories (neutral, negative or positive) for the sentiment classification evaluation.The negative category had 228 sentences, the neutral 37 and the positive 45 sentences.The negative category was the majority class.This was unsurprising as most of the agricultural news stories were negative.Examples of the labelled data can be found in Table 1 (Drury et al. 2014c) that labels causative verbs in a sentence. 5It is based upon a graph based approach that propagates causative and non-causative labels from labelled verbs to unlabelled verbs depending upon the link density between the verbs in the graph.The technique is described in full by (Drury et al. 2014c).RLD is complemented by a rule tagger that annotates noun phrases in sentences.The rule classifier is based upon a number of manually created decision rules.This combination of RLD and rule labeller attempts to identify the N P V N P pattern described in the related work.
The local classifier is a combination stacked of CRFs.Stacking is a meta-learning technique where the training data is divided randomly between the CRFs.Each CRF produces a model, the models are used in combination to label casual relations in text.Each CRF has a "separate view" of the data, and consequently the number of errors produced by the models is reduced (Vilalta & Drissi 2002).The CRFs classify each word in a sentence as either: 1. Non-causative, 2. Causative Link, 3. Cause or 4. Effect.Classification sequences that match the aforementioned N P V N P are assumed to be causal relations.There were two steps to train the CRFs.The steps were: feature selection and selection of the metalearning technique.
Feature selection was achieved using a genetic algorithm (GA) (Nongmeikapam & Bandyopadhyay 2011) because: 1. it was not clear what the best features were and 2. the feature space was large, and it was not possible to test every feature combination.The GA used a pool of a 499 random solutions and 1 seed solution that contained all 54 categories of possible features.The GA used an accuracy figure from a hold-out evaluation as a fitness function.The hold-out evaluation used the manually labelled data described on section [3.1].The hold-out evaluation ignored correct classifications for non-causative words because this class [5] A list of causative verbs generated by a previous version of this algorithm is freely available from the resources described by (Drury et al. 2014b).
OSLa volume 7(1), 2015 the identification of indicators of sentiment [385] was the majority class and simply guessing this class for all words would have produced an accuracy of approximately 90.00% without correctly identifying any causal relations.The accuracy figure was calculated by the number of: 1. effect words, 2. causative link and 3. cause words classified correctly minus the number incorrect classification of non-causative and causative elements.The equation for the hold-out function is Ccr T cr+Enc , where Ccr is the number of correct causal relation elements classified (cause, effect, causal link), T cr is the total number of causal relation elements and Enc is the number of erroneous classifications of non-causal words as a causal relation element.
The solutions were ranked by accuracy and the bottom 50% of the solutions were removed.The breeding strategy selected one surviving solution and chose randomly another surviving solution to breed with.The order of the features of the breeding solutions was randomized, and 50% of each solution was selected for the new solution.Duplicate features were removed.The mutation rate was 0.1, meaning that 25 of the new solutions were mutated.The mutation strategy took one feature of the solution and either: changed its value or swapped it for a new feature.The GA ran for 35 generations.The GA was limited to 35 generations because the GA was a time intensive process.The results are displayed in Figure 1.The diagram shows a steady increase over increasing generations with a number of plateaus.We hypothesize that the plateaus were caused by delay in the best solutions influencing the populations.The results represent a 14.28% relative increase over the initial "best solution" selected on the first generation.The results were unimpressive because 1. we excluded correct non-causative classifications from the fitness measure and 2. the limited amount of labelled data produced weak models.The categories of features selected by the GA strategy where: words ahead (number of words ahead) 16, 4, 8, word behind (number of words behind) 1, word features: number, punctuation, start of sentence, sentiment value, stopword and current word.An example of the features is provided in Figure 2, where the word features are demonstrated for the cause candidate "fumo".The "look behind" word is "O" and the "look ahead" words are: do, momento, Nervoso.Each of these words had a number of word specific features.For example, the cause candidate, "fumo", would have the following word features: IsStartOfSentence: false, Ispunctuation: false, HasSentimentValue: false, IsStopword:false and Cur-rentWord: fumo.Each of the look-ahead and look-behind word-features would be included in the features for the cause-candidate, "fumo".
In addition to using feature selection to improve the performance of the CRF we evaluated the effectiveness of meta-learning.The meta-learning technique we evaluated was stacking (Klugl et al. 2012) because the research literature suggests that stacking CRFs outperform a single CRF.The stacking strategy we attempted was to provide a separate random part of the training data to each individual CRF.The CRFs then vote on each classification with the majority vote being accepted as the classification of the stacked CRF.
We performed a basic evaluation of stacked 3 and 5 CRFs against a baseline of 1 CRF.The evaluation was a hold-out evaluation using he manually labelled data described on section [3.1].The hold-out evaluation was 80:20 1 X 10 , where the data was randomly separated into two partitions: 80% for training and 20% for evaluation.The process was repeated 10 times.An average accuracy was calculated.We found that a stacked 3 CRFs performed gained the highest accuracy on the hold-out evaluation.A more in-depth evaluation was made that we describe later on in the paper.

[4.2] Self-training
The labelled data described on section [3.1] was limited, and consequently any model produced from this data would likely to be weak and produce errors.This characteristic of a weak classifier was shown in the feature selection experiments where the single classifier gained relatively low accuracy measures.A semi-supervised learning strategy is a method that combines labelled and unlabelled data to improve the performance of a classifier.We choose self-training, that is an iterative technique that adds high confidence classifications of unlabelled data as training data in the next cycle.A weakness of self-training is error propagation where the classifier makes an error in classification that is then added to the training data that influences the next cycle.It is possible that classifier could have less accuracy after self-training than the model induced from the training data (Drury et al. 2011).
As stated earlier this algorithm used local and global classifiers to mitigate error propagation.We performed a number of experiments with various configurations of classifiers to supplement the limited hold-out evaluation we performed earlier.The experiments with self-training were designed to justify the selections made for the algorithm.The experiments allowed each configuration of classifiers to classify the whole corpus, and a random selection of 100 classifications were analyzed manually to produce an accuracy figure for: annotations and sentence classification.There was only one iteration for each classifier due to time constraints.The combinations analyzed were: 1. single Conditional Random Field, 2. Relative Link Classifier and Rule Labeller, 3. Relative Link Classifier and Rule Labeller with single Conditional Random Field, and 4. Relative Link Classifier and Rule Labeller with single Conditional Random Field.We calculated an error bar for that was based upon a confidence interval of 95%.The results are displayed in Table 2.
The results show that the combinations of the rule classifier with various combinations of CRFs out-performed: Relative Link Classifier and a Single Conditional Random Field .The stacked CRF was the only combination that outperformed the Relative Link Classifier by more than the margin of error, consequently it was chosen for the causative relation extraction of our algorithm.The relative poor performance of the CRF reflected our experience in the feature selection phase.The causal relation extraction self-training algorithm is fully described in Algorithm 1.
Dictionary construction was achieved by extracting: adjectives, adverbs and nouns from the training data.These words are expanded with synonyms from Onto.pt (Gonçalo Oliveira 2014).Onto.pt is a taxonomy of Portuguese words that are organized by synsets of related words.The synonyms were extracted by: 1. loading the taxonomy into the rdflib python library 6 and 2. returning words (synonyms) from the same synset as a target word.
The training data was constructed by dividing the training data described on section [3] into three sentiment categories: neutral, negative and positive.This data was used for dictionary construction and as training data for a classifier.The positive dictionary had 312 entities, where as the negative dictionary had 4767 entries.This indicates that the training data was overwhelmingly negative.An example of the entries are described in Table 3.
The linguistic rules are the rules described by Drury et al. (2011) where a causal relation is classified in one of the sentiment classes with the following criteria: 1. a sentence is classified as positive if it has two or more entries from the positive class and none from the negative dictionary, 2. a sentence is classified as negative if it has two or more entries from the negative dictionary and none from the positive dictionary, 3. a sentence is classified neutral if it contains no entries from either the positive or negative dictionaries, and 4. if a sentence contains one entry from the positive or the negative dictionaries then no classification is made.
The guided self-training strategy was adjusted to use balancing strategies to improve the performance of the induced model.We used random over balancing that has been shown to gain good results in sentiment classification of Portuguese (Drury & Lopes 2014).The guided self-training algorithm for sentiment classification is described in Algorithm 2.

Guided Self-training evaluation:
The suitability of the sentiment classification strategy was evaluated with 80:20 1 X 10 hold-out evaluation.The hold-out evaluation relied upon labelled data, that in this case was the labelled sentiment data described on section [3.1].The holdout evaluation reversed 80% of the data for training and 20% for testing.The test was repeated 10 times with different splits of the data.The competing strategies [6] http://code.google.com/p/rdflib/.[5] s e n t i m e n t p r e d i c t i o n The last step in the strategy is to assign a sentiment probability to a cause.This is achieved by grouping common causes and aggregating their sentiment categories to produce a sentiment distribution for a specific cause.This grouping process is illustrated in the following example.We have three causative sentences and their sentiment categories: 1. "chuva causa cheias no Porto", neutral, 2. "chuva causa danos em Minas Gerais", negative and 3. "Chuva causa inundações e destrói casa em Itapetininga", negative.When the cause is "chuva", and its sentiment distribution would be P = {N eu = 0.33, N eg = 0.66, P os = 0.0}.

[5.1] Experiments
The experiments for sentiment prediction manually evaluated the sentiment classifications for specific common causes.In the experiments we ran the aforementioned causal relation extractor and sentiment classifier.The relations were grouped by cause and their sentiment distributions calculated.There were 4988 common causes.The most frequent sentiment causal events and their sentiment distributions are displayed in Table 5 [393] seem to make "intuitive" sense.For example, "seca 8 " will be mainly negative for agriculture because of future lower crop yields, however it seems reasonable that there may be some positive future news (for farmers) in the form of crop price rises due to lower supply and constant demand, although this news could be seen as negative for consumers.The future work is to evaluate the predictive ability of sentiment distributions of causes.This work is centred around agriculture, and causes such as "falta de chuva" or "seca" are likely to have similar effects on crops in the future as they have had in the past.It is reasonable to assume at least in this domain that we can estimate the sentiment distribution of future news stories.This may allow the improvement of time dependent sentiment tasks such as reputation management and stock trading.[8] Table 5.

figure 1 :
figure 1: Evolution of accuracy with a GA feature selection a c k n o w l e d g e m e n t s This work was supported by FAPESP grant number: 11/20451-1.r e f e r e n c e s Altenberg, Bengt.1984.Causal linking in spoken and written english.Studia Linguistica 38(1).20-69.Ando, Rie Kubota & Tong Zhang.2007.Two-view feature generation model for semi-supervised learning.In Proceedings of the 24th international conference on machine learning, 25-32.ACM.Baron, Naomi S. 1974.The structure of english causatives.Lingua 33(4).299-342.De Alencar, Leonel Figueiredo.2010.Uma ferramenta para anotação automática de corpora usando o NLTK.In The 9th brazilian corpus linguistics meeting, s/pp.Drury, Brett, Paula C. F. Cardoso, Jorge Carlos Valverde-Rebaza, Alan Valejo, Fabio Pereira & Alneu de Andrade Lopes.2014a.An open source tool for crowdsourcing the manual annotation of texts.In Computational processing of the portuguese language -11th international conference, PROPOR, 268-273.Drury, Brett, Paula C.F. Cardoso, Janie M. Thomas & Alneu de Andrade Lopes.2014b.Lexical resources for the identification of causative relations in Portuguese texts.In Proceedings of workshop on tools and resources for automatically processing Portuguese and Spanish, s/pp.Drury, Brett & Alneu Lopes.2014.A comparison of the effect of feature selection and balancing strategies upon the sentiment classification of Portuguese news stories.In Proceedings of ENIAC, s/pp.

table 1 :
. The training data is available from http://goo.gl/IYP1t1. 4Example of causative labelled data.algorithm was designed to: 1. extract causal relations from text, 2. label cause, effect and casual link of the relation and 3. classify the causal relation into negative, neutral or positive categories.
[4]The annotation schema for the data is: N C = non-causitive, CN = Cause Noun, EN = Effect Noun and CV = Causal Verb.The

table 2 :
Analysis of Causal Relation Strategies.

table 4 :
Results for Hold-Out Evaluation.

table 5 :
Frequent Causal Events and their Sentiment Distribution.