Abstract— convey insults or scorn. It is used for

Abstract— Sarcasm is defined as witty language used to convey insults or scorn. It is used for remarks that clearly mean the opposite
of what people want to say, made in order to hurt someone’s feelings or to criticize something in a humorous way. While speaking, it is
very easy to distinguish sarcasm utilizing pitch of voice, gesture, facial expression etc. But in textual data, it is difficult to detect sarcasm
due to lack of described factors. Sentimental analysis is used to know someone’s opinion, attitude towards particular event, company
etc. Sarcasm is one type of person’s sentiment but used for taunting, insulting, to make fun of someone. Various algorithms are
proposed to detect sarcasm based on different features, domains and type of sarcasm. We propose a Hadoop based framework that
captures real time tweets, process it and use hybrid algorithm which identifies sarcastic sentiment efficiently. Hybrid approach consider
lexical and hyperbole feature to improve performance of system by increasing accuracy, precision, F-score.
Keywords— Big data, Hadoop, MapReduce, Sentiment analysis, Sarcasm detection
I. INTRODUCTION
Now a day, most of people are using twitter, facebook and micro blogging sites. They share their opinion, feeling for
particular topic through comment, review. The volume of data generated daily is very large. So, it is important to analyse the data
for gaining information from that. Sentimental analysis is used for mining various types of data for opinion through text analytics.
It can be positive, negative or neutral.
Twitter became one of the biggest platform for people to express opinion, share their thoughts and regularly updated about
any organization, events etc. So, data collected is huge somewhat called bigdata. To process such a big data we need framework
that manages this entire thing.
Now a day, people are using sarcasm in their daily life. Sarcasm refers to opposite of what person want to say and it is used to
make fun of others, to annoy someone and to show your anger. So it is important to detect it for more accuracy of the system.
Sentimental analysis is positive, negative or neutral. In positive sentiment also either it is actually positive or sarcastic and for
negative sentiment either it is actually negative or sarcastic. If we ignored sarcasm it impact in sentiment analysis and may be
reverse the polarity of sentence. So it is important to detect it for accurate sentiment analysis of any company or organization.
The online Oxford dictionary1
defines sarcasm as “the use of irony to make or convey contempt”. Collins dictionary2
defines it
as “mocking, contemptuous, or ironic language intended to convey scorn or insult”. According to Macmillan English dictionary3
,
sarcasm is “the activity of saying or writing the opposite of what you mean, or of speaking in a way intended to make someone
else feel stupid or show them that you are angry”.
Now a day, most of researcher is working in this field. Every day huge amount of data is generated and to deal with this huge
data it takes time to analyze it and to generate information from that. Different algorithm and approach are proposed to detect
sarcasm accurately but limited accuracy is achieved. So, this becomes most attracted area for researcher to do research on this
topic and improve accuracy of system.
There are many difficulties are present in detecting sarcasm makes it more interesting task. For example, “Wow, there is huge
amount discount.” This sentence considered as compliment. However, considering following sentence: “Wow, there is huge
amount of discount but I don’t buy anything.” This sentence clarify that person did not mean what he/she said. For normal people
it becomes difficult to detect it.
There are different features present which is used to detect sarcasm efficiently. Bharti et al1 proposed different types of
feature available to detect sarcasm easily. First, Lexical feature is used to detect sarcasm in only text data in which uni-grams, bigrams
and n-grams parameters used to detect sarcasm. Bi-grams and n-grams have more impact on sentimental analysis. Second,
Hyperbole feature is used to emphasize meaning of text. In that, Interjection words have more tendencies to become sarcastic. So
interjection words play important role to detect sarcasm. Another features under hyperbole are punctuation mark, quotes,
intensifier is used to improve performance of system. For example, “excellent marks” has high impact rather than “good marks”.
So, intensifier makes task easy to detect sarcasm. Third, pragmatic feature is used to express emotions more accurately using
smiles, emoticons, replies. So, we need to identify which type of feature is used so that accordingly algorithm is applied. In our
research, we are hybrid two feature that is lexical and hyperbole to improve accuracy of system.
Negation words have impact on sentimental analysis. We have to consider it to detect sarcasm because it reverses the polarity
of sentence. Here, we are considered two feature lexical, hyperbole and hybrid them to improve the accuracy of system. We are
considered negation feature to improve precision of sarcasm detection. Mapreduce is used to reduce execution time. It is parallel
computing platform to build reliable, cost-effective, flexible application.
There are different types of sarcasm are present: (1) contradiction between positive sentiment and negative situation. For
example, “I feels great being ignored” (2) Contradiction between negative sentiment and positive situation. For example, “I hate
new Zeeland team because it always win” (3) Tweet starts with an interjection word. For example, “Wow, there is huge amount of
discount but I don’t buy anything!!” (4) Likes and Dislikes contradiction (5) Tweet contradicting universal facts (6) Tweet
contains positive sentiment with antonym pair (7) Tweet contradicting time dependent facts.
There are many challenges present to detect sarcasm. Twitter is used as dataset for sarcasm detection. Twitter limits 140
characters for posting message that creates more ambiguity. Also, tweets contain uncommon words, slangs, abbreviation more of
informal nature to make difficult for sarcasm detection. There is no predefined structure available for sarcasm. It becomes easy to
detect sarcasm if #sarcasm tag is present either at the end of tweet or middle of tweet. But, it creates difficulty if no #sarcasm tag
is available. Joshi et al. 2 highlighted 3 main challenges which are i) the identification of common knowledge, ii) the intent to
ridicule, and iii) the speaker-listener (or reader in the case of written text) context.
The objectives of our system are listed below:
1. To study different approaches available for sarcasm detection.
2. To study different features and type of sarcasm available for detection.
3. Proposed modified approach for sarcasm detection efficiently.
4. To improve accuracy of sentimental analysis and reduce execution time.
II. RELATED WORK
There are many approaches available for sarcasm detection. Different authors consider various feature and approaches to
improve accuracy of system. There are mainly two approaches available: (1) Machine Learning (2) Rule based approach. The
machine learning approach is a method of analysis that forms a model to predict, arrange or classify data through the statistical
process. Meanwhile, rule-based approach is a technique which exploits semantic, syntactic and stylistic properties of sentences in
any language such as phrase pattern, lexical and structural attributes to analyse the sentiment of a sentence.
Bouazizi and Ohtsuki et al. 3 proposed supervised machine learning approach. They focus on importance of proposed set of
feature to detect sarcasm and for each feature they identified different set of parameters to train the data set and tested them.
Sentiment, punctuation, syntactic, semantic, pattern based feature are considered to train classifier. For classification, Random
forest, maximum entropy, SVM, naïve Bayes is used. Rajadesingan et al. 4 aims to address the difficult task of sarcasm
detection on Twitter by leveraging behavioral aspects to users expressing sarcasm. They employ theories from behavioral and
psychological studies to construct a behavioral modeling framework for detecting sarcasm. SCUBA (Sarcasm classification using
behavioral modeling approach) framework is used. Different forms of sarcasm like Sarcasm as a contrast of sentiments, Sarcasm
as a complex form of expression, Sarcasm as a means of conveying emotion, Sarcasm as a possible function of familiarity,
Sarcasm as a form of written expression are considered. Tungthamthiti et al. 5 use concept level knowledge to identify
contradiction between sentiment and situation. For example, “I love going to work on holidays” has positive sentiment love but it
is actually sarcastic sentence. So, apply concept level knowledge that is holidays have relaxed situation while work has stressful
situation so contradiction between them present and it considered as sarcastic. Also, focus on coherency that is correlation among
sentences while multiple sentences are present to detect sarcasm.
Bharti et al. 1 proposed algorithm for different types of sarcasm and also considered lexical and interjection feature to detect
sarcasm. They captured and processed real time tweets using Apache Flume and Hive under the Hadoop framework, proposed a
set of algorithms to detect sarcasm in tweets under the Hadoop framework and proposed another set of algorithms to detect
sarcasm in tweets. Riloff et al. 6 proposed bootstrapping algorithm that automatically learns phrases corresponding to positive
sentiments and phrases corresponding to negative situations. They use tweets that contain a sarcasm hashtag as positive instances
for the learning process. They use the learned lists of sentiment and situation phrases to recognize sarcasm in new tweets by
identifying contexts that contain a positive sentiment in close proximity to a negative situation phrase.
Peter et al. 7 apply string matching against positive sentiment and interjection lexicons to test if the presence of both can be
used to classify content as being sarcastic. By focusing only on the positive sentiment, which would suggest a negative feeling,
those tweets which contained negative sentiment and therefore positive feeling were ignored. Additionally, the use of interjections
is not unique to sarcastic texts and many tweets may contain them where an author wishes to enhance the expressed sentiment.
Vijayalaksmi et al. 8 proposed different semi-supervised algorithm like lexical Analysis with N-grams approach, Knowledge
extraction, contrast approach, emoticon based approach and hyperbole approach to propose a new rule based Hybrid approach for
sarcasm detection. But, developing dictionary for these algorithms takes more time. The sarcasm detection was ignored for
different languages (except English), repeated tweets and empty or a single letter/word tweets in this study.
Different author proposed different approach for detection of sarcasm efficiently. PBLGA is parsing based lexicon generation
algorithm used for generating lexicon that is used to check sarcasm. Contradiction between sentiment and situation has high
probability to classify as sarcastic. Another IWS (Interjection word start) is used to identify sarcasm in sentence that starts with
interjection words like wow, oh, yeah etc. Table I show comparison of individual algorithm with existing state-of-art algorithm
with various parameters like precision, recall, F-score etc.
Table I Comparison of individual algorithm with state-of-art algorithm
III. PROPOSED SYSTEM
A. Data
In this study, we are considering twitter data for sarcasm detection. So, we have to retrieve Tweets through API (input).
Twitter provides different API like search API which is used to search tweet using keyword and retrieved it, Streaming API used
to fetch real time live tweets, Rest API is used to retrieve tweets from twitter database. Then after, these tweets are stored in
hadoop’s HDFS file system for further processing.
B. Preprocessing of Data
Tweet Preprocessing is required to remove noisy data which is not useful to take decision in sentimental analysis. There is some
extra information present like URL which is used to give more information about particular topic or show image for that, @user
mentioned in tweet is not necessary for detecting sarcasm so this data is noisy data for sarcasm detection. So, remove this type of
noisy data to improve performance of system.
C. Part of Speech Tagging
P.O.S Tagging(Part of speech tagging) is a process of taking a word from text (corpus) as input and assign corresponding partof-speech
to each word as output based on its definition and context ie: relationship with adjacent and related words in a phrase,
sentence, or paragraph. After P.O.S. tagging, store all phrases into parse file(PF) and give as an input to our proposed algorithm.
For Example: “I love being ignored”. After P.O.S tagging, I|PRP love|VBP being|VBG ignored|VBN.
After assigning part of speech to each word, it is necessary to assign tag to each word so that we can identify that which is first
tag, second tag and remaining tag. Separation of tags can be useful in interjection related tweet to identify sarcasm. Bharti et al. 9
proposed algorithm for assignment of tag to each word.
P.O.S. Tagging
Data: dataset := Annotated corpus
Result: WT := dictionary variable with pair for each word with its tag in the corpus
TT := dictionary variable with for bigram tag pair
T := dictionary variable with pair for each tag with its occurrences
while sentence in corpus do
while word in sentence do
if word==first word then
previous tag =$
current tag = POS tag of current word
TTprevious tag, current tag++
Tcurrent tag++
WTword, current tag++
end
else
previous tag =POS tag of previous word
current tag = POS tag of current word
TTprevious tag, current tag++
Tcurrent tag++
WTword, current tag++
end
end
end
D. Sentiment analysis of phrase
After p.o.s tagging, Sentiment analysis of phrase can be done. For that positive ratio and negative ratio have to determine.
Positive ratio refers to total number of positive words in phrase from total number of words in phrase. Negative ratio refers to total
number of negative words present in phrase from total number of words in phrase. Intensifier has high impact to detect sarcasm. For
example, Fantastic weather has high impact then good weather. Apply rule based pattern to find polarity of word if any intensifier
is present. Sentiment score can be calculated as:
Sentiment score= Positive Ratio – Negative Ratio
PWP PR
TWP
?
NWP NR
TWP
?
PR= Positive Ratio, NR= Negative Ratio, PWP= Number of positive words per phrase, NWP=Number of negative words per
phrase, TWP= Total words in phrase.
E. Feature based composite approach
Feature based composite approach (FBCA) using mapreduce is our proposed algorithm that is explained in section IV. Here,
two features lexical and hyperbole is composite and mapreduce is used for faster execution. Also, consider punctuation feature and
negation feature to improve precision of system. After execution of proposed algorithm as a result tweet is sarcastic or not is
known. In this step, actual detection of sarcasm is done.
F. Compare precision with individual approach
We have to find and compare precision with individual approach so that we can identify improvement in our proposed
approach. Precision refers to the fraction of retrieved sarcastic tweets that are relevant. In other words, it measures the number of
tweets that have successfully been classified as sarcastic over the total number of tweets classified as sarcastic. For finding
precision, true positive tp, true negative tn, false positive fp, false negative fn parameters are considered. True positive refers to tweet
is positive and considered it positive. True negative refers to tweet is actually negative and is detected negative. False positive refers
to tweet is positive and is detected negative. False negative refers to tweet is negative and is detected positive. So confusion matrix
is created after execution of proposed algorithm. After performing all steps, output is shown in graph form for comparison among
individual algorithm with proposed algorithm.
IV. PROPOSED ALGORITHM
FBCA (Feature based Composite Approach)
Input: Tweet Corpus, interjection corpus, P.O.S. tag file (TF), Parse file (PF)
Output: Classification of tweets as sarcastic or not sarcastic.
Notation: A: adjective, V: verb, R: adverb, N: noun, UH: interjection, T: tweets, C: corpus, t: tag, TWT: tweet wise tag, FT: first
tag, INT: immediate next tag, NT: next tag, SF: sentiment file, sf: situation file, PSF: positive sentiment file, NSF: negative
sentiment file, psf: positive situation file, nsf: negative situation file, SC: sentiment score, E: exclamation mark more than two,
ISC: interjection sarcastic count, IF: interjection file, TWP: tweet wise phrase
Initialisation: TF = {
?
}, SF = {
?
}, sf = {
?
}, PSF = {
?
},
NSF = {
?
}, psf = {
?
}, nsf = {
?
}, count= 0, flag=0
for T in C do
Take FT, INT, NT from TWT
if UH in T
if FT = UH && INT = (ADJ || ADV) && NT= E then
Tweet is sarcastic & increment ISC
Store tweet into IF
else if (FT = UH) && (NT=(ADV + ADJ) && (ADJ+ N) && (ADV + V)) then
Tweet is sarcastic & increment ISC,
Store tweet into IF
else
Tweet is not sarcastic
end if
for T in IF do
k = find_parse (T)
PF
?
TF
?
k
end for
else
for TWP in PF do
k = find_subset (TWP)
if k = NP || ADJP || (NP + V P) then
SF
?
SF
?
k
else if k = V P || (ADV P + V P) || (V P + ADV P) || (ADJP + V P) || (V P + NP) || (V P + ADV P +ADJP) ||
(V+ADJP+NP) || (ADV+ADJP+NP) then
sf
?
sf
?
k
end if
end for
for P in SF do
SC = sentiment_score (P)
if SC >0.0 then PSF
? PSF
? P
else if SC <0.0 then NSF ? NSF ? P else Neutral Sentiment Phrase end if end for for P in sf do SC = sentiment_score (P) if SC >0.0 then psf
? psf
? P
else if SC <0.0 then nsf ? nsf ? P else Neutral Situation Phrase end if end for while words in tweet do if word ? PSF && count==0 count = 1; check nsf continue; end if word ? nsf && (count == 1) flag = True; break; end else if word ? NSF && (count == 0) count = 1 check psf; continue; end if word ? psf && (count == 1) flag = True; break; end end if flag==True then Given tweet is sarcastic end else Given tweet is not sarcastic end end if end for FBCA is used to detect sarcastic tweet using lexical and hyperbole feature. In this approach, first we check about interjection words related tweets. Lunando 10 statement as they said "if the text is using interjection words, the text has more tendencies to be classified into sarcastic". So, first as an input we have to give interjection corpus 11 that is used to find different interjection words available in tweet, P.O.S. tag file stores first tag, immediate next tag, next tag for particular tweet that can be done by P.O.S. tagging algorithm, parse file stores different phrases that is generated by TEXTBLOB 12 tool for specific tweet. The output of proposed algorithm is tweet is sarcastic or not. FBCA is focused on interjection words and number of exclamation marks present to detect sarcasm easily. If first tag is interjection word and immediate next id adverb or adjective and next tag is exclamation mark then tweet classify as sarcastic. Or first tag is interjection word and next tag is adverb followed by adjective or adjective followed by noun or adverb followed by verb then tweet classified as sarcastic. Also, store interjection related tweet in IF (Interjection File) to create more sentiment and situation. As we have more sentiment and situation it became easy to detect sarcasm for further analysis. If there is no interjection words are present then apply rule based pattern to create sentiment and situation file. Sentiment and situation is used to detect sarcasm if contradiction between them present or not. Contradiction between positive sentiment and negative situation, contradiction between negative sentiment and positive situation is identified by applying rule based pattern. For creating sentiment, if phrase is noun or adjective or noun followed by verb then it store in sentiment file. For situation file, phrase has verb or adverb followed by verb or verb followed by adverb or adjective followed by verb or verb followed by noun or verb followed by adverb followed by adjective or verb followed by adjective followed by noun or adverb followed by adjective followed by noun then it store in situation file. Then create positive sentiment file, negative sentiment file, positive situation file, negative situation file using sentiment score of phrases. If contradiction between sentiment and situation present then it classify as sarcastic otherwise not. If some phrase don't have sentiment then it goes in neutral situation and don't have to process it. We are using mapreduce for reducing execution time because constructing sentiment and situation file takes time so we need to do task parallel. In map phase, we are detecting interjection related tweets and classify as sarcastic. Also, we are creating sentiment and situation file using rule based pattern. In reduce phase, we have to create positive sentiment file, negative sentiment file, positive situation file, negative situation file using sentiment score and check for sarcastic tweet. At the end, we have to combine all result from map phase that is total number of sarcastic tweets detected. V. CONCLUSION Sarcasm detection is challenging task due to no predefined structure present. Researchers are improving accuracy of sarcasm detection by providing different algorithms. In this paper, we proposed algorithm that include lexical feature and hyperbole feature to detect sarcasm. Also, consider three types of sarcasm (i) contradiction between positive sentiment and negative situation (ii) contradiction between negative sentiment and positive situation (iii) occurrence of interjection words. We proposed algorithm that also consider punctuation related feature to improve precision. In proposed algorithm, constructing sentiment and situation file takes time so if we use hadoop framework that reduce our execution time. We are considering two features and hybrid them to improve accuracy of system. In future, we will consider emoticon to detect sarcasm. If contradiction between text and emoticon present then it became sarcasm. Also, proposed algorithm for different language is still area of research in future. References 1 Bharti, S. K., Babu, K. S., & Jena, S. K, "Parsing-based sarcasm sentiment recognition in twitter data," 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining(ASONAM), Paris, 2015, pp. 1373-1380. 2 A. Joshi, P. Bhattacharyya, and M. J. Carman., (Feb. 2016). "Automatic sarcasm detection: A survey." Online. Available: https://arxiv.org/abs/1602.03426 3 M. Bouazizi and T. Otsuki Ohtsuki, "A Pattern-Based Approach for Sarcasm Detection on Twitter," in IEEE Access, vol. 4, pp. 5477-5488, 2016. 4 Rajadesingan, A., Zafarani, R., & Liu, H. (2015). "Sarcasm detection on twitter: A behavioral modeling approach." 2015 WSDM - Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 97-106 5 Tungthamthiti, P., Shirai, K., & Mohd, M. (2014). "Recognition of sarcasm in tweets based on concept level sentiment analysis and supervised learning approaches." 28th Pacific Asia Conference on Language, Information and Computation, PACLIC 2014, pp. 404-413 6 Riloff, Ellen & Qadir, A & Surve, P & De Silva, L & Gilbert, N & Huang, R. "Sarcasm as contrast between a positive sentiment and negative situation." Proceedings of EMNLP 2013, pp. 704-714. 7 Clews P. & Kuzma J.(2017). "Rudimentary Lexicon Based Method for Sarcasm Detection." International Journal of Academic Research and Reflection, 5(4), 24-33. 8 N.Vijayalaksmi, Dr. A.Senthilrajan. "A hybrid approach for Sarcasm Detection of Social Media Data." International Journal of Scientific and Research Publications (IJSRP), Volume 7, Issue 5, May 2017 9 Bharti, S. K., Vachha, B., Pradhan, R. K., Babu, K. S., & Jena, S. K. "Sarcastic sentiment detection in tweets streamed in real time: A big data approach." Digital Communications and Networks, 2(3), pp. 108-121 10 Lunando, Edwin & Purwarianti, Ayu. "Indonesian Social Media Sentiment Analysis With Sarcasm Detection." 195-198. 10.1109/ICACSIS.2013.6761575. 11 Enchanted Learning http://www.enchantedlearning.com/Home.html 12 TextBlob: Simplified Text Processing — TextBlob 0.15.0 documentation http://textblob.readthedocs.io/en/dev/

x

Hi!
I'm Marcella!

Would you like to get a custom essay? How about receiving a customized one?

Check it out