Detecting homophily in the network of movie reviewers.
Web-based social networks are rapidly growing every day. This motivates researchers to study various aspects of their functioning. In this post, I will discuss the results of my project regarding the network of movie reviewers’ analysis. I studied the network of the reviewers who voted at MovieLens website in 1997-1998. MovieLens is a system that recommends movies to watch. KinoPoisk is a Russian analogue (it would be fantastic if they one day published their data). Another famous movie recommender is IMDb.
In my project, I aim to find whether the network of movie reviewers exhibits homophily or not in terms of voters’ negative rates. In other words, I do a search for ties within and between the groups of people and try to identify to what degree they are similar when voting for movies (do not be afraid of the word homophily, it is a cornerstone of my research, I will describe it later).
This research answers such questions as:
- Do women dislike the same movies? What about men?
- Do programmers hate the same movies? What about students, educators and administrators?
- Do people under 17 years old tend to put similar negative rates? Do people over 40 do the same?
- And if not… why?
So this time I will be an objectionable guy who focuses only on negative things 😀
Researchers’ interest in movie reviews
Economist’s job is traditionally about identifying what is profitable and about maximizing the profits. No no, this is not because they are voracious, it is just because of the demand that drives the market – someone should do this job 🙂 No wonder that existent papers, investigating movie reviews, concentrate on questions like how movie ratings influence revenues of the companies. However, we can reveal some relevant information from these papers.
In 2008 Duan et al. found that online user reviews had little persuasive effect on consumer purchase decisions. But a few years later Wendy et al. (2011) discovered that the ratings of the consumers were influenced by the customers’ own experience as well as the others’ ratings. In their paper, they consider both direct and indirect effects on sales that result from the influence of the future ratings. They conclude that the behaviour of reviewers is significantly influenced by the ratings that have been posted before. However, they highlight that the effects are limited when one considers indirect effects. Lee et al. (2015) study the social influence of previously posted ratings. They examine an impact of initial ratings by strangers versus friends. In a result, they find evidence of both herding and differentiation behaviour in strangers’ ratings where initial ratings positively or negatively influence future ratings based on movie’s popularity.
The findings about the influence of previously posted ratings are important for my project. I assume that there is a social influence inside my network since the users are able to see an actual rating of the movie before they vote.
To answer my research questions, I had to calculate homophily index for each group of people. In respect to network studies, the term homophily (literally ‘love of the same‘) was first introduced by Lazarsfeld and Merton in 1954. The term describes a tendency of people to have relationships with those who are similar to themselves. This observation is applied to a number of qualities, such as gender, ethnicity, age, class background, educational attainment, etc. In addition, I calculated inbreeding homophily index, which explains a within-group interaction.
It is also important to determine the segregation index as a measure of segregation (isolation) in networks. For this, I use a Freeman’s segregation index covered in the paper by Fagiolo et al (2007). Freeman introduced the segregation index in 1978. The rationale of the computation of the Freeman’s segregation index is that ‘if a given agent-attribute does not matter for social relationships, then the links among the agents should be distributed randomly with respect to that attribute’. The Freeman’s segregation index is given by the difference between the number of cross-group ties expected by chance and the number of observed ties (divided by expected ones). The index ranges between −1 and 1, with the highest segregation level obtained when there are no cross-group links in place.
I use the data from MovieLens website published by Harper & Konstan (2015). This data set consists of 100,000 ratings ranging from 1 to 5 from 943 users on 1682 movies. In this set, each user has rated at least 20 movies. In addition, the data provides simple demographic info of the reviewers, such as age, gender, occupation, their location. The data was collected during the seven-month period from September 19th, 1997 through April 22nd, 1998. Let us start exploring the data we have.
Good news, ladies and gentlemen, the world is not as evil as it seems to be: people put lots of positive and neutral ratings (Figure 1). The share of the negative votes (‘1’ and ‘2’) in the network is only 17.5%.
Figure 1. Votes distribution by rating
Take a look at the list of the most disliked movies (Table 1). All the movies here were produced in 1996 or 1997. This is easily explained by the assumption that the users tend to evaluate brand-new movies more often.
Table 1. Top-10 movies in terms of number of negative votes
|Bed of Roses||1996||88|
|The English Patient||1996||83|
‘Top-3’ (bottom, in fact) movies in this list are Liar Liar (1997), Independence Day (1996) and The Saint (1997). Actually, their presence in this list does not really mean that they are bad (I personally like Liar Liar and all Jim Carrey’s creativeness). I consider this to be again a case of social influence. But reversed one. When a reviewer observes a very high rating of a popular movie, with which he does not agree, he will put a lower rate in order to ‘restore justice’ in his understanding (must admit, I did this several times) :). Probably, if the rating of the movie was not artificially inflated (e.g. by means of commercial) and had more or less deserved one, the movie would get a smaller number of low rates on average.
Male and Female Reviewers
A current network is dominated by men: 71% of the reviewers are males (Figure 2). A large share of male also corresponds with a higher homophily index (Table 2).
Figure 2. Distribution by gender
Let us look at the inbreeding homophily. In both groups of people, it is negative. This conveys that the network is characterized by heterophily (which is opposite to homophily), when the vertices with different characteristics are preferentially linked. Therefore, a man and a woman are more likely to negatively rate the same movie than two women or two men. At the same time, men rate the same movies similarly to women more often than women do (toadies!).
There is no easy explanation for this observation. Researchers who do network analysis know that social networks are mostly characterised by homophily rather than heterophily. My working hypothesis is that people in romantic relationships tend to watch movies with mates of opposite sex, which leads to a social influence and broader heterophily in their votes. However, this assumption requires further research as our network does not contain the information on romantic relationships.
Students, educators, administrators and programmers who put negarive votes
The largest group by occupation in our network is students (Figure 3). The next one is the mysterious group of ‘others’, one level below there is a group of educators. Also among reviewers we have many administrators, engineers, programmers, librarians, writers, executives and scientists.
Figure 3. People distribution in the network by their occupation
When calculated homophily for the groups by occupation, I focused on students, educators, administrators and programmers. Eventually, I detected neither homophily nor heterophily within their groups (Table 3). This means that people of one occupation do not share similar negative or positive opinions on movies.
To check for the existence of homophily in the age groups, I divided the population of my network in four groups. In the first group, I included all reviewers under age 18 (relatively small number: only 4% of all reviewers). The next three groups are almost similar in size. The second group represents people of 18-27 years old (32% of population). The third group has included those who were 28-40 years old (36%). Finally, people older than 40 are included in the fourth group (28%).
It is fascinating that the results show relatively high negative inbreeding homophily in age groups “18-27”, “28-40” and “>40” (Table 4). Thus, one can observe some heterophily in these age groups, which is even more puzzling than the presence of heterophily in gender groups.
Studying movie reviews, researchers mainly concentrate on finding how movie ratings expect to influence the revenue. They focus on understanding the mechanisms of the ratings prediction. Several authors find that social influence exists as one can see the movie rating and read reviews before she votes. I use this assumption for my project, taking into account possible social influence and information cascades in my network.
This project has unexpectedly brought some interesting results. First of all, I reject the hypothesis regarding the existence of homophily in all studied social groups. Second, I detect heterophily in the groups of male and female. This I try to explain by the attitude of men and women to watch movies together, thus influencing each other’s opinion. Third, I observe some heterophily in the age groups of people between 18 and 27 years old, 28 and 40 years old, and within the group of people older than 40.
The heterophily is quite a rare phenomenon in social networks. Most of the social networks are characterised by homophily. This makes the results of this project curious and stimulates further research and, hopefully, it also stimulates you to invite your half to the cinema this weekend and bring me more evidence on my hypothesis 😉
Duan W., Gu B., Whinston A.B. (2008) Do online reviews matter? — An empirical investigation of panel data, Decision Support Systems 45 1007–1016.
Fagiolo G., Valente M., Vriend N.J. (2007) Segregation in networks, Journal of Economic Behavior & Organization Vol. 64 (2007) 316–336.
Harper F.M., Konstan, J.A. (2015) The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015).
Lee Y.-J., Hosanagar K., Tan Y. (2015) Do I Follow My Friends or the Crowd? Information Cascades in Online Movie Ratings, Management Science, Volume 69, Issue 9.
Mantegna R.N. (2016) Economic Networks Lectures, Central European University.
Moe W.W., Trusov M. (2011) The Value of Social Dynamics in Online Product Ratings Forums. Journal of Marketing Research: June 2011, Vol. 48, No. 3, pp. 444-456.
Annexes for the most curious
Table 2. Indices of homophily, inbreeding homophily and segregation in gender groups
Table 3. Indices of homophily, inbreeding homophily and segregation in age groups
Table 4. Indices of homophily, inbreeding homophily and segregation in occupations