Mailing lists on the Internet are the community where people discuss various topics via E-mail. In this paper, we aim at discovering influential comments stimulating peoples' interest by mining the archives of mailing lists. Here we employ Influence Diffusion Model (IDM) in text-based communication, where the influence of comments are defined as the degree of text-based relevance of messages.
Influence Diffusion Model, mining mailing list archives
Diffusion research has been attracted research attentions for decades. In the 1950's and 1960's, Katz et al. [1] and Rogers [2] proposed some diffusion models from mass media to people. Shifting our focus into the diffusion on text-based communication, the researches of computer mediated communications (CMC) [3][4] are deeply relevant. Shibanai et al. analyzed the diffusion process of `Pentium bug in 1994' by questionnaires [5]. Bordia et al. studied rumor transmission chains by classifying the content of individual messages [6]. Kaneko et al. analyzed the comment-chain of e-mails in a mailing-list by using network analysis methods to discover influential comments/people [7]. The study used only the structure of comment-chain, not used the contents.
In this paper, we aim at discovering influential comments stimulating peoples' interest by using not only the structure of comment-chain, but also the contents. In the Section 2, we first propose Influence Diffusion Model (IDM) in text-based communication, where the influence of comments are defined as the degree of text-based relevance of messages. Then, we apply this model to the archives of a mailing list, and present our discoveries in Section 3.
In a mailing list, communications between people are done by exchanging comments, i.e., posting new comments or replying to the comments. Our first assumption is that the relations of comments, called comment-chain, show the flow of influence. For example, if comment C_{y} replies to comment C_{x}, it is considered that C_{y} is affected by C_{x}. That is, the influence diffuses from C_{x} to C_{y}. In this way, the influence diffuses throughout the comment-chain. Our second assumption is that people's idea is expressed and propagated by the medium of terms. Therefore, the process of diffusion of influence is defined as follow.
Definition 1 In text-based communication, influence diffuses along the comment-chain by medium of terms, i.e., words or phrases.We define the influence by the degree of terms propagating through the comment-chain. For example, If C_{y} replies to C_{x}, the influence of C_{x} onto C_{y}, i_{x, y}, is defined as
i_{x, y} = | w_{x} ∩ w_{y} | / | w_{y} | ,
where w_{x} and w_{y} are the set of terms in C_{x} and C_{y} respectively. In addition, if C_{z} replies to C_{y}, the influence of C_{x} onto C_{z} via C_{y}, i_{x, z}, is defined as
i_{x, z} = | w_{x} ∩ w_{y} ∩ w_{z} | / | w_{z} | × i_{x, y} ,
where w_{z} are the terms in C_{z}. It is considered that the more a comment affects other comments, the more the influence increases. The influence of a comment comes to be measurable.
Definition 2 The influence of a comment to the community is measured by the sum of influence diffused from the comment to all other members of the community.Applying Definition 2 to C_{x}, the influence is measured by the sum of influence diffused from C_{x}, i.e., i_{x, y} + i_{x, z} if the community has three members x, y and z.
We formalize the influence of a comment C_{i}. The influence of C_{i} diffuses along the comment-chain by the medium of terms Definition 1, and the influence is measured by the sum of influence diffused from C_{i} (Definition 2). Here, let ξ_{i, z} be the comment-chain which starts from C_{i}, i.e., ξ_{i, z} = { C_{i}, C_{j}, C_{k} … C_{q}, C_{r} … C_{y}, C_{z} } { i < j < k … q < r … y < z }, and the influence of C_{i} onto C_{r} be i_{i, r}. Then, i_{i, r} is described as
i_{i, r} = | w_{i} ∩ w_{j} ∩ … ∩ w_{r} | / | w_{r} | × i_{i, q} ,
where | w_{r} | denotes the count of terms in C_{r}, and | w_{i} ∩ w_{j} ∩ … ∩ w_{r} | denotes the count of propagated terms from C_{i} to C_{q}. i_{i, r} means that i_{i, q} affects i_{i, r} in proportion to the count of propagated terms from C_{i} to C_{r} in the count of terms in C_{r}. Here, let I_{ξi, z} be the sum of influence diffused from C_{i} in ξ_{i, z}. Then, I_{ξi, z} is described as
I_{ξi, z} = i_{i, j} + i_{i, k} + … + i_{i, y} + i_{i, z} .
The influence of C_{i} is defined as the sum of I_{ξ} for all comment-chains from C_{i} (each comment-chain denotes ξ, for example). Let P_{i} be all comment-chain which start from C_{i}, and the influence of C_{i} be D_{Ci}. Then, D_{Ci} is described as
D_{Ci} = ∑_{ξ ∈ Pi} I_{ξ} .
We apply IDM proposed in Section 2 to a part of comment-chain in a mailing list managed in our laboratory. The comment-chain we use here is composed of 24 comments, and the main topic is a lecture on text-mining and natural language processing tools.
The flows of influence between comments are shown in Fig.1, and the top 5 comments in the order of values of diffusing influence (D_{C}) are shown in Table 1.
Ranking | Comment ID | D_{C} |
1 | #445 | 0.700 |
2 | #417 | 0.607 |
3 | #411 | 0.382 |
4 | #443 | 0.374 |
5 | #405 | 0.329 |
The summaries of comments in Table 1 are as follows.
Intuitively, #411 seems to be the most influential comment because #411 had the most replies in Fig.1. However, considering the context of the comment-chain, the influence of #411 was certainly less than #445 and #417. Similarly, #443 and #405 were influential in that their topics dominated the following context. From these considerations, we can understand that comments of high influential value supplied influential topics which attract peoples' interest and trigger peoples' comments.
In this paper, we proposed a method for mining the archives of a mailing list by IDM, and confirmed the effectiveness by experiments. In the next work, we plan to analyze the human relationship in a mailing list by IDM to understand human roles in the community.