Algorithm with Probability Theory

Quote:

>Now here is my problem. I believe that the program works flawlessly

>when there is only one author or one point of evidence. However, I

>think that there must be a better way to combine the answers for the

>individual authors into an answer for the whole set of authors.

>Currently, if you enter authors in a different order, you will get a

>different answer. The reason for this is that the background

>probabilities for the first author are incorrect -- it should not be

>1/numYears but rather should depend on the answer for the other

>author. But the solution to this problem does not seem to be simple,

>because the answer for the other author requires background

>probabilities that depend on the answer for the first author. Or at

>least that is what I think my problem is.

Some might object that there is really no such thing as the "probability"

that document X was written in year Y. It either was written then or

it wasn't. But we can take a Bayesian point of view and say that you

have a subjective probability for it, with a "prior" that would assign

equal probabilities to all years in a given interval. Then given a

certain body of evidence (who cited what when) and a model of the

citing process, you can use Bayes' Theorem in the following form:

Pr{ Y = y | E } = Pr{Y=y} Pr{E | Y=y} / sum_j Pr{Y=j} Pr{E | Y=j}

where Y = j means the document was written in year j, E is the body of

evidence, Pr{Y=j} is the (prior) probability that the document was written

in year j, and Pr{ A | B } is the conditional probability of A given B.

It gets complicated when there is more than one document with an uncertain

date. Then what you'd have to do is take into account all possible

assignments of dates to the documents. For example, suppose there are

three documents:

A, certainly written in years 1 to 3

B, certainly written in 2 to 4

C, certainly written in 2 to 6,

C cites A and B (and its author would certainly have included such a

citation, provided the other document was written in a previous year),

while the authors of A and B would never cite anything.

Then there are 3*3*5=45 "prior" triples of dates, from (A,B,C) = (1,2,2)

to (3,4,6), which the prior distribution gives probability 1/45 each.

The citation evidence gives the conditional probabilities

Pr{E | (A,B,C)=(a,b,c)} = 1 if c>a and c>b, 0 otherwise.

There are 26 triples that fit the citation evidence, and so

Pr{(A,B,C)=(a,b,c) | E} = 1/26 for each of these. Then e.g. the

probability that A was written in year 1 is 9/26, because there are

9 such triples where a=1.

Department of Mathematics http://www.math.ubc.ca/~israel

University of British Columbia

Vancouver, BC, Canada V6T 1Z2