Algorithm with Probability Theory
Author Message
Algorithm with Probability Theory

Hello,

If anyone has any suggestions towards a solution to this problem or

I have been working on a computer program that aids the historian, but
I suppose that the underlying mathematics / probability theory /
statistics could be applied to a range of problems.

The problem is a matter of calculating the probability of a document
being written in a certain year given some imperfect knowledge about
subsequent authors that may have cited the document as well as
probabilistic evidence of any other variety.

For a simple example, suppose we know that the document was written
between 100 and 200 for certain.  And suppose that we are 100% certain
that the document was cited in 150 CE.  We would calculate that there
is a 2% chance that the document was written in each year between 100
and 149, inclusive, and that the expected year of composition is
124.5.

Obviously, the actual problems will be more complicated and involve
the following data for each author that may have cited the document:

* The year in which the author wrote.
* The number of years which we would expect it to take for the
document to have circulated widely ('circulation').
* The probability that the author knew of the document given that the
document was written more than 'circulation' years ago.
* The probability that the author knew of the document given that the
document was written less than 'circulation' years ago (but before the
author).
* The probability that the author cited the document given that the
* The probability that the author cited the document in his or her own
writings.

In working on this computer program, I believe I have solved the
subproblem of finding the probabilities that a document was written
'circulation' or more years before (termed B), that a document was
written during the period between 'circulation' years ago and the time
that the author wrote (termed D), and that a document was written
after the author wrote (termed A) given the data on the author
mentioned above as well as background probabilities for B, D, and A.

The source code for the program in C++ as well as the associated Win32
console application executable can be found here:

http://www.*-*-*.com/ ~kirby/dater3.html

Now here is my problem.  I believe that the program works flawlessly
when there is only one author or one point of evidence.  However, I
think that there must be a better way to combine the answers for the
individual authors into an answer for the whole set of authors.
Currently, if you enter authors in a different order, you will get a
different answer.  The reason for this is that the background
probabilities for the first author are incorrect -- it should not be
1/numYears but rather should depend on the answer for the other
author.  But the solution to this problem does not seem to be simple,
because the answer for the other author requires background
probabilities that depend on the answer for the first author.  Or at
least that is what I think my problem is.

Again, I would appreciate any advice or direction.

thanks,
Peter Kirby

Fri, 01 Oct 2004 13:47:50 GMT
Algorithm with Probability Theory

Quote:

>Now here is my problem.  I believe that the program works flawlessly
>when there is only one author or one point of evidence.  However, I
>think that there must be a better way to combine the answers for the
>individual authors into an answer for the whole set of authors.
>Currently, if you enter authors in a different order, you will get a
>different answer.  The reason for this is that the background
>probabilities for the first author are incorrect -- it should not be
>1/numYears but rather should depend on the answer for the other
>author.  But the solution to this problem does not seem to be simple,
>because the answer for the other author requires background
>probabilities that depend on the answer for the first author.  Or at
>least that is what I think my problem is.

Some might object that there is really no such thing as the "probability"
that document X was written in year Y.  It either was written then or
it wasn't.  But we can take a Bayesian point of view and say that you
have a subjective probability for it, with a "prior" that would assign
equal probabilities to all years in a given interval.  Then given a
certain body of evidence (who cited what when) and a model of the
citing process, you can use Bayes' Theorem in the following form:

Pr{ Y = y | E } = Pr{Y=y} Pr{E | Y=y} / sum_j Pr{Y=j} Pr{E | Y=j}

where Y = j means the document was written in year j, E is the body of
evidence, Pr{Y=j} is the (prior) probability that the document was written
in year j, and Pr{ A | B } is the conditional probability of A given B.

It gets complicated when there is more than one document with an uncertain
date.  Then what you'd have to do is take into account all possible
assignments of dates to the documents.  For example, suppose there are
three documents:
A, certainly written in years 1 to 3
B, certainly written in 2 to 4
C, certainly written in 2 to 6,
C cites A and B (and its author would certainly have included such a
citation, provided the other document was written in a previous year),
while the authors of A and B would never cite anything.

Then there are 3*3*5=45 "prior" triples of dates, from (A,B,C) = (1,2,2)
to (3,4,6), which the prior distribution gives probability 1/45 each.
The citation evidence gives the conditional probabilities
Pr{E | (A,B,C)=(a,b,c)} = 1 if c>a and c>b, 0 otherwise.
There are 26 triples that fit the citation evidence, and so
Pr{(A,B,C)=(a,b,c) | E} = 1/26 for each of these.  Then e.g. the
probability that A was written in year 1 is 9/26, because there are
9 such triples where a=1.

Department of Mathematics        http://www.math.ubc.ca/~israel
University of British Columbia

Sat, 02 Oct 2004 06:16:00 GMT
Algorithm with Probability Theory

Quote:

>Some might object that there is really no such thing as the "probability"
>that document X was written in year Y.  It either was written then or
>it wasn't.

When you're talking about an event that happened in the past, rather than a
prediction about the future, it's obvious that what is meant is the
probability that a guess will be correct, not whether it happened at all.

--

Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

Sat, 02 Oct 2004 06:31:30 GMT
Algorithm with Probability Theory

Quote:
>Hello,

>If anyone has any suggestions towards a solution to this problem or

Hi Peter,
I just glanced at your 15 pages of source code and decided its more
than I feel like studying to give you any detailed support, even if I
could.

My check of your code did confirm that you are basically on the right
track with Bayes theorem.  I did a cursory search on Bayesian networks
which might be helpful.  Bayesian networking is an area of
probabilistc reasoning which I have run across but have no familiarity
with.  In my cursory check I determined that there is more software
than papers.  I worked my way down through a tutorial at
http://www.gmu.edu/departments/seor/faculty/Buede/TutFinbd/Default.htm
and learned there is something called Perl's algorithm which might
yield a solid basis for understanding the theory.  I can vaguely
visualize an extension of Bayesian networking to a distributed model
which might be new and interesting.

Another area which might be worth examining is the use of Hidden
Markov Model technology.  I gather that subject has gotten a large
amount of recent play due to its application to language translation
and speech recognition.  It also has hot new algorithms.
http://www.comp.leeds.ac.uk/scs-only/teaching-materials/HiddenMarkovM...

I don't think Maximum entropy concepts apply here although if you
want, check out:.
If that URL is beyond the capacity of your browser, my post that
kicked off the free for all is
and the message ID is:

You will note there are a number of anti-bayesian bigots on the
sci.math.stat newsgroup.  Some heavy hitters eventually contributed to
the discussion however.

John Bailey
http://www.frontiernet.net/~jmb184

Sat, 02 Oct 2004 19:13:38 GMT
Algorithm with Probability Theory

Quote:

> >Now here is my problem.  I believe that the program works flawlessly
> >when there is only one author or one point of evidence.  However, I
> >think that there must be a better way to combine the answers for the
> >individual authors into an answer for the whole set of authors.
> >Currently, if you enter authors in a different order, you will get a
> >different answer.  The reason for this is that the background
> >probabilities for the first author are incorrect -- it should not be
> >1/numYears but rather should depend on the answer for the other
> >author.  But the solution to this problem does not seem to be simple,
> >because the answer for the other author requires background
> >probabilities that depend on the answer for the first author.  Or at
> >least that is what I think my problem is.

> Some might object that there is really no such thing as the "probability"
> that document X was written in year Y.  It either was written then or
> it wasn't.  But we can take a Bayesian point of view and say that you
> have a subjective probability for it, with a "prior" that would assign
> equal probabilities to all years in a given interval.  Then given a
> certain body of evidence (who cited what when) and a model of the
> citing process, you can use Bayes' Theorem in the following form:

> Pr{ Y = y | E } = Pr{Y=y} Pr{E | Y=y} / sum_j Pr{Y=j} Pr{E | Y=j}

> where Y = j means the document was written in year j, E is the body of
> evidence, Pr{Y=j} is the (prior) probability that the document was written
> in year j, and Pr{ A | B } is the conditional probability of A given B.

Do you have a suggestion as to how to compute Pr( E | Y=y ) based on
the data stated for each author:

* The year in which the author wrote.
* The number of years which we would expect it to take for the
document to have circulated widely ('circulation').
* The probability that the author knew of the document given that the
document was written more than 'circulation' years ago.
* The probability that the author knew of the document given that the
document was written less than 'circulation' years ago (but before the
author).
* The probability that the author cited the document given that the
* The probability that the author cited the document in his or her own
writings.

Or do you think that more data is needed?  (Besides constants such as
the probability that the author knew of the document given that it was
written after the author, which is zero.)

For reference, my current program is shown here: