wikimath's Blog

Happy coding

RSA

 

加密消息

假设Bob想给Alice送一个消息m,他知道Alice产生的Ne。他使用起先与Alice约好的格式将m转换为一个小于N的整数n,比如他可以将每一个字转换为这个字的Unicode码,然后将这些数字连在一起组成一个数字。假如他的信息非常长的话,他可以将这个信息分为几段,然后将每一段转换为n。用下面这个公式他可以将n加密为c

 n^e \equiv c\ (\mathrm{mod}\ N)

计算c并不复杂。Bob算出c后就可以将它传递给Alice。

[编辑] 解密消息

Alice得到Bob的消息c后就可以利用她的密钥d来解码。她可以用以下这个公式来将c转换为n

 c^d \equiv n\ (\mathrm{mod}\ N)

得到n后,她可以将原来的信息m重新复原。

解码的原理是

 c^d \equiv n^{e \cdot d}\ (\mathrm{mod}\ N)

以及ed ≡ 1 (mod p-1)和ed ≡ 1 (mod q-1)。费马小定理证明

 n^{e \cdot d} \equiv n\ (\mathrm{mod}\ p)     和      n^{e \cdot d} \equiv n\ (\mathrm{mod}\ q)

这说明(因为pq不同的质数)

 n^{e \cdot d} \equiv n\ (\mathrm{mod}\ pq)

 

Conditional independence

In probability theory, two events R and B are conditionally independent given a third event G precisely if the occurrence or non-occurrence of R and B are independent events in their conditional probability distribution given G. In the standard notation of probability theory,

\Pr(R \cap B \mid G) = \Pr(R \mid G)\Pr(B \mid G),\,

or equivalently,

\Pr(R \mid B \cap G) = \Pr(R \mid G).\,

Two random variables X and Y are conditionally independent given an event G if they are independent in their conditional probability distribution given G.

Two events R and B are conditionally independent given a σ-algebra Σ if

\Pr(R \cap B \mid \Sigma) = \Pr(R \mid \Sigma)\Pr(B \mid \Sigma)\ a.s.

where \Pr(A \mid \Sigma) denotes the conditional expectation of the indicator function of the event A, χA, given the sigma algebra Σ. That is,

\Pr(A \mid \Sigma) := \operatorname{E}[\chi_A\mid\Sigma].

Two random variables X and Y are conditionally independent given a σ-algebra Σ if the above equation holds for all R in σ(X) and B in σ(Y).

Two random variables X and Y are conditionally independent given a random variable W if they are independent given σ(W): the σ-algebra generated by W. This is commonly written:

X \perp\!\!\!\perp Y \,|\, W \text{   or   } X \perp Y \,|\, W

If W assumes a countable set of values, this is equivalent to the conditional independence of X and Y for the events of the form [W = w]. Conditional independence of more than two events, or of more than two random variables, is defined analogously.

The following two examples show that X Y neither implies nor is implied by X Y | W. First, suppose W is 0 with probability 0.5 and 1 otherwise. When W=0 take X and Y to be independent, each having the value 0 with probability 0.99, the value 1 otherwise. When W=1, X and Y are again independent, but this time they take the value 1 with probability 0.99. Then X Y | W. But X and Y are dependent, because Pr(X=0) < Pr(X=0|Y=0). This is because Pr(X=0) = 0.5, but if Y=0 then it's very likely that W=0 and thus that X=0 as well, so Pr(X=0|Y=0) > 0.5. For the second example, suppose X Y, each taking the values 0 and 1 with probability 0.5. Let W be the product X×Y. Then when W=0, Pr(X=0)=2/3, but Pr(X=0|Y=0)=1/2, so X Y | W is false. This also an example of Explaining Away. See Kevin Murphy's tutorial [1] where X and Y take the values "brainy" and "sporty".

 

Uses in Bayesian statistics

Let p be the proportion of voters who will vote "yes" in an upcoming referendum. In taking an opinion poll, one chooses n voters randomly from the population. For i = 1, ..., n, let Xi = 1 or 0 according as the ith chosen voter will or will not vote "yes".

In a frequentist approach to statistical inference one would not attribute any probability distribution to p (unless the probabilities could be somehow interpreted as relative frequencies of occurrence of some event or as proportions of some population) and one would say that X1, ..., Xn are independent random variables.

By contrast, in a Bayesian approach to statistical inference, one would assign a probability distribution to p regardless of the non-existence of any such "frequency" interpretation, and one would construe the probabilities as degrees of belief that p is in any interval to which a probability is assigned. In that model, the random variables X1, ..., Xn are not independent, but they are conditionally independent given the value of p. In particular, if a large number of the Xs are observed to be equal to 1, that would imply a high conditional probability, given that observation, that p is near 1, and thus a high conditional probability, given that observation, that the next X to be observed will be equal to 1.

Rules of conditional independence

A set of rules governing statements of conditional independence have been derived from the basic definition.[2][3]

Symmetry: X \perp\!\!\!\perp Y \mid Z  \Rightarrow Y \perp\!\!\!\perp X \mid Z

Decomposition: Y,W \perp\!\!\!\perp X  \mid Z  \Rightarrow Y \perp\!\!\!\perp X \mid Z and  W \perp\!\!\!\perp X \mid Z

Weak Union:  X \perp\!\!\!\perp Y,W \mid Z \Rightarrow X \perp\!\!\!\perp Y \mid Z,W

Contraction:  X \perp\!\!\!\perp W \mid Z, Y and  X \perp\!\!\!\perp Y \mid Z \Rightarrow X \perp\!\!\!\perp W,Y \mid Z

If the probabilities of X, Y, Z, W are all positive, then the following also holds:

Intersection:  X \perp\!\!\!\perp Y \mid Z, W and  X \perp\!\!\!\perp W \mid Z, Y \Rightarrow X \perp\!\!\!\perp Y, W \mid Z

bayes

Abstractly, the probability model for a classifier is a conditional model

p(C \vert F_1,\dots,F_n)\,

over a dependent class variable C with a small number of outcomes or classes, conditional on several feature variables F1 through Fn. The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable.

Using Bayes' theorem, we write

p(C \vert F_1,\dots,F_n) = \frac{p(C) \ p(F_1,\dots,F_n\vert C)}{p(F_1,\dots,F_n)}. \,

In plain English the above equation can be written as

\mbox{posterior} = \frac{\mbox{prior} \times \mbox{likelihood}}{\mbox{evidence}}. \,

In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model

p(C, F_1, \dots, F_n)\,

which can be rewritten as follows, using repeated applications of the definition of conditional probability:

p(C, F_1, \dots, F_n)\,
= p(C) \ p(F_1,\dots,F_n\vert C)
= p(C) \ p(F_1\vert C) \ p(F_2,\dots,F_n\vert C, F_1)
= p(C) \ p(F_1\vert C) \ p(F_2\vert C, F_1) \ p(F_3,\dots,F_n\vert C, F_1, F_2)
= p(C) \ p(F_1\vert C) \ p(F_2\vert C, F_1) \ p(F_3\vert C, F_1, F_2) \ p(F_4,\dots,F_n\vert C, F_1, F_2, F_3)
= p(C) \ p(F_1\vert C) \ p(F_2\vert C, F_1) \ p(F_3\vert C, F_1, F_2) \ \dots p(F_n\vert C, F_1, F_2, F_3,\dots,F_{n-1}).

Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for j\neq i. This means that

p(F_i \vert C, F_j) = p(F_i \vert C)\,

and so the joint model can be expressed as

p(C, F_1, \dots, F_n) = p(C) \ p(F_1\vert C) \ p(F_2\vert C) \ p(F_3\vert C) \ \cdots\,
= p(C) \prod_{i=1}^n p(F_i \vert C).\,

This means that under the above independence assumptions, the conditional distribution over the class variable C can be expressed like this:

p(C \vert F_1,\dots,F_n) = \frac{1}{Z}  p(C) \prod_{i=1}^n p(F_i \vert C)

where Z (the evidence) is a scaling factor dependent only on F_1,\dots,F_n, i.e., a constant if the values of the feature variables are known.

Models of this form are much more manageable, since they factor into a so-called class prior p(C) and independent probability distributions p(F_i\vert C). If there are k classes and if a model for each p(F_i\vert C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes model has (k − 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1 (Bernoulli variables as features) are common, and so the total number of parameters of the naive Bayes model is 2n + 1, where n is the number of binary features used for classification and prediction

 

 

 

Here is a worked example of naive Bayesian classification to the document classification problem. Consider the problem of classifying documents by their content, for example into spam and non-spam E-mails. Imagine that documents are drawn from a number of classes of documents which can be modelled as sets of words where the (independent) probability that the i-th word of a given document occurs in a document from class C can be written as

p(w_i \vert C)\

(For this treatment, we simplify things further by assuming that words are randomly distributed in the document - that is, words are not dependent on the length of the document, position within the document with relation to other words, or other document-context.)

Then the probability of a given document D, given a class C, is

p(D\vert C)=\prod_i p(w_i \vert C)\,

The question that we desire to answer is: "what is the probability that a given document D belongs to a given class C?" In other words, what is p(C \vert D)\,?

Now by definition

p(D\vert C)={p(D\cap C)\over p(C)}

and

p(C\vert D)={p(D\cap C)\over p(D)}

Bayes' theorem manipulates these into a statement of probability in terms of likelihood.

p(C\vert D)={p(C)\over p(D)}\,p(D\vert C)

Assume for the moment that there are only two mutually exclusive classes, S and ¬S (e.g. spam and not spam), such that every element (email) is in either one or the other;

p(D\vert S)=\prod_i p(w_i \vert S)\,

and

p(D\vert\neg S)=\prod_i p(w_i\vert\neg S)\,

Using the Bayesian result above, we can write:

p(S\vert D)={p(S)\over p(D)}\,\prod_i p(w_i \vert S)
p(\neg S\vert D)={p(\neg S)\over p(D)}\,\prod_i p(w_i \vert\neg S)

Dividing one by the other gives:

{p(S\vert D)\over p(\neg S\vert D)}={p(S)\,\prod_i p(w_i \vert S)\over p(\neg S)\,\prod_i p(w_i \vert\neg S)}

Which can be re-factored as:

{p(S\vert D)\over p(\neg S\vert D)}={p(S)\over p(\neg S)}\,\prod_i {p(w_i \vert S)\over p(w_i \vert\neg S)}

Thus, the probability ratio p(S | D) / p(¬S | D) can be expressed in terms of a series of likelihood ratios. The actual probability p(S | D) can be easily computed from log (p(S | D) / p(¬S | D)) based on the observation that p(S | D) + p(¬S | D) = 1.

Taking the logarithm of all these ratios, we have:

\ln{p(S\vert D)\over p(\neg S\vert D)}=\ln{p(S)\over p(\neg S)}+\sum_i \ln{p(w_i\vert S)\over p(w_i\vert\neg S)}

(This technique of "log-likelihood ratios" is a common technique in statistics. In the case of two mutually exclusive alternatives (such as this example), the conversion of a log-likelihood ratio to a probability takes the form of a sigmoid curve: see logit for details.)

Finally, the document can be classified as follows. It is spam if p(S\vert D) > p(\neg S\vert D) (i.e. that \ln{p(S\vert D)\over p(\neg S\vert D)} > 0), otherwise it is not spam.