Asymptotic properties of goodness-of-fit criteria for testing hypotheses in a selection scheme without return, based on filling cells in a general well placement scheme Alexander Vladimirovich. Asymptotic behavior of functions

Definition. The direction determined by a non-zero vector is called asymptotic direction relative to the second order line, if any a straight line of this direction (that is, parallel to the vector) either has at most one common point with the line, or is contained in this line.

? How many common points can a second-order line and a straight line have? asymptotic direction relative to this line?

In the general theory of second order lines it is proven that if

Then the non-zero vector ( specifies the asymptotic direction relative to the line

(general criterion for asymptotic direction).

For second order lines

if , then there are no asymptotic directions,

if then there are two asymptotic directions,

if then there is only one asymptotic direction.

The following lemma turns out to be useful ( criterion for the asymptotic direction of a line of parabolic type).

Lemma . Let be a line of parabolic type.

The non-zero vector has an asymptotic direction

relatively . (5)

(Problem: Prove the lemma.)

Definition. The straight line of the asymptotic direction is called asymptote line of the second order, if this line either does not intersect with or is contained in it.

Theorem . If it has an asymptotic direction relative to , then the asymptote parallel to the vector is determined by the equation

Let's fill out the table.

TASKS.

1. Find the vectors of asymptotic directions for the following second order lines:

4 - hyperbolic type two asymptotic directions.

Let's use the asymptotic direction criterion:

Has an asymptotic direction relative to this line 4.

If =0, then =0, that is, zero. Then Divide by We get quadratic equation: , where t = . We solve this quadratic equation and find two solutions: t = 4 and t = 1. Then the asymptotic directions of the line .

(Two methods can be considered, since the line is of a parabolic type.)

2. Find out whether the coordinate axes have asymptotic directions relative to the second-order lines:

3. Write the general equation of the second order line for which

a) the x-axis has an asymptotic direction;

b) Both coordinate axes have asymptotic directions;

c) the coordinate axes have asymptotic directions and O is the center of the line.

4. Write the equations of the asymptotes for the lines:

a) ng w:val="EN-US"/>y=0"> ;

5. Prove that if a second-order line has two non-parallel asymptotes, then their intersection point is the center of this line.

Note: Since there are two non-parallel asymptotes, there are two asymptotic directions, then , and, therefore, the line is central.

Write the equations of the asymptotes in general view and a system for finding the center. Everything is obvious.

6.(No. 920) Write the equation of a hyperbola passing through point A(0, -5) and having asymptotes x – 1 = 0 and 2x – y + 1 = 0.

Note. Use the statement from the previous problem.

Homework . , No. 915 (c, e, f), No. 916 (c, d, e), No. 920 (if you didn’t have time);

Cribs;

Silaev, Timoshenko. Practical tasks in geometry,

1st semester. P.67, questions 1-8, p.70, questions 1-3 (oral).

DIAMETERS OF SECOND ORDER LINES.

CONNECTED DIAMETERS.

An affine coordinate system is given.

Definition. Diameter a second-order line conjugate to a vector of non-asymptotic direction with respect to , is the set of midpoints of all chords of the line parallel to the vector .

During the lecture it was proven that diameter is a straight line and its equation was obtained

Recommendations: Show (on an ellipse) how it is constructed (we set a non-asymptotic direction; draw [two] straight lines of this direction intersecting the line; find the midpoints of the chords to be cut off; draw a straight line through the midpoints - this is the diameter).

Discuss:

1. Why in determining the diameter is a vector of a non-asymptotic direction taken. If they cannot answer, then ask them to construct the diameter, for example, for a parabola.

2. Does any second-order line have at least one diameter? Why?

3. During the lecture it was proven that diameter is a straight line. The midpoint of which chord is point M in the figure?

4. Look at the parentheses in equation (7). What do they remind you of?

Conclusion: 1) each center belongs to each diameter;

2) if there is a line of centers, then there is a single diameter.

5. What direction do the diameters of a parabolic line have? (Asymptotic)

Proof (probably in lecture).

Let the diameter d, given by equation (7`), be conjugate to a vector of non-asymptotic direction. Then its direction vector

(-(), ). Let us show that this vector has an asymptotic direction. Let us use the criterion of the asymptotic direction vector for a line of parabolic type (see (5)). Let’s substitute and make sure (don’t forget that .

6. How many diameters does a parabola have? Their relative position? How many diameters do the remaining parabolic lines have? Why?

7. How to construct the total diameter of some pairs of second-order lines (see questions 30, 31 below).

8. We fill out the table and be sure to make drawings.

1. . Write an equation for the set of midpoints of all chords parallel to the vector

2. Write the equation for the diameter d passing through the point K(1,-2) for the line.

Solution steps:

1st method.

1. Determine the type (to know how the diameters of this line behave).

In this case, the line is central, then all diameters pass through center C.

2. We compose the equation of a straight line passing through two points K and C. This is the desired diameter.

2nd method.

1. We write the equation for diameter d in the form (7`).

2. Substituting the coordinates of point K into this equation, we find the relationship between the coordinates of the vector conjugate to the diameter d.

3. We set this vector, taking into account the found dependence, and compose an equation for diameter d.

In this problem, it is easier to calculate using the second method.

3. . Write an equation for the diameter parallel to the x-axis.

4. Find the midpoint of the chord cut off by the line

on the straight line x + 3y – 12 =0.

Directions to the solution: Of course, you can find the points of intersection of the straight line and line data, and then the middle of the resulting segment. The desire to do this disappears if we take, for example, a straight line with the equation x +3y – 2009 =0.

480 rub. | 150 UAH | $7.5 ", MOUSEOFF, FGCOLOR, "#FFFFCC",BGCOLOR, "#393939");" onMouseOut="return nd();"> Dissertation - 480 RUR, delivery 10 minutes, around the clock, seven days a week and holidays

Kolodzey Alexander Vladimirovich. Asymptotic properties of agreement criteria for testing hypotheses in a selection scheme without return, based on filling cells in a generalized placement scheme: dissertation... Candidate of Physical and Mathematical Sciences: 01.01.05.- Moscow, 2006.- 110 pp.: ill. RSL OD, 61 07-1/496

Introduction

1 Entropy and information distance 36

1.1 Basic definitions and notations 36

1.2 Entropy of discrete distributions with limited mathematical expectation 39

1.3 Logarithmic generalized metric on a set of discrete distributions 43

1.4 Compactness of functions with a countable set of arguments. 46

1.5 Continuity of information distance Kullback - Leibler - Sanov 49

1.6 Conclusions 67

2 Probabilities of large deviations 68

2.1 Probabilities of large deviations of functions from the number of cells with a given filling 68

2.1.1 Local limit theorem 68

2.1.2 Integral limit theorem 70

2.1.3 Information distance and probabilities of large deviations of separable statistics 75

2.2 Probabilities of large deviations of separable statistics that do not satisfy the Cramer condition 81

2.3 Conclusions 90

3 Asymptotic properties of goodness-of-fit criteria 92

3.1 Consent criteria for selection without return design. 92

3.2 Asymptotic relative efficiency of goodness-of-fit criteria 94

3.3 Criteria based on the number of cells in generalized layouts 95

3.4 Conclusions 98

Conclusion 99

Literature 103

Introduction to the work

Object of research and relevance of the topic. In the theory of statistical analysis of discrete sequences, a special place is occupied by goodness-of-fit criteria for testing a possibly complex null hypothesis, which is that for a random sequence pQ)?=i such that

Хі Є Ім,і= 1,...,n, Ім = (о, і,..., M), for any і = 1,..., n, and for any k Є їm probability of event (Хі = k) does not depend on r. This means that the sequence (Хі)f =1 is in some sense stationary.

In a number applied problems As a sequence (X() =1, we consider the sequence of colors of balls when choosing without returning until exhaustion from an urn containing rik - 1 > 0 balls of color k, k Є їm - We will denote the set of such selections T(n 0 - 1, .. .,п/ - 1). Let the urn contain n - 1 balls in total, m n-l= (n fc -l).

Let us denote by r (k) _ r (fc) r (fc) the sequence of numbers of balls of color k in the sample. Consider the sequence h« = (^,...,)). M fc) =ri fc) , ^ = ^-^ = 2,...,^-1, _ (fc)

The sequence h^ is determined using the distances between the places of neighboring balls of color k in such a way that *Ф = n.

The set of sequences h(fc) for all k Є їм uniquely determines the sequence (Х()^ =1. Sequences h k for different k are dependent on each other. In particular, any of them is uniquely determined by all the others. If the cardinality of the set 1m is 2, then the sequence of colors of balls is uniquely determined by the sequence h() of distances between the places of neighboring balls of the same fixed color. Let there be N - 1 balls of color 0 in an urn containing n - 1 balls of two different colors. We can establish a one-to-one correspondence between the set M(N-l,n - N) and a set of 9\ Пі m vectors h(n, N) = (hi,..., /i#) with positive integer components such that

The set 9\n,m corresponds to the set of all distinct partitions of a positive integer n into N ordered terms.

By specifying a certain probability distribution on the set of vectors 9R n d, we obtain the corresponding probability distribution on the set Wl(N - l,n - N). The set V\n,y is a subset of the set 2J n,iv of vectors with non-negative integer components satisfying (0.1). In the dissertation work, distributions of the form will be considered as probability distributions on the set of vectors

P(%, N) = (r b..., r N)) = P(& = r„, u = 1,..., N\ & = n), (0.2) where 6 > , lg - independent non-negative integer random variables.

Distributions of the form (0.2) in /24/ are called generalized schemes for placing n particles in N cells. In particular, if the random variables b...,lr in (0.2) are distributed according to Poisson’s laws with parameters Ai,...,Alr, respectively, then the vector h(n,N) has a polynomial distribution with the probabilities of outcomes

Ri = t--~t~> ^ = 1,---,^-

Li + ... + l^

If the random variables i> >&v in (0.2) are identically distributed according to the geometric law V(Zi = k)= P k - 1 (l-p),k=l,2,..., where p is any in the interval 0

As noted in /14/,/38/, a special place in testing hypotheses about the distribution of frequency vectors h(n, N) = (hi,..., h^) in generalized schemes for placing n particles in N cells is occupied by the criteria constructed on the basis of statistics of the form ad%,lo) = L(i (o.z)

Фк «%,%..;$, (0.4) where /j/, v = 1,2,... and ф are some real-valued functions,

Mg = E 1(K = g), g = 0.1,.... 1/=1

The quantities // r in /27/ were called the number of cells containing exactly r particles.

Statistics of the form (0.3) in /30/ are called separable (additively separable) statistics. If the functions /„ in (0.3) do not depend on u, then such statistics were called in /31/ symmetric separable statistics.

For any r, the statistic /x r is a symmetric separable statistic. From equality

DM = DFg (0.5) it follows that the class of symmetric separable statistics of h u coincides with the class of linear functions of fi r. Moreover, the class of functions of the form (0.4) is wider than the class of symmetric separable statistics.

H 0 = (Rao(n,A0) is a sequence of simple null hypotheses that the distribution of the vector h(n,N) is (0.2), where the random variables i,...,ln and (0.2) are identically distributed and P(ti = k)=p k ,k = 0,l,2,..., parameters n, N change in the central region.

Consider some P Є (0,1) and a sequence, generally speaking, of complex alternatives n = (H(n,N)) such that there exists a n

P(fm > OpAR)) >: 0-We will reject the hypothesis Hq(ti,N) if fm > a s m((3). If there is a limit jim ~1nP(0l > a n, N (P)) = ШН ), where the probability for each N is calculated under the hypothesis #o(n,iV), then the value j (fi,lcl) is called in /38/ the index of the criterion φ at the point (/?,N). The last limit may, generally speaking, not exist. Therefore, in the dissertation work, in addition to the criterion index, the value lim (_IlnP(tor > a N (J3))) =if(P,P) is considered, which the author of the dissertation work, by analogy, called the lower index of the criterion φ at the point (/3,H) . Here and below, lim adg, lim а# jV-уо ЛГ-оо mean, respectively, the lower and upper limits of the sequence (odg) for N -> yu,

If a criterion index exists, then the criterion's subscript coincides with it. The lower index of the criterion always exists. How greater value criterion index (subscript of the criterion), the better the statistical criterion in the sense under consideration. In /38/ the problem of constructing agreement criteria for generalized layout schemes with highest value criterion index in the class of criteria that reject the hypothesis Ho(n,N) for where m > 0 is some fixed number, the sequence of constants is selected based on the given value of the power of the criterion for the sequence of alternatives, ft t is a real function of t + 1 arguments.

The criterion indices are determined by the probabilities of large deviations. As was shown in /38/, rough (up to logarithmic equivalence) asymptotics of the probabilities of large deviations of separable statistics when Cramer’s condition is satisfied for random variable/() is determined by the corresponding Kull-Bak - Leibler - Sanov information distance (the random variable q satisfies the Cramer condition if for some # > 0 the generating function of the moments Me f7? is finite in the interval \t\

The question of the probabilities of large deviations of statistics from an unlimited number of fi r , as well as arbitrary separable statistics that do not satisfy the Cramer condition, remained open. This did not make it possible to finally solve the problem of constructing criteria for testing hypotheses in generalized placement schemes with the highest rate of tending to zero of the probability of a type I error with approaching alternatives in the class of criteria based on statistics of the form (0.4). The relevance of the dissertation research is determined by the need to complete the solution to the specified problem.

The purpose of the dissertation work is to construct agreement criteria with the highest value of the criterion index (subscript of the criterion) for testing hypotheses in the selection scheme without return in the class of criteria that reject the hypothesis U(n, N) for 0(iv"iv"-""" o """)>CiV " (0 " 7) where φ is a function of the countable number of arguments, and the parameters n, N change in the central region.

In accordance with the purpose of the study, the following tasks were set: to investigate the properties of entropy and information distance of Kull-Bak - Leibler - Sanov for discrete distributions with a countable number of outcomes; study the probabilities of large deviations of statistics of the form (0.4); study the probabilities of large deviations of symmetric separable statistics (0.3) that do not satisfy the Cramer condition; - find such statistics that the agreement criterion constructed on its basis for testing hypotheses in generalized placement schemes has the highest index value in the class of criteria of the form (0.7).

Scientific novelty: the concept of a generalized metric is given - a function that admits infinite values and satisfies the axioms of identity, symmetry and triangle inequality. A generalized metric is found and sets are indicated on which the entropy and information distance functions, defined on a family of discrete distributions with a countable number of outcomes, are continuous in this metric; in the generalized placement scheme, a rough (up to logarithmic equivalence) asymptotics was found for the probabilities of large deviations of statistics of the form (0.4), satisfying the corresponding form of the Cramer condition; in the generalized placement scheme, a rough (up to logarithmic equivalence) asymptotics was found for the probabilities of large deviations of symmetric separable statistics that do not satisfy the Cramer condition; in the class of criteria of the form (0.7), a criterion with the highest value of the criterion index is constructed.

Scientific and practical value. The work solves a number of questions about the behavior of the probabilities of large deviations in generalized placement schemes. The results obtained can be used in educational process in the specialties of mathematical statistics and information theory, in the study of statistical procedures for the analysis of discrete sequences and were used in /3/, /21/ in justifying the security of one class of information systems. Provisions put forward for defense: reducing the problem of testing the hypothesis from a single sequence of colors of balls from the fact that this sequence is obtained as a result of a choice without returning until the exhaustion of balls from an urn containing balls of two colors, and each such choice has the same probability, to the construction of agreement criteria to test hypotheses in the corresponding generalized layout; continuity of the entropy and Kullback-Leibler-Sanov information distance functions on an infinite-dimensional simplex with the introduced logarithmic generalized metric; a theorem on rough (up to logarithmic equivalence) asymptotics of the probabilities of large deviations of symmetric separable statistics that do not satisfy the Cramer condition in the generalized placement scheme in the semi-exponential case; a theorem on rough (up to logarithmic equivalence) asymptotics of the probabilities of large deviations for statistics of the form (0.4); - construction of a goodness-of-fit criterion for testing hypotheses in generalized layouts with the highest index value in the class of criteria of the form (0.7).

Approbation of work. The results were presented at seminars of the Department of Discrete Mathematics of the Mathematical Institute named after. V. A. Steklov RAS, information security department of ITM&VT named after. S. A. Lebedev RAS and at: the fifth All-Russian Symposium on Applied and Industrial Mathematics. Spring session, Kislovodsk, May 2 - 8, 2004; sixth International Petrozavodsk conference "Probabilistic methods in discrete mathematics" June 10 - 16, 2004; second International conference"Information systems and technologies (IST" 2004)", Minsk, November 8 - 10, 2004;

International conference "Modern Problems and new Trends in Probability Theory", Chernivtsi, Ukraine, June 19 - 26, 2005.

The main results of the work were used in the research work "Apology", carried out by ITMiVT RAS. S. A. Lebedev in the interests of the Federal Service for Technical and Export Control of the Russian Federation, and were included in the report on the implementation of the research stage /21/. Some results of the dissertation were included in the research report "Development of mathematical problems of cryptography" of the Academy of Cryptography of the Russian Federation for 2004 /22/.

The author expresses deep gratitude to the scientific supervisor, Doctor of Physical and Mathematical Sciences A. F. Ronzhin and the scientific consultant, Doctor of Physical and Mathematical Sciences, Senior Researcher A. V. Knyazev. The author expresses gratitude to Doctor of Physical and Mathematical Sciences, Professor A. M. Zubkov and Candidate of Physical and Mathematical Sciences Mathematical Sciences I. A. Kruglov for his attention to the work and a number of valuable comments.

Structure and content of the work.

The first chapter examines the properties of entropy and information distance for distributions on the set of non-negative integers.

In the first paragraph of the first chapter, notations are introduced and the necessary definitions are given. In particular, they are used following designations: x = (:ro,i, ---) - infinite-dimensional vector with a countable number of components;

Н(х) - -Ex^oXvlnx,; trunc m (x) = (x 0,x 1,...,x t,0,0,...); SI* = (x, x u > 0, u = 0.1,..., E~ o x„ 0,v = 0,l,...,E? =Q x v = 1); fi 7 = (x Є O, L 0 vx v = 7); %] = (хЄП,Эо»х и

16 mі = e o ** v \ &c = Ue>1 | 5 є Q 7) o

It is clear that the set Vt corresponds to a family of probability distributions on the set of non-negative integers, P 7 - to a family of probability distributions on the set of non-negative integers with mathematical expectation 7 - If y Є Q, then for є > 0 the set will be denoted by O e (y)

Оє(у) - (х eO,x v

In the second paragraph of the first chapter, a theorem on the boundedness of the entropy of discrete distributions with limited mathematical expectation is proved.

Theorem 1. On the boundedness of the entropy of discrete distributions with bounded mathematical expectation. For any reinforced concrete 7

If x Є fi 7 corresponds to a geometric distribution with a mathematical distribution 7; that is

7 x„ = (1- р)р\ v = 0.1,..., where р = --,

1 + 7 then the equality H(x) = F(1) holds.

The statement of the theorem can be viewed as the result of a formal application of Lagrange’s method of conditional multipliers in the case of an infinite number of variables. The theorem that the only distribution on the set (k, k + 1, k + 2,...) with a given mathematical expectation and maximum entropy is a geometric distribution with a given mathematical expectation is given (without proof) in /47/. The author, however, has given strict proof.

The third paragraph of the first chapter gives the definition of a generalized metric - a metric that allows infinite values.

For x,y Є Гі the function p(x,y) is defined as the minimum є > O with the property y v e~ e

If such an є does not exist, then it is assumed that p(x,y) = oo.

It is proved that the function p(x,y) is a generalized metric on the family of distributions on the set of non-negative integers, as well as on the entire set Ci*. Instead of e in the definition of the metric p(x,y), you can use any other positive number other than 1. The resulting metrics will differ by a multiplicative constant. Let us denote by J(x, y) the information distance

Here and below it is assumed that 0 In 0 = 0.01n ^ = 0. The information distance is defined for such x, y that x v - 0 for all and such that y v = 0. If this condition is not met, then we will assume J (S,y) = co. Let A C $1. Then we will denote J(Ay)="mU(x,y).

Let's put J(Jb,y) = 00.

In the fourth paragraph of the first chapter, the definition of compactness of functions defined on the set P* is given. The compactness of a function with a countable number of arguments means that with any degree of accuracy the value of the function can be approximated by the values of this function at points where only a finite number of arguments are non-zero. The compactness of the entropy and information distance functions is proved.

For any 0

If for some 0 0 the function \(x) = J(x,p) is compact on the set 7 ] P O g (p).

The fifth paragraph of the first chapter discusses the properties of the information distance defined on an infinite-dimensional space. Compared to the finite-dimensional case, the situation with the continuity of the information distance function changes qualitatively. It is shown that the information distance function is not continuous on the set Г2 in any of the metrics pi(,y)= E|z„-i/„|, (

00 \ 2 p 2 (x,y) = sup (x^-ij^.

The validity of the following inequalities is proved for the entropy functions H(x) and information distance J(x,p):

1. For any x, x" Є fi \H(x) - H(x")\

2. If for some х,р є П there is є > 0 such that х є О є (р), then for any X і Є Q \J(x,p) - J(x",p)\

From these inequalities, taking into account Theorem 1, it follows that the entropy and information distance functions are uniformly continuous on the corresponding subsets fi in the metric p(x,y), namely,

For any 7 such that 0

If for some 7o, O

20 then for any 0 0 the function \p(x) = J(x t p) is uniformly continuous on the set 7 ] P O є (p) in the metric p(x,y).

A definition of non-extremal function is given. The non-extremal condition means that the function does not have local extrema, or the function takes the same values at local minima (local maxima). The non-extrema condition weakens the requirement of the absence of local extrema. For example, the function sin x on the set of real numbers has local extrema, but satisfies the non-extremal condition.

Let for some 7 > 0, the region A is given by the condition

А = (хЄЇ1 1 ,ф(х) >а), (0.9) where Ф(х) is a real-valued function, а is some real constant, inf Ф(х)

And 3y, the question arose, n P „ under what conditions „a „ φ for i_ „ara- q meters n, N in the central region, ^ -> 7, for all their sufficiently large values there will be such non-negative integers ko, k\, ..., k n, what ko + hi + ... + k n = N,

21 k\ + 2/... + nk n - N

Kq k\ k n . ^"iv"-"iv" 0 " 0 "-")>a -

It is proved that for this it is enough to require that the function φ be non-extremal, compact and continuous in the metric p(x,y), and also that for at least one point x satisfying (0.9), for some є > 0 there is a finite moment of degree 1 + є Ml + = і 1+є x and 0 for any u = 0.1,....

In the second chapter, we study the rough (up to logarithmic equivalence) asymptotics of the probability of large deviations of functions from D = (fio,..., cn, 0,...) - the number of cells with a given filling in the central region of variation of parameters N,n . Rough asymptotics of the probabilities of large deviations are sufficient to study the indices of the goodness-of-fit criteria.

Let the random variables ^ in (0.2) be identically distributed and

Р(Сі = к)=рьк = 0.1,... > P(z) - generating function of random variable i - converges in a circle of radius 1

22 Let us denote p(.) = (p(ad = o),P№) = i),...).

If there is a solution z 1 to the equation

M(*) = 7, then it is unique /38/. Throughout what follows we will assume that Pjfc>0,fc = 0,l,....

In the first paragraph of the first paragraph of the second chapter there is asymptotics of logarithms of probabilities of the form -m^1nP(th) = ^,...,/ = K)-

The following theorem is proved.

Theorem 2. Rough local theorem on the probabilities of large deviations. Let n, N -* co such that - ->7>0

The statement of the theorem follows directly from the formula for the joint distribution /to, A*b / in /26/ and the following estimate: if non-negative integer values fii,fi2,/ satisfy the condition /I1 + 2// 2 + ... + 71/ = 71, then the number of non-zero values among them is 0(l/n). This is a rough estimate and does not claim to be new. The number of non-zero τ in generalized layout schemes does not exceed the value of the maximum filling of cells, which in the central region, with a probability tending to 1, does not exceed the value 0(\n) /25/, /27/. Nevertheless, the resulting estimate 0(y/n) is satisfied with probability 1 and is sufficient to obtain rough asymptotics.

In the second paragraph of the first paragraph of the second chapter, the value of the limit is found where adg is a sequence of real numbers converging to some a Є R, φ(x) is a real-valued function. The following theorem is proved.

Theorem 3. Rough integral theorem on the probabilities of large deviations. Let the conditions of Theorem 2 be satisfied, for some r > 0, (> 0) the real function φ(x) is compact and uniformly continuous in the metric p on the set

A = 0 rH (p(r 1))nP bn] and satisfies the condition of non-extremality on the set Г2 7 . If for some constant a such that inf f(x)

24 there is a vector p a fi 7 P 0 r (p(z 7)); such that

Ф(ra) > а J(( (x) >а,хЄ П 7 ),р(2; 7)) = J(p a ,p(^y)), mo for any sequence а^ converging to а, ^-^\nP(f(^,^,...)>a m) = Pr a,p(r,)). (0.11)

With additional restrictions on the function φ(x), the information distance J(pa,P(zy)) in (2.3) can be calculated more specifically. Namely, the following theorem is true. Theorem 4. On information distance. Let for some 0

Whether some r > 0, C > 0, the real function φ(x) and its first-order partial derivatives are compact and uniformly continuous in the generalized metric p(x, y) on the set

A = O g (p)PP bn] , there exist T > 0, R > 0 such that for all \t\ O p v v 1+ z u exp(i--ph(x))

0(p(gaL)) = a, / h X v \Z,t) T, u= oX LJ (Z,t)

Then p(z a , t a) Є ft, u J((z Є Л,0(z) = а),р) = J(p(z a ,t a),p) d _ 9 = 7111 + t a «-^ OFaL)) - In 2Wexp( a --0(p(g a,i a))). j/=0 CnEi/ ^_o CX(/

If the function f(x) is a linear function, and the function fix) is defined using equality (0.5), then condition (0.12) turns into the Cramer condition for the random variable f(,(z)). Condition (0.13) is a form of condition (0.10) and is used to prove the presence in domains of the form (x Є Г2, φ(x) > a) of at least one point from 0(n, N) for all sufficiently large n, N.

Let v ()(n,iV) = (/гі,...,/ijv) be the frequency vector in the generalized layout (0.2). As a corollary of Theorems 3 and 4, the following theorem is formulated.

Theorem 5. Rough integral theorem on the probabilities of large deviations of symmetric separable statistics in a generalized placement scheme.

Let n, N -> co such that jfr - 7» 0 0,R > 0 such that for all \t\ Then for any sequence a# converging to a, 1 iv =

This theorem was first proven by A.F. Ronzhin in /38/ using the saddle point method.

In the second paragraph of the second chapter, the probabilities of large deviations of separable statistics in the generalized cxj^iax placement are studied in the case of failure to satisfy the Cramer condition for the random variable /((z)). Cramer's condition for the random variable f(,(z)) is not satisfied, in particular, if (z) is a Poisson random variable, and /(x) = x 2. Note that Cramer's condition for the separable statistics themselves in generalized allocation schemes is always satisfied, since for any fixed n, N the number possible outcomes in these schemes, of course.

As noted in /2/, if the Cramer condition is not satisfied, then in order to find the asymptotics of the probabilities of large deviations of sums of identically distributed random variables, it is necessary to fulfill additional conditions for the correct change to the distribution of the term. The work (considers the case corresponding to the fulfillment of condition (3) in /2/, that is, the seven-exponential case. Let P(i = k) > O for all

28 k = 0.1,... and the function p(k) = -\nP(^ = k), can be continued to a function of continuous argument - a regularly varying function of order p, 0 oo P(tx) , r v P(t )

Let the function f(x) for sufficiently large values of the argument be a positive, strictly increasing, regularly varying function of order d>1,^ On the rest of the number axis

Then s. V. /(i) has moments of any order and does not satisfy the Cramer condition, ip(x) = o(x) as x -> oo, and the following Theorem 6 is valid. Let the function ip(x) be monotonically nondecreasing for sufficiently large x, the function ^p does not increase monotonically, n, N --> oo so that jf - A, 0 b(z\), where b(z) = M/(1(2)), there is a limit l(n,lg)) > cN] = "(c ~ b(zx))l b""ї

It follows from Theorem b that if the Cramer condition is not satisfied, the limit (^ lim ~\nP(L N (h(n,N)) > cN) = 0, "" Dv

L/-too iV and which proves the validity of the hypothesis expressed in /39/. Thus, the value of the index of the agreement criterion in generalized placement schemes -^ when the Cramer condition is not met is always equal to zero. In this case, in the class of criteria, when Cramer’s condition is satisfied, criteria with a non-zero index value are constructed. From this we can conclude that using criteria whose statistics do not satisfy the Cramer condition, for example, the chi-square test in a polynomial scheme, to construct goodness-of-fit tests for testing hypotheses for non-converging alternatives in the indicated sense is asymptotically ineffective. A similar conclusion was made in /54/ based on the results of a comparison of chi-square and maximum likelihood ratio statistics in a polynomial scheme.

The third chapter solves the problem of constructing goodness-of-fit criteria with the largest value of the criterion index (the largest value of the subscript of the criterion) to test hypotheses in generalized placement schemes. Based on the results of the first and second chapters on the properties of the entropy functions, information distance and probabilities of large deviations, in the third chapter a function of the form (0.4) is found such that the goodness-of-fit criterion constructed on its basis has the largest value of the exact subscript in the class of criteria under consideration. The following theorem is proved. Theorem 7. On the existence of an index. Let the conditions of Theorem 3 be satisfied, 0 ,... - a sequence of alternative distributions, 0^(/3, iV) - the maximum number for which, under the hypothesis Н Р (lo, the inequality

P(φ(^^,...)>a φ (P,M))>(3, there is a limit limjv-»oo o>φ(P, N) - a. Then at the point (/3, N) there is a criterion index f

Zff,K) = 3((φ(x) >a,xe ZD.P^)).

In this case, zf(0,th)N NP(e(2 7) = fc)"

The Conclusion sets out the results obtained in their relationship with the general goal and specific tasks posed in the dissertation, formulates conclusions based on the results of the dissertation research, indicates the scientific novelty, theoretical and practical value of the work, as well as specific scientific tasks identified by the author and the solution of which seems relevant .

Brief review of the literature on the research topic.

The thesis examines the problem of constructing agreement criteria in generalized placement schemes with the highest value of the criterion index in the class of functions of the form (0.4) with non-converging alternatives.

Generalized layout schemes were introduced by V.F. Kolchin in /24/. The quantities fi r in the polynomial scheme were called the number of cells with r pellets and were studied in detail in the monograph by V. F. Kolchin, B. A. Sevastyanov, V. P. Chistyakov /27/. The values of \i r in generalized layouts were studied by V.F. Kolchin in /25/, /26/. Statistics of the form (0.3) were first considered by Yu. I. Medvedev in /30/ and were called separable (additively separable) statistics. If the functions /„ in (0.3) do not depend on u, such statistics were called in /31/ symmetric separable statistics. The asymptotic behavior of the moments of separable statistics in generalized allocation schemes was obtained by G. I. Ivchenko in /9/. Limit theorems for a generalized layout scheme were also considered in /23/. Reviews of the results of limit theorems and agreement criteria in discrete probabilistic schemes of type (0.2) were given by V. A. Ivanov, G. I. Ivchenko, Yu. I. Medvedev in /8/ and G. I. Ivchenko, Yu. I. Medvedev , A.F. Ronzhin in /14/. Agreement criteria for generalized layouts were considered by A.F. Ronzhin in /38/.

A comparison of the properties of statistical criteria in these works was carried out from the point of view of relative asymptotic efficiency. The case of converging (contigual) hypotheses was considered - efficiency in the sense of Pitman and non-converging hypotheses - efficiency in the sense of Bahadur, Hodges - Lehman and Chernov. Connection between various types the relative effectiveness of statistical tests is discussed, for example, in /49/. As follows from the results of Yu. I. Medvedev in /31/ on the distribution of separable statistics in a polynomial scheme, the criterion based on the chi-square statistic has the greatest asymptotic power under convergent hypotheses in the class of separable statistics on the frequencies of outcomes in a polynomial scheme. This result was generalized by A.F. Ronzhin for circuits of type (0.2) in /38/. I. I. Viktorova and V. P. Chistyakov in /4/ constructed an optimal criterion for a polynomial scheme in the class of linear functions of fi r. A.F. Ronzhin in /38/ constructed a criterion that, given a sequence of alternatives that are not close to the null hypothesis, minimizes the logarithmic rate at which the probability of an error of the first kind tends to zero, in the class of statistics of the form (0.6). A comparison of the relative performance of the chi-square and maximum likelihood ratio statistics under approaching and non-approximating hypotheses was carried out in /54/. The thesis considered the case of non-converging hypotheses. Studying the relative statistical effectiveness of criteria under non-converging hypotheses requires studying the probabilities of extremely large deviations - of the order of 0(u/n). For the first time, such a problem for a polynomial distribution with a fixed number of outcomes was solved by I. N. Sanov in /40/. The asymptotic optimality of goodness-of-fit tests for testing simple and complex hypotheses for a multinomial distribution in the case of a finite number of outcomes with non-converging alternatives was considered in /48/. The properties of information distance were previously considered by Kullback, Leibler /29/,/53/ and I. II. Sanov /40/, as well as Hoeffding /48/. In these works, the continuity of information distance was considered on finite-dimensional spaces in the Euclidean metric. A number of authors considered a sequence of spaces with increasing dimension, for example, in the work of Yu. V. Prokhorov /37/ or in the work of V. I. Bogachev, A. V. Kolesnikov /1/. Rough (up to logarithmic equivalence) theorems on the probabilities of large deviations of separable statistics in generalized placement schemes under the Cramer condition were obtained by A.F. Roizhin in /38/. A. N. Timashev in /42/,/43/ obtained exact (up to equivalence) multidimensional integral and local limit theorems on the probabilities of large deviations of the vector fir^n, N),..., fi rs (n,N) , where s, gi,..., r s are fixed integers,

Statistical problems of testing hypotheses and estimating parameters in a selection scheme without return in a slightly different formulation were considered by G. I. Ivchenko, V. V. Levin, E. E. Timonina /10/, /15/, where estimation problems were solved for a finite population, when the number of its elements is an unknown quantity, the asymptotic normality of multivariate S - statistics from s independent samples in a selection scheme without reversion was proved. The problem of studying random variables associated with repetitions in sequences of independent trials was studied by A. M. Zubkov, V. G. Mikhailov, A. M. Shoitov in /6/, /7/, /32/, /33/, / 34/. Analysis of the main statistical problems of estimation and testing of hypotheses within the framework of general model Markova-Polya was carried out by G.I. Ivchenko, Yu.I. Medvedev in /13/, the probabilistic analysis of which was given in /11/. A method for specifying non-uniform probability measures on a set of combinatorial objects, which is not reducible to the generalized placement scheme (0.2), was described in G. I. Ivchenko, Yu. I. Medvedev /12/. A number of problems in probability theory, in which the answer can be obtained as a result of calculations using recurrent formulas, are indicated by A. M. Zubkov in /5/.

Inequalities for the entropy of discrete distributions were obtained in /50/ (cited from the abstract of A. M. Zubkov in RZhMat). If (p n )Lo is a probability distribution,

Рп = Е Рк, к=п A = supp^Pn+i

I + (In -f-) (X Rn - R n+1)

Рп= (x f 1)n+v n>Q. (0.15)

Note that the extremal distribution (0.15) is a geometric distribution with mathematical expectation A, and the function F(X) of parameter (0.14) coincides with the function of the mathematical expectation in Theorem 1.

Entropy of discrete distributions with bounded mathematical expectation

If a criterion index exists, then the criterion's subscript coincides with it. The lower index of the criterion always exists. The higher the value of the criterion index (subscript of the criterion), the better the statistical criterion in this sense. In /38/, the problem of constructing agreement criteria for generalized layouts with the highest value of the criterion index in the class of criteria that reject the hypothesis Ho(n,N) was solved for where m 0 is some fixed number, the sequence of constant units is selected based on the given value power of the criterion for a sequence of alternatives, ft - real function of m + 1 arguments.

The criterion indices are determined by the probabilities of large deviations. As was shown in /38/, the rough (up to logarithmic equivalence) asymptotics of the probabilities of large deviations of separable statistics when the Cramer condition for the random variable /() is satisfied is determined by the corresponding Kull-Bak-Leibler-Sanov information distance (the random variable q satisfies the Cramer condition , if for some # 0 the generating function of the moments Mef7? is finite in the interval \t\ H /28/).

The question of the probabilities of large deviations of statistics from an unlimited number of fir, as well as arbitrary separable statistics that do not satisfy the Cramer condition, remained open. This did not make it possible to finally solve the problem of constructing criteria for testing hypotheses in generalized placement schemes with the highest rate of tending to zero of the probability of a type I error with approaching alternatives in the class of criteria based on statistics of the form (0.4). The relevance of the dissertation research is determined by the need to complete the solution to the specified problem.

The purpose of the dissertation work is to construct agreement criteria with the largest value of the criterion index (subscript of the criterion) for testing hypotheses in a selection scheme without return in the class of criteria that reject the hypothesis U(n, N) for where φ is a function of the countable number of arguments, and parameters n, N change in the central region. In accordance with the purpose of the study, the following tasks were set: - to study the properties of entropy and information distance of Kull-Bak - Leibler - Sanov for discrete distributions with a countable number of outcomes; - study the probabilities of large deviations of statistics of the form (0.4); - study the probabilities of large deviations of symmetric separable statistics (0.3) that do not satisfy the Cramer condition; - find such statistics that the agreement criterion constructed on its basis for testing hypotheses in generalized placement schemes has the highest index value in the class of criteria of the form (0.7). Scientific novelty: - the concept of a generalized metric is given - a function that admits infinite values and satisfies the axioms of identity, symmetry and triangle inequality. A generalized metric is found and sets are indicated on which the entropy and information distance functions, defined on a family of discrete distributions with a countable number of outcomes, are continuous in this metric; - in the generalized placement scheme, a rough (up to logarithmic equivalence) asymptotics was found for the probabilities of large deviations of statistics of the form (0.4), satisfying the corresponding form of the Cramer condition; - in the generalized placement scheme, a rough (up to logarithmic equivalence) asymptotics was found for the probabilities of large deviations of symmetric separable statistics that do not satisfy the Cramer condition; - in the class of criteria of the form (0.7), a criterion with the highest value of the criterion index is constructed. Scientific and practical value. The work solves a number of questions about the behavior of the probabilities of large deviations in generalized placement schemes. The results obtained can be used in the educational process in the specialties of mathematical statistics and information theory, in the study of statistical procedures for the analysis of discrete sequences, and were used in /3/, /21/ to justify the security of one class of information systems. Provisions submitted for defense: - reduction of the problem of testing the hypothesis from a single sequence of colors of balls from the fact that this sequence is obtained as a result of a choice without returning until the exhaustion of balls from an urn containing balls of two colors, and each such choice has the same probability, to the construction of criteria agreement to test hypotheses in the appropriate generalized layout; - continuity of the entropy and Kullback-Leibler-Sanov information distance functions on an infinite-dimensional simplex with the introduced logarithmic generalized metric; - a theorem on rough (up to logarithmic equivalence) asymptotics of the probabilities of large deviations of symmetric separable statistics that do not satisfy the Cramer condition in the generalized placement scheme in the semi-exponential case;

Continuity of Kullback - Leibler - Sanov information distance

Generalized layout schemes were introduced by V.F. Kolchin in /24/. The quantities fir in the polynomial scheme were called the number of cells with r pellets and were studied in detail in the monograph by V. F. Kolchin, B. A. Sevastyanov, V. P. Chistyakov /27/. The values of \іr in generalized layouts were studied by V.F. Kolchin in /25/,/26/. Statistics of the form (0.3) were first considered by Yu. I. Medvedev in /30/ and were called separable (additively separable) statistics. If the functions /„ in (0.3) do not depend on u, such statistics were called in /31/ symmetric separable statistics. The asymptotic behavior of the moments of separable statistics in generalized allocation schemes was obtained by G. I. Ivchenko in /9/. Limit theorems for a generalized layout scheme were also considered in /23/. Reviews of the results of limit theorems and agreement criteria in discrete probabilistic schemes of type (0.2) were given by V. A. Ivanov, G. I. Ivchenko, Yu. I. Medvedev in /8/ and G. I. Ivchenko, Yu. I. Medvedev , A.F. Ronzhin in /14/. Agreement criteria for generalized layouts were considered by A.F. Ronzhin in /38/.

A comparison of the properties of statistical criteria in these works was carried out from the point of view of relative asymptotic efficiency. The case of converging (contigual) hypotheses was considered - efficiency in the sense of Pitman and non-converging hypotheses - efficiency in the sense of Bahadur, Hodges - Lehman and Chernov. The relationship between different types of relative performance statistical tests is discussed, for example, in /49/. As follows from the results of Yu. I. Medvedev in /31/ on the distribution of separable statistics in a polynomial scheme, the greatest asymptotic power under convergent hypotheses in the class of separable statistics on the frequencies of outcomes in a polynomial scheme has a criterion based on the chi-square statistic. This result was generalized by A.F. Ronzhin for circuits of type (0.2) in /38/. I. I. Viktorova and V. P. Chistyakov in /4/ constructed an optimal criterion for a polynomial scheme in the class of linear functions of fir. A.F. Ronzhin in /38/ constructed a criterion that, given a sequence of alternatives that are not close to the null hypothesis, minimizes the logarithmic rate at which the probability of an error of the first kind tends to zero, in the class of statistics of the form (0.6). A comparison of the relative performance of the chi-square and maximum likelihood ratio statistics under approaching and non-approximating hypotheses was carried out in /54/. The thesis considered the case of non-converging hypotheses. Studying the relative statistical effectiveness of criteria under non-converging hypotheses requires studying the probabilities of extremely large deviations - of the order of 0(u/n). For the first time, such a problem for a polynomial distribution with a fixed number of outcomes was solved by I. N. Sanov in /40/. The asymptotic optimality of goodness-of-fit tests for testing simple and complex hypotheses for a multinomial distribution in the case of a finite number of outcomes with non-converging alternatives was considered in /48/. The properties of information distance were previously considered by Kullback, Leibler /29/,/53/ and I. II. Sanov /40/, as well as Hoeffding /48/. In these works, the continuity of information distance was considered on finite-dimensional spaces in the Euclidean metric. A number of authors considered a sequence of spaces with increasing dimension, for example, in the work of Yu. V. Prokhorov /37/ or in the work of V. I. Bogachev, A. V. Kolesnikov /1/. Rough (up to logarithmic equivalence) theorems on the probabilities of large deviations of separable statistics in generalized placement schemes under the Cramer condition were obtained by A. F. Roizhin in /38/. A. N. Timashev in /42/,/43/ obtained exact (up to equivalence) multidimensional integral and local limit theorems on the probabilities of large deviations of a vector

The study of the probabilities of large deviations when the Cramer condition is not met for the case of independent random variables was carried out in the works of A. V. Nagaev /35/. The method of conjugate distributions is described by Feller /45/.

Statistical problems of testing hypotheses and estimating parameters in a selection scheme without return in a slightly different formulation were considered by G. I. Ivchenko, V. V. Levin, E. E. Timonina /10/, /15/, where estimation problems were solved for a finite population, when the number of its elements is an unknown quantity, the asymptotic normality of multivariate S - statistics from s independent samples in a selection scheme without reversion was proved. The problem of studying random variables associated with repetitions in sequences of independent trials was studied by A. M. Zubkov, V. G. Mikhailov, A. M. Shoitov in /6/, /7/, /32/, /33/, /34/ . An analysis of the main statistical problems of estimation and testing of hypotheses within the framework of the general Markov-Pólya model was carried out by G. I. Ivchenko, Yu. I. Medvedev in /13/, a probabilistic analysis of which was given in /11/. A method for specifying non-uniform probability measures on a set of combinatorial objects, which is not reducible to the generalized placement scheme (0.2), was described in G. I. Ivchenko, Yu. I. Medvedev /12/. A number of problems in probability theory, in which the answer can be obtained as a result of calculations using recurrent formulas, are indicated by A. M. Zubkov in /5/.

Information distance and large deviation probabilities of separable statistics

When Cramer's condition is not satisfied, large deviations of separable statistics in the generalized placement scheme in the considered seven-exponential case are determined by the probability of deviation of one independent term. When Cramer's condition is satisfied, this, as emphasized in /39/, is not the case. Remark 10. The function φ(x) is such that the mathematical expectation of Its АН) is finite for 0 t 1 and infinite for t 1. Remark 11. For separable statistics that do not satisfy the Cramer condition, the limit (2.14) is equal to 0, which proves the validity of the hypothesis , expressed in /39/. Remark 12. For the chi-square statistic in a polynomial scheme for n, ./V - co so that - A, it immediately follows from the theorem that This result was obtained in /54/ directly. In this chapter, in the central region of changes in the parameters of generalized particle placement schemes in cells, rough (up to logarithmic equivalence) asymptotics of the probabilities of large deviations of additively separable statistics from the number of cells and functions from the number of cells with a given filling were found.

If Cramer's condition is satisfied, then the rough asymptotics of the probabilities of large deviations is determined by the rough asymptotics of the probabilities of getting into a sequence of points with rational coordinates, converging in the above sense to the point at which the extremum of the corresponding information distance is reached.

The seven-exponential case of non-fulfillment of Cramer's condition for the random variables f(i),..., f(n) was considered, where b, kr are independent random variables generating the generalized decomposition scheme (0.2), f(k) is a function in definition of symmetric additively separable statistics in (0.3). That is, it was assumed that the functions p(k) = - lnP(i = k) and f(k) can be extended to regularly varying functions of a continuous argument of the order p 0 and q 0, respectively, and p q. It turned out that the main contribution to the rough asymptotics of the probabilities of large deviations of separable statistics in generalized placement schemes is similarly made by the rough asymptotics of the probability of ionization in the corresponding sequence of points. It is interesting to note that previously the theorem on the probabilities of large deviations for separable statistics was proved using the saddle point method, with the main contribution to the asymptotics being made by a single saddle point. The case where, if the Cramer condition is not met, the 2-kN condition is not satisfied remains unexplored.

If Cramer's condition is not satisfied, then the specified condition may not be satisfied only in the case of p 1. As directly follows from the logarithm of the corresponding probabilities, for the Poisson distribution and the geometric distribution p = 1. From the result on the asymptotics of the probabilities of large deviations when the Cramer condition is not met, we can conclude that the criteria whose statistics do not satisfy the Cramer condition have a significantly lower rate of tendency towards zero of the probabilities of errors of the second type with a fixed probability of an error of the first kind and non-converging alternatives compared to the criteria whose statistics satisfy the Cramer condition. Let a selection be made from an urn containing N - 1 1 white ip-JV 1 black balls without returning until complete exhaustion. We connect the places of white balls in the choice 1 i\ ... r -i n - 1 with the sequence of distances between neighboring white balls hi,..., h as follows: Then hv l,v =1,... ,N,M EjLi i/ - n- Let us define a probability distribution on the set of vectors h = (hi,...,Lg) by setting V(hv = rv,v = l,...,N) where i,...,lg - independent non-negative integer random variables (r.v.), that is, consider the generalized allocation scheme (0.2). The distribution of the vector h depends on n,N, but the corresponding indices will be omitted where possible to simplify notation. Remark 14. If each of (]) ways of selecting balls from an urn is assigned the same probability ( \) mn for any r i,..., rg such that r„ 1,u = l,...,N ,T,v=\ru = n, the probability that the distances between adjacent white balls in the choice will take these values

Criteria based on the number of cells in general layouts

The purpose of the dissertation work was to construct goodness-of-fit criteria for testing hypotheses in a selection scheme without returning from an urn containing balls of 2 colors. The author decided to study statistics based on the frequencies of distances between balls of the same color. In this formulation, the problem was reduced to the task of testing hypotheses in a suitable generalized layout.

The dissertation work included: the properties of entropy and information distance of discrete distributions with an unlimited number of outcomes with a limited mathematical expectation; - a rough (up to logarithmic equivalence) asymptotics of the probabilities of large deviations of a wide class of statistics in a generalized placement scheme was obtained; - based on the results obtained, a criterion function with the highest logarithmic rate of tending to zero of the probability of an error of the first kind with a fixed probability of an error of the second kind and non-converging alternatives was constructed; - it has been proven that statistics that do not satisfy the Cramer condition have a lower rate of convergence to zero of the probabilities of large deviations compared to statistics that satisfy this condition. The scientific novelty of the work is as follows. - the concept of a generalized metric is given - a function that admits infinite values and satisfies the axioms of identity, symmetry and triangle inequality. A generalized metric is found and sets are indicated on which the entropy and information distance functions, defined on a family of discrete distributions with a countable number of outcomes, are continuous in this metric; - in the generalized placement scheme, a rough (up to logarithmic equivalence) asymptotics was found for the probabilities of large deviations of statistics of the form (0.4), satisfying the corresponding form of the Cramer condition; - in the generalized placement scheme, a rough (up to logarithmic equivalence) asymptotics was found for the probabilities of large deviations of symmetric separable statistics that do not satisfy the Cramer condition; - in the class of criteria of the form (0.7), a criterion with the highest value of the criterion index is constructed. The work solves a number of questions about the behavior of the probabilities of large deviations in generalized placement schemes. The results obtained can be used in the educational process in the specialties of mathematical statistics and information theory, in the study of statistical procedures for the analysis of discrete sequences, and were used in /3/, /21/ to justify the security of one class of information systems. However, a number of questions remain open. The author limited himself to considering the central zone of change parameters n,N generalized schemes for placing n particles in /V cells. If the carrier of the distribution of random variables generating the generalized arrangement scheme (0.2) is not a set of the form r, r 4-1, r + 2,..., then when proving the continuity of the information distance function and studying the probabilities of large deviations, it is necessary to take into account the arithmetic structure of such carrier, which was not considered in the author’s work. For the practical application of criteria built on the basis of the proposed function with the maximum index value, it is necessary to study its distribution both under the null hypothesis and under alternatives, including converging ones. It is also of interest to transfer the developed methods and generalize the results obtained to other probabilistic schemes other than generalized placement schemes. If //1,/ 2,-.. are the frequencies of distances between the numbers of outcome 0 in a binomial scheme with probabilities of outcomes swarm 1 -POj, then it can be shown that in this case, from the analysis of the formula for the joint distribution of values \іт in a generalized placement scheme, proved in /26/, it follows that distribution (3.3), generally speaking, cannot be represented in the general case as a joint distribution of values of cg in any generalized scheme for placing particles in cells. This distribution is a special case of distributions on the set of combinatorial objects introduced in /12/. It seems an urgent task to transfer the results of the dissertation work for generalized placement schemes to this case, which was discussed in /52/.

Exact Tests provides two additional methods for calculating significance levels for the statistics available through the Crosstabs and Nonparametric Tests procedures. These methods, the exact and Monte Carlo methods, provide a means for obtaining accurate results when your data fail to meet any of the underlying assumptions necessary for reliable results using the standard asymptotic method. Available only if you have purchased the Exact Tests Options.

Example. Asymptotic results obtained from small datasets or sparse or unbalanced tables can be misleading. Exact tests enable you to obtain an accurate significance level without relying on assumptions that might not be met by your data. For example, results of an entrance exam for 20 fire fighters in a small township show that all five white applicants received a pass result, whereas the results for Black, Asian and Hispanic applicants are mixed. A Pearson chi-square testing the null hypothesis that results are independent of race produces an asymptotic significance level of 0.07. This result leads to the conclusion that exam results are independent of the race of the examinee. However, because the data contains only 20 cases and the cells have expected frequencies of less than 5, this result is not trustworthy. The exact significance of the Pearson chi-square is 0.04, which leads to the opposite conclusion. Based on the exact significance, you would conclude that exam results and race of the examinee are related. This demonstrates the importance of obtaining exact results when the assumptions of the asymptotic method cannot be met. The exact significance is always reliable, regardless of the size, distribution, sparseness, or balance of the data.

Statistics. Asymptotic significance. Monte Carlo approximation with confidence level, or exact significance.

Asymptotic. The significance level based on the asymptotic distribution of a test statistic. Typically, a value of less than 0.05 is considered significant. The asymptotic significance is based on the assumption that the data set is large. If the data set is small or poorly distributed, this may not be a good indication of significance.
Monte Carlo Estimate. An unbiased estimate of the exact significance level, calculated by repeatedly sampling from a reference set of tables with the same dimensions and row and column margins as the observed table. The Monte Carlo method allows you to estimate exact significance without relying on the assumptions required for the asymptotic method. This method is most useful when the data set is too large to compute exact significance, but the data do not meet the assumptions of the asymptotic method.
Exact. The probability of the observed outcome or an outcome more extreme is calculated exactly. , a significance level less than 0.05 is considered significant, indicating that typically there is some relationship between the row and column variables.

IN modern conditions Interest in data analysis is constantly and intensively growing in completely different fields, such as biology, linguistics, economics, and, of course, IT. The basis of this analysis is statistical methods, and every self-respecting data mining specialist needs to understand them.

Unfortunately, truly good literature, the kind that can provide both mathematically rigorous proofs and clear intuitive explanations, is not very common. And these lectures, in my opinion, are unusually good for mathematicians who understand probability theory precisely for this reason. They are taught to masters at the German Christian-Albrecht University in the Mathematics and Financial Mathematics programs. And for those who are interested in how this subject is taught abroad, I translated these lectures. It took me several months to translate, I diluted the lectures with illustrations, exercises and footnotes on some theorems. I note that I am not a professional translator, but simply an altruist and amateur in this field, so I will accept any criticism if it is constructive.

In short, this is what the lectures are about:

Conditional mathematical expectation

This chapter does not directly relate to statistics, however, it is ideal for starting to study it. Conditional expectation is the best choice for predicting a random outcome based on information already available. And this is also a random variable. Here we consider its various properties, such as linearity, monotonicity, monotonic convergence and others.

Point Estimation Basics

How to estimate the distribution parameter? What criterion should I choose for this? What methods should I use? This chapter helps answer all these questions. Here we introduce the concepts of unbiased estimator and uniformly unbiased minimum variance estimator. Explains where the chi-square and t-distributions come from and why they are important in estimating the parameters of the normal distribution. Explains what the Rao-Kramer inequality and Fisher information are. The concept of an exponential family is also introduced, which greatly facilitates obtaining a good estimate.

Bayesian and minimax parameter estimation

A different philosophical approach to evaluation is described here. In this case, the parameter is considered unknown because it is a realization of a certain random variable with a known (a priori) distribution. By observing the result of the experiment, we calculate the so-called posterior distribution of the parameter. Based on this, we can obtain a Bayesian estimator, where the criterion is the minimum loss on average, or a minimax estimator, which minimizes the maximum possible loss.

Sufficiency and completeness

This chapter has serious practical significance. A sufficient statistic is a function of the sample such that it is sufficient to store only the result of this function in order to estimate the parameter. There are many such functions, and among them are the so-called minimum sufficient statistics. For example, to estimate the median of a normal distribution, it is enough to store only one number - the arithmetic mean over the entire sample. Does this also work for other distributions, such as the Cauchy distribution? How do sufficient statistics help in choosing estimates? Here you can find answers to these questions.

Asymptotic properties of estimates

Perhaps the most important and necessary property of an assessment is its consistency, that is, the tendency towards a true parameter as the sample size increases. This chapter describes what properties the estimates we know, obtained by the statistical methods described in previous chapters, have. The concepts of asymptotic unbiasedness, asymptotic efficiency and Kullback-Leibler distance are introduced.

Testing Basics

In addition to the question of how to estimate a parameter unknown to us, we must somehow check whether it satisfies the required properties. For example, an experiment is being conducted to test a new drug. How do you know if the likelihood of recovery is higher with it than with using old medications? This chapter explains how such tests are constructed. You will learn what the uniformly most powerful test is, the Neyman-Pearson test, the significance level, the confidence interval, and where the well-known Gaussian test and t-test come from.

Asymptotic properties of criteria

Like assessments, criteria must satisfy certain asymptotic properties. Sometimes situations may arise when it is impossible to construct the required criterion, however, using the well-known central limit theorem, we construct a criterion that asymptotically tends to the necessary one. Here you will learn what the asymptotic significance level is, the likelihood ratio method, and how the Bartlett test and the chi-square test of independence are constructed.

Linear model

This chapter can be seen as a complement, namely the application of statistics in the case of linear regression. You will understand what grades are good and under what conditions. You will learn where the least squares method came from, how to construct tests, and why the F-distribution is needed.

As noted in previous section, the study of classical algorithms in many cases can be carried out using asymptotic methods of mathematical statistics, in particular, using CLT and methods of inheritance of convergence. The separation of classical mathematical statistics from the needs of applied research is manifested, in particular, in the fact that widespread monographs lack the mathematical apparatus necessary, in particular, for the study of two-sample statistics. The point is that you have to go to the limit not by one parameter, but by two - the volumes of two samples. We had to develop an appropriate theory - the theory of inheritance of convergence, set out in our monograph.

However, the results of such a study will have to be applied to finite sample sizes. A whole bunch of problems arise associated with such a transition. Some of them were discussed in connection with the study of the properties of statistics constructed from samples from specific distributions.

However, when discussing the impact of deviations from initial assumptions on the properties of statistical procedures, additional problems arise. What deviations are considered typical? Should we focus on the most “harmful” deviations that most distort the properties of algorithms, or should we focus on “typical” deviations?

With the first approach, we get a guaranteed result, but the “price” of this result may be too high. As an example, let us point out the universal Berry-Esseen inequality for the error in the CLT. A.A. absolutely rightly emphasizes. Borovkov that “the speed of convergence in real problems, as a rule, turns out to be better.”

With the second approach, the question arises of which deviations are considered “typical.” You can try to answer this question by analyzing large amounts of real data. It is quite natural that the answers of different research groups will differ, as can be seen, for example, from the results given in the article.

One of the false ideas is to use only a specific parametric family when analyzing possible deviations - the Weibull-Gnedenko distributions, the three-parameter family of gamma distributions, etc. Back in 1927, Acad. USSR Academy of Sciences S.N. Bernstein discussed the methodological error of reducing all empirical distributions to the four-parameter Pearson family. However, parametric methods of statistics are still very popular, especially among applied scientists, and the blame for this misconception lies primarily with teachers of statistical methods (see below, as well as the article).

15. Selecting one of many criteria to test a specific hypothesis

In many cases, many methods have been developed to solve a specific practical problem, and a specialist in mathematical research methods is faced with the problem: which one should be offered to the applied scientist for analyzing specific data?

As an example, consider the problem of testing the homogeneity of two independent samples. As you know, to solve it, you can offer a lot of criteria: Student, Cramer-Welch, Lord, chi-square, Wilcoxon (Mann-Whitney), Van der Waerden, Savage, N.V. Smirnov, omega-square type (Lehman -Rozenblatt), G.V. Martynov, etc. Which one to choose?

The idea of “voting” naturally comes to mind: to check against many criteria and then make a decision “by majority vote”. From the point of view of statistical theory, such a procedure simply leads to the construction of another criterion, which is a priori no better than the previous ones, but more difficult to study. On the other hand, if the solutions coincide according to all considered statistical criteria based on different principles, then, in accordance with the concept of stability, this increases confidence in the resulting general solution.

There is a widespread, especially among mathematicians, false and harmful opinion about the need to search for optimal methods, solutions, etc. The fact is that optimality usually disappears when you deviate from the initial premises. Thus, the arithmetic mean as an estimate of the mathematical expectation is optimal only when the initial distribution is normal, while it is always a valid estimate, as long as the mathematical expectation exists. On the other hand, for any arbitrarily chosen method of estimation or testing of hypotheses, it is usually possible to formulate the concept of optimality in such a way that the method in question becomes optimal - from this specially chosen point of view. Let's take, for example, the sample median as an estimate of the mathematical expectation. It is, of course, optimal, although in a different sense than the arithmetic mean (optimal for a normal distribution). Namely, for the Laplace distribution, the sample median is the maximum likelihood estimate, and therefore optimal (in the sense specified in the monograph).

The homogeneity criteria were analyzed in the monograph. There are several natural approaches to comparing criteria - based on asymptotic relative efficiency according to Bahadur, Hodges-Lehman, Pitman. And it turned out that each criterion is optimal given the corresponding alternative or suitable distribution on the set of alternatives. In this case, mathematical calculations usually use the shift alternative, which is relatively rare in the practice of analyzing real statistical data (in connection with the Wilcoxon test, this alternative was discussed and criticized by us in). The result is sad - the brilliant mathematical technique demonstrated in does not allow us to give recommendations for choosing a criterion for testing homogeneity when analyzing real data. In other words, from the point of view of the application worker’s work, i.e. analysis of specific data, the monograph is useless. The brilliant mastery of mathematics and the enormous diligence demonstrated by the author of this monograph, alas, brought nothing to practice.

Of course, every practically working statistician, in one way or another, solves for himself the problem of choosing a statistical criterion. Based on a number of methodological considerations, we chose the omega-square (Lehman-Rosenblatt) criterion, which is consistent with any alternative. However, there remains a feeling of dissatisfaction due to the lack of justification for this choice.

Asymptotic properties of goodness-of-fit criteria for testing hypotheses in a selection scheme without return, based on filling cells in a general well placement scheme Alexander Vladimirovich. Asymptotic behavior of functions

Entropy of discrete distributions with bounded mathematical expectation

Continuity of Kullback - Leibler - Sanov information distance

Information distance and large deviation probabilities of separable statistics