Phương pháp giá trị tối ưu trong ước lượng tham số

Ước lượng giá trị tối ưu là phương pháp nhằm chọn ra một công cụ ước lượng tham số mà không cần sử dụng đến các hàm phân bố. Phương pháp này được xem như việc ước lượng các giá trị lớn nhất của hàm. Nói cách khác, MLE là phương pháp xác định các giá trị tyham số của một mô hình thống kê. Các giá trị tham số được tìm có thể làm tối ưu hoá các quy trình mô tả các mẫu dữ liệu thực tế được khảo sát. Trong bài báo này, tôi sẽ giải thích rõ nội dung và vai trò của phương pháp MLE trong việc ước lượng tham số thông qua các ví dụ điển hình. Một số nội dung sử dụng kiến thức căn bản về xác suất, cũng như các định nghĩa, định lý của xác suất có điều kiện và các biến cố độc lập. MLE là một phương pháp khá hiệu quả và đơn giản được áp dụng trong hầu hết các bài toán ước lượng tham số. Hơn thế nữa, với các bài toán có không gian mẫu lớn, thì MLE là phương pháp ước lượng đạt hiệu quả và độ tin cậy cao. Do đó, MLE được sử dụng rộng rãi trong các vấn đề liên quan đến thống kê

8 trang | Chia sẻ: thuyduongbt11 | Lượt xem: 733 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Phương pháp giá trị tối ưu trong ước lượng tham số, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

79 PHƯƠNG PHÁP GIÁ TRỊ TỐI ƯU TRONG ƯỚC LƯỢNG THAM SỐ Nguyễn Thị Hương Trường Đại Học Hà Nội Tóm tắt - Ước lượng giá trị tối ưu là phương pháp nhằm chọn ra một công cụ ước lượng tham số mà không cần sử dụng đến các hàm phân bố. Phương pháp này được xem như việc ước lượng các giá trị lớn nhất của hàm. Nói cách khác, MLE là phương pháp xác định các giá trị tyham số của một mô hình thống kê. Các giá trị tham số được tìm có thể làm tối ưu hoá các quy trình mô tả các mẫu dữ liệu thực tế được khảo sát. Trong bài báo này, tôi sẽ giải thích rõ nội dung và vai trò của phương pháp MLE trong việc ước lượng tham số thông qua các ví dụ điển hình. Một số nội dung sử dụng kiến thức căn bản về xác suất, cũng như các định nghĩa, định lý của xác suất có điều kiện và các biến cố độc lập. MLE là một phương pháp khá hiệu quả và đơn giản được áp dụng trong hầu hết các bài toán ước lượng tham số. Hơn thế nữa, với các bài toán có không gian mẫu lớn, thì MLE là phương pháp ước lượng đạt hiệu quả và độ tin cậy cao. Do đó, MLE được sử dụng rộng rãi trong các vấn đề liên quan đến thống kê. Từ khóa - Ước tính khả năng tối đa (MLE), phân phối Bernoulli, phân phối Poisson, chức năng khả năng. Abstract: Maximum likelihood estimation is a method for choosing estimators of parameters that avoids using prior distributions and loss functions. It chooses as the estimate of 𝜃 the value of 𝜃 that provides the largest value of the likelihood function. In other words, maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximize the likelihood that the process describe by the model produced the data that were actually observed. Keywords: Maximum likelihood estimation (MLE), Bernoulli distribution, Poisson distribution, likelihood function. MAXIMUM LIKELIHOOD METHOD FOR PARAMETER ESTIMATION I. INTRODUCTION In this paper I’ll explain what the maximum likelihood method for parameter estimation is and go through a simple example to demonstrate the method. Some of the content requires knowledge of fundamental probability concepts such as the definition of joint probability and independence of events. MLE is a simple method of constructing an estimator without having to specify a loss function and a prior distribution, and it was introduced by R.A. Fisher in 1912. Maximum likelihood estimation can be applied in most problems, it has a strong intuitive appeal, and it will 80 often yield a reasonable estimator of θ. Furthermore, if the sample is large, the method will typically yield an excellent estimator of θ. For these reasons, the method of maximum likelihood is probably the most widely used method of estimation in statistics. II. MAXIMUM LIKELIHOOD METHOD 1) Definition Let the random variables 𝑋1 , 𝑋2, 𝑋3, , 𝑋𝑛 have joint density denoted 𝑓𝜃(𝑥1, 𝑥2, , 𝑥𝑛) = 𝑓(𝑥1, 𝑥2, , 𝑥𝑛|𝜃) Given observed values 𝑋1 = 𝑥1, 𝑋2 = 𝑥1, , 𝑋𝑛 = 𝑥𝑛, the likelihood of 𝜃 is the function 𝑙𝑖𝑘(𝜃) = 𝑓(𝑥1, 𝑥2, , 𝑥𝑛|𝜃) = 𝑓𝑛(𝑥|𝜃) considered as a function of 𝜃. If the distribution is discrete, f will be the frequency distribution function. In words: 𝑙𝑖𝑘(𝜃) = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑖𝑛𝑔 𝑡ℎ𝑒 𝑔𝑖𝑣𝑒𝑛 𝑑𝑎𝑡𝑎 𝑎𝑠 𝑎 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑙𝑖𝑘(𝜃). The maximum likelihood estimate (MLE) of 𝜃 is that value of 𝜃 that maximize 𝑙𝑖𝑘(𝜃): it is the value that makes the observed data the “most probable”. Likelihood Function: When the joint probability density function (p.d.f) or the joint probability mass function (p.m.f) 𝑓𝑛(𝑥|𝜃) of the observations in a random sample is regarded as a function of 𝜃 for given values of 𝑥1, 𝑥2, , 𝑥𝑛, it is called the likelihood function. What are parameter? Often in machine learning we use a model to describe the process that results in the data that are observed. For example, we may use a random forest model to classify whether customers may cancel a subscription from a service (known as churn modelling ) or we may use a linear model to predict the revenue that will be generated for a company depending on how much they may spend on advertising (this would be an example of linear regression). Each model contains its own set of parameters that ultimately defines what the model looks like. For a linear model we can write this as y = mx + c. In this example x could represent the advertising spend and y might be the revenue generated. m and c are parameters for this model Different values for these parameters will give different lines (see figure below). 81 So parameters define a blueprint for the model. It is only when specific values are chosen for the parameters that we get an instantiation for the model that describes a given phenomenon. Calculating the Maximum Likelihood Estimates Now that we have an intuitive understanding of what maximum likelihood estimation is we can move on to learning how to calculate the parameter values. The values that we find are called the maximum likelihood estimates (MLE). Again we’ll demonstrate this with an example. Suppose we have three data points this time and we assume that they have been generated from a process that is adequately described by a Gaussian distribution. These points are 9, 9.5 and 11. How do we calculate the maximum likelihood estimates of the parameter values of the Gaussian distribution μ and σ? What we want to calculate is the total probability of observing all of the data, i.e. the joint probability distribution of all observed data points. To do this we would need to calculate some conditional probabilities, which can get very difficult. So it is here that we’ll make our first assumption. The assumption is that each data point is generated independently of the others. This assumption makes the maths much easier. If the events (i.e. the process that generates the data) are independent, then the total probability of observing all of data is the product of observing each data point individually (i.e. the product of the marginal probabilities). The probability density of observing a single data point x, that is generated from a Gaussian distribution is given by: 𝑃(𝑥; 𝜇, 𝜎) = 1 𝜎√2𝜋 exp (− (𝑥 − 𝜇)2 2𝜎2 ) The semi colon used in the notation 𝑃(𝑥; 𝜇, 𝜎) is there to emphasize that the 82 symbols that appear after it are parameters of the probability distribution. So it shouldn’t be confused with a conditional probability (which is typically represented with a vertical line e.g. P(A| B). In this example the total (joint) probability density of observing the three data points is given by: 𝑃(9,9.5,11;𝜇, 𝜎) = 1 𝜎√2𝜋 exp (− (9 − 𝜇)2 2𝜎2 ) . 1 𝜎√2𝜋 exp (− (9.5 − 𝜇)2 2𝜎2 ) . 1 𝜎√2𝜋 exp(− (11 − 𝜇)2 2𝜎2 ) We just have to figure out the values of μ and σ that results in giving the maximum value of the above expression. The likelihood The above expression for the total probability is actually quite a pain to differentiate, so it is almost always simplified by taking the natural logarithm of the expression. This is absolutely fine because the natural logarithm is a monotonically increasing function. This means that if the value on the x-axis increases, the value on the y-axis also increases (see figure below). This is important because it ensures that the maximum value of the log of the probability occurs at the same point as the original probability function. Therefore we can work with the simpler log-likelihood instead of the original likelihood. Monotonic behaviour of the original function, y = x on the left and the (natural) logarithm function y = ln(x). These functions are both monotonic because as you go from left to right on the x-axis the y value always increases. 83 Example of a non-monotonic function because as you go from left to right on the graph the value of f(x) goes up, then goes down and then goes back up again. Taking logs of the original expression gives us: ln(𝑃(𝑥; 𝜇, 𝜎)) = ln ( 1 𝜎√2𝜋 ) − (9 − 𝜇)2 2𝜎2 + 𝑙𝑛 ( 1 𝜎√2𝜋 ) − (9.5 − 𝜇)2 2𝜎2 + 𝑙𝑛 ( 1 𝜎√2𝜋 ) − (11 − 𝜇)2 2𝜎2 This expression can be simplified again using logarithms to obtain: ln(𝑃(𝑥; 𝜇, 𝜎)) = −3 ln(𝜎) − 3 2 ln(2𝜋) − 1 2𝜎2 [(9 − 𝜇)2 + (9.5 − 𝜇)2 + (11 − 𝜇)2] This expression can be differentiated to find the maximum. In this example we’ll find the MLE of the mean, μ. To do this we take the partial derivative of the function with respect to 𝜇, giving 𝜕ln (𝑃(𝑥;𝜇,𝜎)) 𝜕𝜇 = 1 𝜎2 [9 + 9.5 + 11 − 3𝜇]. Finally, setting the left hand side of the equation to zero and then rearranging for μ gives: 𝜇 = 9 + 9.5 + 11 3 = 9.833 And there we have our maximum likelihood estimate for 𝜇. 2) Examples of maximum Likelihood Estimators  Test for a Disease. Suppose that you are walking down the street and notice that the Department of Public Health is giving a free medical test for a certain disease. The test is 90 percent reliable in the following sense: If a person has the disease, there is a probability of 0.9 that the test will give a positive response; whereas, if a person does not have the disease, there is a probability of only 0.1 that the test will give 84 a positive response. We shall let X stand for the result of the test, where 𝑋 = 1 means that the test is positive and 𝑋 = 0 means that the test is negative. Let the parameter space be Ω = {0.1, 0.9}, where 𝜃 = 0.1 means that the person tested does not have the disease, and 𝜃 = 0.9 means that the person has the disease. This parameter space was chosen so that, given 𝜃, 𝑋 has the Bernoulli distribution with parameter 𝜃. The likelihood function is 𝑓(𝑥|𝜃) = 𝜃𝑥(1 − 𝜃)1−𝑥 If x = 0 is observed, then 𝑓(0|𝜃) = { 0.9 𝑖𝑓 𝜃 = 0.1 0.1 𝑖𝑓 𝜃 = 0.9 Clearly, 𝜃 = 0.1 maximizes the likelihood when 𝑥 = 0 is observed. If 𝑥 = 1 is observed, then 𝑓(1|𝜃) = { 0.1 𝑖𝑓 𝜃 = 0.1 0.9 𝑖𝑓 𝜃 = 0.9 Clearly, 𝜃 = 0.9 maximizes the likelihood when 𝑥 = 1 is observed. Hence, we have that the M.L.E. is 𝜃 = { 0.1 𝑖𝑓 𝑋 = 0 0.9 𝑖𝑓 𝑋 = 1  Poisson Distribution Consider a Poisson distribution with probability mass function 𝑓(𝑥|𝜇) = 𝑒−𝜇𝜇𝑥 𝑥! , 𝑥 = 0,1,2, Suppose that a random sample 𝑥1, 𝑥2, , 𝑥𝑛 is taken from the distribution. What is the maximum likelihood estimate of 𝜇? Solution: The likelihood function is 𝐿(𝑥1, 𝑥2, ; 𝜇) = ∏𝑓(𝑥𝑖|𝜇) = 𝑒−𝑛𝜇𝜇∑ 𝑥𝑖 𝑛 𝑖=1 ∏ 𝑥𝑖! 𝑛 𝑖=1 𝑛 𝑖=1 Now consider ln 𝐿(𝑥1, 𝑥2, ; 𝜇) = −𝑛𝜇 + ∑𝑥𝑖 𝑙𝑛𝜇 − 𝑙𝑛∏𝑥𝑖! 𝑛 𝑖=1 𝑛 𝑖=1 85 𝜕 ln 𝐿(𝑥1, 𝑥2, ; 𝜇) 𝜕𝜇 = −𝑛 + ∑ 𝑥𝑖 𝜇 𝑛 𝑖=1 Solving for �̂�, the maximum likelihood estimator, involves setting the derivative to zero and solving for the parameter. Thus, �̂� = ∑ 𝑥𝑖 𝑛 𝑛 𝑖=1 = 𝑥 The second derivative of the log-likelihood function is negative, which implies that the solution above indeed is a maximum. Since 𝜇 is the mean of the Poisson distribution, the sample average would certainly seem like a reasonable estimator. 3) Limitation of Maximum Likelihood Estimation Despite its intuitive appeal, the method of maximum likelihood is not necessarily appropriate in all problem. Since max{𝑋1, , 𝑋𝑛} < 𝜃 with probability 1, it follows that 𝜃 surely underestimates the value of 𝜃. Indeed, if any prior distribution is assigned to 𝜃, then the Bayes estimator exceeds 𝜃 will, of course, depend on the particular prior distribution that is used and on the observed values of {𝑋1, , 𝑋𝑛} Finally, we shall mention one point concerning the interpretation of the M.L.E. The M.L.E. is the value of 𝜃 that maximizes the conditional p.f. or p.d.f. of the data 𝑿 given 𝜃. Therefore, the maximum likelihood estimate is the value of 𝜃 that assigned the highest probability to seeing the observed data. It is not necessarily the value of the parameter that appears to be most likely given the data. To say how likely are different values of the parameter, one would need a probability distribution for the parameter. Of course, the posterior distribution of the parameter would serve this purpose, but no posterior distribution is involved in the calculation of the M.L.E. Hence, it is not legitimate to interpret the M.L.E. as the most likely value of the parameter after having seen the data. III. SUMMARY The maximum likelihood estimate of a parameter θ is that value of θ that provides the largest value of the likelihood function 𝑓𝑛(𝑥|𝜃) for fixed data x. If 𝛿(x) denotes the maximum likelihood estimate, then 𝜃 = 𝛿(x) is the maximum likelihood estimator (M.L.E.). The method of maximum likelihood allows the analyst to make use of knowledge of the distribution in determining an appropriate estimator. The method of maximum likelihood cannot be applied without knowledge of the underlying distribution. 86 REFERENCES [1] Pedersen, A. R. (1995). “A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observations”. Scand. J. Statist., 22:55–71. [2] Spanos, A. (1999). Probability theory and statistical inference. Cam- bridge, UK: Cambridge University Press. [3] DeGroot, M. H., & Schervish, M. J. (2002). Probability and statistics (3rd ed.). Boston, MA: Addison-Wesley. [4] Bickel, P. J., & Doksum, K. A. (1977). Mathematical statistics. Oakland, CA: Holden-day, Inc. [5] Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.