Title: Stata Blogs - An introduction to the lasso in Stata

Source: Stata Blogs - An introduction to the lasso in Stata Author: David Drukker, Executive Director of Econometrics and Di Liu, Senior Econometrician

^{[1]}Go to comments

## 连享会 - Stata 暑期班

线上直播 9 天：2020.7.28-8.7

主讲嘉宾：连玉君 (中山大学) | 江艇 (中国人民大学)

课程主页：https://gitee.com/arlionn/PX | 微信版

## 直播课程：实证研究设计 (2.4小时)

**目录**

The least absolute shrinkage and selection operator (lasso) estimates model coefficients and these estimates can be used to select which covariates should be included in a model. The lasso is used for outcome prediction and for inference about causal parameters. In this post, we provide an introduction to the lasso and discuss using the lasso for prediction. In the next post, we discuss using the lasso for inference about causal parameters.

The lasso is most useful when a few out of many potential covariates affect the outcome and it is important to include only the covariates that have an affect. "Few" and "many" are defined relative to the sample size. In the example discussed below, we observe the most recent health-inspection scores for 600 restaurants, and we have 100 covariates that could potentially affect each one's score. We have too many potential covariates because we cannot reliably estimate 100 coefficients from 600 observations. We believe that only about 10 of the covariates are important, and we feel that 10 covariates are "a few" relative to 600 observations.

Given that only a few of the many covariates affect the outcome, the problem is now that we don't know which covariates are important and which are not. The lasso produces estimates of the coefficients and solves this covariate-selection problem.

There are technical terms for our example situation. A model with more covariates than whose coefficients you could reliably estimate from the available sample size is known as a high-dimensional model. The assumption that the number of coefficients that are nonzero in the true model is small relative to the sample size is known as a sparsity assumption. More realistically, the approximate sparsity assumption requires that the number of nonzero coefficients in the model that best approximates the real world be small relative to the sample size.

In these technical terms, the lasso is most useful when estimating the coefficients in a high-dimensional, approximately sparse, model.

High-dimensional models are nearly ubiquitous in prediction problems and models that use flexible functional forms. In many cases, the many potential covariates are created from polynomials, splines, or other functions of the original covariates. In other cases, the many potential covariates come from administrative data, social media, or other sources that naturally produce huge numbers of potential covariates.

We use a series of examples to make our discussion of the lasso more accessible. These examples use some simulated data from the following problem. A health inspector in a small U.S. city wants to use social-media reviews to predict the health-inspection scores of restaurants. The inspector plans to add surprise inspections to the restaurants with the lowest-predicted health scores, using our predictions.

**hsafety2.dta** has 1 observation for each of 600 restaurants, and the score from the most recent inspection is in **score**. The percentage of a restaurant's social-media reviews that contain a word like "dirty" could predict the inspection score. We identified 50 words, 30 word pairs, and 20 phrases whose occurrence percentages in reviews written in the three months prior to an inspection could predict the inspection score. The occurrence percentages of the 50 words are in **word1** – **word50**. The occurrence percentages of 30-word pairs are in **wpair1** – **wpair30**. The occurrence percentages of the 20 phrases are in **phrase1** – **phrase20**.

Researchers widely use the following steps to find the best predictor.

Divide the sample into training and validation subsamples. Use the training data to estimate the model parameters of each of the competing estimators. Use the validation data to estimate the out-of-sample mean squared error (MSE) of the predictions produced by each competing estimator. The best predictor is the estimator that produces the smallest out-of-sample MSE.

The ordinary least-squares (OLS) estimator is frequently included as a benchmark estimator when it is feasible. We begin the process with splitting the sample and computing the OLS estimates.

In the output below, we read the data into memory and use **splitsample** with the option **split(.75 .25)** to generate the variable **sample**, which is 1 for a 75% of the sample and 2 for the remaining 25% of the sample. The assignment of each observation in **sample** to 1 or 2 is random, but the **rseed** option makes the random assignment reproducible.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. use hsafety2, clear`

`. splitsample , generate(sample) split(.75 .25) rseed(12345)`

`. label define slabel 1 "Training" 2 "Validation"`

`. label values sample slabel`

`. tabulate sample`

`sample | Freq. Percent Cum.`

`------------+-----------------------------------`

`Training | 450 75.00 75.00`

`Validation | 150 25.00 100.00`

`------------+-----------------------------------`

`Total | 600 100.00`

The one-way tabulation of **sample** produced by **tabulate** verifies that **sample** contains the requested 75%–25% division.

Next, we compute the OLS estimates using the data in the training sample and store the results in memory as **ols**.

- ounter(line
- ounter(line
- ounter(line

`. quietly regress score word1-word50 wpair1-wpair30 phrase1-phrase20 ///`

`if sample==1`

`. estimates store ols`

Now, we use **lassogof** with option **over(sample)** to compute the in-sample (**Training**) and out-of-sample (**Validation**) estimates of the MSE.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`Penalized coefficients`

`-------------------------------------------------------------`

`Name sample | MSE R-squared Obs`

`------------------------+------------------------------------`

`ols |`

`Training | 24.43515 0.5430 450`

`Validation | 35.53149 0.2997 150`

`-------------------------------------------------------------`

## 连享会 - 文本分析与爬虫 - 专题视频

主讲嘉宾：司继春 || 游万海

As expected, the estimated MSE is much smaller in the **Training** subsample than in the **Validation** sample. The out-of-sample estimate of the MSE is the more reliable estimator for the prediction error; see, for example, chapters 1, 2, and 3 in Hastie, Tibshirani, and Friedman (2009).

In this section, we introduce the lasso and compare its estimated out-of-sample MSE to the one produced by OLS.

The lasso is an estimator of the coefficients in a model. What makes the lasso special is that some of the coefficient estimates are exactly zero, while others are not. The lasso selects covariates by excluding the covariates whose estimated coefficients are zero and by including the covariates whose estimates are not zero. There are no standard errors for the lasso estimates. The lasso's ability to work as a covariate-selection method makes it a nonstandard estimator and prevents the estimation of standard errrors. In this post, we discuss how to use the lasso for inferential questions.

Tibshirani (1996) derived the lasso, and Hastie, Tibshirani, and Wainwright (2015) provide a textbook introduction.

The remainder of this section provides some details about the mechanics of how the lasso produces its coefficient estimates. There are different versions of the lasso for linear and nonlinear models. Versions of the lasso for linear models, logistic models, and Poisson models are available in Stata 16. We discuss only the lasso for the linear model, but the points we make generalize to the lasso for nonlinear models.

Like many estimators, the lasso for linear models solves an optimization problem. Specifically, the
linear lasso point estimates

其中：

$\mathit{\lambda}>0$ is the lasso penalty parameter,$y$ is the outome variable,$\mathbf{x}containsthe$ p$ potential covariates,$\mathit{\beta}$ is the vector of coefficients on$\mathbf{x},$ ${\beta}_{j}$ is the$j\text{th}$ element of$\mathit{\beta}$ ,the ${\omega}_{j}$ are parameter-level weights known as penalty loadings, and$n$ is the sample size.

There are two terms in this optimization problem, the least-squares fit measure

and the penalty term

The parameters

There is a value

That the number of potential covariates

The tuning parameters must be selected before using the lasso for prediction or model selection. The most frequent methods used to select the tuning parameters are cross-validation (CV), the adaptive lasso, and plug-in methods. In addition,

CV finds the

CV is the default method of selecting the tuning parameters in the **lasso** command. In the output below, we use **lasso** to estimate the coefficients in the model for **score**, using the training sample. We specified the option **rseed()** to make our CV results reproducible.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. lasso linear score word1-word50 wpair1-wpair30 phrase1-phrase20 ///`

`if sample==1, nolog rseed(12345)`

`Lasso linear model No. of obs = 450`

`No. of covariates = 100`

`Selection: Cross-validation No. of CV folds = 10`

`--------------------------------------------------------------------------`

`| No. of Out-of- CV mean`

`| nonzero sample prediction`

`ID | Description lambda coef. R-squared error`

`---------+----------------------------------------------------------------`

`1 | first lambda 3.271123 0 0.0022 53.589`

`25 | lambda before .3507518 22 0.3916 32.53111`

`* 26 | selected lambda .319592 25 0.3917 32.52679`

`27 | lambda after .2912003 26 0.3914 32.53946`

`30 | last lambda .2202824 30 0.3794 33.18254`

`--------------------------------------------------------------------------`

`* lambda selected by cross-validation.`

`. estimates store cv`

We specified the option **nolog** to supress the CV log over the candidate values of **estimates store** to store these results under the name cv in memory.

We use **cvplot** to plot the CV function.

- ounter(line

`. cvplot, minmax`

The CV function appears somewhat flat near the optimal `lassoknots`

to display the table of knots.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. lassoknots`

`--------------------------------------------------------------------------`

`| No. of CV mean |`

`| nonzero pred. | Variables (A)dded, (R)emoved,`

`ID | lambda coef. error | or left (U)nchanged`

`-----+-----------------------------+--------------------------------------`

`2 | 2.980526 2 52.2861 | A phrase3 phrase4`

`3 | 2.715744 3 50.48463 | A phrase5`

`4 | 2.474485 4 48.55981 | A word3`

`6 | 2.054361 5 44.51782 | A phrase6`

`9 | 1.554049 6 40.23385 | A wpair3`

`10 | 1.415991 8 39.04494 | A wpair2 phrase2`

`12 | 1.175581 9 36.983 | A word2`

`14 | .9759878 10 35.42697 | A word31`

`16 | .8102822 11 34.2115 | A word19`

`17 | .738299 12 33.75501 | A word4`

`21 | .5088809 14 32.74808 | A word14 phrase7`

`22 | .4636733 17 32.64679 | A word32 wpair19 wpair26`

`23 | .4224818 19 32.56572 | A wpair15 wpair25`

`24 | .3849497 22 32.53301 | A wpair24 phrase13 phrase14`

`* 26 | .319592 25 32.52679 | A word25 word30 phrase8`

`27 | .2912003 26 32.53946 | A wpair11`

`29 | .2417596 27 32.86193 | A wpair17`

`30 | .2202824 30 33.18254 | A word23 word38 wpair4`

`--------------------------------------------------------------------------`

`* lambda selected by cross-validation.`

The CV function is minimized at the

CV tends to include extra covariates whose coefficients are zero in the model that best approximates the process that generated the data. This can affect the prediction performance of the CV-based lasso, and it can affect the performance of inferential methods that use a CV-based lasso for model selection. The adaptive lasso is a multistep version of CV. It was designed to exclude some of these extra covariates.

The first step of the adaptive lasso is

We specify the option **selection(adaptive)** below to cause **lasso** to use the adaptive lasso instead of CV to select the tuning parameters. We used **estimates store** to store the results under the name **adaptive**.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. lasso linear score word1-word50 wpair1-wpair30 phrase1-phrase20 ///`

`if sample==1, nolog rseed(12345) selection(adaptive)`

`Lasso linear model No. of obs = 450`

`No. of covariates = 100`

`Selection: Adaptive No. of lasso steps = 2`

`Final adaptive step results`

`--------------------------------------------------------------------------`

`| No. of Out-of- CV mean`

`| nonzero sample prediction`

`ID | Description lambda coef. R-squared error`

`---------+----------------------------------------------------------------`

`31 | first lambda 124.1879 0 0.0037 53.66569`

`77 | lambda before 1.719861 12 0.4238 30.81155`

`* 78 | selected lambda 1.567073 12 0.4239 30.8054`

`79 | lambda after 1.427859 14 0.4237 30.81533`

`128 | last lambda .0149585 22 0.4102 31.53511`

`--------------------------------------------------------------------------`

`* lambda selected by cross-validation in final adaptive step.`

`. estimates store adaptive`

We see that the adaptive lasso included 12 instead of 25 covariates.

Plug-in methods tend to be even more parsimonious than the adaptive lasso. Plug-in methods find the value of the

The plug-in-based lasso is much faster than the CV-based lasso and the adaptive lasso. In practice, the plug-in-based lasso tends to include the important covariates and it is really good at not including covariates that do not belong in the model that best approximates the data. The plug-in-based lasso has a risk of missing some covariates with large coefficients and finding only some covariates with small coefficients. See Belloni, Chernozhukov, and Wei (2016) and Belloni, et al. (2012) for details and formal results.

We specify the option **selection(plugin)** below to cause **lasso** to use the plug-in method to select the tuning parameters. We used **estimates store** to store the results under the name **plugin**.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. lasso linear score word1-word50 wpair1-wpair30 phrase1-phrase20 ///`

`if sample==1, selection(plugin)`

`Computing plugin lambda ...`

`Iteration 1: lambda = .1954567 no. of nonzero coef. = 8`

`Iteration 2: lambda = .1954567 no. of nonzero coef. = 9`

`Iteration 3: lambda = .1954567 no. of nonzero coef. = 9`

`Lasso linear model No. of obs = 450`

`No. of covariates = 100`

`Selection: Plugin heteroskedastic`

`--------------------------------------------------------------------------`

`| No. of`

`| nonzero In-sample`

`ID | Description lambda coef. R-squared BIC`

`---------+----------------------------------------------------------------`

`* 1 | selected lambda .1954567 9 0.3524 2933.203`

`--------------------------------------------------------------------------`

`* lambda selected by plugin formula assuming heteroskedastic.`

`. estimates store plugin`

The plug-in-based lasso included 9 of the 100 covariates, which is far fewer than included by the CV-based lasso or the adaptive lasso.

We now have four different predictors for **score**: OLS, CV-based lasso, adaptive lasso, and plug-in-based lasso. The three lasso methods could predict **score** using the penalized coefficients estimated by lasso, or they could predict **score** using the unpenalized coefficients estimated by OLS, including only the covariates selected by lasso. The predictions that use the penalized lasso estimates are known as the lasso predictions and the predictions that use the unpenalized coefficients are known as the postselection predictions, or the postlasso predictions.

For linear models, Belloni and Chernozhukov (2013) present conditions in which the postselection predictions perform at least as well as the lasso predictions. Heuristically, one expects the lasso predictions from a CV-based lasso to perform better than the postselection predictions because CV chooses

In practice, we estimate the out-of-sample MSE of the predictions for all estimators using both the lasso predictions and the postselection predictions. We select the one that produces the lowest out-of-sample MSE of the predictions.

In the output below, we use **lassogof** to compare the out-of-sample prediction performance of OLS and the lasso predictions from the three lasso methods.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. lassogof ols cv adaptive plugin if sample==2`

`Penalized coefficients`

`-------------------------------------------------`

`Name | MSE R-squared Obs`

`------------+------------------------------------`

`ols | 35.53149 0.2997 150`

`cv | 27.83779 0.4513 150`

`adaptive | 27.83465 0.4514 150`

`plugin | 32.29911 0.3634 150`

`-------------------------------------------------`

For these data, the lasso predictions using the adaptive lasso performed a little bit better than the lasso predictions from the CV-based lasso.

In the output below, we compare the out-of-sample prediction performance of OLS and the lasso predictions from the three lasso methods using the postselection coefficient estimates.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`Penalized coefficients`

`-------------------------------------------------`

`Name | MSE R-squared Obs`

`------------+------------------------------------`

`ols | 35.53149 0.2997 150`

`cv | 27.87639 0.4506 150`

`adaptive | 27.79562 0.4522 150`

`plugin | 26.50811 0.4775 150`

`-------------------------------------------------`

It is not surprising that the plug-in-based lasso produces the smallest out-of-sample MSE. The plug-in method tends to select covariates whose postselection estimates do a good job of approximating the data.

The real competition tends to be between the lasso estimates from the best of the penalized lasso predictions and the postselection estimates from the plug-in-based lasso. In this case, the postselection estimates from the plug-in-based lasso produced the better out-of-sample predictions, and we would use these results to predict **score**.

The elastic net extends the lasso by using a more general penalty term. The elastic net was originally motivated as a method that would produce better predictions and model selection when the covariates were highly correlated. See Zou and Hastie (2005) for details.

The linear elastic net solves

where `elasticnet`

command selects `alpha()`

specifies the candidate values for

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. elasticnet linear score word1-word50 ///`

`wpair1-wpair30 phrase1-phrase20 ///`

`if sample==1, alpha(.25 .5 .75) ///`

`nolog rseed(12345)`

`Elastic net linear model No. of obs = 450`

`No. of covariates = 100`

`Selection: Cross-validation No. of CV folds = 10`

`-------------------------------------------------------------------------------`

`| No. of Out-of- CV mean`

`| nonzero sample prediction`

`alpha ID | Description lambda coef. R-squared error`

`---------------+---------------------------------------------------------------`

`0.750 |`

`1 | first lambda 13.08449 0 0.0062 53.79915`

`39 | lambda before .4261227 24 0.3918 32.52101`

`* 40 | selected lambda .3882671 25 0.3922 32.49847`

`41 | lambda after .3537745 27 0.3917 32.52821`

`44 | last lambda .2676175 34 0.3788 33.21631`

`---------------+---------------------------------------------------------------`

`0.500 |`

`45 | first lambda 13.08449 0 0.0062 53.79915`

`84 | last lambda .3882671 34 0.3823 33.02645`

`---------------+---------------------------------------------------------------`

`0.250 |`

`85 | first lambda 13.08449 0 0.0058 53.77755`

`120 | last lambda .5633091 54 0.3759 33.373`

`-------------------------------------------------------------------------------`

`* alpha and lambda selected by cross-validation.`

`. estimates store enet`

We see that the elastic net selected 25 of the 100 covariates.

For comparison, we also use **elasticnet** to perform ridge regression, with the penalty parameter selected by CV.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. elasticnet linear score word1-word50 ///`

`wpair1-wpair30 phrase1-phrase20 ///`

`if sample==1, alpha(0) nolog rseed(12345)`

`Elastic net linear model No. of obs = 450`

`No. of covariates = 100`

`Selection: Cross-validation No. of CV folds = 10`

`-------------------------------------------------------------------------------`

`| No. of Out-of- CV mean`

`| nonzero sample prediction`

`alpha ID | Description lambda coef. R-squared error`

`---------------+---------------------------------------------------------------`

`0.000 |`

`1 | first lambda 3271.123 100 0.0062 53.79914`

`90 | lambda before .829349 100 0.3617 34.12734`

`* 91 | selected lambda .7556719 100 0.3621 34.1095`

`92 | lambda after .6885401 100 0.3620 34.11367`

`100 | last lambda .3271123 100 0.3480 34.86129`

`-------------------------------------------------------------------------------`

`* alpha and lambda selected by cross-validation.`

`. estimates store ridge`

Ridge regression does not perform model selection and thus includes all the covariates.

We now compare the out-of-sample predictive ability of the CV-based lasso, the elastic net, ridge regression, and the plug-in-based lasso using the lasso predictions. (For elastic net and ridge regression, the "lasso predictions" are made using the coefficient estimates produced by the penalized estimator.)

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. lassogof cv adaptive enet ridge plugin if sample==2`

`Penalized coefficients`

`-------------------------------------------------`

`Name | MSE R-squared Obs`

`------------+------------------------------------`

`cv | 27.83779 0.4513 150`

`adaptive | 27.83465 0.4514 150`

`enet | 27.77314 0.4526 150`

`ridge | 29.47745 0.4190 150`

`plugin | 32.29911 0.3634 150`

`-------------------------------------------------`

In this case, the penalized elastic-net coefficient estimates predict best out of sample among the lasso estimates. The postselection predictions produced by the plug-in-based lasso perform best overall. This can be seen by comparing the above output with the output below.

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`Penalized coefficients`

`-------------------------------------------------`

`Name | MSE R-squared Obs`

`------------+------------------------------------`

`cv | 27.87639 0.4506 150`

`adaptive | 27.79562 0.4522 150`

`enet | 27.87639 0.4506 150`

`plugin | 26.50811 0.4775 150`

`-------------------------------------------------`

So we would use these postselection coefficient estimates from the plug-in-based lasso to predict **score**.

Sensitivity analysis is sometimes performed to see if a small change in the tuning parameters leads to a large change in the prediction performance. When looking at the output of **lassoknots** produced by the CV-based lasso, we noted that for a small increase in the CV function produced by the penalized estimates, there could be a significant reduction in the number of selected covariates. Restoring the **cv** estimates and repeating the **lassoknots** output, we see that

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. estimates restore cv`

`(results cv are active now)`

`. lassoknots`

`-------------------------------------------------------------------------------------`

`| No. of CV mean |`

`| nonzero pred. | Variables (A)dded, (R)emoved,`

`ID | lambda coef. error | or left (U)nchanged`

`-------+-------------------------------+---------------------------------------------`

`2 | 2.980526 2 52.2861 | A phrase3 phrase4`

`3 | 2.715744 3 50.48463 | A phrase5`

`4 | 2.474485 4 48.55981 | A word3`

`6 | 2.054361 5 44.51782 | A phrase6`

`9 | 1.554049 6 40.23385 | A wpair3`

`10 | 1.415991 8 39.04494 | A wpair2 phrase2`

`12 | 1.175581 9 36.983 | A word2`

`14 | .9759878 10 35.42697 | A word31`

`16 | .8102822 11 34.2115 | A word19`

`17 | .738299 12 33.75501 | A word4`

`21 | .5088809 14 32.74808 | A word14 phrase7`

`22 | .4636733 17 32.64679 | A word32 wpair19 wpair26`

`23 | .4224818 19 32.56572 | A wpair15 wpair25`

`24 | .3849497 22 32.53301 | A wpair24 phrase13 phrase14`

`* 26 | .319592 25 32.52679 | A word25 word30 phrase8`

`27 | .2912003 26 32.53946 | A wpair11`

`29 | .2417596 27 32.86193 | A wpair17`

`30 | .2202824 30 33.18254 | A word23 word38 wpair4`

`-------------------------------------------------------------------------------------`

`* lambda selected by cross-validation.`

lasso selected the

- ounter(line
- ounter(line
- ounter(line
- ounter(line

`lassoselect id = 21`

`ID = 21 lambda = .5088809 selected`

`. estimates store hand`

We now compute the out-of-sample MSE produced by the postselection estimates of the lasso whose

- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line
- ounter(line

`. lassogof hand plugin if sample==2, postselection`

`Penalized coefficients`

`-------------------------------------------------`

`Name | MSE R-squared Obs`

`------------+------------------------------------`

`hand | 27.71925 0.4537 150`

`plugin | 26.50811 0.4775 150`

`-------------------------------------------------`

This post has presented an introduction to the lasso and to the elastic net, and it has illustrated how to use them for prediction. There is much more information available in the Stata 16 LASSO manual. The next post will discuss using the lasso for inference about causal parameters.

Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen. 2012. Sparse models and methods for optimal instruments with an application to eminent domain. *Econometrica*80: 2369–2429.Belloni, A., and V. Chernozhukov. 2013. Least squares after model selection in high-dimensional sparse models. *Bernoulli*19: 521–547.Belloni, A., V. Chernozhukov, and Y. Wei. 2016. Post-selection inference for generalized linear models with many controls. *Journal of Business & Economic Statistics*34: 606–619.Bühlmann, P., and S. Van de Geer. 2011. *Statistics for High-Dimensional Data: Methods, Theory and Applications*. Berlin: Springer.Chetverikov, D., Z. Liao, and V. Chernozhukov. 2019. On cross-validated Lasso. arXiv Working Paper No. arXiv:1605.02214. http://arxiv.org/abs/1605.02214. Hastie, T., R. Tibshirani, and J. Friedman. 2009. *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd ed. New York: Springer.Hastie, T., R. Tibshirani, and M. Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Rotaon, FL: CRC Press. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society, Series B*58: 267–288.Zou, H. 2006. The adaptive Lasso and its oracle properties. *Journal of the American Statistical Association*101: 1418–1429.Zou, H., and T. Hastie. 2005. Regularization and variable selection via the elastic net. *Journal of the Royal Statistical Society, Series B*67: 301–320.

Cross-validation finds the value for

After you specify the grid, the sample is partitioned into

For each $k\in \{1,2,\dots ,K\}$ using the data not in partition $k,$ estimate the penalized coefficients$\hat{\mathit{\beta}}$ with$\mathit{\lambda}={\lambda}_{q}$ .using the data in partition $k$ , predict the out-of-sample squared errors.

The mean of these out-of-sample squared errors estimates the out-of-sample MSE of the predictions. The cross-validation function traces the values of these out-of-sample MSEs over the grid of candidate values for

## 连享会 - 效率分析专题

已上线：可随时购买学习+全套课件，课程主页 已经放置板书和 FAQs

主讲嘉宾：连玉君 | 鲁晓东 | 张宁

课程主页，微信版 https://gitee.com/arlionn/TE

连享会-直播课上线了！

http://lianxh.duanshu.com

免费公开课：

直击面板数据模型 - 连玉君，时长：1小时40分钟 Stata 33 讲 - 连玉君, 每讲 15 分钟. 部分直播课 课程资料下载 (PPT，dofiles等)

支持回看，所有课程可以随时购买观看。

专题 | 嘉宾 | 直播/回看视频 |
---|---|---|

⭐ Stata暑期班 |
连玉君 江艇 |
线上直播 9 天 2020.7.28-8.7 |

效率分析-专题 |
连玉君 鲁晓东 张 宁 |
视频-TFP-SFA-DEA 已上线，3天 |

文本分析/爬虫 |
游万海 司继春 |
视频-文本分析与爬虫 已上线，4天 |

空间计量系列 |
范巧 | 空间全局模型, 空间权重矩阵 空间动态面板, 空间DID |

研究设计 | 连玉君 | 我的特斯拉-实证研究设计，-幻灯片- |

面板模型 | 连玉君 | 动态面板模型，-幻灯片- |

直击面板数据模型 [免费公开课，2小时] |

Note: 部分课程的资料，PPT 等可以前往 连享会-直播课 主页查看，下载。

## 关于我们

**Stata连享会**由中山大学连玉君老师团队创办，定期分享实证分析经验。直播间 有很多视频课程，可以随时观看。连享会-主页 和 知乎专栏，300+ 推文，实证分析不再抓狂。 **公众号推文分类：**计量专题 | 分类推文 | 资源工具。推文分成**内生性**|**空间计量**|**时序面板**|**结果输出**|**交乘调节**五类，主流方法介绍一目了然：DID, RDD, IV, GMM, FE, Probit 等。**公众号关键词搜索/回复**功能已经上线。大家可以在公众号左下角点击键盘图标，输入简要关键词，以便快速呈现历史推文，获取工具软件和数据下载。常见关键词：`课程, 直播, 视频, 客服, 模型设定, 研究设计, 暑期班`

`stata, plus，Profile, 手册, SJ, 外部命令, profile, mata, 绘图, 编程, 数据, 可视化`

`DID，RDD, PSM，IV，DID, DDD, 合成控制法，内生性, 事件研究`

,`交乘, 平方项, 缺失值, 离群值, 缩尾, R2, 乱码, 结果`

`Probit, Logit, tobit, MLE, GMM, DEA, Bootstrap, bs, MC, TFP`

,`面板, 直击面板数据, 动态面板, VAR, 生存分析, 分位数`

`空间, 空间计量, 连老师, 直播, 爬虫, 文本, 正则, python`

`Markdown, Markdown幻灯片, marp, 工具, 软件, Sai2, gInk, Annotator, 手写批注`

,`盈余管理, 特斯拉, 甲壳虫, 论文重现`

,`易懂教程, 码云, 教程, 知乎`

连享会小程序：扫一扫，看推文，看视频……

扫码加入连享会微信群，提问交流更方便

✏ 连享会学习群-常见问题解答汇总：

✨ https://gitee.com/arlionn/WD

Posts by David Drukker, Executive Director of Econometrics and Di Liu, Senior Econometrician: *https://blog.stata.com/author/drukker-liu/*

03/12
计量专题
2911

01/01
工具软件
3399

07/05
Stata程序
44

07/05
Stata资源
24

07/05
Stata资源
19

07/04
交乘项-调节
116

07/04
内生性
41

07/04
结果输出
28

07/04
Stata绘图
22

07/04
Stata绘图
20