温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。
编译: 刘亮(三峡大学) 邮件: jetinskyliu0@gmail.com
Sources:
编者按: 这篇推文在编译上述三篇 Stata Blogs 的基础上,进一步做了针对中国的数据分析和可视化处理。主要不目的不是实时呈现疫情变动 (已经有很多优秀的项目在做这件事情了),而是为了演示 Stata 中的一些常用数据分析和绘图命令。
目录
本文介绍 Chuck Huber 博士编写的 Covid19 Stata 命令,并将约翰`霍普金斯大学公布的全球新型冠状病毒数据为例,依次介绍如何使用 Stata 命令进行数据获取、清洗、保存以及数据可视化。
该命令能生成一个包含日期、确诊病例、新增病例、死亡人数,以及康复人数的表格。为了能够查看具体国家病毒感染人数,作者添加了 country() 和 graph 选项,以便画出确诊人数图, 添加保存选项 saving, 以便将数据保存为 covid19_usa.dta。
其基本命令为:
covid19, country("US") graph saving(covid19_usa)
> Confirmed cases of COVID-19 in the United States
GitHub 是一个流行的软件开发和发布平台。约翰霍普金斯大学系统科学与工程中心(School of Engineering Center for Systems Science and Engineering)拥有一个包括来自世界各地的定期更新的 COVID-19GitHub 数据存储库。
其原始数据网址为:https://github.com/CSSEGISandData/COVID19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports
打开网站后得到如下界面,可以看到部分数据如下图所示:
每天的相关数据存储在独立的.csv 文件中。以 2020 年 1 月 29 日的数据为例,查看相关内容。
点击"Raw"按键可以看到原始的数据,原始数据如下所示:
我们能够将这些原始数据(Raw data)使用 import delimited 命令将其导入到 Stata 中,首先以 2020 年 1 月 29 日为例。Stata 的基本命令为:import delimited https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv
导入数据后先输入describe
命令,结果显示如下:
输入list in 1/5
查看前 5 个观测值。
我们看到变量 provincestate 在一些文件中被当成变量 ïprovincestate。由于导入的文件很多,我们避免手动逐个查看。我们使用 confirm 命令去逐个查看每个文件中的 ïprovincestate。其相关命令如下:
confirm variable ïprovincestate
display_rc
输出结果为:0
如果变量 ïprovincestate 存在,command 命令会为_rc 存一个"0"值,从而我们可以使用 rename 和 label 命令修改变量名和变量标签。相关命令如下:
if _rc == 0{
rename ïprovincestate provincestate
label variable provincestate "Province/State"
}
执行完了 if 命令块后,使用describe provincestate
命令,可以看到现在的数据中包含的变量为 provincestate 的相关信息。
为了检查数据是否还存在 ïprovincestate 变量,再使用comfirm variable ïprovincestate
其结果如下:
可以发现原始数据中的变量已经被我们清理完成了。
宏可以分为局部宏和全局宏。全局宏(global macro)一旦定义将贯穿整个 Stata 程序,局部宏(local macro)仅存在被定义的程序或 do 文件中。我们做一个示例定义并引用一个局部宏 today,如下所示:
local today = "3-19-2020"
我们可以通过在宏名称使用左右引号进行引用。左引号(在"esc"键下方)和右引号(在"回车"键的左侧)来引用宏。命令如下:
local today = "3-19-2020"
display "`today'"
3-19-2020
我们可以通过整合其他宏来创建宏,例如:
local month = "3"
local day = "19"
local year = "2020"
local today = "`month'-`day'-`year'"
display "`today'"
3-19-2020
我们想导入名为 03-19-2020.csv 的文件。注意月份包含一个"前置 0"。1 至 9 月被指定为:"01","02"往后依次类推。10 至 12 月指定为"10","11","12"。我们需要一种方法值赋时给月份添加适当的包含前导零的局部宏。String()函数能够满足这一要求。相关代码如下所示:
local month = string(3, "%02.0f")
local day = string(19, "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
display "`today'"
03-19-2020
接下来将前面的内容进行整合,使用宏来导入文件。命令为:
local URL = "https://raw.githubusercontent.com/CSSEGISandData/ > COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
local FileName = "`URL'`today'.csv"
import delimited "`FileName'"
(8 vars,284 obs)
连享会主页: lianxh.cn
作者思路非常清晰,首先介绍简单的月循环,将每个月展示出来,然后介绍嵌套循环,将年和月共同展示。
作者调用宏将月进行循环,代码如下:
forvalues month = 1/12{
display "month = `month'"
}
回归结果如下:
继续将月和日共同循环,相关代码如下:
forvalues month = 1/12{
forvalues day = 1/31{
display "month = `month', day = `day'"
}
}
结果展示如下:
有了前文的介绍,继续运用暂元和循环的相关知识,展现 2020 年所有的日期。相关代码如下:
forvalues month = 1/12{
forvalues day = 1/31{
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
display "`today'"
}
}
结果展示:
接下来作者使用关于循环和宏的命令将网站的所有数据均导入到 Stata 中,但在执行命令的过程中需要注意 2 点:
疫情的数据从 1 月 23 号开始公布,在此之前并没有数据,如果不加以提示,会出现报错,程序停止运行。
原始数据中有的文件的变量名 ïprovincestate,有的是 provincestate,全部统一修改为 provincestate。为了避免程序出现错误,使用 capture 命令,即使程序出现错误也可以继续执行。
本部分命令如下:
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
local FileName = "`URL'`today'.csv"
clear
capture import delimited "`FileName'"
capture save "`today'"
}
}
输入ls
命令 可以查看 Stata 中下载到的数据。
打开 01-22-2020.dta 文件,并且输入describe
查看数据。相关结果如下:
我们看到其中的文件包括 ïprovincestate 变量名,需要将其全部统一为provincestate,代码如下:
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
local FileName = "`URL'`today'.csv"
clear
capture import delimited "`FileName'"
capture confirm variable ïprovincestate
if _rc == 0 {
rename ïprovincestate provincestate
label variable provincestate "Province/State"
}
capture save "`today'", replace //需要添加replace,替换掉之间存在内存中的数据
}
}
现在打开 01-22-2020.dta 数据,并且describe
结果如下:
继续使用循环命令将所有的数据依次添加并保存,代码如下:
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
最后使用describe
描述最后生成的数据。
随着疫情的不断发展,相关的数据不断更新,因此本文的相关数据和变量会产生不一致。例如provoncestate,province_state,countryregion,country_region等。当我们追加数据时必须要保证变量名称一致,接下来继续使用 capture。其完整代码如下:
//收集数据
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
local FileName = "`URL'`today'.csv"
clear
capture import delimited "`FileName'"
capture confirm variable ïprovincestate
if _rc == 0 {
rename ïprovincestate provincestate
label variable provincestate "Province/State"
}
capture rename province_state provincestate
capture rename country_region countryregion
capture rename last_update lastupdate
capture rename lat latitude
capture rename long longitude
capture save "`today'", replace
}
}
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
//合并数据
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
describe
数据可以看到数据结果为:
更新数据已经处理完毕,所有变量均统一,最后保存,为以后所用。
save covid19_raw
本文收集到的疫情数据截止到 2020 年 4 月 2 日,随着疫情的不断发展,相关数据可以到https://github.com/CSSEGISandData/COVID19/tree/master/csse_covid_19_data/csse_covid_19_daily_reportszi自行下载。
湖北省是国内疫情重灾区,我们心系湖北,所以先画出湖北省现存确诊人数。本部分代码借鉴 TidyFriday 公众号《实时疫情与 Stata 地图绘制 》一文,相关代码如下:
use China_covid_0409.dta,clear
gen current_confirmed = confirmed - deaths - recovered //计算现存确诊人数
replace current_confirmed = 0 if current_confirmed < 0
keep province date current_confirmed
drop if missing(current_confirmed)
replace province = subinstr(province, " ", "_", .)
spread province current_confirmed //生成面板
tw conn Hubei date, xtitle("日期", size(*1.0)) ///
ytitle("现存确诊人数", size(*0.8)) ///
title("湖北省现存确诊人数变化)", size(*1.3)) ///
subtitle("") caption("数据来源:约翰·霍普金斯大学", size(*0.8))
graph save hubei.png, replace
我们同时也关注湖北省周边省市的疫情情况,同时画出周边四省的现存确诊人数,从上往下依次为河南、安徽、江西、湖南。相关代码如下:
tw conn Henan Anhui Jiangxi Hunan date, xtitle("日期", size(*0.8)) ///
ytitle("现存确诊人数", size(*0.8)) ///
title("湖北周边四省现存确诊人数变化", size(*1.1)) ///
subtitle("河南、安徽、江西和湖南四省") caption("数据来源:约翰·霍普金斯大学", size(*0.8)) xla(21937(16)22007) ///
xsc(range(21937 22017) extend) ///
leg(order(1 "河南" 2 "安徽" 3 "江西" 4 "湖南") pos(1) ring(0)) ///
lc("100 200 165" "252 141 98" "141 160 203" "231 138 195") ///
mc("102 194 165" "252 141 98" "141 160 203" "231 138 195")
graph save Fourprovince.png, replace
接下来我们观察累计治愈人数,累计死亡病例数、每日新增治愈人数、每日确诊人数。相关代码如下:
use China_covid_0409.dta,clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
tsset date
gen new_confirmed = confirmed - l.confirmed
tw ///
lpolyci new_confirmed date, bw(3) || ///
sc new_confirmed date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("每日新增确诊数", size(*1.2)) ///
yti("人数") name(a, replace) nodraw
gen current_confirmed = confirmed - deaths - recovered
replace current_confirmed = 0 if current_confirmed < 0
tw ///
lpolyci current_confirmed date, bw(3) || ///
sc current_confirmed date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("现存确诊数", size(*1.2)) ///
yti("人数") name(b, replace) nodraw
graph combine a b, caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
graphr(margin(medlarge))
graph export ab.png, replace
* 死亡和治愈病例数量
use China_covid_0409.dta,clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
tsset date
tw ///
lpolyci recovered date, bw(3) || ///
sc recovered date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("累计治愈人数", size(*1.2)) ///
yti("人数") name(a, replace) nodraw
tw ///
lpolyci deaths date, bw(3) || ///
sc deaths date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("累计死亡病例数", size(*1.2)) ///
yti("人数") name(b, replace) nodraw
* 每日新增治愈
gen new_recovered = recovered - l.recovered
tw ///
lpolyci new_recovered date, bw(3) || ///
sc new_recovered date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("每日新增治愈人数", size(*1.2)) ///
yti("人数") name(c, replace) nodraw
graph combine a b, caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
graphr(margin(medlarge))
* 每日新增死亡
gen new_deaths = deaths - l.deaths
tw ///
lpolyci new_deaths date, bw(3) || ///
sc new_deaths date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("每日新增", size(*1.2)) ///
yti("人数") name(d, replace) nodraw
graph combine a b c d, r(2) caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
graphr(margin(medlarge))
graph save fourtype.png, replace
接下来我们分析治愈人数,死亡人数,现存确诊的动态图。
use China_covid_0409.dta, clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
* 计算累积量
gen base = 0
gen deaths_recovered = deaths + recovered
* 绘图
tw ///
rarea recovered base date, fc("180 209 132") ///
c("180 209 132") fintensity(inten80) lw(vvthin) || ///
rarea deaths_recovered recovered date, fc("122 118 123") ///
c("122 118 123") fintensity(inten80) lw(vvthin) || ///
rarea confirmed deaths_recovered date, fc("95 144 161") ///
c("95 144 161") fintensity(inten80) lw(vvthin) ||, ///
leg(order(3 "现存确诊" 2 "死亡" 1 "治愈") pos(11) ring(0)) ///
title("新冠肺炎疫情在中国的发展状况", size(*1.5) ///
justification(left) bexpand) ///
subtitle("中国新冠肺炎的确诊人数、治愈人数和死亡人数的发展趋势。", ///
justification(left) bexpand) ///
caption("数据来源:约翰霍普金斯大学", ///
size(*0.8)ring(0)) ///
xtitle("") xla(21937(8)22013) ///
xsc(range(21937 22013) extend) ///
graphr(margin(medlarge)) ///
text(20000 21984 "治愈", color(black) size(*1.2)) ///
text(63000 21983 "死亡", color(black) size(*1.2)) ///
text(70000 21978 "现存确诊", color(black) size(*1.2)) || ///
pci 80000 21957 0 21957, lp(dash) text(56000 21952 "我国改变了" "确诊的标准", color(black) size(*0.8) justification(right) bexpand)
graph save dynamic.png,replace
新冠肺炎在全球蔓延,约翰霍普金斯大学同时收集了各个国家的疫情数据。本部分内容将绘制新冠肺炎在世界蔓延情况。
首先绘制截止到 4 月 8 号全球疫情最严重的 12 个国家。本部分代码借鉴 TidyFriday 公众号《实时疫情与 Stata 地图绘制 》一文
use global_covid_0403.dta,replace
* 全球疫情最严重的12个国家现存确诊人数
use global_covid_0403.dta,clear
gen current_confirmed = confirmed - deaths - recovered
gsort - current_confirmed
sencode country,gen(country_id)
drop country
rename country_id country
keep in 1/12
twoway bar current_confirmed country,sort horizontal barwidth(0.5) ///
fcolor(red) ylabel(1(1)12,valuelabel angle(0) labsize(*1.2)) ///
ytitle("") xtitle("现存确诊人数") ysize(2) xsize(4) title(截止2020年4月2日确诊人数) ///
caption("数据来源:约翰霍普金斯大学")
graph save global_12.png,replace
接下来绘制世界疫情图,本文部分代码借鉴 TidyFriday 公众号《使用 Stata 分析全球的新冠疫情数据 TidyFriday》
use global_covid_0409.dta,clear
* 修改国家
replace country = "United States of America" if country == "US"
replace country = "The Bahamas" if country == "Bahamas"
replace country = "Republic of the Congo" if country == "Congo (Brazzaville)"
replace country = "Democratic Republic of the Congo" if country == "Congo (Kinshasa)"
replace country = "Ivory Coast" if country == "Cote d'Ivoire"
replace country = "Czech Republic" if country == "Czechia"
replace country = "Guinea Bissau" if country == "Guinea-Bissau"
replace country = "South Korea" if country == "Korea, South"
replace country = "Macedonia" if country == "North Macedonia"
replace country = "Republic of Serbia" if country == "Serbia"
replace country = "United Republic of Tanzania" if country == "Tanzania"
replace country = "East Timor" if country == "Timor-Leste"
rename country name
merge 1:1 name using worlddb
* 绘制地图
replace confirmed = 0 if missing(confirmed)
xtset, clear
grmap confirmed using worldcoord, id(ID) ///
fcolor("252 255 164" "250 193 39" "245 125 21" "212 72 66" "159 42 99" "101 21 110" "40 11 84") ///
ocolor("white" ...) ///
clmethod(custom) clbreaks(0 0.9 9 99 999 9999 99999 999990) ///
title("新型冠状病毒肺炎疫情分布", size(*1.2) color(white)) ///
graphr(margin(medium)) ///
subtitle("截止 2020 年 4 月 8 日", color(white)) ///
osize(vthin ...) ///
legend(size(*1.5) ///
order(2 "无" 3 "1~9 人" 4 "10~99 人" 5 "100~999 人" 6 "1000~9999人" 7 "10000~100000 人" 8 ">= 100000 人") ///
ti(确诊, size(*0.5) pos(11) color(white)) color(white)) ///
plotr(fcolor(24 24 24) lcolor(24 24 24)) ///
graphr(fcolor(24 24 24) lcolor(24 24 24)) ///
xsize(20) ysize(8.510638298) ///
caption("数据来源:约翰·霍普金斯大学", size(*0.8) color(white))
graph save global_confirmed.png,replace
新冠疫情备受全球瞩目,在 github 上同样发现了很多有趣的项目,感兴趣的小伙伴可以自行浏览
https://github.com/emanuele-guidotti/COVID19
全文代码如下:
cd "D:\stata15\ado\personal\Stata_Blog\4.9"
****************************
*******收集数据*************
****************************
//收集数据//合并数据//保存数据
* 收集数据
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
local FileName = "`URL'`today'.csv"
clear
capture import delimited "`FileName'"
capture confirm variable ïprovincestate
if _rc == 0 {
rename ïprovincestate provincestate
label variable provincestate "Province/State"
}
capture rename province_state provincestate
capture rename country_region countryregion
capture rename last_update lastupdate
capture rename lat latitude
capture rename long longitude
capture save "`today'", replace
}
}
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
* 合并数据
clear
forvalues month = 1/12 {
forvalues day = 1/31 {
local month = string(`month', "%02.0f")
local day = string(`day', "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
capture append using "`today'"
}
}
* 保存数据
save covid_0409_chen.dta,replace
*保存主要变量
use covid_0409_chen.dta,clear
keep provincestate countryregion lastupdate confirmed deaths recovered
* 缩小字符串
format provincestate %15s
format countryregion %15s
* 将时间变量lastupdate分成10个类型依次处理
gen lastupdate_01 = lastupdate in 1/38
gen lastupdate_02 = lastupdate in 39/431
gen lastupdate_03 = lastupdate in 432/560
gen lastupdate_04 = lastupdate in 561/7617
gen lastupdate_05 = lastupdate in 7618/11034
gen lastupdate_06 = lastupdate in 11035/31566
gen lastupdate_07 = lastupdate in 31567/38439
gen lastupdate_08 = lastupdate in 38440/53995
gen lastupdate_09 = lastupdate in 53996/56804
gen lastupdate_10 = lastupdate in 56805/62544
rename lastupdate_01 type1
rename lastupdate_02 type2
rename lastupdate_03 type3
rename lastupdate_04 type4
rename lastupdate_05 type5
rename lastupdate_06 type6
rename lastupdate_07 type7
rename lastupdate_08 type8
rename lastupdate_09 type9
rename lastupdate_10 type10
//处理type1
gen y_01 = real(substr(type1,6,4))
gen m_01 = real(substr(type1,1,1))
gen d_01 = real(substr(type1,3,2))
gen new_type1 = string(y_01) + "-" + "0" + string(m_01) + "-" + string(d_01)
//处理type2
gen y_02 = real(substr(type2,6,2))
gen m_02 = real(substr(type2,1,1))
gen d_02 = real(substr(type2,3,2))
gen new_type2 = string(y_02) + "20" + "-" + "0" + string(m_02) + "-" + string(d_02)
//处理type3
replace type3 = subinstr(type3,"2/1/2020","2/01/2020",.)
gen y_03 = real(substr(type3,6,4))
gen m_03 = real(substr(type3,1,1))
gen d_03 = real(substr(type3,3,2))
gen new_type3 = string(y_03) + "-" + "0" + string(m_03) + "-" + string(d_03)
replace new_type3 = subinstr(new_type3,"2020-02-1","2020-02-01",.)
//处理type4
gen new_type4 = substr(type4,1,10)
//处理type5
gen y_05 = real(substr(type5,6,2))
gen m_05 = real(substr(type5,1,1))
gen d_05 = real(substr(type5,3,2))
gen new_type5 = string(y_05) + "20" + "-" + "0" +string(m_05) + "-" + string(d_05)
//处理type6
gen new_type6 = substr(type6,1,10)
//处理type7
gen y_07 = real(substr(type7,6,2))
gen m_07 = real(substr(type7,1,1))
gen d_08 = real(substr(type7,6,2))
gen new_type7 = string(y_07) + "20" + "-" + "0" + string(m_07) + "-" + string(d_07)
//处理type8
gen new_type8 = substr(type8,1,10)
//处理type9
gen y_09 = real(substr(type9,5,2))
gen m_09 = real(substr(type9,1,1))
gen d_09 = real(substr(type9,3,1))
gen new_type9 = string(y_09) + "20" + "-" + "0" +string(m_09) + "-" + "0" + string(d_09)
//处理type10
gen new_type10 = substr(type10,1,10)
//将new_type1-10合并到new_time统一时间
gen new_time = new_type1 in 1/38
replace new_time = new_type2 in 39/431
replace new_time = new_type3 in 432/560
replace new_time = new_type4 in 561/7617
replace new_time = new_type5 in 7618/11034
replace new_time = new_type6 in 11035/31566
replace new_time = new_type7 in 31567/38439
replace new_time = new_type8 in 38440/53995
replace new_time = new_type9 in 53996/56804
replace new_time = new_type10 in 56805/62544
//检查是否完整生成时间变量
gen date = date(new_time,"YMD")
gen v1 = "100" if date ==. // 有缺失值
list lastupdate if v1 == "100" //找到缺失值
replace new_time = "2020-03-08" in 10847
replace new_time = "2020-03-08" in 10848
replace new_time = "2020-03-08" in 34744
replace new_time = "2020-03-08" in 34824
replace new_time = "2020-03-08" in 38182
replace new_time = "2020-03-08" in 38262
replace new_time = "2020-03-29" in 56547
replace new_time = "2020-03-18" in 56565
replace new_time = "2020-03-18" in 56566
replace new_time = "2020-03-30" in 56576
replace new_time = "2020-03-29" in 56577
replace new_time = "2020-03-16" in 56588
replace new_time = "2020-02-23" in 56596
replace new_time = "2020-02-23" in 56615
gen date1 = date(new_time,"YMD")
format date1 %tdCY-N-D
* 保存变量名
keep provincestate countryregion confirmed deaths recovered date1
*修改变量
rename provincestate province
rename countryregion country
rename date1 date
* 数据处理过程保留
save date_complete_covid_0409.dta,replace
******************************************
*******************画图*******************
******************************************
//继续处理数据
use date_complete_covid_0409.dta,clear
* 将缺失值替换为0
replace confirmed = 0 if confirmed ==.
replace deaths = 0 if deaths ==.
replace recovered = 0 if recovered == .
//********画中国的图*******
* 保留中国的数据
keep if inlist(country,"Mainland China","China","Hong Kong","Taiwan","Taiwan*","Macau")
* 完善省份名称
replace province = "Taiwan" if country == "Taiwan*"
rename provincestate province
* 生成面板的中国数据
encode province,gen(province_id)
drop country
xtset province_id date
//处理重复值
duplicates list province_id date
duplicates tag province_id date,gen(isdup)
edit if isdup
drop isdup
duplicates drop province_id date,force
xtset province_id date //生成面板数据
save China_covid_0409.dta,replace //保存中国疫情数据
*************************************画第一个图*******************************
use China_covid_0409.dta,clear
gen current_confirmed = confirmed - deaths - recovered //计算现存确诊人数计算图
replace current_confirmed = 0 if current_confirmed < 0
keep province date current_confirmed
drop if missing(current_confirmed)
replace province = subinstr(province, " ", "_", .)
spread province current_confirmed
*画一个湖北省的疫情确诊图
tw conn Hubei date, xti("日期", size(*0.8)) ///
ytitle("现存确诊人数", size(*0.8)) ///
title("湖北省现存确诊人数变化", size(*1.3)) ///
subtitle("") caption("数据来源:约翰·霍普金斯大学", size(*0.8))
graph save hubei.png, replace
**湖北省周边四省现存确诊人数
tw conn Henan Anhui Jiangxi Hunan date, xtitle("日期", size(*0.8)) ///
ytitle("现存确诊人数", size(*0.8)) ///
title("湖北周边四省现存确诊人数变化", size(*1.1)) ///
subtitle("河南、安徽、江西和湖南四省") caption("数据来源:约翰·霍普金斯大学", size(*0.8)) xla(21937(16)22013) ///
xsc(range(21937 22010) extend) ///
leg(order(1 "河南" 2 "安徽" 3 "江西" 4 "湖南") pos(1) ring(0)) ///
lc("100 200 165" "252 141 98" "141 160 203" "231 138 195") ///
mc("102 194 165" "252 141 98" "141 160 203" "231 138 195")
graph save Fourprovince.png, replace
****************************************************************************
* 地图二
* 死亡和治愈病例数量
* 当前确诊人数与每日新增确诊人数
use China_covid_0409.dta,clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
tsset date
gen new_confirmed = confirmed - l.confirmed
tw ///
lpolyci new_confirmed date, bw(3) || ///
sc new_confirmed date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("每日新增确诊数", size(*1.2)) ///
yti("人数") name(a, replace) nodraw
gen current_confirmed = confirmed - deaths - recovered
replace current_confirmed = 0 if current_confirmed < 0
tw ///
lpolyci current_confirmed date, bw(3) || ///
sc current_confirmed date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("现存确诊数", size(*1.2)) ///
yti("人数") name(b, replace) nodraw
gr combine a b, caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
graphr(margin(medlarge))
gr export ab.png, replace
* 死亡和治愈病例数量
use China_covid_0409.dta,clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
tsset date
tw ///
lpolyci recovered date, bw(3) || ///
sc recovered date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
xsc(range(21937 22013) extend) ///
ti("累计治愈人数", size(*1.2)) ///
yti("人数") name(a, replace) nodraw
tw ///
lpolyci deaths date, bw(3) || ///
sc deaths date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22007) ///
xsc(range(21937 22000) extend) ///
ti("累计死亡病例数", size(*1.2)) ///
yti("人数") name(b, replace) nodraw
* 每日新增治愈
gen new_recovered = recovered - l.recovered
tw ///
lpolyci new_recovered date, bw(3) || ///
sc new_recovered date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22007) ///
xsc(range(21937 22007) extend) ///
ti("每日新增治愈人数", size(*1.2)) ///
yti("人数") name(c, replace) nodraw
gr combine a b, caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
graphr(margin(medlarge))
* 每日新增死亡
gen new_deaths = deaths - l.deaths
tw ///
lpolyci new_deaths date, bw(3) || ///
sc new_deaths date, ms(circle) ||, ///
leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22007) ///
xsc(range(21937 22007) extend) ///
ti("每日新增", size(*1.2)) ///
yti("人数") name(d, replace) nodraw
gr combine a b c d, r(2) caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
graphr(margin(medlarge))
graph save fourtype.png, replace
* 地图三 每天增长总数 //时间轴比较拥挤如何处理
use China_covid_0409.dta, clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
* 计算累积量
gen base = 0
gen deaths_recovered = deaths + recovered
* 绘图
tw ///
rarea recovered base date, fc("180 209 132") ///
c("180 209 132") fintensity(inten80) lw(vvthin) || ///
rarea deaths_recovered recovered date, fc("122 118 123") ///
c("122 118 123") fintensity(inten80) lw(vvthin) || ///
rarea confirmed deaths_recovered date, fc("95 144 161") ///
c("95 144 161") fintensity(inten80) lw(vvthin) ||, ///
leg(order(3 "现存确诊" 2 "死亡" 1 "治愈") pos(11) ring(0)) ///
title("新冠肺炎疫情在中国的发展状况", size(*1.5) ///
justification(left) bexpand) ///
subtitle("中国新冠肺炎的确诊人数、治愈人数和死亡人数的发展趋势。", ///
justification(left) bexpand) ///
caption("数据来源:约翰霍普金斯大学", ///
size(*0.8)ring(0)) ///
xti("") xla(21937(8)22000) ///
xsc(range(21937 22000) extend) ///
graphr(margin(medlarge)) ///
text(20000 21984 "治愈", color(black) size(*1.2)) ///
text(63000 21983 "死亡", color(black) size(*1.2)) ///
text(70000 21978 "现存确诊", color(black) size(*1.2)) || ///
pci 80000 21957 0 21957, lp(dash) text(56000 21952 "我国改变了" "确诊的标准", color(black) size(*0.8) justification(right) bexpand)
graph save dynamic.png, replace
* 地图三 疫情治愈,死亡,新增图
use China_covid_0409.dta, clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
* 计算累积量
gen base = 0
gen deaths_recovered = deaths + recovered
* 绘图
tw ///
rarea recovered base date, fc("180 209 132") ///
c("180 209 132") fintensity(inten80) lw(vvthin) || ///
rarea deaths_recovered recovered date, fc("122 118 123") ///
c("122 118 123") fintensity(inten80) lw(vvthin) || ///
rarea confirmed deaths_recovered date, fc("95 144 161") ///
c("95 144 161") fintensity(inten80) lw(vvthin) ||, ///
leg(order(3 "现存确诊" 2 "死亡" 1 "治愈") pos(11) ring(0)) ///
title("新冠肺炎疫情在中国的发展状况", size(*1.5) ///
justification(left) bexpand) ///
subtitle("中国新冠肺炎的确诊人数、治愈人数和死亡人数的发展趋势。", ///
justification(left) bexpand) ///
caption("数据来源:约翰霍普金斯大学", ///
size(*0.8)ring(0)) ///
xti("") xla(21937(8)22000) ///
xsc(range(21937 22000) extend) ///
graphr(margin(medlarge)) ///
text(20000 21984 "治愈", color(black) size(*1.2)) ///
text(63000 21983 "死亡", color(black) size(*1.2)) ///
text(70000 21978 "现存确诊", color(black) size(*1.2)) || ///
pci 80000 21957 0 21957, lp(dash) text(56000 21952 "我国改变了" "确诊的标准", color(black) size(*0.8) justification(right) bexpand)
graph save dynamic.png, replace
* 绘制世界地图
use date_complete_covid_0409.dta,clear
replace confirmed = 0 if confirmed ==.
replace deaths = 0 if deaths ==.
replace recovered = 0 if recovered ==.
* 保存变量*************************************************************
keep province country date confirmed deaths recovered
* 准备截面数据
keep if date == date("2020-04-08","YMD")
replace country = "China" if countryregion == "Taiwan*"
* 将各国数据汇总
bysort country :egen confirmed_sum = sum(confirmed)
bysort country :egen deaths_sum = sum(deaths)
bysort country :egen recovered_sum = sum(recovered)
gsort -confirmed_sum //由大到小进行排序
drop in 2/2621 //保留美国
drop in 5/14 //保留法国
drop in 7/21 //中国
drop in 9/18 //英国
drop in 13/16 //荷兰
drop in 14/27 //加拿大
drop in 24/30 //澳大利亚
drop in 25/26 //保存丹麦
keep country date confirmed_sum deaths_sum recovered_sum
rename confirmed_sum confirmed
rename deaths_sum deaths
rename recovered_sum recovered
rename countryregion country
save global_covid_0409.dta,replace //全球数据保存完成
*绘制12个国家的疫情图
* 全球疫情最严重12个国家的现存确诊人数
use global_covid_0409.dta,clear
gen current_confirmed = confirmed - deaths - recovered
gsort - current_confirmed
sencode country,gen(country_id)
drop country
rename country_id country
keep in 1/12
save global_12_covid_0409.dta,replace
twoway bar current_confirmed country, sort horizontal barwidth(0.5) ///
fcolor(red) ylabel(1(1)12,valuelabel angle(0) labsize(*1.2)) ///
ytitle("") xtitle("现存确诊人数") ysize(2) xsize(4) title(截止2020年4月8日现存确诊人数) ///
caption("数据来源:约翰霍普金斯大学")
graph save global_12.png,replace
**绘制世界地形图
*绘制世界疫情图需要重新处理数据
use global_covid_0409.dta,clear
* 修改国家
replace country = "United States of America" if country == "US"
replace country = "The Bahamas" if country == "Bahamas"
replace country = "Republic of the Congo" if country == "Congo (Brazzaville)"
replace country = "Democratic Republic of the Congo" if country == "Congo (Kinshasa)"
replace country = "Ivory Coast" if country == "Cote d'Ivoire"
replace country = "Czech Republic" if country == "Czechia"
replace country = "Guinea Bissau" if country == "Guinea-Bissau"
replace country = "South Korea" if country == "Korea, South"
replace country = "Macedonia" if country == "North Macedonia"
replace country = "Republic of Serbia" if country == "Serbia"
replace country = "United Republic of Tanzania" if country == "Tanzania"
replace country = "East Timor" if country == "Timor-Leste"
rename country name
merge 1:1 name using worlddb
* 绘制地图
replace confirmed = 0 if missing(confirmed)
xtset, clear
grmap confirmed using worldcoord, id(ID) ///
fcolor("252 255 164" "250 193 39" "245 125 21" "212 72 66" "159 42 99" "101 21 110" "40 11 84") ///
ocolor("white" ...) ///
clmethod(custom) clbreaks(0 0.9 9 99 999 9999 99999 999990) ///
title("新型冠状病毒肺炎疫情分布", size(*1.2) color(white)) ///
graphr(margin(medium)) ///
subtitle("截止 2020 年 4 月 8 日", color(white)) ///
osize(vthin ...) ///
legend(size(*1.5) ///
order(2 "无" 3 "1~9 人" 4 "10~99 人" 5 "100~999 人" 6 "1000~9999人" 7 "10000~100000 人" 8 ">= 100000 人") ///
ti(确诊, size(*0.5) pos(11) color(white)) color(white)) ///
plotr(fcolor(24 24 24) lcolor(24 24 24)) ///
graphr(fcolor(24 24 24) lcolor(24 24 24)) ///
xsize(20) ysize(8.510638298) ///
caption("数据来源:约翰·霍普金斯大学", size(*0.8) color(white))
graph save global_confirmed.png,replace
连享会-直播课 上线了!
http://lianxh.duanshu.com
免费公开课:
直击面板数据模型 - 连玉君,时长:1小时40分钟 Stata 33 讲 - 连玉君, 每讲 15 分钟. 部分直播课 课程资料下载 (PPT,dofiles等)
支持回看,所有课程可以随时购买观看。
专题 | 嘉宾 | 直播/回看视频 |
---|---|---|
⭐ 最新专题 ⭐ | DSGE, 因果推断, 空间计量等 | |
⭕ Stata数据清洗 | 游万海 | 直播, 2 小时,已上线 |
研究设计 | 连玉君 | 我的特斯拉-实证研究设计,-幻灯片- |
面板模型 | 连玉君 | 动态面板模型,-幻灯片- |
面板模型 | 连玉君 | 直击面板数据模型 [免费公开课,2小时] |
Note: 部分课程的资料,PPT 等可以前往 连享会-直播课 主页查看,下载。
关于我们
课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法
等
连享会小程序:扫一扫,看推文,看视频……
扫码加入连享会微信群,提问交流更方便
✏ 连享会学习群-常见问题解答汇总:
✨ https://gitee.com/arlionn/WD