温馨提示： 定期 清理浏览器缓存，可以获得最佳浏览体验。
Amazing！先看看这张图，找到自己所在的位置…… ♠ → ♣ → ♥ → ¥ → £
The tools mentioned here help manage reproducible research and handle new types of data. Why should you go after new data? New data provides new insights. For example, the recent Clark Medal winners used unconventional data in their major works. This data came large and unstructured, so Excel, Word, and email wouldn’t do the job.
I write for economists, but other social scientists can also find these recommendations useful. These tools have a steep learning curve and pay off over time. Some improve small-data analysis as well, but most gains come from new sources and real-time analysis.
Each section ends with a recommended reading list.
Some websites have APIs, which send data in structured formats but limit the number of requests. Site owners may alter the limit by agreement. When the website has no API, Kimono and Import.io extract structured data from webpages. When they can’t, BeautifulSoup and similar parsers can.
Other sources include industrial software, custom data collection systems (like surveys in Amazon Turk), and physical media. Text recognition systems require little manual labor, so digitizing analog sources is easy now.
A general purpose programming language can manage data that comes in peculiar formats or requires cleaning.
Use Python by default. Its packages also replicate core functionality of Stata, Matlab, and Mathematica. Other packages handle GIS, NLP, visual, and audio data.
Python is slow compared to other popular languages, but certain tweaks make it fast enough to avoid learning other languages, like Julia or Java. Generally, execution time is not an issue. Execution becomes twice cheaper each year (Moore’s Law) and coder’s time gets more expensive.
Version control tracks changes in files. It includes:
Version control by Git is a de-facto standard. GitHub.com is the largest service that maintains Git repositories. It offers free storage for open projects and paid storage for private repositories.
A GitHub repository is a one-click solution for both code and data. No problems with university servers, relocated personal pages, or sending large files via email.
When your project goes north of 1 GB, you can use GitHub’s Large File Storage or alternatives: AWS, Google Cloud, mega.nz, or torrents.
Jupyter notebooks combine text, code, and output on the same page. See examples:
Remote servers store large datasets in memory. They do numerical optimization and Monte Carlo simulations. GPU-based servers train artificial neural networks much faster and require less coding. These things save time.
If campus servers have peculiar limitations, third-party companies offer scalable solutions (AWS and Google Cloud). Users pay for storage and processor power, so exploratory analysis goes quickly.
A typical workflow with version control:
Some services allow writing code in a browser and running it right on their servers.
Real-time analysis requires optimization for performance. I exemplify with industrial applications:
A map for learning new data technologies by Swami Chandrasekaran:
|⭐ 最新专题 ⭐||DSGE, 因果推断, 空间计量等|
|⭕ Stata数据清洗||游万海||直播, 2 小时，已上线|
Note: 部分课程的资料，PPT 等可以前往 连享会-直播课 主页查看，下载。
课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法等