|
@ -61,3 +61,28 @@ |
|
|
+ [`xswang.com`](./src/crawler/xswang.com) :[`https://www.xswang.com/book/56718/`](https://www.xswang.com/book/56718/) |
|
|
+ [`xswang.com`](./src/crawler/xswang.com) :[`https://www.xswang.com/book/56718/`](https://www.xswang.com/book/56718/) |
|
|
|
|
|
|
|
|
+ [`zhihu.com`](./src/crawler/zhihu.com) :[`https://www.zhihu.com/column/c_1553471910075449344`](https://www.zhihu.com/column/c_1553471910075449344) |
|
|
+ [`zhihu.com`](./src/crawler/zhihu.com) :[`https://www.zhihu.com/column/c_1553471910075449344`](https://www.zhihu.com/column/c_1553471910075449344) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 爬虫样本分析 |
|
|
|
|
|
|
|
|
|
|
|
原始爬虫得到5份三组不同 `raw` 数据: |
|
|
|
|
|
|
|
|
|
|
|
+ sample_1-a |
|
|
|
|
|
|
|
|
|
|
|
+ sample_1-b |
|
|
|
|
|
|
|
|
|
|
|
+ sample_2-a |
|
|
|
|
|
|
|
|
|
|
|
+ sample_2-b |
|
|
|
|
|
|
|
|
|
|
|
+ sample_3 |
|
|
|
|
|
|
|
|
|
|
|
经过简单合并后可得到三份初始 `combine` 样本: |
|
|
|
|
|
|
|
|
|
|
|
+ sample_1 |
|
|
|
|
|
|
|
|
|
|
|
+ sample_2 |
|
|
|
|
|
|
|
|
|
|
|
+ sample_3 |
|
|
|
|
|
|
|
|
|
|
|
三份样本进行对照合并,修复各类语法词汇错误、违禁屏蔽词等,得到最终的三组 `fixed` 样本。 |
|
|