From 9f222320c77a7473a539c72503f9e15c44f5924f Mon Sep 17 00:00:00 2001 From: Dnomd343 Date: Sun, 4 Dec 2022 14:03:47 +0800 Subject: [PATCH] docs: update README.md --- README.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/README.md b/README.md index 0d13f68..374c38a 100644 --- a/README.md +++ b/README.md @@ -61,3 +61,28 @@ + [`xswang.com`](./src/crawler/xswang.com) :[`https://www.xswang.com/book/56718/`](https://www.xswang.com/book/56718/) + [`zhihu.com`](./src/crawler/zhihu.com) :[`https://www.zhihu.com/column/c_1553471910075449344`](https://www.zhihu.com/column/c_1553471910075449344) + + +## 爬虫样本分析 + +原始爬虫得到5份三组不同 `raw` 数据: + ++ sample_1-a + ++ sample_1-b + ++ sample_2-a + ++ sample_2-b + ++ sample_3 + +经过简单合并后可得到三份初始 `combine` 样本: + ++ sample_1 + ++ sample_2 + ++ sample_3 + +三份样本进行对照合并,修复各类语法词汇错误、违禁屏蔽词等,得到最终的三组 `fixed` 样本。