diff --git a/README.md b/README.md index e0b8181..b0428c4 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,16 @@ ```mermaid graph LR + subgraph crawler + source_1([108shu.com]) + source_2([aidusk.com]) + source_3([ixsw.la]) + source_4([m.wxsy.net]) + source_5([wxsy.net]) + source_6([xswang.com]) + source_7([zhihu.com]) + end + subgraph sample subgraph raw s1a_raw{{sample_1-a}} @@ -23,11 +33,19 @@ s3_fixed(sample_3) end - subgraph release + subgraph replenish sa{{sample_a}} sb{{sample_b}} end + source_1 ==> s1a_raw + source_2 ==> s1b_raw + source_3 ==> s1b_raw + source_4 ==> s2a_raw + source_5 ==> s2a_raw + source_6 ==> s2b_raw + source_7 ==> s3_raw + s1a_raw -- replenish --> s1_combine s1b_raw --> s1_combine s2a_raw -- replenish --> s2_combine @@ -44,25 +62,15 @@ s3_fixed -- replenish --> sb end - subgraph crawler - source_1([108shu.com]) --> s1a_raw - source_2([aidusk.com]) --> s1b_raw - source_3([ixsw.la]) --> s1b_raw - source_4([m.wxsy.net]) --> s2a_raw - source_5([wxsy.net]) --> s2a_raw - source_6([xswang.com]) --> s2b_raw - source_7([zhihu.com]) --> s3_raw - end - - subgraph rc - rc-1(rc-1) + subgraph release + rc-1([rc-1]) sa --> rc-1 sb -- fix --> rc-1 end ``` -## 数据爬虫来源 +## 数据来源 + [`108shu.com`](./src/crawler/108shu.com) :[http://www.108shu.com/book/54247/](http://www.108shu.com/book/54247/) @@ -79,9 +87,9 @@ + [`zhihu.com`](./src/crawler/zhihu.com) :[https://www.zhihu.com/column/c_1553471910075449344](https://www.zhihu.com/column/c_1553471910075449344) -## 爬虫样本分析 +## 样本分析 -原始爬虫得到5份三组不同 `raw` 数据: +爬虫七个网站的数据,获得五份三组不同的 `raw` 样本: + `sample_1-a` @@ -101,7 +109,7 @@ + `sample_3` -三份样本进行对照合并,修复各类语法词汇错误、违禁屏蔽词等,得到最终的三组 `fixed` 样本,而后对其合并,获得两组 `release` 样本: +进行对照合并,修复各类语法词汇错误、违禁屏蔽词等,得到三组 `fixed` 样本,再次合并,获得两份 `release` 样本: + `sample_a`