Browse Source

docs: update README.md

master
Dnomd343 2 years ago
parent
commit
5544d51ff3
  1. 42
      README.md

42
README.md

@ -2,6 +2,16 @@
```mermaid
graph LR
subgraph crawler
source_1([108shu.com])
source_2([aidusk.com])
source_3([ixsw.la])
source_4([m.wxsy.net])
source_5([wxsy.net])
source_6([xswang.com])
source_7([zhihu.com])
end
subgraph sample
subgraph raw
s1a_raw{{sample_1-a}}
@ -23,11 +33,19 @@
s3_fixed(sample_3)
end
subgraph release
subgraph replenish
sa{{sample_a}}
sb{{sample_b}}
end
source_1 ==> s1a_raw
source_2 ==> s1b_raw
source_3 ==> s1b_raw
source_4 ==> s2a_raw
source_5 ==> s2a_raw
source_6 ==> s2b_raw
source_7 ==> s3_raw
s1a_raw -- replenish --> s1_combine
s1b_raw --> s1_combine
s2a_raw -- replenish --> s2_combine
@ -44,25 +62,15 @@
s3_fixed -- replenish --> sb
end
subgraph crawler
source_1([108shu.com]) --> s1a_raw
source_2([aidusk.com]) --> s1b_raw
source_3([ixsw.la]) --> s1b_raw
source_4([m.wxsy.net]) --> s2a_raw
source_5([wxsy.net]) --> s2a_raw
source_6([xswang.com]) --> s2b_raw
source_7([zhihu.com]) --> s3_raw
end
subgraph rc
rc-1(rc-1)
subgraph release
rc-1([rc-1])
sa --> rc-1
sb -- fix --> rc-1
end
```
## 数据爬虫来源
## 数据来源
+ [`108shu.com`](./src/crawler/108shu.com) :[http://www.108shu.com/book/54247/](http://www.108shu.com/book/54247/)
@ -79,9 +87,9 @@
+ [`zhihu.com`](./src/crawler/zhihu.com) :[https://www.zhihu.com/column/c_1553471910075449344](https://www.zhihu.com/column/c_1553471910075449344)
## 爬虫样本分析
## 样本分析
原始爬虫得到5份三组不同 `raw` 数据
爬虫七个网站的数据,获得五份三组不同的 `raw` 样本
+ `sample_1-a`
@ -101,7 +109,7 @@
+ `sample_3`
三份样本进行对照合并,修复各类语法词汇错误、违禁屏蔽词等,得到最终的三组 `fixed` 样本,而后对其合并,获得两组 `release` 样本:
进行对照合并,修复各类语法词汇错误、违禁屏蔽词等,得到三组 `fixed` 样本,再次合并,获得两份 `release` 样本:
+ `sample_a`

Loading…
Cancel
Save