You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Dnomd343 9f222320c7 docs: update README.md 2 years ago
assets file: add metadata and cover photo 2 years ago
crawler_release chore: file structure 2 years ago
sample update: recover wrong fixes on `sample_3` 2 years ago
src/crawler chore: file structure 2 years ago
.gitignore chore: ignore idea folder 2 years ago
README.md docs: update README.md 2 years ago
xxrs.json file: xxrs json release 2 years ago
xxrs.txt file: release `xxrs.txt` 2 years ago

README.md

栩栩若生

  graph LR
    subgraph sample
      subgraph raw
        s1a_raw{{sample_1-a}}
        s1b_raw{{sample_1-b}}
        s2a_raw{{sample_2-a}}
        s2b_raw{{sample_2-b}}
        s3_raw{{sample_3}}
      end

      subgraph combine
        s1_combine[sample_1]
        s2_combine[sample_2]
        s3_combine[sample_3]
      end

      subgraph fixed
        s1_fixed(sample_1)
        s2_fixed(sample_2)
        s3_fixed(sample_3)
      end

      s1a_raw -- replenish --> s1_combine
      s1b_raw --> s1_combine
      s2a_raw -- replenish --> s2_combine
      s2b_raw -- replenish --> s2_combine
      s3_raw -- clean up --> s3_combine

      s1_combine -- fix --> s1_fixed
      s2_combine -- fix --> s2_fixed
      s3_combine -- fix --> s3_fixed
    end

    subgraph crawler
      source_1([108shu.com]) --> s1a_raw
      source_2([aidusk.com]) --> s1b_raw
      source_3([ixsw.la]) --> s1b_raw
      source_4([m.wxsy.net]) --> s2a_raw
      source_5([wxsy.net]) --> s2a_raw
      source_6([xswang.com]) --> s2b_raw
      source_7([zhihu.com]) --> s3_raw
    end

数据爬虫来源

爬虫样本分析

原始爬虫得到5份三组不同 raw 数据:

  • sample_1-a

  • sample_1-b

  • sample_2-a

  • sample_2-b

  • sample_3

经过简单合并后可得到三份初始 combine 样本:

  • sample_1

  • sample_2

  • sample_3

三份样本进行对照合并,修复各类语法词汇错误、违禁屏蔽词等,得到最终的三组 fixed 样本。