← Back to Gene Catalog

law-site-link-discovery

Hybrid knowledge.webimport

Government/law link discovery. chinacourt.cn/article/index/id/… list pages: only #articleList (to paginationControl) → /article/detail/YYYY/MM/id/*.shtml; excludes 要闻/right sidebar on same HTML. CAC /wxzw/zcfg/: article hrefs /YYYY-MM/DD/c_*.htm; followPagination adds POST /cms/JsonList for page 2+ (URL stays index_1.htm). PBOC tiaofasi: list→detail prefers detail HTML URL over attachments inside detail; CSRC/SAFE extractors; NDA /zwgk/zcfb/list/→detail only /zwgk/zcfb/ (excludes /zjjd/ /ytdd/). Generic crawl. Pagination, nested, collectPdf. maxTotalFetches caps GETs+JsonList. No filesystem.

README

No documentation yet.

Gene authors can add a README when publishing.

Phenotype

Input

PropertyType Req Description
seedUrl string HTTP(S) page URL; host must be in network.allowedDomains (whitelisted govt sites).
maxPages integer Max pages to fetch including the seed, when followPagination. Default 1, max 50.
linkScope default | single_page_downloadable single_page: only the seed list screen (no pagination/nested); still list→detail on that screen for whitelisted sites. default: pagination and nested depth when enabled.
collectPdf boolean Include .pdf in items. Default true.
maxTotalFetches integer Max HTTP GETs per invocation. Default 200, max 200.
nestedLinkDepth integer 0, 1, or 2. Same-origin navigable hrefs: fetch 1 or 2 levels of child pages to collect more file links. Default 0.
followPagination boolean If true, try to follow rel=next, next link, 下一页; maxPages; SPA may not work.
extraAllowedHosts array Optional extra hostnames for link/download allowlist; kb-assistant-app merges local user-allowlist.json here. Cloud WASM typically ignores.
followDetailPages boolean When true (default): follow article/detail links from the list (PBOC tiaofasi: emit detail-page HTML URLs for indexing; list-page file hrefs still collected). When false: only file-like hrefs on the list HTML.
maxNestedUrlsPerLevel integer Cap of distinct URLs to follow per nested level. Default 12, max 40.

Output

PropertyType Req
site string
error string
items array
Raw JSON Schema

inputSchema

{
  "type": "object",
  "required": [
    "seedUrl"
  ],
  "properties": {
    "seedUrl": {
      "type": "string",
      "description": "HTTP(S) page URL; host must be in network.allowedDomains (whitelisted govt sites)."
    },
    "maxPages": {
      "type": "integer",
      "description": "Max pages to fetch including the seed, when followPagination. Default 1, max 50."
    },
    "linkScope": {
      "enum": [
        "default",
        "single_page_downloadable"
      ],
      "type": "string",
      "description": "single_page: only the seed list screen (no pagination/nested); still list→detail on that screen for whitelisted sites. default: pagination and nested depth when enabled."
    },
    "collectPdf": {
      "type": "boolean",
      "description": "Include .pdf in items. Default true."
    },
    "maxTotalFetches": {
      "type": "integer",
      "description": "Max HTTP GETs per invocation. Default 200, max 200."
    },
    "nestedLinkDepth": {
      "type": "integer",
      "description": "0, 1, or 2. Same-origin navigable hrefs: fetch 1 or 2 levels of child pages to collect more file links. Default 0."
    },
    "followPagination": {
      "type": "boolean",
      "description": "If true, try to follow rel=next, next link, 下一页; maxPages; SPA may not work."
    },
    "extraAllowedHosts": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "Optional extra hostnames for link/download allowlist; kb-assistant-app merges local user-allowlist.json here. Cloud WASM typically ignores."
    },
    "followDetailPages": {
      "type": "boolean",
      "description": "When true (default): follow article/detail links from the list (PBOC tiaofasi: emit detail-page HTML URLs for indexing; list-page file hrefs still collected). When false: only file-like hrefs on the list HTML."
    },
    "maxNestedUrlsPerLevel": {
      "type": "integer",
      "description": "Cap of distinct URLs to follow per nested level. Default 12, max 40."
    }
  }
}

outputSchema

{
  "type": "object",
  "required": [
    "site",
    "items"
  ],
  "properties": {
    "site": {
      "type": "string"
    },
    "error": {
      "type": "string"
    },
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "required": [
          "url",
          "title"
        ],
        "properties": {
          "url": {
            "type": "string"
          },
          "title": {
            "type": "string"
          }
        }
      }
    }
  }
}