← Back to Gene Catalog

law-site-link-discovery

Hybridknowledge.webimport

Government/law link discovery. chinacourt.cn/article/index/id/… list pages: only #articleList (to paginationControl) → /article/detail/YYYY/MM/id/*.shtml; excludes 要闻/right sidebar on same HTML. CAC /wxzw/zcfg/: article hrefs /YYYY-MM/DD/c_*.htm; followPagination adds POST /cms/JsonList for page 2+ (URL stays index_1.htm). PBOC tiaofasi: list→detail prefers detail HTML URL over attachments inside detail; CSRC/SAFE extractors; NDA /zwgk/zcfb/list/→detail only /zwgk/zcfb/ (excludes /zjjd/ /ytdd/). Generic crawl. Pagination, nested, collectPdf. maxTotalFetches caps GETs+JsonList. No filesystem.

README

No documentation yet.

Gene authors can add a README when publishing.

Phenotype

Input

PropertyTypeReqDescription
seedUrlstringHTTP(S) page URL; host must be in network.allowedDomains (whitelisted govt sites).
maxPagesintegerMax pages to fetch including the seed, when followPagination. Default 1, max 50.
linkScopedefault | single_page_downloadablesingle_page: only the seed list screen (no pagination/nested); still list→detail on that screen for whitelisted sites. default: pagination and nested depth when enabled.
collectPdfbooleanInclude .pdf in items. Default true.
maxTotalFetchesintegerMax HTTP GETs per invocation. Default 200, max 200.
nestedLinkDepthinteger0, 1, or 2. Same-origin navigable hrefs: fetch 1 or 2 levels of child pages to collect more file links. Default 0.
followPaginationbooleanIf true, try to follow rel=next, next link, 下一页; maxPages; SPA may not work.
extraAllowedHostsarrayOptional extra hostnames for link/download allowlist; kb-assistant-app merges local user-allowlist.json here. Cloud WASM typically ignores.
followDetailPagesbooleanWhen true (default): follow article/detail links from the list (PBOC tiaofasi: emit detail-page HTML URLs for indexing; list-page file hrefs still collected). When false: only file-like hrefs on the list HTML.
maxNestedUrlsPerLevelintegerCap of distinct URLs to follow per nested level. Default 12, max 40.

Output

PropertyTypeReq
sitestring
errorstring
itemsarray
Raw JSON Schema

inputSchema

{
  "type": "object",
  "required": [
    "seedUrl"
  ],
  "properties": {
    "seedUrl": {
      "type": "string",
      "description": "HTTP(S) page URL; host must be in network.allowedDomains (whitelisted govt sites)."
    },
    "maxPages": {
      "type": "integer",
      "description": "Max pages to fetch including the seed, when followPagination. Default 1, max 50."
    },
    "linkScope": {
      "enum": [
        "default",
        "single_page_downloadable"
      ],
      "type": "string",
      "description": "single_page: only the seed list screen (no pagination/nested); still list→detail on that screen for whitelisted sites. default: pagination and nested depth when enabled."
    },
    "collectPdf": {
      "type": "boolean",
      "description": "Include .pdf in items. Default true."
    },
    "maxTotalFetches": {
      "type": "integer",
      "description": "Max HTTP GETs per invocation. Default 200, max 200."
    },
    "nestedLinkDepth": {
      "type": "integer",
      "description": "0, 1, or 2. Same-origin navigable hrefs: fetch 1 or 2 levels of child pages to collect more file links. Default 0."
    },
    "followPagination": {
      "type": "boolean",
      "description": "If true, try to follow rel=next, next link, 下一页; maxPages; SPA may not work."
    },
    "extraAllowedHosts": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "Optional extra hostnames for link/download allowlist; kb-assistant-app merges local user-allowlist.json here. Cloud WASM typically ignores."
    },
    "followDetailPages": {
      "type": "boolean",
      "description": "When true (default): follow article/detail links from the list (PBOC tiaofasi: emit detail-page HTML URLs for indexing; list-page file hrefs still collected). When false: only file-like hrefs on the list HTML."
    },
    "maxNestedUrlsPerLevel": {
      "type": "integer",
      "description": "Cap of distinct URLs to follow per nested level. Default 12, max 40."
    }
  }
}

outputSchema

{
  "type": "object",
  "required": [
    "site",
    "items"
  ],
  "properties": {
    "site": {
      "type": "string"
    },
    "error": {
      "type": "string"
    },
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "required": [
          "url",
          "title"
        ],
        "properties": {
          "url": {
            "type": "string"
          },
          "title": {
            "type": "string"
          }
        }
      }
    }
  }
}