← 返回基因目录

law-site-link-discovery

Hybridknowledge.webimport

Government/law portal link discovery. Same-origin direct files (PDF/DOCX/…/ZIP). Optional: (1) follow list pagination via rel=next and next-page heuristics; (2) nested same-origin link depth 1–2 to collect more files; (3) collectPdf toggles .pdf in results. PBOC tiaofasi: article list + detail. NFRA/gov: optional self-page HTML. maxTotalFetches caps GETs. No filesystem.

v0.3.02026年7月5日
有更新版本:v0.4.1 →

README

暂无文档。

基因作者可在发布时添加 README。

表现型

输入

属性类型必填描述
seedUrlstringHTTP(S) page URL to fetch.
maxPagesintegerMax pages to fetch including the seed, when followPagination. Default 1, max 50.
linkScopedefault | single_page_downloadablesingle_page: one GET (PBOC still follows details); disables pagination and nested for other sites. default: full rules below.
collectPdfbooleanInclude .pdf in items. Default true.
maxTotalFetchesintegerMax HTTP GETs per invocation. Default 200, max 200.
nestedLinkDepthinteger0, 1, or 2. Same-origin navigable hrefs: fetch 1 or 2 levels of child pages to collect more file links. Default 0.
followPaginationbooleanIf true, try to follow rel=next, next link, 下一页; maxPages; SPA may not work.
followDetailPagesbooleanPBOC tiaofasi: also fetch each article detail on the list (default true).
maxNestedUrlsPerLevelintegerCap of distinct URLs to follow per nested level. Default 12, max 40.

输出

属性类型必填
sitestring
errorstring
itemsarray
原始 JSON Schema

inputSchema

{
  "type": "object",
  "required": [
    "seedUrl"
  ],
  "properties": {
    "seedUrl": {
      "type": "string",
      "description": "HTTP(S) page URL to fetch."
    },
    "maxPages": {
      "type": "integer",
      "description": "Max pages to fetch including the seed, when followPagination. Default 1, max 50."
    },
    "linkScope": {
      "enum": [
        "default",
        "single_page_downloadable"
      ],
      "type": "string",
      "description": "single_page: one GET (PBOC still follows details); disables pagination and nested for other sites. default: full rules below."
    },
    "collectPdf": {
      "type": "boolean",
      "description": "Include .pdf in items. Default true."
    },
    "maxTotalFetches": {
      "type": "integer",
      "description": "Max HTTP GETs per invocation. Default 200, max 200."
    },
    "nestedLinkDepth": {
      "type": "integer",
      "description": "0, 1, or 2. Same-origin navigable hrefs: fetch 1 or 2 levels of child pages to collect more file links. Default 0."
    },
    "followPagination": {
      "type": "boolean",
      "description": "If true, try to follow rel=next, next link, 下一页; maxPages; SPA may not work."
    },
    "followDetailPages": {
      "type": "boolean",
      "description": "PBOC tiaofasi: also fetch each article detail on the list (default true)."
    },
    "maxNestedUrlsPerLevel": {
      "type": "integer",
      "description": "Cap of distinct URLs to follow per nested level. Default 12, max 40."
    }
  }
}

outputSchema

{
  "type": "object",
  "required": [
    "site",
    "items"
  ],
  "properties": {
    "site": {
      "type": "string"
    },
    "error": {
      "type": "string"
    },
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "required": [
          "url",
          "title"
        ],
        "properties": {
          "url": {
            "type": "string"
          },
          "title": {
            "type": "string"
          }
        }
      }
    }
  }
}