功能描述

Verify numerical data against original papers and maintain traceable provenance for every value in datasets, tables, and plots. Includes citation source mana...

使用说明 (SKILL.md)

Data Source Verification

Name: Data Source Verification
Author: larry-of-cosmotim

A systematic workflow for verifying that every data point in a research dataset can be traced back to its original source paper, figure, table, or text passage.

When to Use

Building datasets from literature (CSV, JSON, tables)
Populating tables or plots with values from multiple papers
Reviewing existing datasets for data integrity
Before submitting any paper that includes compiled data

Core Rule

Every numerical value must be traceable to a specific location in the original paper. If you cannot find the value in the cited source, it is unverified and must be flagged — never included as confirmed data.

Data Provenance Chain

Source PDF → CITATION.md (extracted values) → CSV/data table → LaTeX manuscript

Every link in this chain must be auditable. If someone asks "where did this number come from?", the answer should be: paper X, Table Y, column Z — and we have the PDF to prove it.

Citation Source Management

Project Setup (`init`)

Create a Citation_Sources/ directory for the project:

Citation_Sources/
  AuthorLastName_Year_Journal_ShortTitle/
    Author_Year_Topic.pdf          ← original paper
    Author_Year_Topic_SI.pdf       ← supplementary info (if any)
    CITATION.md                    ← structured metadata + data provenance

CITATION.md Template

Every cited paper gets a CITATION.md file:

# Author et al. Year — Short Description
**Title**: Full title
**Authors**: Author list
**Journal**: Journal Vol, Pages (Year)
**DOI**: 10.xxxx/xxxxx
**Data used**: [exact values extracted, with table/figure reference]
**PDF**: ✅ Confirmed | ❌ NOT DOWNLOADED — [reason]
**Status**: CONFIRMED | ⚠️ NEEDS CONFIRM — [reason]
**Notes**: [any caveats, discrepancies, proxy assumptions]

Adding a Source (`add`)

When adding a new citation:

Create the folder: Citation_Sources/AuthorLastName_Year_Journal_ShortTitle/
Download the original PDF — always try to get the actual paper, not just the abstract
Download supplementary information if it contains data
Create CITATION.md from the template
Extract the specific values you need, recording exact table/figure/page locations
Mark the PDF status and verification status

Verification Workflow

Step 1: Collect with Provenance

When extracting data from a paper, record ALL of the following for each value:

Value: 0.65 W/m·K
Paper: Cheng et al. 2021
DOI: 10.1002/smll.202101693
Location: Table 2, row 3
Method: TDTR (time-domain thermoreflectance)
Data type: Experimental
Verified: YES — value confirmed in Table 2

Never record a value without filling in the Location, Data type, and Verified fields.

Step 2: Verify Against Original

For each data point:

Always download the original PDF — don't trust web scraping, abstracts, or secondary sources
Find the exact value in a table, figure, or text passage
Record where you found it — table number, figure number, page, equation
Note the measurement method — experimental technique, simulation, estimate
Check units — convert if needed, note the original units
Track the data type: DFT-calculated, experimentally measured, or derived (note assumptions)

If the paper is behind a paywall and you cannot verify:

Mark as ⚠️ NEEDS CONFIRM — paywall
Note this limitation in CITATION.md

Step 3: Cross-Check the Full Chain

Verify consistency at every step:

Value in PDF → Value in CITATION.md → Value in data table/CSV → Value in manuscript

Any mismatch at any step is a flag.

Step 4: Flag Problems

Mark any value with one of these status levels:

Status	Meaning	Action
`VERIFIED`	Found exact value in cited paper at stated location	Include in dataset
`APPROXIMATE`	Value is close but not exact (e.g., read from figure)	Include with note
`UNVERIFIED`	Cannot find value in cited paper	Flag — do not use without user approval
`MISATTRIBUTED`	Cited paper does not contain this data at all	Remove from dataset, alert user immediately
`ESTIMATED`	Value was calculated or estimated, not directly measured	Include with clear label
`⚠️ NEEDS CONFIRM`	PDF not available (paywall) or value needs double-check	Flag for manual verification

Step 5: Flag Discrepancies

When multiple sources report different values for the same quantity:

Record both values with their sources
Note the discrepancy explicitly (e.g., "B = 45 GPa (Author A, Table 2) vs B = 86 GPa (Author B, Fig. 3)")
Check if the difference is due to measurement method, sample preparation, or temperature
Let the user decide which value to use — do not silently pick one

Dataset Format

When building compiled datasets, always include provenance columns:

CSV format:

Material,Property,Value,Unit,Source_Paper,DOI,Source_Location,Method,Data_Type,Verified,Notes
Li6PS5Cl,kappa,0.69,W/m·K,Cheng 2021,10.1002/smll.202101693,Table 2,TDTR,experimental,YES,
Li3InCl6,v_longitudinal,2800,m/s,Asano 2018,10.1002/adma.201803075,NOT FOUND,Unknown,unknown,MISATTRIBUTED,Paper contains no Li3InCl6 sound velocity data

JSON format:

{
  "material": "Li6PS5Cl",
  "property": "thermal_conductivity",
  "value": 0.69,
  "unit": "W/m·K",
  "source": {
    "paper": "Cheng et al. 2021",
    "doi": "10.1002/smll.202101693",
    "location": "Table 2, row 5",
    "method": "TDTR",
    "dataType": "experimental",
    "verified": true
  }
}

Audit Workflow (`audit`)

Scan all CITATION.md files and generate a report:

List all unique sources in Citation_Sources/
For each source, check:
- PDF downloaded? (✅ or ❌)
- CITATION.md complete? (all fields filled)
- Values confirmed against PDF?
Generate audit summary:

## Audit Report — [Project Name]
Date: [timestamp]

### Summary
- Total sources: [N]
- PDFs confirmed: [N] / [N]
- Values verified: [N] / [N]
- Needs confirmation: [N]
- Missing PDFs: [N]

### Source Details

| Paper | PDF | Values | Verified | Status |
|---|---|---|---|---|
| Cheng 2021 | ✅ | 3 | 3/3 | CONFIRMED |
| Asano 2018 | ✅ | 2 | 1/2 | ⚠️ 1 MISATTRIBUTED |
| Wang 2014 | ❌ | 4 | 0/4 | ⚠️ NEEDS CONFIRM |

### Flagged Values
- Li3InCl6 v_longitudinal: MISATTRIBUTED to Asano 2018 — paper contains no LIC data
- LGPS density: conflicting values (2.0 vs 1.9 g/cm³) between Wang 2014 and Kamaya 2011

Report findings — list verified, flagged, and misattributed values
Recommend action for each flagged value

Export (`export`)

Generate a summary table of all data values and their provenance:

## Data Provenance Summary — [Project Name]

| Material | Property | Value | Unit | Source | Location | Data Type | Status |
|---|---|---|---|---|---|---|---|
| LLZTO | κ | 0.42 | W/m·K | Muy 2019 | Table 1 | experimental | VERIFIED |
| LAGP | v_avg | 4700 | m/s | Rohde 2021 | Table S2 | experimental | VERIFIED |
| Li3InCl6 | v_avg | 1849 | m/s | Qiu 2025 | Table 1 | DFT | VERIFIED |

Red Flags

Watch for these indicators of unreliable data:

Value attributed to a paper but no specific table/figure cited
"Estimated from family properties" without a clear methodology
Values that appear in reviews but cannot be traced to original measurements
Round numbers that suggest estimation rather than measurement (e.g., 2800 m/s vs 2837 m/s)
Same value appearing in multiple papers without independent measurement
DFT values presented as experimental without noting the distinction
Discrepancies between different sources for the same quantity left unaddressed

Rules

Never assume a citation is correct — always verify against the original paper
Always download the PDF — don't trust abstracts, web scraping, or secondary sources
Secondary sources are not verification — a review paper citing a value does not confirm it
Flag immediately when a value cannot be found in its cited source
Track data type — distinguish DFT-calculated, experimentally measured, and derived values
Flag discrepancies — when two sources disagree, note both values and let the user decide
Prefer measured over estimated — clearly label the difference
Document everything — future researchers need the audit trail
When in doubt, exclude — a smaller verified dataset beats a larger unverified one

安全使用建议

This skill is a coherent, instruction-only workflow: it will instruct the agent to download PDFs from the web and create local folders and CITATION.md files to record provenance. Before using it, be aware that (1) it performs network downloads — avoid giving the agent any publisher credentials or session cookies you don't want used, (2) it will write PDFs and metadata to your project directory (ensure you have appropriate storage and copyright rights for storing PDFs), and (3) the instructions rely on the agent's ability to fetch and parse PDFs (quality depends on the agent/host environment). If you want tighter control, run the workflow manually (download PDFs yourself and place them in the Citation_Sources/ folders) or restrict the agent's network/file permissions.

功能分析

Type: OpenClaw Skill Name: data-source-verification Version: 1.0.0 The skill bundle provides a legitimate and well-structured workflow for scientific data verification and provenance tracking. It instructs the agent to manage citations, download research PDFs via DOIs/URLs, and audit datasets for accuracy, with no evidence of malicious intent, data exfiltration, or unauthorized system access in SKILL.md or README.md.

能力评估

✓ Purpose & Capability

The name/description (verify numerical data and maintain provenance) match the instructions: creating Citation_Sources/, extracting values from PDFs, populating CITATION.md, and producing audit/export outputs. There are no unrelated requirements (no cloud creds, no unrelated binaries).

ℹ Instruction Scope

SKILL.md explicitly instructs agents to download original PDFs, extract specific table/figure locations, write CITATION.md files, and generate provenance reports. Those actions are appropriate for the stated purpose, but they involve network downloads and writing/reading project files on disk — behaviour the user should expect and control.

✓ Install Mechanism

This is instruction-only with no install spec and no bundled code; nothing will be written to disk by an installer. That limits the attack surface compared with skills that download and install binaries.

✓ Credentials

The skill declares no environment variables, credentials, or config paths. The tasks described (PDF download, local file writes) do not require additional secrets. No disproportionate credential access is requested.

✓ Persistence & Privilege

The skill is not always-enabled and is user-invocable. It does not request to modify other skills or system-wide settings. It will create and manage files within the project's Citation_Sources/ directory as part of normal operation.

版本历史

v1.0.0

Initial release — provenance tracking, source audit workflow, verification statuses

元数据

Slug data-source-verification

版本 1.0.0

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 1

常见问题