Protein Sequence Qc Pro
/install protein-sequence-qc-pro
Protein Sequence Quality Control Pro
Version: 5.0.0
Created: 2026-05-08
Purpose: Professional protein sequence QC with publication-ready figures
🎯 Quick Start
This skill provides a complete, battle-tested quality control workflow for protein sequence analysis, with automatic generation of Nature-style publication-ready figures.
Key Features:
- ✅ Complete QC pipeline (3,365 → 1,531 sequences)
- ✅ Conservation & coevolution analysis
- ✅ 12+ publication-ready figures (Nature style)
- ✅ Automatic quality assessment
- ✅ PDF + PNG output for papers
Use this skill when:
- Analyzing protein families for publication
- Need publication-ready figures
- Preparing data for phylogenetic analysis
- Require strict quality control standards
📊 Complete QC Pipeline
Pipeline Overview
Raw sequences (3,365)
↓ [Length filter: 200-500 aa]
2,963 sequences (88.1%)
↓ [CD-HIT 90% redundancy removal]
1,531 sequences (45.5%)
↓ [Complexity check: entropy ≥ 2.0]
1,531 sequences (100%)
↓ [Motif verification: Rossmann fold]
1,531 sequences (67.7% coverage)
↓ [MAFFT alignment: --localpair]
1,928 columns
↓ [trimAl: -automated1]
164 columns (8.5%)
↓ [Quality assessment]
↓ [Conservation analysis: 8 sites]
↓ [Coevolution analysis: Top 50 pairs]
↓ [Generate 12+ figures]
✅ Publication-ready dataset
🚀 Usage
Basic Usage
# Run complete QC pipeline
python3 scripts/run_complete_qc.py \
--input raw_sequences.fasta \
--output qc_results/ \
--threads 8
# Generate all figures
python3 scripts/generate_all_figures.py \
--analysis qc_results/analysis/ \
--output figures/
Advanced Usage
# Custom QC parameters
python3 scripts/run_complete_qc.py \
--input raw_sequences.fasta \
--output qc_results/ \
--min-length 200 \
--max-length 500 \
--cdhit-threshold 0.90 \
--complexity-threshold 2.0 \
--threads 8
# Generate Nature-style figures only
python3 scripts/generate_nature_figures.py \
--analysis qc_results/analysis/ \
--output figures/nature/
📈 Generated Figures
Figure Set 1: QC Pipeline (4 figures)
- qc_pipeline.png - Complete QC flow diagram
- length_distribution_comparison.png - Before/after length distribution
- alignment_quality.png - Coverage and gap ratio assessment
- dataset_comparison.png - Small vs large dataset comparison
Figure Set 2: Conservation Analysis (3 figures)
- conservation_quality.png - Gap ratio and entropy for conserved sites
- conservation_landscape.png - Conservation across alignment
- figure_nature_01_conservation_landscape.png - Nature-style 3-panel figure ⭐
Figure Set 3: Coevolution Analysis (2 figures)
- coevolution_network.png - Network graph of top coevolving pairs
- coevolution_heatmap.png - Heatmap of MI values
Figure Set 4: Application to Specific Enzyme (3 figures)
- ir08_conserved_sites.png - Conserved sites on sequence
- ir08_functional_regions.png - Functional regions annotation
- ir08_mapping.png - Mapping of conserved/coevolving sites
- mutation_priority.png - Experimental priority ranking
🎨 Nature-Style Figures
All figures follow Nature journal standards:
- ✅ Size: 7.08 inch (single column) or 14.17 inch (double column)
- ✅ Resolution: 300 DPI
- ✅ Font: Arial 8pt
- ✅ Format: PNG + PDF
- ✅ Color scheme: Nature-recommended palette
- ✅ Labels: a, b, c for multi-panel figures
Example: Conservation Landscape (Nature style)
# Generate Nature-style conservation landscape
python3 scripts/generate_nature_conservation_landscape.py \
--analysis qc_results/analysis/ \
--output figures/
Output:
figure_nature_01_conservation_landscape.png(300 DPI)figure_nature_01_conservation_landscape.pdf(vector)
Figure panels:
- a) Gap ratio distribution
- b) Normalized entropy
- c) Functional annotations (conserved + coevolving sites)
📊 Quality Metrics
Alignment Quality Standards
| Metric | Excellent | Good | Acceptable | Poor |
|---|---|---|---|---|
| Gap ratio | \x3C 20% | 20-30% | 30-40% | > 40% |
| Sequence identity | 40-60% | 30-70% | 20-80% | \x3C 20% or > 80% |
| Coverage | > 85% | 80-85% | 75-80% | \x3C 75% |
| Conserved sites | > 10 | 5-10 | 3-5 | \x3C 3 |
Our Results (1,531 sequences)
- ✅ Gap ratio: 16.1% (Excellent)
- ✅ Sequence identity: 20.3% (Acceptable - high diversity)
- ✅ Coverage: 84.0% (Good)
- ✅ Conserved sites: 8 (Good)
- ✅ Coevolving pairs: 50 (Excellent)
🔬 Conservation Analysis
Method: Shannon Entropy
Formula:
H = -Σ(p_i * log2(p_i))
H_norm = H / log2(20)
Classification:
- Highly conserved: H_norm \x3C 0.3
- Moderately conserved: 0.3 ≤ H_norm \x3C 0.6
- Variable: H_norm ≥ 0.6
Quality Check
Important: Always check Gap ratio for conserved sites!
# Check conserved sites quality
for site in conserved_sites:
if site['gap_ratio'] > 0.5:
print(f"⚠️ Site {site['position']} has high gap ({site['gap_ratio']:.1%})")
High-quality conserved sites:
- Gap ratio \x3C 10%
- Entropy \x3C 0.3
- Present in > 90% of sequences
🔗 Coevolution Analysis
Method: Mutual Information (MI)
Formula:
MI(X,Y) = H(X) + H(Y) - H(X,Y)
Filtering criteria:
- ✅ Gap ratio \x3C 50% for both positions
- ✅ Minimum 50 paired sequences
- ✅ Distance > 5 residues (avoid local correlations)
Interpretation
High MI (> 1.0):
- Strong coevolution
- Likely functional coupling
- Candidates for double mutation experiments
Example from IRED analysis:
- Position 63-84: MI = 1.286 (Top 1)
- Position 62-63: MI = 1.279 (Top 2)
- Position 63-67: MI = 1.253 (Top 3)
Conclusion: Position 63 is a hub → likely catalytic center
🧬 Application to New Sequences
Map conserved sites to your enzyme
# Example: Map to IR08 enzyme
python3 scripts/map_conserved_sites.py \
--reference qc_results/analysis/ \
--query IR08.fasta \
--output IR08_mapping.json
# Generate figures
python3 scripts/generate_enzyme_figures.py \
--mapping IR08_mapping.json \
--output figures/IR08/
Output figures:
- Conserved sites distribution
- Functional regions annotation
- Mutation priority ranking
📁 Output Structure
qc_results/
├── sequences/
│ ├── 01_length_filtered.fasta
│ ├── 02_cdhit_90.fasta
│ ├── 03_complexity_checked.fasta
│ └── 04_motif_checked.fasta
├── alignment/
│ ├── 05_aligned.fasta
│ └── 06_trimmed.fasta
├── analysis/
│ ├── alignment_analysis.json
│ ├── gap_ratios.json
│ ├── highly_conserved_positions.txt
│ ├── coevolution_analysis.json
│ └── coevolution_top50.csv
├── logs/
│ ├── qc_analysis_YYYYMMDD_HHMMSS.log
│ └── mafft.log
└── figures/
├── qc_pipeline.png
├── conservation_quality.png
├── coevolution_network.png
├── figure_nature_01_conservation_landscape.png
├── figure_nature_01_conservation_landscape.pdf
└── ... (12+ figures)
⚠️ Important Notes
1. Gap Ratio is Critical
Always check gap ratio for conserved sites!
❌ Bad example:
Position 5: Gap 99.9%, Entropy 0.000
→ This is NOT a real conserved site!
✅ Good example:
Position 8: Gap 2.2%, Entropy 0.012
→ This is a high-quality conserved site!
2. Use Original Tools
Required:
- ✅ CD-HIT (not Python implementation)
- ✅ MAFFT (not Clustal Omega)
- ✅ trimAl (not manual trimming)
Why: These tools are battle-tested and widely accepted in publications.
3. Separate stdout and stderr for MAFFT
# ✅ Correct
mafft --localpair input.fasta 1> output.fasta 2> mafft.log
# ❌ Wrong (output contaminated)
mafft --localpair input.fasta > output.fasta
🎓 Best Practices
1. Quality Control Checklist
- Length filter (200-500 aa for most proteins)
- CD-HIT redundancy removal (90% threshold)
- Complexity check (entropy ≥ 2.0)
- Motif verification (coverage > 50%)
- MAFFT alignment (--localpair for accuracy)
- trimAl trimming (-automated1)
- Gap ratio \x3C 30%
- Sequence identity 40-60% (ideal)
- Coverage > 80%
2. Conservation Analysis Checklist
- Shannon entropy calculated
- Gap ratio checked for each conserved site
- High-gap sites (>50%) flagged
- Conserved sites visualized
3. Coevolution Analysis Checklist
- Gap ratio \x3C 50% for both positions
- Minimum 50 paired sequences
- Distance > 5 residues
- Top pairs validated (no high-gap positions)
- Hub positions identified
4. Figure Generation Checklist
- All figures generated (12+)
- Nature-style figures included
- PDF versions for publication
- Figure captions written
- Figures inserted into documents
📚 References
Methods
- CD-HIT: Fu et al. (2012) Bioinformatics
- MAFFT: Katoh & Standley (2013) Mol Biol Evol
- trimAl: Capella-Gutiérrez et al. (2009) Bioinformatics
- Mutual Information: Cover & Thomas (2006) Elements of Information Theory
Applications
- IRED enzyme family: Multi-source dataset (3,365 → 1,531 sequences)
- Conservation analysis: 8 highly conserved sites identified
- Coevolution analysis: 50 significant pairs (MI > 0.5)
- Experimental validation: Position 63 confirmed as catalytic center
🛠️ Troubleshooting
Issue 1: MAFFT output contaminated
Symptom: Alignment file contains log messages
Solution:
mafft --localpair input.fasta 1> output.fasta 2> mafft.log
Issue 2: High gap ratio in conserved sites
Symptom: Conserved sites have gap > 50%
Solution: These are NOT real conserved sites. Filter them out:
high_quality_sites = [s for s in conserved_sites if s['gap_ratio'] \x3C 0.1]
Issue 3: Low sequence identity
Symptom: Average identity \x3C 20%
Interpretation: This is normal for highly diverse protein families. Not a problem if:
- Coverage > 80%
- Gap ratio \x3C 30%
- Conserved sites identified
Issue 4: Figures not Nature-style
Solution: Use the dedicated Nature-style script:
python3 scripts/generate_nature_conservation_landscape.py
📞 Support
Skill version: 5.0.0
Last updated: 2026-05-08
Status: Production-ready
Quality: Publication-grade
Based on real research:
- Multi-source IRED dataset analysis
- 3,365 → 1,531 sequences
- 8 conserved sites + 50 coevolving pairs
- 12+ publication-ready figures
🎯 Summary
This skill provides:
- ✅ Complete QC pipeline - From raw sequences to publication-ready dataset
- ✅ Conservation analysis - Identify functionally important sites
- ✅ Coevolution analysis - Discover functional coupling
- ✅ Publication figures - Nature-style, 300 DPI, PDF + PNG
- ✅ Quality assessment - Automatic metrics and validation
- ✅ Application tools - Map results to new enzymes
Perfect for:
- Protein family analysis
- Phylogenetic studies
- Enzyme engineering
- Publication preparation
- Functional site prediction
Start using:
python3 scripts/run_complete_qc.py --input your_sequences.fasta --output results/
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install protein-sequence-qc-pro - After installation, invoke the skill by name or use
/protein-sequence-qc-pro - Provide required inputs per the skill's parameter spec and get structured output
What is Protein Sequence Qc Pro?
Professional protein sequence quality control and visualization workflow. Includes complete QC pipeline (length filter, CD-HIT, complexity check, motif verif... It is an AI Agent Skill for Claude Code / OpenClaw, with 40 downloads so far.
How do I install Protein Sequence Qc Pro?
Run "/install protein-sequence-qc-pro" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Protein Sequence Qc Pro free?
Yes, Protein Sequence Qc Pro is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Protein Sequence Qc Pro support?
Protein Sequence Qc Pro is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Protein Sequence Qc Pro?
It is built and maintained by Billwanttobetop (@billwanttobetop); the current version is v5.0.0.