Detail of model production
Each protein sequence in the dataset was degenerated into many alignements to achieve a sequence identity coverage between 1 and 99% of the initial sequence in the pdb file.
The following steps were applied:
- The pdb sequence in fasta format was converted directly from the PDB file (from the residue name) and NOT using the sequence coming from pdb.org. As the PDB protein can be processed, mutated or may contain missing residues, the pdb sequence may be different from the upstream uniprot reference sequence. For chimeric proteins (for instance for GPCR), only the sequence of the protein of interest was considered, not the other protein (for instance the phage T4 sequence was removed in 2RH1).
- A dummy sequence alignement was created by copying the initial sequence on a second entry
>seqA ORIGINALSEQUENCEFILE >copy_of_seqA ORIGINALSEQUENCEFILE
- The dummy sequence alignment was altered using permutations or gaps until the target percentage identity was reached (1 to 99 with steps of 1):
- for gaps: 1 to 8 consecutive gaps could be added to each sequence (template and target), one to five insertions could happen
- for permutations: target sequence amino acids were exchanged until the required percentage identity was obtained, or the exchange was interrupted if a certain amount of iterations was obtained (to avoid infinite loops, typically twice the sequence length).
- When necessary, both sequence were filled in with initial or terminal hypens (-) to obtain a sequence alignment with both sequences of the same length. This step is mandatory for modeller, if there are initial or terminal gaps, this is not a problem for modeller, but the sequences must have the same length.
- The resulting sequence alignment was submitted to modeller
- the template sequence (seqA) was renamed to "template"
- the target sequence (modified sequence) was renamed to target
- a dedicated pir file was created for modeller.
- For each alignment (average of 300 for each pdb), 25 models were produced, each model was assessed using :
- MODELLER's internal objective function value (molpdf)
- DOPE
- DOPE_HR
- GA_341
- its RMSD to the original PDB file according to IPBA
- its GDT_TS to the original PDB file according to IPBA
Statistics
89 pdb files were considered from the original HOMEP2 dataset (article) (dataset download) , split into 22 families. The sequence alignment perturbation led to 332 degenerated sequence alignments on average, each of them leading to 25 models, for a total of more than 8300 models per structure (pdb file).