Using FASTA and GOA Databases in Scaffold – Proteome Software Technical Help Center

The following article contains a list of frequently asked question relating to using FASTA and GOA databases in Scaffold.

FASTA

Where and how do I get FASTA databases?
There are many places to download FASTA databases. We find that the Oxford Journals has a good list.

Also, UniProt, SwissProt and NCBI are great resources for FASTA and GOA files:

http://www.uniprot.org/
http://www.expasy.org/
http://www.ncbi.nlm.nih.gov/

We recommend using UniProt FASTA databases and for GO Terms and biological annotations. Once the the proper file has been located it can be downloaded and added to Scaffold.

Can I make my own FASTA databases?
While it is possible to make a custom FASTA there are considerations related to doing so. FASTA databases have to be in a certain format for them to be parsed correctly. So why then is parsing needed? Parsing the FASTA database is needed so that we may pull out the information we need, like protein sequence, molecular weight, accession number, etc. The FASTA file should also be an ASCII text file. We make our best effort to properly encode the file when parsing, but in some cases this has caused problems. Therefore, technically, Scaffold only supports ASCII 8-bit encoding.

Can I correct the FASTA database after I load an MS file?
Yes the Edit > Edit FASTA Databases menu allows users to modify previously parsed databases or add new ones. Choose the database that you wish to correct and select Edit. A pop-up window will appear giving you the option to Auto Parse or Use Regular Expressions. Use Regular Expressions allows users to choose from a variety of parse rules or create custom ones.

Why do I need to parse the FASTA database?
You need to parse the FASTA database so it will match the parsing parameters used by the previous search engine. This allows Scaffold to compare its results with the results of the previous search engine.

The auto-parse option in Scaffold will automatically choose what it calculates to be the best parsing option for the database. However, if the database is parsed improperly, Scaffold will not display protein molecular weight, percent coverage, or protein sequence. The protein names on the Samples view are obtained from the search engine results therefore problems with their display must be corrected in the search engine FASTA parsing.

How can I edit the parse rules for FASTA databases in Scaffold?
To apply custom parse rules follow Edit > Edit FASTA Databases > [database of interest] > Use Regular Expressions. Choosing Use Regular Expressions will bring up a dialog with a dropdown list containing many preset parse rules. Alternatively, you can also select User Specified... from the dropdown list and enter the regular expression into the box and click Apply. After applying the new parse rules go to Experiment > Apply New Database... and select the database you want and click Apply.

How do I load a FASTA database into Scaffold?
The FASTA database of interest should be downloaded on to the computer running Scaffold (typically users will have a data drive with directories for FASTA and GOA database files), this will speed up Scaffold's analysis. When the database is on your computer you can import it from the Edit > Edit FASTA Databases > Add Database... menu. Select the database and a window will pop-up displaying how the database will be parsed. Here users have the option to choose Auto Parse or Use Regular Expressions to set the parsing rule. In most cases Auto Parse will work with a FASTA obtained from a common source such as UniProt. In the instance of the use of a custom FASTA then a custom regular expression may be required.

Because Scaffold needs to compare the proteins in the database you are loading with proteins in the previously searched database the parsing parameters must be identical to those parsing parameters used in the search engine.

How can I find out what database the data was searched against?
You can find the FASTA database used in the search engine files created prior to loading into Scaffold.

To find the location of the Mascot FASTA database either look at the file name in the Mascot browser, or open the Mascot DAT file you wish to load with a text editor and find the database name inside. Each search engine should display the name of the FASTA file used in searching the data.

Once the data is loaded into Scaffold, the FASTA searched is displayed in the Analysis Information portion of the Load Data view.

What is a decoy database and how do I use it?
A decoy database is a database that has no expected sequence to which to match peptide sequences. An example of a common decoy database is a reverse or jumbled database. That is, a database that has the reverse sequences compared to the forward or normal database. Unless there is a palindromic sequence (one that is the same forward as it is backward), this reverse database should not return any peptide matches. If it does, then these can be considered “false” and therefore contribute to the false discovery rate or FDR. The FDR is equal to the number of false hits divided by the number of positive hits multiplied by 100 to give percentage. For example, if you get 1 false hit for every 100 positive hits, your FDR is 1% (Kall et al 2007, J of Proteome Res).

Can Scaffold create decoy or reverse databases?
Yes and they can be helpful if you wish to calculate false positives at the protein level.
Open Scaffold and go to Edit > Edit FASTA Databases. Select the database by highlighting it and click Edit > Use Regular Expressions. Once this dialogue box is open, click Export. You will get another small dialogue box with export options including a standard FASTA, reverse or random concatenated databases or simply the reversed or random database. You can then use this database in your searches to check how well your search engine is identifying your proteins.

Does Scaffold have an option to automatically update FASTA protein databases?
Currently there is not an option available in Scaffold that allows the user to automatically update FASTA protein databases.

How can I create subset non-redundant (NR) databases for quicker analyses?
One reason why loading NR data into Scaffold takes a long time is that Scaffold scans the entire NR database for each MS sample. What has shown to be useful is to create a subset database containing only the taxonomic species you are interested in. Use that smaller database when loading the data into Scaffold.

To create a subset FASTA database using Scaffold you first have to parse the full database into Scaffold. With NR that can take a while. After the full database has been parsed go to Edit > Edit FASTA databases. Select the NR database and click Edit > Use Regular Expressions > Export. Add key words for the taxonomy you are interested in. When you are finished adding key words select the type of database (standard format is suggested) and select Export.

To use the smaller subset database that you just created go back into Scaffold, click on Edit > Edit FASTA database > Add database and add your subset database. Now use this database when loading your NR search results and Scaffold will load the data much faster.

Can I apply more than one FASTA database to a single data set?
Yes, the Load Data wizard allows users to apply multiple FASTA files, simply click the green check to add additional FASTA files in the Searched Databases portion of the Load and Analyze dialog. Alternatively the FASTA files can be added after data is loaded using Experiment > Apply New Database and choose the next database from which you wish to get IDs. You can do this iteratively until all the databases have been applied.

Why does my Scaffold data show question marks for molecular weights? Why are there missing sequences?
As is described in the parsing sections of this FAQ, Scaffold requires you to parse the FASTA database files so it knows which portions of the entries are accession numbers, descriptions and sequences. It is important that the parsing rules used in Scaffold match those used by the search engine. If the parse rules in Scaffold are invalid, or don't match, or if Scaffold has a hard time parsing the database, it will result in question marks for the molecular weights, and missing sequences. This is because Scaffold couldn't figure out where the accession, description, or sequence for each entry started or ended.

To address this, you should first confirm your search engine FASTA database and parse rules, and make sure they match in Scaffold. Scaffold has a list of standard parse rules in a dropdown menu from which to choose. You can also write your own regular expressions for parse rules. To edit your FASTA database see entry above.

GO Annotation Databases

Can Scaffold show GO annotations?
Yes, there are two GO term sources available; directly querying the NCBI through Scaffold or adding UniProt GOA files available from from their FTP service. Scaffold's user interface offers a list of common organisms to import annotations from.

How do I add GO terms to a Scaffold experiment?
GO annotations can be added by first selecting Import Annotations in the Edit > Edit GO Term Options > GO Annotations dialog box. To apply these annotations, use the Experiment > Apply Annotations menu option.

Visit how to add organism-specific GOA databases to Scaffold for more information.

At any time to toggle the GO terms on/off, use View > Show GO Annotations in Scaffold or the column control button (far-right button next to the sample names row) in Scaffold DDA/DIA.

How much drive space is needed for storage of GOA files?
As more and more GO annotations become available the GOA file sizes continues to increase. UniProt provides an "All Proteomes" file that contains all available GO terms. Once indexed this file takes approximately 160 GB of storage space so users should plan accordingly. Subset taxonomy files are significantly smaller so if space is a consideration using subset databases is recommended over the "All Proteomes" Scaffold has the ability to save GOA databases to alternate locations to avoid filling up a users C drive.

Protein Information Lookup

It is important to ensure gene names are populated for your protein list for best GO Term results. Although the local goa.db file contains both accession- and gene name-linkages, we find that including gene names in your data set works best.

Why doesn't the protein lookup find a protein in the database?
If the chosen database does not have an entry for the selected proteins then Scaffold will not be able to find the protein. You can change the database Scaffold is using to look up your protein by clicking on the Lookup Accession Number In and choosing a different database from the drop-down menu.

Another reason Scaffold may not be finding a protein’s information is if Scaffold’s internet settings are not allowing it to connect to the internet. You can view Scaffold’s connection settings in the Edit > Preferences menu. The check box for allowing Scaffold to connect to the internet can be found on the Internet tab.

How can I look up proteins in a custom database?
To lookup protein information in a custom database you need to have internet access to the custom database. The Edit > Edit Preferences menu's "Web Link" tab allows users to add custom databases to Scaffold. Click on the New Database button and enter the name and the link of the custom database. Be careful to follow the instructions regarding the % sign in typing in the web address of the database. Once you have added the information for your custom database click Apply and then go to the Scaffold Samples view. Click on the Lookup Accession Number In and choose your custom database from the options.

What do the choices in the Lookup Accession Number In dropdown list mean?
The lookup accession number in the bottom of the Samples view allows users to look up the accession number for the selected protein in various databases. The choices in the dropdown menu represent both Scaffold's standard databases and custom ones added by users.

How can I select protein annotation preferences for protein groups?
Selecting Experiment > Apply Protein Preferences will open a dialog that allows users to select the proteins that you want to be displayed in the Samples view when more than one similar proteins are present. It will allow you to automatically set preferred protein names, accession numbers, and taxonomy. This action will automatically search for proteins that contain the text you specify.

For example, you can select Human entries out of SwissProt searching for accession numbers that contain "_HUMAN". All preferences also support regular expressions. For example, you can select SwissProt numbers out of UniProtKB databases using "^[OPQ]".