Request a New Genome in BioCyc

If you'd like to see a new genome added to BioCyc, please send a request to biocyc-support@ai.sri.com. Please include the species name, strain name, and the RefSeq/genbank_assembly_accession ID.

Difficulties Querying BioCyc Genomes by Gene Name

We have received a number of reports from users who are unable to search certain BioCyc genomes by gene name. Here we explain the reasons for this situation.

Most of the genomes within BioCyc were obtained from NCBI RefSeq. Many RefSeq annotations include gene names and (protein names), which are the sources of gene names and protein names that you see in BioCyc databases (before the additional curation that also occurs in BioCyc).

Periodically we re-download RefSeq genomes, and re-generate the BioCyc PGDBs. This allows us to integrate improved RefSeq annotations for the genomes, and to provide improved metabolic reconstructions based on newer versions of our MetaCyc database and PathoLogic software. However, it also means that any problems introduced in RefSeq can appear in BioCyc. Recently released RefSeq genomes omit large numbers of gene names that were present in earlier versions of those genomes. We were not aware of this problem at the time we re-generated these BioCyc databases, and some such genomes were integrated into BioCyc.

NCBI has acknowledged the problem and they are working to fix it, but they will not give an estimate of when they will fix the problem. In the mean time, we will not integrate any updated RefSeq genomes into BioCyc if the updated version contains significantly fewer gene names than the previous version. Reverting or reliably repairing the existing BioCyc PGDBs with small numbers of gene names is not feasible.

Here are a few statistics regarding gene names in BioCyc that illustrate the variability in current genome annotations. We do not know how these numbers compare to the previous version of BioCyc. However, the fact that only 2,000 out of the 11,000 BioCyc databases were re-generated on our last release constrains how much these numbers could have changed from the previous BioCyc release.

In BioCyc 21.5:

  • 1334 genomes contain no gene names.

  • For 8928 genomes, less than 10% of their genes contain gene names. Conversely, for 2052 genomes, more than 10% of their genes contain gene names.

  • For 208 genomes, greater than 50% of their genes contain gene names. However, for a number of these genomes, many of the "names" stored for the genes are actually accession numbers, not true gene names.

  • For EcoCyc, 69.5% of genes contain gene names that do not begin with the letter "y" (genes beginning with y do not have well established functions).