Nucleotide

NCBI Nucleotide contains a wealth of information that would be useful for almost any scientist, however the data is not parser friendly and it is unclear how to download all the genes of interest. This video covers both of these topics. In order to create the two python dictionaries seen in the video, you will first need to download a GenBank (full) file from NCBI Nucleotide. This file is called "sequence.gb" by default. You then need to run parsing.py. This file parses the information of interest from sequence.gb and creates a new file called "parsed.txt". You should only run parsing.py once, and sequence.gb can be deleted after it is run. Each time you want to load the data into dictionaries you need to run dictionaries.py. dictionaries.py reads the parsed.txt file and creates a dictionary called "mrnas" and a dictionary called "sequence". Depending on the size of your dataset these dictionaries may take up several hundred Mb of RAM. If you are interested in seeing how the code was generated, or how you might alter the code to include different information you can view the video below or on youtube. Edit 3/28/2016: I fixed a minor bug in parsing.py which was causing incomplete extraction of all the gene synonyms. I also removed some list brackets in dictionaries.py that shouldn't have been there. Please use the code at this GitHub repository.