PrePubMed: A PubMed for preprints

I believe that our publication system is broken. Where an article is published has undue influence on how it is read, cited, and perceived. Where a researcher publishes essentially determines if they will obtain a faculty position or grants, with complete disregard to other contributions they might bring to the scientific community. For example, maybe they run a popular blog or contribute to numerous GitHub repositories or answer numerous questions on forums.

And it can take years to get a paper published, with each resubmission requiring extensive format changes. The reviewers are often unqualified and raise ridiculous questions or ask for pointless experiments. Papers that get into glamour journals do not represent the best science, but rather are a reflection of what the editors consider "hot" at the moment, resulting in these journals having high retraction rates. And why do we put up with this? Because it only takes one glamour publication to make our careers? When did performing science turn into playing the lottery? Preprints can solve many of these issues. They are posted without having to sit through peer-review, which allows them to be available in a couple days. All articles are welcome, putting everyone's science on an even playing field. And the articles are open access, allowing anyone to read and comment. Not to mention it is free to preprint an article. However, there are multiple problems with preprints. And all of these problems could be solved if we treated preprints as official publications. It is possible for an article to be present at multiple preprint servers or even present at the same preprint server multiple times. Would we let someone publish their work at several journals? Of course we wouldn't, one (or both) papers would get retracted. Would we let articles get posted without affiliations or full author names? Would we let articles which are just abstracts or are short enough to be better suited as a blog post get published? I'm assuming not, but with PLOS ONE you never know. With these issues it is not surprising that preprints do not get indexed into PubMed and become part of the official literature. I can't solve the above issues, which require an improvement in preprint servers and an acceptance of preprints as official publications. However, to improve the credibility of preprints I can index them myself and provide the ability to search preprints, which is why I created PrePubMed. I attempted to make PrePubMed a proxy for PubMed, but there are inherent limitations with preprints. In addition, there are limitations imposed by my code (for example only AND searches are allowed), but these are limitations which can be removed given enough time, motivation, or assistance from others. In this article I will discuss how PrePubMed works, and exactly what the limitations are. Although I could publish PrePubMed as a preprint, I don't believe everything should be published, and by extension, preprinted. Although I think PrePubMed is useful for a wide audience and should be publicized, I don't think the site utilizes any algorithms or technology that warrants its inclusion in the literature. And a blog post is a more natural medium for some work which would otherwise awkwardly pollute the literature.

PrePubMed works under the philosophy that preprints are official published articles. This has the implication that an article only gets indexed once. Preprints have the unique problem of new versions. At the preprint servers these articles get a dedicated page, and when you search articles by date the date of the most recent version is used. Once an article is indexed by PrePubMed, new versions of the article are ignored. This allows the PrePubMed search results to reflect truly new research. If someone preprinted their article two years ago but then updated the author list yesterday, do you really want that to show up at the top of your PrePubMed search? Do you want that to show up in your RSS feed? PrePubMed avoids this situation by utilizing the unique numbers that the preprint servers assign to preprints. For example, at PeerJ Preprints the link to the newest version of the article doesn't change. If the article's link is preprints/2064/ and a new version is posted, they will create a new link: preprints/2064v1/, with the old link still pointing to the newest version. As a result, when PrePubMed does its daily indexing, if it sees a link already in the database it ignores the article, even if changes to the article have occurred. It might be preferable to have PrePubMed update the database with the changes that have occurred (excluding date), but that is currently not implemented. More basics of PrePubMed and how the search operates can be found at the Help page and in the video below: However, this article is not about the basics of PrePubMed, it is about how PrePubMed actually works. PrePubMed is written in Python 2.7 and runs on Django 1.8. The production code is available at GitHub. I am using the default Django template engine to generate HTML, SQLite3 and the Django ORM for the database, Amazon S3 to serve the few static images that I have, and I make heavy use of Bootstrap CSS for the layout of the site. The main app is located in the mysite directory. Looking at urls.py gives a good overview of the site:

urlpatterns = [
    url(r'^$', views.home),
    url(r'^search_results/',views.search_results),
    url(r'^search_tag/',views.search_tag),
    url(r'^search_author/',views.search_author),
    url(r'^help/',views.my_help),
    url(r'^advanced_search/',views.advanced_search),
    url(r'^ad_search_results/',views.advanced_search_results),
    url(r'^grim_test/',views.grim_test),
    url(r'^make_plot/',views.make_plot),
    url(r'^grim_plot/',views.grim_plot),
    url(r'^general_grim/',views.general_grim),
    url(r'^rss_feed/',views.rss_feed),
    url(r'^articles/(.{1,50})/rss/',views.RSSFeed()),
]

When someone visits www.prepubmed.org the home view gets called which just sends them to the home page. The view search_results gets called when someone types their search in the default search box. When an author name is clicked the view search_author is called. When a tag is clicked the search_tag view is called. Visiting advanced_search sends them to the advanced_search page. If they enter a query here the advanced_search_results view gets called. The grim_test view handles the basic grim_test, and make_plot and grim_plot are needed if they decide to generate a plot. The general_grim view handles a more flexible GRIM test. If a user wants an RSS feed they are sent to the RSS query page by rss_feed, and the Django Feed class handles the creation of their RSS feed. Perhaps the most important aspect of PrePubMed is the database, as it is thus far the only attempt to index preprints. The preprints that are indexed are saved in two different formats. First, they are saved in a text file in a format that I affectionately think of as a lazy Python friendly format. I may have invented this format or it might be widely used, I don't know. More details about this format can be found here. Basically they are saved in the data structure in which they existed in Python, which allows them to easily be read back into Python if necessary with an eval() call. Each preprint server is indexed separately and has its own dedicated folders and files. For example, the PeerJ preprints that were indexed on 2016-06-04 are saved in the folder peerj/update_log as 2016-06-04-05-01-59.txt:

[u'Metagenomics accelerates species discovery and unravel great biodiversity of benthic invertebrates in marine sediments in Campos basin, Brazil', ['Milena Marcela D P Schettini', 'Raony G C C L Cardenas', 'Marcella A A Detoni', 'Mauro F Rebelo'], u'2016-06-03', u'Sediment fauna characterization and monitoring are mandatory requirements for obtaining oil and gas (O&G) environmental licensing for exploration and production (E&P) activities. Currently, for environmental characterizations and monitoring, biodiversity is assessed through morphological taxonomy, a time-consuming process. Taxonomists are constantly failing to meet the demands for biodiversity assessment required in monitoring programs. Thus, we combined three different phylogenetic markers(rDNA 18S, rDNA 28S and COI), HTS and Bioinformatics to identify benthic invertebrate organisms from sediment samples collected in five stations in the Campos Basin in southeast Brazil, an important oil extraction area and one of the best-studied marine biota in Brazil. Our results obtained with metagenomics were compared to morphology data provided by the Habitats Project whereas the database Global Biodiversity Information Facility ( www.gbif.org ) was used for organism localization. We obtained around 4.83 \u03bcg of DNA from 15 samples. A total of 3.3 million sequences were clustered in Operational Taxonomic Units and more than 1.6 million sequences (about 50% of all reads) were assigned to 957 prokaryotes and 577 eukaryotes. BLAST identified 23 phyla, 60 classes, 62 orders, 70 families, 67 genus and 46 species of eukaryotes. Our metagenomic analysis identified phyla that are traditionally found in samples of marine benthos, such as Annelida, Arthropoda, Mollusca and Chordata, as well as more rarely found phyla such as Bryozoa, Cnidaria, Echinodermata, Nematoda, Nemertea, Platyhelminthes, Porifera and Priapulida; and even more rare phyla like Entoprocta and Gastrotricha. The low availability of genetic markers for Brazilian species in Genebank impaired our ability to compare our findings with those obtained morphologically for which no sequences were found in Genebank. Our study shows that metagenomics can be applied for environmental characterization and monitoring programs and, with the possibility of automating the method, may reduce from years to few months the time currently required for species identification and biodiversity determination, which will certainly accelerate species discovery.', u'https://peerj.com/preprints/2103/', ['Bioinformatics', 'Environmental Sciences', 'Genomics'], [u'Instituto de Biof\xedsica Carlos Chagas Filho., Universidade Federal do Rio de Janeiro']]
[u'Coupling spatiotemporal community assembly processes to ecosystem function', ['Emily Graham', 'Alex R. Crump', 'Charles T Resch', 'Sarah Fansler', 'Evan Arntzen', 'David Kennedy', 'Jim Fredrickson', 'James C. Stegen'], u'2016-06-03', u'Community assembly processes govern shifts in species abundances in response to environmental change, yet our understanding of assembly remains largely decoupled from ecosystem function. Here, we test hypotheses regarding assembly and function across space and time using hyporheic microbial communities as a model system. We pair sampling of two habitat types (e.g., attached and unattached) through seasonal and sub-hourly hydrologic fluctuation with null modeling and temporally-explicit multivariate statistics. We demonstrate that dual selective pressures assimilate to generate compositional changes at distinct timescales among habitat types, resulting in contrasting associations of Betaproteobacteria and Thaumarchaeota with selection and with seasonal changes in aerobic metabolism. Our results culminate in a conceptual model in which selection from contrasting environments regulates taxon abundance and ecosystem function through time, with increases in function when oscillating selection opposes stable selective pressures. Our model is applicable within both macrobial and microbial ecology and presents an avenue for assimilating community assembly processes into predictions of ecosystem function.', u'https://peerj.com/preprints/2102/', ['Ecology', 'Ecosystem Science', 'Environmental Sciences', 'Microbiology', 'Molecular Biology'], [u'Biological Sciences Division, Pacific Northwest National Laboratory']]
[u'Modeling potential distribution of Indo-Pacific humpback dolphins (Sousa chinensis) in the Beibu Gulf, China', ['Mei Chen', 'Yuqin Song', 'Dagong Qin'], u'2016-06-03', u'Mapping key habitats of marine mega-vertebrates with high mobility is crucial for establishing Marine Protected Area (MPA) networks. Due to difficulties in achieving sound data in the field, Species Distribution Modeling (SDM) provide an efficient alternative. As a keystone and flagship species in inshore waters in southern China, Indo-Pacific humpback dolphins (Sousa chinensis) play an important role in coastal ecosystems. We used a maximum entropy (Maxent) modeling approach to predict potential habitats for the dolphins in the Beibu Gulf of China. Models was based on eight independent oceanographic parameters derived from Google Earth Digital Elevation Model (DEM) and Landsat images, and presence-only data from boat-based surveys between 2003 and 2013. Three variables, distance from major estuaries, from coast and from 10-m isobaths, were the strongest predictors, consistent with previous studies. Apart from known areas, a new area, Beilunhe Estuary (BE) close to the boundary of China and Vietnam was predicted. Based on our findings, we proposed a regional MPA network for humpback dolphins in the Beibu Gulf of China.', u'https://peerj.com/preprints/2101/', ['Biodiversity', 'Ecology', 'Marine Biology'], [u'Department of Environmental Management, College of Environmental Sciences and Engineering, Peking University', u'Center for Nature and Society, School of Life Sciences, Peking University']]
[u'The biomechanical, chemical, and physiological adaptations of the eggs of two Australian megapodes to their nesting strategies and their implications for extinct titanosaur dinosaurs', ['Gerald Grellet-Tinner', 'Suzanne Lindsay', 'Mike Thompson'], u'2016-06-03', u'Megapodes are galliform birds endemic to Australasia and unusual amongst modern birds in that they bury their eggs for incubation in diverse substrates and using various strategies. Alectura lathami and Leipoa ocellata are Australian megapodes that build and nest in mounds of soil and organic matter. Such unusual nesting behaviors have resulted in particular evolutionary adaptations of their eggs and eggshells. We used a combination of scanning electron microscopy, including electron backscatter diffraction and energy-dispersive X-ray spectroscopy, to determine the fine structure of the eggshells and micro-CT scanning to map the structure of pores. We discovered that the surface of the eggshell of A. lathami displays nodes similar to those of extinct titanosaur dinosaurs from Transylvania and Auca Mahuevo egg layer #4 (AM L#4). We propose that this pronounced nodular ornamentation is an adaptation to an environment rich in organic acids from their nest mound, protecting the egg surface from chemical etching and leaving the eggshell thickness intact. By contrast, L. ocellata nests in mounds of sand with less organic matter in semiarid environments and has eggshells with weakly defined nodes, like those of extinct titanosaurs from AM L#3 that also lived in a semiarid environment. We suggest the internode spaces in both megapode and titanosaur species act as funnels, which concentrate the condensed water vapor between the nodes. This water funneling in megapodes through the layer of calcium phosphate reduces the likelihood of bacterial infection by creating a barrier to microbial invasion. In addition, the accessory layer of both species possesses sulfur, which reinforces the calcium phosphate barrier to bacterial and fungal contamination. Like titanosaurs, pores through the eggshell are Y-shaped in both species, but A. lathami displays unique mid-shell connections tangential to the eggshell surface and that connect some adjacent pores, like the eggshells of titanosaur of AM L#4 and Transylvania. The function of these inter-connections is not known, but likely helps the diffusion of gases in eggs buried in environments where occlusion of pores is possible.', u'https://peerj.com/preprints/2100/', ['Animal Behavior', 'Evolutionary Studies', 'Paleontology'], [u'Associate Researcher, Orcas Island Historical Museum', u'Department of Geosciences, Centro Regional de Investigaciones Cient\xedficas y Transferencia Tecnol\xf3gica de La Rioja (CRILAR-CONICET)', u'Australian Museum', u'School of Biological Sciences, University of Sydney']]
[u'Plant spatial patterns and functional traits interaction along a chronosequence of primary succession: evidence from a central Alpine glacier foreland', ['Tommaso Sitzia', 'Matteo Dainese', 'Bertil O. Krusi', 'Duncan McCollin'], u'2016-06-03', u'The main aim of this study was to elucidate the roles of terrain age and spatial self-organisation as drivers of primary succession using high-resolution assessment of plant composition, functional traits and landscape metrics. We sampled 46 plots, 1m x 1m each, distributed along a 15-70 year range of terrain ages on the foreland of the Nardis glacier, located in the southern central Alps of Italy. From existing databases, we selected nine quantitative traits for the 16 plant species present, and we measured a set of seven landscape metrics, which described the spatial arrangement of the plant species patches on the study plots, at a 1cm x 1cm resolution. We applied linear models to study the relationships among plant communities, landscape metrics and terrain age. Furthermore, we used RLQ-analysis to examine trait-spatial configuration relations. To assess the effect of terrain age variation on trait performance, we applied a partial-RLQ analysis approach. Finally, we used the fourth-corner statistic to quantify and test relations between traits, landscape metrics and RLQ axes. Surprisingly, linear models revealed that neither the plant composition nor any of the landscape metrics differed among the three classes of terrain age distinguished, viz. 15-41 y, 41-57 y and 57-66 y, respectively. Further, no correlations were detected between trait patterns and terrain age, however, the floristically defined relev\xe9 clusters differed significantly with regard to several landscape metrics and suggestive relationships between increasing patch diversity and traits connected to growth rate were detected. We conclude that (i) terrain age below 70 years is not a good predictor for neither plant composition nor spatial configuration on the studied microhabitat and (ii) the small-scale configuration of the plant species patches correlates with certain functional traits and with plant composition, suggesting species-based spatial self-organisation.', u'https://peerj.com/preprints/2099/', ['Biochemistry', 'Ecology', 'Plant Science'], [u'Department of Land, Environment, Agriculture and Forestry, Universit\xe0 degli Studi di Padova', u'Department of Animal Ecology and Tropical Biology, Universit\xe4t W\xfcrzburg', u'School of Life Sciences and Facility Management, Zurich University of Applied Science', u'Landscape & Biodiversity Research Group, The University of Northampton']]
[u'Impact of agricultural management on bacterial laccase-encoding genes with possible implications for soil carbon storage in semi-arid Mediterranean olive farming', ['Beatriz Moreno', 'Emilio Benitez'], u'2016-06-03', u'Background  . Laccases, mostly laccase-like multicopper oxidases (LMCO), are probably the most common ligninolytic enzymes in soil. Although, in recent studies, laccase-encoding genes have been successfully used as molecular markers in order to elucidate the role of bacteria in soil organic C cycling , further research in this field is necessary . In this study, using rainfed olive farming as an experimental model, we determined the stability and accumulation levels of humic substances and appliedthese data to bacterial laccase-encoding gene expression and diversity in soils under four different agricultural management systems (bare soils under tillage/no tillage and vegetation cover under chemical/mechanical management).\nMaterials and Methods.  Humic C (>10 4  Da) was subjected to isoelectric focusing. The GC-MS method was used to analyze aromatic hydrocarbons. Real-Time PCR quantification and denaturing gradient gel electrophoresis ( DGGE) of DNA/RNA for functional bacterial laccase-like multicopper oxidase (LMCO)-encoding genes and transcripts were also carried out.\nResults.  Soils under spontaneous vegetation, eliminated in springtime using mechanical methods, showed the highest humic acid levels as well as the largest bacterial population, rich in laccase genes and transcripts after more than 30 years of experiments. The structure of the bacterial community based on LMCO genes also pointed to phylogenetic differences between these soils due to the impact of different management systems. Soils where herbicides were used to eliminate spontaneous vegetation once a year and those where pre-emergence herbicides resulted in bare soils clustered together for DNA-based DGGEanalysis, which indicated a certain amount of microbial selection due to the application of herbicides. When LMCO-encoding gene expression was studied, soils where cover vegetation was managed either with herbicides or with mechanical methods showed less than 10% similarity, suggesting that the different laccase substrates derived from vegetation cover decay when herbicides are used.\nConclusions.  We suggest that the low humic acid content retrieved in the herbicide-treated soils was mainly related to the type (due to vegetal cover specialization ) and smaller quantity (due to lower vegetal biomass levels) of phenolic substrates for laccase enzymes involved in humification processes. We also found that spontaneous vegetal cover managed using mechanical methods could be the best option for achieving C stabilization in rainfed Mediterranean agroecosystems.', u'https://peerj.com/preprints/2097/', ['Agricultural Science', 'Environmental Sciences', 'Microbiology', 'Soil Science'], [u'Department of Environmental Protection, CSIC-Estacion Experimental del Zaidin (EEZ)']]
[u'Geothematic open data in Umbria region', ['Andrea Motti', 'Norman Natali'], u'2016-06-03', u'Detailed information about geology, hydrogeology and seismic hazard issues for Umbria region are contained in a spatial database available as open data format (shape file or KMZ) and distributed under the regional open data portal called Open Data Umbria ( http://dati.umbria.it ) where 297 datasets have been produced by Umbria Region until now and most of them are made by Geological Survey. Development of standardized regional geologic database (BDG from now on) took about 20 years since 2010 to manage the huge set of information contained in the 276 geologic maps. As a result of migration to BDG, 231 distinct geologic units were found for Umbria Region territory represented by about 47,000 polygon features. The total land area of Umbria 8,475 km  2 wide is divided in the BDG into 46,982 different geological areas. Analysis of the information contained in the BDG is preliminary to the creation of more geothematic layers and custom maps. The key word is the characteristic index of the single geologic unit. Characteristic index, shown in percentage, calculates the ratio between the surface of the geologic units compared to their thickness. Thickness value for each geologic unit is intended to be based on rank level and calculated as weighted average of the thickness for each geologic unit. Calculations in terms of land area percentage show many differences between portions of the territory capable of storing water and the characteristic index of the geologic units capable of storing water. The situation changes if instead we analyze aquifers within individual geological domains and their characteristic index of the single geologic unit whose charts show significant differences. Moreover, after accurate analysis by the Geological Survey, regional seismic hazard maps were derived from the BDG and available as open data format. Umbria has been divided in thirteen zones where local conditions, i.e. presence of artificial fills or particular surface topography, may affect the shaking levels and amplify the effects of the earthquake. The total land area of Umbria is 8,475 square kilometers, and it has been classified in 69,675 unique zones each one characterized by particular seismic hazard. Statistics also show (in percent) that 48 of Umbria land area is characterized by morphological and stratigraphic conditions affecting the shake while 52 is not subject to amplification. Population living in area with no amplification is 322,987 accounting for 36.5 % of the total while 561,281 accounting for 63.5 % of the total live in area where amplification of the shake is likely to happen. Currently four italian regions, Emilia-Romagna, Marche, Tuscany and Umbria, have planned to cooperate starting from their own BDG and develop, after data generalization and analysis, a shared GIS based geologic database of Northern Appennines, following the European Standard database structure and format.', u'https://peerj.com/preprints/2096/', ['Data Science'], [u'Geological Survey, Regione Umbria']]
[u'Complexity curve: a graphical measure of data complexity and classifier performance', ['Julian Zubek', 'Dariusz M Plewczynski'], u'2016-06-03', u'We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).', u'https://peerj.com/preprints/2095/', ['Algorithms and Analysis of Algorithms', 'Artificial Intelligence', 'Data Mining and Machine Learning'], [u'Centre of New Technologies, University of Warsaw', u'Institute of Computer Science, Polish Academy of Sciences']]

If it isn't clear by inspecting these entries, the data structure is [title, author, date, abstract, link, tag, author_af]. These articles also get appended to the previously indexed PeerJ preprints present in peerj.txt. As a result, peerj.txt contains every indexed PeerJ Preprints article. By having each preprint server indexed separately it makes it easy to isolate any errors that may occur. I even have a folder that will log certain errors that may occur during the daily indexing. This is to primarily deal with the unfortunate situation that the preprint server is posting new articles while I am indexing, and to date this seems to have only happened once. This is not a big deal since the articles will just get indexed during the next round of indexing (assuming an error does not occur). More seriously however, it could be possible for the HTML layout to change and not trigger an error file generation. As a result, I try to check the update files every several days to make sure articles are still being indexed correctly. Hopefully PrePubMed users will alert me if articles are getting indexed incorrectly, or better yet, the preprint servers themselves will give me a heads up they are redesigning their site. Another advantage of having the articles saved in text files is should the SQLite3 database need to be reconstructed, all the data to do so is there in a Python friendly format. Speaking of the SQLite3 database, lets go over the models present in the "papers" app. The SQL database consists of only four tables, an Author table, Tag table, Affiliation table, and Article table, with no distinction for where a preprint originated (although you could use the link to tell where the preprint came from). The Article table is connected to the other three tables through many-to-many relationships:

from django.db import models

class Author(models.Model):
    first=models.CharField(max_length=50)
    last=models.CharField(max_length=50)
    middle=models.CharField(max_length=50)
    def __unicode__(self):
        return self.last

class Tag(models.Model):
    name=models.CharField(max_length=100)
    def __unicode__(self):
        return self.name

class Affiliation(models.Model):
    name=models.CharField(max_length=200)
    def __unicode__(self):
        return self.name

class Article(models.Model):
    title=models.CharField(max_length=300)
    abstract=models.TextField()
    pub_date=models.DateField()
    authors=models.ManyToManyField(Author)
    author_list=models.TextField()
    tags=models.ManyToManyField(Tag)
    affiliations=models.ManyToManyField(Affiliation)
    link=models.CharField(max_length=200)
    def __unicode__(self):
        return self.title

As far as I can tell, the "max_length" attributes seem to only affect the length of a Django Form, and don't impart length limitations when modifying the database through scripts. And since I'm currently not using Django Forms, they appear to be pointless. One thing to note is the "author_list" field in the Article table. It would seem that this is not needed since there is a connected Author table, but the many-to-many relationship does not preserve the order of the authors. As a result, I had to save the original order of the authors in its own field. Perhaps the next most important aspect of PrePubMed and most complicated feature is its search functionality. I attempted to replicate the experience of searching with PubMed with a few deviations. The first thing that occurs when you type something in the default search box is your query is broken up into individual terms with double quotes grouping multiple terms together. I utilized code from the blog of Julien Phalip, who is a Django core developer:

def normalize_query(query_string):
    normspace=re.compile(r'\s{2,}').sub
    findterms=re.compile(r'"([^"]+)"|(\S+)').findall
    return [normspace(' ', (t[0] or t[1]).strip()) for t in findterms(query_string)]

Note: the code related to the default search is present in papers/views.py. These functions get imported by mysite/views.py, and are not actually views. Although these functions are a part of actual views, a different name may be more appropriate. What this function does is it will take a query such as ' Jordan Anaya "prognostic genes" ' and return ['Jordan', 'Anaya', 'prognostic genes']. It has the added benefit of removing extra space that may be present either at the end of words or between words. I then use the power of this function to iteratively remove punctuation. Punctuation is converted to spaces, and then normalize_query is called to clean up the resulting string:

mypunctuation='!#"$%&()*+,./:;<=>?@\\^_`{|}~'

mytable = string.maketrans(mypunctuation,' '*len(mypunctuation))

def get_mystring(q):
    rawstring=normalize_query(q)
    finalstring=[]
    for i in rawstring:
        if len(i.split())>1:
            finalstring.append(i)
        else:
            if string.translate(i,mytable)==i:
                finalstring.append(i)
            else:
                finalstring+=normalize_query(string.translate(i,mytable))
    return finalstring

The fact that I'm using a string method here requires that the queries be strings. Because of this, default searches are converted to Python strings, and non-ascii characters are not allowed. Another reason to limit searches to ascii characters is the fact that I save author names as strings, and this limitation should force a user to avoid non-ascii characters when searching for authors. Speaking of authors, one of the best features of PubMed is automatically recognizing author names. PubMed can identify a full name written as: Julia S Wong or Wong Julia S And can identify abbreviated names with up to two initials: Wong J or Wong JS I decided to add the same functionality to PrePubmed. Here is what I do in the search_results view:

from papers.views import *
all_terms={'names':[],'terms':[]}
all_terms=parsing_query(get_mystring(raw),all_terms)

A dictionary is created that will contain the portions of the query that will be treated as author names and those portions which will be searched against Title/Abstract. The function parsing_query populates this dictionary until all of the query has been used. It takes as inputs the raw query string passed into the previously discussed get_mystring function, and the dictionary which was just instantiated. Any portions of the query that get_mystring recognizes as a double quoted phrase are immediately partitioned into the 'terms' key, while the rest are treated as potential authors. Now is where things get complicated (not algorithmically, but rather logistically). Although I have a SQL table for authors, to avoid hitting the database I have two Python dictionaries for authors which get updated whenever the database is updated. One dictionary, name_first, has author first names as keys, while the other dictionary, name_last, has author last names as keys. Utilizing these two dictionaries I can quickly identify author names like PubMed:

def parsing_query(mystring,all_terms):
    while mystring:
        query=mystring[0]
        if len(query.split())==1:
            if query in name_last:
                if len(mystring)!=1:
                    query_plus_1=mystring[1]
                    if len(query_plus_1.split())==1:
                        if len(query_plus_1)>2:
                            if query_plus_1 in name_first:
                                ##check if query is e.g. wong julia, could still be wong j, or wong julia s
                                if query in name_first[query_plus_1][0]:
                                    ##confirmed first name and last name could match, check for middle abbreviation
                                    if len(mystring)!=2:
                                        query_plus_2=mystring[2]
                                        if len(query_plus_2.split())==1:
                                            ##check abbreviation, limit length
                                            if len(query_plus_2)<=2:
                                                if query_plus_2 in name_first[query_plus_1][1]:
                                                    all_terms['names']=all_terms.get('names',[])+[[[query_plus_1,query_plus_2,query],[['first','full'],['middle',''],['last','full']]]]
                                                    return parsing_query(mystring[3:],all_terms)
                                                else:
                                                    all_terms['names']=all_terms.get('names',[])+[[[query_plus_1,query],[['first','full'],['last','full']]]]
                                                    return parsing_query(mystring[2:],all_terms)
                                            else:
                                                all_terms['names']=all_terms.get('names',[])+[[[query_plus_1,query],[['first','full'],['last','full']]]]
                                                return parsing_query(mystring[2:],all_terms)
                                        else:
                                            ##handle quoted string here, if there are two words can't be abbreviation
                                            all_terms['names']=all_terms.get('names',[])+[[[query_plus_1,query],[['first','full'],['last','full']]]]
                                            return parsing_query(mystring[2:],all_terms)
                                    else:
                                        all_terms['names']=all_terms.get('names',[])+[[[query_plus_1,query],[['first','full'],['last','full']]]]
                                        return parsing_query(mystring[2:],all_terms)
                                else:
                                    all_terms['names']=all_terms.get('names',[])+[[[query],[['last','full']]]]
                                    return parsing_query(mystring[1:],all_terms)
                            else:
                                ##check if it is actually first last
                                if query_plus_1 in name_last and query in name_first:
                                    if query_plus_1 in name_first[query][0]:
                                        all_terms['names']=all_terms.get('names',[])+[[[query,query_plus_1],[['first','full'],['last','full']]]]
                                        return parsing_query(mystring[2:],all_terms)
                                    else:
                                        all_terms['names']=all_terms.get('names',[])+[[[query],[['last','full']]]]
                                        return parsing_query(mystring[1:],all_terms)
                                else:
                                    all_terms['names']=all_terms.get('names',[])+[[[query],[['last','full']]]]
                                    return parsing_query(mystring[1:],all_terms)
                        else:
                            #could still be wong j or wong js
                            if len(query_plus_1)==1:
                                #check for first name abbreviation
                                if query_plus_1 in name_last[query][0]:
                                    all_terms['names']=all_terms.get('names',[])+[[[query_plus_1,query],[['first',''],['last','full']]]]
                                    return parsing_query(mystring[2:],all_terms)
                                else:
                                    all_terms['names']=all_terms.get('names',[])+[[[query],[['last','full']]]]
                                    return parsing_query(mystring[1:],all_terms)
                            elif len(query_plus_1)==2:
                                #check for first name middle name abbreviation
                                ##make a list of possible middle abbreviations
                                middle_abb=[middle for middle,first in zip(name_last[query][1],name_last[query][0]) if first==query_plus_1[0] and middle!='']
                                if query_plus_1[1] in middle_abb:
                                    all_terms['names']=all_terms.get('names',[])+[[[query_plus_1[0],query,query_plus_1[1]],[['first',''],['last','full'],['middle','']]]]
                                    return parsing_query(mystring[2:],all_terms)
                                else:
                                    all_terms['names']=all_terms.get('names',[])+[[[query],[['last','full']]]]
                                    return parsing_query(mystring[1:],all_terms)
                    else:
                        #handle quoted here
                        ##treating this as a new search term
                        all_terms['names']=all_terms.get('names',[])+[[[query],[['last','full']]]]
                        return parsing_query(mystring[1:],all_terms)
                else:
                    all_terms['names']=all_terms.get('names',[])+[[[query],[['last','full']]]]
                    return parsing_query(mystring[1:],all_terms)
            elif query in name_first:
                if len(mystring)!=1:
                    query_plus_1=mystring[1]
                    if len(query_plus_1.split())==1:
                        if query_plus_1 in name_first[query][0]:
                            ##check if query is e.g. julia wong, if so return name
                            all_terms['names']=all_terms.get('names',[])+[[[query,query_plus_1],[['first','full'],['last','full']]]]
                            return parsing_query(mystring[2:],all_terms)
                        else:
                            ##check for initial, e.g. julia s wong
                            if len(query_plus_1)<=2 and len(mystring)!=2:
                                query_plus_2=mystring[2]
                                if len(query_plus_2.split())==1:
                                    if query_plus_2 in name_first[query][0]:
                                        #check if the middle initial is in the correct place in the list
                                        #need to get a list of the first middle initial of each matching last name first name combo if middle exists
                                        middle_list=[middle[0] for last,middle in zip(name_first[query][0],name_first[query][1]) if last==query_plus_2 and middle!='']
                                        if query_plus_1[0] in middle_list:
                                            all_terms['names']=all_terms.get('names',[])+[[[query,query_plus_1,query_plus_2],[['first','full'],['middle',''],['last','full']]]]
                                            return parsing_query(mystring[3:],all_terms)
                                        else:
                                            all_terms['names']=all_terms.get('names',[])+[[[query,query_plus_2],[['first','full'],['last','full']]]]
                                            ##the middle initial was not found, but still allowing search to proceed
                                            all_terms['unknown']=all_terms.get('unknown',[])+[query_plus_1]
                                            return parsing_query(mystring[3:],all_terms)
                                    else:
                                        all_terms['names']=all_terms.get('names',[])+[[[query],[['first','full']]]]
                                        return parsing_query(mystring[1:],all_terms)        
                                else:
                                    #handle quoted here
                                    #last name is not allowed to be two words, return first name
                                    all_terms['names']=all_terms.get('names',[])+[[[query],[['first','full']]]]
                                    return parsing_query(mystring[1:],all_terms)
                            else:
                                all_terms['names']=all_terms.get('names',[])+[[[query],[['first','full']]]]
                                return parsing_query(mystring[1:],all_terms)           
                    else:
                        #handle quoted here
                        #use quoted strings to denote a new search term, stop name search here, return first
                        all_terms['names']=all_terms.get('names',[])+[[[query],[['first','full']]]]
                        return parsing_query(mystring[1:],all_terms)
                else:
                    all_terms['names']=all_terms.get('names',[])+[[[query],[['first','full']]]]
                    return parsing_query(mystring[1:],all_terms)
            else:
                if query not in stopwords:
                    all_terms['terms']=all_terms.get('terms',[])+[query]
                    return parsing_query(mystring[1:],all_terms)
                else:
                    return parsing_query(mystring[1:],all_terms)
        else:
            #handle quoted strings here
            ##quoted strings will be used to denote phrases instead of authors
            all_terms['terms']=all_terms.get('terms',[])+[query]
            return parsing_query(mystring[1:],all_terms)
    return all_terms

This function is almost perfect. For example, last names get preference over first names, but even if someone has the last name Jordan, the code will still recognize Jordan Anaya as me. Its one issue is when the full name contains a middle initial. For example, it won't recognize Michael B Eisen as the co-founder of PLOS, but rather someone with the last name Michael, and someone else with the last name Eisen. I can fix this, just need to carefully insert a few more if statements, but in the meantime there is advanced search if users are having difficulty with the default search box. Once PrePubMed has identified what should be searched against Title/Abstract and what should be searched as authors, there is still the matter of actually performing the search. For the default search I have a function called perform_query which builds up a Django QuerySet:

def perform_query(all_terms):
    qs=[]
    if all_terms['terms']!=[]:
        q=tiab_query(all_terms['terms'])
        qs=Article.objects.filter(q)
        for i in all_terms['names']:
            qs=qs.filter(au_query(i))
    else:
        if all_terms['names']!=[]:
            qs=Article.objects.filter(au_query(all_terms['names'][0]))
            for i in all_terms['names'][1:]:
                qs=qs.filter(au_query(i))
    return qs

This function utilizes two other functions, au_query and tiab_query. I used an example found on the blog of Julien Phalip as a blueprint for how to build Django Q objects.

def tiab_query(query_string):
    query = None        
    terms = query_string
    for term in terms:
        or_query = None
        for field_name in ['title','abstract']:
            q = Q(**{"%s__icontains" % field_name: term})
            if or_query is None:
                or_query = q
            else:
                or_query = or_query | q
        if query is None:
            query = or_query
        else:
            query = query & or_query
    return query

def au_query(names):
    and_query = None
    for index, name in enumerate(names[0]):
        if names[1][index][1]=='full':
            q = Q(**{"authors__%s__iexact" % names[1][index][0]: name})
        else:
            q = Q(**{"authors__%s__istartswith" % names[1][index][0]: name})
        if and_query is None:
            and_query = q
        else:
            and_query = and_query & q
    return and_query

The resulting QuerySet which is just a Python list object is then sliced with the Django Paginator function and sent to a template to be turned into beautiful HTML. Although I could go over every line of code in the GitHub repository, that would turn this article into a small book and this article is not meant to be a manual for PrePubMed or a Django tutorial. Rather, I wanted to highlight the essential functions of PrePubMed which replicate the familiar search functionality of PubMed. I also want to emphasize that I have not perfectly replicated the search functionality of PubMed, and in fact I did not intend to. One glaring omission in PrePubMed is the absence of search tags such as [au]. These tags are important in PubMed because there are many different fields indexed, for example grant numbers, principal investigators, language, journal, publisher, etc. However, in PrePubMed there are only two options, your query is an author, or it isn't, so I don't find these search tags necessary for PrePubMed. One implication of not allowing search tags is users don't have the ability to mark a stopword as an author name. As a result, I deviated from PubMed and gave author names preference over stopwords. This could be useful for someone searching for someone with the last name "the", but could be annoying for someone using "the" in their query. When using the default search the Search Details box tells users how their query was searched, and should be utilized when obtaining unexpected results. In addition, there is advanced search which is similar to searching with tags. Another deviation from PubMed is I don't allow for OR searches. I guess I just don't find OR searches very useful. Most people are trying to narrow down their search, not expand it.

I don't view PrePubMed as a permanent solution for searching preprints, but more of a stopgap until preprints are indexed by PubMed or a separate governmental or institutional search engine. And although one such search engine has been made, search.bioPreprint by the University of Pittsburgh Health Sciences Library System, it does not implement the functionality of PubMed, and ergo, PrePubMed. What search.bioPreprint does is take the user's query and inputs it into the search engine of each preprint server and then organizes the resulting hits. This allows them to perform a full text search, which is essential if you are interested in how many preprints have cited a certain article. However, they are limited in how many results they will return and don't sort the articles by date, which makes it difficult to identify the most recent work related to your query. Edit 20160704: I realized that PeerJ does not allow full text searching of their preprints, and the full text searching on the BioRxiv is inconsistent and sometimes takes so long that it times out and results in search.biopreprint returning no BioRxiv articles. So the claim that search.bioPreprint allows for full text searching is questionable. Another problem is they don't auto-identify author names, so searching for an author with a common name will return a lot of spurious results. As a result, I see search.bioPreprint as more of a supplemental resource for searching preprints. And could PrePubMed index the full text of preprints? Sure, anything is possible. But again, I don't see PrePubMed as a long term solution and am happy to see that there are attempts being made to develop preprint search engines. I believe preprints are too important for the future of science to only have them being indexed by some random guy in his apartment.