Similarly to a previous cross-temporal analysis of the Dutch urban system (Van der Knaap, 1980), we decided to depart from the current situation and keep the list of units of analysis consistent throughout the period covered by the data collection. The content of a digital archive might be influenced by many factors such as digitalization policies, projects targeting a specific part of the media landscape (a newspaper, a region or a time period) or copyrights issues. Pred (1971) defines information fields as the total array of non-local contacts of individual places. Created with the aim of encouraging the exchange of ideas, methods and results, it publishes in any european language. Extrayendo información geográfica de una selección de 102 millones de noticias, esta base de datos nos permitió estudiar la difusión espacial de información sobre y entre las ciudades holandesas a partir de un conjunto de 81 periódicos publicados en 29 ciudades entre los años 1869 y 1994. Ehrmann M., Colavizza G., Rochat Y., Kaplan F., 2016, "Diachronic Evaluation of NER Systems on Old Newspapers",11. At the time the data collection started, there were 1970 different titles in the archive. As we noticed some misclassification on the kind of named entities by the multiNER software, we kept only the articles with named entities that exactly matches the city name. Table 1: Summary statistics of the Delpher corpus. 2Information circulation has been identified as a key factor in urban dynamics. 12Cities can be defined according to many criteria, they can be continuous build-up areas, functional entities, designated by a certain level of urban functions or by administrative status. For years, the main concern of the Ottoman Porte in Transjordan was to ensure the safety of the Hajj caravan by paying the Bedouin tribes of the regions it passed through (eg. Then, it presents issues in place names recognition and choices to deal with these issues. Cybergeo, the electronic European Journal of Geography, is intended to promote faster communication of research and greater direct contact between authors and readers.Created with the aim of encouraging the exchange of ideas, methods and results, it publishes in any european language. This tendency reflect the history of the Dutch press. This operation could be done in a reasonable amount of time. and ambiguities in place names. This special issue aims to explore, interrogate and reflect on the ways in which women are understood, contextualised and represented in the text of the Bible that has developed, in various ways, a foundational significance for Western culture. While the importance of such an approach was widely acknowledged, the study received a number of critiques related to the book selection (Morse-Gagné, 2011), and the fact that it did not include newspapers, which were thought to better reflect their time due to the frequency of publication (Schwartz, 2011). It has great potential for urban scholars to answer questions related to the dynamics of Dutch cities and the spatial diffusion of information, as well as by historians or media scientists interested in the geographical bias of news coverage. The very short lifespan of most of titles is consistent with the findings of Van Kranenburg et al. The wealth of geographic information in such digital archives has not been used much, while they are very valuable for the study of cities. En revanche, peu d'études se sont intéressées à la richesse de l'information géographique qui peut être extraite de ces archives. Pour une approche constructiviste de la dimension éthique de l'espace des sociétés. Cette base a été construite suite à l'analyse du contenu de 102 millions d'articles et petites annonces publiés dans 81 journaux locaux de 29 villes néerlandaises dont la publication s'étale de 1869 à 1994. Family names: in quite some cultures, it is common to have a family name that relates to a place. Antoine Peris, Willem Jan Faber, Evert Meijers et Maarten van Ham, « One century of information diffusion in the Netherlands derived from a massive digital archive of historical newspapers: the DIGGER dataset », Cybergeo : European Journal of Geography [En ligne], Data papers, document 928, mis en ligne le 14 janvier 2020, … Figure 1: News items per year in Delpher and in the sub-corpus. Because we are interested in identifying cities in texts, we must go beyond these definitions and identify the terms that relate to cities in the common language. Irbid’s growth rate is very high (4.2% per year between 1979 and 1994, and 1.9% between 1994 and 2004). Brieven franco, left. Figure 6: Information field extracted from 15 local newspapers. 18Table 2 shows that the vast majority of city names is not ambiguous (86.4%) and does not require the use of NLP techniques. Peris A., Meijers E. J., van Ham M., 2019, "Information diffusion between Dutch cities: revisiting Zipf and Pred using a computational social science approach", Submitted. We adopt what Goodchild and Li (2011) call a “placial” perspective. Ce problème est prégnant pour les données sur les relations interurbaines, à l’échelle des systèmes de ville. The woonplaatsen are used in the everyday language, they are the toponyms people include when writing down an address. This can also be the case when a region and its most important city have the same name such as for Groningen and Utrecht. edited by Zanne Domoney-Lyttle and Sarah Nicholson.. The most important sources of errors leading to false positives are listed below. However, problems related to extracting spatial information from text where not addressed, including the variety of scales (an article can mention a street, a city, a country, etc.) ZEE-MILITIE.De Burgemeeiter en Wethouders van Venloo nootfigen bij deze de lotelineen uit, die bij de Zee-Militie verlangen te dienen, zich daartoe bij hen aantemelden, ter plaatselijke Secretarie vóór den 1 April aanstaande. 3However, with the recent development of computing techniques, it is now possible to upscale and systematize data collection from newspapers to analyse the information circulation at the level of an entire territory. Originally founded in 1999 under the name, it now hosts more than 450 online publications, i.e. (2011) showed the potential of this approach by compiling 5 million digitalized books to provide quantitative insights on the evolution of grammar, as well as the detection of events such as pandemics, the influence of certain thinkers, or the evolution of gender bias in vocabulary. This is the case for “Katwijk”, which is at the same time a medium-sized coastal city in South Holland and a very small village in North Brabant. 29We then counted the number of true positives, true negatives, false positives and false negatives to derive precision and recall indices for our three periods of time. Bani Sakhr and … These maps confirm the importance of distance for information flows as most of the attention is concentrated on the close-by cities and towns in 1871, with some attention to the big cities of the provinces of North and South-Holland. Investigadores, han demostrado que estos archivos digitales masivos, se pueden utilizar para identificar tendencias macroscópicas, relacionadas con cambios históricos y culturales. In a previous study (Meijers, Peris, 2018), different problems were identified in the case of the Dutch woonplaatsen. This resulted in the presence of a lot of short lived newspapers only published during the Second World War (n=2139) that can be very interesting for historians interested in the war but less relevant for long term studies. While this study could look more precisely at historical and cultural trends, the analysis of the geographical focus, which was not the core of the study, remained at the stage of visualisation. Carefully selecting the corpus can significantly reduce bias, and is necessary to create a dataset as representative as possible depending on the research question. 33 en J.A.v.der Goes,jd. Lansdall-Welfare T., Sudhahar S., Thompson J., Lewis J., Team F. N., Cristianini N., 2017, "Content analysis of 150 years of British periodicals", Proceedings of the National Academy of Sciences, Vol.114, No.4, E457-E465. 35However, extracting such patterns remains an important challenge from a methodological point of view. 13To allow a data collection in a reasonable amount of time, it is very important to work on a limited number of entities. The town’s surface area has … A search of Sociological … 31 en E. v. Vollenho ven, jd. But because of the time and workforce needed for the data collection, these studies were limited to a very small number of cities or short periods of time. International, national and institutional contexts have led to redefine a project——that began in 2003 and that has already fulfilled its original … Uno de los casos cruciales para la comprensión de la dinámica urbana, corresponde contar con datos sobre la relación entre ciudades. STR is the result of a simple string query for unambiguous place names, NER column is the result of a string query for the places that are in the list of ambiguous place names, and NER result is the outcome of the NER algorithm on the ambiguous place name. Van Kranenburg H. L., Palm F. C., Pfann G. A., 1998, "The life cycle of daily newspapers in the Netherlands: 1848–1997", De Economist, Vol.146, No.3, 475-494. He intended the project to highlight Islamic engineering and power, but also profited … The only difference is that additional to the frequency returned by the simple string query, there is an extra column with the number of hits after performing NER on the individual articles returned after the first query: Table 4: Structure of the freq_count_ner.csv file. Years of publication. This selection resulted in three tables similar with the structure shown in Table 3. 24The different steps of the data collection are summarized in Figure 5. Classification of issues in place name recognition, A trade-off between computation time and precision level, Application: The information field of 15 Dutch cities in 1871 Au cours des deux dernières décennies, d'importants efforts de numérisation de textes anciens ont été entrepris, notamment de livres et de journaux qui constituent des sources très riches sur les sociétés qui les ont produites. We decided to go for a mixed technique to retrieve the data on cities in a reasonable amount of time. Different types of NER algorithms exist. This huge variability in duration is also reflected by the amount of news items published by the different newspapers. Mining these huge amounts of textual data is an important challenge for social sciences because these textual sources contain much information on social and economic processes, which are very often tied to places. The increase in the second half of the 19th can be explained by the abolishment of a tax on newspapers – the 'dagbladzegel' – that made them cheaper and affordable for a wider public. The result of this selection is a set of 317 Cities. This way, we could drop the names of people that are composed of a first name (or initials), a family name, and sometimes a prefix in between ("van", "de", "van der", etc.). Classical urban literature has highlighted the importance of available information on locational decisions of individuals, groups and firms and of its role as prerequisite for other kinds of people and goods movements. This separation in different sets was done because we were aware that the quality of prints significantly improved during this period, affecting the efficiency of the automatic recognition of characters (OCR) used during the digitalisation of the newspapers. The column ppn corresponds to a unique identifier given to each newspaper title. In this paper, we present DIGGER, a newly developed dataset that we built on Delpher, the digital archive of historical newspapers of the National Library of the Netherlands, by extracting geographical information from a selection of 102 million of news items. They resulted in two files: one with the results of the data collection for the unambiguous city names (freq_count_STR.csv) and one for the ambiguous city names (freq_count_NER.csv). Figure 2: Location of the 317 cities for which data is collected. Nonetheless, we acknowledge that there are also some drawbacks. Over the last two decades, many efforts have been made to digitalize texts, including books and newspapers, which are primary sources on most of our societies. More detailed descriptions of the files can be found in the metadata of the dataset. This paper presents the method developed to build the dataset as well as the validation steps for the accuracy of the place name recognition. After that, type describes whether the city is mentioned in an article, an advertisement, some family announcements, or in the caption of an illustration. 