Mark Davies, Professor of (Corpus) Linguistics


Education	Education I received a B.A. in 1986 with a double major in Linguistics and Spanish, which was followed by an M.A. in Spanish Linguistics in 1989. I then received a PhD from the University of Texas at Austin in 1992, with a specialization in "Ibero-Romance Linguistics (a fancy term for Spanish and Portuguese linguistics).
Research CV Publications (downloadable articles)	Research As a professor of Spanish at Illinois State University from 1992-2003, most of my publications dealt with historical and genre-based variation in Spanish and Portuguese syntax. I then taught at BYU from 2003-2020, where my research dealt primarily with general issues in corpus design, creation, and use (especially with regards to English), as well as word frequency. Overall, I have published 6 books and about 90 articles, and I have given numerous presentations at international conferences (with many of them being keynote / plenary talks).
Awards	Awards At BYU I received the Karl G. Maeser Research and Creative Arts Award, which recognizes achievements in research. This award is given each year to only two or three people from the 1,500+ full-time faculty members at BYU, and it had not been given to anyone else in the College of Humanities in the previous eight years. I was also given the Creative Works Award, which is given to one person each year, who "demonstrates outstanding achievement in the development of creative works that have had wide acceptance and distribution nationally or internationally." Finally, I have received several awards from the College of Humanities (approx 200 faculty members), including the Barker lectureship, a two-year College Professorship, and two terms (two years + three years) as a Fellow for the Humanities Center.
Grants	Grants I have received six large federal grants to create and analyze corpora. These include four from the National Endowment for the Humanities: 2001-02 (to create a large corpus of historical Spanish), 2004-2006 (to create a large corpus of historical Portuguese, with Michael Ferreira), 2009-2011 (to create a large corpus of historical English), and 2015-2017 (to enlarge the Spanish and Portuguese corpora). The two grants from the National Science Foundation were in 2002-2004 (to examine genre-based variation in Spanish, with Douglas Biber) and 2013-2016 (to examine "web-genres", with Douglas Biber and Jesse Egbert). In addition to these six US-based grants, I have had a large subcontract for a grant from the UK Arts and Humanities Research Council (2014-2016; to create the architecture and web interface for large semantically-tagged corpora). I am also a co-PI for a grant from the Korea Research Foundation (2014-2017, with Jong-Bok Kim) to examine three related syntactic constructions in English from a corpus-based perspective. See below for more information on these projects.

Integrating AI/LLMs into the corpora (2025)	Integrating AI/LLMs into the corpora (2025) English-Corpora.org (and the Spanish and Portuguese corpora) now offer something entirely new: the ability to combine the depth and reliability of corpus data with the analytic power of Large Language Models (LLMs) like GPT, Gemini, Claude, Perplexity, Llama, DeepSeek, and Mistral. With just one click, the corpus sends collocates, frequency patterns, phrase lists, or concordance lines to an LLM — which instantly group, explain, and interpret the data. These AI-powered insights appear directly in the interface, alongside the original corpus results, and nothing like this exists for any other online corpora. (More information)
Comparing AI / LLMs and corpora (2024)	Comparing AI / LLMs and corpora (2024) I have just finished in-depth research that compares the "intuitions" of AI models / Large Language Models (like ChatGPT or Google Gemini) with actual corpus data -- for word frequency, phrase frequency, collocates, word comparisons (via collocates), and more. In addition, looked at what LLM's "know" about linguistic variation -- between genres, dialects, and historical periods. I recently released the findings in "white papers" at English-Corpora.org.
Detailed training and documentation (2023)	Detailed training and documentation (2023-2024) In the last two years or so, I have added several detailed PDF help files: overview / guided tour, architecture, association measures, collocates (cf Sketch Engine), topics (and collocates), word sketches, browsing words, analyzing texts, KWIC -> analyze text, saved words and phrases, saving KWIC entries, customized word lists, search history, external resources, monitor corpus, Virtual Corpora, Virtual Corpora: quick overview. I have also added several detailed instructional videos: overview, language learning and teaching, word sketches, browsing words, analyze texts, search history, customized word lists, saved words (favorites), KWIC lines: limiting and sorting, saved KWIC lines, analyze KWIC lines, external resources, Virtual Corpora, examining recent change.
The NOW Corpus as a monitor corpus (2022)	The NOW Corpus as a monitor corpus (2022) As of March 2022, it is possible to find daily keywords in the NOW Corpus (15.9+ billion words as of September 2022, and growing by about 200-220 million words each month). This is useful to research current events like the invasion of Ukraine -- or any other current event. It is now also possible to quickly and easily search the NOW Corpus by year, and then month, and then day - something that no other large corpus offers.
Spanish and Portuguese corpora (2021)	Spanish and Portuguese corpora (2021) A number of new features were added to the Corpus del Español and the Corpus do Português. These include the ability to browse and search through the top 40,000 words in the language, and to see detailed information on each word (frequency and distribution, definition, translation to 100+ languages, images, videos, pronunciation, synonyms, collocates, related topics, concordance lines, etc). Users can now import entire texts, and analyze the texts to find keywords, see detailed information on each word in the text, and quickly and easily search for related phrases in the corpora.
COCA (Analyze Texts) (2020)	COCA (Analyze Texts) (2020) In COCA, users can now analyze entire texts (e.g. student compositions or online newspaper articles) using COCA data. They can find keywords in their texts, and can click on any word in the text to see a wide range of information (definition, pronunciation, images, videos, synonyms, related words, collocates and related topics, clusters, concordances, etc). They can also quickly and easily find phrases in COCA that are related to phrases in their text, which allows them to find "just the right phrase" to express a given concept.
Coronavirus Corpus (2020)	Coronavirus Corpus (2020) 1.5 billion words of data in almost 1.9 million texts from Jan 2020 - Dec 2022. The corpus is designed to be the definitive record of the social, cultural, and economic impact of the coronavirus (COVID-19) in 2020 and beyond.
All corpora (2020)	All corpora (2020) The frequency-based data from all of the corpora is now linked to a wide range of external resources, including searches of the web, images, and billions of words of books; videos from YouGlish; and translations of the corpus phrases in many different languages. All of this leverages the power of the most powerful and widely-used corpora in the world with huge amounts of data from other sources.
COCA 2020 (2020)	COCA 2020 (2020) The Corpus of Contemporary American English (COCA) is probably the most widely-used corpus throughout the world, and the only corpus that is 1) large 2) recent and 3) has texts from a wide range of genres. In early 2020, it was nearly doubled in size (to one billion words), it now includes texts through Dec 2019, and it now includes three new genres (blogs, other web pages, and TV/Movies subtitles). In addition, the "word-oriented" pages (see iWeb below) are now available for COCA as well. (More information)
TV and Movie corpora (2019)	TV and Movie corpora (2019) These are the most informal of all of the corpora from English-Corpora.org. The TV Corpus has 325 million words in 75,000 TV scripts (comedies and dramas) from 1950-2018 and the Movie Corpus has 200 million words in 25,000 scripts from 1930-2018. In addition to having extremely informal language (even more informal than actual spoken corpora like the BNC-Spoken), the corpora also allow you to look at change over time, as well as between dialects. (More infomation)
iWeb Corpus (2018)	iWeb Corpus (2018) iWeb is the largest corpus that we've ever created -- 14 billion words, which is nearly 14 times the size of COCA. (And yet it's still as fast as any other corpus, due to its advanced architecture.) The corpus allows users to browse through the top 60,000 words in the corpus (including by pronunciation), and for each of these words you can see a wealth of information -- much of which is not available for any of the other corpora from English-Corpora.org, other than COCA (including links to pronunciation, images, videos, and translations). iWeb is perhaps the most innovative and learner-friendly corpus that we've ever created. (More information; also available in Chinese)
Billion word extensions to the Spanish and Portuguese corpora (2015-2018)	Billion word extensions to the Spanish and Portuguese corpora (2015-2018) In 2015 I was awarded (see p37) a three year grant from the US National Endowment for the Humanities to create much larger, updated versions of the Corpus del Español and the Corpus do Português. The Corpus del Español is now 100 times as large as before (two billion words, compared to 20 million words for the 1900s) and the Corpus do Português is now 50 times as large as before (one billion words, compared to 20 million words for the 1900s). In addition, each corpus allows users to see the frequency by country, as is already possible for English with the GloWbE corpus.
Early English Books Online (2017)	Early English Books Online (2017) Part of the SAMUELS project and funded by the AHRC (UK). This corpus contains 755 million words in more than 25,000 texts from the 1470s to the 1690s. The corpus provides many types of searches than are not available from other EEBO corpora online.
Corpus of US Supreme Court Opinions (2017)	Corpus of US Supreme Court Opinions (2017) This corpus contains approximately 130 million words in 32,000 Supreme Court decisions from the 1790s to the current time. This allows users to see how words and phrases have been used in a legal context since that time. This corpus is related to other activities and projects that use corpora to look at legal questions.
NOW corpus ("News on the Web") (2016)	NOW corpus ("News on the Web") (2016) The corpus automatically grows by about 7-8 million words per day, 180-200 million words per month, or more than 2 billion words each year. So when people search the NOW corpus, the data will be current as of yesterday, which should be useful for research that would benefit from up-to-date corpora (i.e. no excuse to be limited to stale corpora from 20-25 years ago).
CORE corpus (Corpus of Online Registers of English) (2016)	CORE corpus (Corpus of Online Registers of English) (2016) Douglas Biber, Jesse Egbert, and I received a grant from the US National Science Foundation to create "A Linguistic Taxonomy of English Web Registers", and this corpus is the result of that research (see also 1 and 2). The corpus contains more than 50 million words of text from the web, and it is the first large web-based corpus that is so carefully categorized into so many different registers. This is quite different from other very large corpora that simply present huge amounts of data from web pages as giant "blobs", with no real attempt to categorize them into linguistically distinct registers.
New corpus interface (2016)	New corpus interface (2016) The new corpus interface has the following improvements and enhances over the interface that had been used since 2008: 1) it now works great with mobile devices as well 2) cleaner, simpler interface 3) more helpful help files 4) simpler, more intuitive search syntax. It also allows users to easily and quickly create and use "virtual corpora" [VC] (e.g. texts from a particular magazine, or related to a particular concept), and then search within the VC, compare frequency across different VC, and quickly generate keyword lists from the virtual corpus.
Hansard Corpus (British Parliament) (2015)	Hansard Corpus (British Parliament) (2015) Part of the SAMUELS project and funded by the AHRC (UK). This corpus contains 1.6 billion words in 7.6 million speeches in the British Parliament from 1803-2005. A unique feature of the corpus is that it is semantically tagged, which allows for powerful meaning-based searches. In addition, users can create "virtual corpora" by speaker, time period, House of Parliament, and party in power, and compare across these corpora. The end result is a corpus that is of value not only to linguists (as the largest structured corpus of historical British English from the 1800s-1900s), but it is also very useful for historians, political scientists, and others.
Wikipedia Corpus (2015)	Wikipedia Corpus (2015) This corpus based on 1.9 billion words in 4.4 million articles from Wikipedia. You can quickly and easily create "virtual corpora" from the 4.4 million web pages (e.g. electrical engineering, investments, or basketball), and then search just that corpus, or create keyword lists based on that virtual corpus. If you want to create a customized corpus for a particular topic, but don't want to have the hassle of collecting all of the texts yourself, this should be a very useful corpus.
Downloadable full-text corpus data (2014)	Downloadable full-text corpus data (2014) You can download all of the texts for several of our largest corpora -- tens of billions of words of data. With this data on your own computer, you can do many things that would be difficult or impossible via the regular web interface, such as sentiment analysis, topic modeling, named entity recognition, advanced regex searches, creating treebanks, and creating your own word frequency, collocates, and n-grams lists.
www.academicwords.info (2013)	www.academicwords.info (2013) Our Academic Vocabulary List of English improves substantially on the AWL created by Coxhead (2000). Most of this data is also integrated into the WordAndPhrase (Academic) site, so that you can see a wealth of information about each word. See the Applied Linguistics article.
GloWbE: Corpus of Global Web-Based English (2013)	GloWbE: Corpus of Global Web-Based English (2013) 1.9 billion word corpus from 1.8 web pages in 20 different English-speaking countries. In addition to being very large (20 times as big as the BNC), this corpus also allows you to carry out powerful searches to compare the English dialects and see the frequency of words, phrases, grammatical constructions, and meaning in these twenty different countries.
www.wordandphrase.info (2012)	www.wordandphrase.info (2012) Even more so than the standard COCA interface, this website is designed to provide information on nearly everything that you might want to know about words and phrases and their usage on one screen and with one search. Best of all, you can enter entire texts and see detailed information about each word in the text, and see related phrases from COCA.
Google Books Corpus (2011)	Google Books Corpus (2011) This improves greatly on the standard n-grams interface from Google Books. It allows users to actually use the frequency data (rather than just see it in a picture), to search by wildcard, lemma, part of speech, and synonyms, to find collocates, and to compare data in different historical periods.
Corpus of Historical American English (COHA) (2010)	Corpus of Historical American English (COHA) (2010) 400 million word corpus of historical American English, 1810-2009. The corpus is 100 times as large as any other structured corpus of historical English, and it is well-balanced by genre in each decade. As a result, it allows researchers to examine a wide range of changes in English with much more accuracy and detail than with any other available corpus. (Funded by the US National Endowment for the Humanities)
English word frequency, collocates, and n-grams (2010)	English word frequency, collocates, and n-grams (2010) Based on COCA and other corpora, the data provides a very accurate listing of the top 100,000 words in English (including frequency by genre), the frequency of 15,300,000+ collocate pairs, and the frequency of all n-grams (1, 2, 3, 4-grams) in the corpus.
Frequency dictionary of American English (2009)	Frequency dictionary of American English (2009) The dictionary contains the top 5000 words (lemmas) in American English, based on the data from the Corpus of Contemporary American English (COCA). The dictionary gives the top collocates for each of the 5000 words, which gives a very good idea of the overall meaning of each word. (Co-authored with Dee Gardner (BYU), and published by Routledge.)
Corpus of Contemporary American English (COCA) (2008)	Corpus of Contemporary American English (COCA) (2008) This 450+ million word corpus (now 1 billion words; 2020) is the only large and balanced corpus of American English. It is probably the most widely-used online corpus currently available. Because of its design, it is also perhaps the only large corpus of English that can be used to look at ongoing changes in the language.
Frequency dictionary of Portuguese (2007)	Frequency dictionary of Portuguese (2007) The dictionary is based on the 20 million words from the 1900s portion of the 45 million word Corpus do Português. It is the first frequency dictionary of Portuguese that is based on a large corpus from several different genres. (Co-authored with Prof. Ana Preto-Bay of the Department of Spanish and Portuguese at BYU, and published by Routledge.)
Corpus do Português (2006)	Corpus do Português (2006) 45 million word corpus of Portuguese (1300s-1900s). The corpus allows users to find the frequency, distribution, and use of words, phrases, and grammatical constructions in different historical periods, as well as in the genres and dialects of Modern Portuguese. (Created in conjunction with Michael Ferreira of Georgetown University, and funded by the US National Endowment for the Humanities)
Frequency dictionary of Spanish (2005)	Frequency dictionary of Spanish (2005) This is the first major frequency dictionary of Spanish that has been published in English since 1964. It is based on the 20 million words from the 1900s portion of the 100 million word Corpus del Español, and it includes many features not found in any previous dictionary of Spanish. Second edition (with Kathy Hayward Davies) in 2017; based on a much larger corpus, with many improved features.
Register variation in Spanish (2004)	Register variation in Spanish (2004) Used large corpora of many different registers of Spanish as the basis for a "Multi-dimensional analysis of register variation in Spanish". (Carried out in conjunction with Douglas Biber of NAU, and funded by the US National Science Foundation)
Corpus del Español (2002)	Corpus del Español (2002) 100 million word corpus of Spanish (1200s-1900s). The corpus allows users to find the frequency, distribution, and use of words, phrases, and grammatical constructions in different historical periods, as well as in the genres of Modern Spanish. (Funded by the US National Endowment for the Humanities)
LDS General Conference Corpus (2000-)	LDS General Conference Corpus (2000-) Quickly and easily search talks from General Conference of the Church of Jesus Christ of Latter-day Saints (Mormons). This corpus (or collection of texts) contains 25 million words in 11,000+ talks from 1851 to the current time. You can see the frequency of words and phrases and study how words and phrases are used differently over time. You can also compare the frequency by speaker, and see what keywords characterize a given speaker.

Technology	Technology In order to create large corpora and place them online, I have acquired experience in a number of different technologies. These include database organization and optimization (mainly with SQL Server, including advanced SQL queries), web-database integration (ActiveX Data Objects), client-side programming (mainly DHTML / Javascript), VB.NET (for processing billions of words of data) and several different corpus and text-related tools. I also maintain the hardware and software for my Windows servers, including the administration of Internet Information Services (IIS).
Personal	Personal Beyond life at the university, my interests include comparative religion, world cultures, history, languages of the world, and the relationship between technology and culture, including the Internet. And of course I enjoy spending time with my family -- Kathy, our children, and our grandchildren.
Email	Email markmark-davies.org