Wednesday, May 21, 2014

Georgian Word and Phrase Frequency Lists

This first post will be used to promote the following word frequency collections I have created for the Georgian language.  Frequency lists for other foreign languages exist on the web (here is another collection), but the ones for Georgian are scarce, very limited in size, or don't necessarily reflect any colloquial or practical usage as one might expect to find on news portals, message boards, or blogs.  I have thus created several lists generated from the content of such sites with the intention of not only to aid my study of Georgian but also to provide any else interested in this language and how it is used in the media and by Georgian speakers using the internet for communication.

Frequency Lists

forum.ge (Download)
This list was created from 6 months worth of posts. 43 million words were analyzed.

Be warned, the list contains several words and phrases that may be regarded as non-standard or grammatically incorrect, in addition to words that are considered vulgar and/or offensive by native Georgian speakers. Many non-Georgian (predominantly Russian) words are also present in the list. As such, I don't advise you to refer to this as a standard vocabulary for studying Georgian, but as a rough portrait of modern, colloquial Georgian as it is used on the internet.

Despite the above caveat, the list may be useful as a modern reference when studying from a formal Georgian language course like Georgian: A Reading Grammar (Aronson, Howard) or Einführung in die Georgische Sprache (ჩხენკელი, კიტა), i.e. special focus may be directed towards those words and verb forms which appear both in the list and the course vocabularies.

intermedia.ge, civil.ge, mediamall.ge (Download)
These lists were created from three separate media sites, each focusing on a different variety of topics.


Each archive above contains not only single word frequencies, but 2 and 3-word combination frequencies.  The lists are very simple, the words and phrases arranged in a list from most to least common.  Each entry is tagged with its count, i.e. number of occurrences in the source material analyzed.  The following example lists the top 10 words in the forum.ge archive:

და    1883889
არ    957538
რომ    604393
რა    473123
თუ    402558
უნდა    314222
ეს    308874
მე    239708
ამ    228322
მაგრამ    207494

და ('and' or 'sister') is the most common, followed by არ ('not') and რომ ('that' 'which').  In all of the lists I've found, და has always appeared at the top.  Interestingly, the most common word occurs nearly twice the number of times than the second most common word.  The same phenomenon is present in the list for Hungarian: 'a' ('the') is twice as common as 'nem' ('not').  Both Georgian and Hungarian are agglutinating, the former possessing far more irregularities in several parts of speech.

Information on the Georgian Language

Georgian is primarily spoken in the country of Georgia, located in the Caucasus between Russia and Turkey, by about 4.2 million people.  The language uses its own writing system and its grammar is characterized by a strikingly complex verb system.  Verbs can contain many components including, in addition to the root, morphemes that indicate not only the subject and tense but the object, direct object, and aspect.  Because of this, a phrase in English such as "I had him send it" can be expressed in only one or two words in Georgian (გავაგზავნინე).

Due to the morphological properties of Georgian, the above lists contain the counts for words used as is in the source material analyzed and not necessarily the dictionary (or "root") forms of a word.  As a result, a word like "this" will appear several times in with different declensions of both the demonstrative adjectival and nominal forms.  Included in each archive is a readme.txt file with further details on the source material and the format of each list.