Friday, February 02, 2007

Ten things you want to know about dictionaries

I met Erin McKean at the Wikimania 2006, I loved her presentation then and I was really happy when I found her presentation in the Google Author series titled "Ten things you want to know about dictionaries". I loved it, I have seen it twice now. I may even have added value to it by adding the "dictionary" label to the presentation. Here I am going to react it with my OmegaWiki hat on. So yes, please watch the presentation (almost an hour and well worth it) as I hope it will improve the understanding of my reaction.

There is no one dictionary; it is a tool
OmegaWiki as a resource is very much a child of the Internet; consequently it has the potential to configure its use. When people do not care for particular information; they should be able to make it invisible or turn it off. In a way it is like the cordless drill, by replacing the drill with a different thingie it becomes an other tool. The same is true for pronunciation; we love to include IPA, but we can also record pronunciations this way people do not need the understanding required when reading IPA.

Please read the "front matter"
People indeed assume that they understand tools like dictionaries and wikis for that matter not to RTFM. For a consumer good like a read only lexical resource, it is pretty safe when the introductions have not been read. As OmegaWiki allows people to add/edit to the information that is in there this proves to be much more problematic.

Inclusion in the dictionary is because it is useful
As we do not have all the functionality that we need to be a credible lexical resource, this is very much a state we hope to get at. However, our aim to include all the lexicological, terminological and ontological sounds pretty like megalomania. Our standard excuse is that we use is that is already less problematic because we only want to do this once and this is where we came from. The data will only be useful when there are people who care about particular categories of data. I am totally with Erin that only data that is useful should be included. Getting rid of unnecessary cruft is hard work.

Horrible words make it in their too
I am a fan of swear words in dictionaries, particularly when there is some etymology to it. Most often people use swear words as an expletive without much understanding for their actual original meaning. As English is for me a second language it is relevant for me to understand why I would rather be a bigot than a racist or someone who discriminates.

The other part of horrible words are those words that are actually used and offend the aesthetic sensitivities. In several medical resources you find stuff like MALARIA and Malaria. UGLY. However as it is useful to these folks, it makes sense to include them anyway. As they are exactly the same as the preferred English expression of malaria, it does not hurt.

Words like "irregardless" well being a non native to the language, I just want to be able to find them.

You have to look at all definitions to find the REAL meaning
The way the New Oxford American Dictionary does this is exquisite. They use a core sense / sub sense approach. To me this seems an approach that is very much language specific. For OmegaWiki to have such an approach, it will need quite a lot of thinking on how to build this.

Approaching the understanding of an expression with core senses / sub senses could be one way of stretching the number of concepts that people can juggle with. For Operational Definitions there is currently this practical limit of some 7 different meanings.

Dictionaries have a sell by date
For OmegaWiki this is not an issue as it is web based resource. However, the same issue still applies; a Dutch book printed in 2003 will use the orthography of 1995 and not the 2005 orthography. People will still want to be able to understand what this word means; annotating them as not being the official spelling since 2005 is relevant.

When words are tagged as used in earnest up to a certain date, I could even include 15th century German words and not have people be confused.. filtering would also help here ..

Facts are good
Referring to actual usage seems obvious, what we intend to do is link OmegaWiki's content to Wikipedia, this will be the most obvious resource to start of with; we aim to have a Wikipedia in all languages. The language is modern usage, so given our Wiki credentials it is the obvious corpus. I totally agree when it is said that we are limited in this way; it is however a great start. When we gain a community with people, organisations that introduce us to other resources, it will be great.

At this moment OmegaWiki is still very much like a "stamp collection"; it is a nice collection, and at some stage it will even become useful.

What we do is like an iceberg
The work done on the New Oxford American Dictionary may be in preparation for the moment when other information like thesaurus information will be included. For OmegaWiki, including information from thesauri is what we did from the start by including the GEMET data, OmegaWiki at this moment is very much "you get what you see", there is little of an iceberg yet. In a way this prevents the usefulness of our data because there is often too much to take in.

Etymology
Technically etymology is one of the hardest nuts to crack. When a word has its root in Latin, it often came to the English language through a French or Spanish connection. I wonder how the NOAD does this, indeed I do not have a copy so I have not read the "front matter" either :) .

Neologisms
At this moment there is not much of a problem about neologisms yet. Our community is still small and sane.. sort off (you must be weird to involve yourself in a project like this). My current thinking is that this problem can be solved using annotation. When a word is tagged as "Neologism; this word is not used except by the author" it will be pretty devastating to the prestige of the word and or the author on OmegaWiki.

The bonus: Using Google and other resources
Erin explains that the resources for building a resource like the NOAD are hardly as much as she would want. Given that this is true for a successful resource for the American English market, consider what this means for languages like Kituba, Stellingwerfs or Seeltersk. Consider what it means when you want to use a translation dictionary for such languages.. These resources become a reality when there is the necessary cooperation; this is what OmegaWiki hopes to achieve.

There is a need for people to work on their terminology and indeed we would like to include the terminology of falconry, tennis, and ships. It will happen when it does.

Q&A
*OmegaWiki does want to include out of copyright content as well. For us it is a start. Collaborating with for instance WordNet would be however more important and relevant.
*Proper names; yes we want them; we have George W. Bush already for quite some time.
*We would retire words by indicating them with a date indicating when they went out of use
*Circular definitions are even more problematic in OmegaWiki, this is a great example why.
*Context .. yes, I wish this was a problem that we have to deal with.. we need more functionality
*Print versions .. this is at this stage no issue. Nobody has indicated that they want to work on this.

Conclusion:
I really enjoyed Erin's presentation. It helps me to get my mind around issues that have not popped up for OmegaWiki. As many are quite will make their appearance, it is best to be forewarned, it allows us to get forearmed.

Thanks,
GerardM

No comments: