New blog explores Norwegian grammatical phenomena with data
A CAS fellow’s new blog aims to use data to explore some myths about the Norwegian language.
One such resource is known as a treebank, which is a large collection of text in which each sentence has been analysed syntactically (and sometimes also semantically). Treebanks are often used in language technology development, as they give developers a database of examples of grammatical phenomena.
Dyvik’s blog is based on NorGramBank, a Norwegian treebank containing about 70 million words developed as part of the Infrastructure for the Exploration of Syntax and Semantics (INESS) project, in which he has participated. The millions of words come from sources such as newspaper articles, novels, parliamentary records, and other publications.
NorGramBank has also inspired the name of the blog: NorGram-Tall (literally ‘NorGram numbers’).
In posts published so far, Dyvik has explored the use of the masculine indefinite article (‘en’) with the feminine word ‘jente’ (‘girl’), how often and for which authors the plural of neuter nouns ends in '-a' or '-ene,' and the use of passive construction, among other topics.
Dyvik is this year participating in the CAS project SynSem: From Form to Meaning - Integrating Linguistics and Computing.
‘This blog is being developed during my stay at CAS, where I am in close contact with leading international scholars in the fields of computational grammar development, syntactic and semantic analysis, and treebanks,’ Dyvik writes.