Date post: | 26-Jan-2019 |
Category: |
Documents |
Upload: | vuongnguyet |
View: | 220 times |
Download: | 0 times |
12032010
1
SEMANTIC NETS
Per la lessicografia contemporanea
I Chiari Linguistica computazionale - aa 20092010 1
Alcune diapositive provengono da Semeraro
(httpwwwdiunibait~semeraroGCIWordN
etpdf) altre da Piek Vossen
Wordnets2
I Chiari Linguistica computazionale - aa 20092010
12032010
2
Wordnet
I Chiari Linguistica computazionale - aa 20092010
3
httpwordnetprincetonedu
Ontologia linguistica che rappresenta in maniera
esplicita e formale la conoscenza linguistica umana
Lrsquoidea nasce nel 1985 da un gruppo di linguisti e
psicolinguisti dellrsquouniversitagrave di Princeton
1048708 Obiettivo ricerca concettuale nei dizionari
1048708 Risultato definizione di un database lessicale
1048708 Linea di ricerca memoria lessicale umana
I Chiari Linguistica computazionale - aa 20092010
4
WordNet egrave unrsquoontologia linguistica toplevel
La conoscenza linguistica
1048708 egrave conoscenza di senso comune
1048708 puograve essere utilizzata in qualsiasi dominio
Wordnet non tratta parole come
of an the and about above because etc
12032010
3
I Chiari Linguistica computazionale - aa 20092010
5
Ogni word meaning egrave rappresentata dallrsquoinsieme
delle word form che possono essere usate per
esprimerla synset
Un synset associato ad una word form consente
allrsquoutente di inferire la semantica della word form in
esame purcheacute conosca la semantica di almeno una
word form elencata nel synset
Relazioni
I Chiari Linguistica computazionale - aa 20092010
6
LE RELAZIONI LESSICALI Si instaurano tra word
form (synonymy antonymy morphological)
LE RELAZIONI SEMANTICHE Si instaurano tra
word meaning (hyponymy hypernymy e
meronymy holonymy)
12032010
4
Da Diapositive Semerarohellip
I Chiari Linguistica computazionale - aa 20092010
7
Sostantivi
I Chiari Linguistica computazionale - aa 20092010
8
WordNet suddivide i nomi in 25 campi semantici distinti (animale sostanzahellip)
In ogni campo semantico i nomi sono organizzati in un albero lessicale secondo la relazione hypernymy
Vale il principio di ereditarietagrave
Ad un nome (canarino) si possono associare
1048708 Attributi del nome (piccolo e giallo)
1048708 Parti del nome (becco e ali)
1048708 Funzioni del nome (canta e vola)
Molti degli attributi delle parti e delle attivitagrave di un termine sono ereditate dal diretto hypernym
12032010
5
Interrogazione online
I Chiari Linguistica computazionale - aa 20092010
9
httpwordnetwebprincetoneduperlwebwn
I Chiari Linguistica computazionale - aa 2009201010
12032010
6
Statistiche su Wordnet
I Chiari Linguistica computazionale - aa 20092010
11
Polisemia in Wordnet
I Chiari Linguistica computazionale - aa 20092010
12
12032010
7
200405ANLE
13
Verbi nel database (Semeraro)
About 10000 forms 20000 senses
Un verbo egrave il nucleo su cui si basa la semantica
associata ad una frase
Il significato dei verbi cambia a seconda del nome
con cui i verbi stessi sono associati
Per risolvere lrsquoambiguitagrave si potrebbe immaginare di
inserire in ogni synset di verbi un puntatore al
synset del nome a cui il significato del verbo egrave
riferito
I Chiari Linguistica computazionale - aa 20092010
14
Abbandonata lrsquoidea proposta precedentemente si
egrave pensato di suddividere i verbi in varie categorie
semantiche (file)
Con tale organizzazione il significato di un verbo in
una categoria non egrave piugrave soggetto ad ambiguitagrave
percheacute legato alla categoria semantica stessa
12032010
8
200405ANLE
15
Relazioni verbali
V1 ENTAILS V2
when Someone V1 (logically) entails Someone V2
- eg snore entails sleep
TROPONYMY
when To do V1 is To do V2 in some manner
- eg limp is a troponym of walk
Hypernym fly-gt travel
Troponym Walk -gt stroll
Entails Snore -gt sleep
Antonym Increase -gt decrease
Differences in wordnet structures
voorwerp
object
lepel
spoon
werktuig
tool
tas
bag
bak
box
blok
block
lichaam
body
Wordnet15 Dutch Wordnet
bagspoonbox
object
natural object (an
object occurring
naturally)
artifact artefact
(a man-made object)
instrumentalityblock body
containerdeviceimplement
tool instrument
- Artificial Classes versus Lexicalized Classes
instrumentality natural object
- Lexicalization differences of classes
container and artifact (object) are not lexicalized in Dutch
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
2
Wordnet
I Chiari Linguistica computazionale - aa 20092010
3
httpwordnetprincetonedu
Ontologia linguistica che rappresenta in maniera
esplicita e formale la conoscenza linguistica umana
Lrsquoidea nasce nel 1985 da un gruppo di linguisti e
psicolinguisti dellrsquouniversitagrave di Princeton
1048708 Obiettivo ricerca concettuale nei dizionari
1048708 Risultato definizione di un database lessicale
1048708 Linea di ricerca memoria lessicale umana
I Chiari Linguistica computazionale - aa 20092010
4
WordNet egrave unrsquoontologia linguistica toplevel
La conoscenza linguistica
1048708 egrave conoscenza di senso comune
1048708 puograve essere utilizzata in qualsiasi dominio
Wordnet non tratta parole come
of an the and about above because etc
12032010
3
I Chiari Linguistica computazionale - aa 20092010
5
Ogni word meaning egrave rappresentata dallrsquoinsieme
delle word form che possono essere usate per
esprimerla synset
Un synset associato ad una word form consente
allrsquoutente di inferire la semantica della word form in
esame purcheacute conosca la semantica di almeno una
word form elencata nel synset
Relazioni
I Chiari Linguistica computazionale - aa 20092010
6
LE RELAZIONI LESSICALI Si instaurano tra word
form (synonymy antonymy morphological)
LE RELAZIONI SEMANTICHE Si instaurano tra
word meaning (hyponymy hypernymy e
meronymy holonymy)
12032010
4
Da Diapositive Semerarohellip
I Chiari Linguistica computazionale - aa 20092010
7
Sostantivi
I Chiari Linguistica computazionale - aa 20092010
8
WordNet suddivide i nomi in 25 campi semantici distinti (animale sostanzahellip)
In ogni campo semantico i nomi sono organizzati in un albero lessicale secondo la relazione hypernymy
Vale il principio di ereditarietagrave
Ad un nome (canarino) si possono associare
1048708 Attributi del nome (piccolo e giallo)
1048708 Parti del nome (becco e ali)
1048708 Funzioni del nome (canta e vola)
Molti degli attributi delle parti e delle attivitagrave di un termine sono ereditate dal diretto hypernym
12032010
5
Interrogazione online
I Chiari Linguistica computazionale - aa 20092010
9
httpwordnetwebprincetoneduperlwebwn
I Chiari Linguistica computazionale - aa 2009201010
12032010
6
Statistiche su Wordnet
I Chiari Linguistica computazionale - aa 20092010
11
Polisemia in Wordnet
I Chiari Linguistica computazionale - aa 20092010
12
12032010
7
200405ANLE
13
Verbi nel database (Semeraro)
About 10000 forms 20000 senses
Un verbo egrave il nucleo su cui si basa la semantica
associata ad una frase
Il significato dei verbi cambia a seconda del nome
con cui i verbi stessi sono associati
Per risolvere lrsquoambiguitagrave si potrebbe immaginare di
inserire in ogni synset di verbi un puntatore al
synset del nome a cui il significato del verbo egrave
riferito
I Chiari Linguistica computazionale - aa 20092010
14
Abbandonata lrsquoidea proposta precedentemente si
egrave pensato di suddividere i verbi in varie categorie
semantiche (file)
Con tale organizzazione il significato di un verbo in
una categoria non egrave piugrave soggetto ad ambiguitagrave
percheacute legato alla categoria semantica stessa
12032010
8
200405ANLE
15
Relazioni verbali
V1 ENTAILS V2
when Someone V1 (logically) entails Someone V2
- eg snore entails sleep
TROPONYMY
when To do V1 is To do V2 in some manner
- eg limp is a troponym of walk
Hypernym fly-gt travel
Troponym Walk -gt stroll
Entails Snore -gt sleep
Antonym Increase -gt decrease
Differences in wordnet structures
voorwerp
object
lepel
spoon
werktuig
tool
tas
bag
bak
box
blok
block
lichaam
body
Wordnet15 Dutch Wordnet
bagspoonbox
object
natural object (an
object occurring
naturally)
artifact artefact
(a man-made object)
instrumentalityblock body
containerdeviceimplement
tool instrument
- Artificial Classes versus Lexicalized Classes
instrumentality natural object
- Lexicalization differences of classes
container and artifact (object) are not lexicalized in Dutch
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
3
I Chiari Linguistica computazionale - aa 20092010
5
Ogni word meaning egrave rappresentata dallrsquoinsieme
delle word form che possono essere usate per
esprimerla synset
Un synset associato ad una word form consente
allrsquoutente di inferire la semantica della word form in
esame purcheacute conosca la semantica di almeno una
word form elencata nel synset
Relazioni
I Chiari Linguistica computazionale - aa 20092010
6
LE RELAZIONI LESSICALI Si instaurano tra word
form (synonymy antonymy morphological)
LE RELAZIONI SEMANTICHE Si instaurano tra
word meaning (hyponymy hypernymy e
meronymy holonymy)
12032010
4
Da Diapositive Semerarohellip
I Chiari Linguistica computazionale - aa 20092010
7
Sostantivi
I Chiari Linguistica computazionale - aa 20092010
8
WordNet suddivide i nomi in 25 campi semantici distinti (animale sostanzahellip)
In ogni campo semantico i nomi sono organizzati in un albero lessicale secondo la relazione hypernymy
Vale il principio di ereditarietagrave
Ad un nome (canarino) si possono associare
1048708 Attributi del nome (piccolo e giallo)
1048708 Parti del nome (becco e ali)
1048708 Funzioni del nome (canta e vola)
Molti degli attributi delle parti e delle attivitagrave di un termine sono ereditate dal diretto hypernym
12032010
5
Interrogazione online
I Chiari Linguistica computazionale - aa 20092010
9
httpwordnetwebprincetoneduperlwebwn
I Chiari Linguistica computazionale - aa 2009201010
12032010
6
Statistiche su Wordnet
I Chiari Linguistica computazionale - aa 20092010
11
Polisemia in Wordnet
I Chiari Linguistica computazionale - aa 20092010
12
12032010
7
200405ANLE
13
Verbi nel database (Semeraro)
About 10000 forms 20000 senses
Un verbo egrave il nucleo su cui si basa la semantica
associata ad una frase
Il significato dei verbi cambia a seconda del nome
con cui i verbi stessi sono associati
Per risolvere lrsquoambiguitagrave si potrebbe immaginare di
inserire in ogni synset di verbi un puntatore al
synset del nome a cui il significato del verbo egrave
riferito
I Chiari Linguistica computazionale - aa 20092010
14
Abbandonata lrsquoidea proposta precedentemente si
egrave pensato di suddividere i verbi in varie categorie
semantiche (file)
Con tale organizzazione il significato di un verbo in
una categoria non egrave piugrave soggetto ad ambiguitagrave
percheacute legato alla categoria semantica stessa
12032010
8
200405ANLE
15
Relazioni verbali
V1 ENTAILS V2
when Someone V1 (logically) entails Someone V2
- eg snore entails sleep
TROPONYMY
when To do V1 is To do V2 in some manner
- eg limp is a troponym of walk
Hypernym fly-gt travel
Troponym Walk -gt stroll
Entails Snore -gt sleep
Antonym Increase -gt decrease
Differences in wordnet structures
voorwerp
object
lepel
spoon
werktuig
tool
tas
bag
bak
box
blok
block
lichaam
body
Wordnet15 Dutch Wordnet
bagspoonbox
object
natural object (an
object occurring
naturally)
artifact artefact
(a man-made object)
instrumentalityblock body
containerdeviceimplement
tool instrument
- Artificial Classes versus Lexicalized Classes
instrumentality natural object
- Lexicalization differences of classes
container and artifact (object) are not lexicalized in Dutch
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
4
Da Diapositive Semerarohellip
I Chiari Linguistica computazionale - aa 20092010
7
Sostantivi
I Chiari Linguistica computazionale - aa 20092010
8
WordNet suddivide i nomi in 25 campi semantici distinti (animale sostanzahellip)
In ogni campo semantico i nomi sono organizzati in un albero lessicale secondo la relazione hypernymy
Vale il principio di ereditarietagrave
Ad un nome (canarino) si possono associare
1048708 Attributi del nome (piccolo e giallo)
1048708 Parti del nome (becco e ali)
1048708 Funzioni del nome (canta e vola)
Molti degli attributi delle parti e delle attivitagrave di un termine sono ereditate dal diretto hypernym
12032010
5
Interrogazione online
I Chiari Linguistica computazionale - aa 20092010
9
httpwordnetwebprincetoneduperlwebwn
I Chiari Linguistica computazionale - aa 2009201010
12032010
6
Statistiche su Wordnet
I Chiari Linguistica computazionale - aa 20092010
11
Polisemia in Wordnet
I Chiari Linguistica computazionale - aa 20092010
12
12032010
7
200405ANLE
13
Verbi nel database (Semeraro)
About 10000 forms 20000 senses
Un verbo egrave il nucleo su cui si basa la semantica
associata ad una frase
Il significato dei verbi cambia a seconda del nome
con cui i verbi stessi sono associati
Per risolvere lrsquoambiguitagrave si potrebbe immaginare di
inserire in ogni synset di verbi un puntatore al
synset del nome a cui il significato del verbo egrave
riferito
I Chiari Linguistica computazionale - aa 20092010
14
Abbandonata lrsquoidea proposta precedentemente si
egrave pensato di suddividere i verbi in varie categorie
semantiche (file)
Con tale organizzazione il significato di un verbo in
una categoria non egrave piugrave soggetto ad ambiguitagrave
percheacute legato alla categoria semantica stessa
12032010
8
200405ANLE
15
Relazioni verbali
V1 ENTAILS V2
when Someone V1 (logically) entails Someone V2
- eg snore entails sleep
TROPONYMY
when To do V1 is To do V2 in some manner
- eg limp is a troponym of walk
Hypernym fly-gt travel
Troponym Walk -gt stroll
Entails Snore -gt sleep
Antonym Increase -gt decrease
Differences in wordnet structures
voorwerp
object
lepel
spoon
werktuig
tool
tas
bag
bak
box
blok
block
lichaam
body
Wordnet15 Dutch Wordnet
bagspoonbox
object
natural object (an
object occurring
naturally)
artifact artefact
(a man-made object)
instrumentalityblock body
containerdeviceimplement
tool instrument
- Artificial Classes versus Lexicalized Classes
instrumentality natural object
- Lexicalization differences of classes
container and artifact (object) are not lexicalized in Dutch
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
5
Interrogazione online
I Chiari Linguistica computazionale - aa 20092010
9
httpwordnetwebprincetoneduperlwebwn
I Chiari Linguistica computazionale - aa 2009201010
12032010
6
Statistiche su Wordnet
I Chiari Linguistica computazionale - aa 20092010
11
Polisemia in Wordnet
I Chiari Linguistica computazionale - aa 20092010
12
12032010
7
200405ANLE
13
Verbi nel database (Semeraro)
About 10000 forms 20000 senses
Un verbo egrave il nucleo su cui si basa la semantica
associata ad una frase
Il significato dei verbi cambia a seconda del nome
con cui i verbi stessi sono associati
Per risolvere lrsquoambiguitagrave si potrebbe immaginare di
inserire in ogni synset di verbi un puntatore al
synset del nome a cui il significato del verbo egrave
riferito
I Chiari Linguistica computazionale - aa 20092010
14
Abbandonata lrsquoidea proposta precedentemente si
egrave pensato di suddividere i verbi in varie categorie
semantiche (file)
Con tale organizzazione il significato di un verbo in
una categoria non egrave piugrave soggetto ad ambiguitagrave
percheacute legato alla categoria semantica stessa
12032010
8
200405ANLE
15
Relazioni verbali
V1 ENTAILS V2
when Someone V1 (logically) entails Someone V2
- eg snore entails sleep
TROPONYMY
when To do V1 is To do V2 in some manner
- eg limp is a troponym of walk
Hypernym fly-gt travel
Troponym Walk -gt stroll
Entails Snore -gt sleep
Antonym Increase -gt decrease
Differences in wordnet structures
voorwerp
object
lepel
spoon
werktuig
tool
tas
bag
bak
box
blok
block
lichaam
body
Wordnet15 Dutch Wordnet
bagspoonbox
object
natural object (an
object occurring
naturally)
artifact artefact
(a man-made object)
instrumentalityblock body
containerdeviceimplement
tool instrument
- Artificial Classes versus Lexicalized Classes
instrumentality natural object
- Lexicalization differences of classes
container and artifact (object) are not lexicalized in Dutch
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
6
Statistiche su Wordnet
I Chiari Linguistica computazionale - aa 20092010
11
Polisemia in Wordnet
I Chiari Linguistica computazionale - aa 20092010
12
12032010
7
200405ANLE
13
Verbi nel database (Semeraro)
About 10000 forms 20000 senses
Un verbo egrave il nucleo su cui si basa la semantica
associata ad una frase
Il significato dei verbi cambia a seconda del nome
con cui i verbi stessi sono associati
Per risolvere lrsquoambiguitagrave si potrebbe immaginare di
inserire in ogni synset di verbi un puntatore al
synset del nome a cui il significato del verbo egrave
riferito
I Chiari Linguistica computazionale - aa 20092010
14
Abbandonata lrsquoidea proposta precedentemente si
egrave pensato di suddividere i verbi in varie categorie
semantiche (file)
Con tale organizzazione il significato di un verbo in
una categoria non egrave piugrave soggetto ad ambiguitagrave
percheacute legato alla categoria semantica stessa
12032010
8
200405ANLE
15
Relazioni verbali
V1 ENTAILS V2
when Someone V1 (logically) entails Someone V2
- eg snore entails sleep
TROPONYMY
when To do V1 is To do V2 in some manner
- eg limp is a troponym of walk
Hypernym fly-gt travel
Troponym Walk -gt stroll
Entails Snore -gt sleep
Antonym Increase -gt decrease
Differences in wordnet structures
voorwerp
object
lepel
spoon
werktuig
tool
tas
bag
bak
box
blok
block
lichaam
body
Wordnet15 Dutch Wordnet
bagspoonbox
object
natural object (an
object occurring
naturally)
artifact artefact
(a man-made object)
instrumentalityblock body
containerdeviceimplement
tool instrument
- Artificial Classes versus Lexicalized Classes
instrumentality natural object
- Lexicalization differences of classes
container and artifact (object) are not lexicalized in Dutch
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
7
200405ANLE
13
Verbi nel database (Semeraro)
About 10000 forms 20000 senses
Un verbo egrave il nucleo su cui si basa la semantica
associata ad una frase
Il significato dei verbi cambia a seconda del nome
con cui i verbi stessi sono associati
Per risolvere lrsquoambiguitagrave si potrebbe immaginare di
inserire in ogni synset di verbi un puntatore al
synset del nome a cui il significato del verbo egrave
riferito
I Chiari Linguistica computazionale - aa 20092010
14
Abbandonata lrsquoidea proposta precedentemente si
egrave pensato di suddividere i verbi in varie categorie
semantiche (file)
Con tale organizzazione il significato di un verbo in
una categoria non egrave piugrave soggetto ad ambiguitagrave
percheacute legato alla categoria semantica stessa
12032010
8
200405ANLE
15
Relazioni verbali
V1 ENTAILS V2
when Someone V1 (logically) entails Someone V2
- eg snore entails sleep
TROPONYMY
when To do V1 is To do V2 in some manner
- eg limp is a troponym of walk
Hypernym fly-gt travel
Troponym Walk -gt stroll
Entails Snore -gt sleep
Antonym Increase -gt decrease
Differences in wordnet structures
voorwerp
object
lepel
spoon
werktuig
tool
tas
bag
bak
box
blok
block
lichaam
body
Wordnet15 Dutch Wordnet
bagspoonbox
object
natural object (an
object occurring
naturally)
artifact artefact
(a man-made object)
instrumentalityblock body
containerdeviceimplement
tool instrument
- Artificial Classes versus Lexicalized Classes
instrumentality natural object
- Lexicalization differences of classes
container and artifact (object) are not lexicalized in Dutch
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
8
200405ANLE
15
Relazioni verbali
V1 ENTAILS V2
when Someone V1 (logically) entails Someone V2
- eg snore entails sleep
TROPONYMY
when To do V1 is To do V2 in some manner
- eg limp is a troponym of walk
Hypernym fly-gt travel
Troponym Walk -gt stroll
Entails Snore -gt sleep
Antonym Increase -gt decrease
Differences in wordnet structures
voorwerp
object
lepel
spoon
werktuig
tool
tas
bag
bak
box
blok
block
lichaam
body
Wordnet15 Dutch Wordnet
bagspoonbox
object
natural object (an
object occurring
naturally)
artifact artefact
(a man-made object)
instrumentalityblock body
containerdeviceimplement
tool instrument
- Artificial Classes versus Lexicalized Classes
instrumentality natural object
- Lexicalization differences of classes
container and artifact (object) are not lexicalized in Dutch
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
9
Applicazioni di Wordnet
I Chiari Linguistica computazionale - aa 20092010
17
httpwwwlexiologycom
I Chiari Linguistica computazionale - aa 2009201018
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
10
Memidex applicazione Wordnet
I Chiari Linguistica computazionale - aa 20092010
19
I Chiari Linguistica computazionale - aa 2009201020
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
11
I Chiari Linguistica computazionale - aa 2009201021
Multiwordnet22
I Chiari Linguistica computazionale - aa 20092010
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
12
I Chiari Linguistica computazionale - aa 20092010
23
httpmultiwordnetfbkeu
lexical relations between words
semantic relations between lexical concepts
(synsets)
correspondences between Italian and English lexical
concepts
semantic fields (domains)
I Chiari Linguistica computazionale - aa 20092010
24
The lastest version of MultiWordNet (139) contains
around 58000 Italian word senses and 41500
lemmas organized into 32700 synsets aligned
whenever possible with Princeton WordNet English
synsets
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
13
Relazioni semantiche e lessicali
I Chiari Linguistica computazionale - aa 20092010
25
I Chiari Linguistica computazionale - aa 2009201026
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
14
I Chiari Linguistica computazionale - aa 2009201027
I Chiari Linguistica computazionale - aa 2009201028
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
15
Le applicazioni di MWN
I Chiari Linguistica computazionale - aa 20092010
29
Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval
Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers
Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task
Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks
Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies
ItalWordNet30
I Chiari Linguistica computazionale - aa 20092010
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
16
I Chiari Linguistica computazionale - aa 20092010
31
ItalWordNet (IWN) egrave un database semantico-
lessicale sviluppato nellambito di due progetti di
ricerca distinti EuroWordNet (EWN)1 e Sistema
Integrato per il Trattamento Automatico del
Linguaggio (SI-TAL) un progetto nazionale dedicato
alla creazione di ampie risorse linguistiche e di
strumenti software per lelaborazione dellitaliano
scritto e parlato
il database IWN
I Chiari Linguistica computazionale - aa 20092010
32
un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)
un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue
la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)
la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
17
Mangiare (v)
I Chiari Linguistica computazionale - aa 20092010
33
FrameNet34
I Chiari Linguistica computazionale - aa 20092010
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
18
Framenet
I Chiari Linguistica computazionale - aa 20092010
35
The Berkeley FrameNet project is creating an on-
line lexical resource for English based on frame
semantics and supported by corpus evidence The
aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses through computer-
assisted annotation of example sentences and
automatic tabulation and display of the annotation
results
database
I Chiari Linguistica computazionale - aa 20092010
36
the FrameNet lexical database currently contains
more than 11600 lexical units (defined below)
more than 6800 of which are fully annotated in
more than 960 semantic frames exemplified in
more than 150000 annotated sentences
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
19
I Chiari Linguistica computazionale - aa 2009201037
I Chiari Linguistica computazionale - aa 2009201038
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
20
I Chiari Linguistica computazionale - aa 2009201039
I Chiari Linguistica computazionale - aa 2009201040
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
21
In ottica multilingue41
I Chiari Linguistica computazionale - aa 20092010
Aligning wordnets
muziekinstrument
orgel
hammond orgel
organ organ organ
hammond organ
musical instrument
instrument
artifact object natural object
objectDutch wordnetEnglish wordnet
orgaan
orgel
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
22
Criteri generali
Massimizzare la sovrapposizione con altri wordnet
di altre lingue
Massimizzare la consistenza semantica allrsquointerno e
attraverso i wordnet
Focalizzare lo sforzo manuale dove necessario
Sfruttare massimamente le tecniche automatiche
Top-down methodology
Develop a core wordnet (5000 synsets)
all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school
provide a formal and explicit semantics
Validate the core wordnet
does it include the most frequent words
are semantic constraints violated
Extend the core wordnet (5000 synsets or more)
automatic techniques for more specific concepts with high-confidence results
add other levels of hyponymy
add specific domains
add lsquoeasyrsquo derivational words
add lsquoeasyrsquo translation equivalence
Validate the complete wordnet
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
23
Developing a core wordnet
Define a set of concepts(so-called Base Concepts) that play an important role in wordnets
high position in the hierarchy amp high connectivity
represented as English WordNet synsets
Common base concepts shared by various wordnets in different languages
Local base concepts not shared
EuroWordNet 1024 synsets shared by 2 or more languages
BalkaNet 5000 synsets (including 1024)
Common semantic framework for all Base Concepts in the form of a Top-Ontology
Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)
Manually build and verify the hypernym relations for the Base Concepts
All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet
63TCs
1024 CBCs
First Level Hyponyms
Remaining
Hyponyms
Hypero
nyms
CBC
Represen-
tatives
Local
BCs
WMs
related via
non-hypo
nymy
Top-Ontology
Inter-Lingual-Index
Remaining
Hyponyms
Hypero
nyms
CBC
Repre-
senta
Local
BCs
WMs
related via
non-hypo
nymyFirst Level HyponymsRemaining
WordNet15
Synsets
Top-down methodology
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
24
DomainNamed
Entities
Next Level
Hyponyms
Sumo
Ontology
WordNet
Synsets
1000
SynsetsSBC
CBC
Hyper
nyms
ABCEuroWordNet
BalkaNet
Base Concepts
5000
SynsetsEnglish
Arabic
LexiconWordNet
Domains
Domainldquochemicsrdquo
WordNet
Synsets
English Wordnet Arabic Wordnet
Arabic
word
frequency
Arabic
roots
amp
derivation
rules
Top-down methodology
More
Hyponyms
Easy
Translations
Named
Entities
=
Advantages of the approach
Well-defined semantics that can be inherited down to more specific concepts
Apply consistency checks
Automatic techniques can use semantic basis
Most frequent concepts and words are covered
High overlap and compatibility with other wordnets
Manual effort is focussed on the most difficult concepts and words
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009
12032010
25
Wordnet
Domains Concepts Proportion
Wordnet
Domains Concepts Proportion
acoustics 104 0092 linguistics 1545 1363
administration 2974 2624 literature 686 0605
aeronautic 154 0136 mathematics 575 0507
agriculture 306 0270 mechanics 532 0469
alimentation 28 0025 medicine 2690 2374
anatomy 2705 2387 merchant_navy 485 0428
anthropology 896 0791 meteorology 231 0204
applied_science 28 0025 metrology 1409 1243
archaeology 68 0060 military 1490 1315
archery 5 0004 money 624 0551
architecture 255 0225 mountaineering 28 0025
art 420 0371 music 985 0869
artisanship 148 0131 mythology 314 0277
astrology 17 0015 number 220 0194
astronautics 29 0026 numismatics 43 0038
astronomy 376 0332 occultism 52 0046
athletics 22 0019 oceanography 10 0009