Semantic Emergence in Croatian vs. English: A Study of Word Connectivity in Vector Space

Introduction
In the era of large language models (LLMs), understanding how words are structured in vector space provides valuable insights into linguistic differences. This study explores the semantic emergence of Croatian and English words by analyzing their connectivity within an n-dimensional space, using a Euclidean radius approach. Our goal is to examine how words form relationships and whether language structure affects the way meaning propagates across semantic layers.
Methodology: Mapping Words in Vector Space
We selected 10 fundamental words from both Croatian and English, representing key concepts such as objects, emotions, and actions. Using principal component analysis (PCA), we projected their embeddings into a lower-dimensional space and defined an n-dimensional sphere around each word, using a fixed Euclidean radius (r=1.5).
Two levels of emergence were calculated:
- 1st Level Emergence: The number of words directly connected to the target word within the defined radius.
- 2nd Level Emergence: The number of words connected indirectly (i.e., neighbors of neighbors).
This structure allows us to analyze how densely connected or isolated each word is within the language model’s representation.
HR Word |
EN Word |
HR – First Level Count |
HR – Second Level Count |
EN – First Level Count |
EN – Second Level Count |
1-1 |
2-2 |
| kuća | house | 2 | 2 | 0 | 0 | 2 | 2 |
| ljubav | love | 1 | 2 | 5 | 7 | -4 | -5 |
| voda | water | 3 | 6 | 1 | 1 | 2 | 5 |
| čovjek | man | 3 | 7 | 0 | 0 | 3 | 7 |
| žena | woman | 1 | 2 | 2 | 2 | -1 | 0 |
| govoriti | speak | 0 | 0 | 2 | 4 | -2 | -4 |
| pisati | write | 2 | 6 | 2 | 4 | 0 | 2 |
| riječ | word | 0 | 0 | 2 | 6 | -2 | -6 |
| jezik | language | 3 | 6 | 0 | 0 | 3 | 6 |
| misliti | think | 0 | 0 | 1 | 1 | -1 | -1 |
| Average | 1.5 | 3.1 | 1.5 | 2.5 | 0 | 0.6 |
Findings: Semantic Connectivity in Croatian vs. English
1. High-Emergence Words: Semantic Hubs
Certain words, such as “čovjek” (man) and “voda” (water), demonstrated high first-level and second-level emergence. These words serve as semantic hubs, bridging multiple concepts across the vector space. Their widespread connectivity suggests that they appear in varied linguistic contexts, making them central to meaning propagation.
2. Low-Emergence Words: Isolated or Specialized Concepts
Conversely, words like “ljubav” (love) and “žena” (woman) had fewer direct and indirect connections. This suggests they exist in more distinct meaning clusters and may require more contextual information for accurate representation. Abstract concepts, such as emotions, often require deeper contextual embedding, which might explain their lower emergence.
3. Croatian vs. English: Structural Differences in Vector Space
Croatian, as a highly inflected language, likely causes words to be more variably distributed in vector space due to case, aspect, and morphology. English, with its fixed syntactic structure, tends to exhibit tighter clustering, meaning words are more predictably positioned. This may influence how LLMs encode relationships differently between the two languages.
4. Second-Level Emergence: The Hidden Connectivity Layer
Some words, such as “voda” (water), significantly increased their number of connections in the second level (e.g., 3 words in the first level → 6 words in the second level). This suggests that certain words act as indirect bridges between concepts, expanding their reach in the semantic network even if their immediate neighborhood appears limited.
Diminutives in Slavic vs. English: A Unique Linguistic Feature
One of the key differences between Slavic and English linguistic structures is the frequent use of diminutives in Slavic languages, including Croatian. Diminutives are a morphological feature that modifies the base word to express smallness, affection, or familiarity. In Croatian, diminutives are highly productive and are commonly used in everyday speech, whereas English relies more on adjectives and context to express the same nuances.
Implications for LLMs
- LLMs trained on English may struggle with capturing the nuanced meanings of diminutives in Slavic languages.
- Word embeddings need to account for morphological richness when dealing with Croatian and similar languages to ensure correct clustering.
- Diminutives may artificially increase second-level emergence due to their widespread usage, affecting how word relations are structured.
Implications for NLP and Language Models
This study highlights key takeaways for language modeling:
- Words with high second-level emergence play a central role in semantic expansion and should be prioritized in word embedding refinements.
- Words with low emergence require stronger contextual embeddings to ensure nuanced understanding.
- Croatian-English differences reveal that grammar and morphology influence how meaning is structured and propagated in LLMs, which is crucial for multilingual model design.
- Diminutives introduce additional complexity, requiring specialized handling in Slavic languages to capture their full semantic range.

0 Comments