Zi-Ke Zhang
2013
Scientific Reports 3(1082), 2013
Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. ...MORE ⇓
Zipf's law on word frequency and Heaps' law on the growth of distinct words are observed in Indo-European language family, but it does not hold for languages like Chinese, Japanese and Korean. These languages consist of characters, and are of very limited dictionary sizes. Extensive experiments show that: (i) The character frequency distribution follows a power law with exponent close to one, at which the corresponding Zipf's exponent diverges. Indeed, the character frequency decays exponentially in the Zipf's plot. (ii) The number of distinct characters grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. A theoretical model for writing process is proposed, which embodies the rich-get-richer mechanism and the effects of limited dictionary size. Experiments, simulations and analytical solutions agree well with each other. This work refines the understanding about Zipf's and Heaps' laws in human language systems.
2008
Empirical analysis on a keyword-based semantic systemPDF
The European Physical Journal B-Condensed Matter and Complex Systems 66(4):557--561, 2008
Abstract Keywords in scientific articles have found their significance in information filtering and classification. In this article, we empirically investigated statistical characteristics and evolutionary properties of keywords in a very famous journal, namely Proceedings of the ...MORE ⇓
Abstract Keywords in scientific articles have found their significance in information filtering and classification. In this article, we empirically investigated statistical characteristics and evolutionary properties of keywords in a very famous journal, namely Proceedings of the ...
Physica A: Statistical Mechanics and its Applications 387(12):3039-3047, 2008
Chinese is spoken by the largest number of people in the world, and it is regarded as one of the most important languages. In this paper, we explore the statistical properties of Chinese language networks (CLNs) within the framework of complex network theory. Based on one of the ...MORE ⇓
Chinese is spoken by the largest number of people in the world, and it is regarded as one of the most important languages. In this paper, we explore the statistical properties of Chinese language networks (CLNs) within the framework of complex network theory. Based on one of the largest Chinese corpora, i.e. People's Daily Corpus, we construct two networks (CLN1 and CLN2) from two different respects, with Chinese words as nodes. In CLN1, a link between two nodes exists if they appear next to each other in at least one sentence; in CLN2, a link represents that two nodes appear simultaneously in a sentence. We show that both networks exhibit small-world effect, scale-free structure, hierarchical organization and disassortative mixing. These results indicate that in many topological aspects Chinese language shapes complex networks with organizing principles similar to other previously studied language systems, which shows that different languages may have some common characteristics in their evolution processes. We believe that our research may shed some new light into the Chinese language and find some potentially significant implications.