This page contains examples of the eight citation tones of a male speaker from Wēnzhōu 温州, a city near the coast in the southern part of the Chinese province of Zhejiang. I have presented them so that you can look at their acoustics and listen to them at the same time. You can also listen to their mean values synthesised as glottal pulses. I have also pointed out some of the things that make the tones phonetically and phonologically interesting.
- Recording Info
- Listen to data
- Entering Tone Development & the OUJIANG subgroup
- Duration differences in WENZHOU tones
- Co-occurence relationships between Tone and Syllable Onsets
- Correlates of 'Register'
- Other descriptions of WENZHOU citation tones
INTRODUCTION
Wenzhou belongs to the Ōujiāng 甌江 subgroup of the Wú 吳 dialects. Eponymous with the river Ou bisecting it in a N.W. to S.E. direction (jiāng = river), the Oujiang subgroup is located in the S.E. corner of Zhejiang province, and is approximately congruent with the Wenzhou administrative area. About five million people speak Oujiang varieties, of which Wenzhou is probably the best known. Click here to view the location of Wenzhou in the Oujiang subgroup. You can find a characterisation of Oujiang Wu, in Chinese, on pages 18 through 21 of Fù Guótōng et al.’s monograph Wu dialect subgroups of Zhejiang (傅国通, 方松熹, 蔡勇飞, 鲍士杰,傅佐之: 浙江吴语分区) published in 1985 by the Zhejiang Linguistics Society. Information on Oujiang may also be found in Cáo Zhìyún’s 2002 book Studies on the phonetics of Southern Wu dialects (曹志耘: 南部吴语语音研究) published by the Commercial Press, Beijing.
The Wenzhou speaker's tones are interesting because they provide a nice example of duration as an extrinsic tonal parameter. Read more about this, and its relationship to diphthongal vowel quality, here. The relationship between the citation tones and the syllable-Onsets is also important. Read about that here. Register is an important tonal dimesion, read about its phonologcial correlates here . Read about the recording of the speaker’s tones here. Wenzhou tones have been described a lot. Click here to read about other descriptions of Wenzhou citation tones.
RECORDING
The examples are from recordings of a 34 year old male - Zhū Guóqìng 朱国庆 - made by Prof. William Ballard in April 1988. Zhu Guoqing was born in 1954 in Ruìān 瑞安, a county to the south of Wenzhou, but moved to Wenzhou when 3 years old. His parents came from Wenzhou. Ballard's elicitation text (he used simplified characters) is given together with glosses and phonemic representation. (Some 30 years later, in 2019, Zhu Guoqing, now Professor Emeritus at Wenzhou Normal University, kindly agreed to do some more recordings! These are now being processed.)
Ballard compiled the corpus, and elicited the tones, according to their eight Middle Chinese tonal categories: Ia/Yinping 阴平, Ib/Yangping 阳平, IIa/Yinshang 阴上, IIb/Yangshang 阳上, IIIa/Yinqu 阴去, IIIb/Yangqu 阳去, IVa/Yinru 阴入, IVb/Yangru 阳入, in that sequence. Five examples of each of the eight categories were included, each repeated three times after Ballard's numerical prompt. Click the following button to listen to Ballard's elicitation of the five tokens of the Ia/Yinping tone:
Ballard also recorded examples of the speaker's disyllabic tone-sandhi, which I have described and analysed in the following papers:
Wenzhou Dialect Disyllabic Lexical Tone Sandhi with First Syllable Entering Tones (2000)
Independent depressor and register effects in Wu dialect tonology: Evidence from Wenzhou tone sandhi (2002)
"Defying Explanation"? - Accounting for Tones in Wenzhou Dialect Disyllabic Lexical Tone Sandhi (2004)
DATA
The table below shows the speaker's citation tone acoustics (F0 as a function of absolute duration). It is arranged according to the conventional 2x4 matrix of Middle Chinese tonal categories (Yin-Yang; Ping-Shang-Qu-Ru. ). Each tone's Middle Chinese name is followed by the name of the Wenzhou tone in capitals, followed by Ballard's Simplified Character elicitation text, and glosses.
I measured the acoustics from the first of the three replicates elicited for each token. You can hear these tokens by clicking on the blue Play/Pause buttons (should work with Chrome, other browsers not guaranteed, sorry!). The dotted lines in the graphs belong to the individual replicates; their arithmetical mean is shown with a thick red line. You can listen to the mean values by clicking on the red Play/Pause mean buttons. ( I have synthesised them on quasi-glottal pulses using Praat's Pitch Tier > Synthesise > To Sound (phonation), so that you can listen to their pitch without the distraction of segmentals.) A segmental phonemisisation is also included on the graph, should you wish to check out the relation between segmentals and acoustics for any token.
Yinping/Ia: tone香 fragrant, 东 east, 西 west, 风 wind, 关 shut, 背 carry on back |
Yinshang/IIa: tone酒 wine, 手 hand, 草 grass, 表 watch, 好 good |
Yinqu/IIIa: tone汽 steam, 四 four, 背 back (read with tone Ia), 酱 paste, 跳 jump |
Yinru/IVa: tone北 north, 作 do, 竹 bamboo, 国 country, 出 go out |
![]() |
![]() |
![]() |
![]() |
Yangping/Ib: tone梅 plum (repeated), 年 year, 田 field, 平 flat, 茶 tea |
Yangshang/IIb: tone被 blanket, 旱 drought, 坐 sit, 肚 stomach, 上 on |
Yangqu/IIIb: tone用 use, 面 face, 共 shared, 地 ground, 自 self |
Yangru/IVb: tone 肉 meat, 实 real, 学 study, 月 month, 白 white |
![]() |
![]() |
![]() |
![]() |
ENTERING TONE DEVELOPMENT and the OUJIANG SUBGROUP
It is an important methodological principle in historical linguistics that sub-groups can only be established on the basis of shared, unusual innovations. The so-called subgroups of the Wu dialects, like most others in Sinitic, have not been established on innovations, but are typically characterised as showing various constellations of phonological, lexical and morpho-syntactic features. Oujiang, however, is different. Oujiang can be considered a bona-fide sub-group because of its unusual development of two proto-Wu tones. Proto-Wu is reconstructed with eight tones, two of which - tones *IVa and *IVb - occurred on syllables with short Rhymes and final stops.
Elsewhere in the Wu area, reflexes of these tones are usually short, with a word-final glottal-stop, e.g. Zhenhai [păʔ 5] (< proto-Wu *pak 45) 百 hundred; & [b̥ăʔ 24] (< proto-Wu *bak 23) 白 white (reconstructions are from Ballard’s 1969 UC Berkeley Ph.D Phonological history of Wu, p.70). Click on the buttons below to listen to some examples of short stopped tones from a speaker of Longyou 龙游 dialect (from the Chuqu 处衢 sub-group) saying five *IVa words and five *IVb words.
Oujiang dialects, however, show an unusual compensatory super-lengthening, whereby the proto-Wu final stop in *tones IVa/b has been lost, and the original short tone has developed an overlong, complex pitch: these are the tones in the right-most column of the table above. Since the Longyou speaker you have just listened to was actually recorded saying the same words as the Wenzhou speaker, you can compare the tones' relative length by clicking on the buttons below.
Although most other Wu varieties show the short stopped reflexes of Middle Chinese *IVa and *IVb, they are also lengthening in the Wuzhou sub-group. In some Wuzhou varieties the short tones have lengthened and merged with other tones; in others they have lengthened but remained separate by virtue of different pitch shapes. You can listen to some examples of this in a speaker of Máodiàn 毛店 on my website here.
DURATION DIFFERENCES in WENZHOU TONES
It is clear this speaker's tones vary markedly in duration. This is interesting for several reasons. Firstly, of course, the conventional wisdom is that rising tones have intrincially longer duration than falling tones. But the mid rising tone has the same duration as the high falling tone, the most likely interpretation of which is that the former has an extrinsically short duration.
Another intesting aspect of the tonal duration is the nature of the timing between laryngeal and supralaryngeal gestures these duration differences require. Listen for example to the word wine 酒, with the (short) mid rising tone, and the word bamboo 竹, with the (long) mid fall-rise tone, which both have phonemically the same segments /ʨoʊ/ (in this representation I have treated the audible high front onglide to the rhyme is as phonetically conditioned by the preceding palatal):
You can hear, however, that the rhyme sounds rather different in both words. The spectrograms in figure 1 show the time course of the first three formants and F0 in these two words. The difference in their duration is very clear.

Figure 2 shows the extracted F-pattern of the two words (modelled with cubic polynomials). The left panel shows the F-pattern plotted as a function of absolute duration. The right panel shows how the F-pattern varies as a function of equalised duration. You can see that when equalised in this fashion the F-pattern of the two words agrees quite well.
This suggests that, at least for this segmental sequence - others may be different - the same supralarygeal gesture is involved, timed with respect to the duration of the tonal Rhyme. Another example of the same timing of segementals relative to tonal Rhyme duration occurs with the timing of the onset of a nasal Coda in Zhenhai dialect. A paper on this is here.
CO-OCCURENCE RELATIONSHIP BETWEEN TONE & SYLLABLE ONSET
In many Wu dialects there is a close relationship between the tone of a syllable and the segmental sound at the syllable’s beginning, called its Onset. Click here to open up a table with the Wenzhou Onsets. This table shows that the sounds that can occur at the beginning of Wenzhou syllables can be nearly all exponents (except rhotic) of the major classes of obstruent (i.e. stop, affricate, fricative), and sonorant (nasal , lateral, glide). It is also useful to recognise a zero Onset for cases where the syllable begins with a vowel. The relationship between Onsets and isolation tones (which include citation tones of the type demonstrated on this page) is one of phonotactic co-occurrence. In other words, some Onsets can only occur with some isolation tones and some with others. Let’s have a look at this relationship.
Looking at the table of Onsets, the first important thing to note is that three phonemic categories of stops or affricates are listed for each of the Place columns except glottal. These are voiceless aspirated, voiceless unaspirated, and voiced. So for example the bilabial stops have /pʰ/ /p/ /b/. The realisation of the first two (aspirated and voiceless unaspirated) sets is straightforward: they have aspirated and voiceless aspirated allophones, e.g. [pʰ] [p], [ʦʰ] [ts] etc. The third (phonemically voiced) set is more complex. It has two allophones, conditioned by word position. Word-internally the allophones are modally voiced: [b] [ʣ] etc., which means there is a three-way contrast word-internally at all Places between [aspirated], [voiceless unaspirated] and [voiced] stops and affricates. Figure 3 illustrates typical acoustics for this three-way contrast for alveolar stops /tʰ/ /t/ & /d/ in the speaker’s subminimal triplicate /dʊŋ tʰe/ norm 动态, /va ti/ hotel 饭店, and /va deɪ/ other lands 外地. (I chose these words because they they are controlled for first- and second-syllable tone, and also for vowel height (almost) on the second syllable - these are factors that might effect the realsiation of the intervocalic consonant.) Listen to them by clicking on the buttons below the figure.
Figure 3.



You can see that the three stops differ most clearly in VOT: the aspirated [tʰ] has lag VOT of about 5 csec., the voiceless unaspirated [t] a coincident VOT of about 1 csec., and the voiced [d] a lead VOT of about -11 csec (with periodicity dropping in amplitude in mid hold). They also differ in hold duration, with the voiceless unaspirated [t] showing a much longer hold (about 17 csec.) than the other two, of which the voiced [d] is slightly greater (11 csec.) than the aspirated [tʰ] (8 csec.). Both voiceless stops appear to show typical F0 perturbation at release, but automatically extracted F0 values at release are not necessarily reliable.
Word-initially, the phonemically voiced set of stops and affricates, e.g. /b/ /dz/ etc. are realised as coincident voiceless lenis ([b̥], [d̥z̥] etc. in free variation with voiced lead [b], [dz] etc.). The voiceless lenis variant is much more common for this speaker. He has no citation tone examples with the lead VOT, but you can hear the free variation nicely in the word-initial /d/ in his three repeats of the word stomach 肚皮 /dœʏ beɪ/. The acoustics of this triplet are shown in figure 4. Click below the figure to listen to the individual repeats.
Figure 4. d



As can be seen in the table of Onsets, there are also systematic two-way voiced and voiceless contrasts for fricatives, e.g. between /s/ and /z/. As with the stops and affricates, the allophonics are straightforward for the /voiceless fricatives/ but more complex for the /voiced/. Figure 5 shows typical acoustics for the word-internal contrast between voiced and voiceless alveolar fricatives /s/ and /z/ in the words /ɕoʊ saŋ/ palm 手心 and /y zaŋ / quiet 安静. The acoustics are again non remarkable, differing in both duration of and presence of periodicity in the hold phase. The duration of /s/ is about 15 csec. and /z/ is a somewhat shorter 9 csec.
Figure 5.
/s/ /z/

As with the /voiced stops and affricates/, there is free variation word-initially in the phonemically voiced set of fricatives, e.g. /v/ /z/ etc. They are realised as voiceless lenis (e.g. [z̥]) in free variation with voiced (e.g. [z]). Click the following button to listen to three repeats of /zai ku/ 罪过 sin, the first with a voiceless realisation of the /z/ and the second two with clear voicing: Spectrograms of the three replicates are in figure 6 below. Click on buttons beneath them to listen. You can easily see the glottal pulses of the word-initial /z/ in the 2nd and 3rd repeat, lasting for about the same time as the aperiodicity from the fricative constriction. Not surprisingly, F0 is has also been extracted during this phase and you can see an apparent lowering effect on the tonally-relevant F0 on the following vowel. In the first replicate, however, at most one or two glottal pulses are evident before the release of the fricative into the vowel, and no F0 extracted. Of interest is that the F0 at the onset of the vowel in the first repeat – despite the fact that its fricative Onset is not phonetically voiced – is still low. It is lower, in fact, than the F0 at the onset of the vowel in the 2nd and 3rd repeats. This means that the lowered F0 cannot be caused by the vocal cord vibration per se. A last thing to note is the duration of the voiceless [z̥] in the first repeat. Comparing it with the duration of the word-initial /ɕ/ in figure 5 above, you can see that it is several centiseconds shorter. A slightly shorter duration for word-initial voiceless lenis allophones of the /voiced fricatives/ is typical and probably one of the acoustic cues making it sound lenis relative to the allophones of the /voiceless fricatives/.
Figure 6. z/.
/


SYLLABLE ONSETS & 'REGISTER'
The eight Wenzhou citation tones divide into two natural classes depending on whether they can occur with the voiced stops and affricate tonemes (/b d dz g/ etc.), or with the two other, voiceless, sets (/pʰ p, tʰ t, tsʰ ts, kʰ k/ etc.). (Note that this complementary distribution only applies to obstruent Onsets: both sets of tones can occur with sonorant onsets like m l or j.) The mean F0 values of these two classes of citation tones are plotted separately in figure 7, so that you can see the some of the acoustic tonal properties that underlie the distinction between the two natural classes. The tones that can occur with the /voiceless obstruent/ Onsets are on the left. You can see that they lie overall higher than the tones with the /voiced obstruent/ Onsets, but there is a substantial overlap. A typical pairing-by-contour is also clear: both classes consist of tones with level, rising, falling and fall-rising contour.


Figure 8 shows how much the difference between the two classes is due to position within the F0 range, and how much to F0 contour, by factoring out position within the F0 range. This was achieved with a simple normalisation that uniformly shifts the /voiced obstruent/ tonal F0 up until the difference between /voiceless obstruent/ tones of corresponding contour is minimised. The left panel of figure 8 shows normalised LEVEL and FALLING tones, the right panel shows the normalised RISING and FALLING-RISING tones. The normalised F0 is plotted as a function of equalised duration. F0 has been transformed to semitones relative to 80Hz. The original value of the /voiced obstruent/ tonal F0 is shown with a dotted grey line. The mean shift necessary for the low register tones is shown in the bottom left of each panel: low register level and falling tones required a mean shift of 2.2 semitones, whereas rising and fall-rising tones required a smaller shift of 1.1 semitones. (These semitone values correspond to values of 13 Hz and 7 Hz respectively.)
Figure 8 shows that the upper and lower register tones have very similar normalised offset values and differ mostly in the depressed onset of the lower register tone contour. Thus for example the [high falling] IIIa tone and the [mid falling] Ib tone can be considered as [+/- depressed] versions of a HIGH FALLING CONTOUR, the mid fall-rising IVa tone and the [low fall-rising] IVb tone can be considered as [+/- depressed] versions of a MID FALL-RISE CONTOUR. This depression effect is actually a word-initial prosody: it usually disappears on non-word initial syllables, leaving just four (level, rising, falling and fall-rising) tones: click here to read more about depression in Wenzhou.
There have been several proposals for the appropriate feature for defining these two natural classes. Probably the best known candidate is in terms of pitch register, with the /voiceless obstruent/ tones being [+ Upper], and the /voiced obstruent/ tones [-Upper]. But this will not work for the Wenzhou data because it implies that the [+ Upper] tones are separated from the [–Upper] tones by occurrence within upper and lower halves of the pitch range respectively, and figure 7 makes it clear that, because of the overlap in F0 range, there is no single point in the pitch range which separates the two sets of tones thus. Another candidate is defined in terms of phonatory register. Thus the /voiced obstruent/ tones have been proposed as [+ lax voice] (or 弛聲), and Cao & Maddieson were able to demonstrate phonatory differences between selected Wenzhou tones in their 1992 Journal of Phonetics paper 'An exploration of phonation types in Wu dialects of Chinese'.
If one is looking for a single feature to distinguish the two sets of tones, the F0 value at onset is the simplest: /voiced obstruent/ tones (in the right-hand panel of figure 7) have an F0 onset below about 120 Hz; /voiceless obstruent/ tones (left-hand panel) have a higher onset. However, the correlates of the distinction are more complicated than that. As is shown in figure 8 the two tonal classes differ acoustically in both position in F0 range AND F0 depression, which suggests a pitch register distinction. From the point of view of production, however, both of these features are probably ultimately referrable to different extrinsic word-initial phonatory gestures, which suggests a phonatory register interpretation. In this case, therefore, if we name the dimension that underlies the difference between the two sets of tones REGISTER, it is best to consider it a cover term for both phonatory and pitch features. Finally, since the difference between the two classes is also reflected in the syllable /Onsets/ it may be that whatever underlies the difference between these natural classes is probably better regarded as a feature of morphemes rather than tone. This could also of course apply to the Middle Chinese so-called tonal categories of Yin 阴and Yang 阳 from which the two synchronic classes derive.
In the table below I have arranged the tones according to the two basic orthogonal tonal dimensions of Register and Contour. The horizontal register dimension divides the eight tones into the two natural classes on the basis of pitch/F0 depression and range, and co-occurrence with Onsets. That leaves four pitch-target contours: level, rising, falling and fall-rising. The tones are thus paired according to contour. You can click to hear their mean F0, synthesised with Praat. One of the most interesting things about this neat pairing-by-contour configuration is that the rules for disyllabic tone sandhi do not actually treat two of the pairings as natural classes, but work rather in terms of historical categories. Thus in tone-sandhi the upper-mid level and the mid falling tones (historical ping category) act as one natural class; and the high falling and lower-mid level tone (historical qu category) act as another. This is described in my 2004 paper "Defying Explanation"? - Accounting for Tones in Wenzhou Dialect Disyllabic Lexical Tone Sandhi.
CONTOUR → | level | rise | fall | fall-rise |
---|---|---|---|---|
REGISTER ↓ UPPER |
UPPER-MID LEVEL (Ia/Yinping) |
MID RISE (IIa/Yinshang) |
HIGH FALL (IIIa/Yinqu) |
MID FALL-RISE (IVa/Yinru) |
LOWER |
LOWER-MID LEVEL (IIIb/Yangqu) |
DELAYED LOW RISE (IIb Yangshang) |
MID FALL (Ib/Yangping) |
LOW FALL-RISE (IVb/Yangru) |
OTHER DESCRIPTIONS OF WENZHOU CITATION TONES
Descriptions of Wenzhou dialect citation tones date back to the late nineteenth century. Most of these are summarised in the native-speaker-linguist Zhèngzhāng Shàngfāng’s comprehensive 2008 鄭張尚芳: 温州方言志 [Wenzhou Dialect Gazetteer] p. 50 ff., 92, 93. See also his 1995 paper 温州方言近百年来的语音变化 [Changes in Wenzhou dialect phonology in the last hundred years], in Eric Zee (ed.) Studies in the Wu Dialects, Chinese University of Hong Kong.
The earliest acoustic description of Wenzhou citation tones is in Chao Yuen Ren's (1928) pioneering monograph Studies in the Modern Wu Dialects / 現代吳語的研究, which on pp. 76,77 contains a musical description of the tones of a female speaker from Yŏngjiā 永嘉, a site about 10 kilometres almost due north of Wenzhou city. The citation tone acoustics of another Wenzhou speaker may be found in Píng Yuèlíng et al.’s 平悦铃等著: 吴语声调的实验研究 [Experimental studies on Wu Tones] p. 341-350.

There are also many examples of Wenzhou citation tones transcribed in Chao’s well-known 5-point “tone letters”. For example the 汉语方音字汇 [Dictionary of Chinese Dialect Character Pronunciations] gives [44] (Ia), [31] (Ib], [45] (IIa), [34] (IIb), [42] (IIIa), [22] (IIIb), [323] (IVa), [212] (IVb).
Figure 9 shows the perceptual transform of the speaker’s mean tonal values in order to see how well this pitch representation fits this speaker's mean data. This figure is a bit complicated so it needs some explanation. In it, the speaker’s mean tonal F0 has been declination-adjusted and converted to semitones relative to 90 Hz (a reference value which gave the best perceptual fit) - for details of this transformation see my Interspeech paper on tone transcription. The semitone scales are on the left of each panel. The scales to the right are the Chao 5-point values fitted to the highest and lowest semitone values.
It can be appreciated that, transformed in this way, some of the speaker’s tones can be assigned a Chao tone-letter representation without too much procrustianism, although the values do not correspond particularly well to the Dictionary of Chinese Dialect Character Pronunciations descriptions above. The falling tones are well represented as [51] and [31] (although of course the congruence of the [5] is forced). The low fall-rising tone is also close to [212]. The onsets of the rising tones are also reasonable, at [3] for mid rising and [1] for delayed low rising. Their offsets are badly represented, however, as their pitch targets lie just about equidistant between [5] and [4]. This means you would be equally likely to get them assuming a representation of either [5] or [4]! The low point of the mid fall-rise tone is also not well modelled by a 5 point scale. Finally, both the level tones are distributed around the mid pitch range but are neither high enough nor low enough to warrant a [44] or a [22].
This sub-optimal fit between quantified pitch target and Chao representation is probably due to the way the two sets of tones are distributed, with the lower set only lying slightly lower than the upper. It might also be the case that most complex tone systems are badly modelled by just five levels (in the likelihood sense that the probability of getting the data assuming the model - P(data|model) - is low). If one wanted to obtain a more accurate discrete perceptual representation, one might, in the spirit of phonetic vowel representation, invoke a shifting indicator e.g. [33↑] (or [44↓]) for the upper-mid level tone and [33↓] (or [22↑]) for the lower mid level tone. But I suspect that it is unreasonable to expect pitch contours like this to be modelled well by any single discrete set of values.