Vowels and the phonotactic clustering of German Dialects


This work adresses the question, which role vowel-information plays when hierar­chi­cal­ly cluste­ring German dialects. When comparing the results of three different settings of the data (unmodi­fied vowel infor­ma­ti­on; vowels replaced by „V“; vowels comple­te­ly removed), it can be observed that in each case the data strongly tends to form the same three main clusters, corre­spon­ding to the large dialect-areas Upper German, Low German/East Central German and West Central German. The fact that each setting appro­xi­mate­ly gives the same result suggests that vowels only play a minor role for phono­tac­ti­cal­ly cluste­ring German dialects, although they make up 30% of the data.

There are already a couple of works in German dialec­to­lo­gy that make use of hierar­chi­cal cluste­ring and other statis­ti­cal similarity-based methods. A large cluster-analysis based on Levensht­ein distance of selected words from the PAD-corpus (see next section) was used by Nerbonne and Siedle (2005), who found that the result of their quanti­ta­ti­ve approach corre­sponds to the catego­ri­sa­ti­on of tradi­tio­nal German dialec­to­lo­gy. Other publi­ca­ti­ons applied simila­ri­ty measures to smaller areas, like Lameli (2014) using Neighbor-Nets on Swabian (see McMahon 2005 for more on Neighbor-Nets) and Birkenes (2019) calcu­la­ting character-trigram cosine-distances of 55 North-Frisian Wenker questionnaires.


The current analysis uses annotated audio-data from the Phonetic Atlas of Germany (Phone­ti­scher Atlas von Deutsch­land; PAD) by Göschel (1992; 2000). From 1956 to 1990, native dialect speakers from 186 locations in Germany were given the task to translate the so-called Wenker sentences into their respec­ti­ve dialects. A readout of this trans­la­ti­ons was recorded on tape. To ensure dialect-competence, these speakers were selected from the group of so-called NORMs (Non-mobile, Old, Rural Males; see Chambers and Trudgill 1998). From 1980 to 1995, narrow phonetic transcrip­ti­ons were made from all monosyl­la­bic words, which make up the PAD-corpus.

Currently, our project is preparing broad, phono­lo­gi­cal transcrip­ti­ons of the full PAD-material. Students first cut out the inter­view­er from the original recording and then they make two ortho­gra­phic transcripts: A dialect version and a (trans­la­ted) standard German version. From these transcripts, two Praat-textgrids are created with the online-forced-aligner WebMAUS (Schiel 1999; Kisler et al. 2017). In a last step, the resulting textgrid is manually corrected. The corrected textgrids are then combined in csv-format. Since our project is work in progress, the data is not complete yet. At the moment, it consists of 25 aligned record­ings (for the locations, see the dendro­grams below).

Because the texts that the inter­view­ers presented to the speakers tend to vary more or less strongly, it was necessary to reduce the data to those phrases of the Wenker sentences that were spoken by all parti­ci­pants to avoid any bias resulting from template simila­ri­ty. For each speaker, the SAMPA-annotated dialect-words were split into separate sounds (except affri­ca­tes and diphtongs) and these were grouped into bigrams. As an example, the bigrams for the word g@bli:b@n are shown in (1). For each bigram, the proba­bi­li­ty (frequency divided by the total amount of bigrams) was calcu­la­ted separ­ate­ly for each speaker and the resulting matrix was used as input for hierar­chi­cal clustering.

  1. g@, @b, bl, li:, i:b, b@, @n

The data was analysed by using hiear­chi­cal cluster analysis with the statis­tics software R. Hierar­chi­cal cluste­ring is a group of distance-based methods which are applied to calculate the simila­ri­ty between objects (Jain and Dubes, 1988). Ward (1963) suggests an agglo­me­ra­ti­ve approach taking a bottom-up-perspective: First, each object is treated as an indivi­du­al cluster and step by step, objects are combined to larger clusters by pairwi­se­ly merging those being most similar to each other. The graphical repre­sen­ta­ti­on of such a cluster-analysis is the so-called dendro­gram in which the leaves represent the indivi­du­al objects and the nodes stand for the clusters of all the objects below them. For this analysis, hierar­chi­cal cluste­ring was done using the R‑package pvclust (Distance = Ward; Method = Average) by Suzuki and Shimodai­ra (2006).


Hierar­chi­cal cluste­ring was done with three different settings: (i) unmodi­fied, (ii) all vowels changed to „V“ and (iii) all vowels comple­te­ly removed.

1. Setting: Unmodified

Fig. 1 shows the dendro­gram which is obtained when cluste­ring data with unmodi­fied vowel infor­ma­ti­on. The red numbers give the AU-p-value (“appro­xi­mate­ly unbiased p-value for non-selective inference”) which indicates how well clusters are supported by the data (Suzuki and Shimodai­ra, 2015). The green numbers give the bootstrap proba­bi­li­ty value. However, this green value is much less accurate than the red AU values (Suzuki and Shimodai­ra, 2015). The grey numbers simply represent a random identi­fier for a certain cluster. The dendro­gram in Fig. 1 shows three large groups: Cluster 22 (p-value 92%) mainly contains record­ings from the West Central German area (and a few Upper German record­ings), cluster 20 (p-value 85%) mostly consists of Lower German and East Central German record­ings and in cluster 21 (p-value 91%), only Upper German record­ings are grouped together. These three groups corre­spond to known German dialect classi­fi­ca­ti­ons such as Wiesinger (1983), König (2005), Nerbonne and Siedle (2005) or Girnth (2007). This suggests that phono­tac­tic infor­ma­ti­on plays an important role for distin­guis­hing broader dialect areas.

Fig. 1 Cluste­ring data with unmodi­fied vowel information

2. Setting: Vowels changed to „V“

In the second setting, vowel-quality-information is removed from the data by simply replacing all vowels by „V“; as an example, g@bli:b@n became gVblVbVn. Fig. 2 shows the resulting dendro­gram. It consists of two main clusters, where Cluster 23 (p-value 79%) mostly contains record­ings from the Upper German area and Cluster 22 (p-value 76%) is further subdi­vi­ded into a Low German/East Central German cluster (Cluster 17; p-value 94%) and a West Central German cluster (Cluster 20; p-value 88%). This suggests that when vowel-quality infor­ma­ti­on is removed, it is still possible to phono­tac­ti­cal­ly group German dialects.

Fig. 2 Cluste­ring data with all vowels replaced by „V“

3. Setting: Vowels comple­te­ly removed

Even stronger, when vowels are comple­te­ly removed (g@bli:b@n became gblbn), the three main clusters still remain as shown in Fig. 3: Low German/East Central German in Cluster 22 (p-value 92%), West Central German in Cluster 20 (p-value 93%) and Upper German in Cluster 21 (p-value 89%). This indicates that even if vowel-position infor­ma­ti­on is not available, larger German dialect groups can be still distin­guis­hed based on phonotactics.

Fig. 3 Cluste­ring data with all vowels comple­te­ly removed


Vowels make up around 30% of the data used in this work. However, the three cluste­ring analyses show that they do not play a big role for the phono­tac­tic grouping of German dialects. In all three settings, appro­xi­mate­ly the same three large groups were present in the dendro­grams: Low German/East Central German, West Central German and Upper German. Surpri­sin­gly, the clearest picture was obtained under the third, no-vowel-setting. In this dendro­gram only a few outliers from West Central German are in the Upper German cluster and vice versa.


  • Birkenes, Magnus Breder (2019): North Frisian dialects: A quanti­ta­ti­ve inves­ti­ga­ti­on using a parallel corpus of trans­la­ti­ons. Us Wurk 68.3–4: 119–168. https://doi.org/10.21827/5c98880d173a4
  • Chambers, J.K. / Peter Trudgill (1998): Dialec­to­lo­gy. 2nd edn. Cambridge: Cambridge Univer­si­ty Press. https://doi.org/10.1017/CBO9780511805103
  • Girnth, Heiko (2007): Varia­ti­ons­lin­gu­is­tik. In: Steinbach, Markus, et al.: Schnitt­stel­len der germa­nis­ti­schen Lingu­is­tik. Stuttgart: Metzler, 187–217. https://doi.org/10.1007/978–3‑476–05042-7_6
  • Göschel, Joachim (1992): Das Forschungs­in­sti­tut für Deutsche Sprache ‘Deutscher Sprach­at­las’. Wissen­schaft­li­cher Bericht. Marburg: FIDS.
  • Göschel, Joachim (2000). Der Phone­ti­sche Atlas von Deutsch­land. ЈyжноcлoвенcкиФилолог 56: 283–288.
  • Jain, K. / R. C. Dubes (1988): Algorith­ms for Cluste­ring Data, Prentice Hall, Englewood Cliffs, New Jersey.
  • König, Werner (2005): dtv-Atlas deutsche Sprache.
  • Lameli, Alfred (2014): Distanz als raumstruk­tu­rel­le Eigen­schaft dialek­ta­ler Kontakt­si­tua­tio­nen. Eine Analyse des Schwä­bi­schen. Dominique Huck (Hg.): Aleman­ni­sche Dialek­to­lo­gie: Dialekte im Kontakt. Beiträge zur 17: 67–86.
  • McMahon, April / Robert McMahon (2005): Language classi­fi­ca­ti­on by numbers. Oxford: Univer­si­ty Press.
  • Nerbonne, John / Christine Siedle (2005): Dialekt­klas­si­fi­ka­ti­on auf der Grundlage aggre­gier­ter Ausspra­che­un­ter­schie­de. Zeitschrift für Dialek­to­lo­gie und Lingu­is­tik: 129–147.
  • Schiel, Florian (1999): Automatic Phonetic Transcrip­ti­on of Non-Prompted Speech. Proc. of the ICPhS, 607–610.
  • Kisler, T. / Reichel U. D. / Schiel, F. (2017): Multi­lin­gu­al proces­sing of speech via web services. Computer Speech & Language 45: 326–347.
  • Suzuki, Ryota / Hidetoshi Shimodai­ra (2006). Pvclust: an R package for assessing the uncer­tain­ty in hierar­chi­cal cluste­ring. Bioin­for­ma­tics 22.12: 1540–1542.
  • Suzuki, Ryota / Hidetoshi Shimodai­ra / Maintai­ner Ryota Suzuki (2015): Package ‘pvclust.’ R topics documen­ted 14.
  • Ward Jr, Joe H. (1963): Hierar­chi­cal grouping to optimize an objective function. Journal of the American statis­ti­cal associa­ti­on 58.301: 236–244. http://dx.doi.org/10.1080/01621459.1963.10500845
  • Wiesinger, Peter (1983): Die Eintei­lung der deutschen Dialekte. Berlin: de Gruyter. http://dx.doi.org/10.1515/9783110203332.807

Diesen Beitrag zitieren als:

Link, Samantha (2021): Vowels and the phono­tac­tic cluste­ring of German Dialects. Sprach­spu­ren: Beiträge aus dem Deutschen Sprach­at­las 1(3). https://sprachspuren.de/phonotacticclustering.

Samantha Link
Samantha Link ist wissenschaftliche Mitarbeiterin am Forschungszentrum Deutscher Sprachatlas und promoviert im Rahmen des DFG-Projekts "Phonotaktik der Dialekte in Deutschland". Sie hat an der Universität Tübingen Computerlinguistik, Allgemeine Sprachwissenschaft, Germanistik und Evangelische Theologie studiert.