Are you ready for a call? – Spontaneous conversations in tourism for speech-to-speech translation systems
Darinka Verdonik, Matej Rojc
Faculty of Electrical Engineering and Computer Science,
Smetanova ul. 17,
2000
E-mail: darinka.verdonik@uni-mb.si, matej.rojc@uni-mb.si
Abstract
The paper represents the Turdis
database of spontaneous conversations in tourist domain in Slovenian language. Database
was built for use in developing speech-to-speech translation components,
however it can be used also for developing dialog systems or used for
linguistic researches. The idea was to record a database of telephone
conversations in tourism where the naturalness of conversations is affected as
little as possible while we still obtain a permission for recording from all
the speakers. When recording in studio environment there can be many problems. It
is especially difficult to imitate a tourist agent if a speaker does not have such
experiences and therefore lacks the background knowledge that a tourist agent
has. Therefore the Turdis database was recorded with professional tourist
agents. The agreement with local tourist companies enabled that we recorded a
tourist agent while he was at his working place in his working time answering
the telephone. Callers were contacted individually and asked to use the Turdis
system and make a call to selected tourist company. Technically the recording
was done using PC ISDN card. Database was orthographically transcribed with
Transcriber tool. At the present it includes cca. 43.000
words.
1. Introduction
The goal of speech-to-speech translation is that one day a person could for example carry out a telephone conversation with a speaker with whom she/he shares no common language. However, researchers developing speech-to-speech translation technologies (this includes speech recognition, speech centred translation and speech synthesis) find the spontaneous conversation full of phenomena like disfluencies, repairs, false starts, hesitations, filled pauses, silences, repetitions etc. In the context of the conversation, these phenomena are not disturbing, the opposite, they may have their own communicative role, but for speech-to-speech translation technologies this is very problematic. The TC-STAR consortium (http://www.tcstar.com.tw/) therefore suggests that simple combining of machine translation technics, developed for the translation of the written text, with speech recognition and speech synthesis into speech-to-speech translation systems cannot achieve satisfying quality, but special approaches to the speech centred translation are needed. Similar is concluded in the Verbmobil (http://verbmobil.dfki.de/verbmobil/VM. English.Mail.30.10.96.html) and other projects where speech-to-speech translation systems were built.
2. Recording conversations in a studio
In order to get closer to the solution, the databases of spontaneous conversations are needed first, to be able to track and study the spontaneous conversation phenomena. For that we wanted to obtain the conversations as real and natural as possible.
Recording real conversations can be difficult since speakers have to be notified in advance that their conversation will be recorded. Depending on a type of conversation we want to record, there may rise different problems. One of the problems is certainly to convince the speakers to allow the recording. We must consider that we will not get a permission to record conversations where personal or classified data are discussed. If we want to record conversations between a seller and his customer we would scare some customers since not all people are ready to be recorded, so there is a small chance that a seller would agree. We must also consider that naturalness of conversation is usually affected as soon as speaker knows that she/he is being recorded.
In many projects of speech-to-speech translation systems databases of conversations were recorded in studio: Janus (Shum et al., 1994; Lavie et al., 1997), Verbmobil (Kurematsu et al., 2000), EuTrans (Aiello et al., 1999), LC-STAR (Arranz et al., 2004) etc.), also Nespole! (Mana et al., 2004), but with professional tourist agent.
When planning the recording for the Turdis
database we first recorded some examples of imitated conversations in the studio
with the scenario of making hotel reservations. The speaker imitating the role
of the hotel receptionist got a calendar and the speaker imitating a caller got
a task to make a reservation for specified number of rooms, beds and for
specified dates. The instructions were to talk to each other naturally, not to
prepare sentences or learn a dialogue by heart. First recordings showed many
problems. It was very hard to motivate speakers (the same report other
researchers (Arranz et al., 2004). Since we were recording telephone
conversations in tourist domain it was especially difficult to imitate the role
of a professional tourist agent who usually provides information: if a person has
never been a professional tourist agent she/he does not have all necessary
background knowledge. Many problems rise from this. Some are contextual: for
example a speaker in studio answered the telephone ringing with: Dober dan. Hotel Piramida. Rezervacije. (
3. Recording conversations with the Turdis system
In the Turdis database presented here we selected
telephone conversations in tourist domain where professional tourist agent
provides information to customers. The domain was chosen according to the LC-STAR
project (www.lc-star.com; Arranz 2004; Foerse et al., 2004) in which
Since the tourist domain in general is too broad as a domain of interest for typical speech-to-speech translation applications, it was further restricted to the following sub-domains:
- telephone conversations in tourist agency
- telephone conversations in tourist office
- telephone conversations in hotel reception
Comparing to the LC-STAR project we did not include telephone conversations in railway and airline companies since it would not be so interesting in the local tourist environment.
In order to avoid most of the problems rising from recording imitated conversations in studio, we made two steps: we contacted professional tourist companies for cooperation, and we enabled the speakers to use the Turdis recording system in their natural environment, professional tourist agent at his working place, and potential customer at home, office or anywhere else.
Technically this was made possible by using the ISDN card. The recording system Turdis uses both available ISDN channels. One is used for connection with an agent and the other for connection with a caller. Callers do not call a tourist agency directly, instead they call the Turdis system. The system calls an agent in the selected tourist agency immediately after receiving a call from a caller. When both connections are established the system automatically connects both lines and establishes direct connection between a caller and an agent. At the same time recording session on both channels starts. The solution was very efficient on shorter distances, but when calling from larger distances (eg. to tourist agency 100 km or more away from where the recording system was placed) echo was already quite disturbing for the callers.
We needed two kinds of speakers for
recording: professional tourist agents and callers. In order to find the agents
who would participate in the recording, we first contacted local tourist
companies for general permission that their agents can spend some (li
The callers were contacted individually and
asked to make a call; they were mostly employees and students of the
We did not give much limitations about a
topic of conversation since it was already enough restricted by conversational
situation: calls could be made only to two hotel receptions (in Hotel Piramida
and Hotel Habakuk), local tourist office (MATIC) and four different tourist
agencies (Sonček, Kompas, Neckermann Reisen, Aritours), all in
All conversations were in Slovenian language which was also a mother tongue of all the callers. As an exemplary study we recorded one conversation between a native German speaker and Slovenian agent in the tourist office. Even though the agent spoke good German some language barriers affected the conversation, e.g. his utterances included longer pauses, sometimes searching for words, more repairing, some standard language imperfections etc.
4. Transcribing
Recorded material was transcribed using the Transcriber tool (http://www.etca.fr/CTA/gip/Projets/Transcriber/ -fr/user.html). We considered some of the EAGLES recommendations (http://www.lc.cnr.it/EAGLES96/ spokentx/) and principles of transcribing BNSI Broadcast News database (Žgank et al., 2004).
4.1 Segmentation
The basic units of segmentation of recorded material are segments or utterances. In order to achieve unitary segmentation, we defined the utterance as a semantic unit of speech, limited with pauses and marked with intonation, in speech of the same speaker. The average length of the utterances, segmented according to this rule, was between 6 and 7 words.
One or more utterances construct a turn, i.e. everything that speaker says before the next speaker starts talking. The way of recording enabled overlapping speech which is common characteristic of conversation. Overlapping speech in the Turdis database was transcribed in a special overlapping segment where speech of each speaker was transcribed. It was very common that overlapping speech started or ended in the middle of an utterance. To keep the information about utterances, we included tags that marked:
1) where the utterance is broken because overlapping speech started or ended (sign [1]),
2) where the utterance continues because overlapping speech started or ended (first next sign [2]),
3) where there should be a new segment for new utterance in speech of the same speaker but was not tagged by segment because of overlapping with the other speaker (sign [P]).
Example (the way it is seen through Transcriber tool, not in XML file):
In
Slovenian language:
Sp1[overlap]: štirinajst
dni prej bi že bilo fajn
Sp2[overlap]: mislim kak ...
[P] štirinajst dni [1]
Sp2: [2] najmanj prej[+overlap_ja]
no
In English:
Sp1[overlap]: fourteen days before is recommended
Sp2[overlap]: I mean how ... [P]
at least [1]
Sp2: [2] fourteen days
before[+lex=overlap_ja] I see
In the recorded conversations it was very common that listener was expressing his attention, understanding, agreement with short words like mhm, aha, ja (Eng. ah, oh, yeah etc.), without signaling that he would like to take over the turn. We did not consider this as overlapping speech, so this words were tagged as special overlapping events (e.g. [lex=overlap_ja] where lex=overlap is the description of event and ja is the word that was pronounced, see previous example).
At a section level only four types
of section were tagged: the beginning, the main body and the end of a conversation,
and longer breaks, e.g. silences or connecting telephone line (more than 1,5
sec.).
Most of the well heard sound events were
tagged: breathing, laughing, coughing, incomprehensible or quiet speech, as well
as some background sounds (eg. telephone ringing).
4.2 Transcription
Transcription was basically
orthographic, however phonetic transcriptions with SAMPA symbols (Zemljak et
al., 2002) were added for some words. In Slovenian language there can be
important difference between spoken language and literary standard because of
vocal reduction. It is very common that some vowels are not pronounced when speaking,
e.g. for word [” t u: - d i] (
Proper names, abbreviations, spelling,
foreign words... were all tagged the way they can be automatically tracked.
Some basic prosody information were added:
shorter pauses within speech (sing [.]),
prolonged syllables (sign [:]),
raising intonation (sign ?),
emphasized pronunciation (sign #), as
well as cut off words (sign ()) and cut off or unfinished utterances (sign ...).
While Transcriber tool enables some
basic information about speaker and turn, these are added as well: gender,
dialect, and mode, fidelity and channel.
The format of the transcriptions in
XML is determined by Transcriber tool.
5. Some main characteristics of the Turdis database
At the
present the Turdis database, according to it’s size, represents only a
foundation for further work. The transcriptions include 43.000 words. The
recorded material includes 74 recordings. In 7 recordings the agent who
answered the phone connected a caller to some other agent, so there were two
conversations in one recording, all together 81 conversations. The total length
of the recordings is 4,6 hours, the average length of a conversation 3,4
minutes. The table 1 shows more details about number and length of
conversations.
|
|
No. of conv. |
Total length |
|
Tourist agency |
48 |
169,2 min. |
|
Tourist office |
18 |
66,3 min. |
|
Hotel
reception |
15 |
42,2 min. |
|
Total |
81 |
277,7 min. |
Table
1: Number and total length of conversations in the Turdis database.
We can see
that more than a half of the conversations were recorded with tourist agency. It
is because the information that a tourist agency can offer are far more diverse
and numerous than information in a hotel reception. There is a lot of tourist
agencies in the local area where we were recording and many people working as
tourist agents, but only one tourist office with few tourist agents providing
information.
The
conversations were done by 75 speakers, 42 callers and 33 tourist agents. Table
2 brings further details about the gender of the speakers. Mostly females work
as a tourist agent who provides information therefore and that reflects also in
the Turdis database.
|
|
Male |
Female |
|
Tourist agents |
6 |
18 |
|
Callers |
24 |
18 |
|
Total |
30 |
36 |
Table
2: Gender of the callers and the agents in the Turdis database.
Slovenian
language is dialectically very diverse. At the most general level we
distinguish the northeast dialects and the southwest dialects. The speakers in
recorded conversations were speaking different northeast dialects, only three
callers and three agents were speakers of southwest dialects. It is still the
task for future work to obtain conversations with speakers of the southwest
dialects.
6. Conclusion
The Turdis
database was basically built for use in developing speech-to-speech translation
components: speech recognition, speech centred translation, speech synthesis. At
the moment it can be used for improving acoustic and language models. It is very
important source of knowledge about conversation and spontaneous speech
phenomena (eg. vocal reduction, morphological particularities, conversational
words which are not part of literary standard, utterance structure like
repairing, non-standard word order) which is necessary for developing
technology. It can be used also for developing dialog systems or for linguistic
researches. The next step should be translation of the transcriptions in
Turdis; as the parallel corpus it could be used also for developing speech
centred translation.
7. Acknowledgements
We sincerely thank to all tourist
companies who participated in recording: tourist agencies Sonček, Kompas, Neckermann Reisen and Aritours, to Terme Maribor,
especially Hotel Piramida and Hotel Habakuk, and to Mariborski zavod za turizem with tourist
office MATIC. We also thank to all
tourist agents in these companies who participated in recording and to all the
callers who were ready to use the Turdis system.
8. References
Aiello, D., L. Cerrato, C. Delogu,
A. Di Carlo (1999). The Acquisition of a Speech Corpus
for Limited Domain Translation. In Proceedings of Eurospeech 1999,
Arranz, V., N. Castell, J. Gimenez, H. Ney, N. Ueffing, (2004). Description of language resources used for experiments. http://www.lc-star.com/archive.htm.
Foerse, H., E.
Hartikainen, H. van den Heuvel, G. Maltese, A. Moreno, S. Shammass, U.
Ziegenhain (2004). Creation and Validation of Language Resources for
Speech-to-Speech Translation Purposes. In Proceedings of the 4th
International Conference on Language Resources and Evaluation. Lisbon, Portugal.
Kurematsu, A., Akegami, Y., Burger, S., Jekat, S., Lause, B., MacLaren, V., Oppermann, D., Schultz, T. (2000). Verbmobil Dialogues: Multifaced Analysis. In Proceedings of the International Conference of Spoken Language Processing.
Lavie, A., L. Levin, P. Zhan, M. Taboada, D. Gates, M. Lapata, C.
Clark, M. Broadhead, A. Waibel (1997). Expanding the Domain of a Multi-lingual
Speech-to-Speech Translation System. In Proceedings of the Workshop on Spoken
Language Translation.
Mana, N., R. Cattoni, E. Pianta, F. Rossi, F. Pianesi, S. Burger
(2004). The Italian NESPOLE! Corpus: a Multilingual Database with Interlingua
Annotation in Tourism and Medical Domains. In Proceedings of 4th
International Conference LREC’04.
Shum, B., L. Levin, N. Coccaro, J. Carbonell, K. Horiguachi, R. Isotani, A. Lavie, L. Mayfield, C. P. Rose, C. Van Ess-Dykema, A. Waibel (1994). Speech-Language Integration in a Multi-lingual Translation System. In Proceedings of AAAI Workshop on Integration of Natural Language and Speech Processing.
Zemljak, M., Kačič, Z., Dobrišek, S., Gros, J., Weiss, P. (2002). Računalniški simbolni fonetični zapis slovenskega govora. Slavistična revija, 50(2), 159--169.
Žgank, A., T. Rotovnik, M. Sepesy Maučec, D. Verdonik, J.
Kitak, D. Vlaj, V. Hozjan, Z. Kačič, B. Horvat. Acquisition and
Annotation of Slovenian Broadcast News Database (2004). In Proceedings of the 4th International Conference on
Language Resources and Evaluation.