Show simple item record

dc.contributor.authorProchazka, V.
dc.contributor.authorPollak, P.
dc.contributor.authorZdansky, J.
dc.contributor.authorNouza, J.
dc.date.accessioned2016-03-01T09:16:14Z
dc.date.available2016-03-01T09:16:14Z
dc.date.issued2011-12cs
dc.identifier.citationRadioengineering. 2011, vol. 20, č. 4, s. 1002-1008. ISSN 1210-2512cs
dc.identifier.issn1210-2512
dc.identifier.urihttp://hdl.handle.net/11012/56902
dc.description.abstractIn this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.en
dc.formattextcs
dc.format.extent1002-1008cs
dc.format.mimetypeapplication/pdfen
dc.language.isoencs
dc.publisherSpolečnost pro radioelektronické inženýrstvícs
dc.relation.ispartofRadioengineeringcs
dc.relation.urihttp://www.radioeng.cz/fulltexts/2011/11_04_1002_1008.pdfcs
dc.rightsCreative Commons Attribution 3.0 Unported Licenseen
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/en
dc.subjectspeech recognitionen
dc.subjectLVCSRen
dc.subjectn-gram language modelsen
dc.subjectpublic language resourcesen
dc.titlePerformance of Czech Speech Recognition with Language Models Created from Public Resourcesen
eprints.affiliatedInstitution.facultyFakulta eletrotechniky a komunikačních technologiícs
dc.coverage.issue4cs
dc.coverage.volume20cs
dc.rights.accessopenAccessen
dc.type.driverarticleen
dc.type.statusPeer-revieweden
dc.type.versionpublishedVersionen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Creative Commons Attribution 3.0 Unported License
Except where otherwise noted, this item's license is described as Creative Commons Attribution 3.0 Unported License