Descripción del proyecto
END-TO-END NEURAL NETWORK BASED AUTOMATIC SPEECH RECOGNITION (ASR) SYSTEMS ARE BECOMING THE STATE-OF-THE-ART TECHNOLOGY, HOWEVER, CURRENT TRAINING ALGORITHMS OF END-TO-END SYSTEMS REQUIRE HIGH DATA VOLUMES THAT ARE USUALLY COLLECTED BY LARGE INDUSTRIAL COMPANIES BUT AREN'T NORMALLY FREELY AVAILABLE FOR THE RESEARCH COMMUNITY, MANY EFFORTS HAVE BEEN DONE DURING THE LAST YEARS TO CREATE FREELY AVAILABLE HUGE ASR CORPORA FROM PUBLIC ACCESS REPOSITORIES, MOST OF THEM FOCUSED ON ENGLISH, BUT FOR LANGUAGES SUCH AS SPANISH, COLLECTING HUNDREDS OF HOURS OF PUBLIC DOMAIN TRANSCRIBED DATA IS A CHALLENGING TASK THAT REQUIRES A LARGE EFFORT AND RESEARCH ON ALIGNING AND FILTERING THE TRANSCRIPTS, SINCE THE SCARCITY OF RESOURCES IMPLIES ESTABLISHING A LOWER QUALITY THRESHOLD FOR THE DATA COLLECTION, TARGETING MINORITY LANGUAGES SUCH AS BASQUE POSES EVEN MORE CHALLENGES,THIS PROJECT FOCUSES ON THE RESEARCH OF METHODS FOR THE COLLECTION AND PROCESSING OF PUBLIC DOMAIN TRANSCRIBED AUDIO DATA FOR BUILDING STATE-OF-THE-ART ASR SYSTEMS FROM SCRATCH, THREE TARGET LANGUAGES ARE SELECTED: ENGLISH, SPANISH AND BASQUE, THE FIRST ONE SERVES AS A BASELINE (THERE HAVE ALREADY BEEN PLENTY OF SIMILAR PROJECTS FOCUSED ON ENGLISH DATA COLLECTION), WHILE THE OTHER TWO LANGUAGES WILL BE THE TEST BED FOR THE RESEARCH ON ALIGNMENT AND FILTERING TECHNIQUES WHEN DEALING WITH LOW QUALITY RESOURCES,FOUR DIFFERENT DATA SOURCES WILL BE EXPLOITED DURING THE PROJECT, FIRST, SMALL READY-TO-USE DATABASES WILL BE USED TO BOOTSTRAP AN ASR SYSTEM PER LANGUAGE, SECOND, AUDIOBOOKS FROM PUBLIC REPOSITORIES WILL BE ALIGNED, SEGMENTED AND FILTERED (USING THE BOOTSTRAP ASR SYSTEMS) TO OBTAIN A SET OF TRANSCRIBED AUDIO SEGMENTS, THOSE SEGMENTS WILL BE USED TO RE-TRAIN THE BOOTSTRAP ASR SYSTEM, ITERATIVELY REALIGNING, FILTERING AND SEGMENTING ALL THE AUDIOBOOKS, IN THIRD PLACE, AN AUTOMATED YOUTUBE DATA COLLECTION SCRIPT WILL SEARCH FOR VIDEOS CONTAINING CLOSED CAPTIONS, DIFFERENT ALIGNING AND FILTERING TECHNIQUES WILL BE APPLIED IN ORDER TO ITERATIVELY PRODUCE A SET OF TRANSCRIBED SEGMENTS WITH IMPROVED ALIGNMENT AND TRANSCRIPTION QUALITY AND AN ENHANCED ASR SYSTEM, FINALLY, THE FEASIBILITY OF COLLECTING TRANSCRIBED AUDIOS FROM PODCASTS WILL BE STUDIED,THE AVAILABILITY OF ROBUST ASR SYSTEMS FOR BASQUE AND SPANISH IS ESSENTIAL FOR THE DEVELOPMENT OF A FULLY AUTOMATED SUBTITLING SYSTEM FOR THE BILINGUAL PLENARY SESSIONS OF THE BASQUE PARLIAMENT, TWO DIFFERENT SOLUTIONS ADDRESSING THE BILINGUAL NATURE OF THE SESSIONS WILL BE DEVELOPED AND COMPARED: (1) APPLYING LANGUAGE DIARIZATION AND THEN DOING THE SUBTITLING WITH TWO MONOLINGUAL ASR SYSTEMS; AND (2) DOING FULLY BILINGUAL SUBTITLING BASED ON A SINGLE BILINGUAL ASR SYSTEM, THE SUBTITLES WILL ALSO INCLUDE SPEAKER LABELS PRODUCED BY A SPEAKER DIARIZATION SYSTEM, UNLIKE OTHER SPEECH PROCESSING TASKS, STATE-OF-THE-ART DIARIZATION SYSTEMS STILL LARGELY RELY ON GENERATIVE MODELS, THE USE OF DEEP LEARNING TECHNIQUES AND THE MIGRATION TOWARDS END-TO-END NEURAL NETWORK BASED DIARIZATION APPROACHES WILL BE ALSO EXPLORED IN THIS PROJECT, RECONOCIMIENTO AUTOMATICO DEL HABLA\LENGUAS DE POCOS RECURSOS\DATOS DE DOMINIO PUBLICO\DIARIZACION DEL LOCUTOR\DIARIZACION DE LA LENGUA\SUBTITULADO AUTOMATICO\ALINEAMIENTO VOZ A TEXTO CON AUDIOS LARG\ENTRENAMIENTO SEMISUPERVISADO\INVESTIGACION REPRODUCIBLE\CODIGO ABIERTO