Towards Effective Natural Language Application Development
Foundations of NLP Lean Programming framework
There is a current trend that more and more computer programs analyze written or spoken natural language. For example, DVAs, IE systems, machine translation systems, and many other types of programs process natural language in order to solve specific use cases when interacting with humans via natural language. Amazon, Google, and Mycroft AI are just some of the companies that have produced DVAs capable of interacting with humans via voice. Such NLP applications use techniques from computer science and artificial intelligence to address their use cases. Additionally, many companies have begun to evaluate the capacity of NLP applications to improve their business processes, for instance by automatically processing customer requests received via e-mail. The development of NLP applications requires years of experience in computer science, artificial intelligence, ML, linguistics, and similar disciplines. Due to this requirement, development is exclusively available to computer science experts with many years of experience in computer science and artificial intelligence. Years of training and experience are therefore required in order to develop an NLP application capable of, for instance, automatically processing customer e-mails. However, the demand for NLP applications continues to grow, while the quantity of such computer science experts remains limited. Due to this growing demand, companies must be able to develop such applications using in-house developers without years of training or Ph. D. in computer science and artificial intelligence. Based on this limitation, this thesis identifies the main obstacles encountered by developers without many years of experience in computer science when creating NLP applications. These obstacles are identified through a research project named ETL Quadrat, which aims at building an IE system for gathering EC data from human-readable documents. The development of the IE system is hindered by a number of obstacles:
- Developers require extensive knowledge of natural language, computer linguistics, statistics, ML, artificial intelligence, computer science, and NLP.
- NLP applications must preprocess natural language before addressing the applications’ use cases. Additionally, a wide variety of NLP tools is available, and it is impossible to judge which set of NLP tools will perform best for a given application. This in turn makes the construction of preprocessing NLP pipelines extremely complex.
- Often, customizing NLP tools and models is necessary in order to improve the quality of the tools’ outputs. This customization process is complex and requires a great deal of effort from domain experts and developers.
- Finally, the available tool stack for building custom NLP tools, models, and pipelines is complex and difficult to use. Based on these and further obstacles, this thesis suggests a method based on CICD tools for supporting developers and domain experts to build NLP applications more efficiently. This method it then implemented through the open-source project NLPf. This project is available on GitLab (https://gitlab.com/schrieveslaac/NLPf) and provides the following features to improve the development process of NLP applications:
- Based on Maven’s core features, NLPf enables quick project setup to create a domain-specific corpus which is then used to derive domain-specific NLP models based on existing NLP tools.
- NLPf uses build automation to determine the best-performing NLP pipeline for a given NLP application. Additionally, NLPf measures and displays common metrics.
- NLPf enables domain experts to easily annotate required training data through the easy-to-use annotation tool QPT and an Xbox 360 controller.
- NLPf makes the best-performing NLP pipeline available as a Maven artifact, enabling it to be integrated in any Maven project. Additionally, developers can use a simple API to integrate the best-performing NLP pipeline into their program code.
@phdthesis{doi:10.17170/kobra-20190529539, author ={Schreiber, Marc}, title ={Towards Effective Natural Language Application Development}, keywords ={004 and Textverstehendes System and Natürlichsprachiges System}, copyright ={http://creativecommons.org/licenses/by-nc-nd/3.0/de/}, language ={en}, school={Kassel, Universität Kassel, Fachbereich Elektrotechnik/Informatik}, year ={2019} }