Francisco Herrera Triguero, professor of Computer's Science at the University of Granada
Jorge Casillas Barranquero, Salvador García López, Alberto Fernández Hilario, Julián Luengo Martín, Francisco Charte Ojeda, Daniel Peralta Cámara, Sara del Río García, Sergio Ramírez Gallego y Elena Ruiz Sánchez, University of Granada.
University of Granada
We are witnessing a growing trend in the study and application of problems in the framework of Big Data. This ismainly due to the great advantages that come from the knowledge extraction from a high volume of information. For this reason,we observe a migration of the standard Data Mining systems towards a more complete scheme known as Data Science.
Data Science has also linked the model building analytics together with preprocessing approaches within the scenario of Big Data. This preprocessing stage is devoted to enhance the quality of the input data in order to boost the performance of the models that are applied. However, it constitutes a challenging task in Big Data, as the previous existent approaches cannot be directly applied due to the scalability issue. On the other hand, the lack of a widely use package/library is also a fault for the data science community.In this project we focus on Big Data Preprocessing as a central challenge to develop models and software tools framed in such a challenging environment. Specifically, it comprises three main objectives:
- Design and development of data preprocessing models for Big Data. For this task, we will both propose novel models including the topics of imperfect data treatment, data reduction, imbalance class learning, and preprocessing for non-standard classification tasks (multi-label and data streams).
- The former new Big Data preprocessing models will result in new software tools, used to both validate and exploit such novel techniques. In particular, we will create quality software packages for the two emergent Big Data platforms Spark, and Flink, and a R package as a container of dependences with the hole preprocessing packages in the CRAN, including new small packages for specific tasks.
- Finally, we plan to emphasize the advantages and effectiveness of the models three case studies: EncephaloGram streams analysis, biometrics (fingerprint recognition) and analysis of users? satisfaction (in banking)