Where: Center for Modelling and Scientific Computing, Universidad de La Frontera (UFRO), Temuco-Chile
When: January 19-22th 2016
At Universidad de la Frontera, we want to learn about HPC tools used for the analysis of Big Data. We will give courses on GPUs and Intel Xeon Phi coprocessors as the current technologies to deal with processing large datasets as well as new Data Mining techniques to improve the understanding and modelling of Big Data.
Our invited speakers:
- Federico Meza, Universidad de Talca, Chile
- Alba Melo, Universidad de Brasilia, Brazil
- Verónica Gil-Costa, Universidad de San Luis, Argentina
-
Manuel Ujaldón, Universidad de Málaga, Spain
-
Thomas Fober, Philipps-Universität Marburg, Alemania
-
Ricardo Barrientos, Universidad de La Frontera, Chile
-
Barbara Russo, Universidad de Bolzano, Italy
-
Tzu-Chiang Shen, ALMA Observatory
-
Martin Cabrera Aguilar, Everis Chile
-
Mauricio Marin, Universidad de Santiago de Chile
-
Oscar Peredo, Telefónica
Some of the talks:
Data Mining for Structured and Unstructured Data by Thomas Fober
In the last decade, the amount of data has grown rapidly mainly due to the Internet and Web 3.0. The growth will, however, become even higher; modern technologies like RFID and smartphones collect data permanently, retail companies track the purchases of their customer, other companies the behaviour of customers. In the life sciences, vast amounts of data are collected to be used, e.g., for the design of new therapies or drugs. With Web 4.0 – the Internet of Things – other sources of data will emerge in the near future.
Without doubts, all the data contain valuable information, e.g., a retail company would like to predict interests of customers to place target-oriented advertisements, pharmaceutical companies would like to detect possible compounds from data to be used as drugs. However, the amount of data is often too big to be analysed by a human, and the data is sometimes high-dimensional and highly correlated.
In this situation, data mining provides a plethora of methods to analyse Big Data. It can be used to detect patterns in data, and to learn models predicting certain classes (classification) or values (regression). Mining frequent patterns is another important topic covered by data mining. These methods work well for structured data, i.e., data represented by SQL tables. Nowadays, data is increasingly unstructured: often, a data scientist has to cope with images or videos, unstructured text (as web pages) or even geometric objects such as molecule structures or produced work pieces.
In my talk, I will present some fundamental data mining techniques for classification, regression, clustering and association mining. Subsequently, I will demonstrate how to deal with unstructured data, in particular with geometric structures, pictures and textual data. I will close my talk with some comments on scalability, and the usefulness of Spark MLlib as a framework for distributed computing.
Parallel Optimal Biological Sequence Alignment of Whole Chromosomes with Hundreds of GPUs by Alba Melo
Biological Sequence Alignment is a very popular application in Bioinformatics used routinely worldwide. Smith-Waterman is the exact algorithm used to compare two sequences, obtaining the optimal alignment in quadratic time and space complexity. In order to accelerate Smith-Waterman, many GPU-based strategies were proposed in the literature. However, aligning sequences of Millions of Base Pairs (MBP) is still a very challenging task. In this talk, we discuss related work in the area o parallel biological sequence alignment and present our multi-GPU strategy to align DNA sequences with up to 249 MBP in 384 GPUs. In order to achieve this, we propose an innovative speculation technique, which is able to parallelize a phase of the Smith-Waterman algorithm that is inherently sequential.
We combined our speculation technique with sophisticated buffer management and fine-grain linear space matrix processing strategies to obtain our parallel algorithm. As far as we know, this is the first implementation of Smith-Waterman able to retrieve the optimal alignment between sequences with more than 50 MBP.
We will show the results obtained in the Keeneland cluster (USA), where we compared all the human x chimpanzee homologous chromosomes (ranging from 26 MBP to 249 MBP). The human_chimpanzee chromosome 5 comparison (180 MBP x 183 MBP) attained 10.35 TCUPS (Trillions of Cells Updated per Second) using 384 GPUs. In this case, we processed 45 petacells, being able to produce the optimal alignment in 53 minutes and 7 seconds, with a speculation hit ratio of 98.2%.
Mining System Logs to Learn Error Predictors by Barbara Russo
Systems generate big data. Part of data that systems generate traces changes in system behaviour. Logging services store such changes as events in log archives. Some of such events can tell about undesirable system behaviours. Mining logs can be used to learn system errors predictors and help system administrators to have a better command over the performance of their systems.
There are several challenges in mining logs. This talk illustrates some of them and our experience to mine log sequences and predict system errors in a case study of a telemetry system for race cars. We will overview our method and illustrate the challenges we faced.
Operational logs analysis at ALMA observatory based on ELK stack by Tzu-Chiang Shen
During the operation, the ALMA observatory generates a huge amount of logs which contain not only valuable information related to specific failures but also for long term performance analysis. We implemented a big data solution based on ElasticSearch, Logstash and Kibana. They are configured in a decoupled manner that has zero impact to the existent operation. It is able to keep more than six months of operation logs online. In this talk, we describe this infrastructure, applications built on top of it, and the problems that we faced during its implementation.
Check the program here
For more information please visit the event website here
Source: UFRO.CL