Big Data is one one the most challenging problems in the nowadays computer science scenario, due to the fact that it encloses and requires to manage a series of not trivial complexities.
In order to characterize the main aspects of the problem, it is common to refer to the so called V’s of Big Data:
- Volume: data grows day after day at critical pace and storage need to do so consequently, facing all the related technical issues (e.g. disks breaks up very often when subjected to an intensive I/O operations rate)
- Velocity: data increases almost in real-time, demanding a proper storage infrastructure
- Variety: there is no standardization in data formatting, making hard to organize the data in a smart way
Veracity: clean up data form “false”entries, that can make less accurate the result of analysis
Visualization: charts are surely provide a meaningful form of communication of a result
- Value: the final goal, to extract something valuable for the company.
The main reasons that brought us in the Big Data field were the need of parsing, process and analyze a huge volume of data, write once persistently onto the disks and read multiple times. The starting point was represented by a dataset obtained with TSTAT, a passive sniffer able to log several network monitoring statistics: it was the typical “embarrassingly parallel” workload, namely a problem that could be easily split in a number of independent tasks, with no need of communication or sharing between them.
SmartData@PoliTO owns the current Politecnico’s Big Data cluster (made in 2015), the pilot project realized to satisfy our research aims. It is composed by 30 worker nodes with a maximum processing capacity of 220 TB of data.
This infrastructure serves today several academic activities: in particular, it represents the reference point for a Master’s Degree course dedicated to Big Data with more than 300 students and it’s used daily from several researches of various departments.
Accordingly to the ever increasing demand for infrastructure use, we plan to project a second cluster exploiting the experience collected in these years.
The cluster architecture have been developed to be horizontally scalable: scaling out means that when current resources become poor for the requested tasks, brand new hardware could be installed just beside the existing one in order to enhance global performances.
In order to draw up the fundamental requirements of the new architecture, a survey has been conducted among all the SmartData@PoliTO members. Here is reported a brief resume of the desired capabilities:
- storage capacity up to 4.5 PB
- each worker equipped with at least 256GB of RAM, two CPUs and double 10GbE network interface
- up to 70 machines in the cluster
- introduction of some GPUs for experimentation
Our new Big Data solution aims to be optimized for the processing of large quantities of data, taking into account high-availability and continuity of services. In order to achieve that, Hadoop framework has been chosen as the best option to store and maintain data across the cluster: the HDFS (Hadoop Distributed File System) allows to replicate each “piece” of data across many different machines (e.g. 3), making data availability more resilient and fault-tolerant.
Hadoop provide also an implementation of the MapReduce programming parading, that broadly consists in splitting up huge volumes of data into small pieces and processing each one of this in parallel, distributing the computation across many machines.
One of the main drawbacks of Hadoop is that it relies a lot on disk operations (each chunk of data is read/written from/to disk), resulting in slow computation times. The solution is to pass to an in-memory processing approach, the one implemented by Spark. This strongly enhances performances, with an improvement up to 100 times in computation speed.
Cluster technical features
The cluster of the BigData laboratory consists of a group of nodes, organized on two racks:
- 30 worker nodes – storage available 768 TB, memory 2TB (8GB per computing core)
- 18 nodes provided by Dell – Single node maximum storage available 36 TB
- 12 nodes provided the previous BigData cluster brought by DET in 2013 – Single node maximum storage available 10 TB
- 3 master nodes still provided by Dell
- 1 device for storage provieded by QNAP – 40TB RAID-6
- 4 switch 48-port layer 2, used to interconnect all nodes of the cluster, provided by Dell
- Uninterruptible Power Supply
The rack cooling of the cluster is based on a closed-architecture solution, an energy efficient method where the cooling unit is integrated on the side the rack without affect the temperature of the room where the two rack are situated. Under this approach the air can circulate only inside the rack and not within the whole room. This is an advantage for the cluster because is can be more easier to maintain the desidered temperature to control the two racks and reduce the energy wastage.