Computing Facilities

SmartData@PoliTO owns the current Politecnico’s Big Data cluster (built in 2015), the pilot project realized to satisfy our research aims. It is composed by 30 worker nodes with a maximum processing capacity of 220 TB of data.

This infrastructure serves today several academic activities: in particular, it represents the reference point for a Master’s Degree course dedicated to Big Data with more than 300 students and it’s used daily from several researches of various departments.
Accordingly to the ever increasing demand for infrastructure use, we plan to project a second cluster exploiting the experience collected in these years.

The cluster architecture have been developed to be horizontally scalable: scaling out means that when current resources become poor for the requested tasks, brand new hardware could be installed just beside the existing one in order to enhance global performances.

In order to draw up the fundamental requirements of the new architecture, a survey has been conducted among all the SmartData@PoliTO members. Here is reported a brief resume of the desired capabilities:

  • storage capacity up to 4.5 PB
  • each worker equipped with at least 256GB of RAM, two CPUs and double 10GbE network interface
  • up to 70 machines in the cluster
  • introduction of some GPUs for experimentation

Our new Big Data solution aims to be optimized for the processing of large quantities of data, taking into account high-availability and continuity of services. In order to achieve that, Hadoop framework has been chosen as the best option to store and maintain data across the cluster: the HDFS (Hadoop Distributed File System) allows to replicate each “piece” of data across many different machines (e.g. 3), making data availability more resilient and fault-tolerant.

Hadoop provide also an implementation of the MapReduce programming parading, that broadly consists in splitting up huge volumes of data into small pieces and processing each one of this in parallel, distributing the computation across many machines.

One of the main drawbacks of Hadoop is that it relies a lot on disk operations (each chunk of data is read/written from/to disk), resulting in slow computation times. The solution is to pass to an in-memory processing approach, the one implemented by Spark. This strongly enhances performances, with an improvement up to 100 times in computation speed.

Cluster technical features

The cluster of the BigData laboratory consists of a group of nodes, organized on two racks:

  • 30 worker nodes – storage available 768 TB, memory 2TB (8GB per computing core)
    • 18 nodes provided by Dell – Single node maximum storage available 36 TB
    • 12 nodes provided the previous BigData cluster brought by DET in 2013 – Single node maximum storage available 10 TB
  • 3 master nodes still provided by Dell
  • 1 device for storage provieded by QNAP – 40TB RAID-6
  • 4 switch 48-port layer 2, used to interconnect all nodes of the cluster, provided by Dell
  • Uninterruptible Power Supply

BigData rack

The rack cooling of the cluster is based on a closed-architecture solution, an energy efficient method where the cooling unit is integrated on the side the rack without affect the temperature of the room where the two rack are situated. Under this approach the air can circulate only inside the rack and not within the whole room. This is an advantage for the cluster because is can be more easier to maintain the desidered temperature to control the two racks and reduce the energy wastage.

Interconnection network schema:

Interconnection

Node characteristics

3 Master nodes DELL PowerEdge R620

Processor type n. 2 processor Intel E5-2630v2 6 cores, 2.6GHz
RAM 128 GB DDR3 ECC 1600Mhz
Disk space n. 3 hard disk 600GB, hot-plug SAS 6Gb/s, 2.5”, 10.000 rpm
Network interface n. 5 port GbE (4 for internal network + 1 for management)

18 Worker node DELL PowerEdge R720XD

Processor type n. 2 processor Intel E5-2620v2 6 cores, 2.6GHz
RAM 96 GB DDR3 ECC 1600Mhz
Disk space n. 12 hard disk 3TB, hot-plug SAS 6Gb/s, 3.5”, 7.200 rpm
Network interface n. 5 port GbE (4 for internal network + 1 for management)

2 Worker nodes SuperMicro from DET cluster

Processor type n. 1 processor Intel Xeon 6 cores, 2.5GHz
RAM 64 GB DDR3 ECC 1600Mhz
Disk space n. 5 hard disk 2TB, hot-plug SATA, 3.5”, 7.200 rpm
Network interface n. 3 port GbE (2 for internal network + 1 for management)

10 Worker nodes SuperMicro from DET cluster

Processor type n. 1 processor Intel Xeon 6 cores, 2.5GHz
RAM 32 GB DDR3 ECC 1600Mhz
Disk space n. 5 hard disk 2TB, hot-plug SATA, 3.5”, 7.200 rpm
Network interface n. 3 port GbE (2 for internal network + 1 for management)