SmartData@PoliTO operates Politecnico’s BigData@Polito Cluster, built in 2015 as a pilot project realized to satisfy the center’s research aims. This cluster was composed by 36 nodes with a maximum capacity of 220 TB of data (around 650 TB of raw disk space). In the context of HPC4AI project financed by the Piemonte Region, the BigData@PoliTO Cluster will be upgraded in 2020.
For support, accounts or technical issues, write to supporto@smartdata.polito.it.
The complete infrastructure will have more than 1,700 CPU cores, 19 TB of RAM memory and 8 PB of raw storage available for users.
This infrastructure serves today several academic activities: In particular, it represents the reference point for a Master’s Degree course dedicated to Big Data with more than 300 students and it’s used daily from several researches of various departments.
The cluster architecture has been developed to be horizontally scalable. Scaling out means that when current resources become poor for the requested tasks, brand new hardware could be installed just beside the existing one in order to enhance global performance.
2020 BigData@PoliTO Cluster Upgrade
In order to draw up the fundamental requirements of the new architecture, a survey has been conducted among all the SmartData@PoliTO members. Based on feedback from the researchers, the BigData@PoliTO Cluster will be extended with the following new capabilities:
- 33 storage workers equipped with:
- 216 TB of raw disk storage
- 384 GB of RAM
- Two CPUs with 18 cores/36 threads each
- Two 25 GbE network interfaces
- Introduction of 2 nodes equipped with 4 GPUs for experimentation
- More than 50 GB/s of data reading and processing speed
BigData@PoliTO Cluster Technical Features
BigData@PoliTO Cluster consists of 84 servers and switches, organized in four racks and two data-centers:
- 33 PowerEdge R740xd (Storage and Processing Workers)
- 2 PowerEdge R740 (GPU Workers)
- 18 PowerEdge R720xd (Storage and Processing Workers)
- 2 PowerEdge R7425 (Storage and Processing Workers)
- 5 PowerEdge R640 (Master Nodes)
- 3 PowerEdge R620 (Management Nodes)
- 1 Synology RAID Storage (Master Backup Node)
- 11 Supermicro 6027R-TRF (developing and staging nodes)
- 1 Dell Networking Z9100-ON (100 GbE Switch)
- 2 Dell Networking S3048-ON (1 GbE Switches)
- 4 Dell Networking N3048 (1 GbE Switches)
The rack cooling is based on a closed-architecture solution, an energy efficient method where the cooling unit is integrated on the side the rack without affect the temperature of the room where the racks are situated. Under this approach the air can circulate only inside the rack and not within the whole room. This is an advantage for the cluster because it is easier to maintain the desired temperature and reduce the energy wastage.
Deployed Software
The BigData@PoliTO Cluster is optimized for the processing of large quantities of data, taking into account high-availability and continuity of services. In order to achieve that, a number of open source frameworks and solutions support the BigData@PoliTO Cluster.
Hadoop Ecosystem
Hadoop framework is used to store and maintain data across the BigData@PoliTO cluster: the HDFS (Hadoop Distributed File System) allows to replicate each piece of data across many different machines, making data resilient and fault-tolerant. The BigData@PoliTO Cluster also supports CEPH, which will soon be the central repository of users’ data, concentrating all data in a single distributed file system.
Hadoop provides an implementation of the MapReduce programming parading that broadly consists in processing in parallel each of the small pieces of a huge volume of data, bringing the code as close as possible to where the data pieces reside in HDFS.
One of the main drawbacks of MapReduce is that it relies a lot on disk operations (each chunk of data is read/written from/to disk), resulting in slow computation times. The solution is to pass to an in-memory processing approach, such as the one implemented by Spark. This approach strongly enhances performance, with an improvement up to 100 times in computation speed.
JupyterLab
JupyterLab is a server that provides Jupyter Notebooks for multiple users. Users spawn servers that are hosted in any worker machine of the BigData@PoliTO Cluster. When spawning a server, users request specific amounts of memory and CPU cores for their notebooks. The distribution of resources in the BigData@PoliTO cluster is managed with Kubernetes.
Jupyter Notebooks are interactive computational environments accessed from web browsers. Several programming languages and environments are supported in the BigData@PoliTO Cluster, including Python, Java, R and Octave. Jupyter Notebooks can also interface with Hadoop services, for example, firing Spark jobs directly from users’ notebooks to interact with TBs of data stored in HDFS and/or CEPH distributed file systems.
All popular Python packages are supported by the BigData@PoliTO Cluster, and the distribution of packages and virtual environments is managed with Anaconda.
Detail Node Characteristics
33 PowerEdge R740xd (Storage and Processing Workers)
Processor type | 2x Intel Xeon Gold 6140 CPU @ 2.30 GHz (18 cores per socket) |
RAM | 384 GB DDR-4 Dual Rank 2666 MHz |
Disk space | 18x HD 12 TB, SATA 6 Gb/s, 3.5”, 7.200 rpm 2x SSD 240 GB, SATA 6 Gb/s, M.2 |
Network interface | 2x Mellanox ConnectX-4 LX 25 GbE SFP 1x GbE management port |
2 PowerEdge R740 (GPU Workers)
Processor type | 2x Intel Xeon Gold 6130 CPU @ 2.10 GHz (16 cores per socket) |
GPU Accelerator | 2x NVIDIA Tesla V100 PCIe 16 GB |
RAM | 384 GB DDR-4 Dual Rank 2666 MHz |
Disk space | 8x HD 1 TB, SAS 12 Gb/s, 2.5”, 7.200 rpm 2x SSD 480 GB, SATA 6 Gb/s, 2.5” |
Network interface | 2x Mellanox ConnectX-4 LX 25 GbE SFP 1x GbE management port |
18 PowerEdge R720xd (Storage and Processing Workers)
Processor type | 2x Intel Xeon CPU E5-2620 v2 @ 2.10 GHz (6 cores per socket) |
RAM | 96 GB DDR3 ECC 1600 Mhz |
Disk space | 12x HD 3 TB, SATA 3 Gb/s, 3.5”, 7.200 rpm |
Network interface | 5x GbE Ports – 4 for production traffic + 1 for management |
2 PowerEdge R7425 (Storage and Processing Workers)
Processor type | 2x AMD EPYC 7301 (16 cores per socket) |
RAM | 512 GB DDR-4 Dual Rank 2400 MHz |
Disk space | 2x HD 600 GB, SAS 12 Gb/s, 2.5”, 10.000 rpm 2x HD 1.8 TB, SAS 12 Gb/s, 2.5”, 10.000 rpm |
Network interface | 5x GbE Ports – 4 for production traffic + 1 for management |
5 PowerEdge R640 (Master Nodes)
Processor type | 2x Intel Xeon Gold 6134 CPU @ 3.20 GHz (8 cores per socket) |
RAM | 384 GB DDR-4 Dual Rank 2666 MHz |
Disk space | 8x HD 1 TB, SAS 12 Gb/s, 2.5”, 7.200 rpm 2x SSD 480 GB, SATA 6 Gb/s, 2.5” 2x SSD 240 GB, SATA 6 Gb/s, M.2 |
Network interface | 2x Mellanox ConnectX-4 LX 25 GbE SFP 1x GbE management port |
3 PowerEdge R620 (Management Nodes)
Processor type | 2x Intel Xeon CPU E5-2630 v2 @ 2.60 GHz (6 cores per socket) |
RAM | 128 GB DDR3 ECC 1600 Mhz |
Disk space | 3x HD 600 GB, SAS 6 Gb/s, 2.5”, 10.000 rpm |
Network interface | 5x GbE Ports – 4 for production traffic + 1 for management |
1 Synology RS3618xs NAS (Master Backup Node)
Processor type | 1x Intel Xeon CPU D-1521 @ 2.40 GHz (4 cores per socket) |
RAM | 16 GB |
Disk space | 12x HD 10 TB, SATA 6 Gb/s, 3.5”, 7.200 rpm |
Network interface | 5x GbE Ports – 4 for production traffic + 1 for management 2x 10 GbE Ports SFP |
11 Supermicro 6027R-TRF (developing and staging nodes)
Processor type | 1x Intel Xeon CPU E5-2640 @ 2.50 GHz (6 cores per socket) |
RAM | 32 GB DDR3 ECC 1600 Mhz |
Disk space | 5x HD 2 TB, SATA, 3.5”, 7.200 rpm |
Network interface | 3x GbE Ports – 2 for production traffic + 1 for management |