Distributed Processing: Raw Resources (2008/2009)
I’ve decided to share some of my 2008/2009 notes on distributed processing. These documents are just raw material and personal observations from my research at the time. While I intend to revisit and organize them into a more presentable format, here’s the first post in a series of about 30 that I plan to release soon.
File Systems for Large Files (TB/PB)
When I needed to find a file system capable of handling extremely large files (terabytes or petabytes), this was the starting point of my research. Below are my raw notes on the various file systems I explored.
File Systems Overview
A file system (or filesystem) is a method for organizing and storing computer files and the data they contain. It essentially acts as a database for storing, manipulating, and retrieving files through the operating system.
Types of File Systems
- File Systems with Built-in Fault Tolerance
These file systems are designed to continue functioning even when a failure occurs, preventing data loss. - Shared Disk File Systems
File systems that allow multiple computers to access the same disk storage. - Distributed File Systems
These systems allow access to files stored across multiple machines in a network. - Distributed Fault-Tolerant File Systems
These are designed to handle failures while ensuring that file access is not interrupted. - Distributed Parallel File Systems
These systems optimize file access across multiple machines for parallel computing workloads. - Distributed Parallel Fault-Tolerant File Systems
These combine the benefits of distributed parallelism and fault tolerance for even greater reliability and performance.
Comparison of File Systems
To understand which file system was best suited for large-scale data processing, I had to compare the advantages and limitations of each, especially in terms of scalability, performance, and fault tolerance. Below are a few systems that I considered.
SAN (Storage Area Network)
A Storage Area Network (SAN) is a specialized network architecture that connects remote storage devices (such as disk arrays or tape libraries) to servers in a way that these devices appear to be directly attached to the operating system. While the cost and complexity of SANs are decreasing, they remain more common in larger enterprises.
Difference from NAS
Network Attached Storage (NAS) differs from SAN by using file-based protocols like NFS or SMB/CIFS, making the remote storage more visible to the operating system. With NAS, computers access portions of an abstract file rather than blocks of data.
NFS (Network File System)
The Network File System (NFS) was developed by Sun Microsystems in 1984 and allows clients to access files over a network similarly to local storage. NFS is built on the Open Network Computing Remote Procedure Call (ONC RPC) system and is an open standard, meaning anyone can implement it.
Primarily used on Unix and Linux systems, NFS enables sharing files transparently between servers, desktops, and laptops. It allows users and administrators to mount files from remote systems as though they were local.
CIFS (Common Internet File System)
CIFS is used by Windows operating systems for file sharing. Based on the Server Message Block (SMB) protocol, it follows the client/server programming model. CIFS is more “chatty” compared to NFS in terms of communication.
CIFS enables file sharing over a network, where a client requests access to a file, and the server responds. It’s commonly used in NAS environments.
XFS
XFS is a high-performance journaling file system, initially created by Silicon Graphics for their IRIX operating system and later ported to Linux. XFS excels at managing large files and providing smooth data transfers.
CXFS (Clustered XFS) is a distributed networked file system specifically designed for SAN environments. It separates the management of file data and metadata, providing direct access to data via SAN while using a metadata broker for managing metadata and file locks.
CXFS’s key benefit is that it handles file locks through the metadata broker, solving many issues typically encountered in distributed file systems.
CXFS supports heterogeneous environments (including Solaris, Linux, macOS, AIX, and Windows), but requires IRIX or Linux for the metadata broker host.
Learn more about XFS
Learn more about CXFS
ExaStore
ExaStore is a distributed file system designed for large-scale storage systems, optimized for environments where speed and scalability are critical.
These notes provide a broad overview of the file systems and storage technologies I explored for handling massive datasets. Each system has its unique strengths, so selecting the right one depends on the specific requirements of your distributed processing tasks.