Proposal
Proposal detailing what this project is about, and it's extent
Introduction
Computer storage requirements have grown, especially for research applications and internet application service providers. With the decrease in price in hardware and especially in storage, it is not uncommon to see computing clusters running in commodity hardware. There range from small networks to very large distributed computing clusters (see GFS from Google).
What I propose to do is create a distributed file system in user space for Linux. This file system will be able to store an increased amount of storage distributed along several data nodes, and have its operations supported by metadata servers. It will also be more tolerant to failure since a file will have one or more replicas in different nodes.
Description
The file system will have a communication layer over TCP/IP which will allow any machine to enter the network, announce itself as a storage machine to the metadata server and serve both clients and other storage machines (for content replicas).
The metadata server will be able to index file pathnames and map them to object id’s stored in one or more storage servers. Clients can request this information from the metadata server and then contact the storage nodes directly for read/write operations.
Metadata servers will also be responsible for locking objects during write operations, and tracking version numbers to avoid inconsistencies in the file system. They will also monitor heartbeats of the storage servers and adapt accordingly to one or more of them disconnecting as well as new storage servers announcing themselves as part of the network.
Scope
Although a high-performance implementation would be ideally done in C and implemented as a kernel module, due to the time constraint of the project development will be done in python to allow for faster development and implemented in userspace. The files will be stored in the underlying file system (ext3 or reiserFS).
There will be 1 metadata server, several storage servers and several client machines.
Tools
- Operating System: Gentoo Linux
- Programming language: Python
- Supporting DB: Berkeley DB (for the metadata index)