Installing Greenplum Single Node Edition on Amazon's EC2

March 23, 2010

I have been thinking for a while now about adding Greenplum support to an open-source application for web analytics that I wrote a few years ago, which is called htMiner and uses PostgreSQL.

In order to do this, I need a multi-CPU environment. While still waiting to get our new servers installed here in our data centre in Italy, I decided to look at Amazon’s Elastic Compute Cloud (EC2) infrastructure. My intention is to do some benchmarking and spot the main differences in terms of performances between Greenplum Single Node Edition and PostgreSQL 8.4, my favourite DBMS.

If you wish to follow this article, you need to have an Amazon AWS account with a valid credit card. Do not worry, this test will only cost you a couple of dollars!

Greenplum SNE is a free version of the Greenplum database, one of the most advanced solutions for data warehousing and analytics, which is based on a shared nothing architecture and allows for data distribution and parallel processing on several nodes (servers).

The Single Node edition of Greenplum is a freely distributed version of Greenplum which can be installed on a single node. On a multi-processor architecture, Greenplum Single Node Edition allows to create multiple segments (usually one per core) and hence to take advantage of parallel processing. Greenplum Single Node Edition can be downloaded for free from the main website.

My intention is to install it on a Large Instance running CentOS Linux 5.4 on Amazon. EC2’s large instance has the following characteristics:

  • 7.5 GB of memory
  • 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
  • 850 GB of local instance storage
  • 64-bit platform

I also decided to get a 10GB volume of Elastic Block Store (1 dollar a month), which I will format using the XFS file system. This volume will contain Greenplum data directories (this time I will try with just one single volume – next time I will try with a volume per segment).

The first step is to log into your Amazon AWS management console. Get your 10GB EBS volume and then launch a large instance using the ami-ebe4cf9f AMI file (AMI stands for Amazon Machine Image), a CentOS 5.4 image file distributed by RightScale for a 64 bit architecture. You may have a different code, as I use a Europe based server.

I then attach the created volume to the instance I just started. The management console informs me that the volume has been attached on /dev/sdf. I grab the public DNS information and connect to the server via ssh as root, using my EC2 identity.

I install the YUM packages for XFS support, by running:

yum install kmod-xfs.x86_64 xfsprogs xfsdump

I create a primary partition on /dev/sdf using fdisk and format it:

mkfs -t xfs /dev/sdf1 

I then add the entry to /etc/fstab:

/dev/sdf1 /greenplum xfs noatime 0 0

and mount the partition on the /greenplum mount point:

mkdir /greenplum
mount /greenplum

Download Greenplum’s Quickstart guide from the download area. Grab the URL of the 64bit RedHat installation of Greenplum and download it from the EC2 server using wget (or upload it from your computer using scp).

Follow the instructions on the quickstart guide about preparing your system to Greenplum (in particular kernel settings and limits).

Unzip the Greenplum’s zip file and execute the .bin file. Answer yes to all the questions and Greenplum at the end of the process is installed in the /usr/local/greenplum-db directory.

Create the gpadmin user and set the password:

useradd gpadmin
passwd gpadmin

Prepare the data directories for the master and the segments:

mkdir -p /greenplum/master
mkdir -p /greenplum/segment1
mkdir -p /greenplum/segment2
chown -R gpadmin:gpadmin /greenplum

Become gpadmin using the su command and include source /usr/local/greenplum-db/greenplum_path.sh into gpadmin’s ~/.bashrc file. Load these settings. Edit the ~/single_host_file file, add localhost to its contents and launch:

gpssh-exkeys -f ~/single_host_file

Create the ~/gp_init_config file with the following content:

ARRAY_NAME="Greenplum"
MACHINE_LIST_FILE=/home/gpadmin/single_host_file
SEG_PREFIX=gp
PORT_BASE=50000
declare -a DATA_DIRECTORY=(/greenplum/segment1 /greenplum/segment2)
MASTER_HOSTNAME=localhost
MASTER_DIRECTORY=/greenplum/master
MASTER_PORT=5432
ENCODING=UNICODE

Finally launch:

gpinitsystem -c ~/gp_init_config

At the end of the process, Greenplum SNE edition is installed on your Amazon’s EC2 server running CentOS 5.4. On this server you can test the solution at quite a reasonable price (I was on the server for 7 hours today and I spent only 3 dollars).

I will post a few more articles on this topic in the next few days, and hopefully I will be able to post the first benchmarks too. Enjoy!

Share this

Relevant Blogs

Random Data

This post continues from my report on Random Numbers. I have begun working on a random data generator so I want to run some tests to see whether different random...
December 03, 2020

More Blogs

Full-text search since PostgreSQL 8.3

Welcome to the third – and last – part of this blog series, exploring how the PostgreSQL performance evolved over the years. The first part looked at OLTP workloads, represented...
November 05, 2020

Números aleatorios

He estado trabajando gradualmente en el desarrollo desde cero de herramientas para probar el rendimiento de los sistemas de bases de datos de código abierto. Uno de los componentes de...
November 04, 2020