Cassandra – Getting a 3 node cluster built

Cassandra – Getting off the ground.
PREV: Cassandra – researching a DB for ‘Big Data’

While researching a project on Big Data services, I knew that I’d need a multi-node cluster to experiment with, but a pile of hardware was not immediately available.

Using the VERY helpful book Cassandra High Performance Cookbook I was able to build a 3 node cluster on a single machine. This is how I did it:


For this cluster test example, I am using Ubunto 10, with following JVM
      JVM vendor/version: OpenJDK 64-Bit Server VM/1.6.0_22

Downloaded Cassandra 1.0.8 package from here:

http://apache.mirrors.tds.net//cassandra/1.0.8/apache-cassandra-1.0.8-bin.tar.gz

Created new user on system: bigdata

Create the required base data directories

  $ mkdir commitlog,log,data,saved_caches

Moved that package there and started the build

$ cp /tmp/apache-cassandra-1.0.8-bin.tar.gz .

Unzipped and extracted the contents

$ gunzip apache-cassandra-1.0.8-bin.tar.gz
$ tar xvf apache-cassandra-1.0.8-bin.tar

Moved the long directory name to first instance cassA-1.0.8

$ mv apache-cassandra-1.0.8 cassA-1.0.8

Extracted again and renamed this to the other two planned instances:

$ tar xfv apache-cassandra-1.0.8-bin.tar
$ mv apache-cassandra-1.0.8 cassB-1.0.8  

$ tar xfv apache-cassandra-1.0.8-bin.tar
$ mv apache-cassandra-1.0.8 cassC-1.0.8  

This gave me three packages to build, and each with a unique IP

  cassA-1.0.8   10.1.1.101
  cassB-1.0.8   10.1.1.102
  cassC-1.0.8   10.1.1.103

Edit configuration files in each instance (casaA-1.0.8 used as example:)

$ vi cassA-1.0.8/conf/cassandra.yaml 

[...]

# directories where Cassandra should store data on disk.
data_file_directories: 
    - /home/bigdata/data/cassA

# commit log
commitlog_directory: /home/bigdata/commitlog/cassA

# saved caches
saved_caches_directory: /home/bigdata/saved_caches/cassA

[...]

# If blank, Cassandra will request a token bisecting the range of
# the heaviest-loaded existing node.  If there is no load information
# available, such as is the case with a new cluster, it will pick
# a random token, which will lead to hot spots.
initial_token: 0

[...]

# Setting this to 0.0.0.0 is always wrong.
listen_address: 10.1.1.101

[...]

rpc_address: 10.1.1.101

[...]

          # seeds is actually a comma-delimited list of addresses.
          # Ex: ",,"
          - seeds: "10.1.100.101,10.1.100.102,10.1.100.103"
[...]

Setting a separate logfile is recommended. Edit config to set separate log

vi cassA-1.0.8/conf/log4j-server.properties

[...]
log4j.appender.R.File=/home/bigdata/log/cassA.log
[...]

Repeat for instances cassB and cassC, setting the token value for B and C to appropriate values (see Extra Credit below if you need to know how to do *that* part):

#cassB
initial_token: 56713727820156410577229101238628035242

#cassC
initial_token: 113427455640312821154458202477256070485

To enable the JMX management console, each instance will require it’s own port. Edit the env file to set that up.

vi cassA-1.0.8/conf/cassandra-env.sh

[...]
# Specifies the default port over which Cassandra will be available for
# JMX connections.
JMX_PORT="8001"
[...]

Repeated for the other two instances, defining 8002 and 8003 respectively.

Now, for the final trick, start up the instances:

  cassA-1.0.8/bin/cassandra
  cassB-1.0.8/bin/cassandra
  cassC-1.0.8/bin/cassandra

Cluster elements started up, and they can be seen active in the process table here:

$ ps -lf
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S bigdata   4554     1  2  80   0 - 226846 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 S bigdata   4593     1  2  80   0 - 210824 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 S bigdata   4632     1  2  80   0 - 226830 futex_ 12:13 pts/0   00:00:05 java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=
0 R bigdata   5047  3054  0  80   0 -  5483 -      12:16 pts/0    00:00:00 ps -lf

Finally, to check the status, connect to of the JMX node ports and check the ring. You only need to connect to one of the cluster’s nodes to check the complete cluster’s status:

$ bin/nodetool -h 10.1.100.101 -port 8001 ring
Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               113427455640312821154458202477256070485     
10.1.100.101    datacenter1 rack1       Up     Normal  21.86 KB        33.33%  0                                           
10.1.100.102    datacenter1 rack1       Up     Normal  20.28 KB        33.33%  56713727820156410577229101238628035242      
10.1.100.103    datacenter1 rack1       Up     Normal  29.1 KB         33.33%  113427455640312821154458202477256070485      

Now, that’s a functional 3 instance cluster running on a single node. These are not in separate VMs, and if you wanted to experiment with this on a larger cluster, running multiple instances on multiple VM’s on a single hypervisor.. I don’t really see why you cannot!

In the next article, I’m going to start feeding data into the cluster. Stay tuned for that!


Extra Credit:

To create the token value I needed for this three ring cluster, I used the following PERL script. BTW, bignum is required unless you want PERL printing these big numbers in scientific notation:

#!/usr/bin/perl
use bignum;
my $nodes = shift;
print "Calculate tokens for $nodes nodes\n";
print "node 0\ttoken: 0\n" unless $nodes;
exit unless $nodes;
my $factor = 2**127;
print "factor = $factor\n";
for (my $i=0;$i<$nodes;$i++) {
	my $token = $i * ( $factor / $nodes);
	print "node $i\ttoken: $token\n";
}

Running the script for three nodes gave me the following results:

$ ./maketokens.pl  3

Calculate tokens for 3 nodes
factor = 170141183460469231731687303715884105728
node 0	token: 0
node 1	token: 56713727820156410577229101238628035242.67
node 2	token: 113427455640312821154458202477256070485.34

NEXT: Java build env to prepare for Cassandra development

Examining an SSL CERT – openssl to the rescue!

Viewing the contents of an SSL CERT

You have that shiny new SSL CERT you purchased online, but how do you know it’s properly tagged and signed?

What if you find a cert on your system and you want to know what it covers, when it expires, whom might own it, etc.

Well, that’s possible using the openssl command line tool.

Running a simple command we’ll examine the SSL Cert. The important info is in the ‘Issuer’ and ‘Subject’ blocks.

  openssl x509 -noout -text -in my.super-awesome.hostname.cert

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            c4:3d:66:b4:e3:cc:61:86
        Signature Algorithm: sha1WithRSAEncryption
        Issuer: C=US, ST=Kellyfornia, L=Sac-of-Tomatoes, O=Crazy Assembly House, OU=Committe on wasting tax payer money, CN=*.super-awesome.net/emailAddress=admin@super-awesome.net
        Validity
            Not Before: Jan  9 17:50:56 2012 GMT
            Not After : Jan  6 17:50:56 2022 GMT
        Subject: C=US, ST=Kellyfornia, L=Sac-of-Tomatoes, O=Crazy Assembly House, OU=Committe on wasting tax payer money, CN=*.super-awesome.net/emailAddress=admin@super-awesome.net
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
            RSA Public Key: (2048 bit)
                Modulus (2048 bit):
[...]  /*  removed the modulus to keep the post short */
               Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Subject Key Identifier: 
                9D:72:0C:A0:E6:EB:77:2C:77:EF:E8:9E:B7:BC:9F:53:81:1A:40:9D
            X509v3 Authority Key Identifier: 
                keyid:9D:72:0C:A0:E6:EB:77:2C:77:EF:E8:9E:B7:BC:9F:53:81:1A:40:9D
                DirName:/C=US/ST=Kellyfornia/L=Sac-of-Tomatoes/O=Crazy Assembly House/OU=Committe on wasting tax payer money/CN=*.super-awesome.net/emailAddress=admin@super-awesome.net
                serial:C4:3D:66:B4:E3:CC:61:86

            X509v3 Basic Constraints: 
                CA:TRUE
    Signature Algorithm: sha1WithRSAEncryption
[...]  /*  removed the signature to keep the post short */

Looking at the Subject breaks downs as follows:

  Subject: C=US, ST=Kellyfornia, L=Sac-of-Tomatoes, O=Crazy Assembly House, OU=Committe on wasting tax payer money, CN=*.super-awesome.net/emailAddress=admin@super-awesome.net

  C=US - Country code  'US'
  ST=Kellyfornia  -  State or Provence. 
  Sac-of-Tomatoes   -  City/Location
  O=Crazy Assembly House  -  Company or Organization name
  OU=Committe on wasting tax payer money Organizational Unit (department, etc.)
  CN=*.super-awesome.net  -  Canonical Name (hostname / domain) that
 the CERT services.  In this case it's a wildcard, signfied by the '*'

That’s all there is to it. Now, secure those website communications!

Cassandra – researching a DB for ‘Big Data’

Being an old-school OSS’er, MySql has been my go-to DB for data storage since the turn of the century. It’s great, I love it (mostly) but it does have it’s drawbacks. Largest of which is it’s now owned by Oracle which does a HORRIBLE JOB of supporting it. I have personal experience with this, as the results of a recent issue with InnoDB and MySQL.

In the mean time, some of the hot-shot up-and-commers in another department have been facing their own data processing challenges (with MySql and other DB’s), and have started to look at some highly scalable alternatives. One of the front-runners right now is Apache’s Cassandra database project.

The synopsis from the page is (as would be most marketing verbiage) very encouraging!

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

This sounds too good to be true. Finally a solution that we might be able to implement and grow, and one that doe not have the incredibly frustrating drawback of InnoDB and MySql’s fragile replication architecture. I’ve found out exactly how fragile it is, despite have a cluster of high-speed specially designed DB servers, the amount of down time we had was NOT ACCEPTABLE!).

With a charter to handle ever growing amounts of data and the need for ultimate availability and reliability, an alternative to MySQL is almost certainly required.

Of the items discussed on the main page, this one really hits home and stands out to me:

Fault Tolerant

Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

I recently watched a video from the 2011 Cassandra Conference in San Francisco. A lot of good information shared. This video is available on the Cassandra home page. I recommend muscling through the marketing BS as the beginning and take in what they cover.

Job graph for ‘Big Data’ is skyrocketing.

Demand for Cassandra experts is also skyrocketing.

Big data players are using Cassandra.

It’s a known issue that RDBM’s (ex. MySql) have serious limitations (no kidding).

RDBM’s generally have an 8GB cache limit (this is interesting, and would explain some issues we’ve had with scalability in our DB severs, which have 64GB of memory).

The notion that Cassandra does not have good read speed, is a fallacy. Version 0.8 read speed is at parity of the already considered fast 0.6 write speed. Fast!?

No global or even low-level write locks. The column swap architecture alleviates the need for these locks, this allows high-speed writes.

Quorum reads and writes are consistent across the distribution.

New feature of local LOCAL_QUORUM allows quorums to be established from only the local nodes, alleviating latency waiting for a quorum including remote nodes in other geographic locations.

Cassandra uses XML files for schema modifications. In version 0.7 provides new features to allow on-line schema updates.

CLI for Cassandra is now very powerful.

Has a SQL language capability (yes!).

Latest version provides much easier to implement secondary indexing (indexes other than the primary).

Version 0.8 supports bulk loading. This is very interesting for my current project

There is wide support for Cassandra in both interpreted and compiled OSS languages, including the ones I most frequently use.

CQL Cassandra Query Language.

Replication architecture is vastly superior to MySQLs transaction and log replay strategy. Cassandra uses an rsync style replication where hash comparisons are exchanged to find which parts of the data tree a given replication node (that is responsible for that tree of data) might need updating, then then transferring just that data. Not only does this reduce bandwidth, but this implies asynchronous replication! Finally! Now this makes sense to me!!

Hadoop support exists for Cassandra, BUT, it’s not a great fit for Cassandra. Look into Brisk if Hadoop implementation is desired or required.

Division of Real-Time and Analytics nodes.

Nodes can be configured to communicate with each other in an encrypted fashion, but in general inter-node communication across public-private networks should be established using VPN tunnels.

This needs further research, but it’s very, VERY promising!

NEXT: Cassandra – Getting a 3 node cluster built

Re-setting / Re-Indexing Apple OSX Help Books

While developing applications and the associated help books, sometimes you need to flush the cache to test updates to your applications Apple Help Book.

I won’s say this is the most effective, or correct way to handle the issue, I will say it’s the one that worked for me.

First order of business is to flush out the help cache. NOTE: If you do this, you’ll need to make sure and restart helpd to create the fresh cache entries.

You will find your user’s cached versions of the files located here:


~/Library/Caches/com.apple.helpd

Located there are the following 3 items (two files and one directory)


-rw-r--r-- 1 mememe staff 6574080 Feb 7 22:13 Cache.db
drwxr-xr-x 56 mememe staff 1904 Feb 2 13:37 Generated
-rw-r--r-- 1 mememe staff 69585 Feb 7 22:15 HelpCache.plist

The application I’m looking to flush the help file for is com.daviddemartini.cidrcalculator.help. It’s highlighted in this partial list here:


[...]
com.apple.Mail.help
com.apple.keychainaccess.help
com.apple.PhotoBooth.help
com.apple.machelp
com.apple.PodcastCapture.help
com.canon.Digital Photo Professional.help
com.apple.PodcastPublisher.help com.daviddemartini.cidrcalculator.help
com.apple.Preview.help
com.apple.QuickTimePlayerX.help com.prect.NavicatPremium.help
[...]

I removed that directory and it’s contents, then removed the two cache files in the upper most directory (first one we entered):


cd com.daviddemartini.cidrcalculator.help
rm English.helpindex
cd ..
rmdir com.daviddemartini.cidrcalculator.help
cd ../..
rm Cache.db HelpCache.plist

Next I went looking for copies of the help plist files to wipe them out:


find ~/Library/Caches -name 'com.apple.help.plist'

./Containers/com.apple.Preview/Data/Library/Preferences/com.apple.help.plist
./Containers/com.apple.TextEdit/Data/Library/Preferences/com.apple.help.plist
./Preferences/com.apple.help.plist

Based on some vague recollection of the last time I did this, and one Google page the skirted around it (not really calling out this very file) I nuked the preferences help.plist.


rm ./Preferences/com.apple.help.plist

Then I went looking for the start-up process for helpd, and eventually located it by looking in this file:


cd /System/Library/LaunchAgents

cat com.apple.helpd.plist



[...]
Program
/System/Library/PrivateFrameworks/HelpData.framework/Versions/A/Resources/helpd
EnableTransactions

[...]

Here is where it’s at:

/System/Library/PrivateFrameworks/HelpData.framework/Versions/A/Resources/helpd

Since the reason I was doing this was to update my Help Book, I want to be sure I’ve re-indexed it. There are at least two primary ways to do it. The GUI App (/Developer/Applications/Utilities/Help Indexer.app) or the command line tool (hiutil).

I’m going for the command line version. The following commands create a new index file right in my help directory.


hiutil -Caf CidrCalc.help/CidrCalcHelp.helpindex CidrCalc.help

Finally, executing the helpd compiler to make the book available:


root# /System/Library/PrivateFrameworks/HelpData.framework/Versions/A/Resources/helpd &

So, there are some crude, ugly but hopefully effective for you, Apple Help Book Reset and Rebuilding Steps.

CIDR Calculator picked up by Softpedia

This just in…. (seems like a good thing). Notice that my new App CIDR Calculator for the MAC (in the wide for barely 24 hours now), was found by Softpedia, and linked in their site.

Congratulations,

CIDR Calculator, one of your products, has been added to Softpedia’s
database of software programs for Mac OS. It is featured with a description
text, screenshots, download links and technical details on this page:
http://mac.softpedia.com/get/Utilities/DeMartini-CIDR-Calculator.shtml

The description text was created by our editors, using sources such as text
from your product’s homepage, information from its help system, the PAD
file (if available) and the editor’s own opinions on the program itself.

Nothing wrong with a little free exposure. No ratings so far, but I hope to get some good feedback. It’s already sold several units so I know someone is out there giving it a test.

If you want to learn more about this entry in Softpedia, [ HERE IS THE LINK ].

If you want to check out the App itself at the Apple MAC App Store.. just click on the button below!

Buy at the Mac App Store

CIDR Calculator for MAC OSX Released!

Now available. The same CIDR tool functionality in my popular iCDIR iPhone application is available on the desktop!

CIDR Calculator is a simple to use utility. Not loaded with a bunch of bells and whistles, CIDR Calculator concentrates on the job at hand.

Requesting IPs using CIDR notation
If you need to request a net block from your ISP, you can determine the right CIDR block size by simply checking the handy ‘IPv4 CIDRs’ quick reference sheet:

IPv4 CIDR Reference Sheet

Understanding a CIDR
If you already have a CIDR assigned for your network, or simply desire to understand the IP ranging for any IPv4 CIDR block, simply punch the value into the input box, and you will receive the starting IP, ending IP, common network settings, and for those that deal in the world of IP blocklists, and data warehousing, a base10 integer representation of the IP addresses (start and finish).

CIDR Calculator in Action

Help at your Fingertips
Along with the CIDR Calculator, new to the MAC OSX desktop version is a handy Apple Help Book containing all the information you need to make the most out of CIDR Calculator.

CIDR Calculator Help Book

Additional information is available on the Official App Page located [ HERE ].

App is ON SALE now at the Apple MAC App Store for only 99 cents! This is a limited time offer!
Buy at the Mac App Store

Take the useful CIDR Calculator on the road with the iPhone/iPad version also available now at the iTunes App Store!
iCIDR - David DeMartini