Thursday 12 September 2013

Using Apache Whirr & Amazon EC2

So once you are done with basic stuff of installing hadoop locally and doing some "Hello World" stuff (wordcount in hadoop parlance), now it is time to put hadoop into some real usage to understand it's value. A good infrastructure is required to do some heavylifting with fairly large amount of data. I did go through some of the cloud providers and figured out that Amazon EC2 works out cheaper. I was also promised to get 50$ AWS credit for participating in RedHat survey.

There are plenty of posts available as to how to setup hadoop on EC2. I followed this one. I also figured out that installing hadoop into cloud is fairly cumbersome - if you are doing it by hand. Apache Whirr an installer tool for cloud services could help get things going pretty fast.

So using Apache whirr and Amazon EC2, I was able to setup my first hadoop cluster and was able to run some word count map-reds.

Saturday 29 June 2013

Quick hands-on with Hadoop for Java Professionals

Apache Hadoop is a suite of components that are aimed at handling large scale data.The traditional n-tier architecture with RDBMS based data store doesnt scale as the data volume exceeds and in the range of peta bytes and this is the space where Hadoop comes into play. Hadoop is not a replacement for traditional architecture, but complements it in the high volume data space.

This series of post is aimed at giving a head start for Java professionals through some practical hands on exercises.

1. Download cygwin. http://cygwin.com/setup.exe
Make sure that you install the optional packages - SSH, Open SSH, python > 2.6

2. JDK - Install a java distribution which is > 1.6.

3. Hadoop core
hadoop.apache.org

4. Follow the instructions here to complete the rest of the setup
http://hadoop.apache.org/docs/stable/single_node_setup.html

5. You could try some examples that are inbuilt in the hadoop package. The equivalent of Hello world in Hadoop is the word count problem where the search for a key word is performed in given data set.

In next post, we will see more practical use case with Hadoop.

Common Issues faced while setting up Hadoop
Install all packages in folder without spaces. For e.g. if you have your JDK in c:/program files/java then you may face issues with the path.

Saturday 7 July 2012

Item 78: Consider serialization proxies instead of serialized instances


As mentioned in Item 74 and discussed throughout this chapter, the decision to implement Serializable increases the likelihood of bugs and security problems, because it causes instances to be created using an extralinguistic mechanism in place of ordinary constructors. There is, however, a technique that greatly reduces these risks. This technique is known as the serialization proxy pattern.

The serialization proxy pattern is reasonably straightforward. First, design a private static nested class of the serializable class that concisely represents the logical state of an instance of the enclosing class. This nested class, known as the serialization proxy, should have a single constructor, whose parameter type is the enclosing class. This constructor merely copies the data from its argument: it need not do any consistency checking or defensive copying. By design, the default serialized form of the serialization proxy is the perfect serialized form of the enclosing class. Both the enclosing class and its serialization proxy must be declared to implement Serializable.

For example, consider the immutable Period class written in Item 39 and made serializable in Item 76. Here is a serialization proxy for this class. Period is so simple that its serialization proxy has exactly the same fields as the class:

// Serialization proxy for Period class
private static class SerializationProxy implements Serializable {
private final Date start;
private final Date end;
SerializationProxy(Period p) {
this.start = p.start;
this.end = p.end;
}
private static final long serialVersionUID =
234098243823485285L; // Any number will do (Item 75)
}

Next, add the following writeReplace method to the enclosing class. This method can be copied verbatim into any class with a serialization proxy:

// writeReplace method for the serialization proxy pattern
private Object writeReplace() {
return new SerializationProxy(this);
}


The presence of this method causes the serialization system to emit a SerializationProxy instance instead of an instance of the enclosing class. In other words, the writeReplace method translates an instance of the enclosing class to its serialization proxy prior to serialization.

With this writeReplace method in place, the serialization system will never generate a serialized instance of the enclosing class, but an attacker might fabricate one in an attempt to violate the class’s invariants. To guarantee that such an attack would fail, merely add this readObject method to the enclosing class:

// readObject method for the serialization proxy pattern
private void readObject(ObjectInputStream stream)
throws InvalidObjectException {
throw new InvalidObjectException("Proxy required");
}

Finally, provide a readResolve method on the SerializationProxy class that returns a logically equivalent instance of the enclosing class. The presence of this method causes the serialization system to translate the serialization proxy back into an instance of the enclosing class upon deserialization.

Here is the readResolve method for Period.SerializationProxy above:

// readResolve method for Period.SerializationProxy
private Object readResolve() {
return new Period(start, end); // Uses public constructor
}

There is another way in which the serialization proxy pattern is more powerful than defensive copying. The serialization proxy pattern allows the deserialized instance to have a different class from the originally serialized instance. You might not think that this would be useful in practice, but it is.

Consider the case of EnumSet (Item 32). This class has no public constructors, only static factories. From the client’s perspective, they return EnumSet instances, but in fact, they return one of two subclasses, depending on the size of the underlying enum type (Item 1). If the underlying enum type has sixty-four or fewer elements, the static factories return a RegularEnumSet; otherwise, they return a JumboEnumSet. Now consider what happens if you serialize an enum set whose enum type has sixty elements, then add five more elements to the enum type, and then deserialize the enum set. It was a RegularEnumSet instance when it was serialized, but it had better be a JumboEnumSet instance once it is deserialized. In fact that’s exactly what happens, because EnumSet uses the serialization proxy pattern. In case you’re curious, here is EnumSet’s serialization proxy. It really is this simple:

// EnumSet's serialization proxy
private static class SerializationProxy <E extends Enum<E>>
implements Serializable {
// The element type of this enum set.
private final Class<E> elementType;
// The elements contained in this enum set.
private final Enum[] elements;
SerializationProxy(EnumSet<E> set) {
elementType = set.elementType;
elements = set.toArray(EMPTY_ENUM_ARRAY); // (Item 43)
}
private Object readResolve() {
EnumSet<E> result = EnumSet.noneOf(elementType);
for (Enum e : elements)
result.add((E)e);
return result;
}
private static final long serialVersionUID =
362491234563181265L;
}

The serialization proxy pattern has two limitations. It is not compatible with classes that are extendable by their clients (Item 17). Also, it is not compatible with some classes whose object graphs contain circularities: if you attempt to invoke a method on an object from within its serialization proxy’s readResolve method, you’ll get a ClassCastException, as you don’t have the object yet, only its serialization proxy.

Finally, the added power and safety of the serialization proxy pattern are not free. On my machine, it is 14 percent more expensive to serialize and deserialize Period instances with serialization proxies than it is with defensive copying.

In summary, consider the serialization proxy pattern whenever you find yourself having to write a readObject or writeObject method on a class that is not extendable by its clients. This pattern is perhaps the easiest way to robustly serialize objects with nontrivial invariants.


Reference: Effective Java 2nd Edition by Joshua Bloch

Item 77: For instance control, prefer enum types to readResolve


Item 3 describes the Singleton pattern and gives the following example of a singleton class. This class restricts access to its constructor to ensure that only a single instance is ever created:

public class Elvis {
public static final Elvis INSTANCE = new Elvis();
private Elvis() { ... }
public void leaveTheBuilding() { ... }
}

As noted in Item 3, this class would no longer be a singleton if the words “implements Serializable” were added to its declaration. It doesn’t matter whether the class uses the default serialized form or a custom serialized form (Item 75), nor does it matter whether the class provides an explicit readObject method (Item 76). Any readObject method, whether explicit or default, returns a newly created instance, which will not be the same instance that was created at class initialization time.

The readResolve feature allows you to substitute another instance for the one created by readObject [Serialization, 3.7]. If the class of an object being deserialized defines a readResolve method with the proper declaration, this method is invoked on the newly created object after it is deserialized. The object reference returned by this method is then returned in place of the newly created object. In most uses of this feature, no reference to the newly created object is retained, so it immediately becomes eligible for garbage collection.

If the Elvis class is made to implement Serializable, the following readResolve method suffices to guarantee the singleton property:

// readResolve for instance control - you can do better!
private Object readResolve() {
// Return the one true Elvis and let the garbage collector
// take care of the Elvis impersonator.
return INSTANCE;
}

This method ignores the deserialized object, returning the distinguished Elvis instance that was created when the class was initialized. Therefore, the serialized form of an Elvis instance need not contain any real data; all instance fields should be declared transient. In fact, if you depend on readResolve for instance control, all instance fields with object reference types must be declared transient. Otherwise, it is possible for a determined attacker to secure a reference to the deserialized object before its readResolve method is run, using a technique that is vaguely similar to the MutablePeriod attack in Item 76.

The attack is a bit complicated, but the underlying idea is simple. If a singleton contains a nontransient object reference field, the contents of this field will be deserialized before the singleton’s readResolve method is run. This allows a carefully crafted stream to “steal” a reference to the originally deserialized singleton at the time the contents of the object reference field are deserialized.

To make this concrete, consider the following broken singleton:

// Broken singleton - has nontransient object reference field!
public class Elvis implements Serializable {
public static final Elvis INSTANCE = new Elvis();
private Elvis() { }
private String[] favoriteSongs =
{ "Hound Dog", "Heartbreak Hotel" };
public void printFavorites() {
System.out.println(Arrays.toString(favoriteSongs));
}
private Object readResolve() throws ObjectStreamException {
return INSTANCE;
}
}

You could fix the problem by declaring the favorites field transient, but you’re better off fixing it by making Elvis a single-element enum type (Item 3). Historically, the readResolve method was used for all serializable instance-controlled classes. As of release 1.5, this is no longer the best way to maintain instance control in a serializable class. As demonstrated by the ElvisStealer attack, this technique is fragile and demands great care.

If instead you write your serializable instance-controlled class as an enum, you get an ironclad guarantee that there can be no instances besides the declared constants. The JVM makes this guarantee, and you can depend on it. It requires no special care on your part. Here’s how our Elvis example looks as an enum:

// Enum singleton - the preferred approach
public enum Elvis {
INSTANCE;
private String[] favoriteSongs =
{ "Hound Dog", "Heartbreak Hotel" };
public void printFavorites() {
System.out.println(Arrays.toString(favoriteSongs));
}
}

The use of readResolve for instance control is not obsolete. If you have to write a serializable instance-controlled class whose instances are not known at compile time, you will not be able to represent the class as an enum type. The accessibility of readResolve is significant. If you place a readResolve method on a final class, it should be private. If you place a readResolve method on a nonfinal class, you must carefully consider its accessibility. If it is private, it will not apply to any subclasses. If it is package-private, it will apply only to subclasses in the same package. If it is protected or public, it will apply to all subclasses that do not override it. If a readResolve method is protected or public and a subclass does not override it, deserializing a serialized subclass instance will produce a superclass instance, which is likely to cause a ClassCastException.

To summarize, you should use enum types to enforce instance control invariants wherever possible. If this is not possible and you need a class to be both serializable and instance-controlled, you must provide a readResolve method and ensure that all of the class’s instance fields are either primitive or transient.


Reference: Effective Java 2nd Edition by Joshua Bloch

Item 76: Write readObject methods defensively


Item 39 contains an immutable date-range class containing mutable private Date fields. The class goes to great lengths to preserve its invariants and its immutability by defensively copying Date objects in its constructor and accessors. Here is the class:

// Immutable class that uses defensive copying
public final class Period {
private final Date start;
private final Date end;
/**
* @param start the beginning of the period
* @param end the end of the period; must not precede start
* @throws IllegalArgumentException if start is after end
* @throws NullPointerException if start or end is null
*/
public Period(Date start, Date end) {
this.start = new Date(start.getTime());
this.end = new Date(end.getTime());
if (this.start.compareTo(this.end) > 0)
throw new IllegalArgumentException(
start + " after " + end);
}
public Date start () { return new Date(start.getTime()); }
public Date end () { return new Date(end.getTime()); }
public String toString() { return start + " - " + end; }
... // Remainder omitted
}

Suppose you decide that you want this class to be serializable. Because the physical representation of a Period object exactly mirrors its logical data content, it is not unreasonable to use the default serialized form (Item 75). Therefore, it might seem that all you have to do to make the class serializable is to add the words “implements Serializable” to the class declaration. If you did so, however, the class would no longer guarantee its critical invariants.

The problem is that the readObject method is effectively another public constructor, and it demands all of the same care as any other constructor. Just as a constructor must check its arguments for validity (Item 38) and make defensive copies of parameters where appropriate (Item 39), so must a readObject method. If a readObject method fails to do either of these things, it is a relatively simple matter for an attacker to violate the class’s invariants.

Loosely speaking, readObject is a constructor that takes a byte stream as its sole parameter. In normal use, the byte stream is generated by serializing a normally constructed instance. The problem arises when readObject is presented with a byte stream that is artificially constructed to generate an object that violates the invariants of its class. To fix this problem, provide a readObject method for Period that calls defaultReadObject and then checks the validity of the deserialized object. If the validity check fails, the readObject method throws an InvalidObjectException, preventing the deserialization from completing:

// readObject method with validity checking
private void readObject(ObjectInputStream s)
throws IOException, ClassNotFoundException {
s.defaultReadObject();
// Check that our invariants are satisfied
if (start.compareTo(end) > 0)
throw new InvalidObjectException(start +" after "+ end);
}

While this fix prevents an attacker from creating an invalid Period instance, there is a more subtle problem still lurking. It is possible to create a mutable Period instance by fabricating a byte stream that begins with a valid Period instance and then appends extra references to the private Date fields internal to the Period instance.

The source of the problem is that Period’s readObject method is not doing enough defensive copying. When an object is deserialized, it is critical to defensively copy any field containing an object reference that a client must not possess. Therefore, every serializable immutable class containing private mutable components must defensively copy these components in its readObject method. The following readObject method suffices to ensure Period’s invariants and to maintain its immutability:

// readObject method with defensive copying and validity checking
private void readObject(ObjectInputStream s)
throws IOException, ClassNotFoundException {
s.defaultReadObject();
// Defensively copy our mutable components
start = new Date(start.getTime());
end = new Date(end.getTime());
// Check that our invariants are satisfied
if (start.compareTo(end) > 0)
throw new InvalidObjectException(start +" after "+ end);
}

Note that the defensive copy is performed prior to the validity check and that we did not use Date’s clone method to perform the defensive copy. Both of these details are required to protect Period against attack (Item 39). Note also that defensive copying is not possible for final fields. To use the readObject method, we must make the start and end fields nonfinal.

Here is a simple litmus test for deciding whether the default readObject method is acceptable for a class: would you feel comfortable adding a public constructor that took as parameters the values for each nontransient field in the object and stored the values in the fields with no validation whatsoever? If not, you must provide a readObject method, and it must perform all the validity checking and defensive copying that would be required of a constructor. Alternatively, you can use the serialization proxy pattern (Item 78).

There is one other similarity between readObject methods and constructors, concerning nonfinal serializable classes. A readObject method must not invoke an overridable method, directly or indirectly (Item 17). If this rule is violated and the method is overridden, the overriding method will run before the subclass’s state has been deserialized. A program failure is likely to result.

To summarize, anytime you write a readObject method, adopt the mind-set that you are writing a public constructor that must produce a valid instance regardless of what byte stream it is given. Do not assume that the byte stream represents an actual serialized instance. While the examples in this item concern a class that uses the default serialized form, all of the issues that were raised apply equally to classes with custom serialized forms. Here, in summary form, are the guidelines for writing a bulletproof readObject method:

• For classes with object reference fields that must remain private, defensively copy each object in such a field. Mutable components of immutable classes fall into this category.

• Check any invariants and throw an InvalidObjectException if a check fails. The checks should follow any defensive copying.

• If an entire object graph must be validated after it is deserialized, use the ObjectInputValidation interface [JavaSE6, Serialization].

• Do not invoke any overridable methods in the class, directly or indirectly.


Reference: Effective Java 2nd Edition by Joshua Bloch

Item 75: Consider using a custom serialized form


Do not accept the default serialized form without first considering whether it is appropriate. Accepting the default serialized form should be a conscious decision that this encoding is reasonable from the standpoint of flexibility, performance, and correctness. Generally speaking, you should accept the default serialized form only if it is largely identical to the encoding that you would choose if you were designing a custom serialized form.

The default serialized form is likely to be appropriate if an object’s physical representation is identical to its logical content. For example, the default serialized form would be reasonable for the following class, which simplistically represents a person’s name:

// Good candidate for default serialized form
public class Name implements Serializable {
/**
* Last name. Must be non-null.
* @serial
*/
private final String lastName;
/**
* First name. Must be non-null.
* @serial
*/
private final String firstName;
/**
* Middle name, or null if there is none.
* @serial
*/
private final String middleName;
... // Remainder omitted
}

Logically speaking, a name consists of three strings that represent a last name, a first name, and a middle name. The instance fields in Name precisely mirror this logical content.

Even if you decide that the default serialized form is appropriate, you often must provide a readObject method to ensure invariants and security. In the case of Name, the readObject method must ensure that lastName and first- Name are non-null. This issue is discussed at length in Items 76 and 78.

Note that there are documentation comments on the lastName, firstName, and middleName fields, even though they are private. That is because these private fields define a public API, which is the serialized form of the class, and this public API must be documented. The presence of the @serial tag tells the Javadoc utility to place this documentation on a special page that documents serialized forms.

Near the opposite end of the spectrum from Name, consider the following class, which represents a list of strings (ignoring for the moment that you’d be better off using one of the standard List implementations):

// Awful candidate for default serialized form
public final class StringList implements Serializable {
private int size = 0;
private Entry head = null;
private static class Entry implements Serializable {
String data;
Entry next;
Entry previous;
}
... // Remainder omitted
}

Logically speaking, this class represents a sequence of strings. Physically, it represents the sequence as a doubly linked list. If you accept the default serialized form, the serialized form will painstakingly mirror every entry in the linked list and all the links between the entries, in both directions.

Using the default serialized form when an object’s physical representation differs substantially from its logical data content has four disadvantages:

It permanently ties the exported API to the current internal representation. In the above example, the private StringList.Entry class becomes part of the public API. If the representation is changed in a future release, the StringList class will still need to accept the linked list representation on input and generate it on output. The class will never be rid of all the code dealing with linked list entries, even if it doesn’t use them anymore.

It can consume excessive space. In the above example, the serialized form unnecessarily represents each entry in the linked list and all the links. These entries and links are mere implementation details, not worthy of inclusion in the serialized form. Because the serialized form is excessively large, writing it to disk or sending it across the network will be excessively slow.

It can consume excessive time. The serialization logic has no knowledge of the topology of the object graph, so it must go through an expensive graph traversal. In the example above, it would be sufficient simply to follow the next references.

It can cause stack overflows. The default serialization procedure performs a recursive traversal of the object graph, which can cause stack overflows even for moderately sized object graphs. Serializing a StringList instance with 1,258 elements causes the stack to overflow on my machine. The number of elements required to cause this problem may vary depending on the JVM implementation and command line flags; some implementations may not have this problem at all.

A reasonable serialized form for StringList is simply the number of strings in the list, followed by the strings themselves. This constitutes the logical data represented by a StringList, stripped of the details of its physical representation. Here is a revised version of StringList containing writeObject and readObject methods implementing this serialized form. As a reminder, the transient modifier indicates that an instance field is to be omitted from a class’s default serialized form:

// StringList with a reasonable custom serialized form
public final class StringList implements Serializable {
private transient int size = 0;
private transient Entry head = null;
// No longer Serializable!
private static class Entry {
String data;
Entry next;
Entry previous;
}
// Appends the specified string to the list
public final void add(String s) { ... }
/**
* Serialize this {@code StringList} instance.
*
* @serialData The size of the list (the number of strings
* it contains) is emitted ({@code int}), followed by all of
* its elements (each a {@code String}), in the proper
* sequence.
*/
private void writeObject(ObjectOutputStream s)
throws IOException {
s.defaultWriteObject();
s.writeInt(size);
// Write out all elements in the proper order.
for (Entry e = head; e != null; e = e.next)
s.writeObject(e.data);
}
private void readObject(ObjectInputStream s)
throws IOException, ClassNotFoundException {
s.defaultReadObject();
int numElements = s.readInt();
// Read in all elements and insert them in list
for (int i = 0; i < numElements; i++)
add((String) s.readObject());
}
... // Remainder omitted
}

Note that the first thing writeObject does is to invoke defaultWriteObject, and the first thing readObject does is to invoke defaultReadObject, even though all of StringList’s fields are transient. If all instance fields are transient, it is technically permissible to dispense with invoking defaultWriteObject and defaultReadObject, but it is not recommended.

Before deciding to make a field nontransient, convince yourself that its value is part of the logical state of the object. If you use a custom serialized form, most or all of the instance fields should be labeled transient, as in the StringList example shown above.

If you are using the default serialized form and you have labeled one or more fields transient, remember that these fields will be initialized to their default values when an instance is deserialized: null for object reference fields, zero for umeric primitive fields, and false for boolean fields [JLS, 4.12.5]. If these values are unacceptable for any transient fields, you must provide a readObject method that invokes the defaultReadObject method and then restores transient fields to acceptable values (Item 76). Alternatively, these fields can be lazily initialized the first time they are used (Item 71).

Whether or not you use the default serialized form, you must impose any synchronization on object serialization that you would impose on any other method that reads the entire state of the object. So, for example, if you have a thread-safe object (Item 70) that achieves its thread safety by synchronizing every method, and you elect to use the default serialized form, use the following writeObject method:

// writeObject for synchronized class with default serialized form
private synchronized void writeObject(ObjectOutputStream s)
throws IOException {
s.defaultWriteObject();
}

If you put synchronization in the writeObject method, you must ensure that it adheres to the same lock-ordering constraints as other activity, or you risk a resource-ordering deadlock [Goetz06, 10.1.5].

Regardless of what serialized form you choose, declare an explicit serial version UID in every serializable class you write. This eliminates the serial version UID as a potential source of incompatibility (Item 74). There is also a small performance benefit. If no serial version UID is provided, an expensive computation is required to generate one at runtime.

Declaring a serial version UID is simple. Just add this line to your class:

private static final long serialVersionUID = randomLongValue ;

To summarize, when you have decided that a class should be serializable (Item 74), think hard about what the serialized form should be. Use the default serialized form only if it is a reasonable description of the logical state of the object; otherwise design a custom serialized form that aptly describes the object. You should allocate as much time to designing the serialized form of a class as you allocate to designing its exported methods (Item 40). Just as you cannot eliminate exported methods from future versions, you cannot eliminate fields from the serialized form; they must be preserved forever to ensure serialization compatibility. Choosing the wrong serialized form can have a permanent, negative impact on the complexity and performance of a class.


Reference: Effective Java 2nd Edition by Joshua Bloch