Tuesday, 18 August 2015

How to use Java 7 Pattern & Matcher to extract values from a String?

    public static void main(String[] args) {
        String txt = "\"199.47.181.213\" \"NULL-AUTH-USER\" \"06/Oct/2014:11:19:54 +0000\" \"GET /site/\" 'HTTP/1.0\" 200 20668 ";
        String rgx = "\"(.*)\" \"(.*)\" \"(.*)\" \"(.*)\" ([0-9]+) ([0-9]+) ";
        SimpleDateFormat dateFormat = new SimpleDateFormat("dd/MMM/yyyy:hh:mm:ss +SSSS");
        Pattern p = Pattern.compile(rgx);
        Matcher m = p.matcher(txt);
        boolean b = m.matches();
        if (b) {
            int groupCount = m.groupCount();
            for (int i = 0; i <= groupCount; i++) {
                String g = m.group(i);
                System.out.print("matched group " + i + ":\t");
                if (i == 3) {
                    Date parsedDate;
                    try {
                        parsedDate = dateFormat.parse(g);
                        Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());
                        System.out.println(timestamp);
                    } catch (ParseException ex) {
                        Logger.getLogger(PatternTester.class.getName()).log(Level.SEVERE, g, ex);
                    }
                } else {
                    System.out.println(g);
                }
            }
        } else {
            System.out.println(txt + "\nDOES NOT MATCH\n" + rgx);
        }
    }

below is the output:

matched group 0: "199.47.181.213" "NULL-AUTH-USER" "06/Oct/2014:11:19:54 +0000" "GET /site/" 'HTTP/1.0" 200 20668
matched group 1: 199.47.181.213
matched group 2: NULL-AUTH-USER
matched group 3: 2014-10-06 11:19:54.0
matched group 4: GET /site/" 'HTTP/1.0
matched group 5: 200
matched group 6: 20668

Monday, 9 February 2015

How to run HADOOP MapReduce job?

Assuming HADOOP 2.6 full cluster has been properly set up, here is the steps to run the demo job:

# prepare jar file

vi WordCount.java
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

# prepare data files

vi file01
vi file02

# format HDFS namenode

hdfs namenode -format

# start HDFS & MapReduce  daemons

start-dfs.sh
start-yarn.sh

# copy data into hdfs

hdfs dfs -mkdir -p /user/hdpuser/wordcount/input
hdfs dfs -copyFromLocal file* /user/hdpuser/wordcount/input

# calculating

hadoop jar wc.jar WordCount /user/hdpuser/wordcount/input /user/hdpuser/wordcount/output

# copy result out of hdfs

hdfs dfs -copyToLocal /user/hdpuser/wordcount/output/part-r-00000 result


Friday, 9 January 2015

Big Data - issues & technologies

Data is always an issue to someone. Big data is a new issue because it is 'big', by which I mean no only the size, but the variety of sources, types, complexities and so on. For example, in healthcare domain, we collect data about people, medicines, food, environment and so on. In any of these aspects, there could be a lot of sub-domains we are interested in.

issue 1: collection
What should we collect? we can collect what we selected or what exist.
How to collect? it is a challenge trying to collect big data about large population for a long period

issue 2: cleansing

issue 3: integration

issue 4: analysis