public static void main(String[] args) {
String txt = "\"199.47.181.213\" \"NULL-AUTH-USER\" \"06/Oct/2014:11:19:54 +0000\" \"GET /site/\" 'HTTP/1.0\" 200 20668 ";
String rgx = "\"(.*)\" \"(.*)\" \"(.*)\" \"(.*)\" ([0-9]+) ([0-9]+) ";
SimpleDateFormat dateFormat = new SimpleDateFormat("dd/MMM/yyyy:hh:mm:ss +SSSS");
Pattern p = Pattern.compile(rgx);
Matcher m = p.matcher(txt);
boolean b = m.matches();
if (b) {
int groupCount = m.groupCount();
for (int i = 0; i <= groupCount; i++) {
String g = m.group(i);
System.out.print("matched group " + i + ":\t");
if (i == 3) {
Date parsedDate;
try {
parsedDate = dateFormat.parse(g);
Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());
System.out.println(timestamp);
} catch (ParseException ex) {
Logger.getLogger(PatternTester.class.getName()).log(Level.SEVERE, g, ex);
}
} else {
System.out.println(g);
}
}
} else {
System.out.println(txt + "\nDOES NOT MATCH\n" + rgx);
}
}
below is the output:
matched group 0: "199.47.181.213" "NULL-AUTH-USER" "06/Oct/2014:11:19:54 +0000" "GET /site/" 'HTTP/1.0" 200 20668
matched group 1: 199.47.181.213
matched group 2: NULL-AUTH-USER
matched group 3: 2014-10-06 11:19:54.0
matched group 4: GET /site/" 'HTTP/1.0
matched group 5: 200
matched group 6: 20668
Tuesday, 18 August 2015
Monday, 9 February 2015
How to run HADOOP MapReduce job?
Assuming HADOOP 2.6 full cluster has been properly set up, here is the steps to run the demo job:
# prepare jar file
vi WordCount.java
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
# prepare data files
vi file01
vi file02
# format HDFS namenode
hdfs namenode -format
# start HDFS & MapReduce daemons
start-dfs.sh
start-yarn.sh
# copy data into hdfs
hdfs dfs -mkdir -p /user/hdpuser/wordcount/input
hdfs dfs -copyFromLocal file* /user/hdpuser/wordcount/input
# calculating
hadoop jar wc.jar WordCount /user/hdpuser/wordcount/input /user/hdpuser/wordcount/output
# copy result out of hdfs
hdfs dfs -copyToLocal /user/hdpuser/wordcount/output/part-r-00000 result
# prepare jar file
vi WordCount.java
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
# prepare data files
vi file01
vi file02
# format HDFS namenode
hdfs namenode -format
# start HDFS & MapReduce daemons
start-dfs.sh
start-yarn.sh
# copy data into hdfs
hdfs dfs -mkdir -p /user/hdpuser/wordcount/input
hdfs dfs -copyFromLocal file* /user/hdpuser/wordcount/input
# calculating
hadoop jar wc.jar WordCount /user/hdpuser/wordcount/input /user/hdpuser/wordcount/output
# copy result out of hdfs
hdfs dfs -copyToLocal /user/hdpuser/wordcount/output/part-r-00000 result
Friday, 9 January 2015
Big Data - issues & technologies
Data is always an issue to someone. Big data is a new issue because it is 'big', by which I mean no only the size, but the variety of sources, types, complexities and so on. For example, in healthcare domain, we collect data about people, medicines, food, environment and so on. In any of these aspects, there could be a lot of sub-domains we are interested in.
issue 1: collection
What should we collect? we can collect what we selected or what exist.
How to collect? it is a challenge trying to collect big data about large population for a long period
issue 2: cleansing
issue 3: integration
issue 4: analysis
Subscribe to:
Comments (Atom)