Monday, June 21, 2010

Removing HTML tags from a String using Regex

result = data.replaceAll("\\\\n","<br/>"). //replace \n with <br/>, 
replaceAll("\\\\", ""). //remove stray "\"s
replaceAll("\\\"", "\\\\\""). // and escape double quotes
replaceAll("<[\\p{Alnum}\\p{Space}\\.\\-=/:\\\"\\\\;]*>"," "). //remove all opening html tags
replaceAll("</[\\p{Alnum}]*>"," "); //remove all closing html tags

Tuesday, June 15, 2010

High Level Hadoop MapReduce: Rookie/Novice/fresher...

MapReduce borrows a lot from functional programming(Lisp/ML/Scheme). Func. prog. expects to process lists of data frequently, hence they have a lot of inbuilt iterator-mechanisms and higher-level functions called list-comprehensions that are operators over lists. Two of these operators are map and reduce. Like map operator, Mapper takes a record (assumed to be a key-value pair) but can emit multiple key-value pairs(map operator does only one). A characteristic of MapRed paradigm is that mapper should be processing individual records in isolation of one-another.(One Record = up to you- depends on the class that loads the data from the block into mapper as k-v pair records).
Reducer takes a key and list of values and can emit none, one or multiple key-value pairs.

It's a general notion to ignore the key that's input to a map task. They could be byte offset to a chunk of data.

People also skip mapper or more often reducer, if it suffices for the app.
Reducers don't start until all mappers complete. They run on the same nodes as mappers(after mappers complete).
All values with same keys are collected from all mappers(that emitted that key) and sent to same reducer. This involves network communication. If multiples keys are processed by one reducer, they are presented to it in sorted order(No assumptions about the values of those keys). This is called "sort and shuffle" phase handled by the underlying framework. A single reduce task processes single key(it takes one key and list of values as input).

Within sort and shuffle, there can be a user-defined combine task that's run on the mapper's intermediate results. If app-logic allows it could be same as reducer code(eg: if reducer is commutative and associative). Combining phase is disabled by default. It runs on the mapper machine. It's solely for reducing network data-load and load on reducer. Don't perform any data-specific operation in combine phase. It may run zero, one or more times.(depends on size of data). Example is the word-count map reduce jobs. Mapper emits [, 1] . Combiner aggragates that mapper's results to [, n](no. of times the word occurred in the input blocks local to that machine). Then comes the reducers(mappers die).

Thank You.

Friday, June 11, 2010

Hibernate JDBCConnectionException: could not execute query- Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:

"org.quartz.JobExecutionException: could not execute query; nested exception is org.hibernate.exception.JDBCConnectionException: could not execute query [See nested exception: org.springframework.dao.DataAccessResourceFailureException: could not execute query; nested exception is org.hibernate.exception.JDBCConnectionException: could not execute query]
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 13,768,115 milliseconds ago.  The last packet sent successfully to the server was 13,768,115 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem."

This problem occurred in a Grails application having a cron job that requests some data from an external server, processes it and persists some results. However the data is very large and the app requires 2+ Hrs. to process all of it before persisting a small result. The jdbc connection remains idle during this time.
Setting 'autoReconnect=true' or max_idle_time is certainly not a reliable solution for this.

What worked for me:
http://sacharya.com/grails-dbcp-stale-connections/: This was my exact problem. It says "By default, DBCP holds the pooled connections open for infinite time. But a database connection is essentially a socket connection, and it doesn’t come for free. The host OS, database host, and firewall have to allocate a certain amount of memory and other resources for each socket connection. It makes sense to those devices not to hold onto idle connections for ever. So the idea is to make sure that you don’t have stale connections in your pool that would otherwise be silently dropped by OS or firewall."

Modified my Datasource.groovy for working out this solution.
Had to use the following syntax from: http://stackoverflow.com/questions/376544/grails-mysql-maxpoolsize
Modified my Datasource.groovy:

dataSource {
    pooled = true
    dbCreate = "update"
    url = "jdbc:mysql://localhost/yourDB"
    driverClassName = "com.mysql.jdbc.Driver"
    username = "yourUser"
    password = "yourPassword"
    properties {
        maxActive = 50
        maxIdle = 25
        minIdle = 5
        initialSize = 5
        minEvictableIdleTimeMillis = 60000
        timeBetweenEvictionRunsMillis = 60000
        maxWait = 10000     
    }   }


Other relevant links:

http://drglennn.blogspot.com/2009/05/javasqlsqlexception-communication-link.html
http://commons.apache.org/dbcp/configuration.html
http://www.grails.org/DataSources+New
http://www.grails.org/1.2+Release+Notes

[javac] Warning: MyClass.java modified in the future.

Uploaded my code to a server in a different timezone(I am UTC+xx. The server's UTX-yy).
So the file modification dates of my source files were later than the current time on the servers. (The server might be thinking these files came through a time machine from future)

What worked for me:
Used the linux "touch" command to set the source file modification dates to current time on server. Don't know the windows equivalent.

Thank you.

Custom Grails environment: Running Grails application in custom environment

grails -Dgrails.env= run-app

http://www.pubbs.net/201005/grails/7053-grails-user-unable-to-run-app-in-custom-environment.html

No Hibernate Session bound to thread: in a Grails artifact that's supposed to be having an injected session.

org.quartz.SchedulerException: JobListener 'sessionBinderListener' threw exception: No Hibernate Session bound to thread, and configuration does not allow creation of non-transactional one here [See nested exception:
org.hibernate.HibernateException: No Hibernate Session bound to thread, and configuration does not allow creation of non-transactional one here]
    at org.quartz.core.QuartzScheduler.notifyJobListenersWasExecuted(QuartzScheduler.java:1912)
    at org.quartz.core.JobRunShell.notifyJobListenersComplete(JobRunShell.java:355)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:226)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525)

Occurred in a Grails application with Quartz-plugin-0.4.1 installed. It was running a cron job. This exception occurred after the last line of code in the Job class executed successfully. It looks like there was a job completion notification that led to a session.flush() call in one of the hibernate classes, where it failed to retrieve the session.

Solution that worked for me:
Upgraded Quartz-plugin to version 0.4.2


Thank you.