The last two weeks, I have been researching options for processing free text. I think I have explored the entire spectrum of possibilities. Below are some notes that I can revisit in a few months and not spend the same effort again.
I was looking at a way to process auto-generated tweets, like the ones on http://twitter.com/moneyvidya_com. Some sample tweets:
- #moneyvidya arunthestocksguru (5 Star rated) says Buy Vijay Shanthi Builders – 6m (Monday 29 March 2010 @ 09:55 … http://bit.ly/bd5JgC
- #moneyvidya arunthestocksguru (5 Star rated) says Buy Bhagwati Banquets & Hotels – 6m (Monday 29 March 2010 @ 09… http://bit.ly/9MzRDG
- #moneyvidya NSV (5 Star rated) says Buy ACC – 6m (Wednesday 24 March 2010 @ 09:55 AM): http://bit.ly/b5xTrN
- #moneyvidya justsurjit (5 Star rated) says Sell Sesa Goa – Intraday (Monday 22 March 2010 @ 10:31 AM): http://bit.ly/9lLo8U
As it is clear, the text follows a specific format, but has its own little variations. I intended to process the ‘insights’ and see each expert’s success rate. Although I never got around actually completing the task, I did learn a lot about text processing.
The apprentice – Regular Expressions
The first approach was the most obvious one – regular expressions. I am sure RegEx would have addressed the particular task at hand. But the parsing expression would become a convoluted mess very soon. So I started looking for better alternatives.
The strict teacher – Lexical Analysis
Lexical analysis starts where regular expression give up. This also needs pretty strict rules on the allowed input text, but the rules could be a lot more flexible and easy to comprehend.
I especially enjoyed using Irony, which makes it trivial to convert BNF formed rules to C# code. There is a good gentle introduction to lexical analysis using Irony on code project.
The guru – Natural Language Processing
Processing test using tools like NLTK, allows you to parse and understand any unstructured text and understand it. Although this gives you maximum freedom, it also needs a lot of work to get right. For this to produce good results, be sure you have lots of data to be able to tweak and test your implementation. I guess this is the reason Google and co., can do so much better at translation, since they have huge data available for improving.
I don’t have one :). I guess, there are several ways to solve a problem, and half the solution is to identify the best way to solve the given problem. As for me, it was a good learning exercise and may come in handy if I ever write a DSL.
It is becoming more and more obvious that there are just two runtimes left to execute code, the Java Virtual Machine (JVM) and the Common Language Infrastructure (CLI). So, I decided to see how they stack up. Looks like both environments have something for everyone.
Here is a list of programming languages available on these runtimes.
- Can run on CLI using IKVM.NET
- Can run on JVM using Mainsoft solution
- Not yet usable
- Can run on CLR, but is behind the JVM implementation
The main reason for the research was to identify a new language I should pick-up. I looked at Python and Ruby, but both have some sore thumbs that I just can’t stand. I really liked Boo and Groovy; they are similar to C#/Java in syntax and incorporate the good things from Python. Although I like Boo’s syntax and approach more than Groovy, Groovy has a more mature implementation and ecosystem. I will try to use Groovy for some hobby project and get a feel to things.
Yesterday I stumbled upon SmallBasic, while looking for something else. It is an interesting little project by Microsoft to create an entry level language to teach programming. It is a mix of toned down BASIC and Logo. Since the language (or is it an application) is still in infancy, version 0.5 released recently, I will try not to be too harsh on it.
Sample program with obligatory screenshot
Showing Flickr Image
1: url = Flickr.GetPictureOfMoment()
2: img = ImageList.LoadImage(url)
3: GraphicsWindow.Title = url
5: GraphicsWindow.Height = ImageList.GetHeightOfImage(img)
6: GraphicsWindow.Width = ImageList.GetWidthOfImage(img)
Spent the last couple of days looking at Javafx, Sun’s response to Silverlight and Flex. It is an interesting mix of ideas. Clearly inspired by dynamic languages as well as Silverlight. Maybe Flex as well, but I couldn’t tell as I have not tried Flex.
Things I liked:
- Type inference: Could have been better, but I will take this any day over the verbose Java alternate
- Binding: In fact this is a great thing. Two way binding and binding with expression.
- Triggers: Need to explore more, but shows promise.
- Timelines: This along with the exceptional support for multimedia will help in creating the next killer app.
- Collections: You can iterate over collections in SQL like syntax which,to me, looks better than LINQ.
- Strings and Dates: Finally they get treated with the respect they deserve, since most of the time one is juggling text and dates.
Things I am meh about:
- Declarative UI design: I believe UI design is best left to designers (software not people)
- Using all the available brackets: All examples look like they have used up every possible punctuation mark
- Init vs. assignment: The ambiguity on where to use a variable : value and where to use a variable = value
Overall looks good and I am going to spend some time learning the innards of Javafx.
Finally, what I will like to see guidance on how patterns will evolve to address this new form of development. I immediately see a lot of older patterns not longer needed like Singleton, Visitor Pattern, Lazy Loading, Thread Pool Pattern, Observer Pattern and more. Similarly we need to recalibrate a few like the MVC, MVP patterns.