Thursday, January 22, 2009

Java Certification Tips

A new page of Java Certification Tips gives a "cribsheet" of some of the "syntactic niggles" and other commonly overlooked features of Java that could trip up an otherwise moderately experienced programmer when taking the Sun Certified Java Programmer (SCJP) exam. Did you know, for example that:

  • this is invalid Java syntax?
    float f = 2.5;
  • goto is a Java keyword?
  • this line does not assign the value 13 to the variable?:
    int i = 013;
See the page of tips for more information on these and other niggles.

Wednesday, January 21, 2009

Regular expression tutorial: new example

The Java regular expression tutorial section has been updated with a new example of using regular expressions. In this example, we look at how to perform what is sometimes referred to as HTML scraping: pulling out data from an HTML page (or indeed XML document).

The example is located here: HTML scraping with Java regular expressions. As explained in this tutorial, regular expressions are a good candidate for data scraping because they are flexible. Various libraries exist to parse an HTML or XML document and return an object representation of that document. But such libraries are often "too fussy": many if not most web pages actually do not conform strictly to HTML standards. Similarly, many XML parsing libraries are too fussy for real-life RSS feeds, which are often malformed, strictly speaking. Using regular expressions cuts through the fuss.

When should you scrape web pages?

Note that the article focusses on how technically to scrape HTML pages in Java. It doesn't deal with the "political" issue of whether the site in question wants its content scraped in the first place. In general, it is good practice to do the following:

  • find out if he web site in question has an API to provide the data in a more convenient format (and in a format that it would prefer to provide it to you in!)
  • be open about what you're doing: if the web site administrator things you're trying to "hide" something, they may block your IP address
  • respect server resources: if you are retrieving multiple pages, consider putting a thread sleep between fetches

Tuesday, December 30, 2008

Updates to the Spanish-English glossary

Various new entries have been added to the Javamex site's Spanish-English glossary of computing terms. For those not familiar with the glossary, it contains English translations of various Spanish computing terms, covering various IT topics such as programming, networking, the Internet, software and hardware, GUIs etc.

Monday, December 29, 2008

Information on memory usage of objects

The section on Java memory usage now contains the following additional articles:
  • information on how to calculate the memory usage of a Java object in general, considering the memory used for "housekeeping" by the JVM
  • calculating the memory usage of Strings, which can often the type of object to use up the biggest proportion of space in a Java application: this section actually considers the memory use of string-related objects such as StringBuffers and StringBuilders
A section on reducing the memory taken up by Strings looks at string canonicalisation, a fairly standard approach (but one which requires certain caveats), plus introduces the example of a CompactCharSequence class, that stores strings as 1 byte per character, thus taking up around half othe memory taken up by a regular Java String (at the expense of not supporting Unicode).

Comments on these articles welcome as usual.

Saturday, December 27, 2008

Beta: Classmexer agent

The beta version of a simple instrumentation tool is available for download from the Javamex site. The Classmexer jar provides various calls for querying the memory usage of Java objects. Via the provided MemoryUtil class.deepMemoryUsageOf() method, it is possible to get an estimate from the JVM of the number of bytes taken up by an object and its "subobjects" (objects referred to by a non-public reference, or by references with other visibility criteria). The memory usage of subobjects is combined recursively (so subobjects of subobjects are considered etc), but without counting the same object more than once.

A variant of the call is also provided which gives the total memory usage of several objects at a time, without counting as duplicates objects referenced by more than one of the objects.

Thursday, December 25, 2008

Updates to profiling section

Firstly, some minor corrections and additions to the Java profiling section. The corrections mainly concern a couple of typos that crept into the variable names of the examples. Readers should be reassured that the code, like that of the site in general, is copied and pasted from working, live profiling code. But things such as variable names are occasionally changed or shortened for the purposes of making it clearer on the site, and that seems to be where the errors crept in. I've also taken the opportunity to add a few links to other sections of the site (such as the section on threading, sleep() and yield()) that were added since the profiling tutorial was written.

Readers interested in Java profiling may also be interested in the first page of an upcoming section on Java and memory. This first page looks at how to find out the memory usage of a Java object. The technique involves using the Java Instrumentation framework introduced in version 5 of the language to query the JVM directly for the size of an object. Although slightly fiddly to set up, the technique has the advantage that there's less guesswork involved than if we were to just estimate an object's size (although future pages in the section will nonetheless look at estimation).

Wednesday, December 10, 2008

RSS feeds of Java tutorials

The Javamex web site now publishes various RSS feeds containing links to articles published recently or on particular topics of frequent interest. The available feeds are as follows:
Suggestions are welcome if you think there's a feed on another theme that you think would be useful.