- this is invalid Java syntax?
float f = 2.5; - goto is a Java keyword?
- this line does not assign the value 13 to the variable?:
int i = 013;
The Javamex companion blog. This blog includes both technical articles relating to the programming information that you'll find on the Javamex site, plus information covering the IT industry more generally.
Thursday, January 22, 2009
Java Certification Tips
A new page of Java Certification Tips gives a "cribsheet" of some of the "syntactic niggles" and other commonly overlooked features of Java that could trip up an otherwise moderately experienced programmer when taking the Sun Certified Java Programmer (SCJP) exam. Did you know, for example that:
Wednesday, January 21, 2009
Regular expression tutorial: new example
The Java regular expression tutorial section has been updated with a new example of using regular expressions. In this example, we look at how to perform what is sometimes referred to as HTML scraping: pulling out data from an HTML page (or indeed XML document).
The example is located here: HTML scraping with Java regular expressions. As explained in this tutorial, regular expressions are a good candidate for data scraping because they are flexible. Various libraries exist to parse an HTML or XML document and return an object representation of that document. But such libraries are often "too fussy": many if not most web pages actually do not conform strictly to HTML standards. Similarly, many XML parsing libraries are too fussy for real-life RSS feeds, which are often malformed, strictly speaking. Using regular expressions cuts through the fuss.
When should you scrape web pages?
Note that the article focusses on how technically to scrape HTML pages in Java. It doesn't deal with the "political" issue of whether the site in question wants its content scraped in the first place. In general, it is good practice to do the following:
The example is located here: HTML scraping with Java regular expressions. As explained in this tutorial, regular expressions are a good candidate for data scraping because they are flexible. Various libraries exist to parse an HTML or XML document and return an object representation of that document. But such libraries are often "too fussy": many if not most web pages actually do not conform strictly to HTML standards. Similarly, many XML parsing libraries are too fussy for real-life RSS feeds, which are often malformed, strictly speaking. Using regular expressions cuts through the fuss.
When should you scrape web pages?
Note that the article focusses on how technically to scrape HTML pages in Java. It doesn't deal with the "political" issue of whether the site in question wants its content scraped in the first place. In general, it is good practice to do the following:
- find out if he web site in question has an API to provide the data in a more convenient format (and in a format that it would prefer to provide it to you in!)
- be open about what you're doing: if the web site administrator things you're trying to "hide" something, they may block your IP address
- respect server resources: if you are retrieving multiple pages, consider putting a thread sleep between fetches
Labels:
HTML scraping,
Java,
regular expression,
scraping,
XML scraping
Tuesday, December 30, 2008
Updates to the Spanish-English glossary
Various new entries have been added to the Javamex site's Spanish-English glossary of computing terms. For those not familiar with the glossary, it contains English translations of various Spanish computing terms, covering various IT topics such as programming, networking, the Internet, software and hardware, GUIs etc.
Monday, December 29, 2008
Information on memory usage of objects
The section on Java memory usage now contains the following additional articles:
Comments on these articles welcome as usual.
- information on how to calculate the memory usage of a Java object in general, considering the memory used for "housekeeping" by the JVM
- calculating the memory usage of Strings, which can often the type of object to use up the biggest proportion of space in a Java application: this section actually considers the memory use of string-related objects such as StringBuffers and StringBuilders
Comments on these articles welcome as usual.
Saturday, December 27, 2008
Beta: Classmexer agent
The beta version of a simple instrumentation tool is available for download from the Javamex site. The Classmexer jar provides various calls for querying the memory usage of Java objects. Via the provided MemoryUtil class.deepMemoryUsageOf() method, it is possible to get an estimate from the JVM of the number of bytes taken up by an object and its "subobjects" (objects referred to by a non-public reference, or by references with other visibility criteria). The memory usage of subobjects is combined recursively (so subobjects of subobjects are considered etc), but without counting the same object more than once.
A variant of the call is also provided which gives the total memory usage of several objects at a time, without counting as duplicates objects referenced by more than one of the objects.
A variant of the call is also provided which gives the total memory usage of several objects at a time, without counting as duplicates objects referenced by more than one of the objects.
Labels:
heap,
instrumentation,
Java,
memory usage,
profiling
Thursday, December 25, 2008
Updates to profiling section
Firstly, some minor corrections and additions to the Java profiling section. The corrections mainly concern a couple of typos that crept into the variable names of the examples. Readers should be reassured that the code, like that of the site in general, is copied and pasted from working, live profiling code. But things such as variable names are occasionally changed or shortened for the purposes of making it clearer on the site, and that seems to be where the errors crept in. I've also taken the opportunity to add a few links to other sections of the site (such as the section on threading, sleep() and yield()) that were added since the profiling tutorial was written.
Readers interested in Java profiling may also be interested in the first page of an upcoming section on Java and memory. This first page looks at how to find out the memory usage of a Java object. The technique involves using the Java Instrumentation framework introduced in version 5 of the language to query the JVM directly for the size of an object. Although slightly fiddly to set up, the technique has the advantage that there's less guesswork involved than if we were to just estimate an object's size (although future pages in the section will nonetheless look at estimation).
Readers interested in Java profiling may also be interested in the first page of an upcoming section on Java and memory. This first page looks at how to find out the memory usage of a Java object. The technique involves using the Java Instrumentation framework introduced in version 5 of the language to query the JVM directly for the size of an object. Although slightly fiddly to set up, the technique has the advantage that there's less guesswork involved than if we were to just estimate an object's size (although future pages in the section will nonetheless look at estimation).
Labels:
instrumentation,
Java,
profiling,
ThreadMXBean
Wednesday, December 10, 2008
RSS feeds of Java tutorials
The Javamex web site now publishes various RSS feeds containing links to articles published recently or on particular topics of frequent interest. The available feeds are as follows:
- Main RSS feed: all Java programming articles published in the past 3 months
- Java threading feed: recent articles on multithreading and concurrency (including both "classical" thread programming and the Java 5+ concurrency libraries)
- Java Web programming feed: articles on topics such as Servlets and AJAX programming
- Java collections and algorithms feed: articles covering the Java collections framework itself, plus other algorithm-related frameworks such as compression
- Java performance feed: performance and profiling-related articles
- Java Regular expressions feed
Subscribe to:
Posts (Atom)