Sun Java System Application Server 9.1 Failing To Start Domain Due To PKCS11 Errors

Posted by sam Wed, 15 Apr 2009 20:32:00 GMT

I’ve you’ve landed at this page I’ll bet my bottom dollar that more than one of the following is true, and some of these you don’t even know yet:

  • You’re running Sun’s Glassfish or Sun Java System Application Server 9.1 (possibly the EE or Enterprise edition)
  • You’re on some flavour of Windows Server
  • You’re attempting an upgrade from a previous version - most likely 8.x to 9.1. This upgrade is more than likely an installation alongside an existing one, rather than an in over-the-top installation
  • You have some kind of Mozilla product on the machine, and that product uses encryption in some way.

I recently attempted to install Sun Java System Application Server 9.1 EE (SJSAS) onto an aging box alongside an existing copy of 8.2. The reasoning behind this was the usual blather about back-out plans and minimising risk. I had real problems with the thing: no matter what I did it would always complain with the error:

CLI156  Could not start the domain domain1

The stack trace in the server.log looked something like this:

[#|2009-02-26T22:51:04.515+0100|WARNING|sun-appserver2.1|javax.enterprise.system.stream.err|_ThreadID=10;_ThreadName=main;_RequestID=b3263e27-5cb1-4f58-8ec8-c9c2b55724a9;|java.lang.reflect.InvocationTargetException 
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
        at java.lang.reflect.Method.invoke(Method.java:597) 
        at com.sun.enterprise.server.PELaunch.main(PELaunch.java:415) 
Caused by: java.lang.NoClassDefFoundError: sun/security/pkcs11/SunPKCS11 
        at java.lang.Class.forName0(Native Method) 
        at java.lang.Class.forName(Class.java:169) 
        at com.sun.enterprise.pluggable.PluggableFeatureFactoryBaseImpl.invoke(PluggableFeatureFactoryBaseImpl.java:84) 
        at $Proxy0.getSecuritySupport(Unknown Source) 
        at com.sun.enterprise.security.SecurityUtil.getSecuritySupport(SecurityUtil.java:364) 
        at com.sun.enterprise.security.SSLUtils.(SSLUtils.java:102)
    at com.sun.enterprise.security.SecurityLifecycle.onInitialization(SecurityLifecycle.java:101)
    at com.sun.enterprise.server.ApplicationServer.onInitialization(ApplicationServer.java:262)
    ...  

So, at this point I did what any self-respecting production engineer would do: I bunged SEC8001 into Google. Take a look at what the SJSAS 9.1 Error Message Reference says. Yup, you read it right: "Cause TBD, Solution TBD". Nice one Sun. Excellent.

I tried a few random scattershot things, banking on hunches from a decade of dealing with this kind of stuff:

  • Check how the classpath was constructed
  • Remove the JRE that some numpty had installed alongside the JDK
  • Reinstall the thing using its own JDK rather than the system one
  • Make sure there was nothing stupid in the system %PATH%
  • Check the asenv.bat to ensure that everything seemed right and proper

It seemed that some digging was in order.

Taking a look at the stack trace, some investigation of PKCS11 seems to be the next step. I’ll save you the leg-work and just link straight to the Wikipedia page. So, we’re dealing with encryption - and so working again on a hunch I widened my search to include NSS - as we’re talking encryption, cross-platform, Sun and Java.

A few minutes of research lead me to this gem:

On Windows: Application Server fails to start due to NSS initialization failure. (6618442)

 

Description

In some cases, Application Server (enterprise profile) might not start on Windows when PKCS11 fails to initialize.

Solution

Remove softokn3.dll from the Windows system path.

 

What on earth is softokn3.dll? First hit on Google says this:

softokn3.dll is a NSS PKCS #11 Library from Mozilla Foundation belonging to Network Security Services

Things are starting to make sense now. Confident I’ve got this thing sussed, I renamed the copy of softokn3.dll I find it %SYSTEMROOT%\system32 and restart SJSAS. Nope.

When I search the file system I found 7 copies of this .dll. All of these were 300ish K, apart from the two installed with SJSAS 9.1which were 200-odd K. Checking the %PATH% it seems that these versions of softokn3.dll had made it into the search list before the intended 9.1 - remember automatic %PATH% additions are appended, and if you’re upgrading you will most likely have bits of previous installations still in there.

However, there’s one last trick to getting all of this to work. It’s not good enough to just rename and/or eliminate these existing copies of the .dll from your %PATH%. Your domain still will not start. They key issue is that they cannot be present during the installation. If they are present during the installation it seems to screw up the keystore(s) for the admin and master keys. No matter what you do post-installation, correct .dlls or not, you won’t easily be able to examine the contents of the keystore and you won’t be able to boot your domain. The domain upgrade wizard failing due to invalid credentials that you know are correct is another sign of this problem.

So, we’re down to:

  1. Remove all previous 300-ish K copies of softokn3.dll from the %PATH% by renaming them all to .dlx and amending the path entries.
  2. Uninstall and reinstall SJSAS 9.1 EE
  3. Confirm the domain starts OK
  4. Unpick the mess you’ve made of all of your other applications that rely on this .dll, including your "backout" 8.x instance, in my case.

You might have remember that I mentioned above that you could be running some kind of Mozilla product that has nothing to do with SJSAS/Glassfish, and still be experiencing this problem. Every version of Firefox I’ve ever come across includes this .dll at one version level or another, for example. If the Firefox installation root is in your path before SJSAS you’ll be for it too.

Even though I’ve fixed this, there’s something tempting me to run the Sysinternals FileMon/ListDLLS/Process Explorer on the thing to get a measure of what’s going on during the installation. If I get enough time in the next couple of days I’ll do so and post my results.

UNC Shares and "java.io.ioexception: insufficient system resources exist to complete the request"

Posted by sam Mon, 06 Apr 2009 19:55:00 GMT

I recently had cause to investigate a production problem related to the error message:

java.io.ioexception: insufficient system resources exist to complete the request

and noticed that through the course of my investigation there was quite a bit of cruft floating around this issue. In particular, if you enter the error message in your favourite generic internet search engine, you will most likely hit a bug logged with Sun in 2003 that offers nothing in the way of help in the slightest.

Before I start, let me try and get some of the background in first.

In our production setup we had a 64 bit Windows Server 2008 box running a long-lived Java-based daemon/service. This process is the man in the middle for an SFTP and bulk load process; all day it does something like this:

  1. Poll UNC share advertised by Samba on a Solaris box for files uploaded via SFTP.
  2. Determine that those files are not still being written to and some other unimportant (to this explanation!) validation.
  3. Copy files to database server (another similar box)
  4. Send message to database server that files have arrived and bulk upload into database can begin.

Of course, all of this had been tested in the various stages of our production release cycle and passed with flying colors. In production, it was now failing to write to the database server.

In light of that, if we now add UNC to our search query we get this Sun bug ticket, which is a bit more enlightening:

I have seen similar when a java application on Windows tries a "write" over a UNC path. 
When this is just a "write" there is an application limit of 64MB, "writes" cannot use Windows cache manager if the destination is a remote file systems. If you change this to a "read/write" it will work. The restriction of 64MB is not configurable and is set to prevent Applications starting by issuing a lot of large IO. This issue only happens for remote file systems, local systems will work. 

For local cases, MS can cache the data because they can enforce the “write-only” logic in the OS. They don’t allow a “write only” handle to read, so its OK if the cache manager caches the write. They can’t do this for remote files because the server can’t trust the client to enforce that logic.

Hmm. This seemed more promising to me, but the 64MB file size just didn’t gel with what we were seeing: the files it was failing to copy were well below 64MB, so even if the code had been terrible nasty heap-abusing things, failing because the files were over 64MB wasn’t one of them in this case.

After much digging about (this Microsoft support article about failing backups turned out to be the key to understanding this) we determined that the following was happening:

  • java.nio was being used in the application code, and it wasn’t using any sort of chunking or buffering - it was just passing a whacking great multi-megabyte array of bytes to Windows.
  • When a file is copied to a UNC share in this fashion (this doesn’t occur for local copies) the client machine will allocate a region of memory approximately the size of (never less than) the file to be copied from the paged kernel memory. You can see your current paged kernel memory value in the Windows task manager.
  • The sized of the kernel paged memory buffer will vary depending on a number of factors, including but not limited to, the uptime of the machine, the amount of RAM in the machine, the usage and I/O profile of the machine over it’s uptime - I’m no Windows internals guru but let’s just say the thing can change ;)

What we discovered is that is the current size of the paged kernel memory, plus the size of the buffer that Windows is attempting to allocate out of it (which is itself derived from the size of the file you’re trying to copy) exceed 160MB then the "insufficient system resources" exception will be raised.

It turned out for us that all of our internal and production release machines had a suitably small paged kernel memory size and as such didn’t pop this seemingly arbitrary 160MB limit. Our production machines, much longer lived and more trafficked, did. The file in question was 47MB in size and our test machines had 22MB of paged kernel memory, whereas our production machines at 120MB. 47 + 22 was a flying pass in the tests, 47 + 120 popped the limit by 7MB and it was game over for files of that size or above.

There is a registry key that can be edited to raise the 160MB upper limit for the paged kernel memory buffer and therefore allow drivers to allocate more memory for buffers, but that isn’t really the way to fix problems such as these. In my case the solution was to explain all of the above to the developers and have them configure their java.nio class to properly "chunk" the copy of the file in suitably small pieces.