Skip to Content

FAQ

The FAQ has information on questions that are not temporal. Please see the Known Issues page for possible transient issues.

Q. How can I get time on Blue Waters?  

A: Please visit the Allocations page to learn about the different pathways available to get time on Blue Waters.

Q. Where can I download the Blue Waters logo?

A: If you would like to use the Blue Waters logo, you can download the .zip file here. Included in the .zip file are the logo usage guidelines. Please follow these guidelines when using the Blue Waters logo.

Q:  As the project PI or designee, how do I request a Blue Waters account for a team member?

A:  Login to the Blue Waters Portal and go to YOUR BLUE WATERS -> Manage Users.  Enter in the user's name and email (make sure the email is correct) and then he/she will receive an invitation to fill out an application for a Blue Waters account.

Q: I am the PI/Project Lead, but I want a different team member to manage the users.  Can I designate someone to add/remove team members?

A: Yes.  Go to YOUR BLUE WATERS -> Manage Users.  In your list of team members, you may select one person to be your designee.  When account-related issues arise, both the designee and the PI will receive notifications.

Q:  I added a user to my team, but he/she still doesn't have an OTP token to login to the system.

A: Although the PI has added a user, the user may not have completed the application for the account.  Check with the user to see if the application was completed and submitted.  Once a user submits an application, the PI needs to verify the application one more time before NCSA sends the OTP token.  PI's receive an email notification when an account is waiting for verification.  Unless the account is urgent or time sensitive, NCSA sends the OTP tokens via USPS which can take a few days to get to the user.

Q: How do I reset my pin?

A: Please see the forgot my pin page.

Q: Should I expect certificate warnings for the following site:  https://diamond.ncsa.illinois.edu:7004/console-selfservice/

A: Yes. See Logging In or Forgot PIN.

Q: What happened to my job when I see ...

A: A list of commonly enountered messages for job termination are listed below.

[NID XXXXX] 2012-11-15 15:21:22 Apid 147699 killed. Received node failed or halted event for nid YYYYY

This is a common message sent when a node (nid YYYYY in this case) fails for some reason (memory, kernel panic, etc). The admins get a nofitication of this situtation and work to determine if this is due to a known issue, a new issue which would generate a bug report to Cray, or if it is a failed piece of hardware.

[NID XXXXX] 2013-03-12 20:22:34 Apid 1410368 killed. Received node event ec_node_failed for nid YYYYY

This is a typical error message when a node (nid YYYYY in this case) fails with a hardware error such as a memory check error (MCE). Typically the node is automatically marked down by the node health checker.

Application 98398 network quiesced: 1248 nodes quiesced, 35:24:46 node-seconds

This is likely a situation where some work was being done on the system such as a failed gemini router, taking a blade down and adding it back (warm-boot). The network routing has to avoid the routers on that blade.

Application 61435 network throttled: 4459 nodes throttled, 25:31:21 node-seconds

This is a case of congestion protection throttling. Please see the Balanced Injection section for more information.

Q: I am seeing the following error messages when compiling my code ...

undefined reference to `__pgas_register_dv'

This is a mixed language issue with the Cray Compiler when using CC to link applications that use PGAS and/or Fortran routines. The CC compiler option: -hpgas_runtime will make sure the correct libraries are linked in.

Q: I am seeing the following error messages when running my code ...

LIBDMAPP WARNING: Unable to open kgni version file /sys/class/gemini/kgni0/version errno 2

This error message occurs when running an application on a external login node (h2ologin1|2|3|4) compiled with the Cray Compiler (CCE) . To compile code only for running on the h2ologin nodes without communication or device libraries please add "-target=local_host" to the options for cc, CC and ftn.

Q: I am seeing aprun failures with pmi_ errors when running my non-mpi code (multiple serial codes in a wrapper for example ):

Mon Jun 11 07:35:50 2018: [PE_0]:_pmi_alps_sync:alps response not OKAY
Mon Jun 11 07:35:50 2018: [PE_0]:_pmi_init:_pmi_alps_sync failed -1

The aprun command forks a copy of your process via the Process Management Interface daemon running on a node (PMI).  In cases where the application is not pure MPI, it may be desirable to bypass this launch sequence and allow the application(s) to start on their own and not as forked children of the PMI daemon.  Set (export) PMI_NO_FORK=1 to disable the default behavior.

Q. My OpenACC code is running the same as (or slower than) the CPU version.  What's happening?

The OpenACC directives are just that--directives.  Compilers are free to ignore them or not implement accelerated code if compiler flags or modules are missing or not loaded.  Take care to include the appropriate modules and flags as shown in the OpenACC compiler table of the user guide (under programming), and don't forget to have the correct modules loaded at runtime (best done in your job script ).

Q: I am seeing the following error messages when using the checkjob command ...

NOTE: job violates constraints for partition login (partition login not in job partition mask)

This error message can be ignored and does not represent a problem with your job. The login partition is the MOM nodes that need to be part of the job but are not part of the compute nodes partion.

Q: I am seeing "no CUDA-capable device is detected'', what are the reasons and how to avoid it? 

This error occurs when XE nodes are used instead of XK nodes for GPU applications. Make sure XK nodes are specified in the resource request.

This error can also occur when interective mode is used for GPU applications. In this case one will need to use the CCM mode (cluster compatibility mode). 

Q. How can I monitor the memory usage of my application?

Compile and link into your application a routine with getrusage() and call it from Fortran or C at points in your code  where you would like to monitor memory usage.  

Q: How do I submit a bug?

Send an e-mail to help+bw@ncsa.illinois.edu (if you ever forget, this information is always at the bottom of the message-of-the-day.  Just "more /etc/motd" to be reminded.)

Make the subject line an informative description of the probem.  Put as many details as you can think of in the body.  

If you want to point the help staff to an example, please make the code and all the directories above the code readable by everone on the system.  You do this with Unix file permissions settings:  chmod a+rX -R my/top/code/directory 

Q: My globusonline transfers using globus connect is getting the following error in the transfer log. What does that mean?

500-globus_gsi_gssapi: SSLv3 handshake problems: Couldn't do ssl handshake 

500-OpenSSL Error: s3_srvr.c:956: in library: SSL routines, function SSL3_GET_CLIENT_HELLO: wrong version 

This is an issue with the encryption being used by the Globus Toolkit version. SSLv3 is no longer supported and TLS is now the supported method. Please update your installation of Globus Connect or Globus Toolkit.

Q: How can a task get its rank number without asking MPI?

A script or program launched with aprun may obtain the rank number by retrieving the ALPS_APP_PE environment variable.

Q: I am getting "Assertion failed" in MPI_File_read_at_all() on arrays larger than 2GB.

There is a upper limit of 2GB hardcoded limit in the Cray MPI implementation on Blue Waters that will not be fixed. The work-around is to split up the buffers being sent into smaller chunks.

Q. Why is the color feature for ls disabled by default? 

The issue is performance related in that the extra work needed by the color feature is not handled well by Lustre when the number of files is large or the files are striped wide across the Lustre OSTs.

Q: Why am I seeing "relocation truncated to fit: R_X86_64_PC32 against `.bss' errors when linking?

You need to enable special compiler/linker options for codes with large static data:

  • Cray: compile:  -hpic , link: -dynamic -hpic
  • GNU: compile: -mcmodel=medium (and maybe -fpic), link: -mcmodel=medium (and maybe -dynamic)
  • PGI:  compile: -mcmodel=medium -Mlarge_arrays, link: -dynamic -mcmodel=medium -Mlarge_arrays

Before linking remove ATP: module delete atp

Q: Why am I seeing the following error message when connecting via ssh to Blue Waters login nodes such as h2ologin or bw?

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:EtANMQV12S951bexYFeAC/b3MnoRJDe3J1YlyUXDPL8.
Please contact your system administrator.
Add correct host key in /Path/user/.ssh/known_hosts to get rid of this message.
Offending RSA key in /Path/user/.ssh/known_hosts:NN
RSA host key for h2ologin2.ncsa.illinois.edu has changed and you have requested strict checking.
Host key verification failed.

The host keys for the Blue Waters login nodes (h2ologin[1-4].ncsa.illinois.edu or bw[1-4].ncsa.illinois.edu) will change as of 01/14/2020. The new fingerprint for the RSA key sent by the remote host is: SHA256:EtANMQV12S951bexYFeAC/b3MnoRJDe3J1YlyUXDPL8.

 

To accept the new host key:

  1. Edit the known_hosts file and delete the offending entry mention by line number
  2. Reconnect via ssh to the host where you should see
The authenticity of host 'h2ologin2.ncsa.illinois.edu (141.142.176.130)' can't be established.
ECDSA key fingerprint is SHA256:9y0z7Rtj9ugBXCoVjOHsy37PG2AUh5tMEsxfimN+kCE.
Are you sure you want to continue connecting (yes/no)?

You should accept the new host key. You might encounter this for each h2login or bw host but the new host key will be the same.

The fingerprint of new keys for h2ologin[1-4].ncsa.illinois.edu or bw[1-4].ncsa.illinois.edu and h2ologin-duo[1-4].ncsa.illinois.edu or bw[1-4]-duo.ncsa.illinois.edu are:

SHA256:EtANMQV12S951bexYFeAC/b3MnoRJDe3J1YlyUXDPL8 (RSA)
SHA256:9y0z7Rtj9ugBXCoVjOHsy37PG2AUh5tMEsxfimN+kCE (ECDSA)
SHA256:jaIZt/ybXU37JB48b3MQOvq/Y11P0cICcQwLO3mSBOo (ED25519)