The FAQ has information on questions that are not temporal. Please see the Known Issues page for possible transient issues.
Q. How can I get time on Blue Waters?
A: Please visit the Allocations page to learn about the different pathways available to get time on Blue Waters.
Q: As the project PI or designee, how do I request a Blue Waters account for a team member?
A: Login to the Blue Waters Portal with your NCSA issued OTP token and go to YOUR BLUE WATERS -> Manage Users. Enter in the user's name and email (make sure the email is correct) and then he/she will receive an invitation to fill out an application for a Blue Waters account.
Q: I am the PI/Project Lead, but I want a different team member to manage the users. Can I designate someone to add/remove team members?
A: Yes. Go to YOUR BLUE WATERS -> Manage Users. In your list of team members, you may select one person to be your designee. When account-related issues arise, both the designee and the PI will receive notifications.
Q: I added a user to my team, but he/she still doesn't have an OTP token to login to the system.
A: Although the PI has added a user, the user may not have completed the application for the account. Check with the user to see if the application was completed and submitted. Once a user submits an application, the PI needs to verify the application one more time before NCSA sends the OTP token. PI's receive an email notification when an account is waiting for verification. Unless the account is urgent or time sensitive, NCSA sends the OTP tokens via USPS which can take a few days to get to the user.
Q: How do I reset my pin?
A: Please see the forgot my pin page.
Q: Should I expect certificate warnings for the following site: https://diamond.ncsa.illinois.edu:7004/console-selfservice/
Q: What happened to my job when I see ...
A: A list of commonly enountered messages for job termination are listed below.
[NID XXXXX] 2012-11-15 15:21:22 Apid 147699 killed. Received node failed or halted event for nid YYYYY
This is a common message sent when a node (nid YYYYY in this case) fails for some reason (memory, kernel panic, etc). The admins get a nofitication of this situtation and work to determine if this is due to a known issue, a new issue which would generate a bug report to Cray, or if it is a failed piece of hardware.
[NID XXXXX] 2013-03-12 20:22:34 Apid 1410368 killed. Received node event ec_node_failed for nid YYYYY
This is a typical error message when a node (nid YYYYY in this case) fails with a hardware error such as a memory check error (MCE). Typically the node is automatically marked down by the node health checker.
Application 98398 network quiesced: 1248 nodes quiesced, 35:24:46 node-seconds
This is likely a situation where some work was being done on the system such as a failed gemini router, taking a blade down and adding it back (warm-boot). The network routing has to avoid the routers on that blade.
Application 61435 network throttled: 4459 nodes throttled, 25:31:21 node-seconds
This is a case of congestion protection throttling. Please see the Balanced Injection section for more information.
Q: I am seeing the following error messages when compiling my code ...
undefined reference to `__pgas_register_dv'
This is a mixed language issue with the Cray Compiler when using CC to link applications that use PGAS and/or Fortran routines. The CC compiler option:
-hpgas_runtime will make sure the correct libraries are linked in.
Q: I am seeing the following error messages when running my code ...
LIBDMAPP WARNING: Unable to open kgni version file /sys/class/gemini/kgni0/version errno 2
This error message occurs when running an application on a external login node (h2ologin1|2|3|4) compiled with the Cray Compiler (CCE) . To compile code only for running on the h2ologin nodes without communication or device libraries please add "-target=local_host" to the options for cc, CC and ftn.
Q. My OpenACC code is running the same as (or slower than) the CPU version. What's happening?
The OpenACC directives are just that--directives. Compilers are free to ignore them or not implement accelerated code if compiler flags or modules are missing or not loaded. Take care to include the appropriate modules and flags as shown in the OpenACC compiler table of the user guide (under programming), and don't forget to have the correct modules loaded at runtime (best done in your job script ).
Q: I am seeing the following error messages when using the checkjob command ...
NOTE: job violates constraints for partition login (partition login not in job partition mask)
This error message can be ignored and does not represent a problem with your job. The login partition is the MOM nodes that need to be part of the job but are not part of the compute nodes partion.
Q: I am seeing "no CUDA-capable device is detected'', what are the reasons and how to avoid it?
This error occurs when XE nodes are used instead of XK nodes for GPU applications. Make sure XK nodes are specified in the resource request.
This error can also occur when interective mode is used for GPU applications. In this case one will need to use the CCM mode (cluster compatibility mode).
Q. How can I monitor the memory usage of my application?
Compile and link into your application a routine with getrusage() and call it from Fortran or C at points in your code where you would like to monitor memory usage.
Q: How do I submit a bug?
Send an e-mail to email@example.com (if you ever forget, this information is always at the bottom of the message-of-the-day. Just "more /etc/motd" to be reminded.)
Make the subject line an informative description of the probem. Put as many details as you can think of in the body.
If you want to point the help staff to an example, please make the code and all the directories above the code readable by everone on the system. You do this with Unix file permissions settings: chmod a+rX -R my/top/code/directory
Q: My globusonline transfers using globus connect is getting the following error in the transfer log. What does that mean?
500-globus_gsi_gssapi: SSLv3 handshake problems: Couldn't do ssl handshake
500-OpenSSL Error: s3_srvr.c:956: in library: SSL routines, function SSL3_GET_CLIENT_HELLO: wrong version
Q: How can a task get its rank number without asking MPI?
A script or program launched with aprun may obtain the rank number by retrieving the ALPS_APP_PE environment variable.