[Hacking the JVM] The Hidden Cost of Thread Creation

Abstract

Thread creation in Java is often cited as a heavyweight operation, but what makes it costly? This post explores the complete journey of thread creation—from calling `Thread.start()` in Java code through JNI boundaries, into JVM internals (jvm.cpp, thread.cpp), and finally to OS-level pthread creation on Linux. We examine stack allocation, guard pages, TLS adjustments, synchronization mechanisms, and state transitions that occur before a single line of user code executes. Understanding this machinery provides insight into why thread pools exist and helps engineers make informed decisions.

Thread Creation in Java

// Code is only for demonstration purposes
Runnable r = new Runnable() {
  // some logic here
}

Thread threadRun = new Thread(r);
threadRun.start();</pre>

Once the work is done, thread would complete it’s life cycle and reclaimed. The curiosity was what happens under the hood. Let’s dive right in

Thread Object creation

Thread threadRun = new Thread(r);

This line creates a Thread object, just like any other Java object assigning it various identifier. We have covered some of them in this blog post here

The start() Method: Where Magic Begins

Java17 code for start can be found here

public synchronized void start() {
        /**
         * This method is not invoked for the main method thread or "system"
         * group threads created/set up by the VM. Any new functionality added
         * to this method in the future may have to also be added to the VM.
         *
         * A zero status value corresponds to state "NEW".
         */
        if (threadStatus != 0)
            throw new IllegalThreadStateException();

        /* Notify the group that this thread is about to be started
         * so that it can be added to the group's list of threads
         * and the group's unstarted count can be decremented. */
        group.add(this);

        boolean started = false;
        try {
            start0();
            started = true;
        } finally {
            try {
                if (!started) {
                    group.threadStartFailed(this);
                }
            } catch (Throwable ignore) {
                /* do nothing. If start0 threw a Throwable then
                  it will be passed up the call stack */
            }
        }
    }
private native void start0();

As we see there are some checks/setup and a call to native method start0. This is where the control is passed to JVM.

JNI Boundary: Crossing into Native Code

This section is just for completion, following code shows the method mapping used fr start0(). Code can be found here

static JNINativeMethod methods[] = {
    {"start0",           "()V",        (void *)JVM_StartThread},
    {"stop0",            "(" OBJ ")V", (void *)JVM_StopThread},
    {"suspend0",         "()V",        (void *)JVM_SuspendThread},
    {"resume0",          "()V",        (void *)JVM_ResumeThread},
    {"setPriority0",     "(I)V",       (void *)JVM_SetThreadPriority},
    {"yield",            "()V",        (void *)JVM_Yield},
    {"sleep",            "(J)V",       (void *)JVM_Sleep},
    {"currentThread",    "()" THD,     (void *)JVM_CurrentThread},
    {"interrupt0",       "()V",        (void *)JVM_Interrupt},
    {"holdsLock",        "(" OBJ ")Z", (void *)JVM_HoldsLock},
    {"getThreads",        "()[" THD,   (void *)JVM_GetAllThreads},
    {"dumpThreads",      "([" THD ")[[" STE, (void *)JVM_DumpThreads},
    {"setNativeName",    "(" STR ")V", (void *)JVM_SetNativeThreadName},
};

JVM Thread Creation

jvm.cpp start the thread creation process here

// removed documentation to compact
JVM_ENTRY(void, JVM_StartThread(JNIEnv* env, jobject jthread))
  JavaThread *native_thread = NULL;
  bool throw_illegal_thread_state = false;

  {
    // Lock threads list for thread safety
    MutexLocker mu(Threads_lock);

    // Check if thread has already been started
    if (java_lang_Thread::thread(JNIHandles::resolve_non_null(jthread)) != NULL) {
      throw_illegal_thread_state = true;
    } else {
      // Get stack size from Java Thread object
      jlong size = java_lang_Thread::stackSize(JNIHandles::resolve_non_null(jthread));
      size_t sz = size > 0 ? (size_t) size : 0;

      // CREATE THE JAVATHREAD
      native_thread = new JavaThread(&thread_entry, sz);

      // Link Java and C++ thread objects
      if (native_thread->osthread() != NULL) {
        native_thread->prepare(jthread);
      }
    }
  }

  if (throw_illegal_thread_state) {
    THROW(vmSymbols::java_lang_IllegalThreadStateException());
  }

  if (native_thread->osthread() == NULL) {
    native_thread->smr_delete();
    THROW_MSG(vmSymbols::java_lang_OutOfMemoryError(),
              os::native_thread_creation_failed_msg());
  }

  // START THE THREAD
  Thread::start(native_thread);

JVM_END

On a high level the work done in this thread is

1. Acquire Threads_lock` for thread-safe access to thread list

2. Validate thread state – throw `IllegalThreadStateException` if already started

3. Create `JavaThread` with `thread_entry` function pointer and stack size

4. Call `prepare()` to link Java/C++ thread objects

5. Call `Thread::start()` to signal the thread to begin execution

Then native thread is created here

native_thread = new JavaThread(&thread_entry, sz);

JavaThread is a class worth mentioning here, as it represents a Java Thread within the JVM.

JavaThread::JavaThread(ThreadFunction entry_point, size_t stack_sz) : JavaThread() {
  _jni_attach_state = _not_attaching_via_jni;
  set_entry_point(entry_point);
  // Create the native thread itself.
  // %note runtime_23
  os::ThreadType thr_type = os::java_thread;
  thr_type = entry_point == &CompilerThread::thread_entry ? os::compiler_thread :
                                                            os::java_thread;
  os::create_thread(this, thr_type, stack_sz);
}

From here the handle is passed onto architecture specific implementation. Here we shall explore linux implementation

Platform-Specific Implementation (Linux)

All the action happens inside os::create_thread0

NOTE: Skip the code below if not interested

bool os::create_thread(Thread* thread, ThreadType thr_type,
                       size_t req_stack_size) {
  assert(thread->osthread() == NULL, "caller responsible");

  // Allocate the OSThread object
  OSThread* osthread = new OSThread(NULL, NULL);
  if (osthread == NULL) {
    return false;
  }

  // set the correct thread state
  osthread->set_thread_type(thr_type);

  // Initial state is ALLOCATED but not INITIALIZED
  osthread->set_state(ALLOCATED);

  thread->set_osthread(osthread);

  // init thread attributes
  pthread_attr_t attr;
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);

  // Calculate stack size if it's not specified by caller.
  size_t stack_size = os::Posix::get_initial_stack_size(thr_type, req_stack_size);
  size_t guard_size = os::Linux::default_guard_size(thr_type);

  // Configure glibc guard page. Must happen before calling
  // get_static_tls_area_size(), which uses the guard_size.
  pthread_attr_setguardsize(&attr, guard_size);

  // Apply stack size adjustments if needed. However, be careful not to end up
  // with a size of zero due to overflow. Don't add the adjustment in that case.
  size_t stack_adjust_size = 0;
  if (AdjustStackSizeForTLS) {
    // Adjust the stack_size for on-stack TLS - see get_static_tls_area_size().
    stack_adjust_size += get_static_tls_area_size(&attr);
  } else if (os::Linux::adjustStackSizeForGuardPages()) {
    stack_adjust_size += guard_size;
  }

  stack_adjust_size = align_up(stack_adjust_size, os::vm_page_size());
  if (stack_size <= SIZE_MAX - stack_adjust_size) {
    stack_size += stack_adjust_size;
  }
  assert(is_aligned(stack_size, os::vm_page_size()), "stack_size not aligned");

  if (THPStackMitigation) {
    // In addition to the glibc guard page that prevents inter-thread-stack hugepage
    // coalescing (see comment in os::Linux::default_guard_size()), we also make
    // sure the stack size itself is not huge-page-size aligned; that makes it much
    // more likely for thread stack boundaries to be unaligned as well and hence
    // protects thread stacks from being targeted by khugepaged.
    if (HugePages::thp_pagesize() > 0 &&
        is_aligned(stack_size, HugePages::thp_pagesize())) {
      stack_size += os::vm_page_size();
    }
  }

  int status = pthread_attr_setstacksize(&attr, stack_size);
  if (status != 0) {
    // pthread_attr_setstacksize() function can fail
    // if the stack size exceeds a system-imposed limit.
    assert_status(status == EINVAL, status, "pthread_attr_setstacksize");
    log_warning(os, thread)("The %sthread stack size specified is invalid: " SIZE_FORMAT "k",
                            (thr_type == compiler_thread) ? "compiler " : ((thr_type == java_thread) ? "" : "VM "),
                            stack_size / K);
    thread->set_osthread(NULL);
    delete osthread;
    return false;
  }

  ThreadState state;

  {
    ResourceMark rm;
    pthread_t tid;
    int ret = 0;
    int limit = 3;
    do {
      ret = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread);
    } while (ret == EAGAIN && limit-- > 0);

    char buf[64];
    if (ret == 0) {
      log_info(os, thread)("Thread \"%s\" started (pthread id: " UINTX_FORMAT ", attributes: %s). ",
                           thread->name(), (uintx) tid, os::Posix::describe_pthread_attr(buf, sizeof(buf), &attr));

      // Print current timer slack if override is enabled and timer slack value is available.
      // Avoid calling prctl otherwise for extra safety.
      if (TimerSlack >= 0) {
        int slack = prctl(PR_GET_TIMERSLACK);
        if (slack >= 0) {
          log_info(os, thread)("Thread \"%s\" (pthread id: " UINTX_FORMAT ") timer slack: %dns",
                               thread->name(), (uintx) tid, slack);
        }
      }
    } else {
      log_warning(os, thread)("Failed to start thread \"%s\" - pthread_create failed (%s) for attributes: %s.",
                              thread->name(), os::errno_name(ret), os::Posix::describe_pthread_attr(buf, sizeof(buf), &attr));
      // Log some OS information which might explain why creating the thread failed.
      log_info(os, thread)("Number of threads approx. running in the VM: %d", Threads::number_of_threads());
      LogStream st(Log(os, thread)::info());
      os::Posix::print_rlimit_info(&st);
      os::print_memory_info(&st);
      os::Linux::print_proc_sys_info(&st);
      os::Linux::print_container_info(&st);
    }

    pthread_attr_destroy(&attr);

    if (ret != 0) {
      // Need to clean up stuff we've allocated so far
      thread->set_osthread(NULL);
      delete osthread;
      return false;
    }

    // Store pthread info into the OSThread
    osthread->set_pthread_id(tid);

    // Wait until child thread is either initialized or aborted
    {
      Monitor* sync_with_child = osthread->startThread_lock();
      MutexLocker ml(sync_with_child, Mutex::_no_safepoint_check_flag);
      while ((state = osthread->get_state()) == ALLOCATED) {
        sync_with_child->wait_without_safepoint_check();
      }
    }
  }

  // The thread is returned suspended (in state INITIALIZED),
  // and is started higher up in the call chain
  assert(state == INITIALIZED, "race condition");
  return true;
}

This is an interesting function. I would like to dive deeper into this, but to limit the scope of this post shall summarize in brief

1. Allocates an OSThread object to track thread metadata and sets its initial state to ALLOCATED
2. Configures pthread attributes – sets the thread as detached (won’t need explicit joining) and calculates the stack size based on thread type, applying adjustments for TLS (Thread Local Storage) and guard pages if needed
3. Handles Transparent Huge Pages (THP) mitigation – intentionally misaligns stack size to prevent khugepaged from targeting thread stacks
4. Creates the thread via pthread_create – calls the native thread entry point (thread_native_entry) with retry logic 
5. Synchronizes with the child thread – waits on a monitor until the child thread signals it has initialized, ensuring the parent doesn’t return until the thread is ready

The thread is returned in a suspended INITIALIZED state – the actual execution begins later when os::start_thread() is called higher up in the call chain. If anything fails (invalid stack size, pthread_create failure), it cleans up and returns false.

Let’s briefly cover the thread_native_entry. This routine is called for all newly created threads and does the following

  1. Record stack bounds and initialize thread-lcoal states
  2. Some house keeping stuff and eventually calls thread->call_run(); 
    1. For Java Thread, it calls run()

The thread is not running yet. Thread::start is called from the entry point which follows thread.cpp -> os.cpp -> os_linux.cpp

For a Java Thread JavaThread::run() is run, which does tlab initialization and other stuff

This last part is where I need to understand more and and still do not have a clear picture. 

The Complete Flow: A Visual Summary

Performance Implications

Now that we’ve seen the machinery, the cost becomes clear:

  • Memory allocation: Stack space (typically 1MB default), OSThread object, guard pages 
  • System calls: pthread_create, multiple synchronization primitives 
  • State synchronization: Parent-child coordination through monitors
  • TLS setup: Thread-local storage allocation and initialization
  • JVM bookkeeping: Adding to thread list, TLAB initialization, safepoint setup

This is why:

  • Thread pools reuse threads instead of creating new ones
  • Virtual threads (Project Loom) use a different model
  •  Async/event-driven architectures avoid per-request threads

Summary

Thread creation in Java involves a complex journey through multiple layers—from Java API to JNI, through JVM internals, and finally to OS-level pthread creation. Each layer adds overhead: memory allocation for stacks and metadata, system calls, synchronization between parent and child threads, and initialization of thread-local storage. This deep understanding explains why:

  • Thread pools are essential for high-performance applications 
  • The Java community developed Virtual Threads (Project Loom)
  • Modern applications favor async patterns for high-concurrency scenarios

 

Leave a Reply

Your email address will not be published. Required fields are marked *