×
Community Blog Read Node.js Source Code to Deeply Understand the CJS Module System

Read Node.js Source Code to Deeply Understand the CJS Module System

This article explores the Node.js source code to understand the loading process of CJS modules.

By Moumou (Zhou Feiyu)

Overview

The following is how we load a module in Node.js:

const fs = require('fs');
const express = require('express');
const anotherModule = require('./another-module');

Yes, require is the API for loading CJS modules, but V8 does not have a CJS module system, so how does the node find modules through require and load them? Today, we will explore the Node.js source code to understand the loading process of CJS modules. The version of the node code we read is v17.x:

Source Code Reading

Built-in Module

In order to know the working logic of the require, we need to first understand how the built-in modules are loaded into the node (such as 'fs', 'path', and 'child_process', which also includes some internal modules that cannot be referenced by users). After preparing the code, we first need to start reading from the node. The main function of the node enables a node instance by calling the method:

node::Start in the src/node_main.cc:

int Start(int argc, char** argv) {
  InitializationResult result = InitializeOncePerProcess(argc, argv);
  if (result.early_return) {
    return result.exit_code;
  }

  {
    Isolate::CreateParams params;
    const std::vector<size_t>* indices = nullptr;
    const EnvSerializeInfo* env_info = nullptr;
    bool use_node_snapshot =
        per_process::cli_options->per_isolate->node_snapshot;
    if (use_node_snapshot) {
      v8::StartupData* blob = NodeMainInstance::GetEmbeddedSnapshotBlob();
      if (blob != nullptr) {
        params.snapshot_blob = blob;
        indices = NodeMainInstance::GetIsolateDataIndices();
        env_info = NodeMainInstance::GetEnvSerializeInfo();
      }
    }
    uv_loop_configure(uv_default_loop(), UV_METRICS_IDLE_TIME);

    NodeMainInstance main_instance(&params,
                                   uv_default_loop(),
                                   per_process::v8_platform.Platform(),
                                   result.args,
                                   result.exec_args,
                                   indices);
    result.exit_code = main_instance.Run(env_info);
  }

  TearDownOncePerProcess();
  return result.exit_code;
}

Here, an event loop and a NodeMainInstance instance main_instance are created, and its Run method is called:

int NodeMainInstance::Run(const EnvSerializeInfo* env_info) {
  Locker locker(isolate_);
  Isolate::Scope isolate_scope(isolate_);
  HandleScope handle_scope(isolate_);

  int exit_code = 0;
  DeleteFnPtr<Environment, FreeEnvironment> env =
      CreateMainEnvironment(&exit_code, env_info);
  CHECK_NOT_NULL(env);

  Context::Scope context_scope(env->context());
  Run(&exit_code, env.get());
  return exit_code;
}

The CreateMainEnvironment is called in the Run method to create and initialize the environment:

Environment* CreateEnvironment(
    IsolateData* isolate_data,
    Local<Context> context,
    const std::vector<std::string>& args,
    const std::vector<std::string>& exec_args,
    EnvironmentFlags::Flags flags,
    ThreadId thread_id,
    std::unique_ptr<InspectorParentHandle> inspector_parent_handle) {
  Isolate* isolate = context->GetIsolate();
  HandleScope handle_scope(isolate);
  Context::Scope context_scope(context);
  // TODO(addaleax): This is a much better place for parsing per-Environment
  // options than the global parse call.
  Environment* env = new Environment(
      isolate_data, context, args, exec_args, nullptr, flags, thread_id);
#if HAVE_INSPECTOR
  if (inspector_parent_handle) {
    env->InitializeInspector(
        std::move(static_cast<InspectorParentHandleImpl*>(
            inspector_parent_handle.get())->impl));
  } else {
    env->InitializeInspector({});
  }
#endif

  if (env->RunBootstrapping().IsEmpty()) {
    FreeEnvironment(env);
    return nullptr;
  }

  return env;
}

Create an Environment object env and call its RunBootstrapping method:

MaybeLocal<Value> Environment::RunBootstrapping() {
  EscapableHandleScope scope(isolate_);

  CHECK(!has_run_bootstrapping_code());

  if (BootstrapInternalLoaders().IsEmpty()) {
    return MaybeLocal<Value>();
  }

  Local<Value> result;
  if (!BootstrapNode().ToLocal(&result)) {
    return MaybeLocal<Value>();
  }

  // Make sure that no request or handle is created during bootstrap -
  // if necessary those should be done in pre-execution.
  // Usually, doing so would trigger the checks present in the ReqWrap and
  // HandleWrap classes, so this is only a consistency check.
  CHECK(req_wrap_queue()->IsEmpty());
  CHECK(handle_wrap_queue()->IsEmpty());

  DoneBootstrapping();

  return scope.Escape(result);
}

The BootstrapInternalLoaders here implements a very important step in the node module loading process. The nativeModulerequire function is obtained by wrapping and executing the internal/bootstrap/loaders.js to load the built-in js module, and internalBinding is obtained to load the built-in C++ module. The NativeModule is a small module system specially used for the built-in module.

function nativeModuleRequire(id) {
  if (id === loaderId) {
    return loaderExports;
  }

  const mod = NativeModule.map.get(id);
  // Can't load the internal errors module from here, have to use a raw error.
  // eslint-disable-next-line no-restricted-syntax
  if (!mod) throw new TypeError(`Missing internal module '${id}'`);
  return mod.compileForInternalLoader();
}

const loaderExports = {
  internalBinding,
  NativeModule,
  require: nativeModuleRequire
};

return loaderExports;

It should be noted that this require function will only be used for loading built-in modules, not for loading user modules. (This is also why we can see all user modules through the print require('module')._cache, but we cannot see the built-in modules (such as fs) because the loading and cache maintenance methods are different.)

User Module

Next, let's look back at the NodeMainInstance::Run function:

int NodeMainInstance::Run(const EnvSerializeInfo* env_info) {
  Locker locker(isolate_);
  Isolate::Scope isolate_scope(isolate_);
  HandleScope handle_scope(isolate_);

  int exit_code = 0;
  DeleteFnPtr<Environment, FreeEnvironment> env =
      CreateMainEnvironment(&exit_code, env_info);
  CHECK_NOT_NULL(env);

  Context::Scope context_scope(env->context());
  Run(&exit_code, env.get());
  return exit_code;
}

We have created an env object through the CreateMainEnvironment function. This Environment instance already has a module system NativeModule to maintain the built-in module. Then, the code runs to another overloaded version of the Run function:

void NodeMainInstance::Run(int* exit_code, Environment* env) {
  if (*exit_code == 0) {
    LoadEnvironment(env, StartExecutionCallback{});

    *exit_code = SpinEventLoop(env).FromMaybe(1);
  }

  ResetStdio();

  // TODO(addaleax): Neither NODE_SHARED_MODE nor HAVE_INSPECTOR really
  // make sense here.
#if HAVE_INSPECTOR && defined(__POSIX__) && !defined(NODE_SHARED_MODE)
  struct sigaction act;
  memset(&act, 0, sizeof(act));
  for (unsigned nr = 1; nr < kMaxSignal; nr += 1) {
    if (nr == SIGKILL || nr == SIGSTOP || nr == SIGPROF)
      continue;
    act.sa_handler = (nr == SIGPIPE) ? SIG_IGN : SIG_DFL;
    CHECK_EQ(0, sigaction(nr, &act, nullptr));
  }
#endif

#if defined(LEAK_SANITIZER)
  __lsan_do_leak_check();
#endif
}

Here, call the LoadEnvironment:

MaybeLocal<Value> LoadEnvironment(
    Environment* env,
    StartExecutionCallback cb) {
  env->InitializeLibuv();
  env->InitializeDiagnostics();

  return StartExecution(env, cb);
}

Then, execute the StartExecution:

MaybeLocal<Value> StartExecution(Environment* env, StartExecutionCallback cb) {
// Here we only look at the "node index.js" situation without paying attention to other running situations, which does not affect our understanding of the module system.
if (!first_argv.empty() && first_argv != "-") {
    return StartExecution(env, "internal/main/run_main_module");
  }
}

In the call StartExecution(env, "internal/main/run_main_module"), we will wrap a function, pass it to the require function exported from loaders just now, and run the code in the lib/internal/main/run_main_module.js:

'use strict';

const {
  prepareMainThreadExecution
} = require('internal/bootstrap/pre_execution');

prepareMainThreadExecution(true);

markBootstrapComplete();

// Note: this loads the module through the ESM loader if the module is
// determined to be an ES module. This hangs from the CJS module loader
// because we currently allow monkey-patching of the module loaders
// in the preloaded scripts through require('module').
// runMain here might be monkey-patched by users in --require.
// XXX: the monkey-patchability here should probably be deprecated.
require('internal/modules/cjs/loader').Module.runMain(process.argv[1]);

The so-called wrapper function is passed to the require. The pseudo-code is listed below:

(function(require, /* other input parameters */) {
// Here is the file content of internal/main/run_main_module.js.
})();

Therefore, the runMain method on the Module object exported by the lib/internal/modules/cjs/loader.js is loaded through the require function of the built-in module. However, we did not find runMain function in the loader.js. This function is defined onto the Module object in the lib/internal/bootstrap/pre_execution.js:

function initializeCJSLoader() {
  const CJSLoader = require('internal/modules/cjs/loader');
  if (!noGlobalSearchPaths) {
    CJSLoader.Module._initPaths();
  }
  // TODO(joyeecheung): deprecate this in favor of a proper hook?
  CJSLoader.Module.runMain =
    require('internal/modules/run_main').executeUserEntryPoint;
}

Find the executeUserEntryPoint method in the lib/internal/modules/run_main.js:

function executeUserEntryPoint(main = process.argv[1]) {
  const resolvedMain = resolveMainPath(main);
  const useESMLoader = shouldUseESMLoader(resolvedMain);
  if (useESMLoader) {
    runMainESM(resolvedMain || main);
  } else {
    // Module._load is the monkey-patchable CJS module loader.
    Module._load(main, null, true);
  }
}

The parameter main is the entry file index.js that we pass in. As you can see, index.js, as a CJS module, should be loaded by Module._load. What did _load do? This function is the most important function in the CJS module loading process and is worth reading carefully:

// The '_load' function checks the cache of the requested file.
// 1. If the module already exists, the cached exports object is returned.
// 2. If the module is a built-in module, call "NativeModule.prototype.compileForPublicLoader()"
// to obtain the exports object of the built-in module. The compileForPublicLoader function has a whitelist and can only obtain the public
// The exports of the built-in module 
// 3. If the above 2 situations all fail, create a new Module object and save it to the cache. Then, load the file through it and return its exports. 

// request: the requested module, such as 'fs','./another-module','@pipcook/core', etc.
// parent: the parent module. For example, 'require('b.js')'in 'a.js', the request here is 'b.js',
Module object with 'a.js' as the parent module 
// isMain: The entry file is 'true', and all other modules are 'false'.
Module._load = function(request, parent, isMain) {
let relResolveCacheIdentifier;
  if (parent) {
    debug('Module._load REQUEST %s parent: %s', request, parent.id);
    // The relativeResolveCache is the module path cache,
    // It is used to accelerate the requests for the current modules from all modules in the directory where the parent module is located.
    // You can directly query the actual path without searching for files through _resolveFilename.
    relResolveCacheIdentifier = `${parent.path}\x00${request}`;
    const filename = relativeResolveCache[relResolveCacheIdentifier];
    if (filename !== undefined) {
      const cachedModule = Module._cache[filename];
      if (cachedModule !== undefined) {
        updateChildren(parent, cachedModule, true);
        if (!cachedModule.loaded)
          return getExportsForCircularRequire(cachedModule);
        return cachedModule.exports;
      }
      delete relativeResolveCache[relResolveCacheIdentifier];
    }
  }
    // Try to find the path of the module file. If the module cannot be found, an exception is thrown.
const filename = Module._resolveFilename(request, parent, isMain);
  // If it is a built-in module, load it from 'NativeModule'.
  if (StringPrototypeStartsWith(filename, 'node:')) {
    // Slice 'node:' prefix
    const id = StringPrototypeSlice(filename, 5);

    const module = loadNativeModule(id, request);
    if (!module?.canBeRequiredByUsers) {
      throw new ERR_UNKNOWN_BUILTIN_MODULE(filename);
    }

    return module.exports;
  }
    // If the cache already exists, push the current module to the children field of the parent module.
const cachedModule = Module._cache[filename];
if (cachedModule !== undefined) {
updateChildren(parent, cachedModule, true);
    // Process circular references.
    if (!cachedModule.loaded) {
      const parseCachedModule = cjsParseCache.get(cachedModule);
      if (!parseCachedModule || parseCachedModule.loaded)
        return getExportsForCircularRequire(cachedModule);
      parseCachedModule.loaded = true;
    } else {
      return cachedModule.exports;
    }
  }
    // Try to load from the built-in module.
const mod = loadNativeModule(filename, request);
if (mod?.canBeRequiredByUsers) return mod.exports;
    
  const mod = loadNativeModule(filename, request);
  if (mod?.canBeRequiredByUsers) return mod.exports;
    
  // Don't call updateChildren(), Module constructor already does.
  const module = cachedModule || new Module(filename, parent);

  if (isMain) {
    process.mainModule = module;
    module.id = '.';
  }
    // Add the module object to the cache.
  Module._cache[filename] = module;
  if (parent !== undefined) {
    relativeResolveCache[relResolveCacheIdentifier] = filename;
  }

  // Try to load the module. If the module fails to be loaded, delete the module object in the cache. 
  // Delete the module object in the children of the parent module. 
  let threw = true;
  try {
    module.load(filename);
    threw = false;
  } finally {
    if (threw) {
      delete Module._cache[filename];
      if (parent !== undefined) {
        delete relativeResolveCache[relResolveCacheIdentifier];
        const children = parent?.children;
        if (ArrayIsArray(children)) {
          const index = ArrayPrototypeIndexOf(children, module);
          if (index !== -1) {
            ArrayPrototypeSplice(children, index, 1);
          }
        }
      }
    } else if (module.exports &&
               !isProxy(module.exports) &&
               ObjectGetPrototypeOf(module.exports) ===
                 CircularRequirePrototypeWarningProxy) {
      ObjectSetPrototypeOf(module.exports, ObjectPrototype);
    }
  }
    // Return the exports object.
  return module.exports;
};

The load function on the module object is used to load a module:

Module.prototype.load = function(filename) {
  debug('load %j for module %j', filename, this.id);

  assert(!this.loaded);
  this.filename = filename;
  this.paths = Module._nodeModulePaths(path.dirname(filename));

  const extension = findLongestRegisteredExtension(filename);
  // allow .mjs to be overridden
  if (StringPrototypeEndsWith(filename, '.mjs') && !Module._extensions['.mjs'])
    throw new ERR_REQUIRE_ESM(filename, true);

  Module._extensions[extension](this, filename);
  this.loaded = true;

  const esmLoader = asyncESM.esmLoader;
  // Create module entry at load time to snapshot exports correctly
  const exports = this.exports;
  // Preemptively cache
  if ((module?.module === undefined ||
       module.module.getStatus() < kEvaluated) &&
      !esmLoader.cjsCache.has(this))
    esmLoader.cjsCache.set(this, exports);
};

The actual loading operation is performed in the Module._extensions[extension](this, filename);. There will be different loading strategies with different extension names:

  • .js: Call the fs.readFileSync to read the file content and wrap the file content in the wrapper. It should be noted that the require here is the require method of Module.prototype.require rather than that of built-in modules.
const wrapper = [
  '(function (exports, require, module, __filename, __dirname) { ',
  '\n});',
];
  • .json: Call the fs.readFileSync to read the file content and convert it to an object.
  • .node: Call dlopen to open the node extension.

The Module.prototype.require function also calls the static method Module._load to load modules:

Module.prototype.require = function(id) {
  validateString(id, 'id');
  if (id === '') {
    throw new ERR_INVALID_ARG_VALUE('id', id,
                                    'must be a non-empty string');
  }
  requireDepth++;
  try {
    return Module._load(id, this, /* isMain */ false);
  } finally {
    requireDepth--;
  }
};

Summary

The loading process of the CJS module is clearer now:

  1. Initialize node and load NativeModule to load all built-in js and C++ modules
  2. Run the built-in module run_main
  3. Introduce the user module system module in the run_main
  4. The module _load method is used to load the entry file. During the process, pass the module.require and module.exports to allow entry files to require other dependency modules and recursively allow the entire dependency tree to be fully loaded.

After knowing the complete process of CJS module loading, we can also read other codes along this procedure (such as initialization of global variables and management methods of esModule) to have a deeper understanding of various implementations in nodes.

0 1 0
Share on

Alibaba F(x) Team

66 posts | 3 followers

You may also like

Comments